Lecture 7 - Case Studies

Overview

In this lecture, we cover some key aspects of being a data scientist and some principles for exploratory data analysis.

We also cover two case studies:

  • NYC airbnb
  • food prices for nutrition
References

What makes a good data scientist?

From Jeff Goldsmith (Columbia):

  • You need data skills
    • data wrangling
    • reproducibility
    • communication
    • analytics and modeling
  • You also need a mindset:
    • intellectual curiosity
    • ability to solve problems
    • interest in domain area

Problem solving

Quote from Chris Volinsky, “How Industry Views Data Science Education in Statistics Departments”

I’ve interviewed a lot of people over the years… recently, when people have an interview, I ask a single question that I think tries to get at the point of problem solving. The question I ask is along the lines of:

“Imagine you had access to a database of 100 million mobile devices. What questions would you ask? What types of things do you think you could learn, and how would you go about doing it?”

Exploratory Data Analysis

From R for Data Science (2e):

Exploratory data analysis (EDA) is an iterative cycle:

  1. Generate questions about your data.

  2. Search for answers by visualizing, transforming, and modelling your data.

  3. Use what you learn to refine your questions and/or generate new questions.

Often, insightful questions will only become clear after some exploration of the data.

Two helpful questions to start with are:

  • What type of variation occurs within my variables?

    Variation is the tendency of the values of a variable to change from measurement to measurement.

  • What type of covariation occurs between my variables?

    Covariation is the tendency for the values of two or more variables to vary together in a related way.

Case Study: New York City Airbnb Data

This following notebook is an annotated Python script translation of Jeff Goldsmith’s course P8105: Data Science I at the Biostatistics Department at the Columbia School of Public Health.

Data source

Our data was accessed, according to the P8105 site, from Inside Airbnb on September 2, 2017. Inside Airbnb provides some visualizations of the NYC Airbnb data here, where you can see maps showing type of room, activity, availability, and listings per host for all NYC Airbnb listings.

Brainstorm questions for reference

  • Where are max and min air bnb prices?

    • what are average prices?
    • What factors affect prices?
  • Where are prices going up and down over time?

  • What’s the relationship between number of reviews and average review?

  • Does room type affect availability?

  • What’s the location of the unit?

    • What areas are popular?
    • Popularity = number of units? Average price?
  • Are there repeat hosts?

    • If so, what does that mean?

Let’s start coding!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
## Set palette for categorical display
sns.set_palette('Set2')
sns.color_palette("Set2")

Reading the data

The file nyc_airbnb contains a single dataframe nyc_airbnb with 40,753 rows of data on 17 variables.

  • id: listing id
  • review_scores_location: 0-5 stars converted into a 0-10 scale
  • name: listing name
  • host_id: host id
  • host_name: host name
  • neighbourhood_group: NYC borough
  • neighbourhood: NYC neighborhood
  • latitude: listing latitude
  • longitude: listing longitude
  • room_type: type of listing (Entire home/apt, Private room, Shared room)
  • price: listing price
  • minimum_nights: required minimum nights stay
  • number_of_reviews: total number of reviews
  • last_review: date of last review
  • reviews per month: average number of reviews per month
  • calculated_host_listings_count: total number of listings for this host
  • availability_365: number of days listing is available out of 365
nyc_airbnb = pd.read_csv('../data/nyc_airbnb.csv')
nyc_airbnb.head()
id review_scores_location name host_id host_name neighbourhood_group neighbourhood long lat room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 7949480 10.0 City Island Sanctuary relaxing BR & Bath w Par... 119445 Linda & Didier Bronx City Island -73.788680 40.852054 Private room 99 1 25 4/23/17 1.59 1 170
1 16042478 NaN WATERFRONT STUDIO APARTMENT 9117975 Collins Bronx City Island -73.788607 40.853491 Private room 200 7 0 NaN NaN 1 180
2 1886820 NaN Quaint City Island Community. 9815788 Steve Bronx City Island -73.783052 40.841144 Entire home/apt 300 7 0 NaN NaN 1 365
3 6627449 10.0 Large 1 BDRM in Great location 13886510 Arlene Bronx City Island -73.786609 40.849775 Entire home/apt 125 3 12 10/24/16 0.54 1 335
4 5557381 10.0 Quaint City Island Home 28811542 Phoebe Bronx City Island -73.789328 40.850024 Private room 69 3 86 4/22/17 3.63 1 352

Summary statistics

Max-min stats for prices

# Compute max, min, and mean of the 'price' column
max_price = nyc_airbnb['price'].max()
min_price = nyc_airbnb['price'].min()
mean_price = nyc_airbnb['price'].mean()

print(f"Max price: {max_price}")
print(f"Min price: {min_price}")
print(f"Mean price: {mean_price}")
Max price: 10000
Min price: 10
Mean price: 145.2863592864329
## Summary stats
nyc_airbnb.describe()
id review_scores_location host_id long lat price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
count 4.075300e+04 30716.000000 4.075300e+04 40753.000000 40753.000000 40753.000000 40753.000000 40753.000000 31454.000000 40753.000000 40753.000000
mean 9.912016e+06 9.417991 3.059031e+07 -73.955952 40.730776 145.286359 3.964861 16.409467 1.312114 1.662675 134.081614
std 5.582254e+06 0.836230 3.267980e+07 0.041214 0.053690 209.946665 13.194728 29.973090 1.757380 2.159644 137.809093
min 2.515000e+03 2.000000 7.200000e+01 -74.240843 40.499794 10.000000 1.000000 0.000000 0.010000 1.000000 0.000000
25% 4.995083e+06 9.000000 4.857042e+06 -73.983946 40.692356 70.000000 1.000000 1.000000 0.250000 1.000000 0.000000
50% 1.001931e+07 10.000000 1.805129e+07 -73.957210 40.724119 100.000000 2.000000 4.000000 0.750000 1.000000 84.000000
75% 1.496799e+07 10.000000 4.664988e+07 -73.940474 40.764489 170.000000 3.000000 18.000000 1.910000 2.000000 277.000000
max 1.851999e+07 10.000000 1.284849e+08 -73.711750 40.912164 10000.000000 1250.000000 432.000000 125.920000 35.000000 365.000000

Factors that might affect prices

We first compare the distribution of prices across neighborhood groups:

# Histogram of price grouped by neighbourhood_group
g = sns.FacetGrid(data=nyc_airbnb, col='neighbourhood_group', 
                  col_wrap=5,hue='neighbourhood_group',
                  xlim = (0, 250))
g.map(sns.histplot, 'price',binwidth=3, element='step', common_norm=False)
g.add_legend(title = 'Neighbourhood Group')
g.set_titles('{col_name}')

plt.show()

Average prices by room type:

# Group by room_type and calculate mean price
mean_price_by_room = round(nyc_airbnb.groupby('room_type')['price'].mean(),2).reset_index()
print(mean_price_by_room)
         room_type   price
0  Entire home/apt  206.63
1     Private room   87.52
2      Shared room   70.20

Median prices by room type by neighborhood:

# Group by neighbourhood_group and room_type, compute median price, and pivot the table
median_price_pivot = nyc_airbnb.groupby(['neighbourhood_group', 'room_type'])['price'].median().unstack()
print(median_price_pivot)
room_type            Entire home/apt  Private room  Shared room
neighbourhood_group                                            
Bronx                          100.0          55.0         43.0
Brooklyn                       145.0          65.0         40.0
Manhattan                      190.0          90.0         65.0
Queens                         119.0          60.0         39.0
Staten Island                  112.5          55.0         25.0

Review scores’ affect on prices:

# Scatter plot of review_scores_location vs price
plt.figure(figsize=(10, 6))
sns.scatterplot(data=nyc_airbnb, x='review_scores_location', y='price')
plt.xlabel('Review Scores')
plt.ylabel('Price')
plt.show()

Does room type affect availabilities?

Now, we explore the relation between room types and availabilities. A violin plot is similar to a boxplot, with the exception of a rotated kernel density plot on each side.

# Violin plot of room_type vs availability_365
# Set cut = 0 to restrict violin plot within min-max range. Otherwise, default = 2 which extends the density past extreme datapoints
plt.figure(figsize=(10, 6))
sns.violinplot(data=nyc_airbnb, x='room_type', y='availability_365', cut = 0)
plt.xlabel('Room Type')
plt.ylabel('Days available in a year')
plt.show()

What about reviews and popularity?

Number of reviews by neighborhood:

# Group by neighbourhood_group and sum number_of_reviews, then sort
reviews_by_group = nyc_airbnb.groupby('neighbourhood_group')['number_of_reviews'].sum()\
                    .reset_index().sort_values(by='number_of_reviews', ascending=False)
print(reviews_by_group)
  neighbourhood_group  number_of_reviews
2           Manhattan             323941
1            Brooklyn             263542
3              Queens              66611
0               Bronx               9897
4       Staten Island               4744

Relationship between number of reviews and the review scores:

# Scatter plot of number_of_reviews vs review_scores_location
plt.figure(figsize=(10, 6))
sns.scatterplot(data=nyc_airbnb, x='number_of_reviews', y='review_scores_location')
plt.xlabel('Number of Reviews')
plt.ylabel('Review Scores')
plt.show()

Repeated hosts?

# Filter hosts with multiple calculated listings
filtered_hosts = nyc_airbnb[nyc_airbnb['calculated_host_listings_count'] > 34]
filtered_hosts.head()
id review_scores_location name host_id host_name neighbourhood_group neighbourhood long lat room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
1589 15057686 NaN Home 4 Medical Professionals- Stuyvesant Heigh... 26377263 Stat Brooklyn Bedford-Stuyvesant -73.950457 40.687853 Private room 54 30 0 NaN NaN 35 289
1810 15080936 NaN Home 4 Medical Professionals- Stuyvesant Heigh... 26377263 Stat Brooklyn Bedford-Stuyvesant -73.950478 40.689792 Private room 57 30 0 NaN NaN 35 264
2897 14776203 10.0 Home 4 Medical Professionals- Stuyvesant Heigh... 26377263 Stat Brooklyn Bedford-Stuyvesant -73.952462 40.687926 Private room 57 30 1 2/18/17 0.39 35 362
4288 15074005 NaN Home 4 Medical Professionals- Stuyvesant Heigh... 26377263 Stat Brooklyn Bedford-Stuyvesant -73.951861 40.689432 Private room 54 30 0 NaN NaN 35 320
4540 5866656 NaN Home 4 Medical Professionals-MAIM0 26377263 Stat Brooklyn Borough Park -73.997281 40.640318 Private room 50 30 0 NaN NaN 35 355

Now, let’s do a deeper dive in Manhattan.

We take a closer look in Manhattan. We only limit our search to those whose prices are under $1000.

# Manhattan listings with price < 1000 mapped by latitude and longitude
manhattan_filtered = nyc_airbnb[(nyc_airbnb['neighbourhood_group'] == 'Manhattan') & (nyc_airbnb['price'] < 1000)]
plt.figure(figsize=(8, 6))
sns.scatterplot(data=manhattan_filtered, x='lat', y='long', hue='price',size='price', alpha=0.7)
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.legend(title='Price by $')
plt.show()

Let’s see what the average prices for entire homes under $1000 are, in each Manhattan neighborhood:

# Compute mean price by neighbourhood for Manhattan entire homes under $1000
manhattan_entire_home = manhattan_filtered[manhattan_filtered['room_type'] == 'Entire home/apt']
mean_price_neighbourhood = manhattan_entire_home.groupby('neighbourhood')['price'].mean().reset_index().sort_values(by='price', ascending=False)
print(mean_price_neighbourhood)
          neighbourhood       price
26              Tribeca  358.030000
20                 NoHo  311.573770
7     Flatiron District  306.840000
23                 SoHo  295.735043
25     Theater District  282.193548
17              Midtown  275.967939
0     Battery Park City  270.659091
9     Greenwich Village  255.833333
1               Chelsea  254.603922
6    Financial District  249.907895
11       Hell's Kitchen  246.286866
31         West Village  245.414356
21               Nolita  241.170732
14         Little Italy  237.829787
13             Kips Bay  227.217712
19          Murray Hill  226.341615
29      Upper West Side  223.983271
24      Stuyvesant Town  223.071429
3          Civic Center  218.880000
8              Gramercy  214.566667
5          East Village  210.396313
2             Chinatown  207.256410
15      Lower East Side  204.777778
28      Upper East Side  192.833167
27          Two Bridges  173.761905
10               Harlem  164.489627
4           East Harlem  163.313225
18  Morningside Heights  150.707006
30   Washington Heights  132.896154
22     Roosevelt Island  130.875000
12               Inwood  119.079545
16          Marble Hill  100.833333
# Violin plot of price by Manhattan neighbourhood, reordered by price
plt.figure(figsize=(14, 5))
manhattan_entire_home_sorted = manhattan_entire_home.sort_values(by='price')
sns.violinplot(data=manhattan_entire_home_sorted, x='neighbourhood', y='price')
plt.xticks(rotation=45,ha="right", rotation_mode="anchor")
plt.show()

# drop nas from review
manhattan_filtered_dropna = manhattan_filtered.dropna(subset='review_scores_location')
fig = px.scatter_map(
    manhattan_filtered_dropna,
    lon='long',
    lat='lat',
    opacity=0.6,
    color='review_scores_location',
    color_continuous_scale=px.colors.sequential.Viridis,
    labels={"review_scores_location": "Review Score"}
)

fig.update_layout(
    autosize=False, height=500, width=800,
    title=dict(text="AirBnB Review Scores in Manhattan")
)


fig.show()

Case Study: Food Prices for Nutrition

The Food Prices for Nutrition data set is taken from Word Bank Group International Comparison Program (ICP). It records costs and affordabilities of various diets across different countries from 2017 to 2022. You can check the summary and some basic analyses of this data set through its website in the link before. In this case study, we will take a dive into this data set and do some exploratory data analysis.

What does the data look like?

Before we ask any scientific question about the data set, we need to at least take a look at it!

df = pd.read_csv("../data/food_prices.csv")
df.head()
Classification Name Classification Code Country Name Country Code Time Time Code Cost of an energy sufficient diet [CoCA] Cost of a nutrient adequate diet [CoNA] Cost of a healthy diet [CoHD] Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] ... Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp] Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp] Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] Percent of the population who cannot afford sufficient calories [CoCA_headcount] Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] Percent of the population who cannot afford a healthy diet [CoHD_headcount] Millions of people who cannot afford sufficient calories [CoCA_unafford_n] Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] Millions of people who cannot afford a healthy diet [CoHD_unafford_n] Population [Pop]
0 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2017.0 YR2017 0.725 2.471 3.952 5.45 ... 0.078 0.266 0.425 0 13 37.8 0 0.4 1.1 2873457
1 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2018.0 YR2018 .. .. 4.051 .. ... .. .. .. .. .. 27.9 .. .. 0.8 2866376
2 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2019.0 YR2019 .. .. 4.117 .. ... .. .. .. .. .. 19.8 .. .. 0.6 2854191
3 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2020.0 YR2020 .. .. 4.197 .. ... .. .. .. .. .. 20.1 .. .. 0.6 2837743
4 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2021.0 YR2021 .. .. .. .. ... .. .. .. .. .. .. .. .. .. ..

5 rows × 40 columns

The countries included in this program.

df['Country Name'].unique()
array(['Albania', 'Algeria', 'Angola', 'Anguilla', 'Antigua and Barbuda',
       'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria',
       'Azerbaijan', 'Bahamas, The', 'Bahrain', 'Bangladesh', 'Barbados',
       'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan',
       'Bolivia', 'Bonaire', 'Bosnia and Herzegovina', 'Botswana',
       'Brazil', 'British Virgin Islands', 'Brunei Darussalam',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia',
       'Cameroon', 'Canada', 'Cayman Islands', 'Central African Republic',
       'Chad', 'Chile', 'China', 'Colombia', 'Comoros',
       'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Côte d'Ivoire",
       'Croatia', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark',
       'Djibouti', 'Dominica', 'Dominican Republic',
       'East Asia & Pacific', 'Ecuador', 'Egypt, Arab Rep.',
       'El Salvador', 'Equatorial Guinea', 'Estonia', 'Eswatini',
       'Ethiopia', 'Europe & Central Asia', 'Fiji', 'Finland', 'France',
       'Gabon', 'Gambia, The', 'Germany', 'Ghana', 'Greece', 'Grenada',
       'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'High income',
       'Honduras', 'Hong Kong SAR, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran, Islamic Rep.', 'Iraq', 'Ireland', 'Israel',
       'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya',
       'Korea, Rep.', 'Kuwait', 'Kyrgyz Republic', 'Lao PDR',
       'Latin America & Caribbean', 'Latvia', 'Lesotho', 'Liberia',
       'Lithuania', 'Low income', 'Lower middle income', 'Luxembourg',
       'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
       'Mauritania', 'Mauritius', 'Mexico', 'Middle East & North Africa',
       'Moldova', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco',
       'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
       'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North America',
       'North Macedonia', 'Norway', 'Oman', 'Pakistan', 'Panama',
       'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Qatar',
       'Romania', 'Russian Federation', 'Rwanda', 'Sao Tome and Principe',
       'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone',
       'Singapore', 'Sint Maarten (Dutch part)', 'Slovak Republic',
       'Slovenia', 'South Africa', 'South Asia', 'Spain', 'Sri Lanka',
       'St. Kitts and Nevis', 'St. Lucia',
       'St. Vincent and the Grenadines', 'Sub-Saharan Africa', 'Sudan',
       'Suriname', 'Sweden', 'Switzerland', 'Taiwan, China', 'Tajikistan',
       'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia',
       'Türkiye', 'Turks and Caicos Islands', 'Uganda',
       'United Arab Emirates', 'United Kingdom', 'United States',
       'Upper middle income', 'Uruguay', 'Vietnam', 'West Bank and Gaza',
       'WORLD', 'Zambia', 'Zimbabwe', nan], dtype=object)

The number of records and the number of variables.

df.shape
(3725, 40)

Let’s look at the variables:

df.columns
Index(['Classification Name', 'Classification Code', 'Country Name',
       'Country Code', 'Time', 'Time Code',
       'Cost of an energy sufficient diet [CoCA]',
       'Cost of a nutrient adequate diet [CoNA]',
       'Cost of a healthy diet [CoHD]',
       'Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA]',
       'Cost of fruits [CoHD_f]', 'Cost of vegetables [CoHD_v]',
       'Cost of starchy staples [CoHD_ss]',
       'Cost of animal-source foods [CoHD_asf]',
       'Cost of legumes, nuts and seeds [CoHD_lns]',
       'Cost of oils and fats [CoHD_of]',
       'Cost share for fruits in a least-cost healthy diet [CoHD_f_prop]',
       'Cost share for vegetables in a least-cost healthy diet [CoHD_v_prop]',
       'Cost share for starchy staples in a least-cost healthy diet [CoHD_ss_prop]',
       'Cost share for animal-sourced foods in a least-cost healthy diet [CoHD_asf_prop]',
       'Cost share for legumes, nuts and seeds in a least-cost healthy diet [CoHD_lns_prop]',
       'Cost share for oils and fats in a least-cost healthy diet [CoHD_of_prop]',
       'Cost of fruits relative to the starchy staples in a least-cost healthy diet [CoHD_f_ss]',
       'Cost of vegetables relative to the starchy staples in a least-cost healthy diet [CoHD_v_ss]',
       'Cost of animal-sourced foods relative to the starchy staples in a least-cost healthy diet [CoHD_asf_ss]',
       'Cost of legumes, nuts and seeds relative to the starchy staples in a least-cost healthy diet [CoHD_lns_ss]',
       'Cost of oils and fats relative to the starchy staples in a least-cost healthy diet [CoHD_of_ss]',
       'Affordability of an energy sufficient diet: ratio of cost to the food poverty line [CoCA_pov]',
       'Affordability of a nutrient adequate diet: ratio of cost to the food poverty line [CoNA_pov]',
       'Affordability of a healthy diet: ratio of cost to the food poverty line [CoHD_pov]',
       'Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp]',
       'Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp]',
       'Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp]',
       'Percent of the population who cannot afford sufficient calories [CoCA_headcount]',
       'Percent of the population who cannot afford nutrient adequacy [CoNA_headcount]',
       'Percent of the population who cannot afford a healthy diet [CoHD_headcount]',
       'Millions of people who cannot afford sufficient calories [CoCA_unafford_n]',
       'Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n]',
       'Millions of people who cannot afford a healthy diet [CoHD_unafford_n]',
       'Population [Pop]'],
      dtype='object')

Data Cleaning

We need to process the data to a format that we can easily manipulate.

df.dropna(inplace = True)
selected_countries = ["India", "Senegal", "Albania", "China", "United States", "United Arab Emirates", "Türkiye", "Egypt, Arab Rep.", "Ghana"]
df = df[df['Country Name'].isin(selected_countries)]
df.head()
Classification Name Classification Code Country Name Country Code Time Time Code Cost of an energy sufficient diet [CoCA] Cost of a nutrient adequate diet [CoNA] Cost of a healthy diet [CoHD] Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] ... Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp] Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp] Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] Percent of the population who cannot afford sufficient calories [CoCA_headcount] Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] Percent of the population who cannot afford a healthy diet [CoHD_headcount] Millions of people who cannot afford sufficient calories [CoCA_unafford_n] Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] Millions of people who cannot afford a healthy diet [CoHD_unafford_n] Population [Pop]
0 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2017.0 YR2017 0.725 2.471 3.952 5.45 ... 0.078 0.266 0.425 0 13 37.8 0 0.4 1.1 2873457
1 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2018.0 YR2018 .. .. 4.051 .. ... .. .. .. .. .. 27.9 .. .. 0.8 2866376
2 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2019.0 YR2019 .. .. 4.117 .. ... .. .. .. .. .. 19.8 .. .. 0.6 2854191
3 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2020.0 YR2020 .. .. 4.197 .. ... .. .. .. .. .. 20.1 .. .. 0.6 2837743
4 Food Prices for Nutrition 1.0 FPN 1.0 Albania ALB 2021.0 YR2021 .. .. .. .. ... .. .. .. .. .. .. .. .. .. ..

5 rows × 40 columns

cost_of_diet, percent_cannot_afford = "Cost of a healthy diet [CoHD]", "Percent of the population who cannot afford a healthy diet [CoHD_headcount]"
df = df[["Country Name", cost_of_diet,  "Classification Code", "Time", "Population [Pop]", percent_cannot_afford]]
df = df.replace("..", pd.NA)
df = df.dropna()
df = df[df["Classification Code"] == "FPN 2.1"]
df['Time'] = df['Time'].astype(int)
df['Country Name'] = df['Country Name'].astype(str)
df['Population [Pop]'] = df['Population [Pop]'].astype(int)
df[percent_cannot_afford] = df[percent_cannot_afford].astype(float)
df[cost_of_diet] = df[cost_of_diet].astype(float)
df = df.drop_duplicates()
df = df.drop(columns = ["Classification Code"])
df.dtypes
Country Name                                                                    object
Cost of a healthy diet [CoHD]                                                  float64
Time                                                                             int64
Population [Pop]                                                                 int64
Percent of the population who cannot afford a healthy diet [CoHD_headcount]    float64
dtype: object
df.head()
Country Name Cost of a healthy diet [CoHD] Time Population [Pop] Percent of the population who cannot afford a healthy diet [CoHD_headcount]
2790 Albania 3.952 2017 2873457 31.3
2791 Albania 4.069 2018 2866376 23.0
2792 Albania 4.262 2019 2854191 22.2
2793 Albania 4.280 2020 2837849 19.9
2794 Albania 4.388 2021 2811666 15.7

The population is at a very large scale, so usually we need to trasform it to log scale for better visualization, etc.

df["pop_log"] = np.log(df["Population [Pop]"])
df.head()
Country Name Cost of a healthy diet [CoHD] Time Population [Pop] Percent of the population who cannot afford a healthy diet [CoHD_headcount] pop_log
2790 Albania 3.952 2017 2873457 31.3 14.871026
2791 Albania 4.069 2018 2866376 23.0 14.868559
2792 Albania 4.262 2019 2854191 22.2 14.864299
2793 Albania 4.280 2020 2837849 19.9 14.858557
2794 Albania 4.388 2021 2811666 15.7 14.849288

Interactive Scatter Plot

# Create an interactive scatter plot with a slider using Plotly Express
fig = px.scatter(
    df,
    x='Cost of a healthy diet [CoHD]',
    y='Percent of the population who cannot afford a healthy diet [CoHD_headcount]',
    size='pop_log',
    animation_frame='Time',  
    color = 'Country Name',
    title='Cost of diet Vs % of population who cannot afford it through the years',
    labels={'Population [Pop]': 'Point Size'}
)

fig.update_xaxes(range=[2, 4.5])  # Adjust the range based on your desired x-axis limits
fig.update_yaxes(range=[-10, 100])  

fig.update_layout(
    title_x=0.5,  # Center the title
    yaxis_title="% cannot afford",
    xaxis_title='Cost of a Healthy Diet',
    height=600, width=800,
)

# Show the interactive plot
fig.show()