from nycflights13 import flights, airports, airlines, planes, weather
Lecture 3 - Pandas II
Overview
This lecture introduces:
- the concept of tidy data
- more advanced Pandas functionality, including
groupby
, categorical datatypes, hierarchical indexing and merging datasets
References
This lecture contains material from:
- Chapter 8, Python for Data Analysis, 3E (Wes McKinney, 2022)
- Chapter 3, Data Science: A First Introduction with Python (Timbers et al. 2022)
- Pierre Bellec’s notes
- Hadley Wickham. Tidy data. Journal of Statistical Software, 59(10):1–23, 2014.
Tidy Data
Let’s define the following helpful terms:
- variable: a characteristic, number, or quantity that can be measured.
- observation: all of the measurements for a given entity.
- value: a single measurement of a single variable for a given entity.
A tidy data frame satisfies the following three criteria (Wickham, 2014):
- each row is a single observation,
- each column is a single variable, and
- each value is a single cell (i.e., its entry in the data frame is not shared with another value).
Illustration: NYC flights data
We now look at NYC flights data, contained in the nycflights13
packge.
This package contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) to destinations in the United States, Puerto Rico, and the American Virgin Islands) in 2013: 336,776 flights in total. To help understand what causes delays, it also includes a number of other useful datasets.
This package provides the following data tables.
- flights: all flights that departed from NYC in 2013
- weather: hourly meterological data for each airport
- planes: construction information about each plane
- airports: airport names and locations
- airlines: translation between two letter carrier codes and names
Install the nycflights13
dataset in your conda environment:
'Terminal
conda activate msds597
pip install nycflights13
Load the nycflights13
data:
import numpy as np
import pandas as pd
import seaborn as sns
flights.head()
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
The above data is “tidy”:
- each row is a single observation (flight)
- each column is a single variable (e.g. year, month, departure time)
- each value is a single cell
Below are examples of “untidy” versions of this data.
Let’s check the origin of all the flights:
'origin'].value_counts() flights[
origin
EWR 120835
JFK 111279
LGA 104662
Name: count, dtype: int64
Practising skills from last lecture:
- Which flights on 1st Jan have missing departure times?
= flights[(flights.month == 1) & (flights.day == 1)] jan1
jan1[jan1.dep_time.isna()]
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
838 | 2013 | 1 | 1 | NaN | 1630 | NaN | NaN | 1815 | NaN | EV | 4308 | N18120 | EWR | RDU | NaN | 416 | 16 | 30 | 2013-01-01T21:00:00Z |
839 | 2013 | 1 | 1 | NaN | 1935 | NaN | NaN | 2240 | NaN | AA | 791 | N3EHAA | LGA | DFW | NaN | 1389 | 19 | 35 | 2013-01-02T00:00:00Z |
840 | 2013 | 1 | 1 | NaN | 1500 | NaN | NaN | 1825 | NaN | AA | 1925 | N3EVAA | LGA | MIA | NaN | 1096 | 15 | 0 | 2013-01-01T20:00:00Z |
841 | 2013 | 1 | 1 | NaN | 600 | NaN | NaN | 901 | NaN | B6 | 125 | N618JB | JFK | FLL | NaN | 1069 | 6 | 0 | 2013-01-01T11:00:00Z |
'flight']
jan1[jan1.dep_time.isna()][# we can't do this!
# jan1[jan1.dep_time.isna(), 'flight']
# we CAN do this
'flight'] jan1.loc[jan1.dep_time.isna(),
838 4308
839 791
840 1925
841 125
Name: flight, dtype: int64
- Which flights flew to Houston (IAH or HOU)?
= flights[(flights.dest == 'IAH') | (flights.dest == 'HOU')] houston
- Which flights that departed from JFK were operated by United, American or Delta? (UA, AA, DL)
= flights[flights.origin == 'JFK'] jfk
= jfk.carrier == 'UA'
ua = jfk.carrier == 'AA'
aa = jfk.carrier == 'DL'
dl | aa) | dl)] jfk[((ua
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
12 | 2013 | 1 | 1 | 558.0 | 600 | -2.0 | 924.0 | 917 | 7.0 | UA | 194 | N29129 | JFK | LAX | 345.0 | 2475 | 6 | 0 | 2013-01-01T11:00:00Z |
23 | 2013 | 1 | 1 | 606.0 | 610 | -4.0 | 837.0 | 845 | -8.0 | DL | 1743 | N3739P | JFK | ATL | 128.0 | 760 | 6 | 10 | 2013-01-01T11:00:00Z |
26 | 2013 | 1 | 1 | 611.0 | 600 | 11.0 | 945.0 | 931 | 14.0 | UA | 303 | N532UA | JFK | SFO | 366.0 | 2586 | 6 | 0 | 2013-01-01T11:00:00Z |
36 | 2013 | 1 | 1 | 628.0 | 630 | -2.0 | 1137.0 | 1140 | -3.0 | AA | 413 | N3BAAA | JFK | SJU | 192.0 | 1598 | 6 | 30 | 2013-01-01T11:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336704 | 2013 | 9 | 30 | 2028.0 | 1910 | 78.0 | 2255.0 | 2215 | 40.0 | AA | 21 | N338AA | JFK | LAX | 294.0 | 2475 | 19 | 10 | 2013-09-30T23:00:00Z |
336715 | 2013 | 9 | 30 | 2041.0 | 2045 | -4.0 | 2147.0 | 2208 | -21.0 | DL | 985 | N359NB | JFK | BOS | 37.0 | 187 | 20 | 45 | 2013-10-01T00:00:00Z |
336718 | 2013 | 9 | 30 | 2050.0 | 2045 | 5.0 | 20.0 | 53 | -33.0 | DL | 347 | N396DA | JFK | SJU | 188.0 | 1598 | 20 | 45 | 2013-10-01T00:00:00Z |
336744 | 2013 | 9 | 30 | 2121.0 | 2100 | 21.0 | 2349.0 | 14 | -25.0 | DL | 2363 | N193DN | JFK | LAX | 296.0 | 2475 | 21 | 0 | 2013-10-01T01:00:00Z |
336751 | 2013 | 9 | 30 | 2140.0 | 2140 | 0.0 | 10.0 | 40 | -30.0 | AA | 185 | N335AA | JFK | LAX | 298.0 | 2475 | 21 | 40 | 2013-10-01T01:00:00Z |
39018 rows × 19 columns
- Sort the flights to find the most delayed flights.
='arr_delay', ascending=False) flights.sort_values(by
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7072 | 2013 | 1 | 9 | 641.0 | 900 | 1301.0 | 1242.0 | 1530 | 1272.0 | HA | 51 | N384HA | JFK | HNL | 640.0 | 4983 | 9 | 0 | 2013-01-09T14:00:00Z |
235778 | 2013 | 6 | 15 | 1432.0 | 1935 | 1137.0 | 1607.0 | 2120 | 1127.0 | MQ | 3535 | N504MQ | JFK | CMH | 74.0 | 483 | 19 | 35 | 2013-06-15T23:00:00Z |
8239 | 2013 | 1 | 10 | 1121.0 | 1635 | 1126.0 | 1239.0 | 1810 | 1109.0 | MQ | 3695 | N517MQ | EWR | ORD | 111.0 | 719 | 16 | 35 | 2013-01-10T21:00:00Z |
327043 | 2013 | 9 | 20 | 1139.0 | 1845 | 1014.0 | 1457.0 | 2210 | 1007.0 | AA | 177 | N338AA | JFK | SFO | 354.0 | 2586 | 18 | 45 | 2013-09-20T22:00:00Z |
270376 | 2013 | 7 | 22 | 845.0 | 1600 | 1005.0 | 1044.0 | 1815 | 989.0 | MQ | 3075 | N665MQ | JFK | CVG | 96.0 | 589 | 16 | 0 | 2013-07-22T20:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
336776 rows × 19 columns
- Select columns which have arrival related information.
for col in flights.columns if 'arr_' in col] [col
['arr_time', 'sched_arr_time', 'arr_delay']
- List the top three airlines that operated flights from EWR the maximum # times and arrange them in descending order.
= flights[flights.origin == 'EWR'] ewr
=False) ewr.carrier.value_counts().sort_values(ascending
carrier
UA 46087
EV 43939
B6 6557
WN 6188
US 4405
DL 4342
AA 3487
MQ 2276
VX 1566
9E 1268
AS 714
OO 6
Name: count, dtype: int64
- What are the most common flight routes?
def cat_func(x):
= ''
base for i in x:
+= i + ' '
base return base
= flights[['origin', 'dest']].apply(cat_func, axis=1) origin_dest
=False) origin_dest.value_counts(ascending
JFK LAX 11262
LGA ATL 10263
LGA ORD 8857
JFK SFO 8204
LGA CLT 6168
...
JFK STL 1
JFK MEM 1
JFK BHM 1
LGA LEX 1
EWR LGA 1
Name: count, Length: 224, dtype: int64
- How many flights were there from NYC airports to Seattle (SEA)?
= flights[(flights['origin'].isin(['JFK', 'LGA', 'EWR'])) & (flights['dest'] == 'SEA')]
nyc_sea nyc_sea.shape
(3923, 19)
- How many carriers fly to SEA?
len(nyc_sea.carrier.unique()) # or nyc_sea.carrier.nunique()
5
- What is the average arrival delay?
'arr_delay'].mean() nyc_sea[
np.float64(-1.0990990990990992)
- What proportion of flights come from each NYC airport?
= 100 * (nyc_sea.origin == 'EWR').mean()
ewr = 100 * (nyc_sea.origin == 'LGA').mean()
lga = 100 * (nyc_sea.origin == 'JFK').mean() jfk
f'EWR is {ewr:.2f}%, LGA is {lga:.2f}%, JFK is {jfk:.2f}%'
'EWR is 46.67%, LGA is 0.00%, JFK is 53.33%'
Groupby
In “wide” format, it is easy to apply functions to each of the currencies: e.g. finding the mean of USD, JPY etc. What about in the “tidy” format? Here, we can use the groupby
method:
= pd.read_csv('../data/rates.csv')
df_wide = pd.to_datetime(df_wide.Time) df_wide.Time
= df_wide.melt(id_vars='Time', var_name='currency', value_name='rate') df_long
'currency')['rate'].mean() df_long.groupby(
currency
BGN 1.955800
CHF 0.957208
CZK 25.199081
DKK 7.456484
GBP 0.855201
JPY 162.195484
USD 1.082711
Name: rate, dtype: float64
'currency')['rate'].agg(['min', 'max']) df_long.groupby(
min | max | |
---|---|---|
currency | ||
BGN | 1.95580 | 1.95580 |
CHF | 0.93150 | 0.98460 |
CZK | 24.73400 | 25.46000 |
DKK | 7.45360 | 7.46090 |
GBP | 0.85098 | 0.85846 |
JPY | 158.96000 | 164.97000 |
USD | 1.06370 | 1.09390 |
With groupby
, the tidy format is much more consistent than the wide format for many operations. (It is also similar to R’s dplyr
and the R tidyverse
more generally.)
'Time')['rate'].agg(['min', 'max']) df_long.groupby(
min | max | |
---|---|---|
Time | ||
2024-01-18 | 0.85773 | 160.89 |
2024-01-19 | 0.85825 | 161.17 |
2024-01-22 | 0.85575 | 160.95 |
2024-01-23 | 0.85493 | 160.88 |
2024-01-24 | 0.85543 | 160.46 |
... | ... | ... |
2024-04-10 | 0.85515 | 164.89 |
2024-04-11 | 0.85525 | 164.18 |
2024-04-12 | 0.85424 | 163.16 |
2024-04-15 | 0.85405 | 164.05 |
2024-04-16 | 0.85440 | 164.54 |
62 rows × 2 columns
Compare to:
'min', 'max']) df_wide.agg([
Time | USD | JPY | BGN | CZK | DKK | GBP | CHF | |
---|---|---|---|---|---|---|---|---|
min | 2024-01-18 | 1.0637 | 158.96 | 1.9558 | 24.734 | 7.4536 | 0.85098 | 0.9315 |
max | 2024-04-16 | 1.0939 | 164.97 | 1.9558 | 25.460 | 7.4609 | 0.85846 | 0.9846 |
# this fails because we have time as a column
# df_wide.agg(['min', 'max'], axis=1)
# first, get only currency columns (not Time)
= df_wide.columns[df_wide.columns != 'Time']
cols # agg over rows, excluding Time column
'min', 'max'], axis=1) df_wide[cols].agg([
min | max | |
---|---|---|
0 | 0.85440 | 164.54 |
1 | 0.85405 | 164.05 |
2 | 0.85424 | 163.16 |
3 | 0.85525 | 164.18 |
4 | 0.85515 | 164.89 |
... | ... | ... |
57 | 0.85543 | 160.46 |
58 | 0.85493 | 160.88 |
59 | 0.85575 | 160.95 |
60 | 0.85825 | 161.17 |
61 | 0.85773 | 160.89 |
62 rows × 2 columns
# alternatively, turn Time into the index
= df_wide.Time
df_wide.index ='Time').agg(['min', 'max'], axis=1) df_wide.drop(columns
min | max | |
---|---|---|
Time | ||
2024-04-16 | 0.85440 | 164.54 |
2024-04-15 | 0.85405 | 164.05 |
2024-04-12 | 0.85424 | 163.16 |
2024-04-11 | 0.85525 | 164.18 |
2024-04-10 | 0.85515 | 164.89 |
... | ... | ... |
2024-01-24 | 0.85543 | 160.46 |
2024-01-23 | 0.85493 | 160.88 |
2024-01-22 | 0.85575 | 160.95 |
2024-01-19 | 0.85825 | 161.17 |
2024-01-18 | 0.85773 | 160.89 |
62 rows × 2 columns
Back to nycflights13
: - Which airline carrier had the longest mean delay?
'carrier')['arr_delay'].mean().sort_values(ascending=False) flights.groupby(
carrier
F9 21.920705
FL 20.115906
EV 15.796431
YV 15.556985
OO 11.931034
MQ 10.774733
WN 9.649120
B6 9.457973
9E 7.379669
UA 3.558011
US 2.129595
VX 1.764464
DL 1.644341
AA 0.364291
HA -6.915205
AS -9.930889
Name: arr_delay, dtype: float64
- What days of week had highest mean delay times?
# convert time hour to datetime dtype
'time_hour'] = pd.to_datetime(flights['time_hour'])
flights[
# You can get day of week as integer (Monday=0, Sunday=6)
'dayofweek'] = flights['time_hour'].dt.dayofweek
flights[
# You can also get day name:
'day_name'] = flights['time_hour'].dt.day_name() flights[
'day_name')['arr_delay'].mean().sort_values(ascending=False) flights.groupby(
day_name
Thursday 11.044998
Monday 9.846577
Friday 9.397136
Wednesday 6.581860
Tuesday 6.118452
Sunday 2.645165
Saturday 1.126204
Name: arr_delay, dtype: float64
- What month had highest mean delay times?
'month')['arr_delay'].mean().sort_values(ascending=False) flights.groupby(
month
7 16.711307
6 16.481330
12 14.870355
4 11.176063
1 6.129972
8 6.040652
3 5.807577
2 5.613019
5 3.521509
11 0.461347
10 -0.167063
9 -4.018364
Name: arr_delay, dtype: float64
- What day-of-week / month combo had largest mean delay times?
'month', 'day_name'])['arr_delay'].mean().sort_values(ascending=False) flights.groupby([
month day_name
6 Monday 30.650388
7 Monday 29.503160
4 Friday 25.632658
Thursday 24.991816
8 Thursday 22.258595
...
5 Tuesday -4.991189
9 Tuesday -6.252610
Wednesday -7.006858
Sunday -8.932241
Saturday -12.576506
Name: arr_delay, Length: 84, dtype: float64
- What unique day had the largest mean delay times?
'month', 'day'])['arr_delay'].mean().sort_values(ascending=False) flights.groupby([
month day
3 8 85.862155
6 13 63.753689
7 22 62.763403
5 23 61.970899
7 10 59.626478
...
9 5 -15.540373
28 -16.371852
6 -17.895010
10 1 -18.959375
9 7 -20.349854
Name: arr_delay, Length: 365, dtype: float64
- Plot the delays by month for each carrier:
=flights,
sns.relplot(data='month',
x='arr_delay',
y='carrier',
hue='line') # default in relplot is a scatterplot - we want a line plot here kind
The default behavior in seaborn is to aggregate the multiple measurements at each x
value by plotting the mean and the 95% confidence interval around the mean, where the confidence interval is calculated using bootstrapping.
=flights,
sns.relplot(data='month',
x='arr_delay',
y='carrier',
hue='line',
kind=None) # can disable with errorbar = None errorbar
- Now only plot the top 5 carriers (by number of flights)
= flights['carrier'].value_counts().reset_index()
top_carriers = top_carriers.carrier[0:5] top_carriers
= top_carriers.to_list() top_carriers
= flights[flights.carrier.isin(top_carriers)] flights_subset
=flights_subset,
sns.relplot(data='month',
x='arr_delay',
y='carrier',
hue='line',
kind= 4,
height = 3) aspect
- What about by hour of day?
=flights_subset,
sns.relplot(data='hour',
x='arr_delay',
y='carrier',
hue='line',
kind= 4,
height = 3) aspect
Categorical Data
Sometimes you see categorical data (e.g. USD, JPY etc.) represented as integers, often for ease of data storage. These data will also have a look-up table, so you can tell what integer corresponds to what category.
= pd.Series([0, 1, 0, 0] * 2)
values = pd.Series(['orange', 'apple']) dim
print(values)
0 0
1 1
2 0
3 0
4 0
5 1
6 0
7 0
dtype: int64
print(dim)
0 orange
1 apple
dtype: object
dim.take(values)
0 orange
1 apple
0 orange
0 orange
0 orange
1 apple
0 orange
0 orange
dtype: object
pandas has a special Categorical extension type for holding data that uses the integer-based categorical representation or encoding. This is a popular data compression technique for data with many occurrences of similar values and can provide significantly faster performance with lower memory use, especially for string data.
= pd.Categorical.from_codes(values, dim) fruit
fruit
['orange', 'apple', 'orange', 'orange', 'orange', 'apple', 'orange', 'orange']
Categories (2, object): ['orange', 'apple']
'currency'] df_long[
0 USD
1 USD
2 USD
3 USD
4 USD
...
429 CHF
430 CHF
431 CHF
432 CHF
433 CHF
Name: currency, Length: 434, dtype: object
= df_long['currency'].copy() currency_object
'currency'] = df_long['currency'].astype('category') df_long[
'currency'] df_long[
0 USD
1 USD
2 USD
3 USD
4 USD
...
429 CHF
430 CHF
431 CHF
432 CHF
433 CHF
Name: currency, Length: 434, dtype: category
Categories (7, object): ['BGN', 'CHF', 'CZK', 'DKK', 'GBP', 'JPY', 'USD']
= df_long['currency'] currency_cat
=True) currency_object.memory_usage(deep
22700
=True) currency_cat.memory_usage(deep
1230
Binning
Continuous data is often discretized or otherwise separated into “bins” for analysis.
For NYC flights data, let’s turn the integer day of week into ‘weekday’ or ‘weekend’ using pd.cut
.
'dayofweek'], bins=[0, 5, 7], right=False) # right=False specifies do not include right value pd.cut(flights[
0 [0, 5)
1 [0, 5)
2 [0, 5)
3 [0, 5)
4 [0, 5)
...
336771 [0, 5)
336772 [0, 5)
336773 [0, 5)
336774 [0, 5)
336775 [0, 5)
Name: dayofweek, Length: 336776, dtype: category
Categories (2, interval[int64, left]): [[0, 5) < [5, 7)]
= pd.cut(flights['dayofweek'], bins=[0, 5, 7], labels=['weekday', 'weekend'], right=False) # right=False specifies do not include right value day_type
day_type
0 weekday
1 weekday
2 weekday
3 weekday
4 weekday
...
336771 weekday
336772 weekday
336773 weekday
336774 weekday
336775 weekday
Name: dayofweek, Length: 336776, dtype: category
Categories (2, object): ['weekday' < 'weekend']
day_type.cat.categories
Index(['weekday', 'weekend'], dtype='object')
Note that pd.cut
produces ordered categories (['weekday' < 'weekend']
).
day_type.cat.ordered
True
You can create a new unordered categorical vector by:
day_type.cat.as_unordered()
0 weekday
1 weekday
2 weekday
3 weekday
4 weekday
...
336771 weekday
336772 weekday
336773 weekday
336774 weekday
336775 weekday
Name: dayofweek, Length: 336776, dtype: category
Categories (2, object): ['weekday', 'weekend']
Illustration: Penguins data
Missing data
= sns.load_dataset('penguins')
df_penguins df_penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
How to get rid of NaNs?
= df_penguins.apply(lambda x: not any(x.isna()), axis=1) keep
df_penguins[keep]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
Alternatively, use the dropna
method:
# drop rows with at least one NaN df_penguins.dropna()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
We can also fill the NaNs with a particular value if we have some prior knowledge of what the values should be:
0) df_penguins.fillna(
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
We can fill the values with different values for different columns:
'bill_length_mm': 0, 'sex': 'Female'}) df_penguins.fillna({
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | 0.0 | NaN | NaN | NaN | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | 0.0 | NaN | NaN | NaN | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
=True) df_penguins.mean(numeric_only
bill_length_mm 43.921930
bill_depth_mm 17.151170
flipper_length_mm 200.915205
body_mass_g 4201.754386
dtype: float64
We can fill in with other values, like the mean:
=True)) df_penguins.fillna(df_penguins.mean(numeric_only
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.10000 | 18.70000 | 181.000000 | 3750.000000 | Male |
1 | Adelie | Torgersen | 39.50000 | 17.40000 | 186.000000 | 3800.000000 | Female |
2 | Adelie | Torgersen | 40.30000 | 18.00000 | 195.000000 | 3250.000000 | Female |
3 | Adelie | Torgersen | 43.92193 | 17.15117 | 200.915205 | 4201.754386 | NaN |
4 | Adelie | Torgersen | 36.70000 | 19.30000 | 193.000000 | 3450.000000 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | 43.92193 | 17.15117 | 200.915205 | 4201.754386 | NaN |
340 | Gentoo | Biscoe | 46.80000 | 14.30000 | 215.000000 | 4850.000000 | Female |
341 | Gentoo | Biscoe | 50.40000 | 15.70000 | 222.000000 | 5750.000000 | Male |
342 | Gentoo | Biscoe | 45.20000 | 14.80000 | 212.000000 | 5200.000000 | Female |
343 | Gentoo | Biscoe | 49.90000 | 16.10000 | 213.000000 | 5400.000000 | Male |
344 rows × 7 columns
For now, let’s proceed with dropping the rows which have an NaN value:
= df_penguins.dropna() df_penguins
Checking for duplicated rows
df_penguins.duplicated()
0 False
1 False
2 False
4 False
5 False
...
338 False
340 False
341 False
342 False
343 False
Length: 333, dtype: bool
any(df_penguins.duplicated())
False
# if we did have duplicates, we could use
df_penguins.drop_duplicates()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
Computing Indicator/Dummy Variables
Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a dummy or indicator matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s. pandas has a pandas.get_dummies function for doing this, though you could also devise one yourself.
pd.get_dummies(df_penguins.sex)
Female | Male | |
---|---|---|
0 | False | True |
1 | True | False |
2 | True | False |
4 | True | False |
5 | False | True |
... | ... | ... |
338 | True | False |
340 | True | False |
341 | False | True |
342 | True | False |
343 | False | True |
333 rows × 2 columns
=float) pd.get_dummies(df_penguins.sex, dtype
Female | Male | |
---|---|---|
0 | 0.0 | 1.0 |
1 | 1.0 | 0.0 |
2 | 1.0 | 0.0 |
4 | 1.0 | 0.0 |
5 | 0.0 | 1.0 |
... | ... | ... |
338 | 1.0 | 0.0 |
340 | 1.0 | 0.0 |
341 | 0.0 | 1.0 |
342 | 1.0 | 0.0 |
343 | 0.0 | 1.0 |
333 rows × 2 columns
Seaborn pairplot
The default relplot is a scatterplot:
=df_penguins,
sns.relplot(data='bill_length_mm',
x='bill_depth_mm') y
=df_penguins,
sns.relplot(data='bill_length_mm',
x='bill_depth_mm',
y='species') hue
=df_penguins,
sns.relplot(data='bill_length_mm',
x='bill_depth_mm',
y='species',
hue='sex') style
=df_penguins,
sns.relplot(data='bill_length_mm',
x='flipper_length_mm',
y='species',
hue='sex') style
=df_penguins,
sns.relplot(data='bill_length_mm',
x='flipper_length_mm',
y='species',
hue='sex') col
By default, this function will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column.
=df_penguins, hue='species') sns.pairplot(data
Groupby with multiple categories
Let’s take a look at summary statistics.
With .agg
, we can apply several functions at once.
= df_penguins.groupby(['species', 'island', 'sex'])
grouped 'mean', 'median']) grouped.agg([
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |||
species | island | sex | ||||||||
Adelie | Biscoe | Female | 37.359091 | 37.75 | 17.704545 | 17.70 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 | ||
Dream | Female | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 | |
Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 | ||
Torgersen | Female | 37.554167 | 37.60 | 17.550000 | 17.45 | 188.291667 | 189.0 | 3395.833333 | 3400.0 | |
Male | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 | ||
Chinstrap | Dream | Female | 46.573529 | 46.30 | 17.588235 | 17.65 | 191.735294 | 192.0 | 3527.205882 | 3550.0 |
Male | 51.094118 | 50.95 | 19.252941 | 19.30 | 199.911765 | 200.5 | 3938.970588 | 3950.0 | ||
Gentoo | Biscoe | Female | 45.563793 | 45.50 | 14.237931 | 14.25 | 212.706897 | 212.0 | 4679.741379 | 4700.0 |
Male | 49.473770 | 49.50 | 15.718033 | 15.70 | 221.540984 | 221.0 | 5484.836066 | 5500.0 |
Note: .agg
with a single function is equivalent to .apply
all(grouped.agg('mean') == grouped.apply('mean'))
True
Hierarchical indexing
Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis.
= grouped.agg(['mean', 'median']) grouped_mm
grouped_mm.index
MultiIndex([( 'Adelie', 'Biscoe', 'Female'),
( 'Adelie', 'Biscoe', 'Male'),
( 'Adelie', 'Dream', 'Female'),
( 'Adelie', 'Dream', 'Male'),
( 'Adelie', 'Torgersen', 'Female'),
( 'Adelie', 'Torgersen', 'Male'),
('Chinstrap', 'Dream', 'Female'),
('Chinstrap', 'Dream', 'Male'),
( 'Gentoo', 'Biscoe', 'Female'),
( 'Gentoo', 'Biscoe', 'Male')],
names=['species', 'island', 'sex'])
Here, the columns also have a MultiIndex.
grouped_mm.columns
MultiIndex([( 'bill_length_mm', 'mean'),
( 'bill_length_mm', 'median'),
( 'bill_depth_mm', 'mean'),
( 'bill_depth_mm', 'median'),
('flipper_length_mm', 'mean'),
('flipper_length_mm', 'median'),
( 'body_mass_g', 'mean'),
( 'body_mass_g', 'median')],
)
You can see how many levels an index has by accessing its nlevels
attribute:
grouped_mm.index.nlevels
3
grouped_mm.index.levels
FrozenList([['Adelie', 'Chinstrap', 'Gentoo'], ['Biscoe', 'Dream', 'Torgersen'], ['Female', 'Male']])
grouped_mm.columns.nlevels
2
With partial column indexing you can select groups of columns:
'bill_depth_mm'] grouped_mm[
mean | median | |||
---|---|---|---|---|
species | island | sex | ||
Adelie | Biscoe | Female | 17.704545 | 17.70 |
Male | 19.036364 | 18.90 | ||
Dream | Female | 17.618519 | 17.80 | |
Male | 18.839286 | 18.65 | ||
Torgersen | Female | 17.550000 | 17.45 | |
Male | 19.391304 | 19.20 | ||
Chinstrap | Dream | Female | 17.588235 | 17.65 |
Male | 19.252941 | 19.30 | ||
Gentoo | Biscoe | Female | 14.237931 | 14.25 |
Male | 15.718033 | 15.70 |
'Adelie'] grouped_mm.loc[
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | ||||||
---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | ||
island | sex | ||||||||
Biscoe | Female | 37.359091 | 37.75 | 17.704545 | 17.70 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 | |
Dream | Female | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 |
Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 | |
Torgersen | Female | 37.554167 | 37.60 | 17.550000 | 17.45 | 188.291667 | 189.0 | 3395.833333 | 3400.0 |
Male | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 |
'Adelie', "body_mass_g"] # works for the level 0 index on the rows or the columns grouped_mm.loc[
mean | median | ||
---|---|---|---|
island | sex | ||
Biscoe | Female | 3369.318182 | 3375.0 |
Male | 4050.000000 | 4000.0 | |
Dream | Female | 3344.444444 | 3400.0 |
Male | 4045.535714 | 3987.5 | |
Torgersen | Female | 3395.833333 | 3400.0 |
Male | 4034.782609 | 4000.0 |
# we can't do this for "inner level"
# grouped_m.loc['Biscoe']
# grouped_mm.loc['Adelie', "mean"]
To traverse levels (i.e. get one key from first level and another key from second level), use tuples:
'Adelie', 'Biscoe')] grouped_mm.loc[(
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||
---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |
sex | ||||||||
Female | 37.359091 | 37.75 | 17.704545 | 17.7 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | 40.590909 | 40.80 | 19.036364 | 18.9 | 190.409091 | 191.0 | 4050.000000 | 4000.0 |
'Adelie', 'Biscoe'), ('bill_length_mm', 'mean')] grouped_mm.loc[(
sex
Female 37.359091
Male 40.590909
Name: (bill_length_mm, mean), dtype: float64
'bill_length_mm', 'mean')] grouped_mm.loc[:, (
species island sex
Adelie Biscoe Female 37.359091
Male 40.590909
Dream Female 36.911111
Male 40.071429
Torgersen Female 37.554167
Male 40.586957
Chinstrap Dream Female 46.573529
Male 51.094118
Gentoo Biscoe Female 45.563793
Male 49.473770
Name: (bill_length_mm, mean), dtype: float64
A list is used to specify multiple keys:
'Adelie', 'Gentoo'], ["body_mass_g", "flipper_length_mm"]] grouped_mm.loc[[
body_mass_g | flipper_length_mm | |||||
---|---|---|---|---|---|---|
mean | median | mean | median | |||
species | island | sex | ||||
Adelie | Biscoe | Female | 3369.318182 | 3375.0 | 187.181818 | 187.0 |
Male | 4050.000000 | 4000.0 | 190.409091 | 191.0 | ||
Dream | Female | 3344.444444 | 3400.0 | 187.851852 | 188.0 | |
Male | 4045.535714 | 3987.5 | 191.928571 | 190.5 | ||
Torgersen | Female | 3395.833333 | 3400.0 | 188.291667 | 189.0 | |
Male | 4034.782609 | 4000.0 | 194.913043 | 195.0 | ||
Gentoo | Biscoe | Female | 4679.741379 | 4700.0 | 212.706897 | 212.0 |
Male | 5484.836066 | 5500.0 | 221.540984 | 221.0 |
# list of tuples
'Adelie', 'Dream', 'Male'), ('Gentoo', 'Biscoe', 'Female')]] grouped_mm.loc[[(
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |||
species | island | sex | ||||||||
Adelie | Dream | Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 |
Gentoo | Biscoe | Female | 45.563793 | 45.50 | 14.237931 | 14.25 | 212.706897 | 212.0 | 4679.741379 | 4700.0 |
It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).
See more examples here: https://pandas.pydata.org/docs/user_guide/advanced.html#advanced-indexing-with-hierarchical-index
= pd.Series(
s 1, 2, 3, 4, 5, 6],
[=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
index
) s
A c 1
d 2
e 3
B c 4
d 5
e 6
dtype: int64
"A", "c"), ("B", "d")]] # list of tuples s.loc[[(
A c 1
B d 5
dtype: int64
"A", "B"], ["c", "d"])] # tuple of lists s.loc[([
A c 1
d 2
B c 4
d 5
dtype: int64
Reordering and sorting levels
At times you may need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel method takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):
'species', 'island') grouped_mm.swaplevel(
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |||
island | species | sex | ||||||||
Biscoe | Adelie | Female | 37.359091 | 37.75 | 17.704545 | 17.70 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 | ||
Dream | Adelie | Female | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 |
Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 | ||
Torgersen | Adelie | Female | 37.554167 | 37.60 | 17.550000 | 17.45 | 188.291667 | 189.0 | 3395.833333 | 3400.0 |
Male | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 | ||
Dream | Chinstrap | Female | 46.573529 | 46.30 | 17.588235 | 17.65 | 191.735294 | 192.0 | 3527.205882 | 3550.0 |
Male | 51.094118 | 50.95 | 19.252941 | 19.30 | 199.911765 | 200.5 | 3938.970588 | 3950.0 | ||
Biscoe | Gentoo | Female | 45.563793 | 45.50 | 14.237931 | 14.25 | 212.706897 | 212.0 | 4679.741379 | 4700.0 |
Male | 49.473770 | 49.50 | 15.718033 | 15.70 | 221.540984 | 221.0 | 5484.836066 | 5500.0 |
'species', 'island').sort_index() grouped_mm.swaplevel(
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |||
island | species | sex | ||||||||
Biscoe | Adelie | Female | 37.359091 | 37.75 | 17.704545 | 17.70 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 | ||
Gentoo | Female | 45.563793 | 45.50 | 14.237931 | 14.25 | 212.706897 | 212.0 | 4679.741379 | 4700.0 | |
Male | 49.473770 | 49.50 | 15.718033 | 15.70 | 221.540984 | 221.0 | 5484.836066 | 5500.0 | ||
Dream | Adelie | Female | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 |
Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 | ||
Chinstrap | Female | 46.573529 | 46.30 | 17.588235 | 17.65 | 191.735294 | 192.0 | 3527.205882 | 3550.0 | |
Male | 51.094118 | 50.95 | 19.252941 | 19.30 | 199.911765 | 200.5 | 3938.970588 | 3950.0 | ||
Torgersen | Adelie | Female | 37.554167 | 37.60 | 17.550000 | 17.45 | 188.291667 | 189.0 | 3395.833333 | 3400.0 |
Male | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 |
2, 1, 0], axis=0) grouped_mm.reorder_levels([
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |||
sex | island | species | ||||||||
Female | Biscoe | Adelie | 37.359091 | 37.75 | 17.704545 | 17.70 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | Biscoe | Adelie | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 |
Female | Dream | Adelie | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 |
Male | Dream | Adelie | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 |
Female | Torgersen | Adelie | 37.554167 | 37.60 | 17.550000 | 17.45 | 188.291667 | 189.0 | 3395.833333 | 3400.0 |
Male | Torgersen | Adelie | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 |
Female | Dream | Chinstrap | 46.573529 | 46.30 | 17.588235 | 17.65 | 191.735294 | 192.0 | 3527.205882 | 3550.0 |
Male | Dream | Chinstrap | 51.094118 | 50.95 | 19.252941 | 19.30 | 199.911765 | 200.5 | 3938.970588 | 3950.0 |
Female | Biscoe | Gentoo | 45.563793 | 45.50 | 14.237931 | 14.25 | 212.706897 | 212.0 | 4679.741379 | 4700.0 |
Male | Biscoe | Gentoo | 49.473770 | 49.50 | 15.718033 | 15.70 | 221.540984 | 221.0 | 5484.836066 | 5500.0 |
1, 0], axis=1) grouped_mm.reorder_levels([
mean | median | mean | median | mean | median | mean | median | |||
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | bill_length_mm | bill_depth_mm | bill_depth_mm | flipper_length_mm | flipper_length_mm | body_mass_g | body_mass_g | |||
species | island | sex | ||||||||
Adelie | Biscoe | Female | 37.359091 | 37.75 | 17.704545 | 17.70 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 | ||
Dream | Female | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 | |
Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 | ||
Torgersen | Female | 37.554167 | 37.60 | 17.550000 | 17.45 | 188.291667 | 189.0 | 3395.833333 | 3400.0 | |
Male | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 | ||
Chinstrap | Dream | Female | 46.573529 | 46.30 | 17.588235 | 17.65 | 191.735294 | 192.0 | 3527.205882 | 3550.0 |
Male | 51.094118 | 50.95 | 19.252941 | 19.30 | 199.911765 | 200.5 | 3938.970588 | 3950.0 | ||
Gentoo | Biscoe | Female | 45.563793 | 45.50 | 14.237931 | 14.25 | 212.706897 | 212.0 | 4679.741379 | 4700.0 |
Male | 49.473770 | 49.50 | 15.718033 | 15.70 | 221.540984 | 221.0 | 5484.836066 | 5500.0 |
Summary statistics by level
Many descriptive and summary statistics on DataFrame and Series have a level option in which you can specify the level you want to aggregate by on a particular axis.
=0).mean() grouped_mm.groupby(level
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||
---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |
species | ||||||||
Adelie | 38.845610 | 39.050 | 18.356670 | 18.283333 | 190.096007 | 190.083333 | 3706.652380 | 3693.75 |
Chinstrap | 48.833824 | 48.625 | 18.420588 | 18.475000 | 195.823529 | 196.250000 | 3733.088235 | 3750.00 |
Gentoo | 47.518782 | 47.500 | 14.977982 | 14.975000 | 217.123940 | 216.500000 | 5082.288722 | 5100.00 |
=0, axis='columns').mean() # takes mean over column level=1 grouped_mm.groupby(level
/var/folders/f0/m7l23y8s7p3_0x04b3td9nyjr2hyc8/T/ipykernel_4760/3198538097.py:1: FutureWarning:
DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.
bill_depth_mm | bill_length_mm | body_mass_g | flipper_length_mm | |||
---|---|---|---|---|---|---|
species | island | sex | ||||
Adelie | Biscoe | Female | 17.702273 | 37.554545 | 3372.159091 | 187.090909 |
Male | 18.968182 | 40.695455 | 4025.000000 | 190.704545 | ||
Dream | Female | 17.709259 | 36.855556 | 3372.222222 | 187.925926 | |
Male | 18.744643 | 40.160714 | 4016.517857 | 191.214286 | ||
Torgersen | Female | 17.500000 | 37.577083 | 3397.916667 | 188.645833 | |
Male | 19.295652 | 40.843478 | 4017.391304 | 194.956522 | ||
Chinstrap | Dream | Female | 17.619118 | 46.436765 | 3538.602941 | 191.867647 |
Male | 19.276471 | 51.022059 | 3944.485294 | 200.205882 | ||
Gentoo | Biscoe | Female | 14.243966 | 45.531897 | 4689.870690 | 212.353448 |
Male | 15.709016 | 49.486885 | 5492.418033 | 221.270492 |
Cross-sections: Using .xs
to selet subset of rows at a given level
'Adelie') grouped_mm.xs(
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | ||||||
---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | ||
island | sex | ||||||||
Biscoe | Female | 37.359091 | 37.75 | 17.704545 | 17.70 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 | |
Dream | Female | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 |
Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 | |
Torgersen | Female | 37.554167 | 37.60 | 17.550000 | 17.45 | 188.291667 | 189.0 | 3395.833333 | 3400.0 |
Male | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 |
'Adelie', drop_level=False) grouped_mm.xs(
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |||
species | island | sex | ||||||||
Adelie | Biscoe | Female | 37.359091 | 37.75 | 17.704545 | 17.70 | 187.181818 | 187.0 | 3369.318182 | 3375.0 |
Male | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 | ||
Dream | Female | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 | |
Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 | ||
Torgersen | Female | 37.554167 | 37.60 | 17.550000 | 17.45 | 188.291667 | 189.0 | 3395.833333 | 3400.0 | |
Male | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 |
'Male', level=2, drop_level=False) # we need to specify where index is, if it is not level=0 grouped_mm.xs(
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |||
species | island | sex | ||||||||
Adelie | Biscoe | Male | 40.590909 | 40.80 | 19.036364 | 18.90 | 190.409091 | 191.0 | 4050.000000 | 4000.0 |
Dream | Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 | |
Torgersen | Male | 40.586957 | 41.10 | 19.391304 | 19.20 | 194.913043 | 195.0 | 4034.782609 | 4000.0 | |
Chinstrap | Dream | Male | 51.094118 | 50.95 | 19.252941 | 19.30 | 199.911765 | 200.5 | 3938.970588 | 3950.0 |
Gentoo | Biscoe | Male | 49.473770 | 49.50 | 15.718033 | 15.70 | 221.540984 | 221.0 | 5484.836066 | 5500.0 |
'Adelie', 'Dream'), level=(0, 1), drop_level=False)
grouped_mm.xs((# note: we needed a tuple, not a list
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |||
species | island | sex | ||||||||
Adelie | Dream | Female | 36.911111 | 36.80 | 17.618519 | 17.80 | 187.851852 | 188.0 | 3344.444444 | 3400.0 |
Male | 40.071429 | 40.25 | 18.839286 | 18.65 | 191.928571 | 190.5 | 4045.535714 | 3987.5 |
'bill_length_mm', 'mean'), level=(0,1), axis=1)
grouped_mm.xs((# for the columns, need axis=1
bill_length_mm | |||
---|---|---|---|
mean | |||
species | island | sex | |
Adelie | Biscoe | Female | 37.359091 |
Male | 40.590909 | ||
Dream | Female | 36.911111 | |
Male | 40.071429 | ||
Torgersen | Female | 37.554167 | |
Male | 40.586957 | ||
Chinstrap | Dream | Female | 46.573529 |
Male | 51.094118 | ||
Gentoo | Biscoe | Female | 45.563793 |
Male | 49.473770 |
Combining and Merging Datasets
Data contained in pandas objects can be combined in a number of ways:
pandas.merge
Connect rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
pandas.concat
Concatenate or “stack” objects together along an axis. (Lecture 2)
Database-Style DataFrame Joins
Merge or join operations combine datasets by linking rows using one or more keys. These operations are particularly important in relational databases (e.g., SQL-based). The pandas.merge function in pandas is the main entry point for using these algorithms on your data.
= pd.DataFrame({"key": ["b", "b", "a", "c", "a", "a", "b"],
df1 "data1": pd.Series(range(7))})
= pd.DataFrame({"key": ["a", "b", "d"],
df2 "data2": pd.Series(range(3))})
df1
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | a | 5 |
6 | b | 6 |
df2
key | data2 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | d | 2 |
='key') pd.merge(df1, df2, on
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | a | 2 | 0 |
3 | a | 4 | 0 |
4 | a | 5 | 0 |
5 | b | 6 | 1 |
If the column names are different in each object, you can specify them separately:
= pd.DataFrame({"lkey": ["b", "b", "a", "c", "a", "a", "b"],
df3 "data1": pd.Series(range(7), dtype="Int64")})
= pd.DataFrame({"rkey": ["a", "b", "d"],
df4 "data2": pd.Series(range(3), dtype="Int64")})
="lkey", right_on="rkey") pd.merge(df3, df4, left_on
lkey | data1 | rkey | data2 | |
---|---|---|---|---|
0 | b | 0 | b | 1 |
1 | b | 1 | b | 1 |
2 | a | 2 | a | 0 |
3 | a | 4 | a | 0 |
4 | a | 5 | a | 0 |
5 | b | 6 | b | 1 |
You may notice that the "c"
and "d"
values and associated data are missing from the result. By default, pandas.merge does an "inner"
join; the keys in the result are the intersection, or the common set found in both tables. Other possible options are "left"
, "right"
, and "outer"
. The outer join takes the union of the keys, combining the effect of applying both left and right joins:
Option | Behavior |
---|---|
how=“inner” | Use only the key combinations observed in both tables |
how=“left” | Use all key combinations found in the left table |
how=“right” | Use all key combinations found in the right table |
how=“outer” | Use all key combinations observed in both tables together |
= set(df1.key)
ind1 = set(df2.key) ind2
# inner join ind1.intersection(ind2)
{'a', 'b'}
="inner") pd.merge(df1, df2, how
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | a | 2 | 0 |
3 | a | 4 | 0 |
4 | a | 5 | 0 |
5 | b | 6 | 1 |
# outer join ind2.union(ind1)
{'a', 'b', 'c', 'd'}
="outer") pd.merge(df1, df2, how
key | data1 | data2 | |
---|---|---|---|
0 | a | 2.0 | 0.0 |
1 | a | 4.0 | 0.0 |
2 | a | 5.0 | 0.0 |
3 | b | 0.0 | 1.0 |
4 | b | 1.0 | 1.0 |
5 | b | 6.0 | 1.0 |
6 | c | 3.0 | NaN |
7 | d | NaN | 2.0 |
="left") pd.merge(df1, df2, how
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1.0 |
1 | b | 1 | 1.0 |
2 | a | 2 | 0.0 |
3 | c | 3 | NaN |
4 | a | 4 | 0.0 |
5 | a | 5 | 0.0 |
6 | b | 6 | 1.0 |
="right") pd.merge(df1, df2, how
key | data1 | data2 | |
---|---|---|---|
0 | a | 2.0 | 0 |
1 | a | 4.0 | 0 |
2 | a | 5.0 | 0 |
3 | b | 0.0 | 1 |
4 | b | 1.0 | 1 |
5 | b | 6.0 | 1 |
6 | d | NaN | 2 |
To merge with multiple keys, pass a list of column names:
= pd.DataFrame({"key1": ["foo", "foo", "bar"],
left "key2": ["one", "two", "one"],
"lval": pd.Series([1, 2, 3], dtype='Int64')})
= pd.DataFrame({"key1": ["foo", "foo", "bar", "bar"],
right "key2": ["one", "one", "one", "two"],
"rval": pd.Series([4, 5, 6, 7], dtype='Int64')})
=["key1", "key2"], how="outer") pd.merge(left, right, on
key1 | key2 | lval | rval | |
---|---|---|---|---|
0 | bar | one | 3 | 6 |
1 | bar | two | <NA> | 7 |
2 | foo | one | 1 | 4 |
3 | foo | one | 1 | 5 |
4 | foo | two | 2 | <NA> |
Another issue to consider in merge operations is the treatment of overlapping column names. For example:
="key1") pd.merge(left, right, on
key1 | key2_x | lval | key2_y | rval | |
---|---|---|---|---|---|
0 | foo | one | 1 | one | 4 |
1 | foo | one | 1 | one | 5 |
2 | foo | two | 2 | one | 4 |
3 | foo | two | 2 | one | 5 |
4 | bar | one | 3 | one | 6 |
5 | bar | one | 3 | two | 7 |
="key1", suffixes=("_left", "_right")) pd.merge(left, right, on
key1 | key2_left | lval | key2_right | rval | |
---|---|---|---|---|---|
0 | foo | one | 1 | one | 4 |
1 | foo | one | 1 | one | 5 |
2 | foo | two | 2 | one | 4 |
3 | foo | two | 2 | one | 5 |
4 | bar | one | 3 | one | 6 |
5 | bar | one | 3 | two | 7 |
Merging on index
Sometimes, the merge key is in the index (row labels). In this case, use left_index=True
or right_index=True
(or both) to indicate the index should be used as merge key.
= pd.DataFrame({"key": ["a", "b", "a", "a", "b", "c"],
left1 "value": pd.Series(range(6), dtype="Int64")})
= pd.DataFrame({"group_val": [3.5, 7]}, index=["a", "b"]) right1
left1
key | value | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | a | 2 |
3 | a | 3 |
4 | b | 4 |
5 | c | 5 |
right1
group_val | |
---|---|
a | 3.5 |
b | 7.0 |
='key', right_index=True) pd.merge(left1, right1, left_on
key | value | group_val | |
---|---|---|---|
0 | a | 0 | 3.5 |
1 | b | 1 | 7.0 |
2 | a | 2 | 3.5 |
3 | a | 3 | 3.5 |
4 | b | 4 | 7.0 |
Checking for duplicate keys
= pd.DataFrame({"A": [1, 2], "B": [1, 2]})
left = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]}) right
left
A | B | |
---|---|---|
0 | 1 | 1 |
1 | 2 | 2 |
right
A | B | |
---|---|---|
0 | 4 | 2 |
1 | 5 | 2 |
2 | 6 | 2 |
There are many values of A for the same value of B - how to merge? The validate
argument can check whether there are issues. - validate ="one_to_one”
or “1:1”: check if merge keys are unique in both left and right datasets. - validate=“one_to_many”
or “1:m”: check if merge keys are unique in left dataset. - validate=“many_to_one”
or “m:1”: check if merge keys are unique in right dataset. - validate=“many_to_many”
or “m:m”: allowed, but does not result in checks.
# this fails
#result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
The default behavior is to perform a many-to-many merge. This means that for each duplicate key in one DataFrame, it will be matched with all corresponding duplicate keys in the other DataFrame.
= pd.merge(left, right, on="B", how="outer") result
result
A_x | B | A_y | |
---|---|---|---|
0 | 1 | 1 | NaN |
1 | 2 | 2 | 4.0 |
2 | 2 | 2 | 5.0 |
3 | 2 | 2 | 6.0 |
= pd.DataFrame({'key': [1, 1, 2], 'value1': ['a', 'b', 'c']})
df1 = pd.DataFrame({'key': [1, 1, 3], 'value2': ['x', 'y', 'z']})
df2
= pd.merge(df1, df2, on='key') merged_df
# another example
df1
key | value1 | |
---|---|---|
0 | 1 | a |
1 | 1 | b |
2 | 2 | c |
df2
key | value2 | |
---|---|---|
0 | 1 | x |
1 | 1 | y |
2 | 3 | z |
merged_df
key | value1 | value2 | |
---|---|---|---|
0 | 1 | a | x |
1 | 1 | a | y |
2 | 1 | b | x |
3 | 1 | b | y |
Example: NYC flights data
= flights['dest'].value_counts() dest_count
= 'flight_count' dest_count.name
airports.head()
faa | name | lat | lon | alt | tz | dst | tzone | |
---|---|---|---|---|---|---|---|---|
0 | 04G | Lansdowne Airport | 41.130472 | -80.619583 | 1044 | -5 | A | America/New_York |
1 | 06A | Moton Field Municipal Airport | 32.460572 | -85.680028 | 264 | -6 | A | America/Chicago |
2 | 06C | Schaumburg Regional | 41.989341 | -88.101243 | 801 | -6 | A | America/Chicago |
3 | 06N | Randall Airport | 41.431912 | -74.391561 | 523 | -5 | A | America/New_York |
4 | 09J | Jekyll Island Airport | 31.074472 | -81.427778 | 11 | -5 | A | America/New_York |
= pd.merge(airports, dest_count, left_on='faa', right_index=True) airport_flight_count
='flight_count', ascending=False) airport_flight_count.sort_values(by
faa | name | lat | lon | alt | tz | dst | tzone | flight_count | |
---|---|---|---|---|---|---|---|---|---|
1026 | ORD | Chicago Ohare Intl | 41.978603 | -87.904842 | 668 | -6 | A | America/Chicago | 17283 |
153 | ATL | Hartsfield Jackson Atlanta Intl | 33.636719 | -84.428067 | 1026 | -5 | A | America/New_York | 17215 |
770 | LAX | Los Angeles Intl | 33.942536 | -118.408075 | 126 | -8 | A | America/Los_Angeles | 16174 |
223 | BOS | General Edward Lawrence Logan Intl | 42.364347 | -71.005181 | 19 | -5 | A | America/New_York | 15508 |
852 | MCO | Orlando Intl | 28.429394 | -81.308994 | 96 | -5 | A | America/New_York | 14082 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
927 | MTJ | Montrose Regional Airport | 38.509794 | -107.894242 | 5759 | -7 | A | America/Denver | 15 |
1192 | SBN | South Bend Rgnl | 41.708661 | -86.317250 | 799 | -5 | A | America/New_York | 10 |
128 | ANC | Ted Stevens Anchorage Intl | 61.174361 | -149.996361 | 152 | -9 | A | America/Anchorage | 8 |
782 | LEX | Blue Grass | 38.036500 | -84.605889 | 979 | -5 | A | America/New_York | 1 |
786 | LGA | La Guardia | 40.777245 | -73.872608 | 22 | -5 | A | America/New_York | 1 |
101 rows × 9 columns
= flights.groupby('dest')['arr_delay'].mean() dest_delay
dest_delay
dest
ABQ 4.381890
ACK 4.852273
ALB 14.397129
ANC -2.500000
ATL 11.300113
...
TPA 7.408525
TUL 33.659864
TVC 12.968421
TYS 24.069204
XNA 7.465726
Name: arr_delay, Length: 105, dtype: float64
= pd.merge(airport_flight_count, dest_delay, left_on='faa', right_index=True) airport_flight_count_with_delay
airport_flight_count_with_delay
faa | name | lat | lon | alt | tz | dst | tzone | flight_count | arr_delay | |
---|---|---|---|---|---|---|---|---|---|---|
87 | ABQ | Albuquerque International Sunport | 35.040222 | -106.609194 | 5355 | -7 | A | America/Denver | 254 | 4.381890 |
91 | ACK | Nantucket Mem | 41.253053 | -70.060181 | 48 | -5 | A | America/New_York | 265 | 4.852273 |
118 | ALB | Albany Intl | 42.748267 | -73.801692 | 285 | -5 | A | America/New_York | 439 | 14.397129 |
128 | ANC | Ted Stevens Anchorage Intl | 61.174361 | -149.996361 | 152 | -9 | A | America/Anchorage | 8 | -2.500000 |
153 | ATL | Hartsfield Jackson Atlanta Intl | 33.636719 | -84.428067 | 1026 | -5 | A | America/New_York | 17215 | 11.300113 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1327 | TPA | Tampa Intl | 27.975472 | -82.533250 | 26 | -5 | A | America/New_York | 7466 | 7.408525 |
1334 | TUL | Tulsa Intl | 36.198389 | -95.888111 | 677 | -6 | A | America/Chicago | 315 | 33.659864 |
1337 | TVC | Cherry Capital Airport | 44.741445 | -85.582235 | 624 | -5 | A | America/New_York | 101 | 12.968421 |
1347 | TYS | Mc Ghee Tyson | 35.810972 | -83.994028 | 981 | -5 | A | America/New_York | 631 | 24.069204 |
1430 | XNA | NW Arkansas Regional | 36.281869 | -94.306811 | 1287 | -6 | A | America/Chicago | 1036 | 7.465726 |
101 rows × 10 columns
import plotly.express as px
px.scatter_geo(airport_flight_count_with_delay,='lat', lon='lon',
lat="name",
hover_name="locations",
fitbounds='flight_count') size
px.scatter_geo(airport_flight_count_with_delay,='lat', lon='lon',
lat="name",
hover_name="locations",
fitbounds='flight_count',
size='arr_delay',
color=0,
color_continuous_midpoint='icefire') color_continuous_scale
Resources
- pandas cheat sheet (link)