Lecture 2 - Introduction to Pandas

Overview

This lecture provides an introduction to Pandas, a Python package with structures and data manipulation tools designed to make data cleaning and analysis fast and convenient. Topics include:

Pandas objects (pd.Series, pd.DataFrame)
Essential pandas functionality (indexing, selection, filtering, operations on dataframes)
The concept of tidy data (Wickham, 2014)

References

This lecture contains material from:

Chapter 5, Python for Data Analysis, 3E (Wes McKinney, 2022)
Pierre Bellec’s notes

About Pandas

Pandas is the Python module that provides functionality similar to R’s built-in data.frame object. Pandas is designed for working with tabular or heterogeneous data.

# Import packages
import numpy as np
import pandas as pd
import seaborn as sns

Pandas Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index.

s = pd.Series([2, 4, -1, 5])
s

0    2
1    4
2   -1
3    5
dtype: int64

s.array

<NumpyExtensionArray>
[np.int64(2), np.int64(4), np.int64(-1), np.int64(5)]
Length: 4, dtype: int64

s.index

RangeIndex(start=0, stop=4, step=1)

s = pd.Series([2, 4, -1, 5], index=['a', 'b', 'c', 'd'])

s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

s['a']

np.int64(2)

s[['a', 'd']]

a    2
d    5
dtype: int64

s * 2

a     4
b     8
c    -2
d    10
dtype: int64

s[s>2]

b    4
d    5
dtype: int64

We can create a series from a numpy array:

# Creating a numpy array
a = np.arange(6, 15, 1) + 0.2
print(a.shape)
a

(9,)

array([ 6.2,  7.2,  8.2,  9.2, 10.2, 11.2, 12.2, 13.2, 14.2])

s_from_np = pd.Series(a)
s_from_np

0     6.2
1     7.2
2     8.2
3     9.2
4    10.2
5    11.2
6    12.2
7    13.2
8    14.2
dtype: float64

s_from_np.array

<NumpyExtensionArray>
[ np.float64(6.2),  np.float64(7.2),  np.float64(8.2),  np.float64(9.2),
 np.float64(10.2), np.float64(11.2), np.float64(12.2), np.float64(13.2),
 np.float64(14.2)]
Length: 9, dtype: float64

s_from_np.index

RangeIndex(start=0, stop=9, step=1)

We can adjust the index in place:

s_from_np.index = [2 * i for i in range(len(s_from_np))]

s_from_np

0      6.2
2      7.2
4      8.2
6      9.2
8     10.2
10    11.2
12    12.2
14    13.2
16    14.2
dtype: float64

Finally, we can turn the Series object back to an ndarray as:

s_from_np.to_numpy()

array([ 6.2,  7.2,  8.2,  9.2, 10.2, 11.2, 12.2, 13.2, 14.2])

We can also create a series from a dictionary:

dat = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
s_from_dict = pd.Series(dat)
s_from_dict

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

s_from_dict.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

states = ["California", "Ohio", "Oregon", "Texas"]

s_new = pd.Series(dat, index=states)

s_new

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

pd.isna(s_new)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notna(s_new)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

s_new.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Pandas DataFrame

A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

frame

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

# adding a column (similar to dictionaries)
frame['debt'] = 16

frame

	state	year	pop	debt
0	Ohio	2000	1.5	16
1	Ohio	2001	1.7	16
2	Ohio	2002	3.6	16
3	Nevada	2001	2.4	16
4	Nevada	2002	2.9	16
5	Nevada	2003	3.2	16

len(frame)

frame['debt'] = np.linspace(16,20,num=len(frame))
frame

	state	year	pop	debt
0	Ohio	2000	1.5	16.0
1	Ohio	2001	1.7	16.8
2	Ohio	2002	3.6	17.6
3	Nevada	2001	2.4	18.4
4	Nevada	2002	2.9	19.2
5	Nevada	2003	3.2	20.0

# another example
A2d = np.random.normal(size=(8, 2))
A2d.shape

(8, 2)

df_from_np = pd.DataFrame(A2d)
df_from_np

	0	1
0	-2.372890	0.461543
1	-0.280463	1.104957
2	-0.078033	-0.759848
3	1.013987	0.562257
4	0.168685	0.785529
5	-0.862294	-0.784686
6	-1.642501	-0.204560
7	0.056823	0.733871

df_from_np.head()

	0	1
0	-2.372890	0.461543
1	-0.280463	1.104957
2	-0.078033	-0.759848
3	1.013987	0.562257
4	0.168685	0.785529

df_from_np.tail()

	0	1
3	1.013987	0.562257
4	0.168685	0.785529
5	-0.862294	-0.784686
6	-1.642501	-0.204560
7	0.056823	0.733871

df_from_np.to_numpy()

array([[-2.37288997,  0.4615429 ],
       [-0.28046288,  1.10495676],
       [-0.07803338, -0.75984847],
       [ 1.01398664,  0.5622572 ],
       [ 0.16868548,  0.78552939],
       [-0.86229404, -0.78468577],
       [-1.64250143, -0.20455967],
       [ 0.05682344,  0.73387118]])

Indices

df_from_np.index

RangeIndex(start=0, stop=8, step=1)

indices = list('abcdefgh')
indices

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

df_from_np = pd.DataFrame(A2d, index=indices)
df_from_np

	0	1
a	-2.372890	0.461543
b	-0.280463	1.104957
c	-0.078033	-0.759848
d	1.013987	0.562257
e	0.168685	0.785529
f	-0.862294	-0.784686
g	-1.642501	-0.204560
h	0.056823	0.733871

isinstance(df_from_np.index, pd.Series)

False

isinstance(df_from_np.columns, pd.Index)

True

obj = pd.Series([4.5, 7.2,-5.3, 3.6], index=["d", "b", "a", "c"])

obj.reindex(['a', 'b', 'c', 'd', 'e'])

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

obj2 = obj.reindex(['a', 'a', 'b', 'c', 'd'])

obj2.index.is_unique

False

obj2.index.unique()

Index(['a', 'b', 'c', 'd'], dtype='object')

obj2['a']

a   -5.3
a   -5.3
dtype: float64

obj2.index.isin(['a'])

array([ True,  True, False, False, False])

Columns

df_from_np.columns

RangeIndex(start=0, stop=2, step=1)

df_from_np.columns = ['firstCol', 'secondCol']
df_from_np

	firstCol	secondCol
a	-2.372890	0.461543
b	-0.280463	1.104957
c	-0.078033	-0.759848
d	1.013987	0.562257
e	0.168685	0.785529
f	-0.862294	-0.784686
g	-1.642501	-0.204560
h	0.056823	0.733871

isinstance(df_from_np.columns, pd.Series)

False

isinstance(df_from_np.columns, pd.Index)

True

Datatypes

df_from_np.dtypes

firstCol     float64
secondCol    float64
dtype: object

df = pd.DataFrame([['a', 1, 'b'], [1, 2, 3]])

df

	0	1	2
0	a	1	b
1	1	2	3

df.dtypes

0    object
1     int64
2    object
dtype: object

type(df.dtypes)

pandas.core.series.Series

isinstance(df.dtypes, pd.Series)

True

Illustration: Historical Rates Data

Reading in csv files

First, make sure the file is in your working directory - this is ‘where you are currently’ on the computer.

Python needs to know where the files live - they can be on your computer (local) or on the internet (remote).

The place where your file lives is referred to as its ‘path’. You can think of the path as directions to the file. There are two kinds of paths:

relative paths indicate where a file is with respect to your working directory
absolute paths indicate where the file is with respect to the computers filesystem base (or root) folder, regardless of where you are working

Image from Data Science: A first introduction with Python (link)

In VSCode, your working directory is typically where your Jupyter notebook is saved.

import os
os.getcwd()

'/Users/gm845/Library/CloudStorage/Box-Box/teaching/2025/msds-597/website/lec-2'

We will be working with exchange rates data from the European Central Bank.

This data has the path:

/Users/gm845/Library/CloudStorage/Box-Box/teaching/2025/msds-597/lectures/data/rates.csv

To get to rates.csv, I need to “go one level up”, then enter the “data” folder.

Relative to the Jupyter notebook, this path is ../data/rates.csv. Here, ../ indicates “go one level up”.

df = pd.read_csv('../data/rates.csv')

Note: the default is header=0 which takes the first row in the csv file as the column names.

Let’s look at the dataframe.

df.head()

	Time	USD	JPY	BGN	CZK	DKK	GBP	CHF
0	2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712
1	2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725
2	2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716
3	2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787
4	2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810

df.tail()

	Time	USD	JPY	BGN	CZK	DKK	GBP	CHF
57	2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415
58	2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
59	2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
60	2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
61	2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432

df.shape

(62, 8)

We can inspect the data types of each column:

df.dtypes

Time     object
USD     float64
JPY     float64
BGN     float64
CZK     float64
DKK     float64
GBP     float64
CHF     float64
dtype: object

We see that the Time column has type object. We can fix this by reading in the data again, this time with the argument parse_dates=[`Time`]

df = pd.read_csv('../data/rates.csv', parse_dates=['Time'])
df.dtypes

Time    datetime64[ns]
USD            float64
JPY            float64
BGN            float64
CZK            float64
DKK            float64
GBP            float64
CHF            float64
dtype: object

Accessing a column

df['USD']

0     1.0637
1     1.0656
2     1.0652
3     1.0729
4     1.0860
       ...  
57    1.0905
58    1.0872
59    1.0890
60    1.0887
61    1.0875
Name: USD, Length: 62, dtype: float64

isinstance(df['USD'], pd.Series)

True

df.USD

0     1.0637
1     1.0656
2     1.0652
3     1.0729
4     1.0860
       ...  
57    1.0905
58    1.0872
59    1.0890
60    1.0887
61    1.0875
Name: USD, Length: 62, dtype: float64

Note: when we get the USD rates, we unfortunately lose information about the time.

# here is how to fix it and use time is index
df.index = df['Time']
df

	Time	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712
2024-04-15	2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725
2024-04-12	2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716
2024-04-11	2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787
2024-04-10	2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810
...	...	...	...	...	...	...	...	...
2024-01-24	2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415
2024-01-23	2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-22	2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-19	2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-18	2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432

62 rows × 8 columns

# Now when we extract a column, we keep the index
# we keep information about the dates
df.USD

Time
2024-04-16    1.0637
2024-04-15    1.0656
2024-04-12    1.0652
2024-04-11    1.0729
2024-04-10    1.0860
               ...  
2024-01-24    1.0905
2024-01-23    1.0872
2024-01-22    1.0890
2024-01-19    1.0887
2024-01-18    1.0875
Name: USD, Length: 62, dtype: float64

Selecting/dropping multiple columns or rows

# Subsets of columns
df[['USD', 'CHF']]

	USD	CHF
Time
2024-04-16	1.0637	0.9712
2024-04-15	1.0656	0.9725
2024-04-12	1.0652	0.9716
2024-04-11	1.0729	0.9787
2024-04-10	1.0860	0.9810
...	...	...
2024-01-24	1.0905	0.9415
2024-01-23	1.0872	0.9446
2024-01-22	1.0890	0.9458
2024-01-19	1.0887	0.9459
2024-01-18	1.0875	0.9432

62 rows × 2 columns

df[['USD', 'CHF']].head()

	USD	CHF
Time
2024-04-16	1.0637	0.9712
2024-04-15	1.0656	0.9725
2024-04-12	1.0652	0.9716
2024-04-11	1.0729	0.9787
2024-04-10	1.0860	0.9810

df
df_dropped = df.drop(columns=['Time', 'BGN', 'DKK', 'CZK'])
df_dropped.head()

	USD	JPY	GBP	CHF
Time
2024-04-16	1.0637	164.54	0.85440	0.9712
2024-04-15	1.0656	164.05	0.85405	0.9725
2024-04-12	1.0652	163.16	0.85424	0.9716
2024-04-11	1.0729	164.18	0.85525	0.9787
2024-04-10	1.0860	164.89	0.85515	0.9810

df.head() # original df unaffected

	Time	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712
2024-04-15	2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725
2024-04-12	2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716
2024-04-11	2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787
2024-04-10	2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810

# dropping a row
df_dropped.drop('2024-04-16').head()
# same as: df_dropped.drop(index='2024-04-16').head()

	USD	JPY	GBP	CHF
Time
2024-04-15	1.0656	164.05	0.85405	0.9725
2024-04-12	1.0652	163.16	0.85424	0.9716
2024-04-11	1.0729	164.18	0.85525	0.9787
2024-04-10	1.0860	164.89	0.85515	0.9810
2024-04-09	1.0867	164.97	0.85663	0.9819

# now that time is the index, let's drop the Time column
df = df.drop(columns='Time')

Selecting rows

Series

Note: be careful with accessing data - it will treat integers as labels

usd = df['USD'].copy()

usd.index = [i * 2 for i in range(len(usd))]

usd

0      1.0637
2      1.0656
4      1.0652
6      1.0729
8      1.0860
        ...  
114    1.0905
116    1.0872
118    1.0890
120    1.0887
122    1.0875
Name: USD, Length: 62, dtype: float64

usd[2]

np.float64(1.0656)

To avoid this, use the .loc method for labels, and .iloc for integers.

usd.loc[2]

np.float64(1.0656)

usd.iloc[2]

np.float64(1.0652)

Dataframes

df.loc['2024-04-16']

USD      1.0637
JPY    164.5400
BGN      1.9558
CZK     25.2100
DKK      7.4609
GBP      0.8544
CHF      0.9712
Name: 2024-04-16 00:00:00, dtype: float64

df.loc['2024-04-16', ['USD', 'GBP']]

USD    1.0637
GBP    0.8544
Name: 2024-04-16 00:00:00, dtype: float64

df.loc[["2024-01-18"], ['USD', 'GBP']]

	USD	GBP
Time
2024-01-18	1.0875	0.85773

print('First is', type(df.loc["2024-01-18", ['USD', 'GBP']]), ', second is', type(df.loc[["2024-01-18"], ['USD', 'GBP']]))

First is <class 'pandas.core.series.Series'> , second is <class 'pandas.core.frame.DataFrame'>

df.loc[["2024-01-18", "2024-04-16"], ['USD', 'GBP']]

	USD	GBP
Time
2024-01-18	1.0875	0.85773
2024-04-16	1.0637	0.85440

# this will fail: df.loc[:, 1:3] 
# loc expects labels for rows and columns

iloc

df.iloc[0]

USD      1.0637
JPY    164.5400
BGN      1.9558
CZK     25.2100
DKK      7.4609
GBP      0.8544
CHF      0.9712
Name: 2024-04-16 00:00:00, dtype: float64

df.iloc[:, 1:3]

	JPY	BGN
Time
2024-04-16	164.54	1.9558
2024-04-15	164.05	1.9558
2024-04-12	163.16	1.9558
2024-04-11	164.18	1.9558
2024-04-10	164.89	1.9558
...	...	...
2024-01-24	160.46	1.9558
2024-01-23	160.88	1.9558
2024-01-22	160.95	1.9558
2024-01-19	161.17	1.9558
2024-01-18	160.89	1.9558

62 rows × 2 columns

df.iloc[:, 1:] # all columns except 0th one

	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	164.54	1.9558	25.210	7.4609	0.85440	0.9712
2024-04-15	164.05	1.9558	25.324	7.4606	0.85405	0.9725
2024-04-12	163.16	1.9558	25.337	7.4603	0.85424	0.9716
2024-04-11	164.18	1.9558	25.392	7.4604	0.85525	0.9787
2024-04-10	164.89	1.9558	25.368	7.4594	0.85515	0.9810
...	...	...	...	...	...	...
2024-01-24	160.46	1.9558	24.786	7.4568	0.85543	0.9415
2024-01-23	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-22	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-19	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-18	160.89	1.9558	24.734	7.4571	0.85773	0.9432

62 rows × 6 columns

df.iloc[:10, 1:4] # select a few rows, only keep columns 1,2,3

	JPY	BGN	CZK
Time
2024-04-16	164.54	1.9558	25.210
2024-04-15	164.05	1.9558	25.324
2024-04-12	163.16	1.9558	25.337
2024-04-11	164.18	1.9558	25.392
2024-04-10	164.89	1.9558	25.368
2024-04-09	164.97	1.9558	25.380
2024-04-08	164.43	1.9558	25.354
2024-04-05	164.10	1.9558	25.286
2024-04-04	164.69	1.9558	25.322
2024-04-03	163.66	1.9558	25.352

Type	Notes
`df[column]`	Select single column or sequence of columns from the DataFrame; special case conveniences: Boolean array (filter rows), slice (slice rows), or Boolean DataFrame (set values based on some criterion)
`df.loc[rows]`	Select single row or subset of rows from the DataFrame by label
`df.loc[:,cols]`	Select single column or subset of columns by label
`df.loc[rows,cols]`	Select both row(s) and column(s) by label
`df.iloc[rows]`	Select single row or subset of rows from the DataFrame by integer position
`df.iloc[:,cols]`	Select single column or subset of columns by integer position
`df.iloc[rows,cols]`	Select both row(s) and column(s) by integer position

Selecting based on logical conditions

df.query('USD > 1.09')

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-03-21	1.0907	164.96	1.9558	25.243	7.4579	0.85678	0.9766
2024-03-14	1.0925	161.70	1.9558	25.198	7.4568	0.85420	0.9616
2024-03-13	1.0939	161.83	1.9558	25.273	7.4573	0.85451	0.9599
2024-03-12	1.0916	161.39	1.9558	25.272	7.4571	0.85458	0.9588
2024-03-11	1.0926	160.43	1.9558	25.322	7.4552	0.85208	0.9594
2024-03-08	1.0932	160.99	1.9558	25.308	7.4547	0.85168	0.9588
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

df.query('USD > 1.09 and JPY < 161')

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-03-11	1.0926	160.43	1.9558	25.322	7.4552	0.85208	0.9594
2024-03-08	1.0932	160.99	1.9558	25.308	7.4547	0.85168	0.9588
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

df.USD > 1.09 # boolean mask

Time
2024-04-16    False
2024-04-15    False
2024-04-12    False
2024-04-11    False
2024-04-10    False
              ...  
2024-01-24     True
2024-01-23    False
2024-01-22    False
2024-01-19    False
2024-01-18    False
Name: USD, Length: 62, dtype: bool

df[df.USD > 1.09] # you can select boolean rows with []

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-03-21	1.0907	164.96	1.9558	25.243	7.4579	0.85678	0.9766
2024-03-14	1.0925	161.70	1.9558	25.198	7.4568	0.85420	0.9616
2024-03-13	1.0939	161.83	1.9558	25.273	7.4573	0.85451	0.9599
2024-03-12	1.0916	161.39	1.9558	25.272	7.4571	0.85458	0.9588
2024-03-11	1.0926	160.43	1.9558	25.322	7.4552	0.85208	0.9594
2024-03-08	1.0932	160.99	1.9558	25.308	7.4547	0.85168	0.9588
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

mask1 = df.USD > 1.09
mask2 = df.JPY < 161

df[mask1 & mask2] # mask1 and mask2

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-03-11	1.0926	160.43	1.9558	25.322	7.4552	0.85208	0.9594
2024-03-08	1.0932	160.99	1.9558	25.308	7.4547	0.85168	0.9588
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415

df[mask1 | mask2].head() # mask1 or mask2

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-03-21	1.0907	164.96	1.9558	25.243	7.4579	0.85678	0.9766
2024-03-14	1.0925	161.70	1.9558	25.198	7.4568	0.85420	0.9616
2024-03-13	1.0939	161.83	1.9558	25.273	7.4573	0.85451	0.9599
2024-03-12	1.0916	161.39	1.9558	25.272	7.4571	0.85458	0.9588
2024-03-11	1.0926	160.43	1.9558	25.322	7.4552	0.85208	0.9594

# this fails
# df[mask1 & mask2, 'USD'] 
# to select rows with booleans AND columns, use loc

Boolean arrays can be used with loc but not iloc:

df.loc[mask1 & mask2, 'USD']

Time
2024-03-11    1.0926
2024-03-08    1.0932
2024-01-24    1.0905
Name: USD, dtype: float64

# Using a boolean mask to select columns
df.columns != 'Time'

array([ True,  True,  True,  True,  True,  True,  True])

df_dropped2 = df.loc[:, df.columns != 'Time']
df_dropped2

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712
2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725
2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716
2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787
2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810
...	...	...	...	...	...	...	...
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432

62 rows × 7 columns

Sorting dataframes

df.sort_index()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415
...	...	...	...	...	...	...	...
2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810
2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787
2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716
2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725
2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712

62 rows × 7 columns

df.sort_index(axis='columns')

	BGN	CHF	CZK	DKK	GBP	JPY	USD
Time
2024-04-16	1.9558	0.9712	25.210	7.4609	0.85440	164.54	1.0637
2024-04-15	1.9558	0.9725	25.324	7.4606	0.85405	164.05	1.0656
2024-04-12	1.9558	0.9716	25.337	7.4603	0.85424	163.16	1.0652
2024-04-11	1.9558	0.9787	25.392	7.4604	0.85525	164.18	1.0729
2024-04-10	1.9558	0.9810	25.368	7.4594	0.85515	164.89	1.0860
...	...	...	...	...	...	...	...
2024-01-24	1.9558	0.9415	24.786	7.4568	0.85543	160.46	1.0905
2024-01-23	1.9558	0.9446	24.824	7.4574	0.85493	160.88	1.0872
2024-01-22	1.9558	0.9458	24.758	7.4585	0.85575	160.95	1.0890
2024-01-19	1.9558	0.9459	24.813	7.4575	0.85825	161.17	1.0887
2024-01-18	1.9558	0.9432	24.734	7.4571	0.85773	160.89	1.0875

62 rows × 7 columns

df.USD.sort_values(ascending=False)

Time
2024-03-13    1.0939
2024-03-08    1.0932
2024-03-11    1.0926
2024-03-14    1.0925
2024-03-12    1.0916
               ...  
2024-04-11    1.0729
2024-02-14    1.0713
2024-04-15    1.0656
2024-04-12    1.0652
2024-04-16    1.0637
Name: USD, Length: 62, dtype: float64

df_sorted = df.sort_values(by='USD', ascending=False)

df_sorted.head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-03-13	1.0939	161.83	1.9558	25.273	7.4573	0.85451	0.9599
2024-03-08	1.0932	160.99	1.9558	25.308	7.4547	0.85168	0.9588
2024-03-11	1.0926	160.43	1.9558	25.322	7.4552	0.85208	0.9594
2024-03-14	1.0925	161.70	1.9558	25.198	7.4568	0.85420	0.9616
2024-03-12	1.0916	161.39	1.9558	25.272	7.4571	0.85458	0.9588

Unique values

df.USD.unique()

array([1.0637, 1.0656, 1.0652, 1.0729, 1.086 , 1.0867, 1.0823, 1.0841,
       1.0852, 1.0783, 1.0749, 1.0811, 1.0816, 1.0855, 1.0835, 1.0907,
       1.0844, 1.0854, 1.0892, 1.0925, 1.0939, 1.0916, 1.0926, 1.0932,
       1.0895, 1.0874, 1.0849, 1.0846, 1.0813, 1.0826, 1.0808, 1.0856,
       1.0834, 1.0809, 1.0802, 1.0776, 1.0768, 1.0743, 1.0713, 1.0793,
       1.0773, 1.0772, 1.0758, 1.0746, 1.0883, 1.0814, 1.0837, 1.0871,
       1.0893, 1.0905, 1.0872, 1.089 , 1.0887, 1.0875])

df.USD.value_counts().head()

USD
1.0823    3
1.0846    2
1.0844    2
1.0776    2
1.0852    2
Name: count, dtype: int64

df.BGN.unique()

array([1.9558])

df.BGN.value_counts()

BGN
1.9558    62
Name: count, dtype: int64

Adding or modifying columns

df_copy = df.copy() # copy by value, not by reference

df_copy['new_col'] = 100 * df_copy.JPY/df_copy.USD

df_copy.head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF	new_col
Time
2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712	15468.647175
2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725	15395.082583
2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716	15317.311303
2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787	15302.451300
2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810	15183.241252

df_copy[df_copy > 200] = 200

# if we are changing a single column, use loc
df_copy.loc[df_copy['new_col'] == 200, 'new_col'] = 0

df_copy

	USD	JPY	BGN	CZK	DKK	GBP	CHF	new_col
Time
2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712	0.0
2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725	0.0
2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716	0.0
2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787	0.0
2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810	0.0
...	...	...	...	...	...	...	...	...
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415	0.0
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446	0.0
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458	0.0
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459	0.0
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432	0.0

62 rows × 8 columns

Alternatively, we can use the assign function to create a new dataframe with the new column:

df.assign(new = df.USD/df.JPY * 100)

	USD	JPY	BGN	CZK	DKK	GBP	CHF	new
Time
2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712	0.646469
2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725	0.649558
2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716	0.652856
2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787	0.653490
2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810	0.658621
...	...	...	...	...	...	...	...	...
2024-01-24	1.0905	160.46	1.9558	24.786	7.4568	0.85543	0.9415	0.679609
2024-01-23	1.0872	160.88	1.9558	24.824	7.4574	0.85493	0.9446	0.675783
2024-01-22	1.0890	160.95	1.9558	24.758	7.4585	0.85575	0.9458	0.676608
2024-01-19	1.0887	161.17	1.9558	24.813	7.4575	0.85825	0.9459	0.675498
2024-01-18	1.0875	160.89	1.9558	24.734	7.4571	0.85773	0.9432	0.675928

62 rows × 8 columns

df.head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	1.0637	164.54	1.9558	25.210	7.4609	0.85440	0.9712
2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725
2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716
2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787
2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810

Concatenating

new_col = 100 * df_copy.JPY/df_copy.USD

pd.concat([df_copy['USD'], new_col]) # vertical stacking

Time
2024-04-16        1.063700
2024-04-15        1.065600
2024-04-12        1.065200
2024-04-11        1.072900
2024-04-10        1.086000
                  ...     
2024-01-24    14714.351215
2024-01-23    14797.645327
2024-01-22    14779.614325
2024-01-19    14803.894553
2024-01-18    14794.482759
Length: 124, dtype: float64

pd.concat([df_copy['USD'], new_col], axis=1) # horizontal stacking

	USD	0
Time
2024-04-16	1.0637	15468.647175
2024-04-15	1.0656	15395.082583
2024-04-12	1.0652	15317.311303
2024-04-11	1.0729	15302.451300
2024-04-10	1.0860	15183.241252
...	...	...
2024-01-24	1.0905	14714.351215
2024-01-23	1.0872	14797.645327
2024-01-22	1.0890	14779.614325
2024-01-19	1.0887	14803.894553
2024-01-18	1.0875	14794.482759

62 rows × 2 columns

Summary statistics

Pandas objects have a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series, or a Series of values from the rows or columns of a DataFrame. Pandas methods have built-in handling for missing data (unlike NumPy methods). Specifically, the default for pandas methods is skipna=True, which excludes NA/null values when computing the result.

df.mean()

USD      1.082711
JPY    162.195484
BGN      1.955800
CZK     25.199081
DKK      7.456484
GBP      0.855201
CHF      0.957208
dtype: float64

df_na = df.copy()

df_na.iloc[0, 0] = np.nan

df_na.head()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	NaN	164.54	1.9558	25.210	7.4609	0.85440	0.9712
2024-04-15	1.0656	164.05	1.9558	25.324	7.4606	0.85405	0.9725
2024-04-12	1.0652	163.16	1.9558	25.337	7.4603	0.85424	0.9716
2024-04-11	1.0729	164.18	1.9558	25.392	7.4604	0.85525	0.9787
2024-04-10	1.0860	164.89	1.9558	25.368	7.4594	0.85515	0.9810

df_na.mean()

USD      1.083023
JPY    162.195484
BGN      1.955800
CZK     25.199081
DKK      7.456484
GBP      0.855201
CHF      0.957208
dtype: float64

df_na.mean(skipna=False)

USD           NaN
JPY    162.195484
BGN      1.955800
CZK     25.199081
DKK      7.456484
GBP      0.855201
CHF      0.957208
dtype: float64

df.describe()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
count	62.000000	62.000000	6.200000e+01	62.000000	62.000000	62.000000	62.000000
mean	1.082711	162.195484	1.955800e+00	25.199081	7.456484	0.855201	0.957208
std	0.006755	1.623635	2.686287e-15	0.220672	0.002161	0.001680	0.015174
min	1.063700	158.960000	1.955800e+00	24.734000	7.453600	0.850980	0.931500
25%	1.078550	160.905000	1.955800e+00	25.046000	7.454425	0.854267	0.944900
50%	1.083900	162.150000	1.955800e+00	25.297000	7.456500	0.855125	0.956150
75%	1.087350	163.432500	1.955800e+00	25.354000	7.458200	0.856272	0.971500
max	1.093900	164.970000	1.955800e+00	25.460000	7.460900	0.858460	0.984600

df.describe(percentiles=[0.01, 0.05, 0.25, 0.75, 0.95])

	USD	JPY	BGN	CZK	DKK	GBP	CHF
count	62.000000	62.000000	6.200000e+01	62.000000	62.000000	62.000000	62.000000
mean	1.082711	162.195484	1.955800e+00	25.199081	7.456484	0.855201	0.957208
std	0.006755	1.623635	2.686287e-15	0.220672	0.002161	0.001680	0.015174
min	1.063700	158.960000	1.955800e+00	24.734000	7.453600	0.850980	0.931500
1%	1.064615	159.326000	1.955800e+00	24.742540	7.453661	0.851407	0.932781
5%	1.071380	159.710500	1.955800e+00	24.759400	7.453905	0.852504	0.934705
25%	1.078550	160.905000	1.955800e+00	25.046000	7.454425	0.854267	0.944900
50%	1.083900	162.150000	1.955800e+00	25.297000	7.456500	0.855125	0.956150
75%	1.087350	163.432500	1.955800e+00	25.354000	7.458200	0.856272	0.971500
95%	1.092455	164.682500	1.955800e+00	25.414800	7.460395	0.857946	0.981000
max	1.093900	164.970000	1.955800e+00	25.460000	7.460900	0.858460	0.984600

df[['USD', 'CHF', 'CZK']].describe()

	USD	CHF	CZK
count	62.000000	62.000000	62.000000
mean	1.082711	0.957208	25.199081
std	0.006755	0.015174	0.220672
min	1.063700	0.931500	24.734000
25%	1.078550	0.944900	25.046000
50%	1.083900	0.956150	25.297000
75%	1.087350	0.971500	25.354000
max	1.093900	0.984600	25.460000

df.mean()

USD      1.082711
JPY    162.195484
BGN      1.955800
CZK     25.199081
DKK      7.456484
GBP      0.855201
CHF      0.957208
dtype: float64

df.median()

USD      1.083900
JPY    162.150000
BGN      1.955800
CZK     25.297000
DKK      7.456500
GBP      0.855125
CHF      0.956150
dtype: float64

df.min()

USD      1.06370
JPY    158.96000
BGN      1.95580
CZK     24.73400
DKK      7.45360
GBP      0.85098
CHF      0.93150
dtype: float64

type(df.mean())

pandas.core.series.Series

df.agg(['mean', 'median']) # aggregate multiple statistics together

	USD	JPY	BGN	CZK	DKK	GBP	CHF
mean	1.082711	162.195484	1.9558	25.199081	7.456484	0.855201	0.957208
median	1.083900	162.150000	1.9558	25.297000	7.456500	0.855125	0.956150

Correlation

df['USD'].corr(df['JPY'])

np.float64(-0.10333665136813898)

df.corr()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
USD	1.000000	-0.103337	NaN	-0.218888	-0.232007	0.074199	-0.042449
JPY	-0.103337	1.000000	NaN	0.655093	0.463510	0.484794	0.901636
BGN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CZK	-0.218888	0.655093	NaN	1.000000	0.008358	0.128065	0.649767
DKK	-0.232007	0.463510	NaN	0.008358	1.000000	0.307508	0.604572
GBP	0.074199	0.484794	NaN	0.128065	0.307508	1.000000	0.424830
CHF	-0.042449	0.901636	NaN	0.649767	0.604572	0.424830	1.000000

df.corrwith(df['USD'])

USD    1.000000e+00
JPY   -1.033367e-01
BGN   -3.260316e-14
CZK   -2.188876e-01
DKK   -2.320074e-01
GBP    7.419901e-02
CHF   -4.244863e-02
dtype: float64

Operations between series

s1 = df['USD'].iloc[0:5].copy()
s1

Time
2024-04-16    1.0637
2024-04-15    1.0656
2024-04-12    1.0652
2024-04-11    1.0729
2024-04-10    1.0860
Name: USD, dtype: float64

s2 = df['USD'].iloc[2:7].copy()
s2

Time
2024-04-12    1.0652
2024-04-11    1.0729
2024-04-10    1.0860
2024-04-09    1.0867
2024-04-08    1.0823
Name: USD, dtype: float64

s1 + s2

Time
2024-04-08       NaN
2024-04-09       NaN
2024-04-10    2.1720
2024-04-11    2.1458
2024-04-12    2.1304
2024-04-15       NaN
2024-04-16       NaN
Freq: B, Name: USD, dtype: float64

df1 = df.iloc[0:4, 0:3].copy()
df1

	USD	JPY	BGN
Time
2024-04-16	1.0637	164.54	1.9558
2024-04-15	1.0656	164.05	1.9558
2024-04-12	1.0652	163.16	1.9558
2024-04-11	1.0729	164.18	1.9558

df2 = df.iloc[1:5, 1:4].copy()
df2

	JPY	BGN	CZK
Time
2024-04-15	164.05	1.9558	25.324
2024-04-12	163.16	1.9558	25.337
2024-04-11	164.18	1.9558	25.392
2024-04-10	164.89	1.9558	25.368

df1 + df2

	BGN	CZK	JPY	USD
Time
2024-04-10	NaN	NaN	NaN	NaN
2024-04-11	3.9116	NaN	328.36	NaN
2024-04-12	3.9116	NaN	326.32	NaN
2024-04-15	3.9116	NaN	328.10	NaN
2024-04-16	NaN	NaN	NaN	NaN

When one dataframe is a subset of the other, you can do the following.

df1

	USD	JPY	BGN
Time
2024-04-16	1.0637	164.54	1.9558
2024-04-15	1.0656	164.05	1.9558
2024-04-12	1.0652	163.16	1.9558
2024-04-11	1.0729	164.18	1.9558

df3 = df.iloc[0:3, 0:2].copy()
df3

	USD	JPY
Time
2024-04-16	1.0637	164.54
2024-04-15	1.0656	164.05
2024-04-12	1.0652	163.16

df1.add(df3, fill_value=0)

	BGN	JPY	USD
Time
2024-04-11	1.9558	164.18	1.0729
2024-04-12	1.9558	326.32	2.1304
2024-04-15	1.9558	328.10	2.1312
2024-04-16	1.9558	329.08	2.1274

Operations between dataframes and series

Operations between a dataframe and series follow similar broadcasting rules to numpy.

arr = np.arange(12.).reshape((3, 4))

print(arr.shape, arr[0].shape) # trailing dimensions match

(3, 4) (4,)

arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])
series = frame.iloc[0]

frame - series

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
frame + series2

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods and specify to match over the index. For example:

series3 = frame["d"]
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

frame.sub(series3, axis="index")

	b	e
Utah	-1.0	1.0
Ohio	-1.0	1.0
Texas	-1.0	1.0
Oregon	-1.0	1.0

frame.sub(series3) # this doesn't do the right thing

	Ohio	Oregon	Texas	Utah	b	d	e
Utah	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Ohio	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Texas	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Oregon	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Standardizing dataframes

df.mean()

USD      1.082711
JPY    162.195484
BGN      1.955800
CZK     25.199081
DKK      7.456484
GBP      0.855201
CHF      0.957208
dtype: float64

df - df.mean()

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	-0.019011	2.344516	4.440892e-16	0.010919	0.004416	-0.000801	0.013992
2024-04-15	-0.017111	1.854516	4.440892e-16	0.124919	0.004116	-0.001151	0.015292
2024-04-12	-0.017511	0.964516	4.440892e-16	0.137919	0.003816	-0.000961	0.014392
2024-04-11	-0.009811	1.984516	4.440892e-16	0.192919	0.003916	0.000049	0.021492
2024-04-10	0.003289	2.694516	4.440892e-16	0.168919	0.002916	-0.000051	0.023792
...	...	...	...	...	...	...	...
2024-01-24	0.007789	-1.735484	4.440892e-16	-0.413081	0.000316	0.000229	-0.015708
2024-01-23	0.004489	-1.315484	4.440892e-16	-0.375081	0.000916	-0.000271	-0.012608
2024-01-22	0.006289	-1.245484	4.440892e-16	-0.441081	0.002016	0.000549	-0.011408
2024-01-19	0.005989	-1.025484	4.440892e-16	-0.386081	0.001016	0.003049	-0.011308
2024-01-18	0.004789	-1.305484	4.440892e-16	-0.465081	0.000616	0.002529	-0.014008

62 rows × 7 columns

df.sub(df.mean())

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	-0.019011	2.344516	4.440892e-16	0.010919	0.004416	-0.000801	0.013992
2024-04-15	-0.017111	1.854516	4.440892e-16	0.124919	0.004116	-0.001151	0.015292
2024-04-12	-0.017511	0.964516	4.440892e-16	0.137919	0.003816	-0.000961	0.014392
2024-04-11	-0.009811	1.984516	4.440892e-16	0.192919	0.003916	0.000049	0.021492
2024-04-10	0.003289	2.694516	4.440892e-16	0.168919	0.002916	-0.000051	0.023792
...	...	...	...	...	...	...	...
2024-01-24	0.007789	-1.735484	4.440892e-16	-0.413081	0.000316	0.000229	-0.015708
2024-01-23	0.004489	-1.315484	4.440892e-16	-0.375081	0.000916	-0.000271	-0.012608
2024-01-22	0.006289	-1.245484	4.440892e-16	-0.441081	0.002016	0.000549	-0.011408
2024-01-19	0.005989	-1.025484	4.440892e-16	-0.386081	0.001016	0.003049	-0.011308
2024-01-18	0.004789	-1.305484	4.440892e-16	-0.465081	0.000616	0.002529	-0.014008

62 rows × 7 columns

df.sub(df.mean()).mean() # almost zero: success

USD   -2.184632e-16
JPY    3.483951e-14
BGN    4.440892e-16
CZK   -5.042561e-15
DKK   -9.311548e-16
GBP    1.146037e-16
CHF   -2.148819e-17
dtype: float64

df.sub(df.mean(), axis="columns")

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	-0.019011	2.344516	4.440892e-16	0.010919	0.004416	-0.000801	0.013992
2024-04-15	-0.017111	1.854516	4.440892e-16	0.124919	0.004116	-0.001151	0.015292
2024-04-12	-0.017511	0.964516	4.440892e-16	0.137919	0.003816	-0.000961	0.014392
2024-04-11	-0.009811	1.984516	4.440892e-16	0.192919	0.003916	0.000049	0.021492
2024-04-10	0.003289	2.694516	4.440892e-16	0.168919	0.002916	-0.000051	0.023792
...	...	...	...	...	...	...	...
2024-01-24	0.007789	-1.735484	4.440892e-16	-0.413081	0.000316	0.000229	-0.015708
2024-01-23	0.004489	-1.315484	4.440892e-16	-0.375081	0.000916	-0.000271	-0.012608
2024-01-22	0.006289	-1.245484	4.440892e-16	-0.441081	0.002016	0.000549	-0.011408
2024-01-19	0.005989	-1.025484	4.440892e-16	-0.386081	0.001016	0.003049	-0.011308
2024-01-18	0.004789	-1.305484	4.440892e-16	-0.465081	0.000616	0.002529	-0.014008

62 rows × 7 columns

# also works, but maybe less explicit than axis="columns"
df.sub(df.mean(), axis=1).mean()

USD   -2.184632e-16
JPY    3.483951e-14
BGN    4.440892e-16
CZK   -5.042561e-15
DKK   -9.311548e-16
GBP    1.146037e-16
CHF   -2.148819e-17
dtype: float64

df_standardized = (df-df.mean())/df.std()

df_standardized2 = df.sub(df.mean()).div(df.std())

# most explicit
df_standardized3 = df.sub(df.mean(), axis="columns").div(df.std(), axis="columns")

df_standardized4 = df.sub(df.mean(), axis=1).div(df.std(), axis=1)

((df_standardized4-df_standardized) == 0.).all()

USD    True
JPY    True
BGN    True
CZK    True
DKK    True
GBP    True
CHF    True
dtype: bool

df_max_standardized = df/df.max()

df_standardized

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	-2.814246	1.443993	0.165317	0.049482	2.043993	-0.476651	0.922100
2024-04-15	-2.532988	1.142200	0.165317	0.566086	1.905139	-0.684975	1.007773
2024-04-12	-2.592200	0.594048	0.165317	0.624997	1.766285	-0.571885	0.948461
2024-04-11	-1.452368	1.222268	0.165317	0.874235	1.812570	0.029281	1.416367
2024-04-10	0.486828	1.659558	0.165317	0.765476	1.349722	-0.030241	1.567942
...	...	...	...	...	...	...	...
2024-01-24	1.152964	-1.068888	0.165317	-1.871920	0.146319	0.136419	-1.035197
2024-01-23	0.664465	-0.810209	0.165317	-1.699719	0.424028	-0.161188	-0.830900
2024-01-22	0.930919	-0.767096	0.165317	-1.998805	0.933160	0.326887	-0.751817
2024-01-19	0.886510	-0.631598	0.165317	-1.749567	0.470313	1.814921	-0.745227
2024-01-18	0.708874	-0.804050	0.165317	-2.107564	0.285174	1.505410	-0.923163

62 rows × 7 columns

Function application and mapping

def f1(x):
    return x.max() - x.min()

df.apply(f1) # applies across rows by default

USD    0.03020
JPY    6.01000
BGN    0.00000
CZK    0.72600
DKK    0.00730
GBP    0.00748
CHF    0.05310
dtype: float64

df.apply(f1, axis='columns')

Time
2024-04-16    163.68560
2024-04-15    163.19595
2024-04-12    162.30576
2024-04-11    163.32475
2024-04-10    164.03485
                ...    
2024-01-24    159.60457
2024-01-23    160.02507
2024-01-22    160.09425
2024-01-19    160.31175
2024-01-18    160.03227
Length: 62, dtype: float64

def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

df.apply(f2)

	USD	JPY	BGN	CZK	DKK	GBP	CHF
min	1.0637	158.96	1.9558	24.734	7.4536	0.85098	0.9315
max	1.0939	164.97	1.9558	25.460	7.4609	0.85846	0.9846

def my_format(x):
    return f"{x:.2f}"

For element-wise operations, use pd.DataFrame.map:

df.map(my_format)

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	1.06	164.54	1.96	25.21	7.46	0.85	0.97
2024-04-15	1.07	164.05	1.96	25.32	7.46	0.85	0.97
2024-04-12	1.07	163.16	1.96	25.34	7.46	0.85	0.97
2024-04-11	1.07	164.18	1.96	25.39	7.46	0.86	0.98
2024-04-10	1.09	164.89	1.96	25.37	7.46	0.86	0.98
...	...	...	...	...	...	...	...
2024-01-24	1.09	160.46	1.96	24.79	7.46	0.86	0.94
2024-01-23	1.09	160.88	1.96	24.82	7.46	0.85	0.94
2024-01-22	1.09	160.95	1.96	24.76	7.46	0.86	0.95
2024-01-19	1.09	161.17	1.96	24.81	7.46	0.86	0.95
2024-01-18	1.09	160.89	1.96	24.73	7.46	0.86	0.94

62 rows × 7 columns

df.map(lambda x: f"{x:.2f}")

	USD	JPY	BGN	CZK	DKK	GBP	CHF
Time
2024-04-16	1.06	164.54	1.96	25.21	7.46	0.85	0.97
2024-04-15	1.07	164.05	1.96	25.32	7.46	0.85	0.97
2024-04-12	1.07	163.16	1.96	25.34	7.46	0.85	0.97
2024-04-11	1.07	164.18	1.96	25.39	7.46	0.86	0.98
2024-04-10	1.09	164.89	1.96	25.37	7.46	0.86	0.98
...	...	...	...	...	...	...	...
2024-01-24	1.09	160.46	1.96	24.79	7.46	0.86	0.94
2024-01-23	1.09	160.88	1.96	24.82	7.46	0.85	0.94
2024-01-22	1.09	160.95	1.96	24.76	7.46	0.86	0.95
2024-01-19	1.09	161.17	1.96	24.81	7.46	0.86	0.95
2024-01-18	1.09	160.89	1.96	24.73	7.46	0.86	0.94

62 rows × 7 columns

Wide vs. long data formats

df_standardized = df_standardized.reset_index() # turn time back into a column

df_standardized.head()

	Time	USD	JPY	BGN	CZK	DKK	GBP	CHF
0	2024-04-16	-2.814246	1.443993	0.165317	0.049482	2.043993	-0.476651	0.922100
1	2024-04-15	-2.532988	1.142200	0.165317	0.566086	1.905139	-0.684975	1.007773
2	2024-04-12	-2.592200	0.594048	0.165317	0.624997	1.766285	-0.571885	0.948461
3	2024-04-11	-1.452368	1.222268	0.165317	0.874235	1.812570	0.029281	1.416367
4	2024-04-10	0.486828	1.659558	0.165317	0.765476	1.349722	-0.030241	1.567942

The data is in what’s known as a ‘wide’ format. This is because the currency is in each of the columns. We can transform it to a ‘long’ format by using melt with arguments:

id_vars: this is the column to be used as identifier variables
var_name: this is the new column which will contain the names of the columns in wide format
value_name: this is the new column which will contain the values of the columns in wide format

df_long = df_standardized.melt(id_vars='Time', var_name='currency', value_name='rate')

df_long

	Time	currency	rate
0	2024-04-16	USD	-2.814246
1	2024-04-15	USD	-2.532988
2	2024-04-12	USD	-2.592200
3	2024-04-11	USD	-1.452368
4	2024-04-10	USD	0.486828
...	...	...	...
429	2024-01-24	CHF	-1.035197
430	2024-01-23	CHF	-0.830900
431	2024-01-22	CHF	-0.751817
432	2024-01-19	CHF	-0.745227
433	2024-01-18	CHF	-0.923163

434 rows × 3 columns

This “long’ format is an example of a”tidy data frame”, which we will discuss in more detail next lecture.

To go back to the “wide” format, we can use the pivot method:

df_long.pivot(index=['Time'], # column which will be the index
              columns=['currency'], # variable which will expand out to columns
              values=['rate']) # values in the table

	rate
currency	BGN	CHF	CZK	DKK	GBP	JPY	USD
Time
2024-01-18	0.165317	-0.923163	-2.107564	0.285174	1.505410	-0.804050	0.708874
2024-01-19	0.165317	-0.745227	-1.749567	0.470313	1.814921	-0.631598	0.886510
2024-01-22	0.165317	-0.751817	-1.998805	0.933160	0.326887	-0.767096	0.930919
2024-01-23	0.165317	-0.830900	-1.699719	0.424028	-0.161188	-0.810209	0.664465
2024-01-24	0.165317	-1.035197	-1.871920	0.146319	0.136419	-1.068888	1.152964
...	...	...	...	...	...	...	...
2024-04-10	0.165317	1.567942	0.765476	1.349722	-0.030241	1.659558	0.486828
2024-04-11	0.165317	1.416367	0.874235	1.812570	0.029281	1.222268	-1.452368
2024-04-12	0.165317	0.948461	0.624997	1.766285	-0.571885	0.594048	-2.592200
2024-04-15	0.165317	1.007773	0.566086	1.905139	-0.684975	1.142200	-2.532988
2024-04-16	0.165317	0.922100	0.049482	2.043993	-0.476651	1.443993	-2.814246

62 rows × 7 columns

Example visualization with seaborn

We use the “long” data for a Seaborn plot:

sns.relplot(data=df_long,
            x='Time',
            y='rate',
            hue='currency',
            kind='line',
            height=3,
            aspect=3)