import pandas as pd
import numpy as np
import seaborn as sns
Lecture 4 - Data loading and cleaning
Overview
In this lecture, we cover data loading, both from a local file and from the web. Loading data is the process of converting data (stored as a csv, HTML, etc.) into an object that Python can access and manipulate.
We also cover basic data cleaning strategies.
References
This lecture contains material from:
- Chapters 6, 7, Python for Data Analysis, 3E (Wes McKinney, 2022)
- Chapter 2, Data Science: A First Introduction with Python (Timbers et al. 2022)
- Pierre Bellec’s notes
- Chapter 5, R for Data Science (2e) (Wickham et al. 2023)
Directories
Recall from Lecture 2, we discussed your working directory - this is ‘where you are currently’ on the computer.
Python needs to know where the files live - they can be on your computer (local) or on the internet (remote).
The place where your file lives is referred to as its ‘path’. You can think of the path as directions to the file. There are two kinds of paths: relative and absolute. A relative path indicates where a file is with respect to your working directory. An absolute path indicates where the file is with respect to the computers filesystem base (or root) folder, regardless of where you are working.
Example from Data Science: A first introduction with Python (link)
Here are some helpful shortcuts for file paths.
Symbol | Meaning |
---|---|
~ |
Home directory |
. |
Current working directory |
.. |
One directory up from current working directory |
../.. |
Two directories up from current working directory |
Example: suppose we are in /home/dsci-100/project3
. The location of happiness_report.csv
relative to our current position is: - data/happiness_report.csv
(you could also use ./data/happiness_report.csv
with 1 dot, instead of 2!)
Best practices
In general, you should avoid using absolute file paths in your code, as these will never work on someone else’s computer.
A good workflow for a data analysis project is:
- create a project directory (e.g.
/Users/gm845/Library/2025/project1
) - in
project1
, create a file structure - I like:src
: this is where you put your codedata
: this is where you put your datadoc
: this is where you put your documents (e.g. Word or LaTex files)etc
: miscellaneous files (e.g. pictures of whiteboards)
Within your project, you should use relative file paths. This makes your code portable if someone else copies your project1
folder.
In terms of naming your files, here are some good file naming practices from Jenny Bryan (link). Some key principles:
- machine readable file names (easy to narrow file lists bases on names, or extract info from file names based on splitting, NO SPACES)
- human readable file names (enough info to know what file is)
- plays well with default ordering (e.g. zero pad files 01-exercise.py, 02-exercise.py, …, 11-exercise.py)
Loading data
In Lectures 2 and 3, we looked at the exchange rate data, which we read in as a .csv
(comma separated value) file using pd.read_csv('../data/rates.csv')
.
Comma separated values files look like this:
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
= pd.read_csv('../data/example_01.csv') ex1
ex1
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Suppose we do not have a header in the file - then, we can specify header=None
:
= pd.read_csv('../data/example_02.csv', header=None) ex2
Skipping rows
!cat ../data/example_03.csv
Here is a dataframe that we created:
(it is a simple example)
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
Note: Above, the !
allows shell commands to be run within cells. That is, in Terminal (or another shell), we can run cat ../data/example_03.csv
to display the file.
Suppose we try to load example_03.csv
:
# this gives an error
# ex3 = pd.read_csv('../data/example_03.csv')
We don’t want the first 2 lines; we can use the argument skiprows
to skip them when loading the data:
= pd.read_csv('../data/example_03.csv', skiprows=2) ex3
ex3
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Using the sep
argument for different separators
So far we have seen csv
files. What about .txt
files with tabs separating datapoints?
'../data/example_04.txt', header=None) pd.read_csv(
0 | |
---|---|
0 | 1\t2\t3\t4\thello |
1 | 5\t6\t7\t8\tworld |
2 | 9\t10\t11\t12\tfoo |
= pd.read_csv('../data/example_04.txt', sep='\t', header=None) ex4
ex4
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Note: \t
is an example of an escaped character, which always starts with a backslash (\
). Escaped characters are used to represent non-printing characters (like the tab or new line \n
).
Note: there are lots of potential arguments for pd.read_csv
. In Lecture 2, we saw the parse_dates
argument as:
df = pd.read_csv('../data/rates.csv', parse_dates=['Time'])
The parse_dates
argument takes a list of column names and tries to parse each of the columns as a date column.
The pd.read_csv
arguments may fall into a few categories:
Indexing: Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, arguments you provide, or not at all.
Type inference and data conversion: Includes the user-defined value conversions and custom list of missing value markers.
Date and time parsing: Includes a combining capability, including combining date and time information spread over multiple columns into a single column in the result.
Iterating: Support for iterating over chunks of very large files.
Unclean data issues: Includes skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.
For full information, you can see the documentation here.
Reading URLs
We can also read in data from a URL.
= pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv") penguins
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
Reading Excel files
Gapminder is an independent education non-profit which releases public datasets (link).
We downloaded three datasets: - gdp_pcap.xlsx
(Gross Domestic Product per capita) - lex.xlsx
(Life expectancy) - pop.xlsx
(Population)
Note: there are years past 2025 - these are projections, with details of projections in the documentation linked above.
The file name extension .xlsx
refers to Microsoft Excel files, which store data very differently to csv
or txt
files:
PK?]Nd??xl/_rels/workbook.xml.rels<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
(base) sas5ce91e6ae9a7:lec-4 gm845$ cat ../data/gdp_pcap.xlsx | head -n 2
PK?]Nd??xl/_rels/workbook.xml.rels<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheets/sheet1.xml"/><Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/sharedStrings" Target="sharedStrings.xml"/><Relationship Id="rId3"
To read this data, we use pd.read_excel
. Note: you may need to install the openpyxl
package in terminal (after activating your environment):
conda activate msds597
conda install openpyxl
and restart the VSCode python kernel.
= pd.read_excel('../data/gdp_pcap.xlsx')
gdp = pd.read_excel('../data/pop.xlsx')
pop = pd.read_excel('../data/lex.xlsx') lex
gdp.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 481 | 481 | 481 | 481 | 481 | 481 | 481 | 481 | 481 | ... | 4680 | 4790 | 4910 | 5020 | 5140 | 5260 | 5380 | 5510 | 5640 | 5780 |
1 | Angola | 373 | 374 | 376 | 378 | 379 | 381 | 383 | 385 | 386 | ... | 24.5k | 25k | 25.6k | 26.1k | 26.6k | 27.1k | 27.7k | 28.2k | 28.8k | 29.3k |
2 | Albania | 469 | 471 | 472 | 473 | 475 | 476 | 477 | 479 | 480 | ... | 54.5k | 55.1k | 55.7k | 56.3k | 56.9k | 57.4k | 58k | 58.6k | 59.2k | 59.8k |
3 | Andorra | 1370 | 1370 | 1370 | 1380 | 1380 | 1380 | 1390 | 1390 | 1390 | ... | 79.9k | 80.2k | 80.4k | 80.7k | 81k | 81.3k | 81.5k | 81.8k | 82k | 82.3k |
4 | UAE | 1140 | 1150 | 1150 | 1150 | 1160 | 1160 | 1170 | 1170 | 1180 | ... | 92.6k | 92.6k | 92.6k | 92.7k | 92.7k | 92.7k | 92.8k | 92.8k | 92.8k | 92.9k |
5 rows × 302 columns
pop.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | 3.28M | ... | 124M | 125M | 126M | 126M | 127M | 128M | 128M | 129M | 130M | 130M |
1 | Angola | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | 1.57M | ... | 139M | 140M | 142M | 143M | 144M | 145M | 147M | 148M | 149M | 150M |
2 | Albania | 400k | 402k | 404k | 405k | 407k | 409k | 411k | 413k | 414k | ... | 1.34M | 1.32M | 1.3M | 1.29M | 1.27M | 1.25M | 1.23M | 1.22M | 1.2M | 1.18M |
3 | Andorra | 2650 | 2650 | 2650 | 2650 | 2650 | 2650 | 2650 | 2650 | 2650 | ... | 52.8k | 52.1k | 51.5k | 50.8k | 50.2k | 49.6k | 49k | 48.4k | 47.8k | 47.2k |
4 | UAE | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | 40.2k | ... | 24.1M | 24.3M | 24.5M | 24.7M | 25M | 25.2M | 25.4M | 25.7M | 25.9M | 26.1M |
5 rows × 302 columns
lex.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.1 | 28.1 | 28.1 | ... | 75.9 | 76.1 | 76.2 | 76.4 | 76.5 | 76.7 | 76.8 | 77.0 | 77.1 | 77.3 |
1 | Angola | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | ... | 75.8 | 75.9 | 76.0 | 76.1 | 76.3 | 76.4 | 76.5 | 76.7 | 76.8 | 76.9 |
2 | Albania | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | ... | 88.9 | 89.0 | 89.2 | 89.3 | 89.4 | 89.5 | 89.6 | 89.7 | 89.8 | 90.0 |
3 | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 91.0 | 91.1 | 91.3 | 91.4 | 91.5 | 91.6 | 91.7 | 91.8 | 91.9 | 92.0 |
4 | UAE | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | ... | 83.7 | 83.8 | 83.9 | 84.0 | 84.1 | 84.2 | 84.3 | 84.5 | 84.6 | 84.7 |
5 rows × 302 columns
Let’s practice some of our data skills from previous lectures.
- Turn each dataset into tidy data
= gdp.melt(id_vars='country',
gdp ='year',
var_name='gdp') value_name
gdp.head()
country | year | gdp | |
---|---|---|---|
0 | Afghanistan | 1800 | 481 |
1 | Angola | 1800 | 373 |
2 | Albania | 1800 | 469 |
3 | Andorra | 1800 | 1370 |
4 | UAE | 1800 | 1140 |
= pop.melt(id_vars = 'country',
pop = 'year',
var_name = 'pop')
value_name
= lex.melt(id_vars = 'country',
lex = 'year',
var_name = 'lex') value_name
- Merge the datasets together
= gdp.merge(pop, on=['country', 'year']).merge(lex, on=['country', 'year']) gapminder
gapminder.head()
country | year | gdp | pop | lex | |
---|---|---|---|---|---|
0 | Afghanistan | 1800 | 481 | 3.28M | 28.2 |
1 | Angola | 1800 | 373 | 1.57M | 27.0 |
2 | Albania | 1800 | 469 | 400k | 35.4 |
3 | Andorra | 1800 | 1370 | 2650 | NaN |
4 | UAE | 1800 | 1140 | 40.2k | 30.7 |
- Keep only years 1950-2019
= (gapminder['year'] >= 1950) & (gapminder['year'] <= 2019) keep
# an alternative
= gapminder['year'].between(1950, 2019) keep
= gapminder[keep] gapminder
= gapminder.reset_index().drop(columns='index') gapminder
- Are there any NaNs?
sum() gapminder.isna().
country 0
year 0
gdp 0
pop 0
lex 0
dtype: int64
String manipulation
Illustration: Gapminder data
We can still see some issues - numbers are in 14M
and 14k
formats and are read as strings, not integers or floats.
We can use Python string methods to clean this data.
First, let’s look at some of the data:
= gapminder.loc[0, 'pop'] test
test
'7.78M'
isinstance(test, str)
True
Numbers in the millions are stored as strings.
= gapminder.loc[3,'pop'] test2
test2
6000
type(test2)
int
We can iterate over strings:
for ch in test:
print(ch)
7
.
7
8
M
For each string element, we can see if it’s a letter using .isalpha()
.
for ch in test:
if ch.isalpha():
print(ch)
M
Finally, let’s write our function that returns all the letters we may see in the pd.Series.
def get_number_formats(x):
= []
num_forms for val in x:
if isinstance(val, str):
for ch in val:
if ch.isalpha():
num_forms.append(ch)
return set(num_forms)
= get_number_formats(gapminder['pop']) num_forms
num_forms
{'B', 'M', 'k'}
Now, convert the strings to floats:
def convert_to_float(x):
if isinstance(x, str):
if 'k' in x:
= float(x.replace('k', '')) * 1000
x_new if 'M' in x:
= float(x.replace('M', '')) * 1000000
x_new if 'B' in x:
= float(x.replace('B', '')) * 1000000000
x_new
return x_new
else:
return x
'gdp', 'pop']] = gapminder[['gdp', 'pop']].map(convert_to_float) gapminder[[
'pop'] = gapminder['pop'].astype('int64')
gapminder['year'] = gapminder['year'].astype('int64') gapminder[
= 2007
year_plot = sns.relplot(gapminder[gapminder['year'] == year_plot],
g ='gdp',
x='lex',
y='pop',
size= 4,
height = 2)
aspect g
Let’s plot in log scale for the x-axis:
= sns.relplot(gapminder[gapminder['year'] == year_plot],
g ='gdp',
x='lex',
y='pop',
size= 4,
height = 2)
aspect set(xscale='log')
g. g
Above, we used the following Python built-in string methods:
.isalpha
.replace
Another helpful method is .split()
, which breaks strings into a list of substrings using the passed delimiter:
= '2,3,4,5'
my_string ',') my_string.split(
['2', '3', '4', '5']
String functions in pandas
We can apply string methods to pandas dataframes using pd.DataFrame.str.<method>
.
Here is a list of some more helpful functions.
= gapminder['country'].str.contains('U') subset
'country'][subset].unique() gapminder[
array(['UAE', 'UK', 'Uganda', 'Ukraine', 'Uruguay', 'USA', 'Uzbekistan'],
dtype=object)
Standardizing Country Names
Let’s go back to our gapminder data! We are going to add continent data.
Install these packages in your msds597
environment:
Terminal
pip install pycountry
pip install pycountry_convert
import pycountry
import pycountry_convert as pc
= gapminder[['country']].drop_duplicates() df_country
'country_code'] = pd.NA df_country[
df_country.dtypes
country object
country_code object
dtype: object
= []
exceptions for c in df_country['country']:
try:
'country']==c, 'country_code'] = pc.country_name_to_country_alpha2(c)
df_country.loc[df_country[except:
exceptions.append(c)
exceptions
['UAE',
"Cote d'Ivoire",
'Congo, Dem. Rep.',
'Congo, Rep.',
'Micronesia, Fed. Sts.',
'UK',
'Hong Kong, China',
'Lao',
'St. Vincent and the Grenadines']
Let’s now do the painstaking work of finding exceptions:
= pycountry.countries.search_fuzzy('emirates') uae
uae
[Country(alpha_2='AE', alpha_3='ARE', flag='🇦🇪', name='United Arab Emirates', numeric='784')]
'country'].str.contains('UAE') df_country[
0 False
1 False
2 False
3 False
4 True
...
190 False
191 False
192 False
193 False
194 False
Name: country, Length: 195, dtype: bool
'country'].str.contains('UAE'), 'country_code'] = uae[0].alpha_2 df_country.loc[df_country[
= pycountry.countries.search_fuzzy('ivoire') ivoire
'country'].str.contains('Ivoire'), 'country_code'] = ivoire[0].alpha_2 df_country.loc[df_country[
= pycountry.countries.search_fuzzy('congo') congo
congo
[Country(alpha_2='CG', alpha_3='COG', flag='🇨🇬', name='Congo', numeric='178', official_name='Republic of the Congo'),
Country(alpha_2='CD', alpha_3='COD', flag='🇨🇩', name='Congo, The Democratic Republic of the', numeric='180')]
'country'].str.contains('Congo, Dem.'), 'country_code'] = congo[1].alpha_2 df_country.loc[df_country[
'country'].str.contains('Congo, Rep.'), 'country_code'] = congo[0].alpha_2 df_country.loc[df_country[
= pycountry.countries.search_fuzzy('micronesia') mic
'country'].str.contains('Micronesia, Fed. Sts.'), 'country_code'] = mic[0].alpha_2 df_country.loc[df_country[
= pycountry.countries.search_fuzzy('great britain') gb
'country'].str.contains('UK'), 'country_code'] = gb[0].alpha_2 df_country.loc[df_country[
= pycountry.countries.search_fuzzy('hong kong') hk
'country'].str.contains('Hong Kong'), 'country_code'] = hk[0].alpha_2 df_country.loc[df_country[
= pycountry.countries.search_fuzzy('Laos') laos
'country'].str.contains('Lao'), 'country_code'] = laos[0].alpha_2 df_country.loc[df_country[
= pycountry.countries.search_fuzzy('Vincent') vin
'country'].str.contains('Vincent'), 'country_code'] = vin[0].alpha_2 df_country.loc[df_country[
Have we found them all?
any(df_country['country_code'].isna())
False
'continent'] = pd.NA df_country[
= []
exceptions for c in df_country['country_code']:
try:
= pc.country_alpha2_to_continent_code(c)
cc 'country_code']==c, 'continent'] = pc.convert_continent_code_to_continent_name(cc)
df_country.loc[df_country[except:
exceptions.append(c)
exceptions
['TL']
'country_code']=='TL', 'continent'] = 'Asia' df_country.loc[df_country[
any(df_country['continent'].isna())
False
Merging continent data
= gapminder.merge(df_country, on='country') gapminder
gapminder.head()
country | year | gdp | pop | lex | country_code | continent | |
---|---|---|---|---|---|---|---|
0 | Afghanistan | 1950 | 1450.0 | 7780000 | 42.7 | AF | Asia |
1 | Angola | 1950 | 2230.0 | 4550000 | 45.6 | AO | Africa |
2 | Albania | 1950 | 1980.0 | 1250000 | 52.2 | AL | Europe |
3 | Andorra | 1950 | 8350.0 | 6000 | 74.6 | AD | Europe |
4 | UAE | 1950 | 1710.0 | 74500 | 58.4 | AE | Asia |
= sns.relplot(gapminder[gapminder['year'] == year_plot],
g ='gdp',
x='lex',
y='pop',
size='continent',
hue= 4.5,
height = 1.5)
aspect set(xscale='log')
g."Gross Domestic Product per Capita", "Life Expectancy")
g.set_axis_labels(set(title=f'Gapminder Data (Year = {year_plot})')
g. g
Illustration: World Health Organization Tuberculosis Data
We now consider another data cleaning exercise, adapted from R For Data Science (2e).
The data is who_data.csv
, collected by the World Health Organisation (link). It contains information about tuberculosis diagnoses. The data contains:
- country
- year
- 56 columns like
new_sp_m_014
,new_ep_m_4554
, andnew_rel_m_3544
.
Each of the 56 columns follow a pattern.
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two letters describe the type of TB:
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)
The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
- 014 = 0 – 14 years old
- 1524 = 15 – 24 years old
- 2534 = 25 – 34 years old
- 3544 = 35 – 44 years old
- 4554 = 45 – 54 years old
- 5564 = 55 – 64 years old
- 65 = 65 or older
= pd.read_csv('../data/who_data.csv')
who who.head()
country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | new_sp_m5564 | ... | newrel_m4554 | newrel_m5564 | newrel_m65 | newrel_f014 | newrel_f1524 | newrel_f2534 | newrel_f3544 | newrel_f4554 | newrel_f5564 | newrel_f65 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | AF | AFG | 1980 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Afghanistan | AF | AFG | 1981 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Afghanistan | AF | AFG | 1982 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Afghanistan | AF | AFG | 1983 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Afghanistan | AF | AFG | 1984 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 60 columns
Let’s first put the data in tidy form:
= who.melt(id_vars=['country', 'iso2', 'iso3', 'year'],
who_melt ='key',
var_name='cases') value_name
who_melt.head()
country | iso2 | iso3 | year | key | cases | |
---|---|---|---|---|---|---|
0 | Afghanistan | AF | AFG | 1980 | new_sp_m014 | NaN |
1 | Afghanistan | AF | AFG | 1981 | new_sp_m014 | NaN |
2 | Afghanistan | AF | AFG | 1982 | new_sp_m014 | NaN |
3 | Afghanistan | AF | AFG | 1983 | new_sp_m014 | NaN |
4 | Afghanistan | AF | AFG | 1984 | new_sp_m014 | NaN |
Let’s drop NaNs.
= who_melt.dropna()
who_melt who_melt.head()
country | iso2 | iso3 | year | key | cases | |
---|---|---|---|---|---|---|
17 | Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0.0 |
18 | Afghanistan | AF | AFG | 1998 | new_sp_m014 | 30.0 |
19 | Afghanistan | AF | AFG | 1999 | new_sp_m014 | 8.0 |
20 | Afghanistan | AF | AFG | 2000 | new_sp_m014 | 52.0 |
21 | Afghanistan | AF | AFG | 2001 | new_sp_m014 | 129.0 |
'key'].unique() who_melt[
array(['new_sp_m014', 'new_sp_m1524', 'new_sp_m2534', 'new_sp_m3544',
'new_sp_m4554', 'new_sp_m5564', 'new_sp_m65', 'new_sp_f014',
'new_sp_f1524', 'new_sp_f2534', 'new_sp_f3544', 'new_sp_f4554',
'new_sp_f5564', 'new_sp_f65', 'new_sn_m014', 'new_sn_m1524',
'new_sn_m2534', 'new_sn_m3544', 'new_sn_m4554', 'new_sn_m5564',
'new_sn_m65', 'new_sn_f014', 'new_sn_f1524', 'new_sn_f2534',
'new_sn_f3544', 'new_sn_f4554', 'new_sn_f5564', 'new_sn_f65',
'new_ep_m014', 'new_ep_m1524', 'new_ep_m2534', 'new_ep_m3544',
'new_ep_m4554', 'new_ep_m5564', 'new_ep_m65', 'new_ep_f014',
'new_ep_f1524', 'new_ep_f2534', 'new_ep_f3544', 'new_ep_f4554',
'new_ep_f5564', 'new_ep_f65', 'newrel_m014', 'newrel_m1524',
'newrel_m2534', 'newrel_m3544', 'newrel_m4554', 'newrel_m5564',
'newrel_m65', 'newrel_f014', 'newrel_f1524', 'newrel_f2534',
'newrel_f3544', 'newrel_f4554', 'newrel_f5564', 'newrel_f65'],
dtype=object)
Looks like most keys have pattern new_type_sexage
, except newrel
. Let’s change that.
'key'] = who_melt['key'].str.replace('newrel', 'new_rel') who_melt[
'key'].unique() who_melt[
array(['new_sp_m014', 'new_sp_m1524', 'new_sp_m2534', 'new_sp_m3544',
'new_sp_m4554', 'new_sp_m5564', 'new_sp_m65', 'new_sp_f014',
'new_sp_f1524', 'new_sp_f2534', 'new_sp_f3544', 'new_sp_f4554',
'new_sp_f5564', 'new_sp_f65', 'new_sn_m014', 'new_sn_m1524',
'new_sn_m2534', 'new_sn_m3544', 'new_sn_m4554', 'new_sn_m5564',
'new_sn_m65', 'new_sn_f014', 'new_sn_f1524', 'new_sn_f2534',
'new_sn_f3544', 'new_sn_f4554', 'new_sn_f5564', 'new_sn_f65',
'new_ep_m014', 'new_ep_m1524', 'new_ep_m2534', 'new_ep_m3544',
'new_ep_m4554', 'new_ep_m5564', 'new_ep_m65', 'new_ep_f014',
'new_ep_f1524', 'new_ep_f2534', 'new_ep_f3544', 'new_ep_f4554',
'new_ep_f5564', 'new_ep_f65', 'new_rel_m014', 'new_rel_m1524',
'new_rel_m2534', 'new_rel_m3544', 'new_rel_m4554', 'new_rel_m5564',
'new_rel_m65', 'new_rel_f014', 'new_rel_f1524', 'new_rel_f2534',
'new_rel_f3544', 'new_rel_f4554', 'new_rel_f5564', 'new_rel_f65'],
dtype=object)
How to split up these keys?
'key'].str.split('_') who_melt[
17 [new, sp, m014]
18 [new, sp, m014]
19 [new, sp, m014]
20 [new, sp, m014]
21 [new, sp, m014]
...
405269 [new, rel, f65]
405303 [new, rel, f65]
405371 [new, rel, f65]
405405 [new, rel, f65]
405439 [new, rel, f65]
Name: key, Length: 75752, dtype: object
= who_melt['key'].str.split('_', expand=True)
who_split who_split
0 | 1 | 2 | |
---|---|---|---|
17 | new | sp | m014 |
18 | new | sp | m014 |
19 | new | sp | m014 |
20 | new | sp | m014 |
21 | new | sp | m014 |
... | ... | ... | ... |
405269 | new | rel | f65 |
405303 | new | rel | f65 |
405371 | new | rel | f65 |
405405 | new | rel | f65 |
405439 | new | rel | f65 |
75752 rows × 3 columns
Change column names:
= ['new', 'type', 'sexage'] who_split.columns
Split sexage
into sex
and age
columns:
'sexage'].str.slice(start=0, stop=1) who_split[
17 m
18 m
19 m
20 m
21 m
..
405269 f
405303 f
405371 f
405405 f
405439 f
Name: sexage, Length: 75752, dtype: object
'sex'] = who_split['sexage'].str.slice(start=0, stop=1) who_split[
'age'] = who_split['sexage'].str.slice(start=1) who_split[
who_split.head()
new | type | sexage | sex | age | |
---|---|---|---|---|---|
17 | new | sp | m014 | m | 014 |
18 | new | sp | m014 | m | 014 |
19 | new | sp | m014 | m | 014 |
20 | new | sp | m014 | m | 014 |
21 | new | sp | m014 | m | 014 |
... | ... | ... | ... | ... | ... |
405269 | new | rel | f65 | f | 65 |
405303 | new | rel | f65 | f | 65 |
405371 | new | rel | f65 | f | 65 |
405405 | new | rel | f65 | f | 65 |
405439 | new | rel | f65 | f | 65 |
75752 rows × 5 columns
In this dataset, all cases are new
:
'new'].unique() who_split[
array(['new'], dtype=object)
= who_split.drop(columns=['new', 'sexage']) who_split
'age'] = who_split['age'].astype('category') who_split[
'age'] = who_split['age'].cat.as_ordered() who_split[
'age'] who_split[
17 014
18 014
19 014
20 014
21 014
...
405269 65
405303 65
405371 65
405405 65
405439 65
Name: age, Length: 75752, dtype: category
Categories (7, object): ['014' < '1524' < '2534' < '3544' < '4554' < '5564' < '65']
who_split.head()
type | sex | age | |
---|---|---|---|
17 | sp | m | 014 |
18 | sp | m | 014 |
19 | sp | m | 014 |
20 | sp | m | 014 |
21 | sp | m | 014 |
= who_melt.merge(who_split, left_index=True, right_index=True)
who_tidy who_tidy.head()
country | iso2 | iso3 | year | key | cases | type | sex | age | |
---|---|---|---|---|---|---|---|---|---|
17 | Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0.0 | sp | m | 014 |
18 | Afghanistan | AF | AFG | 1998 | new_sp_m014 | 30.0 | sp | m | 014 |
19 | Afghanistan | AF | AFG | 1999 | new_sp_m014 | 8.0 | sp | m | 014 |
20 | Afghanistan | AF | AFG | 2000 | new_sp_m014 | 52.0 | sp | m | 014 |
21 | Afghanistan | AF | AFG | 2001 | new_sp_m014 | 129.0 | sp | m | 014 |
= who_tidy.drop(columns=['key']) who_tidy
who_tidy.head()
country | iso2 | iso3 | year | cases | type | sex | age | |
---|---|---|---|---|---|---|---|---|
17 | Afghanistan | AF | AFG | 1997 | 0.0 | sp | m | 014 |
18 | Afghanistan | AF | AFG | 1998 | 30.0 | sp | m | 014 |
19 | Afghanistan | AF | AFG | 1999 | 8.0 | sp | m | 014 |
20 | Afghanistan | AF | AFG | 2000 | 52.0 | sp | m | 014 |
21 | Afghanistan | AF | AFG | 2001 | 129.0 | sp | m | 014 |
== 'AF'],
sns.relplot(who_tidy[who_tidy.iso2 ='year',
x='cases',
y='type',
hue='sex',
style='age',
col=3) col_wrap
Internet data types
JSON data
JSON (short for JavaScript Object Notation) is a standard format for sending data by HTTP request between web browsers and other applications.
Here is an example from NYC Open Data
The data is on Citi-Bike Stations in NYC, and the start of it looks like:
{"data": {"stations": [{"is_installed": 0, "is_returning": 0, "num_bikes_disabled": 0, "is_renting": 0, "station_id": "1817822909864556798", "num_docks_disabled": 0, "legacy_id": "1817822909864556798", "last_reported": 1739366429, "num_ebikes_available": 0, "num_bikes_available": 0, "num_docks_available": 0, "eightd_has_available_keys": false}, {"is_installed": 1, "is_returning": 1, "num_bikes_disabled": 4, "is_renting": 1, "station_id": "5a24acbb-94ef-4cb5-9db8-f6073ac26a2b", "num_docks_disabled": 0, "legacy_id": "4302", "num_scooters_available": 0, "last_reported": 1739394755, "num_ebikes_available": 0,
We can see it looks nearly like valid Python code (e.g. a dictionary).
A helpful package for reading JSON data is json
:
import json
= json.load(open('../data/citibike.json')) bike_data
bike_data.keys()
dict_keys(['data', 'last_updated', 'ttl', 'version'])
'data'].keys() bike_data[
dict_keys(['stations'])
= pd.DataFrame(bike_data['data']['stations']) stations
stations.head()
is_installed | is_returning | num_bikes_disabled | is_renting | station_id | num_docks_disabled | legacy_id | last_reported | num_ebikes_available | num_bikes_available | num_docks_available | eightd_has_available_keys | num_scooters_available | num_scooters_unavailable | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1817822909864556798 | 0 | 1817822909864556798 | 1739366429 | 0 | 0 | 0 | False | NaN | NaN |
1 | 1 | 1 | 1 | 1 | bacfe3c2-7946-45ed-91f9-f5b91791de75 | 0 | 3503 | 1739395003 | 1 | 2 | 27 | False | 0.0 | 0.0 |
2 | 1 | 1 | 5 | 1 | 9394fb49-583f-4426-8d2f-a2b0ffb584d2 | 0 | 3737 | 1739395004 | 4 | 34 | 10 | False | 0.0 | 0.0 |
3 | 1 | 1 | 9 | 1 | 1869753654337059182 | 0 | 1869753654337059182 | 1739394999 | 1 | 10 | 2 | False | 0.0 | 0.0 |
4 | 1 | 1 | 6 | 1 | 1f8f6714-42a7-45d5-86cd-a093c940b2c9 | 0 | 4635 | 1739395000 | 1 | 2 | 19 | False | 0.0 | 0.0 |
HTML data
To read in HTML tables, for example, we can use the pd.read_html
command.
Note: we need the following packages:
conda install lxml beautifulsoup4 html5lib
This is because pd.read_html
uses lxml or Beautiful Soup under the hood to parse data from HTML.
The following is from the pandas documenation, and gets a table of failed banks from the FDIC website.
= "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list"
url = pd.read_html(url) banks
type(banks)
list
Note: pd.read_html
reads HTML tables into a list of DataFrame objects. This is helpful if there is more than one table on the website.
0] banks[
Bank Name | City | State | Cert | Aquiring Institution | Closing Date | Fund Sort ascending | |
---|---|---|---|---|---|---|---|
0 | Pulaski Savings Bank | Chicago | Illinois | 28611 | Millennium Bank | January 17, 2025 | 10548 |
1 | The First National Bank of Lindsay | Lindsay | Oklahoma | 4134 | First Bank & Trust Co., Duncan, OK | October 18, 2024 | 10547 |
2 | Republic First Bank dba Republic Bank | Philadelphia | Pennsylvania | 27332 | Fulton Bank, National Association | April 26, 2024 | 10546 |
3 | Citizens Bank | Sac City | Iowa | 8758 | Iowa Trust & Savings Bank | November 3, 2023 | 10545 |
4 | Heartland Tri-State Bank | Elkhart | Kansas | 25851 | Dream First Bank, N.A. | July 28, 2023 | 10544 |
5 | First Republic Bank | San Francisco | California | 59017 | JPMorgan Chase Bank, N.A. | May 1, 2023 | 10543 |
6 | Signature Bank | New York | New York | 57053 | Flagstar Bank, N.A. | March 12, 2023 | 10540 |
7 | Silicon Valley Bank | Santa Clara | California | 24735 | First Citizens Bank & Trust Company | March 10, 2023 | 10539 |
8 | Almena State Bank | Almena | Kansas | 15426 | Equity Bank | October 23, 2020 | 10538 |
9 | First City Bank of Florida | Fort Walton Beach | Florida | 16748 | United Fidelity Bank, fsb | October 16, 2020 | 10537 |
XML data
XML is another common structured data format supporting hierarchical, nested data with metadata. XML and HTML are similar, but XML is more general.
The following is an example from the xml.etree.ElementTree
documentation.
!cat ../data/country_data.xml
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
import xml.etree.ElementTree as ET
= ET.parse('../data/country_data.xml')
tree = tree.getroot() root
XML documents are represented as a tree of elements
Each element has:
- a tag (name)
- attributes (optional)
- text content (optional)
root.tag
'data'
root.attrib
{}
for child in root:
print(child.tag, child.attrib, child.text)
for child2 in child:
print('--', child2.tag, child2.attrib, child2.text)
print('\n')
country {'name': 'Liechtenstein'}
-- rank {} 1
-- year {} 2008
-- gdppc {} 141100
-- neighbor {'name': 'Austria', 'direction': 'E'} None
-- neighbor {'name': 'Switzerland', 'direction': 'W'} None
country {'name': 'Singapore'}
-- rank {} 4
-- year {} 2011
-- gdppc {} 59900
-- neighbor {'name': 'Malaysia', 'direction': 'N'} None
country {'name': 'Panama'}
-- rank {} 68
-- year {} 2011
-- gdppc {} 13600
-- neighbor {'name': 'Costa Rica', 'direction': 'W'} None
-- neighbor {'name': 'Colombia', 'direction': 'E'} None
We can access elements using indexing:
0] root[
<Element 'country' at 0x3064bb830>
0].attrib root[
{'name': 'Liechtenstein'}
0].attrib['name'] root[
'Liechtenstein'
# We can also use .get to extract a specific attribute value
0].get('name') root[
'Liechtenstein'
# We can use .find to get the first matching subelement
0].find('rank') root[
<Element 'rank' at 0x3064b9990>
# .find only works one subelement down - use the path for deeper elements
'country/rank') root.find(
<Element 'rank' at 0x3064b9990>
# if you don't know the path, use './/tag_name'
'.//rank') root.find(
<Element 'rank' at 0x3064b9990>
# We can use .find to get a list of all matching subelements
'.//rank') root.findall(
[<Element 'rank' at 0x3064b9990>,
<Element 'rank' at 0x3064b84f0>,
<Element 'rank' at 0x306499cb0>]
You can also use pd.read_xml
, however, this method is best designed to import shallow XML documents.
This function will always return a single DataFrame or raise exceptions due to issues with XML document, xpath, or other parameters.
= pd.read_xml('../data/country_data.xml') xml_df
xml_df
name | rank | year | gdppc | neighbor | |
---|---|---|---|---|---|
0 | Liechtenstein | 1 | 2008 | 141100 | NaN |
1 | Singapore | 4 | 2011 | 59900 | NaN |
2 | Panama | 68 | 2011 | 13600 | NaN |
Note that we lose the neighbor information.
More specifically, these are the kinds of formats easiest to read with pd.read_xml
<root>
<row>
<column1>data</column1>
<column2>data</column2>
<column3>data</column3>
...
</row>
<row>
...
</row>
...
</root>
Illustration: Exchange Rate Data
Let’s look again at the exchange rate data from Lectures 2 and 3, but this time in XML form.
= pd.read_xml('../data/eurofxref-hist-90d.xml') df
df
subject | name | Cube | |
---|---|---|---|
0 | Reference rates | None | None |
1 | None | European Central Bank | None |
2 | None | None | \n |
The structure is too complicated to be directly read with pd.read_xml
.
= ET.parse('../data/eurofxref-hist-90d.xml')
tree = tree.getroot()
root root.tag
'{http://www.gesmes.org/xml/2002-08-01}Envelope'
for child in root:
print(child.tag, child.attrib, child.text)
{http://www.gesmes.org/xml/2002-08-01}subject {} Reference rates
{http://www.gesmes.org/xml/2002-08-01}Sender {}
{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube {}
for child in root:
print(child.tag, child.attrib, child.text)
for i, child2 in enumerate(child):
if i < 5:
print('---', child2.tag, child2.attrib, child2.text)
{http://www.gesmes.org/xml/2002-08-01}subject {} Reference rates
{http://www.gesmes.org/xml/2002-08-01}Sender {}
--- {http://www.gesmes.org/xml/2002-08-01}name {} European Central Bank
{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube {}
--- {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube {'time': '2024-04-16'}
--- {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube {'time': '2024-04-15'}
--- {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube {'time': '2024-04-12'}
--- {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube {'time': '2024-04-11'}
--- {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube {'time': '2024-04-10'}
2].tag root[
'{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube'
= root[2] cube
for i, child in enumerate(cube):
if i < 5:
print(child.attrib)
for j, child2 in enumerate(child):
if j < 5:
print('----', child2.attrib)
{'time': '2024-04-16'}
---- {'currency': 'USD', 'rate': '1.0637'}
---- {'currency': 'JPY', 'rate': '164.54'}
---- {'currency': 'BGN', 'rate': '1.9558'}
---- {'currency': 'CZK', 'rate': '25.21'}
---- {'currency': 'DKK', 'rate': '7.4609'}
{'time': '2024-04-15'}
---- {'currency': 'USD', 'rate': '1.0656'}
---- {'currency': 'JPY', 'rate': '164.05'}
---- {'currency': 'BGN', 'rate': '1.9558'}
---- {'currency': 'CZK', 'rate': '25.324'}
---- {'currency': 'DKK', 'rate': '7.4606'}
{'time': '2024-04-12'}
---- {'currency': 'USD', 'rate': '1.0652'}
---- {'currency': 'JPY', 'rate': '163.16'}
---- {'currency': 'BGN', 'rate': '1.9558'}
---- {'currency': 'CZK', 'rate': '25.337'}
---- {'currency': 'DKK', 'rate': '7.4603'}
{'time': '2024-04-11'}
---- {'currency': 'USD', 'rate': '1.0729'}
---- {'currency': 'JPY', 'rate': '164.18'}
---- {'currency': 'BGN', 'rate': '1.9558'}
---- {'currency': 'CZK', 'rate': '25.392'}
---- {'currency': 'DKK', 'rate': '7.4604'}
{'time': '2024-04-10'}
---- {'currency': 'USD', 'rate': '1.086'}
---- {'currency': 'JPY', 'rate': '164.89'}
---- {'currency': 'BGN', 'rate': '1.9558'}
---- {'currency': 'CZK', 'rate': '25.368'}
---- {'currency': 'DKK', 'rate': '7.4594'}
0][0].attrib['currency'] cube[
'USD'
0][0].attrib['rate'] cube[
'1.0637'
= {}
xml_dict
for i, child in enumerate(cube):
= {}
temp_dict for j, child2 in enumerate(child):
'currency']] = child2.attrib['rate']
temp_dict[child2.attrib['time']] = temp_dict xml_dict[child.attrib[
= pd.DataFrame(xml_dict) df_xml
df_xml.head()
2024-04-16 | 2024-04-15 | 2024-04-12 | 2024-04-11 | 2024-04-10 | 2024-04-09 | 2024-04-08 | 2024-04-05 | 2024-04-04 | 2024-04-03 | ... | 2024-01-31 | 2024-01-30 | 2024-01-29 | 2024-01-26 | 2024-01-25 | 2024-01-24 | 2024-01-23 | 2024-01-22 | 2024-01-19 | 2024-01-18 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
USD | 1.0637 | 1.0656 | 1.0652 | 1.0729 | 1.086 | 1.0867 | 1.0823 | 1.0841 | 1.0852 | 1.0783 | ... | 1.0837 | 1.0846 | 1.0823 | 1.0871 | 1.0893 | 1.0905 | 1.0872 | 1.089 | 1.0887 | 1.0875 |
JPY | 164.54 | 164.05 | 163.16 | 164.18 | 164.89 | 164.97 | 164.43 | 164.1 | 164.69 | 163.66 | ... | 160.19 | 159.97 | 160.13 | 160.62 | 160.81 | 160.46 | 160.88 | 160.95 | 161.17 | 160.89 |
BGN | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | ... | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 | 1.9558 |
CZK | 25.21 | 25.324 | 25.337 | 25.392 | 25.368 | 25.38 | 25.354 | 25.286 | 25.322 | 25.352 | ... | 24.891 | 24.831 | 24.806 | 24.748 | 24.756 | 24.786 | 24.824 | 24.758 | 24.813 | 24.734 |
DKK | 7.4609 | 7.4606 | 7.4603 | 7.4604 | 7.4594 | 7.459 | 7.4588 | 7.459 | 7.4589 | 7.4589 | ... | 7.455 | 7.4543 | 7.4538 | 7.4549 | 7.456 | 7.4568 | 7.4574 | 7.4585 | 7.4575 | 7.4571 |
5 rows × 62 columns
= df_xml.T df_xml
df_xml.head()
USD | JPY | BGN | CZK | DKK | GBP | HUF | PLN | RON | SEK | ... | ILS | INR | KRW | MXN | MYR | NZD | PHP | SGD | THB | ZAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2024-04-16 | 1.0637 | 164.54 | 1.9558 | 25.21 | 7.4609 | 0.8544 | 394.63 | 4.3435 | 4.9764 | 11.635 | ... | 4.0001 | 88.8975 | 1481.62 | 17.9141 | 5.0973 | 1.8072 | 60.561 | 1.4509 | 38.958 | 20.2104 |
2024-04-15 | 1.0656 | 164.05 | 1.9558 | 25.324 | 7.4606 | 0.85405 | 394.25 | 4.2938 | 4.9742 | 11.5583 | ... | 3.9578 | 88.895 | 1473.79 | 17.6561 | 5.0925 | 1.7943 | 60.558 | 1.4498 | 39.124 | 20.2067 |
2024-04-12 | 1.0652 | 163.16 | 1.9558 | 25.337 | 7.4603 | 0.85424 | 391.63 | 4.2603 | 4.9717 | 11.5699 | ... | 4.0175 | 88.914 | 1469.79 | 17.5982 | 5.081 | 1.7872 | 60.256 | 1.4478 | 38.853 | 19.9907 |
2024-04-11 | 1.0729 | 164.18 | 1.9558 | 25.392 | 7.4604 | 0.85525 | 389.8 | 4.257 | 4.9713 | 11.531 | ... | 4.03 | 89.4385 | 1469.32 | 17.6448 | 5.0936 | 1.792 | 60.577 | 1.4518 | 39.22 | 20.1614 |
2024-04-10 | 1.086 | 164.89 | 1.9558 | 25.368 | 7.4594 | 0.85515 | 389.18 | 4.2563 | 4.969 | 11.4345 | ... | 4.0324 | 90.3585 | 1467.33 | 17.7305 | 5.1558 | 1.7856 | 61.408 | 1.4605 | 39.536 | 20.0851 |
5 rows × 30 columns
= 'Time' df_xml.index.name
= df_xml.reset_index() df_xml
df_xml.head()
Time | USD | JPY | BGN | CZK | DKK | GBP | HUF | PLN | RON | ... | ILS | INR | KRW | MXN | MYR | NZD | PHP | SGD | THB | ZAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2024-04-16 | 1.0637 | 164.54 | 1.9558 | 25.21 | 7.4609 | 0.8544 | 394.63 | 4.3435 | 4.9764 | ... | 4.0001 | 88.8975 | 1481.62 | 17.9141 | 5.0973 | 1.8072 | 60.561 | 1.4509 | 38.958 | 20.2104 |
1 | 2024-04-15 | 1.0656 | 164.05 | 1.9558 | 25.324 | 7.4606 | 0.85405 | 394.25 | 4.2938 | 4.9742 | ... | 3.9578 | 88.895 | 1473.79 | 17.6561 | 5.0925 | 1.7943 | 60.558 | 1.4498 | 39.124 | 20.2067 |
2 | 2024-04-12 | 1.0652 | 163.16 | 1.9558 | 25.337 | 7.4603 | 0.85424 | 391.63 | 4.2603 | 4.9717 | ... | 4.0175 | 88.914 | 1469.79 | 17.5982 | 5.081 | 1.7872 | 60.256 | 1.4478 | 38.853 | 19.9907 |
3 | 2024-04-11 | 1.0729 | 164.18 | 1.9558 | 25.392 | 7.4604 | 0.85525 | 389.8 | 4.257 | 4.9713 | ... | 4.03 | 89.4385 | 1469.32 | 17.6448 | 5.0936 | 1.792 | 60.577 | 1.4518 | 39.22 | 20.1614 |
4 | 2024-04-10 | 1.086 | 164.89 | 1.9558 | 25.368 | 7.4594 | 0.85515 | 389.18 | 4.2563 | 4.969 | ... | 4.0324 | 90.3585 | 1467.33 | 17.7305 | 5.1558 | 1.7856 | 61.408 | 1.4605 | 39.536 | 20.0851 |
5 rows × 31 columns
'Time'] = pd.to_datetime(df_xml['Time']) df_xml[
= df_xml.melt(id_vars=['Time'],
df_xml = 'currency',
var_name ='rate') value_name
df_xml
Time | currency | rate | |
---|---|---|---|
0 | 2024-04-16 | USD | 1.0637 |
1 | 2024-04-15 | USD | 1.0656 |
2 | 2024-04-12 | USD | 1.0652 |
3 | 2024-04-11 | USD | 1.0729 |
4 | 2024-04-10 | USD | 1.086 |
... | ... | ... | ... |
1855 | 2024-01-24 | ZAR | 20.5366 |
1856 | 2024-01-23 | ZAR | 20.7022 |
1857 | 2024-01-22 | ZAR | 20.8846 |
1858 | 2024-01-19 | ZAR | 20.6892 |
1859 | 2024-01-18 | ZAR | 20.597 |
1860 rows × 3 columns