Lecture 12 - Big data and databases

Overview

In this lecture, we discuss computing issues that arise with big data, and potential solutions including:

pandas chunking
sqlite
duckdb
polars

References

Python for Data Analysis (Wes McKinney, 2022)
DuckDB tutorial
Modern Polars (Kevin Heavey, 2024)

Computing and big data issues

Simply put, a computer is made up of three basic parts:

the memory units (e.g. random access memory, RAM; solid-state drive, SSD)
the computing units (e.g. central processing unit, CPU; graphics processing unit, GPU)
the connections between them

Memory

RAM	Solid-state drive
Temporary memory used for actively running programs/data	Permanent storage for your operating system, applications, and files
Extremely fast read/write	Slower read/write
Smaller (typically 8-64 GB)	Larger (TBs)

If a dataset is loaded into RAM, we typically say it is “in memory”

If a dataset is on the solid-state drive, we typically say it is “on disk”.

Compute

CPU	GPU
General-purpose processor	Highly parallel processor for repetitive tasks (originally for graphics)
Few cores (e.g. 4–32)	Thousands of smaller cores
Strong per-core performance	Weak individual cores, but excellent at parallel tasks
Ideal for logic-heavy tasks, branching, single-thread performance	Ideal for matrix operations, linear algebra, deep learning

Issues

Pandas in particular struggles with big data due to:

memory: pandas loads the entire dataset into RAM. As RAM is on the order of GB, this is problematic for datasets of that size.
compute: pandas typically takes advantage of only one CPU core, unlike other libraries which parallelize operations to multiple cores under the hood. For large data, this means pandas operations can be much slower than other libraries.
eager evaluation: pandas uses eager evaluation, which means it executes computations immediately when invoked. Other libraries use lazy evaluation, where computations are only run when needed. Lazy evaluation is often optimized to avoid unnecessary computations, and not store intermediate calculations.

Pandas and big data

There are a few things we can do in pandas to help with memory issues:

subsetting columns: load in only a subset of columns at a time
chunking: load in only a few rows at a time

Subsetting columns

import pandas as pd

head_df = pd.read_csv('../data/rates.csv', nrows=0)

head_df.columns

Index(['Time', 'USD', 'JPY', 'BGN', 'CZK', 'DKK', 'GBP', 'CHF'], dtype='object')

rates_df = pd.read_csv('../data/rates.csv', usecols=head_df.columns[:3])

rates_df.head()

	Time	USD	JPY
0	2024-04-16	1.0637	164.54
1	2024-04-15	1.0656	164.05
2	2024-04-12	1.0652	163.16
3	2024-04-11	1.0729	164.18
4	2024-04-10	1.0860	164.89

Chunking

The function pd.read_csv has an optional argument chunksize.

When you specify chunksize, pd.read_csv returns an iterable instead of a DataFrame. To access the chunks of data, iterate over the iterable.

Let’s consider this dataset from Kaggle of over 900K songs.

# get header
spotify_header = pd.read_csv("../data/spotify_dataset.csv", nrows=0)

spotify_header.columns

Index(['Artist(s)', 'song', 'text', 'Length', 'emotion', 'Genre', 'Album',
       'Release Date', 'Key', 'Tempo', 'Loudness (db)', 'Time signature',
       'Explicit', 'Popularity', 'Energy', 'Danceability', 'Positiveness',
       'Speechiness', 'Liveness', 'Acousticness', 'Instrumentalness',
       'Good for Party', 'Good for Work/Study',
       'Good for Relaxation/Meditation', 'Good for Exercise',
       'Good for Running', 'Good for Yoga/Stretching', 'Good for Driving',
       'Good for Social Gatherings', 'Good for Morning Routine',
       'Similar Artist 1', 'Similar Song 1', 'Similarity Score 1',
       'Similar Artist 2', 'Similar Song 2', 'Similarity Score 2',
       'Similar Artist 3', 'Similar Song 3', 'Similarity Score 3'],
      dtype='object')

chunk_iter = pd.read_csv("../data/spotify_dataset.csv", chunksize=1000)

chunk = next(chunk_iter)

chunk['Genre'].unique()

array(['hip hop', 'jazz', 'indie rock,britpop', 'classical',
       'rock,pop,comedy', 'rock,pop rock', 'rap,hip hop',
       'trance,electronic,psychedelic', 'country,classic rock,hard rock',
       'hip hop,trap,cloud rap', 'electronic,rap',
       'rock,hardcore,garage rock', 'pop',
       'chillout,psychedelic,electronic', 'alternative rock,new wave',
       'hip-hop,hip hop', 'punk,punk rock', 'k-pop',
       'metal,hard rock,heavy metal', 'rock,synthpop,alternative rock',
       'country,pop', 'folk', 'alternative rock,indie rock,indie', 'rock'],
      dtype=object)

chunk_iter = pd.read_csv("../data/spotify_dataset.csv", chunksize=1000)
result = []

for chunk in chunk_iter:
    chunk['genre_1'] = chunk['Genre'].str.split(',').str[0]
    group = chunk.groupby("genre_1")["Danceability"].agg(['sum', 'count'])
    result.append(group)

final = pd.concat(result).reset_index()

final['genre_1'] = final['genre_1'].str.replace('hip hop', 'hip-hop')

final = final.groupby("genre_1").sum()
final = final.reset_index()

total_songs = final['count'].sum()

final.rename(columns={'genre_1': 'Genre', 'sum': 'Danceability'}, inplace=True)

final = final[final['count'] > 1000]

final['Danceability'] = final['Danceability'] / final['count']

import seaborn as sns
import matplotlib.pyplot as plt

sns.stripplot(final.sort_values(by='Danceability', ascending=False),
              x='Danceability',
              y='Genre')
plt.gca().xaxis.grid(False)
plt.gca().yaxis.grid(True)
plt.gcf().set_size_inches(5, 8)
plt.gca().set(xlim=(0, 76), title='Danceability by Genre', xlabel="", ylabel="");

Other Pandas tips

Chain commands instead of doing step-by-step

Example:

# DO THIS
group = chunk.groupby("genre_1")["Danceability"].agg(['sum', 'count'])

# NOT THIS
group = chunk.groupby("genre_1")
dance = group["Danceability"].agg(['sum', 'count'])

Use categoricals instead of strings

Databases

Much of this section is drawn from Software Carpentry.

Databases can handle large, complex datasets. Databases include powerful tools for search and analysis, without loading the data into memory.

Relational databases

A relational database is a way to store and manipulate information.

The data is stored in objects called tables. Tables have:

columns (also known as fields)
rows (also known as records)

The database is “relational” in that the tables are connected in some way. For example, one table may be a list of customers and their characteristics; another table may be customer transactions.

Querying databases

When we use a database, we send commands - usually called queries - to a database manager: a program that manipulates the database for us.

Queries are written in a language called SQL, which stands for “Structured Query Language”. SQL commands fall into the “CRUD” categories:

Create: e.g. CREATE, INSERT
Read: e.g. SELECT
Update: e.g. UPDATE
Delete: e.g. DELETE, DROP

SQLite

We introduce the basics of SQL using SQLite.

SQLite is a server-less, open-source database management system.

SQLite can handle data of even hundreds or thousands of gigabytes in size that fit on a single computer’s disk.

In Python, sqlite3 is a built-in library.

Connecting to a database

We work with the database survey.db, downloaded from here.

The database contains data from an expedition to the Pole of Inaccessbility in the South Pacific and Antarctica in the late 1920s and early 1930s. The expedition was led by William Dyer, Frank Pabodie, and Valentina Roerich.

Open a connection to a database using sqlite3.connect()

Note

If the database does not exist, sqlite will create a new database with that name.

import sqlite3

conn = sqlite3.connect("../data/survey.db")

The returned Connection object conn represents the connection to the on-disk database.

To execute SQL statement and fetch results from SQL queries to the database, we need a database cursor.

call conn.cursor() to create the Cursor:

cursor = conn.cursor()

Schema

What does the database contain?

Every SQLite database contains a single schema table that stores the schema (or description) for that database.

The schema is a description of all the tables, indexes, triggers and views (collectively “objects”) that are contained within the database.

The sqlite_schema table contains:

one row for each object in the schema
the columns include:
- type: one of ‘table’, ‘index’, ‘view’ or ‘trigger’
- name: name of object
- sql: SQL text that describes the object
(for full column list, see docs).

Querying the schema

We can use a SQL query to inspect the schema:

SELECT name, sql FROM sqlite_schema

SELECT defines which columns to return
FROM defines the source table

cursor.execute("SELECT name, sql FROM sqlite_schema")
tables = cursor.fetchall()

for tab in tables:
    print(tab)

('Person', 'CREATE TABLE Person (id text, personal text, family text)')
('Site', 'CREATE TABLE Site (name text, lat real, long real)')
('Survey', 'CREATE TABLE Survey (taken integer, person text, quant text, reading real)')
('Visited', 'CREATE TABLE Visited (id integer, site text, dated text)')

Consider the first line. The sql column tells us that the table Person has three columns:

id with type text
personal with type text
family with type text

Selecting

Now we know what is in our database, let’s look at the tables.

We can select all columns in a table using *:

cursor.execute("SELECT * FROM Person")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	id	personal	family
0	dyer	William	Dyer
1	pb	Frank	Pabodie
2	lake	Anderson	Lake
3	roe	Valentina	Roerich
4	danforth	Frank	Danforth

cursor.execute("SELECT * FROM Site")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	name	lat	long
0	DR-1	-49.85	-128.57
1	DR-3	-47.15	-126.72
2	MSK-4	-48.87	-123.40

We can also count how many rows there are in a table.

cursor.execute("SELECT COUNT(*) FROM Visited")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	COUNT(*)
0	8

cursor.execute("SELECT * FROM Visited")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols).head()

	id	site	dated
0	619	DR-1	1927-02-08
1	622	DR-1	1927-02-10
2	734	DR-3	1930-01-07
3	735	DR-3	1930-01-12
4	751	DR-3	1930-02-26

cursor.execute("SELECT COUNT(*) FROM Survey")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	COUNT(*)
0	21

cursor.execute("SELECT * FROM Survey")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols).head()

	taken	person	quant	reading
0	619	dyer	rad	9.82
1	619	dyer	sal	0.13
2	622	dyer	rad	7.80
3	622	dyer	sal	0.09
4	734	pb	rad	8.41

We can also select and order at the same time.

cursor.execute("SELECT * FROM Person ORDER BY id")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	id	personal	family
0	danforth	Frank	Danforth
1	dyer	William	Dyer
2	lake	Anderson	Lake
3	pb	Frank	Pabodie
4	roe	Valentina	Roerich

To do descending order, use

SELECT * FROM Person ORDER BY id DESC

Now, suppose we want to know what are different types of quantities (quant) collected. We can use the keyword DISTINCT.

cursor.execute("SELECT DISTINCT quant FROM Survey")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols).head()

	quant
0	rad
1	sal
2	temp

If we want to determine which visit (stored in the taken column) have which quant measurement, we can use the DISTINCT keyword on multiple columns.

cursor.execute("SELECT DISTINCT taken, quant FROM Survey")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	taken	quant
0	619	rad
1	619	sal
2	622	rad
3	622	sal
4	734	rad
5	734	sal
6	734	temp
7	735	rad
8	735	sal
9	735	temp
10	751	rad
11	751	temp
12	751	sal
13	752	rad
14	752	sal
15	752	temp
16	837	rad
17	837	sal
18	844	rad

In order to look at which scientist measured quantities during each visit, we can look again at the Survey table.

This query sorts results first by taken, and then by person within each group of equal taken values:

cursor.execute("""
            SELECT taken, person, quant 
            FROM Survey 
            ORDER BY taken , person
""")
    
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	taken	person	quant
0	619	dyer	rad
1	619	dyer	sal
2	622	dyer	rad
3	622	dyer	sal
4	734	lake	sal
5	734	pb	rad
6	734	pb	temp
7	735	None	sal
8	735	None	temp
9	735	pb	rad
10	751	lake	sal
11	751	pb	rad
12	751	pb	temp
13	752	lake	rad
14	752	lake	sal
15	752	lake	temp
16	752	roe	sal
17	837	lake	rad
18	837	lake	sal
19	837	roe	sal
20	844	roe	rad

Filtering

We can filter results using the keyword WHERE:

cursor.execute("SELECT * FROM Visited WHERE site = 'DR-1'")
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	id	site	dated
0	619	DR-1	1927-02-08
1	622	DR-1	1927-02-10
2	844	DR-1	1932-03-22

We can combine logical statements with AND:

cursor.execute("""
            SELECT * 
            FROM Visited 
            WHERE site = 'DR-1' AND dated < '1930-01-01'
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	id	site	dated
0	619	DR-1	1927-02-08
1	622	DR-1	1927-02-10

Dates

Most database managers have a special data type for dates. SQLite doesn’t: instead, it stores dates as either:

text (in the ISO-8601 standard format “YYYY-MM-DD HH:MM:SS.SSSS”);
real numbers (Julian days, the number of days since November 24, 4714 BCE);
integers (Unix time, the number of seconds since midnight, January 1, 1970).

Logicals can also be combined with OR:

cursor.execute("""
            SELECT * FROM Survey 
               WHERE person = 'lake' OR person = 'roe'
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	taken	person	quant	reading
0	734	lake	sal	0.05
1	751	lake	sal	0.10
2	752	lake	rad	2.19
3	752	lake	sal	0.09
4	752	lake	temp	-16.00
5	752	roe	sal	41.60
6	837	lake	rad	1.46
7	837	lake	sal	0.21
8	837	roe	sal	22.50
9	844	roe	rad	11.25

Alternatively, we can use IN.

cursor.execute("""
            SELECT * FROM Survey WHERE person IN ('lake', 'roe')
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	taken	person	quant	reading
0	734	lake	sal	0.05
1	751	lake	sal	0.10
2	752	lake	rad	2.19
3	752	lake	sal	0.09
4	752	lake	temp	-16.00
5	752	roe	sal	41.60
6	837	lake	rad	1.46
7	837	lake	sal	0.21
8	837	roe	sal	22.50
9	844	roe	rad	11.25

cursor.execute("""
            SELECT * FROM Survey 
            WHERE quant = 'sal' AND (person = 'lake' OR person = 'roe')
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	taken	person	quant	reading
0	734	lake	sal	0.05
1	751	lake	sal	0.10
2	752	lake	sal	0.09
3	752	roe	sal	41.60
4	837	lake	sal	0.21
5	837	roe	sal	22.50

cursor.execute("""
            SELECT * FROM Visited WHERE site LIKE 'DR%'
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	id	site	dated
0	619	DR-1	1927-02-08
1	622	DR-1	1927-02-10
2	734	DR-3	1930-01-07
3	735	DR-3	1930-01-12
4	751	DR-3	1930-02-26
5	752	DR-3	None
6	844	DR-1	1932-03-22

cursor.execute("""
            SELECT DISTINCT person, quant FROM Survey 
            WHERE person = 'lake' OR person = 'roe'
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	person	quant
0	lake	sal
1	lake	rad
2	lake	temp
3	roe	sal
4	roe	rad

Null values

Databases represent missing values using a special value called null.

To check whether a value is null or not, we use a special test IS NULL:

cursor.execute("""
            SELECT * FROM Visited WHERE dated IS NULL;
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	id	site	dated
0	752	DR-3	None

Calculating new values

Suppose we need to correct values upward by 5%. We can do this as part of our SELECT query:

cursor.execute("""
            SELECT 1.05 * reading FROM Survey WHERE quant = 'rad'
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	1.05 * reading
0	10.3110
1	8.1900
2	8.8305
3	7.5810
4	4.5675
5	2.2995
6	1.5330
7	11.8125

We convert from Fahrenheit to Celcius:

cursor.execute("""
            SELECT taken, round(5 * (reading - 32) / 9, 2) as Celcius
            FROM Survey 
            WHERE quant = 'temp'
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	taken	Celcius
0	734	-29.72
1	735	-32.22
2	751	-28.06
3	752	-26.67

Aggregation functions

We have aggregation functions including:

MIN() returns the smallest value within the selected column
MAX() returns the largest value within the selected column
SUM() returns the total sum of a numerical column
AVG() returns the average value of a numerical column

cursor.execute("""
            SELECT MIN(dated) FROM Visited
            """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	MIN(dated)
0	1927-02-08

Joining

Joining tables is very similar to pandas merging. We use the keyword JOIN. Similar to pandas, the default is an inner join.

cursor.execute("""
                SELECT * FROM Site
               JOIN Visited ON Site.name = Visited.site
                 """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	name	lat	long	id	site	dated
0	DR-1	-49.85	-128.57	619	DR-1	1927-02-08
1	DR-1	-49.85	-128.57	622	DR-1	1927-02-10
2	DR-1	-49.85	-128.57	844	DR-1	1932-03-22
3	DR-3	-47.15	-126.72	734	DR-3	1930-01-07
4	DR-3	-47.15	-126.72	735	DR-3	1930-01-12
5	DR-3	-47.15	-126.72	751	DR-3	1930-02-26
6	DR-3	-47.15	-126.72	752	DR-3	None
7	MSK-4	-48.87	-123.40	837	MSK-4	1932-01-14

Primary keys and foreign keys

A primary key is a field (or combination of fields) whose values uniquely identify the records in a table.

A foreign key is a field (or combination of fields) in one table whose values are a primary key in another table.

In survey.db, Person.id is the primary key in the Person table, while Survey.person is a foreign key relating the Survey table’s entries to entries in Person.

Most database designers believe that every table should have a well-defined primary key. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database.

SQLite automatically numbers records as they are added to tables, and we can use those record numbers in queries:

cursor.execute("""
               SELECT rowid, * FROM Person
                 """)
rows = cursor.fetchall()
cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	rowid	id	personal	family
0	1	dyer	William	Dyer
1	2	pb	Frank	Pabodie
2	3	lake	Anderson	Lake
3	4	roe	Valentina	Roerich
4	5	danforth	Frank	Danforth

Closing connection

Once we are finished with our database, we need to close the connection:

conn.close()

Context manager

Instead of explicitly closing the connection, we can also use a context manager:

with sqlite3.connect("../data/survey.db") as conn:
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM Person ORDER BY id")
    rows = cursor.fetchall()
    cols = [description[0] for description in cursor.description]

pd.DataFrame(rows, columns=cols)

	id	personal	family
0	danforth	Frank	Danforth
1	dyer	William	Dyer
2	lake	Anderson	Lake
3	pb	Frank	Pabodie
4	roe	Valentina	Roerich

Creating databases

We can also create databases with sqlite3.

Suppose we want to create a new database named tutorial.db. We can run:

conn = sqlite3.connect("../data/tutorial.db")

This will create a new tutorial.db file in ../data.

cur = conn.cursor()

To create a new table in our database, we use the command CREATE TABLE.

cur.execute("CREATE TABLE movie(title, year, score)")

<sqlite3.Cursor at 0x3261d9bc0>

We can check the schema to see the table has been created:

cur.execute("SELECT name, sql FROM sqlite_schema")
tables = cur.fetchall()

for tab in tables:
    print(tab)

('movie', 'CREATE TABLE movie(title, year, score)')

We can use an INSERT statement to add the values:

cur.execute("""
    INSERT INTO movie VALUES
        ('Monty Python and the Holy Grail', 1975, 8.2),
        ('And Now for Something Completely Different', 1970, 7.5)
""")

<sqlite3.Cursor at 0x3261d9bc0>

cur.execute("SELECT * FROM movie")
tables = cur.fetchall()

for tab in tables:
    print(tab)

('Monty Python and the Holy Grail', 1975, 8.2)
('And Now for Something Completely Different', 1970, 7.5)

We can update values:

cur.execute("""
    UPDATE movie SET year=1971 
            WHERE title = 'And Now for Something Completely Different';
""")

<sqlite3.Cursor at 0x3261d9bc0>

cur.execute("SELECT * FROM movie")
tables = cur.fetchall()

for tab in tables:
    print(tab)

('Monty Python and the Holy Grail', 1975, 8.2)
('And Now for Something Completely Different', 1971, 7.5)

We can also DELETE values:

cur.execute("""
    DELETE FROM movie WHERE year=1971
""")

<sqlite3.Cursor at 0x3261d9bc0>

cur.execute("SELECT * FROM movie")
tables = cur.fetchall()

for tab in tables:
    print(tab)

('Monty Python and the Holy Grail', 1975, 8.2)

Finally, we can remove tables using DROP TABLE:

cur.execute("""
    DROP TABLE movie
""")

<sqlite3.Cursor at 0x3261d9bc0>

cur.execute("SELECT name, sql FROM sqlite_schema")
tables = cur.fetchall()

for tab in tables:
    print(tab)

After making changes to a database, we need to commit the changes:

conn.commit()

conn.close()

DuckDB

SQLite is a general-purpose database engine. However, it is primarily designed for fast online transaction processing e.g. fast scanning and lookup for individual rows.

DuckDB is the prefered choice for analytical query workloads e.g. queries that apply aggregate calculations across large numbers of rows.

Install duckdb in your conda environment:

Terminal

pip install duckdb

The following is based on the DuckDB tutorial.

import duckdb

# conn = duckdb.connect()

By default, connect with no parameters creates an in-memory database globally stored inside the Python module. If you close the program, you lose the database and the data.

If you want the data to persist, pass a filename to connect so it creates a file corresponding to the database.

conn = duckdb.connect("../data/railway.db")

conn.sql("""
         CREATE TABLE services AS
         FROM 'https://blobs.duckdb.org/nl-railway/services-2024.csv.gz'
         """)

conn.sql("""
         CREATE TABLE stations AS
         FROM 'https://blobs.duckdb.org/nl-railway/stations-2023-09.csv'
         """)

We note the following about the above commands:

there is no need to define a schema like we did for SQLite. DuckDB automatically detects that services-2023.csv.gz refers to a gzip-compressed CSV file, so it calls the read_csv function, which decompresses the file and infers its schema from its content using the CSV sniffer.
the query makes use of DuckDB’s FROM-first syntax, which allows users to omit the SELECT * clause. Hence, the SQL statement FROM 'services-2023.csv.gz;' is a shorthand for SELECT * FROM 'services-2023.csv.gz';
the query creates a table called services and populates it with the result from the CSV reader. This is achieved using a CREATE TABLE ... AS statement.

conn.sql('SELECT * FROM services LIMIT 10')

┌────────────────┬──────────────┬──────────────┬─────────────────┬──────────────────────┬──────────────────────────────┬──────────────────────────┬───────────────────────┬─────────────┬───────────────────┬─────────────────────────┬─────────────────────┬────────────────────┬────────────────────────┬─────────────────────┬──────────────────────┬──────────────────────────┐
│ Service:RDT-ID │ Service:Date │ Service:Type │ Service:Company │ Service:Train number │ Service:Completely cancelled │ Service:Partly cancelled │ Service:Maximum delay │ Stop:RDT-ID │ Stop:Station code │    Stop:Station name    │  Stop:Arrival time  │ Stop:Arrival delay │ Stop:Arrival cancelled │ Stop:Departure time │ Stop:Departure delay │ Stop:Departure cancelled │
│     int64      │     date     │   varchar    │     varchar     │        int64         │           boolean            │         boolean          │         int64         │    int64    │      varchar      │         varchar         │      timestamp      │       int64        │        boolean         │      timestamp      │        int64         │         boolean          │
├────────────────┼──────────────┼──────────────┼─────────────────┼──────────────────────┼──────────────────────────────┼──────────────────────────┼───────────────────────┼─────────────┼───────────────────┼─────────────────────────┼─────────────────────┼────────────────────┼────────────────────────┼─────────────────────┼──────────────────────┼──────────────────────────┤
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     2 │   114307592 │ RTD               │ Rotterdam Centraal      │ NULL                │               NULL │ NULL                   │ 2024-01-01 01:00:00 │                    0 │ false                    │
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     0 │   114307593 │ DT                │ Delft                   │ 2024-01-01 01:13:00 │                  0 │ false                  │ 2024-01-01 01:13:00 │                    0 │ false                    │
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     0 │   114307594 │ GV                │ Den Haag HS             │ 2024-01-01 01:21:00 │                  0 │ false                  │ 2024-01-01 01:22:00 │                    0 │ false                    │
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     0 │   114307595 │ LEDN              │ Leiden Centraal         │ 2024-01-01 01:35:00 │                  0 │ false                  │ 2024-01-01 01:40:00 │                    0 │ false                    │
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     0 │   114307596 │ SHL               │ Schiphol Airport        │ 2024-01-01 02:00:00 │                  0 │ false                  │ 2024-01-01 02:03:00 │                    0 │ false                    │
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     0 │   114307597 │ ASS               │ Amsterdam Sloterdijk    │ 2024-01-01 02:12:00 │                  0 │ false                  │ 2024-01-01 02:12:00 │                    0 │ false                    │
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     0 │   114307598 │ ASD               │ Amsterdam Centraal      │ 2024-01-01 02:18:00 │                  1 │ false                  │ 2024-01-01 02:20:00 │                    2 │ false                    │
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     0 │   114307599 │ ASB               │ Amsterdam Bijlmer ArenA │ 2024-01-01 02:31:00 │                  2 │ false                  │ 2024-01-01 02:31:00 │                    2 │ false                    │
│       12690865 │ 2024-01-01   │ Intercity    │ NS              │                 1410 │ false                        │ false                    │                     0 │   114307600 │ UT                │ Utrecht Centraal        │ 2024-01-01 02:50:00 │                  0 │ false                  │ NULL                │                 NULL │ NULL                     │
│       12690866 │ 2024-01-01   │ Nightjet     │ NS Int          │                  420 │ false                        │ false                    │                     6 │   114307601 │ NURNB             │ Nürnberg Hbf            │ NULL                │               NULL │ NULL                   │ 2024-01-01 01:01:00 │                    0 │ false                    │
├────────────────┴──────────────┴──────────────┴─────────────────┴──────────────────────┴──────────────────────────────┴──────────────────────────┴───────────────────────┴─────────────┴───────────────────┴─────────────────────────┴─────────────────────┴────────────────────┴────────────────────────┴─────────────────────┴──────────────────────┴──────────────────────────┤
│ 10 rows                                                                                                                                                                                                                                                                                                                                                              17 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We can extract a table using conn.table("table-name")
DuckDB has a describe() method which displays summary statistics:

conn.table('stations').describe()

┌─────────┬────────────────────┬─────────┬────────────────────┬────────────┬──────────────────┬──────────────────┬───────────────────┬─────────┬────────────────────┬────────────────────┬────────────────────┐
│  aggr   │         id         │  code   │        uic         │ name_short │   name_medium    │    name_long     │       slug        │ country │        type        │      geo_lat       │      geo_lng       │
│ varchar │       double       │ varchar │       double       │  varchar   │     varchar      │     varchar      │      varchar      │ varchar │      varchar       │       double       │       double       │
├─────────┼────────────────────┼─────────┼────────────────────┼────────────┼──────────────────┼──────────────────┼───────────────────┼─────────┼────────────────────┼────────────────────┼────────────────────┤
│ count   │              591.0 │ 591     │              591.0 │ 591        │ 591              │ 591              │ 591               │ 591     │ 591                │              591.0 │              591.0 │
│ mean    │ 399.32148900169204 │ NULL    │  8364171.463620981 │ NULL       │ NULL             │ NULL             │ NULL              │ NULL    │ NULL               │ 51.582670490663915 │  6.140781107285482 │
│ stddev  │ 240.29090038664916 │ NULL    │ 230145.59489590116 │ NULL       │ NULL             │ NULL             │ NULL              │ NULL    │ NULL               │ 1.5830165154574618 │ 1.9482965139140689 │
│ min     │                5.0 │ AC      │          7015400.0 │ 't Harde   │ 's-Hertogenb. O. │ 's-Hertogenbosch │ aachen-hbf        │ A       │ facultatiefStation │           43.30381 │          -0.126136 │
│ max     │              850.0 │ ZZS     │          8894821.0 │ de Riet    │ de Riet          │ Zürich HB        │ zwolle-stadshagen │ S       │ stoptreinstation   │          57.708889 │          16.375833 │
│ median  │              389.0 │ NULL    │          8400313.0 │ NULL       │ NULL             │ NULL             │ NULL              │ NULL    │ NULL               │    51.965278625488 │    5.7922220230103 │
└─────────┴────────────────────┴─────────┴────────────────────┴────────────┴──────────────────┴──────────────────┴───────────────────┴─────────┴────────────────────┴────────────────────┴────────────────────┘

conn.table('services').describe()

┌─────────┬────────────────────┬──────────────┬───────────────────┬─────────────────┬──────────────────────┬──────────────────────────────┬──────────────────────────┬───────────────────────┬────────────────────┬───────────────────┬───────────────────┬─────────────────────┬────────────────────┬────────────────────────┬─────────────────────┬──────────────────────┬──────────────────────────┐
│  aggr   │   Service:RDT-ID   │ Service:Date │   Service:Type    │ Service:Company │ Service:Train number │ Service:Completely cancelled │ Service:Partly cancelled │ Service:Maximum delay │    Stop:RDT-ID     │ Stop:Station code │ Stop:Station name │  Stop:Arrival time  │ Stop:Arrival delay │ Stop:Arrival cancelled │ Stop:Departure time │ Stop:Departure delay │ Stop:Departure cancelled │
│ varchar │       double       │   varchar    │      varchar      │     varchar     │        double        │           varchar            │         varchar          │        double         │       double       │      varchar      │      varchar      │       varchar       │       double       │        varchar         │       varchar       │        double        │         varchar          │
├─────────┼────────────────────┼──────────────┼───────────────────┼─────────────────┼──────────────────────┼──────────────────────────────┼──────────────────────────┼───────────────────────┼────────────────────┼───────────────────┼───────────────────┼─────────────────────┼────────────────────┼────────────────────────┼─────────────────────┼──────────────────────┼──────────────────────────┤
│ count   │         21857914.0 │ 21857914     │ 21857914          │ 21857914        │           21857914.0 │ 21857914                     │ 21857914                 │            21857914.0 │         21857914.0 │ 21857914          │ 21857914          │ 19448924            │         19448924.0 │ 19448924               │ 19446167            │           19446167.0 │ 19446167                 │
│ mean    │ 13888657.881730618 │ NULL         │ NULL              │ NULL            │    59390.83299838219 │ NULL                         │ NULL                     │   0.23097483135856423 │ 125241230.89369969 │ NULL              │ NULL              │ NULL                │ 1.0032194583103928 │ NULL                   │ NULL                │   0.9348018558104535 │ NULL                     │
│ stddev  │  699038.8168290926 │ NULL         │ NULL              │ NULL            │   191873.59274439292 │ NULL                         │ NULL                     │    1.7712675501431556 │  6309866.281116365 │ NULL              │ NULL              │ NULL                │  3.604251176865116 │ NULL                   │ NULL                │    3.407607275674939 │ NULL                     │
│ min     │         12690865.0 │ 2024-01-01   │ Belbus            │ Arriva          │                  1.0 │ false                        │ false                    │                   0.0 │        114307592.0 │ AC                │ 's-Hertogenbosch  │ 2024-01-01 01:13:00 │                0.0 │ false                  │ 2024-01-01 01:00:00 │                  0.0 │ false                    │
│ max     │         15102578.0 │ 2024-12-31   │ Taxibus ipv trein │ ZLSM            │             987590.0 │ true                         │ true                     │                1446.0 │        136324429.0 │ ZZS               │ Zürich HB         │ 2025-01-01 08:14:00 │             1469.0 │ true                   │ 2025-01-01 07:55:00 │               1470.0 │ true                     │
│ median  │         13882717.0 │ NULL         │ NULL              │ NULL            │               6914.0 │ NULL                         │ NULL                     │                   0.0 │        125241233.5 │ NULL              │ NULL              │ NULL                │                0.0 │ NULL                   │ NULL                │                  0.0 │ NULL                     │
└─────────┴────────────────────┴──────────────┴───────────────────┴─────────────────┴──────────────────────┴──────────────────────────────┴──────────────────────────┴───────────────────────┴────────────────────┴───────────────────┴───────────────────┴─────────────────────┴────────────────────┴────────────────────────┴─────────────────────┴──────────────────────┴──────────────────────────┘

conn.sql("""
        SELECT format('{:,}', count(*)) AS num_services
        FROM services;
           """)

┌──────────────┐
│ num_services │
│   varchar    │
├──────────────┤
│ 21,857,914   │
└──────────────┘

(Above we use DuckDB formatting).

What were the busiest railway stations in the Netherlands in the first 6 months of 2023?

conn.sql("""
        SELECT
        month("Service:Date") AS month,
        "Stop:Station name" AS station,
        count(*) AS num_services
        FROM services
        GROUP BY month, station
        LIMIT 5;
        """)

┌───────┬────────────────┬──────────────┐
│ month │    station     │ num_services │
│ int64 │    varchar     │    int64     │
├───────┼────────────────┼──────────────┤
│     1 │ Steenwijk      │         4182 │
│     1 │ Zwolle         │        21860 │
│     1 │ Ede-Wageningen │         8646 │
│     1 │ Eijsden        │         1150 │
│     1 │ Stavoren       │         1314 │
└───────┴────────────────┴──────────────┘

Let’s create a table named services_per_month based on this query. Note we use the DuckDB shortcut GROUP BY ALL (see here).

conn.sql("""
        CREATE TABLE services_per_month AS
        SELECT
        month("Service:Date") AS month,
        "Stop:Station name" AS station,
        count(*) AS num_services
        FROM services
        GROUP BY ALL;
         """)

Let’s look at just the first 6 months of 2024:

conn.sql("""
        SELECT
         month, arg_max(station, num_services) AS station,
        max(num_services) AS num_services
        FROM services_per_month
        WHERE month <= 6
        GROUP BY ALL;
         """)

┌───────┬──────────────────┬──────────────┐
│ month │     station      │ num_services │
│ int64 │     varchar      │    int64     │
├───────┼──────────────────┼──────────────┤
│     1 │ Utrecht Centraal │        40573 │
│     2 │ Utrecht Centraal │        39201 │
│     3 │ Utrecht Centraal │        40444 │
│     4 │ Utrecht Centraal │        38195 │
│     5 │ Utrecht Centraal │        40425 │
│     6 │ Utrecht Centraal │        38812 │
└───────┴──────────────────┴──────────────┘

Maybe surprisingly, in most months, the busiest railway station is not in Amsterdam but in the country’s 4th largest city, Utrecht, thanks to its central geographic location.

Let’s try another more complex query.

From the table services_per_month, let’s select:

month (integer between 1-12)
month name (using strftime(make_date(2023, month, 1), '%B'))
station
number of services (num_services)
rank of station by number of services per month

How do we rank the stations by number of services per month?

we use rank() OVER to specify rank() as a window function. This computes per-group values without collapsing the rows (unlike GROUP BY)
then PARTITION BY is used inside OVER to define the groups (by month, in this case)
we also use ORDER BY ... DESC to return the stations in order

conn.sql("""
            SELECT
                month,
                strftime(make_date(2023, month, 1), '%B') AS month_name,
                station,
                num_services,
                rank() OVER
                    (PARTITION BY month ORDER BY num_services DESC) AS rank,
                FROM services_per_month
         """)

┌───────┬────────────┬───────────────────────┬──────────────┬───────┐
│ month │ month_name │        station        │ num_services │ rank  │
│ int64 │  varchar   │        varchar        │    int64     │ int64 │
├───────┼────────────┼───────────────────────┼──────────────┼───────┤
│     4 │ April      │ Utrecht Centraal      │        38195 │     1 │
│     4 │ April      │ Amsterdam Centraal    │        33030 │     2 │
│     4 │ April      │ Schiphol Airport      │        24121 │     3 │
│     4 │ April      │ Rotterdam Centraal    │        22999 │     4 │
│     4 │ April      │ Amsterdam Sloterdijk  │        22690 │     5 │
│     4 │ April      │ Zwolle                │        20892 │     6 │
│     4 │ April      │ Den Haag Centraal     │        20305 │     7 │
│     4 │ April      │ Arnhem Centraal       │        19665 │     8 │
│     4 │ April      │ Leiden Centraal       │        19091 │     9 │
│     4 │ April      │ Groningen             │        17456 │    10 │
│     · │   ·        │     ·                 │            · │     · │
│     · │   ·        │     ·                 │            · │     · │
│     · │   ·        │     ·                 │            · │     · │
│     9 │ September  │ Bad Oeynhausen        │            5 │   524 │
│     9 │ September  │ Minden (Westf)        │            5 │   524 │
│     9 │ September  │ Regensburg Hbf        │            3 │   526 │
│     9 │ September  │ Wels Hbf              │            3 │   526 │
│     9 │ September  │ Wien Hbf              │            3 │   526 │
│     9 │ September  │ Linz Hbf              │            3 │   526 │
│     9 │ September  │ St.Pölten Hbf         │            3 │   526 │
│     9 │ September  │ Passau Hbf            │            3 │   526 │
│     9 │ September  │ Heerenveen IJsstadion │            1 │   532 │
│     9 │ September  │ Kohlscheid            │            1 │   532 │
├───────┴────────────┴───────────────────────┴──────────────┴───────┤
│ 6423 rows (20 shown)                                    5 columns │
└───────────────────────────────────────────────────────────────────┘

Now, let’s select just

month
month name
station

from the new table we created in the previous step. We just put the table from the previous step in parentheses, and use SELECT..FROM ().

Let’s also keep only:

months between 6 and 8 (i.e. June - August)
stations with rank less than or equal to 3

conn.sql("""
            SELECT month, month_name, station AS top3_stations
            FROM (
                SELECT
                    month,
                    strftime(make_date(2023, month, 1), '%B') AS month_name,
                    rank() OVER
                        (PARTITION BY month ORDER BY num_services DESC) AS rank,
                    station,
                    num_services
                FROM services_per_month
                WHERE month BETWEEN 6 AND 8
            )
            WHERE rank <= 3
         """)

┌───────┬────────────┬──────────────────────┐
│ month │ month_name │    top3_stations     │
│ int64 │  varchar   │       varchar        │
├───────┼────────────┼──────────────────────┤
│     8 │ August     │ Utrecht Centraal     │
│     8 │ August     │ Amsterdam Centraal   │
│     8 │ August     │ Amsterdam Sloterdijk │
│     7 │ July       │ Utrecht Centraal     │
│     7 │ July       │ Amsterdam Centraal   │
│     7 │ July       │ Rotterdam Centraal   │
│     6 │ June       │ Utrecht Centraal     │
│     6 │ June       │ Amsterdam Centraal   │
│     6 │ June       │ Schiphol Airport     │
└───────┴────────────┴──────────────────────┘

Finally, we can return one row per month and aggregate the train stations using array_agg and GROUP BY ALL.

conn.sql("""
            SELECT month, month_name, array_agg(station) AS top3_stations
            FROM (
                SELECT
                    month,
                    strftime(make_date(2023, month, 1), '%B') AS month_name,
                    rank() OVER
                        (PARTITION BY month ORDER BY num_services DESC) AS rank,
                    station,
                    num_services
                FROM services_per_month
                WHERE month BETWEEN 6 AND 8
            )
            WHERE rank <= 3
            GROUP BY ALL;
         """)

┌───────┬────────────┬──────────────────────────────────────────────────────────────┐
│ month │ month_name │                        top3_stations                         │
│ int64 │  varchar   │                          varchar[]                           │
├───────┼────────────┼──────────────────────────────────────────────────────────────┤
│     6 │ June       │ [Utrecht Centraal, Amsterdam Centraal, Schiphol Airport]     │
│     8 │ August     │ [Utrecht Centraal, Amsterdam Centraal, Amsterdam Sloterdijk] │
│     7 │ July       │ [Utrecht Centraal, Amsterdam Centraal, Rotterdam Centraal]   │
└───────┴────────────┴──────────────────────────────────────────────────────────────┘

Finally, order by month!

conn.sql("""
            SELECT month, month_name, array_agg(station) AS top3_stations
            FROM (
                SELECT
                    month,
                    strftime(make_date(2023, month, 1), '%B') AS month_name,
                    rank() OVER
                        (PARTITION BY month ORDER BY num_services DESC) AS rank,
                    station,
                    num_services
                FROM services_per_month
                WHERE month BETWEEN 6 AND 8
            )
            WHERE rank <= 3
            GROUP BY ALL
            ORDER BY month
         """)

┌───────┬────────────┬──────────────────────────────────────────────────────────────┐
│ month │ month_name │                        top3_stations                         │
│ int64 │  varchar   │                          varchar[]                           │
├───────┼────────────┼──────────────────────────────────────────────────────────────┤
│     6 │ June       │ [Utrecht Centraal, Amsterdam Centraal, Schiphol Airport]     │
│     7 │ July       │ [Utrecht Centraal, Amsterdam Centraal, Rotterdam Centraal]   │
│     8 │ August     │ [Utrecht Centraal, Amsterdam Centraal, Amsterdam Sloterdijk] │
└───────┴────────────┴──────────────────────────────────────────────────────────────┘

Efficient File Formats

There are also efficient file formats that we can use.

Parquet

Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

If you install pyarrow, you can use the pandas function pd.read_parquet.

pip install pyarrow

We can also use DuckDB on parquet files.

duckdb.query("SELECT * FROM '../data/spotify.parquet' LIMIT 5").df()

	Artist(s)	song	text	Length	emotion	Genre	Album	Release Date	Key	Tempo	...	Similar Artist 1	Similar Song 1	Similarity Score 1	Similar Artist 2	Similar Song 2	Similarity Score 2	Similar Artist 3	Similar Song 3	Similarity Score 3
0	!!!	Even When the Waters Cold	Friends told her she was better off at the bot...	03:47	sadness	hip hop	Thr!!!er	2013-04-29	D min	0.437870	...	Corey Smith	If I Could Do It Again	0.986061	Toby Keith	Drinks After Work	0.983719	Space	Neighbourhood	0.983236
1	!!!	One Girl / One Boy	Well I heard it, playing soft From a drunken b...	04:03	sadness	hip hop	Thr!!!er	2013-04-29	A# min	0.508876	...	Hiroyuki Sawano	BRE@TH//LESS	0.995409	When In Rome	Heaven Knows	0.990905	Justice Crew	Everybody	0.984483
2	!!!	Pardon My Freedom	Oh my god, did I just say that out loud? Shoul...	05:51	joy	hip hop	Louden Up Now	2004-06-08	A Maj	0.532544	...	Ricky Dillard	More Abundantly Medley Live	0.993176	Juliet	Avalon	0.965147	The Jacksons	Lovely One	0.956752
3	!!!	Ooo	[Verse 1] Remember when I called you on the te...	03:44	joy	hip hop	As If	2015-10-16	A min	0.538462	...	Eric Clapton	Man Overboard	0.992749	Roxette	Don't Believe In Accidents	0.991494	Tiwa Savage	My Darlin	0.990381
4	!!!	Freedom 15	[Verse 1] Calling me like I got something to s...	06:00	joy	hip hop	As If	2015-10-16	F min	0.544379	...	Cibo Matto	Lint Of Love	0.981610	Barrington Levy	Better Than Gold	0.981524	Freestyle	Its Automatic	0.981415

5 rows × 39 columns

HDF5

HDF stands for “hierarchical data format”. HDF5 files are commonly use to store scientific array data. Each HDF5 file can store multiple datasets and corresponding metadata.

h5py is a helpful package for working with HDF5 data:

Terminal

conda install h5py

The following comes from the h5py tutorial.

import h5py
f = h5py.File('../data/mytestfile.hdf5', 'r')

list(f.keys())

['mydataset']

dset = f['mydataset']
dset.shape

(100,)

We can get all values of the dataset using [...] (you can also use standard numpy indexing to get elements)

dset[...]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Polars

Polars is a recently developed package which is extremely efficient at handling large scale datasets.

Similarly to Pandas, Polars is built around a DataFrame object (in this case, a Polars DataFrame).

Install Polars in your conda environment:

Terminal

pip install polars

What makes Polars efficient?

Lazy evaluation

Unlike Pandas, Polars can employ lazy evaluation. In Polars’ lazy evaluation, operations built into an execution plan which is optimized to avoid unnecessary computations, and excecuted only when needed.

Query optimization

Polars automatically optimizes the execution plan for efficient resource use, based on expressions and data characteristics.

Parallelization

Polars divides the workload among the available CPU cores without any additional configuration.

Memory

Unlike Pandas, Polars can process your results without requiring all your data to be in memory at the same time.

Let’s compare Pandas and Polars.

import pandas as pd
penguins_pd = pd.read_csv("../data/penguins.csv")
print(f"{type(penguins_pd) = }")

type(penguins_pd) = <class 'pandas.core.frame.DataFrame'>

import polars as pl  
penguins_pl = pl.read_csv("../data/penguins.csv", separator=",", has_header=True)
print(f"{type(penguins_pl) = }")

type(penguins_pl) = <class 'polars.dataframe.frame.DataFrame'>

penguins_pd.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

penguins_pl.head()

shape: (5, 7)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
str	str	f64	f64	i64	i64	str
"Adelie"	"Torgersen"	39.1	18.7	181	3750	"Male"
"Adelie"	"Torgersen"	39.5	17.4	186	3800	"Female"
"Adelie"	"Torgersen"	40.3	18.0	195	3250	"Female"
"Adelie"	"Torgersen"	null	null	null	null	null
"Adelie"	"Torgersen"	36.7	19.3	193	3450	"Female"

Selecting columns:

penguins_pd['bill_length_mm'].head()

0    39.1
1    39.5
2    40.3
3     NaN
4    36.7
Name: bill_length_mm, dtype: float64

penguins_pl.get_column('bill_length_mm').head(5)

shape: (5,)

bill_length_mm
f64
39.1
39.5
40.3
null
36.7

Dropping columns:

penguins_pd = penguins_pd.drop(columns=['body_mass_g'])

penguins_pl = penguins_pl.drop('body_mass_g')

Note above that we didn’t need to specify columns for penguins_pl.drop. This is because Polars has column-only methods (e.g. drop) and row-only methods (e.g. filter).

This is in contrast to Pandas, which operates on rows by default, and needs axis=1 or columns for column operations.

A major difference between Polars and Pandas is that Polars does not have an Index.

Duplicates:

penguins_pd.drop_duplicates(subset="species")

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	sex
0	Adelie	Torgersen	39.1	18.7	181.0	Male
152	Chinstrap	Dream	46.5	17.9	192.0	Female
220	Gentoo	Biscoe	46.1	13.2	211.0	Female

penguins_pl.unique(subset="species")

shape: (3, 6)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	sex
str	str	f64	f64	i64	str
"Chinstrap"	"Dream"	46.5	17.9	192	"Female"
"Adelie"	"Torgersen"	39.1	18.7	181	"Male"
"Gentoo"	"Biscoe"	46.1	13.2	211	"Female"

pandas has many possible options for missing values: None, NaN, NA, <NA>, NaT, and np.nan, which also depend on the data type.

Polars represents missing values as null. Invalid numbers are NaN (not a number).

penguins_pd.dropna()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	sex
0	Adelie	Torgersen	39.1	18.7	181.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	Female
4	Adelie	Torgersen	36.7	19.3	193.0	Female
5	Adelie	Torgersen	39.3	20.6	190.0	Male
...	...	...	...	...	...	...
338	Gentoo	Biscoe	47.2	13.7	214.0	Female
340	Gentoo	Biscoe	46.8	14.3	215.0	Female
341	Gentoo	Biscoe	50.4	15.7	222.0	Male
342	Gentoo	Biscoe	45.2	14.8	212.0	Female
343	Gentoo	Biscoe	49.9	16.1	213.0	Male

333 rows × 6 columns

penguins_pl.drop_nulls()

shape: (333, 6)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	sex
str	str	f64	f64	i64	str
"Adelie"	"Torgersen"	39.1	18.7	181	"Male"
"Adelie"	"Torgersen"	39.5	17.4	186	"Female"
"Adelie"	"Torgersen"	40.3	18.0	195	"Female"
"Adelie"	"Torgersen"	36.7	19.3	193	"Female"
"Adelie"	"Torgersen"	39.3	20.6	190	"Male"
…	…	…	…	…	…
"Gentoo"	"Biscoe"	47.2	13.7	214	"Female"
"Gentoo"	"Biscoe"	46.8	14.3	215	"Female"
"Gentoo"	"Biscoe"	50.4	15.7	222	"Male"
"Gentoo"	"Biscoe"	45.2	14.8	212	"Female"
"Gentoo"	"Biscoe"	49.9	16.1	213	"Male"

Sorting:

penguins_pd.sort_values("bill_length_mm", ascending=False)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	sex
253	Gentoo	Biscoe	59.6	17.0	230.0	Male
169	Chinstrap	Dream	58.0	17.8	181.0	Female
321	Gentoo	Biscoe	55.9	17.0	228.0	Male
215	Chinstrap	Dream	55.8	19.8	207.0	Male
335	Gentoo	Biscoe	55.1	16.0	230.0	Male
...	...	...	...	...	...	...
70	Adelie	Torgersen	33.5	19.0	190.0	Female
98	Adelie	Dream	33.1	16.1	178.0	Female
142	Adelie	Dream	32.1	15.5	188.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN

344 rows × 6 columns

penguins_pl.sort("bill_length_mm", descending=True)

shape: (344, 6)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	sex
str	str	f64	f64	i64	str
"Adelie"	"Torgersen"	null	null	null	null
"Gentoo"	"Biscoe"	null	null	null	null
"Gentoo"	"Biscoe"	59.6	17.0	230	"Male"
"Chinstrap"	"Dream"	58.0	17.8	181	"Female"
"Gentoo"	"Biscoe"	55.9	17.0	228	"Male"
…	…	…	…	…	…
"Adelie"	"Torgersen"	34.1	18.1	193	null
"Adelie"	"Dream"	34.0	17.1	185	"Female"
"Adelie"	"Torgersen"	33.5	19.0	190	"Female"
"Adelie"	"Dream"	33.1	16.1	178	"Female"
"Adelie"	"Dream"	32.1	15.5	188	"Female"

Method chaining:

(
    penguins_pd
    .loc[penguins_pd['species'].isin(['Adelie', 'Gentoo']), ["bill_length_mm", "bill_depth_mm"]]
)

	bill_length_mm	bill_depth_mm
0	39.1	18.7
1	39.5	17.4
2	40.3	18.0
3	NaN	NaN
4	36.7	19.3
...	...	...
339	NaN	NaN
340	46.8	14.3
341	50.4	15.7
342	45.2	14.8
343	49.9	16.1

276 rows × 2 columns

(
    penguins_pl
    .filter(pl.col("species").is_in(['Adelie', 'Gentoo']))
    .select(["bill_length_mm", "bill_depth_mm"])
)

shape: (276, 2)

bill_length_mm	bill_depth_mm
f64	f64
39.1	18.7
39.5	17.4
40.3	18.0
null	null
36.7	19.3
…	…
null	null
46.8	14.3
50.4	15.7
45.2	14.8
49.9	16.1

Group by:

penguins_pd.groupby(["species", "island"])[["bill_length_mm"]].mean()

		bill_length_mm
species	island
Adelie	Biscoe	38.975000
	Dream	38.501786
	Torgersen	38.950980
Chinstrap	Dream	48.833824
Gentoo	Biscoe	47.504878

penguins_pl.group_by("species", "island").agg(pl.col("bill_length_mm").mean())

shape: (5, 3)

species	island	bill_length_mm
str	str	f64
"Adelie"	"Torgersen"	38.95098
"Adelie"	"Biscoe"	38.975
"Chinstrap"	"Dream"	48.833824
"Adelie"	"Dream"	38.501786
"Gentoo"	"Biscoe"	47.504878

Switching between Pandas and Polars:

penguins_pl = pl.DataFrame(penguins_pd)
penguins_pd = penguins_pl.to_pandas()

Example lazy query in Polars:

lazy_query = (
    pl.scan_csv("../data/penguins.csv")
    .group_by(["species", "island"])
    .agg(pl.col("bill_length_mm").mean())
    .filter(pl.col("island") == "Biscoe")
)

lazy_query.collect()

shape: (2, 3)

species	island	bill_length_mm
str	str	f64
"Adelie"	"Biscoe"	38.975
"Gentoo"	"Biscoe"	47.504878

Other libraries

Some other popular libraries are:

Dask
PySpark

Both Dask and PySpark excel for processing massive datasets that require distributed processing across multiple computers. (These are out-of-scope for this course.)