Lecture 5a - Generalized Linear Models

Bias-Variance Illustration

Suppose we have fit a model \widehat{f}(x) to some training data \{{X}_i,{Y}_i\}_{i=1}^n. Let (x_0, y_0) be a test observation from the population. The true model is Y=f(X)+\varepsilon (where f(x)=\mathbb{E}[Y|X=x]). Then the MSE of the test data is: \begin{align*} \mathbb{E}[(y_0-\widehat{f}(x_0))^2] = \mathrm{Var}(\varepsilon) + [\mathrm{Bias}(\widehat{f}(x_0))]^2 + \mathrm{Var}(\widehat{f}(x_0)) \end{align*} where \mathrm{Bias}(\widehat{f}(x_0)) = \mathbb{E}[\widehat{f}(x_0)] - f(x_0).

The expectation \mathbb{E} averages over the variability of y_0 as well as variability in the training data.

Below we compare different models to predict Y from X where the true relationship is nonlinear.

  • Left: Plot of true function, draw of training data, fit to the training data
  • Middle: Plot of fitted functions for different training datasets (to give a visual of variance of \widehat{f})
  • Right: Plot of average of fitted functions and true function (to give a visual of bias of \mathbb{E}[\widehat{f}(x)])

Linear regression fit to nonlinear data

Piecewise polynomial fit to nonlinear data (with regularization)

Piecewise polynomial fit to nonlinear data (no regularization)

Poisson Regression

Adapted from Introduction to Statistical Learning with Python (link)

We consider the Bike Share data from class.

import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize)
Bike = load_data('Bikeshare')
Bike.shape, Bike.columns
((8645, 15),
 Index(['season', 'mnth', 'day', 'hr', 'holiday', 'weekday', 'workingday',
        'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
        'registered', 'bikers'],
       dtype='object'))
features = ['mnth', 'hr', 'workingday', 'temp', 'weathersit']
Bike['workingday'].value_counts()
workingday
1    5911
0    2734
Name: count, dtype: int64
Bike['weathersit'].value_counts()
weathersit
clear              5645
cloudy/misty       2218
light rain/snow     781
heavy rain/snow       1
Name: count, dtype: int64
X = MS(['mnth',
        'hr',
        'workingday',
        'temp',
        'weathersit']).fit_transform(Bike)
Y = Bike['bikers']
X.columns
Index(['intercept', 'mnth[Feb]', 'mnth[March]', 'mnth[April]', 'mnth[May]',
       'mnth[June]', 'mnth[July]', 'mnth[Aug]', 'mnth[Sept]', 'mnth[Oct]',
       'mnth[Nov]', 'mnth[Dec]', 'hr[1]', 'hr[2]', 'hr[3]', 'hr[4]', 'hr[5]',
       'hr[6]', 'hr[7]', 'hr[8]', 'hr[9]', 'hr[10]', 'hr[11]', 'hr[12]',
       'hr[13]', 'hr[14]', 'hr[15]', 'hr[16]', 'hr[17]', 'hr[18]', 'hr[19]',
       'hr[20]', 'hr[21]', 'hr[22]', 'hr[23]', 'workingday', 'temp',
       'weathersit[cloudy/misty]', 'weathersit[heavy rain/snow]',
       'weathersit[light rain/snow]'],
      dtype='object')
M_pois = sm.GLM(Y, X, family=sm.families.Poisson()).fit()
M_pois.summary()
Generalized Linear Model Regression Results
Dep. Variable: bikers No. Observations: 8645
Model: GLM Df Residuals: 8605
Model Family: Poisson Df Model: 39
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -1.4054e+05
Date: Mon, 23 Feb 2026 Deviance: 2.2804e+05
Time: 13:27:35 Pearson chi2: 2.20e+05
No. Iterations: 7 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
intercept 2.6937 0.010 277.124 0.000 2.675 2.713
mnth[Feb] 0.2260 0.007 32.521 0.000 0.212 0.240
mnth[March] 0.3764 0.007 56.263 0.000 0.363 0.390
mnth[April] 0.6917 0.007 98.996 0.000 0.678 0.705
mnth[May] 0.9106 0.007 122.469 0.000 0.896 0.925
mnth[June] 0.8934 0.008 108.402 0.000 0.877 0.910
mnth[July] 0.7738 0.009 87.874 0.000 0.757 0.791
mnth[Aug] 0.8213 0.008 98.573 0.000 0.805 0.838
mnth[Sept] 0.9037 0.008 118.578 0.000 0.889 0.919
mnth[Oct] 0.9377 0.007 139.054 0.000 0.925 0.951
mnth[Nov] 0.8204 0.006 126.334 0.000 0.808 0.833
mnth[Dec] 0.6868 0.006 108.724 0.000 0.674 0.699
hr[1] -0.4716 0.013 -36.278 0.000 -0.497 -0.446
hr[2] -0.8088 0.015 -55.220 0.000 -0.837 -0.780
hr[3] -1.4439 0.019 -76.631 0.000 -1.481 -1.407
hr[4] -2.0761 0.025 -83.728 0.000 -2.125 -2.027
hr[5] -1.0603 0.016 -65.957 0.000 -1.092 -1.029
hr[6] 0.3245 0.011 30.585 0.000 0.304 0.345
hr[7] 1.3296 0.009 146.822 0.000 1.312 1.347
hr[8] 1.8313 0.009 211.630 0.000 1.814 1.848
hr[9] 1.3362 0.009 148.191 0.000 1.318 1.354
hr[10] 1.0912 0.009 117.831 0.000 1.073 1.109
hr[11] 1.2485 0.009 137.304 0.000 1.231 1.266
hr[12] 1.4340 0.009 160.486 0.000 1.417 1.452
hr[13] 1.4280 0.009 159.529 0.000 1.410 1.445
hr[14] 1.3793 0.009 153.266 0.000 1.362 1.397
hr[15] 1.4081 0.009 156.862 0.000 1.391 1.426
hr[16] 1.6287 0.009 184.979 0.000 1.611 1.646
hr[17] 2.0490 0.009 239.221 0.000 2.032 2.066
hr[18] 1.9667 0.009 229.065 0.000 1.950 1.983
hr[19] 1.6684 0.009 190.830 0.000 1.651 1.686
hr[20] 1.3706 0.009 152.737 0.000 1.353 1.388
hr[21] 1.1186 0.009 121.383 0.000 1.101 1.137
hr[22] 0.8719 0.010 91.429 0.000 0.853 0.891
hr[23] 0.4814 0.010 47.164 0.000 0.461 0.501
workingday 0.0147 0.002 7.502 0.000 0.011 0.018
temp 0.7853 0.011 68.434 0.000 0.763 0.808
weathersit[cloudy/misty] -0.0752 0.002 -34.528 0.000 -0.080 -0.071
weathersit[heavy rain/snow] -0.9263 0.167 -5.554 0.000 -1.253 -0.599
weathersit[light rain/snow] -0.5758 0.004 -141.905 0.000 -0.584 -0.568
S_pois = summarize(M_pois)

coef_month = S_pois[S_pois.index.str.contains('mnth')]['coef']
coef_month = pd.concat([pd.Series([0],
                                   index=['mnth[Jan]']), coef_month])
coef_hr = S_pois[S_pois.index.str.contains('hr')]['coef']
coef_hr = pd.concat([pd.Series([0],
                     index=['hr[0]']), coef_hr])
coef_month
mnth[Jan]      0.0000
mnth[Feb]      0.2260
mnth[March]    0.3764
mnth[April]    0.6917
mnth[May]      0.9106
mnth[June]     0.8934
mnth[July]     0.7738
mnth[Aug]      0.8213
mnth[Sept]     0.9037
mnth[Oct]      0.9377
mnth[Nov]      0.8204
mnth[Dec]      0.6868
dtype: float64
x_month = np.arange(coef_month.shape[0])
x_hr = np.arange(coef_hr.shape[0])
fig_pois, (ax_month, ax_hr) = subplots(1, 2, figsize=(16,8))
ax_month.plot(x_month, coef_month, marker='o', ms=10)
ax_month.set_xticks(x_month)
ax_month.set_xticklabels([l[5] for l in coef_month.index], fontsize=20)
ax_month.set_xlabel('Month', fontsize=20)
ax_month.set_ylabel('Coefficient', fontsize=20)
ax_hr.plot(x_hr, coef_hr, marker='o', ms=10)
ax_hr.set_xticklabels(range(24)[::2], fontsize=20)
ax_hr.set_xlabel('Hour', fontsize=20)
ax_hr.set_ylabel('Coefficient', fontsize=20);
/var/folders/f0/m7l23y8s7p3_0x04b3td9nyjr2hyc8/T/ipykernel_28453/3779510511.py:8: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax_hr.set_xticklabels(range(24)[::2], fontsize=20)