Generalized Additive Models (GAMs): A Powerful Approach to Modeling Non-Linear Data, Part I

“All considered, it is conceivable that in a minor way, nonparametric regression might, like linear regression, become an object treasured for both its artistic merit as well as usefulness.”

– Leo Breiman, statistician, 1977

My first project at The General® involved classifying vehicles by their risk profile. While conducting R&D ahead of the project, I came across a model type I was unfamiliar with, the Generalized Additive Model, or GAM. Given the ability of GAMs to handle the non-linear nature of the data I was working with, their inclusion of penalties for overfitting, and the—somewhat, if unconventional—ease of interpretation of GAM output, they appeared to be a promising model type for my project.

This post will introduce the GAM by way of its history, powerful smoothing parameters, output interpretation, and offer practical considerations for implementation. We’ll be using R’s mgcv and ggplot packages.

Brief History Lesson

GAMs are a generalization of multiple regression and are considered “traditional statistics” as opposed to modern machine learning methods. Regression can be traced back to mathematicians Carl Frederich Gauss and Adrien-Marie Legendre at the start of the 19th century (for an entertaining and informative article outlining their dispute over just who invented ordinary least squares, read this). The GAM, however, is a much more recent discovery: though additive models were discussed in the 1950s, the GAM was formally introduced in 1986 by Trevor Hastie and Robert Tibshirani, both of whom are now professors at Stanford (also, shout out to Tibshirani, a fellow Canadian).

The Essentials

As the name implies, GAMs are generalized in the sense that the response variable can be specified to have a non-normal distribution and does not have to be continuous.  GAMs are additive because each of the X components on the right hand side of the equation are added together (y = n + x), rather than multiplied together (y = nx).  The main benefit of GAMs is their ability to use smoothing parameters that allow for fitting to non-linear data while penalizing for overfitting.  Let’s take a look at a basic form of the math:

y = sλ(x) + e 
y = response variable 
λ (lamba) = smoothing parameter
sλ(x) = non-parametric smooth functions

We’ll revisit more of the internals of GAMs throughout this post, but let’s begin by simulating some non-linear data and plotting our variables, X and y.  

# set seed for reproducibility
set.seed(1986)

X <- seq(0,10,0.025)

# add small amount of 'noise' to X
X <- jitter(X, factor=10)

y <- sin(X) + rnorm(mean=1, sd=0.5*sd(sin(X)), n=length(X))


# combine data
data <-as.data.frame(cbind(X, y))

# plot
ggplot(data, aes(X, y)) + geom_point() 
Screen Shot 2019-07-03 at 11.43.02 AM.png

Method #1: Linear Model

As an exercise, let’s see what happens when we fit a linear model to our data.  The blue line indicates the best fit from our regression.  As intuition might have told you before even plotted, a linear fit leaves a lot to be desired.


# linear model
lm <- lm(y ~ X, data=data)

# get coefficients
coeff = coefficients(lm)

# visualize linear model
ggplot(data, aes(X, y)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "blue") +
  ggtitle("Linear Regression") + theme(plot.title = element_text(hjust = 0.5))
Screen Shot 2019-07-03 at 11.43.48 AM.png

Assessing the summary output confirms our visual inspection. Our coefficient for X is very small.  The adjusted r-squared—the amount of variation in Y explained by X—is practically zero.

# summary
summary(lm)
Call:
lm(formula = y ~ X, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.94923 -0.56761  0.05555  0.61133  1.68722 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.27266    0.07362  17.288   <2e-16 ***
X           -0.01362    0.01274  -1.069    0.286    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7385 on 399 degrees of freedom
Multiple R-squared:  0.002854,	Adjusted R-squared:  0.0003551 
F-statistic: 1.142 on 1 and 399 DF,  p-value: 0.2859

Let’s move on to our next model type.


Method #2: Polynomial Model

One solution may be to fit a polynomial regression, a special case of a linear model that introduces non-linearity to a prediction. Non-linearity is introduced by polynomial functions— quadratic, cubic, etc. —to a single or multiple explanatory variables in a model.

Our data has three inflection points so let’s begin with a quadratic, or 4th degree, polynomial.

# fit polynomial 
pnl <- lm(y ~ poly(X, 4))

# visualize polynomial 
modelggplot(data, aes(X, y)) +
   geom_point() + 
   stat_smooth(method = "lm", formula = y ~ poly(x, 4)) +
   ggtitle("Linear Regression w/ 4th Degree Polynomial") + 
   theme(plot.title = element_text(hjust = 0.5))
Screen Shot 2019-07-03 at 11.44.32 AM.png

That’s a much better fit!

Let’s take a look at our summary data. Our residual standard error of 0.3897 indicates our quadratic polynomial model is a much better fit than our basic linear model (rse = 0.7385), the adjusted r-squared value is much improved, and the p-values indicate all of our variables are significant (α = 0.001). 

# summary
summary(pnl)
Call:
lm(formula = y ~ poly(X, 4))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.06788 -0.25621  0.02535  0.25502  1.06345 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.20457    0.01946  61.893   <2e-16 ***
poly(X, 4)1  -0.78918    0.38973  -2.025   0.0435 *  
poly(X, 4)2   5.66094    0.38973  14.525   <2e-16 ***
poly(X, 4)3  -4.06198    0.38973 -10.423   <2e-16 ***
poly(X, 4)4 -10.43490    0.38973 -26.775   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3897 on 396 degrees of freedom
Multiple R-squared:  0.7244,	Adjusted R-squared:  0.7216 
F-statistic: 260.2 on 4 and 396 DF,  p-value: < 2.2e-16

All in all, visual inspection and summary metrics indicate our polynomial model is pretty strong. We could tinker and improve our model, but this is a post about GAMs, so let’s see of what they are capable!

Method #3: GAM

The value of GAMs is offered by their increased flexibility via their smooth functions, or spline curve. A spline curve is the combination of two or more polynomial curves; however, unlike our polynomial model, which might’ve required tinkering with higher-degree models that would be more prone to overfitting, our splines are non-parametric and can adopt the shape of the data. Furthermore, as with ridge regression, GAMs penalize large coefficients, minimizing the potential for overfitting.

When fitting a GAM, we estimate the splines for an explanatory variable by adjusting a smoothing parameter, λ. The higher the λthe smoother the function and vice versa.  These splines are themselves the linear combination of a series of “basis functions” or small curves. Clearly there are a lot of moving parts to consider.

First, we’ll estimate a GAM with a smoothing parameter equal to 10. Though the GAM curve loosely follows the data, our fit is clearly under-fitting:

# fit GAM
gam <- gam(y ~ s(X), data = data)


# splines = 10
gam <- gam(y ~ s(X), data = data, sp = 10)


# plot
plot(gam, residuals = TRUE, se = FALSE, pch = 1, rug = FALSE, col = "blue", main = "GAM w/ Smoothing Parameter = 10")
Screen Shot 2019-07-03 at 11.45.10 AM.png

Let’s try again, and fit a model with a smoothing parameter equal to 0.1. Higher smoothing parameters are penalized (they are determined by the square of a 2nd order partial derivative), so we should expect this smoothing parameter setting to produce a more “wiggly” curve.

# splines = 0.1
gam.2 <- gam(y ~ s(X), data = data, sp = 0.1)


# plot
plot(gam.2, residuals = TRUE, se = FALSE, pch = 1, rug = FALSE, col = "blue", main = "GAM w/ Smoothing Parameter = 0.1")
Screen Shot 2019-07-03 at 11.45.50 AM.png

Thankfully, we don’t have to peck away endlessly to find the optimal fit: the mcgv package allows for smooth estimation via “REML”, or restricted maximum likelihood. Using the REML method gives us the following:

# splines by REML
gam.3 <- gam(y ~ s(X), data = data, method = 'REML')


# plot 
plot(gam.3, residuals = TRUE, seWithMean = TRUE, pch = 1, rug = FALSE, col = "blue", 
     main = "GAM w/ Smoothing Parameter Estimated by REML")

Screen Shot 2019-07-03 at 11.46.23 AM.png

Looks like a keeper. Let’s evaluate the essential parts of our GAM. From the first part of the summary below, we see our GAM equation in addition to the following:

  • Family = Gaussian: indicates we are modeling with normal distribution.

  • Link function = identity: we have no transformations.

# summary
summary(gam.3)
Family: gaussian 
Link function: identity 
Formula: y ~ s(X)

The parametric coefficients refer to the pre-determined aspects of our model.  We see that our intercept is significant at the 0.001 level. Also, our standard error for the intercept is near zero.

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.20457    0.01662   72.49   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Next we look at summary output for the non-parametric aspects of our model, the smooth terms. The first metric edf (effective degrees of freedom) represents the complexity of the smooth:

  • edf of 1 is equivalent to a straight line, which implies a linear model would be just as effective.

  • edf of 2 is equivalent to a quadratic curve.

  • edf > 2 describes more wiggly curves.

Our edf is 7.05, meaning the fit for our X line is relatively complex. It is important to note the edf does not measure significance of the smooth term. Significance is assessed by ANOVA testing in the next two columns, ref.df and f, with estimates located in the p-value column.

We have an adjusted r-squared of 0.797, meaning we are capturing 79.7% of the variation in Y with s(X)—quite an improvement over our linear and polynomial models.

Approximate significance of smooth terms:
       edf Ref.df     F p-value    
s(X) 8.547  8.941 175.8  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.797   Deviance explained = 80.1%
-REML = 149.78  Scale est. = 0.11072   n = 401

Our smooth term is again significant at the 0.001 level. Statistical significance is a thorny issue with GAMs as it is in general (a topic we’ll be exploring in a future post). With GAMs, checking the p-value column alone is insufficient; visual inspection is a must when evaluating the significance of variables. If a horizontal line approximately can rest within the bounds of our 95% confidence intervals, the variable is not significant.

When we add 95% confidence intervals and a horizontal line to our GAM plot and check, it becomes apparent that X is significant.

# plot 
plot(gam.3, residuals = TRUE, seWithMean = TRUE, pch = 1, rug = FALSE, col = "blue", 
     main = "GAM w/ Smoothing Parameter Estimated by REML")
abline(h = 0 , col = "red")
Screen Shot 2019-07-03 at 11.46.57 AM.png

You might’ve noticed the summary for our smooth terms differs from what we typically see for regression output—we do not get coefficients for our smooth terms. This is because each smooth term has multiple coefficients, one for each basis function. We can call the coefficients for the basis functions with a standard coef(); however, it is recommended to interpret GAMs through visual inspection alongside other standard summary metrics as we have done.

We’ve fit a simple yet powerful GAM model with an easy out-of-the-box process!

Practical Considerations

As data scientists, we have to be aware of the costs—not just the benefits—of our models.  GAMs, as with anything under the sun, are not without trade-offs. We’ve already touched on the potential for overfitting with GAMs, but there is another drawback to keep in mind: GAMs can be computationally expensive relative to other model types. This is because estimation of smoothing parameters is done through an iterative process.

A model with predictions that initially appear to knock it out of the park may turn out to be a foul ball if the computational (i.e., financial) costs are excessive.  In that case, a more practical model may suit your overall business needs far better.

Comparing estimates of our memory usage with object.size() for our linear, polynomial, and best GAM models, we see that the computational costs increase as our model complexity increases. Our GAM requires nearly twice as much memory as the linear model, and ~20% more than the 4th degree polynomial.

object.size(lm)
object.size(pnl)
object.size(gam.3)

116576 bytes

141648 bytes

166136 bytes

The expense here is a non-issue, but as datasets reach into the hundreds of thousands of observations, and the number of variables and splines increases, the memory gaps between the above model types would be unlikely to scale proportionally, with GAMs requiring relatively higher levels of memory.

Summary

This post provided an introduction to GAMs. We provided some high-level historical and theoretical context, and illustrated their ability to manage the bias/ variance trade-off via the use of splines. We also fit a basic GAM model, evaluated summary metrics, and showed the superiority of our simple GAM to another non-linear approach, polynomial regression. Practical considerations of implementing GAMs were also offered.

In a future post: we’ll add more complexity to our GAM, take a step back and understand it from the ground up via it’s basis functions, and explore additional ways to visualize and evaluate our model.