Local Polynomial Regression 1: Introduction
Local polynomial regression is an important statistical tool for non-parametric regression. This post, the first in a short series, covers the general problem setup and introduces the Nadaraya–Watson estimator.
The Python code is available on GitHub.
Regression
Regression problems form a large class of problems which are central to statistical theory and methods. Their applications include prediction, variable selection and causality analysis.
The data
Suppose we observe $n$ pairs of observations $(x_i, y_i)$, where each $x_i \in \R^d$ for some dimension $d \geq 1$, and each $y_i \in \R$.
For example, we might perform an experiment on $n = 100$ people to determine whether a drug helps to lower blood pressure. We could take $x_i$ as the amount of drug administered to person $i$, while $y_i$ could represent the change in blood pressure for person $i$.
The model
It is natural to assume that the change in blood pressure depends on the amount of drug administered, possibly with some noise in the observations. This gives the model
\[y_i = \mu(x_i) + \varepsilon_i ,\]where $\mu$ is an unknown function describing the dependence of $y$ on $x$, and $\varepsilon_i$ is the unknown error in observation $i$. This error could come from the drug affecting different people in different ways, from measurement error in the blood pressure reading, or from any other source of noise. We impose the condition that $\E[\varepsilon_i | x_i] = 0$ for each $i$ to ensure that on average each error is zero, allowing $\mu$ to be identified.
Parametric vs. non-parametric
The aim of regression is to use the data points $(x_i, y_i)$ to calculate a function $\widehat \mu$ which estimates the unknown regression function $\mu$. If we assume that $\mu$ is a specific type of function which can be determined by finitely many parameters, then the problem is known as parametric regression. Otherwise, when we do not assume anything about the form of $\mu$, the problem is called non-parametric regression. For example if we suppose that $\mu$ is a quadratic function (hence determined by its three coefficients), the problem is parametric. In these posts we will explore the more general setting of the non-parametric problem.
Parametric linear regression
The simplest regression estimator is the parametric linear regression. This estimator gives $\widehat \mu$ as the linear function which minimises the mean squared error (MSE) defined by
\[\MSE(\widehat \mu) = \frac{1}{n} \sum_{i=1}^n \big(y_i - \widehat \mu(x_i) \big)^2.\]Figure 2 shows how linear regression fits a straight line to the data.
However if the regression function is not linear, this method can perform poorly, as seen in Figure 3. Here the regression function is clearly some kind of curve, but our estimator is limited to straight lines.
Linear regression with transformed features
One possible solution to this problem is to include transformations of $x_i$ as extra features. For example, suppose that $x_i \in \R$, so that $d=1$. Then one could use not only the variables $x_i$ but also $x_i^2$, allowing quadratic curves to be fitted to the data. This gives rise to so-called polynomial regression. Figure 4 shows how including $x_i^2$ can give a much better fit to the data.
However if the regression function is not well-approximated by any low-degree polynomial, this regression method can still perform poorly, as seen in Figure 5.
While it is tempting to use higher and higher-degree polynomials such as cubics, quartics and quintics, this can lead to overfitting as shown in Figure 6, especially when there are not many data points.
Non-parametric local regression
Non-parametric estimators attempt to solve these problems in a variety of ways. The idea behind local regression methods is that at each evaluation point the fitted regression function only needs to depend on the data points “nearby.” This concept is made more concrete using the notion of a kernel function.
Kernels
A kernel $K$ is a function from $\R$ to $\R$ which allows us to quantify the local nature of a non-parametric estimator. If $x$ is a point at which we want to estimate the regression function and $x_i$ is a data point, then we can use
\[K\left(\frac{x_i - x}{h}\right)\]as a measure of the influence of $x_i$ at $x$, where $h>0$ is a parameter called the bandwidth. Kernels must integrate to one, and are typically (though not always) symmetric non-negative functions. Figure 7 shows some commonly-used examples.
The bandwidth $h$ controls how much locality is present. If $h$ is very small then only data points which are very close to the evaluation point are used. If $h$ is larger then further-away points are used too.
The Nadaraya–Watson estimator
The simplest local regression estimator is the Nadaraya–Watson estimator, which works as follows. First pick a kernel and a bandwidth. For each evaluation point $x$, find the “importance” of each data point $x_i$ using the kernel function and bandwidth. Then average all of the responses $y_i$, weighted according to this importance. As an equation, this gives
\[\widehat \mu(x) = \frac{ \sum_{i=1}^n y_i K\left(\frac{x_i-x}{h}\right) } { \sum_{i=1}^n K\left(\frac{x_i-x}{h}\right) }.\]Note how we do not assume anything about the form of the underlying regression function, though it is necessary to choose an appropriate bandwidth $h$.
There are several interesting features in Figure 8. Firstly, note how $\widehat \mu$ underestimates $\mu$ where $\mu$ is concave (around $x=1$) and overestimates $\mu$ where $\mu$ is convex (around $x=2$). Secondly, note how $\widehat \mu$ overestimates $\mu$ at the left boundary and underestimates $\mu$ at the right boundary. These illustrate problems relating to the bias of the Nadaraya–Watson estimator, which will be investigated in later posts.
Next time
In the next post we will address the issue of how to choose an appropriate bandwidth for a local regression estimator, both in theory and in practice.