Local Polynomial Regression 3: Correcting Bias
This post is the third in a series on local polynomial regression, motivating the local polynomial estimator through bias reduction.
Towards the end of part one we briefly noted some potential problems regarding the bias of the Nadaraya–Watson estimator. In this post we will finally introduce the local polynomial estimator and show how it can alleviate these issues. As in part two we focus on the Epanechnikov kernel, and in this post the plots will be significantly oversmoothed (bandwidth too large) to better display the effects of bias.
Boundary bias
The Nadaraya–Watson is susceptible to boundary bias, where an estimator consistently over or underestimates the true regression function at the edge of the support of the data. Note how in Figure 1 the estimated function lies significantly below the true function at the left edge of the plot.
The reason for this effect is related to the gradient of the true regression function at the boundary. In particular, suppose that the regression function is decreasing at the left boundary, as in Figure 1. When evaluating the estimator near this boundary point, almost all of the data “seen” by the kernel lies to the right. Then the negative gradient implies that these data points are on average lower than would be expected if the data were to continue beyond the edge of the plot. As a result, the estimator is downward biased at the left edge.
This effect worsens as the bandwidth increases since a wider kernel uses data further from the point of interest. Note that this phenomenon also appears at the right boundary of Figure 1, but to a lesser extent, due to the relatively small positive gradient of the regression function at the right edge.
The local linear smoother
A popular method to fix the issue of boundary bias is to use a local linear smoother. Recall that the Nadaraya–Watson estimator is a local average, with locality measured by the kernel function:
\[\widehat \mu(x) = \frac{ \sum_{i=1}^n y_i K\left(\frac{x_ix}{h}\right) } { \sum_{i=1}^n K\left(\frac{x_ix}{h}\right) }.\]Suppose that at each evaluation point we fit a local linear model rather than a simple local average. This is equivalent to weighted leastsquares regression, with the weights given by the kernel, and yields the following formulation:
\[\widehat \mu(x) = e_1^\T \big(P(x)^\T W(x) P(x)\big)^{1} P(x)^\T W(x) Y\]where $e_1 = (1, 0)^\T \in \R^2$ is a standard basis vector, $P(x) \in \R^{n \times 2}$ with $P(x)_{i1} = 1$ and $P(x)_{i2} = \frac{X_i  x}{h}$, and $W(x) \in \R^{n \times n}$ is diagonal with $W(x)_{ii} = \frac{1}{h} K\left(\frac{X_i  x}{h}\right)$.
As seen in Figure 2, the local linear smoother is able to reproduce the linear trend at boundaries and thus accounts for boundary bias much better than the Nadaraya–Watson estimator.
The local polynomial estimator
A subtle bias problem still remains with the estimator depicted in Figure 2. Note how in the center of the plot the estimator is significantly above the regression function. This is because we used a linear (firstorder) smoother which is unable to take into account the secondorder curvature of the regression function.
We could address this issue by using a local quadratic smoother or even a higherorder polynomial. This leads to the degreep local polynomial estimator, defined analogously to the local linear smoother as
\[\widehat \mu(x) = e_1^\T \big(P(x)^\T W(x) P(x)\big)^{1} P(x)^\T W(x) Y\]where $e_p = (1, 0, \ldots, 0)^\T \in \R^{p+1}$ is a basis vector, $P(x) \in \R^{n \times (p+1)}$ with $P(x)_{i1} = 1$ and $P(x)_{i j} = \left(\frac{X_i  x}{h}\right)^{j1}$, and $W(x) \in \R^{n \times n}$ is diagonal with $W(x)_{ii} = \frac{1}{h} K\left(\frac{X_i  x}{h}\right)$.
Figure 3 shows that indeed a secondorder (quadratic) local polynomial estimator is able to remove the systematic underestimation due to curvature, but note how the fit is less smooth. This is a general principle: bias reduction comes at the expense of increased variance. Therefore in practice the degree is usually taken as $p=0$ (Nadaraya–Watson) or $p=1$ (local linear smoother) to avoid overfitting.
The bandwidth for a local polynomial estimator can be selected by leaveoneout crossvalidation (LOOCV), as was presented in part two.
Next time
In the next and final post we will apply the concepts discussed during the previous three parts to some realworld data and discuss the conclusions.
References

The University of Oxford’s course in Applied and Computational Statistics, taught by Geoff Nicholls in 2018

The Wikipedia article on local regression