Tutorial 2: Regression with kNN and Linear Regression

View notebooks on Github

Author: Alejandro Monroy

In this notebook we will cover two of the most basic regression models: kNN and Linear Regression. Furthermore, we will see some metrics to evaluate regression models.

[1]:
import numpy as np

1. Loading and preparing the data

We will use the diabetes dataset from Sklearn as we did in the previous tutorial. This time, we will set scaled=True to skip the normalization step:

[2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

diabetes = datasets.load_diabetes(as_frame=True, scaled=True)
diabetes.data
[2]:
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641
... ... ... ... ... ... ... ... ... ... ...
437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 -0.002592 0.031193 0.007207
438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 0.034309 -0.018114 0.044485
439 0.041708 0.050680 -0.015906 0.017293 -0.037344 -0.013840 -0.024993 -0.011080 -0.046883 0.015491
440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 0.026560 0.044529 -0.025930
441 -0.045472 -0.044642 -0.073030 -0.081413 0.083740 0.027809 0.173816 -0.039493 -0.004222 0.003064

442 rows × 10 columns

[3]:
X = diabetes.data.values
y = diabetes.target.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2. K-Nearest Neighbors

K-Nearest Neighbors (kNN) is a simple, yet powerful, algorithm used for both classification and regression tasks. In regression, kNN predicts the value of a target variable based on the average of the values of its k-nearest neighbors. The “neighbors” are determined by calculating the distance between data points, typically using Euclidean distance. KNN regression is non-parametric, meaning it makes no assumptions about the underlying data distribution, making it versatile for various types of data.

2.1 Implementation from scratch

An important step in the k-NN algorithm is computing distances between datapoints. We will use the euclidean distance. The Euclidean distance between two points \(\mathbf{p} = (p_1, p_2, \ldots, p_n)\) and \(\mathbf{q} = (q_1, q_2, \ldots, q_n)\) in Euclidean n-space is given by:

\[d(\mathbf{p}, \mathbf{q}) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \cdots + (p_n - q_n)^2}\]

The following function implements the euclidean distance by computing the element-wise difference between the points and then computing the norm using np.linalg.norm:

[4]:
def euclidean_distance(p, q):
    """
    Calculates the Euclidean distance between two points.

    Args:
        p (np.ndarray): First point.
        q (np.ndarray): Second point.

    Returns:
        float: Euclidean distance between the two points.
    """
    return np.linalg.norm(p - q)

# Sample usage
point1 = np.array([1, 2, 3])
point2 = np.array([4, 5, 6])
distance = euclidean_distance(point1, point2)
print("Euclidean distance:", distance)
Euclidean distance: 5.196152422706632

We can now implement the k-NN algorithm:

[5]:
import numpy as np

def knn_regressor(X_train, y_train, X_test, k=5):
    """
    K-Nearest Neighbors regressor.

    Args:
        X_train (np.ndarray): Training data features.
        y_train (np.ndarray): Training data labels.
        X_test (np.ndarray): Data to predict.
        k (int): Number of neighbors to use for prediction. Default is 5.

    Returns:
        np.ndarray: Predicted target values.
    """
    # Calculate predictions for each test point
    predictions = []
    for x in X_test:
        # Compute distances from the test point to all training points
        distances = np.linalg.norm(X_train - x, axis=1)
        # Find the indices of the k nearest neighbors
        k_indices = np.argsort(distances)[:k]
        # Get the target values of the k nearest neighbors
        k_nearest_values = y_train[k_indices]
        # Compute the mean of the k nearest values
        prediction = np.mean(k_nearest_values)
        predictions.append(prediction)

    return np.array(predictions)

# Predict on the diabetes dataset
y_pred = knn_regressor(X_train, y_train, X_test, k=2)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])
First 5 predicted labels: [148.5 148.  183.  248.5 123.5]
First 5 true labels: [219.  70. 202. 230. 111.]

Recall that in the previous homework assignment we implemented the DummyRegressor mimicking Sklearn’s implementation. Let’s do the same here. In this case, fitting the model just means storing the training set in the regressor object, and predictions are made in a similar way as in our previous implementation, but accessing the training set that is stored in the class:

[6]:
class KNeighborsRegressor:
    """
    K-Nearest Neighbors regressor.
    """

    def __init__(self, n_neighbors=5):
        """
        Initializes the KNeighborsRegressor with the specified number of neighbors.

        Args:
            n_neighbors (int): Number of neighbors to use for prediction. Default is 5.
        """
        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        """
        Fit the KNN regressor on the training data.

        Args:
            X (np.ndarray): Training data features.
            y (np.ndarray): Training data labels.
        """
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        """
        Predict the target for the given data.

        Args:
            X (np.ndarray): Data to predict.

        Returns:
            np.ndarray: Predicted target values.
        """
        predictions = []
        for x in X:
            distances = np.linalg.norm(self.X_train - x, axis=1)
            k_indices = np.argsort(distances)[:self.n_neighbors]
            k_nearest_values = self.y_train[k_indices]
            prediction = np.mean(k_nearest_values)
            predictions.append(prediction)
        return np.array(predictions)

# Sample usage
knn_regressor = KNeighborsRegressor(n_neighbors=2)
knn_regressor.fit(X_train, y_train)
y_pred = knn_regressor.predict(X_test)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])
First 5 predicted labels: [148.5 148.  183.  248.5 123.5]
First 5 true labels: [219.  70. 202. 230. 111.]

2.2. Importing the model from Sklearn

Now that we know how the model works, we can just import it from the sklearn.neighbors module so we don’t have to implement it from scratch everytime we use it:

[7]:
from sklearn.neighbors import KNeighborsRegressor

knn_regressor = KNeighborsRegressor(n_neighbors=2)
knn_regressor.fit(X_train, y_train)
y_pred = knn_regressor.predict(X_test)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])
First 5 predicted labels: [148.5 148.  183.  248.5 123.5]
First 5 true labels: [219.  70. 202. 230. 111.]

Our results and Sklearn”s coincide :) Hurray!

3. Evaluating regression models

Evaluating a ML model consists assessing how well the model’s predictions match the actual values. The most common metrics for evaluating regression models are:

  • Mean Squared Error (MSE): Measures the average squared difference between the actual and predicted values.

    \[MSE = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2.\]
  • Root Mean Squared Error (RMSE): The square root of the MSE, providing a measure in the same units as the target variable.

\[RMSE = \sqrt{MSE}.\]
  • R-squared (\(R^2\)): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.

    \[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2}{\sum_{i=1}^{n} (y_{i} - \bar{\mathbf{y}})^2},\]

where \(n\) is the number of samples, \(y_{i}\) is the true target value for the \(i\)-th sample, \(\hat{y_{i}}\) is the predicted target value for the \(i\)-th sample, and \(\bar{\mathbf{y}}\) is the mean of the true target values.

The implementation for these metrics is available in the metrics module of Sklearn:

[8]:
from sklearn.metrics import mean_squared_error, r2_score

# Assuming y_true and y_pred are the true and predicted target values, respectively
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)

Mean Squared Error: 3537.612359550562
Root Mean Squared Error: 59.477830824186604
R-squared: 0.3322931226835779

4. Linear Regression

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting linear equation that describes how the dependent variable changes as the independent variables change. The equation of a multiple linear regression model is:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\]

where:

  • \(y\) is the dependent variable or target,

  • \(x_1, x_2, \ldots, x_p\) are the independent variables,

  • \(\beta_0\) is the y-intercept,

  • \(\beta_1, \beta_2, \ldots, \beta_p\) are the coefficients,

  • \(\epsilon\) is the error term.

In the context of Machine Learning, we can use a training set to compute estimates of the parameters (\(\hat{\beta_0}, \hat{\beta_1}, \hat{\beta_2}, \ldots, \hat{\beta_p}\)), which can be used to make predictions from new test points. In this specific case, we could either compute them using the Ordinary Least Squares method or using a numerical optimization algorithm such as gradient descent. However, we will not implement this model form scratch, but we will directly use Sklearn’s implementation:

[9]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)

print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r_squared = r2_score(y_test, y_pred)
print(f"\nMean Squared Error (MSE): {mse:.3f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.3f}")
print(f"R-squared: {r_squared:.3f}")
First 5 predicted labels: [139.5475584  179.51720835 134.03875572 291.41702925 123.78965872]
First 5 true labels: [219.  70. 202. 230. 111.]

Mean Squared Error (MSE): 2900.194
Root Mean Squared Error (RMSE): 53.853
R-squared: 0.453

The MSE is lower than for kNN and the R-squared is higher, which indicate that this model performs better.

We can inspect the lin_reg object to find out what are the estimates for the parameters:

[10]:
print("Estimated coefficients: ", lin_reg.coef_)
print("Estimated intercept: ", lin_reg.intercept_)
Estimated coefficients:  [  37.90402135 -241.96436231  542.42875852  347.70384391 -931.48884588
  518.06227698  163.41998299  275.31790158  736.1988589    48.67065743]
Estimated intercept:  151.34560453985995

Recall that the features are:

[11]:
diabetes.feature_names
[11]:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

Therefore, the formula for the linear regression model (i.e., the one that is applied when we call lin_reg.predict) is

\[\begin{split}\begin{align*} \hat{y} = & \ 151.35 + 37.90 \cdot \text{age} - 241.96 \cdot \text{sex} + 542.43 \cdot \text{bmi} \\ & + 347.70 \cdot \text{bp} - 931.49 \cdot \text{s1} + 518.06 \cdot \text{s2} \\ & + 163.42 \cdot \text{s3} + 275.32 \cdot \text{s4} + 736.20 \cdot \text{s5} \\ & + 48.67 \cdot \text{s6} \end{align*}\end{split}\]

🔎 Observation: When writing math formulas for ML/statistics, we usually use the hat ( \(\hat{ }\) ) to denote esimates/predictions. For example \(\hat{\beta_1}\) is the estimate of \(\beta_1\) that we learn from data.