Tutorial 2: Regression with kNN and Linear Regression
Author: Alejandro Monroy
In this notebook we will cover two of the most basic regression models: kNN and Linear Regression. Furthermore, we will see some metrics to evaluate regression models.
[1]:
import numpy as np
1. Loading and preparing the data
We will use the diabetes dataset from Sklearn as we did in the previous tutorial. This time, we will set scaled=True to skip the normalization step:
[2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
diabetes = datasets.load_diabetes(as_frame=True, scaled=True)
diabetes.data
[2]:
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 437 | 0.041708 | 0.050680 | 0.019662 | 0.059744 | -0.005697 | -0.002566 | -0.028674 | -0.002592 | 0.031193 | 0.007207 |
| 438 | -0.005515 | 0.050680 | -0.015906 | -0.067642 | 0.049341 | 0.079165 | -0.028674 | 0.034309 | -0.018114 | 0.044485 |
| 439 | 0.041708 | 0.050680 | -0.015906 | 0.017293 | -0.037344 | -0.013840 | -0.024993 | -0.011080 | -0.046883 | 0.015491 |
| 440 | -0.045472 | -0.044642 | 0.039062 | 0.001215 | 0.016318 | 0.015283 | -0.028674 | 0.026560 | 0.044529 | -0.025930 |
| 441 | -0.045472 | -0.044642 | -0.073030 | -0.081413 | 0.083740 | 0.027809 | 0.173816 | -0.039493 | -0.004222 | 0.003064 |
442 rows × 10 columns
[3]:
X = diabetes.data.values
y = diabetes.target.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
2. K-Nearest Neighbors
K-Nearest Neighbors (kNN) is a simple, yet powerful, algorithm used for both classification and regression tasks. In regression, kNN predicts the value of a target variable based on the average of the values of its k-nearest neighbors. The “neighbors” are determined by calculating the distance between data points, typically using Euclidean distance. KNN regression is non-parametric, meaning it makes no assumptions about the underlying data distribution, making it versatile for various types of data.
2.1 Implementation from scratch
An important step in the k-NN algorithm is computing distances between datapoints. We will use the euclidean distance. The Euclidean distance between two points \(\mathbf{p} = (p_1, p_2, \ldots, p_n)\) and \(\mathbf{q} = (q_1, q_2, \ldots, q_n)\) in Euclidean n-space is given by:
The following function implements the euclidean distance by computing the element-wise difference between the points and then computing the norm using np.linalg.norm:
[4]:
def euclidean_distance(p, q):
"""
Calculates the Euclidean distance between two points.
Args:
p (np.ndarray): First point.
q (np.ndarray): Second point.
Returns:
float: Euclidean distance between the two points.
"""
return np.linalg.norm(p - q)
# Sample usage
point1 = np.array([1, 2, 3])
point2 = np.array([4, 5, 6])
distance = euclidean_distance(point1, point2)
print("Euclidean distance:", distance)
Euclidean distance: 5.196152422706632
We can now implement the k-NN algorithm:
[5]:
import numpy as np
def knn_regressor(X_train, y_train, X_test, k=5):
"""
K-Nearest Neighbors regressor.
Args:
X_train (np.ndarray): Training data features.
y_train (np.ndarray): Training data labels.
X_test (np.ndarray): Data to predict.
k (int): Number of neighbors to use for prediction. Default is 5.
Returns:
np.ndarray: Predicted target values.
"""
# Calculate predictions for each test point
predictions = []
for x in X_test:
# Compute distances from the test point to all training points
distances = np.linalg.norm(X_train - x, axis=1)
# Find the indices of the k nearest neighbors
k_indices = np.argsort(distances)[:k]
# Get the target values of the k nearest neighbors
k_nearest_values = y_train[k_indices]
# Compute the mean of the k nearest values
prediction = np.mean(k_nearest_values)
predictions.append(prediction)
return np.array(predictions)
# Predict on the diabetes dataset
y_pred = knn_regressor(X_train, y_train, X_test, k=2)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])
First 5 predicted labels: [148.5 148. 183. 248.5 123.5]
First 5 true labels: [219. 70. 202. 230. 111.]
Recall that in the previous homework assignment we implemented the DummyRegressor mimicking Sklearn’s implementation. Let’s do the same here. In this case, fitting the model just means storing the training set in the regressor object, and predictions are made in a similar way as in our previous implementation, but accessing the training set that is stored in the class:
[6]:
class KNeighborsRegressor:
"""
K-Nearest Neighbors regressor.
"""
def __init__(self, n_neighbors=5):
"""
Initializes the KNeighborsRegressor with the specified number of neighbors.
Args:
n_neighbors (int): Number of neighbors to use for prediction. Default is 5.
"""
self.n_neighbors = n_neighbors
self.X_train = None
self.y_train = None
def fit(self, X, y):
"""
Fit the KNN regressor on the training data.
Args:
X (np.ndarray): Training data features.
y (np.ndarray): Training data labels.
"""
self.X_train = X
self.y_train = y
def predict(self, X):
"""
Predict the target for the given data.
Args:
X (np.ndarray): Data to predict.
Returns:
np.ndarray: Predicted target values.
"""
predictions = []
for x in X:
distances = np.linalg.norm(self.X_train - x, axis=1)
k_indices = np.argsort(distances)[:self.n_neighbors]
k_nearest_values = self.y_train[k_indices]
prediction = np.mean(k_nearest_values)
predictions.append(prediction)
return np.array(predictions)
# Sample usage
knn_regressor = KNeighborsRegressor(n_neighbors=2)
knn_regressor.fit(X_train, y_train)
y_pred = knn_regressor.predict(X_test)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])
First 5 predicted labels: [148.5 148. 183. 248.5 123.5]
First 5 true labels: [219. 70. 202. 230. 111.]
2.2. Importing the model from Sklearn
Now that we know how the model works, we can just import it from the sklearn.neighbors module so we don’t have to implement it from scratch everytime we use it:
[7]:
from sklearn.neighbors import KNeighborsRegressor
knn_regressor = KNeighborsRegressor(n_neighbors=2)
knn_regressor.fit(X_train, y_train)
y_pred = knn_regressor.predict(X_test)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])
First 5 predicted labels: [148.5 148. 183. 248.5 123.5]
First 5 true labels: [219. 70. 202. 230. 111.]
Our results and Sklearn”s coincide :) Hurray!
3. Evaluating regression models
Evaluating a ML model consists assessing how well the model’s predictions match the actual values. The most common metrics for evaluating regression models are:
Mean Squared Error (MSE): Measures the average squared difference between the actual and predicted values.
\[MSE = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2.\]Root Mean Squared Error (RMSE): The square root of the MSE, providing a measure in the same units as the target variable.
R-squared (\(R^2\)): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.
\[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2}{\sum_{i=1}^{n} (y_{i} - \bar{\mathbf{y}})^2},\]
where \(n\) is the number of samples, \(y_{i}\) is the true target value for the \(i\)-th sample, \(\hat{y_{i}}\) is the predicted target value for the \(i\)-th sample, and \(\bar{\mathbf{y}}\) is the mean of the true target values.
The implementation for these metrics is available in the metrics module of Sklearn:
[8]:
from sklearn.metrics import mean_squared_error, r2_score
# Assuming y_true and y_pred are the true and predicted target values, respectively
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)
Mean Squared Error: 3537.612359550562
Root Mean Squared Error: 59.477830824186604
R-squared: 0.3322931226835779
4. Linear Regression
Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting linear equation that describes how the dependent variable changes as the independent variables change. The equation of a multiple linear regression model is:
where:
\(y\) is the dependent variable or target,
\(x_1, x_2, \ldots, x_p\) are the independent variables,
\(\beta_0\) is the y-intercept,
\(\beta_1, \beta_2, \ldots, \beta_p\) are the coefficients,
\(\epsilon\) is the error term.
In the context of Machine Learning, we can use a training set to compute estimates of the parameters (\(\hat{\beta_0}, \hat{\beta_1}, \hat{\beta_2}, \ldots, \hat{\beta_p}\)), which can be used to make predictions from new test points. In this specific case, we could either compute them using the Ordinary Least Squares method or using a numerical optimization algorithm such as gradient descent. However, we will not implement this model form scratch, but we will directly use Sklearn’s implementation:
[9]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)
print("First 5 predicted labels:", y_pred[:5])
print("First 5 true labels:", y_test[:5])
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r_squared = r2_score(y_test, y_pred)
print(f"\nMean Squared Error (MSE): {mse:.3f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.3f}")
print(f"R-squared: {r_squared:.3f}")
First 5 predicted labels: [139.5475584 179.51720835 134.03875572 291.41702925 123.78965872]
First 5 true labels: [219. 70. 202. 230. 111.]
Mean Squared Error (MSE): 2900.194
Root Mean Squared Error (RMSE): 53.853
R-squared: 0.453
The MSE is lower than for kNN and the R-squared is higher, which indicate that this model performs better.
We can inspect the lin_reg object to find out what are the estimates for the parameters:
[10]:
print("Estimated coefficients: ", lin_reg.coef_)
print("Estimated intercept: ", lin_reg.intercept_)
Estimated coefficients: [ 37.90402135 -241.96436231 542.42875852 347.70384391 -931.48884588
518.06227698 163.41998299 275.31790158 736.1988589 48.67065743]
Estimated intercept: 151.34560453985995
Recall that the features are:
[11]:
diabetes.feature_names
[11]:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
Therefore, the formula for the linear regression model (i.e., the one that is applied when we call lin_reg.predict) is
🔎 Observation: When writing math formulas for ML/statistics, we usually use the hat ( \(\hat{ }\) ) to denote esimates/predictions. For example \(\hat{\beta_1}\) is the estimate of \(\beta_1\) that we learn from data.