{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 2: Regression with kNN and Linear Regression\n", "[![View notebooks on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/amonroym99/uva-applied-ml/blob/main/docs/notebooks/2_reg_knn_linreg.ipynb)\n", "\n", "**Author:** Alejandro Monroy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will cover two of the most basic regression models: kNN and Linear Regression. Furthermore, we will see some metrics to evaluate regression models." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Loading and preparing the data\n", "We will use the `diabetes` dataset from Sklearn as we did in the previous tutorial. This time, we will set `scaled=True` to skip the normalization step:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexbmibps1s2s3s4s5s6
00.0380760.0506800.0616960.021872-0.044223-0.034821-0.043401-0.0025920.019907-0.017646
1-0.001882-0.044642-0.051474-0.026328-0.008449-0.0191630.074412-0.039493-0.068332-0.092204
20.0852990.0506800.044451-0.005670-0.045599-0.034194-0.032356-0.0025920.002861-0.025930
3-0.089063-0.044642-0.011595-0.0366560.0121910.024991-0.0360380.0343090.022688-0.009362
40.005383-0.044642-0.0363850.0218720.0039350.0155960.008142-0.002592-0.031988-0.046641
.................................
4370.0417080.0506800.0196620.059744-0.005697-0.002566-0.028674-0.0025920.0311930.007207
438-0.0055150.050680-0.015906-0.0676420.0493410.079165-0.0286740.034309-0.0181140.044485
4390.0417080.050680-0.0159060.017293-0.037344-0.013840-0.024993-0.011080-0.0468830.015491
440-0.045472-0.0446420.0390620.0012150.0163180.015283-0.0286740.0265600.044529-0.025930
441-0.045472-0.044642-0.073030-0.0814130.0837400.0278090.173816-0.039493-0.0042220.003064
\n", "

442 rows × 10 columns

\n", "
" ], "text/plain": [ " age sex bmi bp s1 s2 s3 \\\n", "0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 \n", "1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 \n", "2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 \n", "3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 \n", "4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 \n", ".. ... ... ... ... ... ... ... \n", "437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 \n", "438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 \n", "439 0.041708 0.050680 -0.015906 0.017293 -0.037344 -0.013840 -0.024993 \n", "440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 \n", "441 -0.045472 -0.044642 -0.073030 -0.081413 0.083740 0.027809 0.173816 \n", "\n", " s4 s5 s6 \n", "0 -0.002592 0.019907 -0.017646 \n", "1 -0.039493 -0.068332 -0.092204 \n", "2 -0.002592 0.002861 -0.025930 \n", "3 0.034309 0.022688 -0.009362 \n", "4 -0.002592 -0.031988 -0.046641 \n", ".. ... ... ... \n", "437 -0.002592 0.031193 0.007207 \n", "438 0.034309 -0.018114 0.044485 \n", "439 -0.011080 -0.046883 0.015491 \n", "440 0.026560 0.044529 -0.025930 \n", "441 -0.039493 -0.004222 0.003064 \n", "\n", "[442 rows x 10 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import datasets\n", "from sklearn.model_selection import train_test_split\n", "\n", "diabetes = datasets.load_diabetes(as_frame=True, scaled=True)\n", "diabetes.data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X = diabetes.data.values\n", "y = diabetes.target.values\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. K-Nearest Neighbors\n", "K-Nearest Neighbors (kNN) is a simple, yet powerful, algorithm used for both classification and regression tasks. In regression, kNN predicts the value of a target variable based on the average of the values of its k-nearest neighbors. The \"neighbors\" are determined by calculating the distance between data points, typically using Euclidean distance. KNN regression is non-parametric, meaning it makes no assumptions about the underlying data distribution, making it versatile for various types of data.\n", "\n", "### 2.1 Implementation from scratch\n", "An important step in the k-NN algorithm is computing distances between datapoints. We will use the euclidean distance. The Euclidean distance between two points $\\mathbf{p} = (p_1, p_2, \\ldots, p_n)$ and $\\mathbf{q} = (q_1, q_2, \\ldots, q_n)$ in Euclidean n-space is given by:\n", "\n", "\n", "$$ d(\\mathbf{p}, \\mathbf{q}) = \\sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \\cdots + (p_n - q_n)^2} $$\n", "\n", "The following function implements the euclidean distance by computing the element-wise difference between the points and then computing the norm using `np.linalg.norm`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Euclidean distance: 5.196152422706632\n" ] } ], "source": [ "def euclidean_distance(p, q):\n", " \"\"\"\n", " Calculates the Euclidean distance between two points.\n", "\n", " Args:\n", " p (np.ndarray): First point.\n", " q (np.ndarray): Second point.\n", "\n", " Returns:\n", " float: Euclidean distance between the two points.\n", " \"\"\"\n", " return np.linalg.norm(p - q)\n", "\n", "# Sample usage\n", "point1 = np.array([1, 2, 3])\n", "point2 = np.array([4, 5, 6])\n", "distance = euclidean_distance(point1, point2)\n", "print(\"Euclidean distance:\", distance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now implement the k-NN algorithm:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 5 predicted labels: [148.5 148. 183. 248.5 123.5]\n", "First 5 true labels: [219. 70. 202. 230. 111.]\n" ] } ], "source": [ "import numpy as np\n", "\n", "def knn_regressor(X_train, y_train, X_test, k=5):\n", " \"\"\"\n", " K-Nearest Neighbors regressor.\n", "\n", " Args:\n", " X_train (np.ndarray): Training data features.\n", " y_train (np.ndarray): Training data labels.\n", " X_test (np.ndarray): Data to predict.\n", " k (int): Number of neighbors to use for prediction. Default is 5.\n", "\n", " Returns:\n", " np.ndarray: Predicted target values.\n", " \"\"\"\n", " # Calculate predictions for each test point\n", " predictions = []\n", " for x in X_test:\n", " # Compute distances from the test point to all training points\n", " distances = np.linalg.norm(X_train - x, axis=1)\n", " # Find the indices of the k nearest neighbors\n", " k_indices = np.argsort(distances)[:k]\n", " # Get the target values of the k nearest neighbors\n", " k_nearest_values = y_train[k_indices]\n", " # Compute the mean of the k nearest values\n", " prediction = np.mean(k_nearest_values)\n", " predictions.append(prediction)\n", " \n", " return np.array(predictions)\n", "\n", "# Predict on the diabetes dataset\n", "y_pred = knn_regressor(X_train, y_train, X_test, k=2)\n", "print(\"First 5 predicted labels:\", y_pred[:5])\n", "print(\"First 5 true labels:\", y_test[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that in the previous homework assignment we implemented the `DummyRegressor` mimicking Sklearn's implementation. Let's do the same here. In this case, fitting the model just means storing the training set in the regressor object, and predictions are made in a similar way as in our previous implementation, but accessing the training set that is stored in the class:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 5 predicted labels: [148.5 148. 183. 248.5 123.5]\n", "First 5 true labels: [219. 70. 202. 230. 111.]\n" ] } ], "source": [ "class KNeighborsRegressor:\n", " \"\"\"\n", " K-Nearest Neighbors regressor.\n", " \"\"\"\n", "\n", " def __init__(self, n_neighbors=5):\n", " \"\"\"\n", " Initializes the KNeighborsRegressor with the specified number of neighbors.\n", "\n", " Args:\n", " n_neighbors (int): Number of neighbors to use for prediction. Default is 5.\n", " \"\"\"\n", " self.n_neighbors = n_neighbors\n", " self.X_train = None\n", " self.y_train = None\n", "\n", " def fit(self, X, y):\n", " \"\"\"\n", " Fit the KNN regressor on the training data.\n", "\n", " Args:\n", " X (np.ndarray): Training data features.\n", " y (np.ndarray): Training data labels.\n", " \"\"\"\n", " self.X_train = X\n", " self.y_train = y\n", "\n", " def predict(self, X):\n", " \"\"\"\n", " Predict the target for the given data.\n", "\n", " Args:\n", " X (np.ndarray): Data to predict.\n", "\n", " Returns:\n", " np.ndarray: Predicted target values.\n", " \"\"\"\n", " predictions = []\n", " for x in X:\n", " distances = np.linalg.norm(self.X_train - x, axis=1)\n", " k_indices = np.argsort(distances)[:self.n_neighbors]\n", " k_nearest_values = self.y_train[k_indices]\n", " prediction = np.mean(k_nearest_values)\n", " predictions.append(prediction)\n", " return np.array(predictions)\n", " \n", "# Sample usage\n", "knn_regressor = KNeighborsRegressor(n_neighbors=2)\n", "knn_regressor.fit(X_train, y_train)\n", "y_pred = knn_regressor.predict(X_test)\n", "print(\"First 5 predicted labels:\", y_pred[:5])\n", "print(\"First 5 true labels:\", y_test[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Importing the model from Sklearn\n", "Now that we know how the model works, we can just import it from the `sklearn.neighbors` module so we don't have to implement it from scratch everytime we use it:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 5 predicted labels: [148.5 148. 183. 248.5 123.5]\n", "First 5 true labels: [219. 70. 202. 230. 111.]\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsRegressor\n", "\n", "knn_regressor = KNeighborsRegressor(n_neighbors=2)\n", "knn_regressor.fit(X_train, y_train)\n", "y_pred = knn_regressor.predict(X_test)\n", "print(\"First 5 predicted labels:\", y_pred[:5])\n", "print(\"First 5 true labels:\", y_test[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our results and Sklearn\"s coincide :) Hurray!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Evaluating regression models\n", "Evaluating a ML model consists assessing how well the model's predictions match the actual values. The most common metrics for evaluating regression models are:\n", "\n", "- **Mean Squared Error (MSE)**: Measures the average squared difference between the actual and predicted values.\n", " \n", " $$MSE = \\frac{1}{n} \\sum_{i=1}^{n} (y_{i} - \\hat{y_{i}})^2.$$\n", "\n", " - **Root Mean Squared Error (RMSE)**: The square root of the MSE, providing a measure in the same units as the target variable.\n", "\n", " $$RMSE = \\sqrt{MSE}.$$\n", "\n", "- **R-squared (**$R^2$**)**: Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.\n", " \n", " $$R^2 = 1 - \\frac{\\sum_{i=1}^{n} (y_{i} - \\hat{y_{i}})^2}{\\sum_{i=1}^{n} (y_{i} - \\bar{\\mathbf{y}})^2},$$\n", "\n", "where $n$ is the number of samples, $y_{i}$ is the true target value for the $i$-th sample, $\\hat{y_{i}}$ is the predicted target value for the $i$-th sample, and $\\bar{\\mathbf{y}}$ is the mean of the true target values.\n", "\n", "The implementation for these metrics is available in the `metrics` module of Sklearn:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Squared Error: 3537.612359550562\n", "Root Mean Squared Error: 59.477830824186604\n", "R-squared: 0.3322931226835779\n" ] } ], "source": [ "from sklearn.metrics import mean_squared_error, r2_score\n", "\n", "# Assuming y_true and y_pred are the true and predicted target values, respectively\n", "mse = mean_squared_error(y_test, y_pred)\n", "rmse = np.sqrt(mse)\n", "r2 = r2_score(y_test, y_pred)\n", "\n", "print(\"Mean Squared Error:\", mse)\n", "print(\"Root Mean Squared Error:\", rmse)\n", "print(\"R-squared:\", r2)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting linear equation that describes how the dependent variable changes as the independent variables change. The equation of a multiple linear regression model is:\n", "\n", "$$ y = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + \\cdots + \\beta_p x_p + \\epsilon $$\n", "\n", "where:\n", "\n", "- $y$ is the dependent variable or target,\n", "- $x_1, x_2, \\ldots, x_p$ are the independent variables,\n", "- $\\beta_0$ is the y-intercept,\n", "- $\\beta_1, \\beta_2, \\ldots, \\beta_p$ are the coefficients,\n", "- $\\epsilon$ is the error term.\n", "\n", "In the context of Machine Learning, we can use a training set to compute estimates of the parameters ($\\hat{\\beta_0}, \\hat{\\beta_1}, \\hat{\\beta_2}, \\ldots, \\hat{\\beta_p}$), which can be used to make predictions from new test points. In this specific case, we could either compute them using the Ordinary Least Squares method or using a numerical optimization algorithm such as gradient descent. However, we will not implement this model form scratch, but we will directly use Sklearn's implementation:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 5 predicted labels: [139.5475584 179.51720835 134.03875572 291.41702925 123.78965872]\n", "First 5 true labels: [219. 70. 202. 230. 111.]\n", "\n", "Mean Squared Error (MSE): 2900.194\n", "Root Mean Squared Error (RMSE): 53.853\n", "R-squared: 0.453\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "lin_reg = LinearRegression()\n", "lin_reg.fit(X_train, y_train)\n", "y_pred = lin_reg.predict(X_test)\n", "\n", "print(\"First 5 predicted labels:\", y_pred[:5])\n", "print(\"First 5 true labels:\", y_test[:5])\n", "\n", "mse = mean_squared_error(y_test, y_pred)\n", "rmse = np.sqrt(mse)\n", "r_squared = r2_score(y_test, y_pred)\n", "print(f\"\\nMean Squared Error (MSE): {mse:.3f}\")\n", "print(f\"Root Mean Squared Error (RMSE): {rmse:.3f}\")\n", "print(f\"R-squared: {r_squared:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The MSE is lower than for kNN and the R-squared is higher, which indicate that this model performs better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can inspect the `lin_reg` object to find out what are the estimates for the parameters:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimated coefficients: [ 37.90402135 -241.96436231 542.42875852 347.70384391 -931.48884588\n", " 518.06227698 163.41998299 275.31790158 736.1988589 48.67065743]\n", "Estimated intercept: 151.34560453985995\n" ] } ], "source": [ "print(\"Estimated coefficients: \", lin_reg.coef_)\n", "print(\"Estimated intercept: \", lin_reg.intercept_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that the features are:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diabetes.feature_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Therefore, the formula for the linear regression model (i.e., the one that is applied when we call `lin_reg.predict`) is\n", "\n", "$$\n", "\\begin{align*}\n", "\\hat{y} = & \\ 151.35 + 37.90 \\cdot \\text{age} - 241.96 \\cdot \\text{sex} + 542.43 \\cdot \\text{bmi} \\\\\n", " & + 347.70 \\cdot \\text{bp} - 931.49 \\cdot \\text{s1} + 518.06 \\cdot \\text{s2} \\\\\n", " & + 163.42 \\cdot \\text{s3} + 275.32 \\cdot \\text{s4} + 736.20 \\cdot \\text{s5} \\\\\n", " & + 48.67 \\cdot \\text{s6}\n", "\\end{align*}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "🔎 **Observation:** When writing math formulas for ML/statistics, we usually use the hat ( $\\hat{ }$ ) to denote esimates/predictions. For example $\\hat{\\beta_1}$ is the estimate of $\\beta_1$ that we learn from data." ] } ], "metadata": { "kernelspec": { "display_name": "tml_25", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }