{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 2: Regression with kNN and Linear Regression\n",
    "[![View notebooks on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/amonroym99/uva-applied-ml/blob/main/docs/notebooks/2_reg_knn_linreg.ipynb)\n",
    "\n",
    "**Author:** Alejandro Monroy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will cover two of the most basic regression models: kNN and Linear Regression. Furthermore, we will see some metrics to evaluate regression models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Loading and preparing the data\n",
    "We will use the `diabetes` dataset from Sklearn as we did in the previous tutorial. This time, we will set `scaled=True` to skip the normalization step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>bmi</th>\n",
       "      <th>bp</th>\n",
       "      <th>s1</th>\n",
       "      <th>s2</th>\n",
       "      <th>s3</th>\n",
       "      <th>s4</th>\n",
       "      <th>s5</th>\n",
       "      <th>s6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.038076</td>\n",
       "      <td>0.050680</td>\n",
       "      <td>0.061696</td>\n",
       "      <td>0.021872</td>\n",
       "      <td>-0.044223</td>\n",
       "      <td>-0.034821</td>\n",
       "      <td>-0.043401</td>\n",
       "      <td>-0.002592</td>\n",
       "      <td>0.019907</td>\n",
       "      <td>-0.017646</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.001882</td>\n",
       "      <td>-0.044642</td>\n",
       "      <td>-0.051474</td>\n",
       "      <td>-0.026328</td>\n",
       "      <td>-0.008449</td>\n",
       "      <td>-0.019163</td>\n",
       "      <td>0.074412</td>\n",
       "      <td>-0.039493</td>\n",
       "      <td>-0.068332</td>\n",
       "      <td>-0.092204</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.085299</td>\n",
       "      <td>0.050680</td>\n",
       "      <td>0.044451</td>\n",
       "      <td>-0.005670</td>\n",
       "      <td>-0.045599</td>\n",
       "      <td>-0.034194</td>\n",
       "      <td>-0.032356</td>\n",
       "      <td>-0.002592</td>\n",
       "      <td>0.002861</td>\n",
       "      <td>-0.025930</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.089063</td>\n",
       "      <td>-0.044642</td>\n",
       "      <td>-0.011595</td>\n",
       "      <td>-0.036656</td>\n",
       "      <td>0.012191</td>\n",
       "      <td>0.024991</td>\n",
       "      <td>-0.036038</td>\n",
       "      <td>0.034309</td>\n",
       "      <td>0.022688</td>\n",
       "      <td>-0.009362</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.005383</td>\n",
       "      <td>-0.044642</td>\n",
       "      <td>-0.036385</td>\n",
       "      <td>0.021872</td>\n",
       "      <td>0.003935</td>\n",
       "      <td>0.015596</td>\n",
       "      <td>0.008142</td>\n",
       "      <td>-0.002592</td>\n",
       "      <td>-0.031988</td>\n",
       "      <td>-0.046641</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>437</th>\n",
       "      <td>0.041708</td>\n",
       "      <td>0.050680</td>\n",
       "      <td>0.019662</td>\n",
       "      <td>0.059744</td>\n",
       "      <td>-0.005697</td>\n",
       "      <td>-0.002566</td>\n",
       "      <td>-0.028674</td>\n",
       "      <td>-0.002592</td>\n",
       "      <td>0.031193</td>\n",
       "      <td>0.007207</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>438</th>\n",
       "      <td>-0.005515</td>\n",
       "      <td>0.050680</td>\n",
       "      <td>-0.015906</td>\n",
       "      <td>-0.067642</td>\n",
       "      <td>0.049341</td>\n",
       "      <td>0.079165</td>\n",
       "      <td>-0.028674</td>\n",
       "      <td>0.034309</td>\n",
       "      <td>-0.018114</td>\n",
       "      <td>0.044485</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>439</th>\n",
       "      <td>0.041708</td>\n",
       "      <td>0.050680</td>\n",
       "      <td>-0.015906</td>\n",
       "      <td>0.017293</td>\n",
       "      <td>-0.037344</td>\n",
       "      <td>-0.013840</td>\n",
       "      <td>-0.024993</td>\n",
       "      <td>-0.011080</td>\n",
       "      <td>-0.046883</td>\n",
       "      <td>0.015491</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>440</th>\n",
       "      <td>-0.045472</td>\n",
       "      <td>-0.044642</td>\n",
       "      <td>0.039062</td>\n",
       "      <td>0.001215</td>\n",
       "      <td>0.016318</td>\n",
       "      <td>0.015283</td>\n",
       "      <td>-0.028674</td>\n",
       "      <td>0.026560</td>\n",
       "      <td>0.044529</td>\n",
       "      <td>-0.025930</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>441</th>\n",
       "      <td>-0.045472</td>\n",
       "      <td>-0.044642</td>\n",
       "      <td>-0.073030</td>\n",
       "      <td>-0.081413</td>\n",
       "      <td>0.083740</td>\n",
       "      <td>0.027809</td>\n",
       "      <td>0.173816</td>\n",
       "      <td>-0.039493</td>\n",
       "      <td>-0.004222</td>\n",
       "      <td>0.003064</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>442 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          age       sex       bmi        bp        s1        s2        s3  \\\n",
       "0    0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   \n",
       "1   -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   \n",
       "2    0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   \n",
       "3   -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   \n",
       "4    0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   \n",
       "..        ...       ...       ...       ...       ...       ...       ...   \n",
       "437  0.041708  0.050680  0.019662  0.059744 -0.005697 -0.002566 -0.028674   \n",
       "438 -0.005515  0.050680 -0.015906 -0.067642  0.049341  0.079165 -0.028674   \n",
       "439  0.041708  0.050680 -0.015906  0.017293 -0.037344 -0.013840 -0.024993   \n",
       "440 -0.045472 -0.044642  0.039062  0.001215  0.016318  0.015283 -0.028674   \n",
       "441 -0.045472 -0.044642 -0.073030 -0.081413  0.083740  0.027809  0.173816   \n",
       "\n",
       "           s4        s5        s6  \n",
       "0   -0.002592  0.019907 -0.017646  \n",
       "1   -0.039493 -0.068332 -0.092204  \n",
       "2   -0.002592  0.002861 -0.025930  \n",
       "3    0.034309  0.022688 -0.009362  \n",
       "4   -0.002592 -0.031988 -0.046641  \n",
       "..        ...       ...       ...  \n",
       "437 -0.002592  0.031193  0.007207  \n",
       "438  0.034309 -0.018114  0.044485  \n",
       "439 -0.011080 -0.046883  0.015491  \n",
       "440  0.026560  0.044529 -0.025930  \n",
       "441 -0.039493 -0.004222  0.003064  \n",
       "\n",
       "[442 rows x 10 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "diabetes = datasets.load_diabetes(as_frame=True, scaled=True)\n",
    "diabetes.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = diabetes.data.values\n",
    "y = diabetes.target.values\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. K-Nearest Neighbors\n",
    "K-Nearest Neighbors (kNN) is a simple, yet powerful, algorithm used for both classification and regression tasks. In regression, kNN predicts the value of a target variable based on the average of the values of its k-nearest neighbors. The \"neighbors\" are determined by calculating the distance between data points, typically using Euclidean distance. KNN regression is non-parametric, meaning it makes no assumptions about the underlying data distribution, making it versatile for various types of data.\n",
    "\n",
    "### 2.1 Implementation from scratch\n",
    "An important step in the k-NN algorithm is computing distances between datapoints. We will use the euclidean distance. The Euclidean distance between two points $\\mathbf{p} = (p_1, p_2, \\ldots, p_n)$ and $\\mathbf{q} = (q_1, q_2, \\ldots, q_n)$ in Euclidean n-space is given by:\n",
    "\n",
    "\n",
    "$$ d(\\mathbf{p}, \\mathbf{q}) = \\sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \\cdots + (p_n - q_n)^2} $$\n",
    "\n",
    "The following function implements the euclidean distance by computing the element-wise difference between the points and then computing the norm using `np.linalg.norm`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Euclidean distance: 5.196152422706632\n"
     ]
    }
   ],
   "source": [
    "def euclidean_distance(p, q):\n",
    "    \"\"\"\n",
    "    Calculates the Euclidean distance between two points.\n",
    "\n",
    "    Args:\n",
    "        p (np.ndarray): First point.\n",
    "        q (np.ndarray): Second point.\n",
    "\n",
    "    Returns:\n",
    "        float: Euclidean distance between the two points.\n",
    "    \"\"\"\n",
    "    return np.linalg.norm(p - q)\n",
    "\n",
    "# Sample usage\n",
    "point1 = np.array([1, 2, 3])\n",
    "point2 = np.array([4, 5, 6])\n",
    "distance = euclidean_distance(point1, point2)\n",
    "print(\"Euclidean distance:\", distance)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now implement the k-NN algorithm:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First 5 predicted labels: [148.5 148.  183.  248.5 123.5]\n",
      "First 5 true labels: [219.  70. 202. 230. 111.]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "def knn_regressor(X_train, y_train, X_test, k=5):\n",
    "    \"\"\"\n",
    "    K-Nearest Neighbors regressor.\n",
    "\n",
    "    Args:\n",
    "        X_train (np.ndarray): Training data features.\n",
    "        y_train (np.ndarray): Training data labels.\n",
    "        X_test (np.ndarray): Data to predict.\n",
    "        k (int): Number of neighbors to use for prediction. Default is 5.\n",
    "\n",
    "    Returns:\n",
    "        np.ndarray: Predicted target values.\n",
    "    \"\"\"\n",
    "    # Calculate predictions for each test point\n",
    "    predictions = []\n",
    "    for x in X_test:\n",
    "        # Compute distances from the test point to all training points\n",
    "        distances = np.linalg.norm(X_train - x, axis=1)\n",
    "        # Find the indices of the k nearest neighbors\n",
    "        k_indices = np.argsort(distances)[:k]\n",
    "        # Get the target values of the k nearest neighbors\n",
    "        k_nearest_values = y_train[k_indices]\n",
    "        # Compute the mean of the k nearest values\n",
    "        prediction = np.mean(k_nearest_values)\n",
    "        predictions.append(prediction)\n",
    "    \n",
    "    return np.array(predictions)\n",
    "\n",
    "# Predict on the diabetes dataset\n",
    "y_pred = knn_regressor(X_train, y_train, X_test, k=2)\n",
    "print(\"First 5 predicted labels:\", y_pred[:5])\n",
    "print(\"First 5 true labels:\", y_test[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that in the previous homework assignment we implemented the `DummyRegressor` mimicking Sklearn's implementation. Let's do the same here. In this case, fitting the model just means storing the training set in the regressor object, and predictions are made in a similar way as in our previous implementation, but accessing the training set that is stored in the class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First 5 predicted labels: [148.5 148.  183.  248.5 123.5]\n",
      "First 5 true labels: [219.  70. 202. 230. 111.]\n"
     ]
    }
   ],
   "source": [
    "class KNeighborsRegressor:\n",
    "    \"\"\"\n",
    "    K-Nearest Neighbors regressor.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, n_neighbors=5):\n",
    "        \"\"\"\n",
    "        Initializes the KNeighborsRegressor with the specified number of neighbors.\n",
    "\n",
    "        Args:\n",
    "            n_neighbors (int): Number of neighbors to use for prediction. Default is 5.\n",
    "        \"\"\"\n",
    "        self.n_neighbors = n_neighbors\n",
    "        self.X_train = None\n",
    "        self.y_train = None\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        \"\"\"\n",
    "        Fit the KNN regressor on the training data.\n",
    "\n",
    "        Args:\n",
    "            X (np.ndarray): Training data features.\n",
    "            y (np.ndarray): Training data labels.\n",
    "        \"\"\"\n",
    "        self.X_train = X\n",
    "        self.y_train = y\n",
    "\n",
    "    def predict(self, X):\n",
    "        \"\"\"\n",
    "        Predict the target for the given data.\n",
    "\n",
    "        Args:\n",
    "            X (np.ndarray): Data to predict.\n",
    "\n",
    "        Returns:\n",
    "            np.ndarray: Predicted target values.\n",
    "        \"\"\"\n",
    "        predictions = []\n",
    "        for x in X:\n",
    "            distances = np.linalg.norm(self.X_train - x, axis=1)\n",
    "            k_indices = np.argsort(distances)[:self.n_neighbors]\n",
    "            k_nearest_values = self.y_train[k_indices]\n",
    "            prediction = np.mean(k_nearest_values)\n",
    "            predictions.append(prediction)\n",
    "        return np.array(predictions)\n",
    "    \n",
    "# Sample usage\n",
    "knn_regressor = KNeighborsRegressor(n_neighbors=2)\n",
    "knn_regressor.fit(X_train, y_train)\n",
    "y_pred = knn_regressor.predict(X_test)\n",
    "print(\"First 5 predicted labels:\", y_pred[:5])\n",
    "print(\"First 5 true labels:\", y_test[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2. Importing the model from Sklearn\n",
    "Now that we know how the model works, we can just import it from the `sklearn.neighbors` module so we don't have to implement it from scratch everytime we use it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First 5 predicted labels: [148.5 148.  183.  248.5 123.5]\n",
      "First 5 true labels: [219.  70. 202. 230. 111.]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsRegressor\n",
    "\n",
    "knn_regressor = KNeighborsRegressor(n_neighbors=2)\n",
    "knn_regressor.fit(X_train, y_train)\n",
    "y_pred = knn_regressor.predict(X_test)\n",
    "print(\"First 5 predicted labels:\", y_pred[:5])\n",
    "print(\"First 5 true labels:\", y_test[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our results and Sklearn\"s coincide :) Hurray!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Evaluating regression models\n",
    "Evaluating a ML model consists assessing how well the model's predictions match the actual values. The most common metrics for evaluating regression models are:\n",
    "\n",
    "- **Mean Squared Error (MSE)**: Measures the average squared difference between the actual and predicted values.\n",
    "  \n",
    "  $$MSE = \\frac{1}{n} \\sum_{i=1}^{n} (y_{i} - \\hat{y_{i}})^2.$$\n",
    "\n",
    " - **Root Mean Squared Error (RMSE)**: The square root of the MSE, providing a measure in the same units as the target variable.\n",
    "\n",
    "  $$RMSE = \\sqrt{MSE}.$$\n",
    "\n",
    "- **R-squared (**$R^2$**)**: Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.\n",
    "  \n",
    "  $$R^2 = 1 - \\frac{\\sum_{i=1}^{n} (y_{i} - \\hat{y_{i}})^2}{\\sum_{i=1}^{n} (y_{i} - \\bar{\\mathbf{y}})^2},$$\n",
    "\n",
    "where $n$ is the number of samples, $y_{i}$ is the true target value for the $i$-th sample, $\\hat{y_{i}}$ is the predicted target value for the $i$-th sample, and $\\bar{\\mathbf{y}}$ is the mean of the true target values.\n",
    "\n",
    "The implementation for these metrics is available in the `metrics` module of Sklearn:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean Squared Error: 3537.612359550562\n",
      "Root Mean Squared Error: 59.477830824186604\n",
      "R-squared: 0.3322931226835779\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "\n",
    "# Assuming y_true and y_pred are the true and predicted target values, respectively\n",
    "mse = mean_squared_error(y_test, y_pred)\n",
    "rmse = np.sqrt(mse)\n",
    "r2 = r2_score(y_test, y_pred)\n",
    "\n",
    "print(\"Mean Squared Error:\", mse)\n",
    "print(\"Root Mean Squared Error:\", rmse)\n",
    "print(\"R-squared:\", r2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Linear Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting linear equation that describes how the dependent variable changes as the independent variables change. The equation of a multiple linear regression model is:\n",
    "\n",
    "$$ y = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + \\cdots + \\beta_p x_p + \\epsilon $$\n",
    "\n",
    "where:\n",
    "\n",
    "- $y$ is the dependent variable or target,\n",
    "- $x_1, x_2, \\ldots, x_p$ are the independent variables,\n",
    "- $\\beta_0$ is the y-intercept,\n",
    "- $\\beta_1, \\beta_2, \\ldots, \\beta_p$ are the coefficients,\n",
    "- $\\epsilon$ is the error term.\n",
    "\n",
    "In the context of Machine Learning, we can use a training set to compute estimates of the parameters ($\\hat{\\beta_0}, \\hat{\\beta_1}, \\hat{\\beta_2}, \\ldots, \\hat{\\beta_p}$), which can be used to make predictions from new test points. In this specific case, we could either compute them using the Ordinary Least Squares method or using a numerical optimization algorithm such as gradient descent. However, we will not implement this model form scratch, but we will directly use Sklearn's implementation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First 5 predicted labels: [139.5475584  179.51720835 134.03875572 291.41702925 123.78965872]\n",
      "First 5 true labels: [219.  70. 202. 230. 111.]\n",
      "\n",
      "Mean Squared Error (MSE): 2900.194\n",
      "Root Mean Squared Error (RMSE): 53.853\n",
      "R-squared: 0.453\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "lin_reg = LinearRegression()\n",
    "lin_reg.fit(X_train, y_train)\n",
    "y_pred = lin_reg.predict(X_test)\n",
    "\n",
    "print(\"First 5 predicted labels:\", y_pred[:5])\n",
    "print(\"First 5 true labels:\", y_test[:5])\n",
    "\n",
    "mse = mean_squared_error(y_test, y_pred)\n",
    "rmse = np.sqrt(mse)\n",
    "r_squared = r2_score(y_test, y_pred)\n",
    "print(f\"\\nMean Squared Error (MSE): {mse:.3f}\")\n",
    "print(f\"Root Mean Squared Error (RMSE): {rmse:.3f}\")\n",
    "print(f\"R-squared: {r_squared:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The MSE is lower than for kNN and the R-squared is higher, which indicate that this model performs better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can inspect the `lin_reg` object to find out what are the estimates for the parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Estimated coefficients:  [  37.90402135 -241.96436231  542.42875852  347.70384391 -931.48884588\n",
      "  518.06227698  163.41998299  275.31790158  736.1988589    48.67065743]\n",
      "Estimated intercept:  151.34560453985995\n"
     ]
    }
   ],
   "source": [
    "print(\"Estimated coefficients: \", lin_reg.coef_)\n",
    "print(\"Estimated intercept: \", lin_reg.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that the features are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "diabetes.feature_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Therefore, the formula for the linear regression model (i.e., the one that is applied when we call `lin_reg.predict`) is\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\hat{y} = & \\ 151.35 + 37.90 \\cdot \\text{age} - 241.96 \\cdot \\text{sex} + 542.43 \\cdot \\text{bmi} \\\\\n",
    "          & + 347.70 \\cdot \\text{bp} - 931.49 \\cdot \\text{s1} + 518.06 \\cdot \\text{s2} \\\\\n",
    "          & + 163.42 \\cdot \\text{s3} + 275.32 \\cdot \\text{s4} + 736.20 \\cdot \\text{s5} \\\\\n",
    "          & + 48.67 \\cdot \\text{s6}\n",
    "\\end{align*}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🔎 **Observation:** When writing math formulas for ML/statistics, we usually use the hat (  $\\hat{ }$  ) to denote esimates/predictions. For example $\\hat{\\beta_1}$ is the estimate of $\\beta_1$ that we learn from data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "tml_25",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}