Skip to content

Machine Learning

Deep Dive into LLMs Like ChatGPT - Karpathy

Original video from Andrej Karpathy. Masterpiece.


Fineweb

Tokenization. From words to tokens. Tokens are not words and punctuation, they can be the root of words, they can be some sequence of letters without explicit meaning. This will change depending on the model. (tiktokenizer para verlo en accion)

How it works? From words to bits with encoding could be the first step. From words to 1 and 0. But this becomes a super large representation because we only have two symbols. We can group every 8 bits into 256 different bytes. [0 to 255] The sequence is much shorter because we have more symbols. We can use this, but in SOTA we go beyond that. We use Byte-pair encoding which looks for common pairs of bytes and we create a new symbols starting from 256, and we can do this many times. More symbols, shorter representation. Gpt4 ends up with +100k symbols.

Neural Networks

We take windows of tokens of flexible length up to some maximum (4, 8, 16k). Too much could be computationally expensive.

The idea is to predict the next token. The window used is the context. We do this for every context and token of the training data. The process will adjust the probabilities of each next token. They are initialized at random and in the end they should match the statistical properties of the dataset.

Training is done via Neural Network, the mathematical expression is updated via weights. Input is the token sequence, output is the probability of each token as the next one. In the middle there is the NN architecture with transformers and etc. First parts includes the embedding into numerical representation.

Inference

Put an initial token and sample from the output probability distribution, this is your next token. We do that again, but now the context is 2 tokens, and so on.

GPT-2 train and inference

Useful because the technical parts are still relevant only that now are bigger and more complex.

Base model is the model that comes out after training. It's an inference machine, just token prediction but it's not useful for chat for example. Some companies release their base models but not all of them. GPT2 was released. A base model release requires, the model architecture and the model weights.

Llama 3 is another one more recent. 2024, 405 billions parameters

Base model are somewhat good at memorization (regurgitation) which is not desirable. If you paste the first sentence of a wikipedia page it will probably output the exact rest of the article up until some point and then deviate.

Wikipedia also has more weight in the model because the source is truthful. I'm not sure if this is because the wikipedia extract appears more times in the corpus because of citations and so or because some sources have more importance than other from pretraining methodology.

Base model is still useful to some extent without being an assistant if you are clever with the prompt.

Few shot prompt. The model has some in-context learning, which is that the model can understand a pattern in the prompt. In the video, AK prompts a list of english words with their korean translation and left the last one without translating to make inference from that point.

He also shows that you can generate some assistant type of inference via in-context learning by passing an example of human-assistant interaction and making it generate the answer to you actual question via inference.

Post training

Pretraining is all that we saw before. Get data from the web, tokenize it and create a base model that predicts next token. It's not an assistant, it's like a internet text generator. That's all included in pretraining and it's the expensive part, the one that takes millions of dollars and a lot of time.

Post training
Much less computation than pretraining and it's the step that moves us from a token generator to an assistant.

Because this is a neural network we can't explicitly program the assistant or give them a personality or make it refuse some kind of questions. We can only do that via neural networks training on datasets.

Programming by example. And the examples requires human labelers. We train the model on this responses and try to imitate that behaviour.

We substitute the training dataset of the model, we remove all the internet text and we start using the conversation dataset. We keep training the model but know with this dataset and the model will pick the statistics of this new dataset and how the conversations should happen. (Supervised Finetuning aka SFT)

Post training in SOTA can be in 3 hours for example vs 3 months of training in pre training and thousands of CPUs. This is because post training the dataset is human created and much smaller.

How do we go from conversations in the new dataset to tokens?

We need to encode and decode in some specific way. Each model has a slightly different methodology, but gpt-4o has for example a few extra tokens that represent the beginning of the new character talking (user or assistant), then a token for the user or the assistant, then a token for the start of the actual message, then the message tokenized and then a token for the end of the message. Then we go again, same token representing the beginning, then the other token of user/assistant, etc

So, when we go to chatgpt and ask a question it's sent to the backend encoded with the above format and they add the tokens for the start of a new message from the assistant and run inference there, they let the LLM complete all the next tokens.

InstructGPT paper on SFT:First paper to talk about post-training. Mentions the heavy human labeler part from where the post training datasets with conversations emerged and some of the instructions the labelers received. The dataset from OpenAI is not released, but OpenAssistant is an open source alternative with a similar format.

Currently LLMs are being used to help create this datasets of conversations. No need for that much human effort. But in the end the root of all this conversations is the initial human labelers following OpenAI and other companies instructions. In the end, chatgpt for example, is answering in the tone and guided by those examples, so it's kind of recreating how the labelers wrote. It's a labeler text generation machine.

"What would a human labeler say in this conversation?" You are talking to a simulation of an average labeler (who is probably some skilled person but still)

Hallucinations

They exist because the model is sampling from the training dataset statistics trying to answer something even if it's not the truth. The problem has been improved over the years but it's still relevant.

How to fix it?

We need to include in the post training dataset some conversations where the answer is that it doesn't know.

Mitigation #1: model interrogation

What Meta did for LLama is super clever . We don't know what the model knows or not exactly so we need to let it decide in some way. They assume there is some internal representation of lack of knowledge, some neuron that gets activated when it doesn't "know" something. So, what they did to include that pattern is to take random text (from wikipedia lets say) and they used an LLM to create a few questions with factual answers about that text. They interrogate the model with those questions and compare the answer to the actual truth (also another LLM as judge, no need for human). They did it a few times per question. If the model answered correctly then the conversation output is fine and all good. But if the model hallucinates and answers wrongly (as judged by another LLM based on the actual truth) then the answer to that question in the conversation dataset becomes "Sorry, I don't know". If you add some amount of answers like this (because of course you can't just add all question that don't have a true answer) then the model will get that pattern, that when that unknown neuron related to lack of knowledge gets activated, then answering I don't know is what it should do. This worked quite well to mitigate hallucinations!

Allow the model to search the internet. Allow the model to use Tools. We do this by introducing new tokens, in this case special tokens for search_start and search_end with a query in the middle. The Assistant will look up that query in a browser and will copy paste the information it gets just after the special tokens, so the internet information is now in the context. It goes directly into the model, like refreshing our memory as humans.

To add this functionality we need again to teach the model by example, adding a bunch of examples in the dataset on how to use the search.

Knowledge in the parameters == Vague recollection (e.g. of something you read 1 month ago) Knowledge in the tokens of the context window == Working memory

Models need tokens to think

Given the neural network architecture, there is a finite amount of computation that can be given for each token. Given the context your forward/inference pass will predict the next token using the network capability but it's finite/limited. We should try to expand/distribute the computation, "the thinking" between many tokens to use the full neural network computation power for each token, so we end up using more computation in total for our answer. In concrete, he shows a simple math problem. If we aim for the model to answer directly, we are forcing the neural network to use it's context and finite computation to answer in a single attempt. Everything that comes after that answer will be a post hoc justification. While if we aim for the model to elaborate and go step by step (disguised CoT?) it will use full computation for each step and by the time it outputs the answer, it will have a good detailed context to provide that final "calculation"

image

This is the same reason why models are not good at counting. Too much expected from a single forward pass, finite computation.

"use code" it's a great way to make the model good at those tasks, because the model is good at copy pasting and code gives the right answer. So, you can just copy the string to code and then use python to actually count the number of letters or whatever. Same for calculation. It's much more likely to have the right answer than relying on the "mental arithmetic" of the model.

It's also interesting to understand why a model might be good at solving complex phD level math problems but fail at simple tasks like: "what is bigger 9.11 or 9.9?" which usually is answered wrongly or randomly.

One hypothesis he mentions is that some research team said that the bible has 9.11 > 9.9 in terms of verses and this can create confusion to the neural network but it's a problem not fully understood.

Reinforcement Learning

The last major stage, some times is included as part of post training but it's really a next separate major step.

It's like going to school.

He compares the training of a model to a textbook, general explanations are like the pretraining, then the examples given in the book is like the post training with examples of how things should be solved (how to answer like an assistant) and then there are exercises which the student doesn't have the solution and needs to try to solve. You might have the final answer but not the path towards it. Reinforcement learning is like this last step.

The motivation is that we as humans (labelers) don't know what's the best way for the LLM to solve a specific problem, such as a math problem. He shows a few options, like going straight to the answer, doing some arithmetic, talking in native english and giving the answer, putting the problem as a system of equations, etc. What's easier for us might not be easy for the LLM, so we need to try what approach gives best results.

So, the idea is to generate multiple (thousands) solutions for some problem which isn't trivial and store the stochastic solutions / inferences / token sequence that led to a right answer among all the tries. Some will get the right answer, some don't. And then, the model will be retrained based on the right solutions. So, it's not human labeled anymore, it's just trying solutions and re-train on the ones that were correct so the network learns to keep doing that for similar situations in the future.

Pre and post training are quite standard and used across all providers but the reinforcement learning step is in an earlier stage and not standardized, different providers are trying different approaches and how some details in the process are handled (which is a simple idea overall) has high impact and is not trivial, those details in how we select what is "the best answer" among the correct ones for example play a big role.

Deepseek R1

The paper was innovative and game changer in part because is open and explicit about RL while openAI and other kept the details for themselves.

With RL, the model learns over time to give better answers to the questions and it's using more and more tokens to do it. It "discovers" that trying many paths and backtracking and trying again it's better to get a good answers. Chain of thoughts emerge without being hardcoded anywhere by researchers (would be impossible too, it's something the model needs to discover)

A "thinking/reasoning" model is one that has been trained with Reinforcement Learning

ChatGPT 4o is not a reasoning one, is SFT montly (learn by example, just finetuned, no RL. He says there is a bit of RL but we should think about them as SFT really). DeepSeek uses RL. o1 and o3 are also RL ones.

AlphaGo

RL made it possible for the model to beat top players ELO while supervised learning was not capable. RL is not restricted to human kind of plays and that's how move 37 happened, a play that was not expected by top level players but that actually was really powerful. This happened because the training wasn't guided by supervised learning but by RL (the AI playing against itself kind of)

Learning in Unverifiable domains (Reinforcement learning from Humand Feedback)

The previous problems where easily verifiable, we could just compare in RL if the solution was correct by checking the final output of the LLM vs the right answer, maybe by direct comparison or using LLM as judge where we ask another LLM to check if the solution provided by the model is consistent with the actual solution (currently that approach is quite reliable) but this can't be done in unverifiable domains such as "write a joke about pelicans", "write a poem", etc

For the pelican jokes, in principle you could use humans to judge if the joke is funny and reward it but as you need to evaluate thousands of generations for thousands of prompts, this becomes unfeasible. We need another strategy. This paperintroduced the RL-HF subject

RLHF approach STEP 1 :Take 1000 prompts, generate 5 options, order them from best to worst. STEP 2: Train a neural net simulator of human preferences ("reward model") STEP 3: Run RL as usual, but using the simulator instead of actual humans

The reward model is not a perfect human simulator but it's currently good enough to work meaningfully.

Upside

We can run RL in arbitrary domains (even unverifiable ones) and empirically it gives better results.

He says that probably this improves the model due to discriminator - generator gap. It's easier for a human to discriminate than to generate. It's easier to say which jokes are good or bad than to create the good ones. The labeler doesn't need to create a good joke, it just leaves that hard task to the model and the labeler points out which are good and which are bad

Downsides

RL is done with respect to a lossy simulation of humans. It might be misleading, we generate orders based on a model that might not reflect the actual human judgement.

RL discovers ways to "game" the reward model.

It happens that after a lot of updates, the jokes that are considered the top are non sensical. At first, we some initial updates the jokes might improve but after some point in time they become much worse, like a top joke could be "the the the the the" and somehow that gets a high score by the reward model. Those weird top answers are adversarial examples. Somehow the RL get some answers that go through little paths that somehow fire good scores without making human sense. The reward model are massive neural nets and they have cracks.

You could get this non sensical answers and add them to your dataset with very low ranking to make it learn that this is not a good joke but this is an infinite process, there will always be more adversarial examples to be found by the neural net.

What to do? Just train the RL for some time and crop the training, don't go too far so you avoid the adversarial example generation.

This is true for RLHF, not RL . Plain RL can be run indefinitely because you can't really game the answer, you are looking for a specific answer and the neural net will find ways, even non standard ways, to find that answer but it's totally verifiable. RLHF is different, and a reward function can be gamed, so RLHF can't be run forever while plain RL yes.

E-commerce Image Similarity via Visual Embeddings

How I implemented an API to retrieve similar images from an E-commerce in 3 steps

In this post, we explore a system to identify similar articles solely based on e-commerce images. The approach is designed with two primary objectives in mind:

Objectives

  1. Catalog Similarity:
    Determine which items in the client's catalog resemble each other using only their photos.

  2. External Querying:
    Although not implemented in the current API, the plan is to eventually allow querying with external images. The envisioned workflow is to source images from suppliers, compare them with the client's catalog, and perform this embedding generation locally to keep the API lightweight.

Proposed Solution

The core idea is to use a pretrained image model to extract embeddings from each photo. Once these embeddings are available, we can perform similarity searches to find items that are visually alike.

For our implementation, we experimented with both OpenAI's clip-ViT-B-32 and ResNet. In general, both models produced comparable results, though we opted for CLIP in our main experiments.

Implementation Steps

Step 1: Download Catalog Images

  • Script: embeddings/download_images/get_images.py
  • Details:
    This script downloads all catalog images from the e-commerce using a ThreadPool to speed up the process.

Step 2: Generate and Index Embeddings

  • Script: embeddings/clip_faiss.py
  • Details:
    The script generates embeddings for each photo and stores them in a Faiss index, which is saved under embeddings/faiss_index/.
    Note: Since the process is deterministic, a simple overwrite will not impact the results. Idempotent as they call it.

Additional notebooks are available to illustrate the process, check results, and experiment with alternative models and tests.

Step 3: Query the Faiss Index

  • API Functionality:
    The Faiss index is already built. We expose an API endpoint where you can pass an article_id (used during the embedding generation) and retrieve the most similar items.
    Validation:
    As a sanity check, querying an article should return itself as the top match with a distance of 0.

Conclusion

By leveraging pretrained image models and efficient similarity search with Faiss, this approach provides a scalable method for identifying visually similar items in an e-commerce catalog. This system not only improves internal catalog management but also sets the groundwork for integrating external image queries in the future.

Kullback-Leibler divergence

To understand KL divergence we need to first understand Entropy. The most important thing to have in mind though is that entropy can be thought as a measure of "information", but what I like the most, as a measure of expected surprise that one gets for every observed value of the distribution.

For a highly dense distribution, you will almost sure get a value from the dense region and the expected surprise will be really low (but you are highly surprised when you see a value off that region!)

Relative entropy

How this relates to KL? Well, Kullback-Leibler is a divergence but is also called relative entropy. This means, how that measure of entropy differs between two distributions. I find that easier to grasp.

Usually we have a true distribution \(p(x)\) and an estimated distribution \(q(x)\) that we use to approximate \(p(x)\). KL divergence can help us understand how good it does the job.

If entropy is:

\(H[x] = - \sum_x p(x) \ln p(x)\)
(originally \(\log_2\) but \(\ln\) works too and it's used everywhere )

\[KL(p||q) = - \int{p(x) \ln q(x)dx} - (-\int{p(x) \ln p(x)dx})\]
\[ = - \int{p(x) \ln \frac{q(x)}{p(x)}dx}\]

The first row is clear, is just the difference in entropies. This is relative entropy between p(x) and q(x).

Important: KL divergence is not symmetrical. \(KL(p||q) \neq KL(q||p)\) So, KL is not a metric of distance (but can be thought as if). It's actually a divergence.

Why it's not symmetrical?

I think about it this way. This is a dissimilarity between one distribution and how we approximate it. Think about two totally different distributions (awful approximations):

  • one highly centered around one specific value
  • the other with a uniform-ish shape, highly dispersed.

The difference in "surprise" you get from both ways approximation is not the same.

If you are approximating the highly dense distribution with the uniform one, you are approximating it with a distribution that surprises you in general \(\ln q(x)\), let's say moderately high everywhere, even for the specific dense region.

But the other way, if you approximate the dispersed distribution with the dense one, you are using a distribution with high surprise in most of the region of the dispersed distribution and really low surprise in one specific value.

This term \(\int{p(x) \ln q(x)dx}\) behaves differently in both scenarios and there is no guarantee or need that they should match.

References

Deep Learning, Bishop
Wikipedia

Mutual Information

When two variables \(x\) and \(y\) are independent, their joint distribution will factorize into the product of their marginals \(p(x,y) = p(x)p(y)\). If the variables are not independent, we can gain some idea of whether they are "close" to being independent by considering the KL Divergence between the joint distribution and the product of the marginals, given by:

\[I[x,y] = KL(p(x,y)||p(x)p(y))\]
\[= - \int \int p(x,y) \ln (\frac{p(x)p(y)}{p(x,y)}) \]

which is called the mutual information between \(x\) and \(y\).

Thus, the mutual information represents the reduction in uncertainty about \(x\) by virtue of being told the value of \(y\) (or vice versa)

From a Bayesian perspective, we can view \(p(x)\) as the prior distribution for \(x\) and \(p(x|y)\) as the posterior distribution after we have observed new data \(y\). The mutual information therefore represents the reduction in uncertainty about \(x\) as a consequence of the new observation \(y\).

Entropy and information

Term that comes from information theory. The most intuitive way to think about information of a variable is to relate to the degree of surprise on learning the value of the variable X.
This definition was mentioned both in Deep Learning, Bishop and in Hamming, and the former is the text I was just reading before starting this note.

So, having a variable X with p(x), what's h(x) , the information of observing X? This quantity h(x) should be a monotonic function of p(x). Remember, information is tied to the surprise, observing an almost sure event, high p(x), reveals less information than an unexpected event,low p(x). In the extreme, observing a known value, gives 0 information.

How to define the information mathematically comes (in some way intuitively) by the definition that, observing two independent variables x and y should provide an amount of information equal to the sum of the individual informations.

\[h(x,y) = h(x) + h(y)\]

We know that for independent events \(p(x, y) = p(x)*p(y)\)

It can be derived then that

\[h(x) = - \log_2 p(x)\]

The log provides the summation part coming from a multiplication. The negative ensure 0 or positive values. Remember that we are taking the log of a value between 0 and 1.

Now, there is another important term, entropy. It summarizes the average amount of information that is transmitted if a sender wishes to transmit the value of a random variable to a receiver. The entropy is the expectation of the information, with respect to the distribution p(x).

\[H[x] = - \sum_x p(x) \log_2 p(x)\]

For p(x) = 0, where the log would bring problem, we consider \(p(x) \log p(x)\)=0

Classical information theory uses \(\log_2\) because it relates to bits and the amount of bits required to send a message but the book changes to \(\ln\) because is much more used in ML and it's just a different unit to measure entropy.

Expectation Maximization for Poisson process

The problem

What if you want to run a regression to estimate the coefficients and retrieve the data generating process of some data BUT you are in a scenario with censored data? How does that affect my estimation? Can I do better?

Censored data

In short, you have censored data when some observation is unobserved or constrained because of some specific reason which is natural and unavoidable.

Unobserved

For instance, in survival analysis, if the subject is not yet dead (luckily) you can not observe the time of death yet and hence you don't know if he will die tomorrow or in two years. You can't observe yet the death date.

Constrained

For goods with limited demand, once you deplete your stock, by definition you can't sell more and you can't observe all the demand that you would have if more items were able to be sold. Or an Airline , when selling seats, can't observe the full demand after all the seats have been sold, maybe more people were willing to flight.

We will simulate an example of the latest case, for an airline with simple toy data.

import numpy as np
import statsmodels.api as sm
from scipy.stats import poisson
import matplotlib.pyplot as plt

We create some fake data, where seats sold are generated from a poisson distribution, with an intercept (2) and a coefficient related to how many days are left to the departure (-0.2), the further from the departure, the less seats are sold.

What we also do, is that on the last date, we constrain the demand. We don't get the full seats that would have been sold under a poisson distribution but the amount minus 3. Only ourselves know that because we want to tweak the data. In real life we would see data similar to the one generated and we couldn't see the actual sales that would have been present in a complete poisson generating process (that actually is how the tickets are sold in our example)

# Step 1: Generate data
np.random.seed(100)
n_samples = 20
n_features = 1
substract_sales = 3

# Generate features
X = np.random.normal(size=(n_samples, n_features))
X = np.array(range(n_samples+1, 1, -1))
X = sm.add_constant(X)  # Add intercept term

# True coefficients
beta_true = np.array([2, -0.2])

# Generate Poisson-distributed target values
linear_pred = np.dot(X, beta_true)
lambda_true = np.exp(linear_pred)
y = np.random.poisson(lambda_true)
# y

# Right-censoring threshold in the last date (when the fare is closed)
censor_threshold = np.concatenate([(y+1)[:-1],np.array([y[-1]-substract_sales])])
y_censored = np.where(y > censor_threshold, censor_threshold, y)
is_censored = (y > censor_threshold)
Maybe it's easier to visualize.
The green curve is the actual lambda parameter for each data, based on the intercept and the real coefficient multiplied by days to departure.
The blue dots are the actual data points we see for all dates except from the last one.
The grey dot is the data we should see if there was no limit of how many seats to sell, following the poisson distribution.
The red dot is the actual data point we have, the last day the airline sold all the remaining seats and couldn't fulfill the true demand and hence we see lower sales than the ones generated by the poisson distribution.

# Plot the data
plt.xlabel('Days to departure')
plt.ylabel('Bookings')
plt.title('Data Plot')
plt.plot(X[:-1, 1], y[:-1], 'o')
plt.plot(X[:, 1], lambda_true, label='Lambda True', color="green")
# plt.plot(X[:, 1], lambda_est, label='Lambda Estimated')
plt.scatter(X[-1, 1], y[-1:],  color="grey", label="Real unseen demand" )
plt.scatter(X[-1, 1], y[-1:]-substract_sales, color='red', label='Constrained')
plt.legend()
plt.show()

Image

Esimation
Poisson Regression

First we estimate the parameters using a poisson regression. We assume the data follows a poisson process and we don't do anything to manage the constrained data.
The results are kind of ok, we get some closeness to the true parameters but it is clearly biased.

model_censored = sm.GLM(y_censored, X, family=sm.families.Poisson())
results_censored = model_censored.fit()
print(f'Estimated Censored coefficients: {results_censored.params}')
## Estimated Censored coefficients: [ 1.86976922 -0.21124541]
Expectation Maximization

OK, here is our alternative. What we can do, a bit more sophisticated, is to apply EM algorithm, which is a iterative process, to estimate the coefficients of the poisson regression, including the knowledge we have, that there might be constrained data.

# Step 2: Initialize parameters
beta_est = np.zeros(n_features + 1)
tol = 1e-4
max_iter = 1000

for iteration in range(max_iter):
    # Step 3: E-step
    # Estimate the expected values of censored data
    lambda_est = np.exp(np.dot(X, beta_est))
    expected_y = np.where(
        is_censored,
        (censor_threshold + 1) / (1 - poisson.cdf(censor_threshold, lambda_est)),
        y_censored
    )
    # Ensure expected_y values are valid
    expected_y = np.nan_to_num(expected_y, nan=np.mean(y_censored), posinf=np.max(y_censored), neginf=0)
    # Step 4: M-step
    # Update parameter estimates using Poisson regression
    model = sm.GLM(expected_y, X, family=sm.families.Poisson())
    results = model.fit(method="lbfgs")
    # results = model.fit_regularized(L1_wt=0, alpha=0.1) 
    new_beta_est = results.params

    # Check convergence
    if np.linalg.norm(new_beta_est - beta_est) < tol:
        break

    beta_est = new_beta_est
## <string>:7: RuntimeWarning: divide by zero encountered in divide
print(f'Estimated coefficients: {beta_est}')

## Estimated coefficients: [ 2.11015996 -0.23573144]
We can see our estimates are of course not perfect but the intercept is closer to the true parameter.
Let's visualize the results to have a clear intuition.

uncensored_predictions = results.predict(X)
censored_predictions = results_censored.predict(X)
Promising!
For the furthest dates we see a bias in both approaches, we are not doing better than the Poisson regression, but neither worse.
As we move closer the the departure, the EM algorithm predictions get closer and closer to the true lambdas, while the regular poisson regression continues it's biased trajectory.
The EM procedure adjusted the coefficients to better match the constrained data.

plt.plot(X[:, 1], lambda_true, label='Lambda True', color="green")
plt.plot(X[:, 1], uncensored_predictions, label='Uncensored Predictions (EM)', color="blue")
plt.plot(X[:, 1], censored_predictions, label='Censored Predictions', color="red")
plt.xlabel('Days to departure')
plt.ylabel('Lambda - Poisson regression')
plt.title('Lambda Predictions')
plt.legend()
plt.show()

Image

Weighted regression

Weighted regression consists on assigning different weights to each observation and hence more or less importance at the time of fitting the regression.

On way to look at it is to think as solving the regression problem minimizing Weighted Mean Squared Error(WSME) instead of Mean Squared Error(MSE)

\(\(WMSE(\beta, w) = \frac{1}{N} \sum_{i=1}^n w_i(y_i - \overrightarrow {x_i} \beta)^2\)\) Intuitively, we are looking fot the coefficients that minimize MSE but putting different weights to each observation. OLS is a particular case where all the \(w_i = 1\)

Why doing this? A few reasons (Shalizi 2015. Chapter 7.1)

  • Focusing Accuracy: We want to predict specially well some particular points or region of points, maybe because that's the focus for production or maybe because being wrong at those observations has a huge cost, etc. Using Weighted regression will do an extra effort to match that data.

  • Discount imprecision: OLS returns the maximum likelihood estimates when the residuals are independent, normal with mean 0 and with constant variance. When we face non constant variance OLS no longer returns the MLE. The logic behind using weighted regression is that makes no sense to pay equal attention to all the observations since some of them have higher variance and are less indicative of the conditional mean. We should put more emphasis on the regions of lower variance, predict it well and "expect to fit poorly where the noise is big".
    The weights that will return MLE are \(\frac{1}{\sigma_i^2}\)

  • Sampling bias: If we think or know that the observations in our data are not completely random and some subset of the population might be under-represented (in a survey for example or because of data availability) it might make sense to weight observation inversely to the probability of being included in the sample. Under-represented observations will get more weights and over-represented less weight.
    Another similar situation is related to covariate shift. If the distribution of variable x changes over time we can use a weight designed as the ratio of the probability density functions.

    "If the old pdf was p(x) and the new one is q(x), the weight we'd want to is \(w_i=q(x_i)/p(x_i)\)

  • Other: Related to GLM, when the conditional mean is a non linear function of a linear predictor. (Not further explained in the book at this point)

Is there a scenario where OLS is better than Weighted regression? Assuming we can compute the weights.

Example.

First we will see the impact of using weighted regression, using a simulated scenario where we actually know the variance of the error of each observation. This is not realistic but useful to see it in action.

library(tidyverse)

We generate 1000 datapoints with a linear relation between y and x. Intercept = 0, slope = 5. We let the variance of the error depend on the value of x. Higher values of x are associated with higher values of the variance of the error.

set.seed(23)
n=1000
x = runif(n,0,10)
error = rnorm(n,0, x/1.5)
df = data.frame(x)
df = df %>% mutate(y = 5*x + error)
Visually..

ggplot(data=df, aes(x=x, y=y)) +
  geom_point(alpha=0.3) + 
  # geom_smooth(color="blue") +
  # geom_smooth(method = "lm", mapping = aes(weight = (1/sqrt(x)^2)),
  #             color = "red", show.legend = FALSE) +
  NULL

Image

Linear regression
ols = lm(formula = "y~x", data=df)
summary(ols)

## 
## Call:
## lm(formula = "y~x", data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.868  -1.720  -0.137   1.918  14.722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.19192    0.24278   0.791    0.429    
## x            4.95585    0.04148 119.489   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.855 on 998 degrees of freedom
## Multiple R-squared:  0.9347, Adjusted R-squared:  0.9346 
## F-statistic: 1.428e+04 on 1 and 998 DF,  p-value: < 2.2e-16
We get an intercept of 0.19, non-significant when the actual value is 0 and a slope of 4.96 when the actual value is 5.

Weighted linear regression
wols = lm(formula = "y~x", data=df, weights = (1/sqrt(x)^2) )
summary(wols)
## 
## Call:
## lm(formula = "y~x", data = df, weights = (1/sqrt(x)^2))
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8880 -0.8601 -0.0016  0.8936  4.6535 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.001483   0.030072   0.049    0.961    
## x           4.993473   0.021874 228.286   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.498 on 998 degrees of freedom
## Multiple R-squared:  0.9812, Adjusted R-squared:  0.9812 
## F-statistic: 5.211e+04 on 1 and 998 DF,  p-value: < 2.2e-16

We get an intercept of 0, non-significant too but much closer to 0 and with lower standard error and a slope of 4.99 also much closer to the actual value of 5 and with lower standard error.

Conclusion: if we know the right weights we can get better estimates from a linear regression in case of heteroscedasticity.

Inference is not valid in the dataset used for model selection.

Let's say we have a dataset and we want to fit a model to it and do some inference such as obtaining the coefficients and look for their confidence intervals.

For such a task we would first need to find a model that we think approximates to the real data generating process behind the phenomenon.
This will be the model selection step.
Then we would look at the output of our model and get the standard error of the coefficients or calculate the confidence interval or any other similar task. This will be the inference step.

The issue here is that, if we don't know the true model and we do model selection, our own model will be a random object. Why? Because the particular dataset we are using is also a set of random variables. Other datasets might return another model formula as the best between our options since that particular dataset would have other observations and particularities.

Main problem:

since we are selecting a model based on a particular dataset, the standard errors and p-values will be smaller than then actual ones.

"That means there is some extra randomness in your estimated parameters (and everything else), which isn't accounted for by formulas which assume a fixed model.
This is not just a problem with formal model-selection devices like cross-validation. If you do an initial, exploratory data analysis before deciding which model to use - and that's generally a good idea - you are, yourself, acting as a noisy, complicated model-selection device" (Sharizi 2017)

The most straightforward way to deal with this (if you are using independent observations) is to split the data, do model selection in one part and then fit the best model in the other part. Your second fit will be the one useful for inference.
You could fit the model to the full data but that would include the part used for model selection and you would still get false, overconfident standard errors.

Let's see an example.
We will generate data following a somewhat "complicated" model with interactions. We will split the data in two equal size parts. One for model selection and one for inference.
We will then fit a couple formulas to model selection part and pick the one with the minimum RMSE. We will compare the standard errors obtained in the model selection part and the ones obtained fitting that model to the inference part.

Thanks to BrunoRodrigues for this post that I used as guideline to fit models with Cross Validation in R.

We start by generating the data, including interactions.

set.seed(1)
N = 5000
b0 = 4
b1 = 1
b2 = 2
b3 = 3
b4 = 4
b5 = 5

x1 = runif(N, 0, 10)
x2 = rnorm(N, 20, 3)
x3 = runif(N, 20, 40)
error = rnorm(N, 0, 200)

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x1*x2 + b5*x2*x3 + error


df = tibble(y, x1, x2, x3)

We do the first split, df_selection will be the one used to try different models and pick one.
df_inference will be used to do the actual inference given the model selected.

prop = 0.5

selection_inference_split = initial_split(df, prop=prop)

df_selection = training(selection_inference_split)
df_inference = testing(selection_inference_split)

To select a model using df_selection we will use Cross validation to try to get the model that best generalizes.
We will generate 30 split of 70% of the data and use the other 30% to calculate RMSE metric.

validation_data <- mc_cv(df_selection, prop = 0.7, times = 30)

We create two functions, my_lm() will run a linear regression for the training part of each split of CV and get the prediction for the testing part of each split. We will run this for a couple of formulas.
return_model will fit the model to the whole training data to extract the parameters and standard errors we get if we use the same dataset that was used to do model selection.

my_lm <- function(formula, split, id){

    analysis_set = analysis(split)  

    model <- lm(formula = formula, data=analysis_set)

    assessment_set <- assessment(split)


    pred = tibble::tibble("id" = id,
        "formula" = formula,
        "truth" = assessment_set$y,
        "prediction" = unlist(predict(model, newdata = assessment_set)))

}


return_model = function(formula){


    model <- lm(formula = formula, data=df_selection)
    output = list(model$coefficients, summary(model))

}

We will try 5 formulas. The first one is the actual data generating process and should the best in terms of RMSE. We will exclude that one for model selection since the aim of this is to simulate a scenario where we don't know the actual formula behind the data. We calculate it just for reference but we will pick one of the other 4 models for inference.

formulas = list("y ~ x1 + x2 +x3 + x1*x2 + x2*x3", 
                "y ~ .", 
                "y ~ x1 + x2", 
                "y ~ x1 + x2 + x3 + x1*x2",
                "y ~ x1 + x2 + x3 + x2*x3")
results = data.frame()

models = list()
for (formula in formulas){

results_selection <- map2_df(.x = validation_data$splits,
                           .y = validation_data$id,
                           ~my_lm(formula = formula, split = .x, id = .y))

model = return_model(formula)


results = rbind.data.frame(results, results_selection)
models = c(models, model )

}

We retrieve the mean RMSE across the splits, calculated in the test part of each split.
We can see that the real model is the best in terms of RMSE. Between the others, we can see that the one including the x2:x3 interaction is the best. So, we will keep that one as our "model selected"

results %>%
    group_by(id, formula) %>%
    rmse(truth, prediction) %>%
    group_by(formula) %>%
    summarise(mean_rmse = mean(.estimate)) %>%
    as.data.frame()
##                           formula mean_rmse
## 1                           y ~ .  219.4756
## 2                     y ~ x1 + x2  625.0173
## 3        y ~ x1 + x2 + x3 + x1*x2  217.3185
## 4        y ~ x1 + x2 + x3 + x2*x3  198.9802
## 5 y ~ x1 + x2 +x3 + x1*x2 + x2*x3  196.4747

We can check the parameters and the standard errors when fitted to the whole selection dataset.

## 
## Call:
## lm(formula = formula, data = df_selection)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -854.07 -132.91   -0.97  137.65  714.53 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -245.1084   138.9684  -1.764   0.0779 .  
## x1            82.7919     1.3540  61.148   <2e-16 ***
## x2            15.7118     6.8628   2.289   0.0221 *  
## x3            -1.5177     4.5626  -0.333   0.7394    
## x2:x3          5.1676     0.2256  22.906   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199.1 on 2495 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9469 
## F-statistic: 1.115e+04 on 4 and 2495 DF,  p-value: < 2.2e-16

And let's see what happens if we fit the same model to the inference set.

model_test = lm(formula=formulas[[5]], data=df_inference)

summary(model_test)

## 
## Call:
## lm(formula = formulas[[5]], data = df_inference)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -656.47 -138.67   -5.64  130.21  773.99 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -438.7059   140.2618  -3.128 0.001782 ** 
## x1            81.4724     1.3622  59.812  < 2e-16 ***
## x2            23.4856     6.9475   3.380 0.000735 ***
## x3             3.6309     4.5942   0.790 0.429417    
## x2:x3          4.9750     0.2275  21.869  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199.4 on 2495 degrees of freedom
## Multiple R-squared:  0.9473, Adjusted R-squared:  0.9472 
## F-statistic: 1.121e+04 on 4 and 2495 DF,  p-value: < 2.2e-16
First we can see that the parameters have changed a bit.
In second place we can see that the standard errors are generally bigger in comparison to the parameter for the inference set and will generate a wider confidence interval.

ggplot(data = ratio_df) +
  geom_point(aes(x=parameter, y=ratio, col=set), size=3) +
  theme(legend.title = element_blank()) +
  theme_light() +
  xlab("") +
  ylab("Ratio") +
  ggtitle("Absolute ratio between SD and Estimate")

Image

My idea is to add a plot with the confidence intervals so the effect can be seen directly but I don't have the time today. Anyways, it is clear that the standad error to parameter ratio is bigger in the inference set, showing that the inference in the same dataset as model selection is invalid as it is overconfident in the results.

Remarks on R2

R2 depends on the variance on the variance of the predictors

Quoting from Shalizi[^1] Assuming a true linear model
$$ Y = aX + \epsilon$$
and assuming we know \(a\) exactly.
The variance of Y will be \(a^2\mathbb{V}[X] + \mathbb{V}[\epsilon]\).
So \(R^2 = \frac{a^2\mathbb{V}[X]}{a^2\mathbb{V}[X] + \mathbb{V}[\epsilon]}\)
This goes to 0 as \(\mathbb{V}[X] \rightarrow 0\) and it goes to 1 as \(\mathbb{V}[X] \rightarrow \infty\). "It thus has little to do with the quality of the fit, and a lot to do with how spread out the predictor variable is. Notice also how easy it is to get a high \(R^2\) even when the true model is not linear!"

Below a quick comparison between two linear relationships, one with much higher variance than the other in the predictor.
Added a different constant for better display in plot.

library(tidyverse)

x1 = rnorm(1000, mean=0, sd=1)
x2 = rnorm(1000, mean=0, sd=10)
error = rnorm(1000, mean=0, sd=0.5)

y1 = x1 + error
y2 = 10 + x2 +  error

df = data.frame(x1,x2,y1,y2)

model1 = lm("y1 ~ x1")
## Error in eval(predvars, data, env): object 'y1' not found
model2 =  lm("y2 ~ x2")
## Error in eval(predvars, data, env): object 'y2' not found

Linear Smoothers

Linear regression as smoothing

Let's assume the DGP (data generating process) is: $$ Y = \mu(x) + \epsilon$$ where \(\mu(x)\) is the mean Y value for that particular x and \(\epsilon\) is an error with mean 0.

When running OLS we are trying to approximate \(\mu(x)\) with a linear function of the form \(\alpha + \beta x\) and trying to retrieve the best \(\alpha\) and \(\beta\) minimizing the mean-squared error.

The conclusions don't change but the math gets easier if we assume both X and Y are centered (mean=0).
With that in mind we can write down the MSE and optimize to get the best parameters.

\[MSE(\alpha, \beta) = \mathbb{E}[(Y - \alpha - \beta X)^2] \\ = \mathbb{E}[\mathbb{E}[(Y - \alpha - \beta X)^2 | X]] \\ = \mathbb{E}[\mathbb{V}[Y|X]] + \mathbb{E}[Y- \alpha - \beta X | X])^2] \\ = \mathbb{E}[\mathbb{V}[Y|X]] + \mathbb{E}[(\mathbb{E}[Y- \alpha - \beta X | X])^2]\]

Deriving with respect to \(\alpha\) and \(\beta\) for optimization..
The first term can be dropped since doesn't include any parameter.

$$\frac{\partial MSE}{\partial \alpha} = \mathbb{E}[2(Y - \alpha - \beta X)(-1)] \ \mathbb{E}[Y - a - b X] = 0 \ a = \mathbb{E}[Y] - b \mathbb{E}[X] = 0 $$ when Y and X are centered..

and $$\frac{\partial MSE}{\partial \beta} = \mathbb{E}[2(Y - \alpha - \beta X)(-X)] \ \mathbb{E}[XY] - b\mathbb{E}[X^2] = 0 \ b = \frac{Cov[X,Y]}{\mathbb{V}[X]} $$

The optimal beta is a function of the covariance between Y and X, and the variance of X.

Putting together \(a\) and \(b\) we get \(\mu(x) = x \frac{Cov[X,Y]}{\mathbb{V}[X]}\)

Replacing with the values from the sampled data we get an estimation of \(a\) and \(b\).

Remember they are 0 centered so variance and covariance get simplified.

\[ \hat a = 0 \\ \hat b = \frac{\sum_i y_i x_i}{\sum_i x_i^2}\]

With all this we can see how OLS is a smoothing of the data.
Writing in terms of the data points:
$$\hat \mu(x) = \hat b x \ = x \frac{\sum_i y_i x_i}{\sum_i x_i^2} \ = \sum_i y_i \frac{x_i}{\sum_j x_j^2} x \ = \sum_i y_i \frac{x_i}{n \hat \sigma_x^2} x $$ where \(\hat \sigma_x^2\) is the sample variance of X.
In words, our prediction is a weighted average of the observed values \(y_i\) of the dependent variable, where the weights are proportional to how far \(x_i\) is from the center (relative to the variance), and proportional to the magnitude of \(x\). If \(x_i\) is on the same side of the center as \(x\), it gets a positive weight, and if it's on the opposite side it gets a negative weight. (Shalizi 2017)

If \(\mu(x)\) is really a straight line, this is fine, but when it's not, that the weights are proportional to how far they are to the center and not the point to predict can lead to awful predictions.

Alternative smoothers

For that, other methods smooth the data in another ways to help mitigate that.

As quick examples, we have KNN regression where the smoothing is done using only close observations to the one to predict (and getting quite noisy since depend a lot on the sample points around a small area).

Kernel smoothers are a variant where depending on the kernel selected we get different smoothing. The main idea is that we use a windowd of data with the idea of putting more weight to points close to the one to predict. Could be Gaussian weight around X for example, or uniform around a window. Note this is different than KNN regression since we do not take the average of those points, we get a regression for that area.
A nice thing about this smoothers (and KNN regression) is that if we want to predict points far from the training data we won't get a linear extrapolation as with OLS but it will be pushed towards the closest data points we had in training.