Skip to content

My words, sometimes technical, sometimes not

Weighted regression

Weighted regression consists on assigning different weights to each observation and hence more or less importance at the time of fitting the regression.

On way to look at it is to think as solving the regression problem minimizing Weighted Mean Squared Error(WSME) instead of Mean Squared Error(MSE)

\(\(WMSE(\beta, w) = \frac{1}{N} \sum_{i=1}^n w_i(y_i - \overrightarrow {x_i} \beta)^2\)\) Intuitively, we are looking fot the coefficients that minimize MSE but putting different weights to each observation. OLS is a particular case where all the \(w_i = 1\)

Why doing this? A few reasons (Shalizi 2015. Chapter 7.1)

  • Focusing Accuracy: We want to predict specially well some particular points or region of points, maybe because that's the focus for production or maybe because being wrong at those observations has a huge cost, etc. Using Weighted regression will do an extra effort to match that data.

  • Discount imprecision: OLS returns the maximum likelihood estimates when the residuals are independent, normal with mean 0 and with constant variance. When we face non constant variance OLS no longer returns the MLE. The logic behind using weighted regression is that makes no sense to pay equal attention to all the observations since some of them have higher variance and are less indicative of the conditional mean. We should put more emphasis on the regions of lower variance, predict it well and "expect to fit poorly where the noise is big".
    The weights that will return MLE are \(\frac{1}{\sigma_i^2}\)

  • Sampling bias: If we think or know that the observations in our data are not completely random and some subset of the population might be under-represented (in a survey for example or because of data availability) it might make sense to weight observation inversely to the probability of being included in the sample. Under-represented observations will get more weights and over-represented less weight.
    Another similar situation is related to covariate shift. If the distribution of variable x changes over time we can use a weight designed as the ratio of the probability density functions.

    "If the old pdf was p(x) and the new one is q(x), the weight we'd want to is \(w_i=q(x_i)/p(x_i)\)

  • Other: Related to GLM, when the conditional mean is a non linear function of a linear predictor. (Not further explained in the book at this point)

Is there a scenario where OLS is better than Weighted regression? Assuming we can compute the weights.

Example.

First we will see the impact of using weighted regression, using a simulated scenario where we actually know the variance of the error of each observation. This is not realistic but useful to see it in action.

library(tidyverse)

We generate 1000 datapoints with a linear relation between y and x. Intercept = 0, slope = 5. We let the variance of the error depend on the value of x. Higher values of x are associated with higher values of the variance of the error.

set.seed(23)
n=1000
x = runif(n,0,10)
error = rnorm(n,0, x/1.5)
df = data.frame(x)
df = df %>% mutate(y = 5*x + error)
Visually..

ggplot(data=df, aes(x=x, y=y)) +
  geom_point(alpha=0.3) + 
  # geom_smooth(color="blue") +
  # geom_smooth(method = "lm", mapping = aes(weight = (1/sqrt(x)^2)),
  #             color = "red", show.legend = FALSE) +
  NULL

Image

Linear regression
ols = lm(formula = "y~x", data=df)
summary(ols)

## 
## Call:
## lm(formula = "y~x", data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.868  -1.720  -0.137   1.918  14.722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.19192    0.24278   0.791    0.429    
## x            4.95585    0.04148 119.489   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.855 on 998 degrees of freedom
## Multiple R-squared:  0.9347, Adjusted R-squared:  0.9346 
## F-statistic: 1.428e+04 on 1 and 998 DF,  p-value: < 2.2e-16
We get an intercept of 0.19, non-significant when the actual value is 0 and a slope of 4.96 when the actual value is 5.

Weighted linear regression
wols = lm(formula = "y~x", data=df, weights = (1/sqrt(x)^2) )
summary(wols)
## 
## Call:
## lm(formula = "y~x", data = df, weights = (1/sqrt(x)^2))
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8880 -0.8601 -0.0016  0.8936  4.6535 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.001483   0.030072   0.049    0.961    
## x           4.993473   0.021874 228.286   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.498 on 998 degrees of freedom
## Multiple R-squared:  0.9812, Adjusted R-squared:  0.9812 
## F-statistic: 5.211e+04 on 1 and 998 DF,  p-value: < 2.2e-16

We get an intercept of 0, non-significant too but much closer to 0 and with lower standard error and a slope of 4.99 also much closer to the actual value of 5 and with lower standard error.

Conclusion: if we know the right weights we can get better estimates from a linear regression in case of heteroscedasticity.

Inference is not valid in the dataset used for model selection.

Let's say we have a dataset and we want to fit a model to it and do some inference such as obtaining the coefficients and look for their confidence intervals.

For such a task we would first need to find a model that we think approximates to the real data generating process behind the phenomenon.
This will be the model selection step.
Then we would look at the output of our model and get the standard error of the coefficients or calculate the confidence interval or any other similar task. This will be the inference step.

The issue here is that, if we don't know the true model and we do model selection, our own model will be a random object. Why? Because the particular dataset we are using is also a set of random variables. Other datasets might return another model formula as the best between our options since that particular dataset would have other observations and particularities.

Main problem:

since we are selecting a model based on a particular dataset, the standard errors and p-values will be smaller than then actual ones.

"That means there is some extra randomness in your estimated parameters (and everything else), which isn't accounted for by formulas which assume a fixed model.
This is not just a problem with formal model-selection devices like cross-validation. If you do an initial, exploratory data analysis before deciding which model to use - and that's generally a good idea - you are, yourself, acting as a noisy, complicated model-selection device" (Sharizi 2017)

The most straightforward way to deal with this (if you are using independent observations) is to split the data, do model selection in one part and then fit the best model in the other part. Your second fit will be the one useful for inference.
You could fit the model to the full data but that would include the part used for model selection and you would still get false, overconfident standard errors.

Let's see an example.
We will generate data following a somewhat "complicated" model with interactions. We will split the data in two equal size parts. One for model selection and one for inference.
We will then fit a couple formulas to model selection part and pick the one with the minimum RMSE. We will compare the standard errors obtained in the model selection part and the ones obtained fitting that model to the inference part.

Thanks to BrunoRodrigues for this post that I used as guideline to fit models with Cross Validation in R.

We start by generating the data, including interactions.

set.seed(1)
N = 5000
b0 = 4
b1 = 1
b2 = 2
b3 = 3
b4 = 4
b5 = 5

x1 = runif(N, 0, 10)
x2 = rnorm(N, 20, 3)
x3 = runif(N, 20, 40)
error = rnorm(N, 0, 200)

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x1*x2 + b5*x2*x3 + error


df = tibble(y, x1, x2, x3)

We do the first split, df_selection will be the one used to try different models and pick one.
df_inference will be used to do the actual inference given the model selected.

prop = 0.5

selection_inference_split = initial_split(df, prop=prop)

df_selection = training(selection_inference_split)
df_inference = testing(selection_inference_split)

To select a model using df_selection we will use Cross validation to try to get the model that best generalizes.
We will generate 30 split of 70% of the data and use the other 30% to calculate RMSE metric.

validation_data <- mc_cv(df_selection, prop = 0.7, times = 30)

We create two functions, my_lm() will run a linear regression for the training part of each split of CV and get the prediction for the testing part of each split. We will run this for a couple of formulas.
return_model will fit the model to the whole training data to extract the parameters and standard errors we get if we use the same dataset that was used to do model selection.

my_lm <- function(formula, split, id){

    analysis_set = analysis(split)  

    model <- lm(formula = formula, data=analysis_set)

    assessment_set <- assessment(split)


    pred = tibble::tibble("id" = id,
        "formula" = formula,
        "truth" = assessment_set$y,
        "prediction" = unlist(predict(model, newdata = assessment_set)))

}


return_model = function(formula){


    model <- lm(formula = formula, data=df_selection)
    output = list(model$coefficients, summary(model))

}

We will try 5 formulas. The first one is the actual data generating process and should the best in terms of RMSE. We will exclude that one for model selection since the aim of this is to simulate a scenario where we don't know the actual formula behind the data. We calculate it just for reference but we will pick one of the other 4 models for inference.

formulas = list("y ~ x1 + x2 +x3 + x1*x2 + x2*x3", 
                "y ~ .", 
                "y ~ x1 + x2", 
                "y ~ x1 + x2 + x3 + x1*x2",
                "y ~ x1 + x2 + x3 + x2*x3")
results = data.frame()

models = list()
for (formula in formulas){

results_selection <- map2_df(.x = validation_data$splits,
                           .y = validation_data$id,
                           ~my_lm(formula = formula, split = .x, id = .y))

model = return_model(formula)


results = rbind.data.frame(results, results_selection)
models = c(models, model )

}

We retrieve the mean RMSE across the splits, calculated in the test part of each split.
We can see that the real model is the best in terms of RMSE. Between the others, we can see that the one including the x2:x3 interaction is the best. So, we will keep that one as our "model selected"

results %>%
    group_by(id, formula) %>%
    rmse(truth, prediction) %>%
    group_by(formula) %>%
    summarise(mean_rmse = mean(.estimate)) %>%
    as.data.frame()
##                           formula mean_rmse
## 1                           y ~ .  219.4756
## 2                     y ~ x1 + x2  625.0173
## 3        y ~ x1 + x2 + x3 + x1*x2  217.3185
## 4        y ~ x1 + x2 + x3 + x2*x3  198.9802
## 5 y ~ x1 + x2 +x3 + x1*x2 + x2*x3  196.4747

We can check the parameters and the standard errors when fitted to the whole selection dataset.

## 
## Call:
## lm(formula = formula, data = df_selection)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -854.07 -132.91   -0.97  137.65  714.53 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -245.1084   138.9684  -1.764   0.0779 .  
## x1            82.7919     1.3540  61.148   <2e-16 ***
## x2            15.7118     6.8628   2.289   0.0221 *  
## x3            -1.5177     4.5626  -0.333   0.7394    
## x2:x3          5.1676     0.2256  22.906   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199.1 on 2495 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9469 
## F-statistic: 1.115e+04 on 4 and 2495 DF,  p-value: < 2.2e-16

And let's see what happens if we fit the same model to the inference set.

model_test = lm(formula=formulas[[5]], data=df_inference)

summary(model_test)

## 
## Call:
## lm(formula = formulas[[5]], data = df_inference)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -656.47 -138.67   -5.64  130.21  773.99 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -438.7059   140.2618  -3.128 0.001782 ** 
## x1            81.4724     1.3622  59.812  < 2e-16 ***
## x2            23.4856     6.9475   3.380 0.000735 ***
## x3             3.6309     4.5942   0.790 0.429417    
## x2:x3          4.9750     0.2275  21.869  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199.4 on 2495 degrees of freedom
## Multiple R-squared:  0.9473, Adjusted R-squared:  0.9472 
## F-statistic: 1.121e+04 on 4 and 2495 DF,  p-value: < 2.2e-16
First we can see that the parameters have changed a bit.
In second place we can see that the standard errors are generally bigger in comparison to the parameter for the inference set and will generate a wider confidence interval.

ggplot(data = ratio_df) +
  geom_point(aes(x=parameter, y=ratio, col=set), size=3) +
  theme(legend.title = element_blank()) +
  theme_light() +
  xlab("") +
  ylab("Ratio") +
  ggtitle("Absolute ratio between SD and Estimate")

Image

My idea is to add a plot with the confidence intervals so the effect can be seen directly but I don't have the time today. Anyways, it is clear that the standad error to parameter ratio is bigger in the inference set, showing that the inference in the same dataset as model selection is invalid as it is overconfident in the results.

Remarks on R2

R2 depends on the variance on the variance of the predictors

Quoting from Shalizi[^1] Assuming a true linear model
$$ Y = aX + \epsilon$$
and assuming we know \(a\) exactly.
The variance of Y will be \(a^2\mathbb{V}[X] + \mathbb{V}[\epsilon]\).
So \(R^2 = \frac{a^2\mathbb{V}[X]}{a^2\mathbb{V}[X] + \mathbb{V}[\epsilon]}\)
This goes to 0 as \(\mathbb{V}[X] \rightarrow 0\) and it goes to 1 as \(\mathbb{V}[X] \rightarrow \infty\). "It thus has little to do with the quality of the fit, and a lot to do with how spread out the predictor variable is. Notice also how easy it is to get a high \(R^2\) even when the true model is not linear!"

Below a quick comparison between two linear relationships, one with much higher variance than the other in the predictor.
Added a different constant for better display in plot.

library(tidyverse)

x1 = rnorm(1000, mean=0, sd=1)
x2 = rnorm(1000, mean=0, sd=10)
error = rnorm(1000, mean=0, sd=0.5)

y1 = x1 + error
y2 = 10 + x2 +  error

df = data.frame(x1,x2,y1,y2)

model1 = lm("y1 ~ x1")
## Error in eval(predvars, data, env): object 'y1' not found
model2 =  lm("y2 ~ x2")
## Error in eval(predvars, data, env): object 'y2' not found

Linear Smoothers

Linear regression as smoothing

Let's assume the DGP (data generating process) is: $$ Y = \mu(x) + \epsilon$$ where \(\mu(x)\) is the mean Y value for that particular x and \(\epsilon\) is an error with mean 0.

When running OLS we are trying to approximate \(\mu(x)\) with a linear function of the form \(\alpha + \beta x\) and trying to retrieve the best \(\alpha\) and \(\beta\) minimizing the mean-squared error.

The conclusions don't change but the math gets easier if we assume both X and Y are centered (mean=0).
With that in mind we can write down the MSE and optimize to get the best parameters.

\[MSE(\alpha, \beta) = \mathbb{E}[(Y - \alpha - \beta X)^2] \\ = \mathbb{E}[\mathbb{E}[(Y - \alpha - \beta X)^2 | X]] \\ = \mathbb{E}[\mathbb{V}[Y|X]] + \mathbb{E}[Y- \alpha - \beta X | X])^2] \\ = \mathbb{E}[\mathbb{V}[Y|X]] + \mathbb{E}[(\mathbb{E}[Y- \alpha - \beta X | X])^2]\]

Deriving with respect to \(\alpha\) and \(\beta\) for optimization..
The first term can be dropped since doesn't include any parameter.

$$\frac{\partial MSE}{\partial \alpha} = \mathbb{E}[2(Y - \alpha - \beta X)(-1)] \ \mathbb{E}[Y - a - b X] = 0 \ a = \mathbb{E}[Y] - b \mathbb{E}[X] = 0 $$ when Y and X are centered..

and $$\frac{\partial MSE}{\partial \beta} = \mathbb{E}[2(Y - \alpha - \beta X)(-X)] \ \mathbb{E}[XY] - b\mathbb{E}[X^2] = 0 \ b = \frac{Cov[X,Y]}{\mathbb{V}[X]} $$

The optimal beta is a function of the covariance between Y and X, and the variance of X.

Putting together \(a\) and \(b\) we get \(\mu(x) = x \frac{Cov[X,Y]}{\mathbb{V}[X]}\)

Replacing with the values from the sampled data we get an estimation of \(a\) and \(b\).

Remember they are 0 centered so variance and covariance get simplified.

\[ \hat a = 0 \\ \hat b = \frac{\sum_i y_i x_i}{\sum_i x_i^2}\]

With all this we can see how OLS is a smoothing of the data.
Writing in terms of the data points:
$$\hat \mu(x) = \hat b x \ = x \frac{\sum_i y_i x_i}{\sum_i x_i^2} \ = \sum_i y_i \frac{x_i}{\sum_j x_j^2} x \ = \sum_i y_i \frac{x_i}{n \hat \sigma_x^2} x $$ where \(\hat \sigma_x^2\) is the sample variance of X.
In words, our prediction is a weighted average of the observed values \(y_i\) of the dependent variable, where the weights are proportional to how far \(x_i\) is from the center (relative to the variance), and proportional to the magnitude of \(x\). If \(x_i\) is on the same side of the center as \(x\), it gets a positive weight, and if it's on the opposite side it gets a negative weight. (Shalizi 2017)

If \(\mu(x)\) is really a straight line, this is fine, but when it's not, that the weights are proportional to how far they are to the center and not the point to predict can lead to awful predictions.

Alternative smoothers

For that, other methods smooth the data in another ways to help mitigate that.

As quick examples, we have KNN regression where the smoothing is done using only close observations to the one to predict (and getting quite noisy since depend a lot on the sample points around a small area).

Kernel smoothers are a variant where depending on the kernel selected we get different smoothing. The main idea is that we use a windowd of data with the idea of putting more weight to points close to the one to predict. Could be Gaussian weight around X for example, or uniform around a window. Note this is different than KNN regression since we do not take the average of those points, we get a regression for that area.
A nice thing about this smoothers (and KNN regression) is that if we want to predict points far from the training data we won't get a linear extrapolation as with OLS but it will be pushed towards the closest data points we had in training.

Bias Variance Tradeoff

Mean squared error (MSE) is a measure of how far our prediction is from the true values of the dependent variable. It's the expectation of the squared error.

The squared error being:

\[(Y - \hat \mu(x))^2\]

where Y is the true value and \(\hat \mu(x)\) is the prediction for a given x.

We can decompose it into:

\[(Y - \hat \mu(x))^2 \\ = (Y - \mu(x) + \mu(x) - \hat \mu(x)^2) \\ = (Y - \mu(x))^2 + 2(Y - \mu(x))(\mu(x) - \hat \mu(x)) + (\mu(x) - \hat \mu(x))^2\]

So, that's the squared error. The MSE is the expectation of that.

The expectation is a linear operator so we can apply it independently to different terms of a summation.
The expectation of the first term is the variance of the error intrinsic to the DGP.
The second term goes to 0 because involves \(E(Y-\mu(x))\) that is the expectation of the error and that's equal to 0.
The third term reamins as it is since doesn't involve random variables.

\[MSE(\hat \mu(x)) = \sigma^2_x + (\mu(x) - \hat \mu(x))^2\]

This is our first bias-variance decomposition. The first term is the intrinsic difficulty of the problem to model, is the variance of the error and can not be reduced, it is what it is.
The second term is how off our predictions are regarding the true expected value for that particular X.

This would be fine if we wouldn't need to consider \(\hat \mu(x)\) a random variable itself, since it is dependent on the specific dataset we are using. Given another dataset our estimation would be different despite using the same model methodology.
What we actually want is the MSE of the method used \(\hat M\) and not only the result of a particular realization.

\[MSE(\hat M_n(x)) = E[(Y - \hat M_n(X))^2 | X=x] \\ = ... \\ = \sigma^2_x + (\mu(x) - E[\hat M_n(x)])^2 - V[\hat M_n(x)] \]

This is our 2nd bias-variance decomposition.
The first term is still the irreducible error.
The second term is the bias of using \(\hat M_n\) to approximate \(\mu(x)\). Is the approximation bias/error.
The third term is the variance of the estimate of the regression function. If our estimates have high variance we can have large errors despite using an unbiased approximation.

Flexible methods will be able to approximate \(\mu(x)\) closely, however usually using more flexible methods involve increasing the variance of the estimate. That's the bias-variance tradeoff. We need to evaluate how to balance that, sometimes including some bias reduce much more the error by decreasing the variance.
Usually larger N decreases the MSE since it decreases bias and variance error.

Reference

Based on 1.4.1 from Advanced data analysis from a elementary point of view.

Spark and Pyspark

What's Spark?

prueba The definition says:

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters >through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any >Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new >workloads like streaming, interactive queries, and machine learning.

Basically is a framework to work with big amounts of data stored in distributed systems instead of just one machine. This allows parallelization and hence much faster calculations.
It's biggest difference with plain Hadoop is that Spark uses RAM to process data while Hadoop doesn't.

Not being a data engineer myself I can tell you that you can use Spark to work with data stored in HDFS, S3 buckets or a data lake for example. All distributed systems.

Since those usually store huge big amount of data you can see how all this relate. The use case I have been exposed to, as a data scientist, is to query this distributed data and process it before using it for some purpose (modeling, reporting, etc).

How to use it?

I haven't deployed a distributed storage system myself but I think it's safe to assume that amount of data is gathered in big organizations and probably some data engineer has already done all the setup. You just want to access the data from an environment connected to the spark cluster.

There are several languages that can interact with Spark. Scala is the original one but you could use Java or Python. As data scientist we are probably more familiar with Python so I will show you Pyspark

Pyspark

Pyspark is an API to work with Spark using Python. In order to run you need also Java installed and Apache Spark. In our fictional organization a data engineer might have set up a server with Jupyter notebooks linked to the data lake and with all the dependencies.

There are probably ways to connect to the remote spark server from your local machine but I haven't done that.

So, Pyspark allows you to query the datalake/bigdata storage from a jupyter notebook and then convert that to a Pandas Dataframe and work as you are used to.

Spark/Pyspark has a particular syntax that is quite clear but has some particularities based on the parallelization notion. For example, many functions don't actually retrieve all the data, that only happens when you decide to. For example show() or collect() do retrieve the data (and can take a while if you are working with a lot of data) while filter() or withColumn() don't.

Another thing to notice is that you will need to create/initiate a sparkContext before actually being able to query data.

To understand this and have a good amount of examples regarding the functions and syntax I highly recommend THIS SITE.

How to practice?

You can practice Pyspark queries and scripts by installing Pyspark in your local machine despite not having a cluster running distributed data. With Pyspark installed you can create some data and use it as it was real.
You will be able to use all the functions and check them by yourself.

How to install it? You can check THIS GUIDE FOR WINDOWS.

I have struggled a bit to make it work so these are some things I learned during the way.

  • I have downloaded Java 8 since that's what the guide says and use that at my current organization.
  • To avoid creating an account in Oracle to download Java you can check THIS SOLUTION.
  • When creating Environment variables avoid blank spaces
  • If Pyspark doesn't run because can't find Java. Check the %JAVA_HOME% path.
  • If the error is related to missing Python3 , check the %PYTHONPATH% and create in the anaconda path a copy of python.exe but rename it python3.exe

Softmax vs sigmoid

When using Neural Nets for a multiclass classification problem it's standard to have a softmax layer at the end to normalize the probabilities for each class. This means that the output of our net is a vector of probabilities (one for each class) that sums to 1. If there isn't a softmax layer at the end, then the net will output a value in each of the last cells (one for each class) but without a delimited range.
Just a set of numbers where usually the highest is the one with the most probable class but it's not obvious how to value the differences between them.

So, you have a ordered set of numbers, you know which one is the most probable but you want to transform that into clear probabilities. You use the softmax layer.

You could use a sigmoid activation function in the last cell to have individual probabilities. For each class, it transforms the output of the net into a probability. However the sum of those probabilities is not guaranteed to sum 1, actually it won't in practice. It's a simple proxy but you can get better intuitions with softmax.

We will compare how these two approaches affect the last group of weights by inspecting the gradient after calculating the loss for an observation.

I'm using the reticulate package in R to include Python code in Rmarkdown. Pretty nice.

library(reticulate)

We import pytorch to handle tensors and neural net functions.

import numpy as np
import torch
import torch.nn.functional as F
torch.manual_seed(99)
## <torch._C.Generator object at 0x00000262714CF730>
  • 1 obs
  • 5 features (X)
  • 3 possible classes (index 1 = class 2)
  • W. 3 output cells, each one with 5 weights (one per feature)
  • W1 = W2 because we run it twice (two scenarios) and we can't re use the same weights because of the gradient calculated
X = torch.randn(5)
W1 = torch.randn(3,5)
W2 = W1.detach().clone() 
y = torch.tensor([1]) 

We transform everything to positives to make it cleaner and we add the requires_grad_() characteristic that tells pytorch that those tensors need the gradient backpropagated during training

X = X.abs()
W1 = W1.abs().requires_grad_()
W2 = W2.abs().requires_grad_()

We define both losses (softmax and sigmoid).

Softmax

  • Weights * input: cell value
  • we change dimension of output to use it as input of softmax
  • We calculate the softmax (probabilities of each class that sum 1)
  • Apply log because we will use the negative log likelihood
  • We calculate the loss (log of softmax probabilities vs actual class)

Sigmoid

  • Weights * input: cell value
  • we change dimension of output to use it as input of sigmoid
  • We calculate the sigmoid (probabilities of each class individually)
  • Apply log because we will use the negative log likelihood
  • We calculate the loss (log of sigmoid probabilities vs actual class)
# funcion con softmax al final
def softmax_loss(W):
    z = W @ X
    z = z.unsqueeze(0)
    z = torch.softmax(z, dim=1)
    z = torch.log(z)
    return F.nll_loss(z, y)

# funcion con una sigmoidea por activacion
def sigmoid_loss(W):
    z = W @ X
    z = z.unsqueeze(0)
    z = torch.sigmoid(z)
    z = torch.log(z)
    return F.nll_loss(z, y)

We run the forward pass and calculate the loss for the sigmoid first. Then we look for the gradient.
As we can see in the results, only the weights that go to the correct class' output cell are modified. Classes one and three rest untouched. This is because the sigmoid activation just has the individual weights (and cross entropy only look to the correct class)

out_sigmoid = sigmoid_loss(W1)
out_sigmoid.backward()
W1.grad

## tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
##         [-0.0452, -0.0867, -0.0564, -0.0492, -0.0549],
##         [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000]])
On the contrary, when running the same net but with softmax layer we see that all the weights are updated. The correct class has gradient with the same sign that for the sigmoid example but the other two classes have in this case opposite sign gradients (which makes sense since you want them to go in the other direction).
This happens because the softmax includes the other classes in each cell since they are needed to normalize and return probabilities that sum up to 1.

out_softmax = softmax_loss(W2)
out_softmax.backward()
W2.grad

## tensor([[ 0.5393,  1.0346,  0.6731,  0.5868,  0.6552],
##         [-0.5576, -1.0697, -0.6959, -0.6066, -0.6775],
##         [ 0.0183,  0.0351,  0.0228,  0.0199,  0.0222]])
This is a simple case with just one layer of weights so we can clearly see this. If you had a fully connected net with more layers, this is valid just for the last one because the gradient is backpropagated and the weights from "other paths" still affect the cell that corresponds to the second class.

Conclusion

The net should evolve during training in a similar way with both last layer activations but the way they do it is different and we try to show in here why. In the end, the sigmoid still reflects the preference for one of the classes and during each epoch it will go through the desired path but just updating some of the weights and not all at the same time.

Counter Strike: chance of winning

This CS GO Kaggle link has data about several competitive CS GO matches. In a few words:

  • those are 5 vs 5 matches where each team tries to kill the other or complete a task (planting or defusing the bomb depending the role you are playing) before the time expires.
  • The goal is to win 16 rounds before the other team.
  • After 15 rounds both teams switch sides/role.

The data has mostly information about each time a player damages another one (and eventually kills it), some grenades usage and some general data of each round as the amount of money spent by each team and the result of that round.

In here I have followed Namita Nandakumar hockey example to obtain and model some basic winning probability based on the lead and how many rounds have been played so far.

This is how probability of winning looks as the game progresses, grouped by how much the current winner is leading. (Averaging leads greater than 4 to keep it clean ).
The thin line is the empirical probability, based solely on segmenting the data.
The thick line is a local regression with its standard deviation. Image

So, as we see there is some noise around the trend and the approximation wiggles a bit as you go through X. We would like to have a model where winning by some amount is always better if you are closer to 16. Let's say it is not crazy to assume that if you are winning by 3, 15 to 12, you should always have higher chances to win than if you are leading 6-3.

Namita shows that xgboost is a nice tool to impose that kind of constraint to a simple model using the monotone constraint parameter.

params = list(objective = "binary:logistic",
              eval_metric = "logloss",
              max_depth = 2,
              eta = 0.1,
              monotone_constraints = c(1,1)) 

What we get is a model that follows the constraints, although has some bias for the lower leading categories. Nevertheless is a quick approach to approximate the probabilities in a credible way.
You could use the dataset to explore other stuff since it has some rich information about locations.

Image

Distribucion Dirichlet como prior de Multinomial

Basado en:
http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3301/week6.pdf
http://www.inf.ed.ac.uk/teaching/courses/mlpr/assignments/multinomial.pdf

La distribución Dirichlet es una distribución multivariada para un conjunto de cantidades \(\theta_i,...,\theta_m\) donde \(\theta_i >= 0\) y \(\sum_{i=1}^m \theta_i = 1\). Esto la hace una candidata útil para modelar un conjunto de probabilidades de una partición (un un conjunto de eventos mutuamente excluyentes). Es decir, un grupo de probabilides de eventos excluyentes, que sumen 1.
Podemos remplazar los \(\theta\) por \(p\) si es más claro que hablamos de probabilidades luego.

La PDF es:

\[f(\theta_i,...,\theta_m; \alpha_i.., \alpha_m) = \frac{\Gamma(\sum_i{\alpha_i})}{\prod_{i=1}^m \Gamma(\alpha_i)}\prod_{i = 1}^m \theta_i^{(\alpha_1-1)}\]

Donde la función \(\Gamma\) es \(\Gamma(\alpha) = (\alpha -1)!\). Para más detalles ver acá.
Los \(\alpha_i\) son parámetros de la distribución y deben ser mayores a 0.
Cuando m = 2, obtenemos una función \(beta(\alpha_1, \alpha_2)\) como caso particular de la Dirichlet.

Dirichtlet en Inferencia Bayesiana.

De la misma manera que la distribución Beta suele usarse como prior de la Distribución Binomial ya que es una distribución conjugada para ese caso, la distribución Dirichlet suele usarse para distribuciones Multinomiales, es decir donde hay más de 2 categorías posibles (más de 2 \(p_i\)). También es distribución conjugada. Es simplemente la versión multinomial de la beta.

La distribución multinomial es la siguiente:

\[\frac{n!}{\prod_{i = 1}^m x_i!}\prod_{i=1}^m p_i^{x_i}\]

Cuando m = 2, es la distribución binomial.

Si tuvieramos un experimento que se puede modelar como una multinomial y queremos estimar los \(p_i\) podemos utilizar los estimadores de máxima verosimilitud (frecuentista) o ir por el camino de bayesiano donde comenzamos con un prior para cada p, que modelaremos con la Dirichlet. El prior de cada \(p_i\) va a ser definido con la elección de los \(\alpha\).

Yendo por el camino bayesiano vamos a tener nuestra distribución posterior: $$ P(p | x) \propto P(x|p) * P(p)$$ donde \(P(x|p)\) no es otra cosa que la distribución multinomial y \(P(p)\) es nuestro prior de \(p\) dado por la Dirichlet. Omitimos el denominador que es normalizador ya que es una constante.

Multiplicamos entonces la PDF multinomial por la Dirichlet y obtenemos:

Importante notar que efectivamente cambiamos \(\theta\) por \(p\) en la Dirichlet para que sea consistente con la multinomial.

\[\frac{n!}{\prod_{i = 1}^m x_i!}\prod_{i=1}^m p_i^{x_i} * \frac{\Gamma(\sum_i{\alpha_i})}{\prod_{i=1}^m \Gamma(\alpha_i)}\prod_{i = 1}^m p_i^{(\alpha_1-1)} \\ \propto \prod_i p_i^{\alpha_i + x_i -1}\]

Para la proporcionalidad, quitamos todo lo que es factorial (y \(\Gamma\)) ya que es constante y combinamos los exponentes de base \(p_i\).

Vemos entonces que nuestra distribución posterior es propocional a ese término, que si vemos, es una Dirichlet para la cual nos falta el término constante! Por eso se dice que es una prior conjugada, ya que la posterior es de la misma familia que la prior (con otros valores claro.)
Es entonces una Dirichlet con parámetros \(\alpha_i + x_i\) y podemos completar el término faltante obteniendo: $$ \frac{\Gamma(\sum_i{\alpha_i + x_i})}{\prod_{i=1}^m \Gamma(\alpha_i + x_i)}\prod_{i=1}^m p_i^{(\alpha_i + x_i-1)}$$

He ahí nuestra distribución posterior para los valores de \(p\) de la multinomial.

Para calcular rápidamente la esperanza de cada \(p_i\) hacemos simplemente: \(\(E(p_i) = \frac{\alpha_i + x_i}{\sum (\alpha_i + x_i)}\)\)

Si obtenemos nueva información podemos repetir el proceso, pero nuestra nueva prior debería ser la posterior previamente calculada. Y así vamos agregando información a medida que se recolecta y actualizando nuestra inferencia acerca de \(p_i\)

Aclaración: La proporción de cada \(\alpha_i\) iniciales en la Dirichlet prior sobre la suma de todos los \(\alpha_i\) es nuestro prior de \(p_i\). A mayores valores absolutos, mayor peso al prior respecto a los datos, ya que nuestro nuevo \(p_i\) es función del \(\alpha_i\) y \(x_i\). Revisar bien como ajustar los \(alpha\) según la magnitud de \(x\), si es que hay que hacerlo.

Ejemplo

Queremos modelar la compra de remeras de basquet en una tienda. Entra un cliente al azar y tiene determinadas probabilidades de comprar una remera de los Lakers, una de los Celtics, una de San Antonio o cualquier otro equipo.

En un primer momento no sabemos las proporciones y empezamos con unos priors \(\alpha_1 : \alpha_4 = [8,6,4,2]\) que corresponde a 40%, 30%, 20% y 10% respectivamente.

Recolectamos los datos de 100 clientes y vemos que las ventas fueron las siguientes:
Lakers : 45
Celtics: 22
Spurs: 27
Otros: 6

Calculando rapidamente con la fórmula de la Esperanza las probabildades que se derivan de nuestra posterior obtenemos:

Lakers = 0.442
Celtic = 0.233
Spurs = 0.258
Otro = 0.067

Para ser más prolijos habría que agregar la varianza de cada \(p\). A agregar en un futuro..

Si hubieramos calculado los p de máxima verosimilitud no sería más que la proporción de cada equipo en los datos, sin tener en cuenta nuestro prior. Vemos que acá están obviamente cercanos a la proporción en los datos pero se inclinan hacia el prior. Recordar que el peso de los priors va a verse afectar por los \(\alpha\) elegidos y por la cantidad de datos recolectados.

En ML es bastante útil para el caso donde una nueva categoría aparece en el test set. Si no fue vista en el training le va a dar probabilidad 0 mientras que con un prior podemos salvar ese problema.
En NLP es bastante habitual usar la distribución Dirichlet como prior. Investigar por ese lado.