Skip to content

My words, sometimes technical, sometimes not

I previously had a blog where I used to write about technical topics in english or spanish (lately only english). Here you can find all that old content and anything new that I publish, this is the actual source of truth.

You can expect to find topics around statistics, AI, machine learning, learning to do business or software in general but also about myself.

Follow me on X

Deep Dive into LLMs Like ChatGPT - Karpathy

Original video from Andrej Karpathy. Masterpiece.


Fineweb

Tokenization. From words to tokens. Tokens are not words and punctuation, they can be the root of words, they can be some sequence of letters without explicit meaning. This will change depending on the model. (tiktokenizer para verlo en accion)

How it works? From words to bits with encoding could be the first step. From words to 1 and 0. But this becomes a super large representation because we only have two symbols. We can group every 8 bits into 256 different bytes. [0 to 255] The sequence is much shorter because we have more symbols. We can use this, but in SOTA we go beyond that. We use Byte-pair encoding which looks for common pairs of bytes and we create a new symbols starting from 256, and we can do this many times. More symbols, shorter representation. Gpt4 ends up with +100k symbols.

Neural Networks

We take windows of tokens of flexible length up to some maximum (4, 8, 16k). Too much could be computationally expensive.

The idea is to predict the next token. The window used is the context. We do this for every context and token of the training data. The process will adjust the probabilities of each next token. They are initialized at random and in the end they should match the statistical properties of the dataset.

Training is done via Neural Network, the mathematical expression is updated via weights. Input is the token sequence, output is the probability of each token as the next one. In the middle there is the NN architecture with transformers and etc. First parts includes the embedding into numerical representation.

Inference

Put an initial token and sample from the output probability distribution, this is your next token. We do that again, but now the context is 2 tokens, and so on.

GPT-2 train and inference

Useful because the technical parts are still relevant only that now are bigger and more complex.

Base model is the model that comes out after training. It's an inference machine, just token prediction but it's not useful for chat for example. Some companies release their base models but not all of them. GPT2 was released. A base model release requires, the model architecture and the model weights.

Llama 3 is another one more recent. 2024, 405 billions parameters

Base model are somewhat good at memorization (regurgitation) which is not desirable. If you paste the first sentence of a wikipedia page it will probably output the exact rest of the article up until some point and then deviate.

Wikipedia also has more weight in the model because the source is truthful. I'm not sure if this is because the wikipedia extract appears more times in the corpus because of citations and so or because some sources have more importance than other from pretraining methodology.

Base model is still useful to some extent without being an assistant if you are clever with the prompt.

Few shot prompt. The model has some in-context learning, which is that the model can understand a pattern in the prompt. In the video, AK prompts a list of english words with their korean translation and left the last one without translating to make inference from that point.

He also shows that you can generate some assistant type of inference via in-context learning by passing an example of human-assistant interaction and making it generate the answer to you actual question via inference.

Post training

Pretraining is all that we saw before. Get data from the web, tokenize it and create a base model that predicts next token. It's not an assistant, it's like a internet text generator. That's all included in pretraining and it's the expensive part, the one that takes millions of dollars and a lot of time.

Post training
Much less computation than pretraining and it's the step that moves us from a token generator to an assistant.

Because this is a neural network we can't explicitly program the assistant or give them a personality or make it refuse some kind of questions. We can only do that via neural networks training on datasets.

Programming by example. And the examples requires human labelers. We train the model on this responses and try to imitate that behaviour.

We substitute the training dataset of the model, we remove all the internet text and we start using the conversation dataset. We keep training the model but know with this dataset and the model will pick the statistics of this new dataset and how the conversations should happen. (Supervised Finetuning aka SFT)

Post training in SOTA can be in 3 hours for example vs 3 months of training in pre training and thousands of CPUs. This is because post training the dataset is human created and much smaller.

How do we go from conversations in the new dataset to tokens?

We need to encode and decode in some specific way. Each model has a slightly different methodology, but gpt-4o has for example a few extra tokens that represent the beginning of the new character talking (user or assistant), then a token for the user or the assistant, then a token for the start of the actual message, then the message tokenized and then a token for the end of the message. Then we go again, same token representing the beginning, then the other token of user/assistant, etc

So, when we go to chatgpt and ask a question it's sent to the backend encoded with the above format and they add the tokens for the start of a new message from the assistant and run inference there, they let the LLM complete all the next tokens.

InstructGPT paper on SFT:First paper to talk about post-training. Mentions the heavy human labeler part from where the post training datasets with conversations emerged and some of the instructions the labelers received. The dataset from OpenAI is not released, but OpenAssistant is an open source alternative with a similar format.

Currently LLMs are being used to help create this datasets of conversations. No need for that much human effort. But in the end the root of all this conversations is the initial human labelers following OpenAI and other companies instructions. In the end, chatgpt for example, is answering in the tone and guided by those examples, so it's kind of recreating how the labelers wrote. It's a labeler text generation machine.

"What would a human labeler say in this conversation?" You are talking to a simulation of an average labeler (who is probably some skilled person but still)

Hallucinations

They exist because the model is sampling from the training dataset statistics trying to answer something even if it's not the truth. The problem has been improved over the years but it's still relevant.

How to fix it?

We need to include in the post training dataset some conversations where the answer is that it doesn't know.

Mitigation #1: model interrogation

What Meta did for LLama is super clever . We don't know what the model knows or not exactly so we need to let it decide in some way. They assume there is some internal representation of lack of knowledge, some neuron that gets activated when it doesn't "know" something. So, what they did to include that pattern is to take random text (from wikipedia lets say) and they used an LLM to create a few questions with factual answers about that text. They interrogate the model with those questions and compare the answer to the actual truth (also another LLM as judge, no need for human). They did it a few times per question. If the model answered correctly then the conversation output is fine and all good. But if the model hallucinates and answers wrongly (as judged by another LLM based on the actual truth) then the answer to that question in the conversation dataset becomes "Sorry, I don't know". If you add some amount of answers like this (because of course you can't just add all question that don't have a true answer) then the model will get that pattern, that when that unknown neuron related to lack of knowledge gets activated, then answering I don't know is what it should do. This worked quite well to mitigate hallucinations!

Allow the model to search the internet. Allow the model to use Tools. We do this by introducing new tokens, in this case special tokens for search_start and search_end with a query in the middle. The Assistant will look up that query in a browser and will copy paste the information it gets just after the special tokens, so the internet information is now in the context. It goes directly into the model, like refreshing our memory as humans.

To add this functionality we need again to teach the model by example, adding a bunch of examples in the dataset on how to use the search.

Knowledge in the parameters == Vague recollection (e.g. of something you read 1 month ago) Knowledge in the tokens of the context window == Working memory

Models need tokens to think

Given the neural network architecture, there is a finite amount of computation that can be given for each token. Given the context your forward/inference pass will predict the next token using the network capability but it's finite/limited. We should try to expand/distribute the computation, "the thinking" between many tokens to use the full neural network computation power for each token, so we end up using more computation in total for our answer. In concrete, he shows a simple math problem. If we aim for the model to answer directly, we are forcing the neural network to use it's context and finite computation to answer in a single attempt. Everything that comes after that answer will be a post hoc justification. While if we aim for the model to elaborate and go step by step (disguised CoT?) it will use full computation for each step and by the time it outputs the answer, it will have a good detailed context to provide that final "calculation"

image

This is the same reason why models are not good at counting. Too much expected from a single forward pass, finite computation.

"use code" it's a great way to make the model good at those tasks, because the model is good at copy pasting and code gives the right answer. So, you can just copy the string to code and then use python to actually count the number of letters or whatever. Same for calculation. It's much more likely to have the right answer than relying on the "mental arithmetic" of the model.

It's also interesting to understand why a model might be good at solving complex phD level math problems but fail at simple tasks like: "what is bigger 9.11 or 9.9?" which usually is answered wrongly or randomly.

One hypothesis he mentions is that some research team said that the bible has 9.11 > 9.9 in terms of verses and this can create confusion to the neural network but it's a problem not fully understood.

Reinforcement Learning

The last major stage, some times is included as part of post training but it's really a next separate major step.

It's like going to school.

He compares the training of a model to a textbook, general explanations are like the pretraining, then the examples given in the book is like the post training with examples of how things should be solved (how to answer like an assistant) and then there are exercises which the student doesn't have the solution and needs to try to solve. You might have the final answer but not the path towards it. Reinforcement learning is like this last step.

The motivation is that we as humans (labelers) don't know what's the best way for the LLM to solve a specific problem, such as a math problem. He shows a few options, like going straight to the answer, doing some arithmetic, talking in native english and giving the answer, putting the problem as a system of equations, etc. What's easier for us might not be easy for the LLM, so we need to try what approach gives best results.

So, the idea is to generate multiple (thousands) solutions for some problem which isn't trivial and store the stochastic solutions / inferences / token sequence that led to a right answer among all the tries. Some will get the right answer, some don't. And then, the model will be retrained based on the right solutions. So, it's not human labeled anymore, it's just trying solutions and re-train on the ones that were correct so the network learns to keep doing that for similar situations in the future.

Pre and post training are quite standard and used across all providers but the reinforcement learning step is in an earlier stage and not standardized, different providers are trying different approaches and how some details in the process are handled (which is a simple idea overall) has high impact and is not trivial, those details in how we select what is "the best answer" among the correct ones for example play a big role.

Deepseek R1

The paper was innovative and game changer in part because is open and explicit about RL while openAI and other kept the details for themselves.

With RL, the model learns over time to give better answers to the questions and it's using more and more tokens to do it. It "discovers" that trying many paths and backtracking and trying again it's better to get a good answers. Chain of thoughts emerge without being hardcoded anywhere by researchers (would be impossible too, it's something the model needs to discover)

A "thinking/reasoning" model is one that has been trained with Reinforcement Learning

ChatGPT 4o is not a reasoning one, is SFT montly (learn by example, just finetuned, no RL. He says there is a bit of RL but we should think about them as SFT really). DeepSeek uses RL. o1 and o3 are also RL ones.

AlphaGo

RL made it possible for the model to beat top players ELO while supervised learning was not capable. RL is not restricted to human kind of plays and that's how move 37 happened, a play that was not expected by top level players but that actually was really powerful. This happened because the training wasn't guided by supervised learning but by RL (the AI playing against itself kind of)

Learning in Unverifiable domains (Reinforcement learning from Humand Feedback)

The previous problems where easily verifiable, we could just compare in RL if the solution was correct by checking the final output of the LLM vs the right answer, maybe by direct comparison or using LLM as judge where we ask another LLM to check if the solution provided by the model is consistent with the actual solution (currently that approach is quite reliable) but this can't be done in unverifiable domains such as "write a joke about pelicans", "write a poem", etc

For the pelican jokes, in principle you could use humans to judge if the joke is funny and reward it but as you need to evaluate thousands of generations for thousands of prompts, this becomes unfeasible. We need another strategy. This paperintroduced the RL-HF subject

RLHF approach STEP 1 :Take 1000 prompts, generate 5 options, order them from best to worst. STEP 2: Train a neural net simulator of human preferences ("reward model") STEP 3: Run RL as usual, but using the simulator instead of actual humans

The reward model is not a perfect human simulator but it's currently good enough to work meaningfully.

Upside

We can run RL in arbitrary domains (even unverifiable ones) and empirically it gives better results.

He says that probably this improves the model due to discriminator - generator gap. It's easier for a human to discriminate than to generate. It's easier to say which jokes are good or bad than to create the good ones. The labeler doesn't need to create a good joke, it just leaves that hard task to the model and the labeler points out which are good and which are bad

Downsides

RL is done with respect to a lossy simulation of humans. It might be misleading, we generate orders based on a model that might not reflect the actual human judgement.

RL discovers ways to "game" the reward model.

It happens that after a lot of updates, the jokes that are considered the top are non sensical. At first, we some initial updates the jokes might improve but after some point in time they become much worse, like a top joke could be "the the the the the" and somehow that gets a high score by the reward model. Those weird top answers are adversarial examples. Somehow the RL get some answers that go through little paths that somehow fire good scores without making human sense. The reward model are massive neural nets and they have cracks.

You could get this non sensical answers and add them to your dataset with very low ranking to make it learn that this is not a good joke but this is an infinite process, there will always be more adversarial examples to be found by the neural net.

What to do? Just train the RL for some time and crop the training, don't go too far so you avoid the adversarial example generation.

This is true for RLHF, not RL . Plain RL can be run indefinitely because you can't really game the answer, you are looking for a specific answer and the neural net will find ways, even non standard ways, to find that answer but it's totally verifiable. RLHF is different, and a reward function can be gamed, so RLHF can't be run forever while plain RL yes.

E-commerce Image Similarity via Visual Embeddings

How I implemented an API to retrieve similar images from an E-commerce in 3 steps

In this post, we explore a system to identify similar articles solely based on e-commerce images. The approach is designed with two primary objectives in mind:

Objectives

  1. Catalog Similarity:
    Determine which items in the client's catalog resemble each other using only their photos.

  2. External Querying:
    Although not implemented in the current API, the plan is to eventually allow querying with external images. The envisioned workflow is to source images from suppliers, compare them with the client's catalog, and perform this embedding generation locally to keep the API lightweight.

Proposed Solution

The core idea is to use a pretrained image model to extract embeddings from each photo. Once these embeddings are available, we can perform similarity searches to find items that are visually alike.

For our implementation, we experimented with both OpenAI's clip-ViT-B-32 and ResNet. In general, both models produced comparable results, though we opted for CLIP in our main experiments.

Implementation Steps

Step 1: Download Catalog Images

  • Script: embeddings/download_images/get_images.py
  • Details:
    This script downloads all catalog images from the e-commerce using a ThreadPool to speed up the process.

Step 2: Generate and Index Embeddings

  • Script: embeddings/clip_faiss.py
  • Details:
    The script generates embeddings for each photo and stores them in a Faiss index, which is saved under embeddings/faiss_index/.
    Note: Since the process is deterministic, a simple overwrite will not impact the results. Idempotent as they call it.

Additional notebooks are available to illustrate the process, check results, and experiment with alternative models and tests.

Step 3: Query the Faiss Index

  • API Functionality:
    The Faiss index is already built. We expose an API endpoint where you can pass an article_id (used during the embedding generation) and retrieve the most similar items.
    Validation:
    As a sanity check, querying an article should return itself as the top match with a distance of 0.

Conclusion

By leveraging pretrained image models and efficient similarity search with Faiss, this approach provides a scalable method for identifying visually similar items in an e-commerce catalog. This system not only improves internal catalog management but also sets the groundwork for integrating external image queries in the future.

Kullback-Leibler divergence

To understand KL divergence we need to first understand Entropy. The most important thing to have in mind though is that entropy can be thought as a measure of "information", but what I like the most, as a measure of expected surprise that one gets for every observed value of the distribution.

For a highly dense distribution, you will almost sure get a value from the dense region and the expected surprise will be really low (but you are highly surprised when you see a value off that region!)

Relative entropy

How this relates to KL? Well, Kullback-Leibler is a divergence but is also called relative entropy. This means, how that measure of entropy differs between two distributions. I find that easier to grasp.

Usually we have a true distribution \(p(x)\) and an estimated distribution \(q(x)\) that we use to approximate \(p(x)\). KL divergence can help us understand how good it does the job.

If entropy is:

\(H[x] = - \sum_x p(x) \ln p(x)\)
(originally \(\log_2\) but \(\ln\) works too and it's used everywhere )

\[KL(p||q) = - \int{p(x) \ln q(x)dx} - (-\int{p(x) \ln p(x)dx})\]
\[ = - \int{p(x) \ln \frac{q(x)}{p(x)}dx}\]

The first row is clear, is just the difference in entropies. This is relative entropy between p(x) and q(x).

Important: KL divergence is not symmetrical. \(KL(p||q) \neq KL(q||p)\) So, KL is not a metric of distance (but can be thought as if). It's actually a divergence.

Why it's not symmetrical?

I think about it this way. This is a dissimilarity between one distribution and how we approximate it. Think about two totally different distributions (awful approximations):

  • one highly centered around one specific value
  • the other with a uniform-ish shape, highly dispersed.

The difference in "surprise" you get from both ways approximation is not the same.

If you are approximating the highly dense distribution with the uniform one, you are approximating it with a distribution that surprises you in general \(\ln q(x)\), let's say moderately high everywhere, even for the specific dense region.

But the other way, if you approximate the dispersed distribution with the dense one, you are using a distribution with high surprise in most of the region of the dispersed distribution and really low surprise in one specific value.

This term \(\int{p(x) \ln q(x)dx}\) behaves differently in both scenarios and there is no guarantee or need that they should match.

References

Deep Learning, Bishop
Wikipedia

Mutual Information

When two variables \(x\) and \(y\) are independent, their joint distribution will factorize into the product of their marginals \(p(x,y) = p(x)p(y)\). If the variables are not independent, we can gain some idea of whether they are "close" to being independent by considering the KL Divergence between the joint distribution and the product of the marginals, given by:

\[I[x,y] = KL(p(x,y)||p(x)p(y))\]
\[= - \int \int p(x,y) \ln (\frac{p(x)p(y)}{p(x,y)}) \]

which is called the mutual information between \(x\) and \(y\).

Thus, the mutual information represents the reduction in uncertainty about \(x\) by virtue of being told the value of \(y\) (or vice versa)

From a Bayesian perspective, we can view \(p(x)\) as the prior distribution for \(x\) and \(p(x|y)\) as the posterior distribution after we have observed new data \(y\). The mutual information therefore represents the reduction in uncertainty about \(x\) as a consequence of the new observation \(y\).

Obisidan Workflow

This is my first personal order for Obsidian with some actual rules and not improvising everything.

I've read Kepano's way and this other guy I think both are useful in some way but I' sticking closer to Kepano. Not copying as he has a lots of templates with structured information in the YAML as he seems to create a lot of lists and rankings. Currently I'm using Obsidian a lot for technical notes, daily notes, paper annotation but also some listing as books I've read, movies, etc. It's a mix and I'm not sure I know what info to put for everything that is not a book, or game. Like, notes from some technical topic I can put the area or the author but there is no ranking.

Also, I like folders and he hates them. I'll do the following.

  • Have folders by general topic to order the notes (book, movies, companies I work for, projects). Notes will be written where they belong the most. For general technical topics (such as Xgboost, I might create a data science or technical folder with all there. I'll put reasonable tags, many notes in different folders might share tags)
  • Notes across folders might share useful information. Maybe I have some deployment technical note on a company that could apply for another project, or even I'll want to find it without remembering where I applied it, so, I'll also add tags so I can see everything together using dataview
  • I'll try to add relevant information usable by dataview in some cases, such as Author for books, or technical notes.
  • I'll have a [[tags]] folder with relevant tags with queries from dataview so I can have stuff organized.
  • I'll keep tags in english for technical stuff but for books, series, etc where is more natural to remember in spanish (I'm from Argentina) and won't be shared probably I'll ues them in spanish (such as "policial", "novela", etc which are words that I never use in english).

Entropy and information

Term that comes from information theory. The most intuitive way to think about information of a variable is to relate to the degree of surprise on learning the value of the variable X.
This definition was mentioned both in Deep Learning, Bishop and in Hamming, and the former is the text I was just reading before starting this note.

So, having a variable X with p(x), what's h(x) , the information of observing X? This quantity h(x) should be a monotonic function of p(x). Remember, information is tied to the surprise, observing an almost sure event, high p(x), reveals less information than an unexpected event,low p(x). In the extreme, observing a known value, gives 0 information.

How to define the information mathematically comes (in some way intuitively) by the definition that, observing two independent variables x and y should provide an amount of information equal to the sum of the individual informations.

\[h(x,y) = h(x) + h(y)\]

We know that for independent events \(p(x, y) = p(x)*p(y)\)

It can be derived then that

\[h(x) = - \log_2 p(x)\]

The log provides the summation part coming from a multiplication. The negative ensure 0 or positive values. Remember that we are taking the log of a value between 0 and 1.

Now, there is another important term, entropy. It summarizes the average amount of information that is transmitted if a sender wishes to transmit the value of a random variable to a receiver. The entropy is the expectation of the information, with respect to the distribution p(x).

\[H[x] = - \sum_x p(x) \log_2 p(x)\]

For p(x) = 0, where the log would bring problem, we consider \(p(x) \log p(x)\)=0

Classical information theory uses \(\log_2\) because it relates to bits and the amount of bits required to send a message but the book changes to \(\ln\) because is much more used in ML and it's just a different unit to measure entropy.

Taking notes is life changing.

I constantly feel that my memory is failing me more and more even for things that I put effort of retaining.
Why is a more complex topic, I'm not sure, some medical condition? chronical bad sleep? nothing wrong just this is it and everyone has this poor memory?

The feeling is like when you read some code you wrote few weeks/months ago and you can't remember anything nor why you did it this way. Well, that but for a lot of stuff.

Despite the cause I've decided to help myself and stop thinking "I will remember this, it's super important" and I began taking notes regularly. This helped me in a huge way for personal topics but also for work and technical subjects.

It helps in two ways:

1) Writing helps memorizing things, you are going through it again and you need to organize a bit your ideas to put them on paper.
2) I have a place to rest on where I can find information I should have memorized.

I started doing it randomly for everyday things, like things I want to buy for the house, places I would like to visit, places I've gone.
This Christmas I'm also writing the gifts I gave, and what I received. I hate the feeling of someone telling me "Hey, this is the shirt you gave me!" and I'm speechless for a second and it's obvious I don't remember, or to avoid giving twice the same gift.
I'm expanding these notes as I encounter situations where I regret the status of my memory.

But taking notes is also huge for me in terms of Work and research.

Some examples:

  • Daily notes on what I've done in my full time job (YYYY-MM-DD). I can always come back if someone asks, sometimes I put some briefs about decisions made on meetings. Specially useful for Mondays where I forget what I did/talked on Friday.
  • Notes on how I did something that I did for the first time. In my job or in my side businesses, including decisions, commands, installation processes. I find myself coming back to these really often and I think this saves my huge amounts of time, stress and frustration. It's surprising how I forget THE command I've been running twice a day for some deployment test if I have to come back to that topic two weeks later. Something that seems impossible to forget at some point just vanishes. Also, super useful for my side business where I setup something once (DB, platform, server, whatever) and I need to revisit how I did it or what resource I followed in case I have to do it again.
  • Studying. I'm also taking side notes on books and papers that I later write down in the form of more structured texts. What I understood about papers, doubts, summaries. As usual, while reading I feel I understand but then if I need to revisit the paper without notes it feels like I need to re read it. Many times I read my notes and I feel surprised as I don't remembered a lot of contents, is like the note was given to me by someone else.

Motivation

It's a tough topic because everyone has a different perception and it's hard to share emotions but I feel what I call motivation really has an impact on me. Sometimes is a silent impact, I'm talking now about lack ok motivation, noticeable only when it comes back.
You don't know what you don't know. You don't know how much energy and willingness to do things you should have if you never realized it.

My biggest adulthood lack of motivation came from jobs. I can't complain, I work in software/Machine learning, good salaries, remote work if desired, etc, but still, a day to day that is not fulfilling has devastating effects. Even more, probably fully remote work post pandemic plays a role in this despite I think the net effect is positive as I can't value enough the commute time regained.

The difference between being motivated to do something and being indifferent is the largest gap possible. If you truly desire something and work for it the odds of succeding increase exponentially.

Things I've realized that affect my motivation:

  • Sleep. I have regular issues sleeping. Usually after 2 days of bad sleep my brain doesn't work anymore. I work poorly, I get anxious because I can't get anything done nor I can think properly. I sleep bad again and so on. Disaster.
  • Anxiety. It's related to sleep but not always. Feeling I can't do something or that I don't understand some topic or how to solve a problem, makes me anxious and after a while I want to quit. I trust I can understand or solve complex machine learning topics but usually not having a clear North is really stressful. If only I could have someone guiding a bit, everything looks better. Not knowing if you are on the right path and you have a short time to figure out something is horrible. For instance, we need to solve X problem. You ask a few days for research, you accumulate papers or blogposts, but sometimes none is exactly your solution, and then all have something different, and then you don't know what to read and all are long pieces of text, and they mention other topics you don't know. I feel anxious just by writing this, but one needs to make peace with the fact that you can't have a phD in every topic to solve. Maybe in none if you are just in industry and you are responsible for multiple projects, and things will be done suboptimally probably. But it's hard to really accept it and on top of that, is you have a boss or some coleague asking for this, the stress is twice as big, at least for me. Fear of disappointing. Maybe that's another whole note.
  • Daily meetings. Daily standup or just status meetings are terrible for me. I feel the pressure to explain myself every day, to tell someone else what did I do for 8 hours and sometimes, when motivations it's at a low level, is not that much, or maybe I am just researching without having grasped full understanding so it seems unfruitful, and this goes back to anxiety. Every day feeling the need to have something meaningful to share. Currently I don't have daily meetings, I have some freedom to work and then put up to speed my manager. I feel more free and funny enough, I feel that I do more progress on a daily basis this way.
  • Social media. You can see unrealistic success stories all day long that feel like someone found the way to live in a 5 minute revelation and minimum effort. All fake, all lies but you can quickly fall in that trap of feeling less than others. You just have to compare yourself with previous you, not others that are even probably fake. On top of that, when you realize all the time you get back you feel like your days are longer and you can fight the anxiety better. At some point I realized this and when I automatically reached my phone, which happened really often, to open Twitter, I would hit my wrist (without hurting!) and took a deep breath, then drop the phone. You get back 5/10 minutes of time you would use in social media, and that accumulates fast.

I've quit because of this, because of not being happy, because of being stressed every morning by lack of progress. Maybe I wasn't up to the task, maybe if you know exactly what you are doing you can move forward steadily without worrynig about this but I'm sure there was another time in a project where I felt all this lack of motivation and it was not because of "skill issue" but because of continuous urgency. Every morning I could (or maybe not, you don't know until you check your phone) with some urgent message of modifying something because yadda yadda the client. Sometimes it made sense, others.. not so much, and that feeling of being alert and thinking about slack messages all the time was awful, you can't focus, you can't feel you are working in peace in something.

Consequence

Eventually I quit all that and started learning about how to set up a job board, reading , drinking coffee, I took a 6 month sabbatical, at least, industry sabbatical, but I would spend time in the computer learning, working in the job board, etc. But I was free and owner of my own time, no one expecting anything. If I wanted to do nothing and just play with my dog all afternoon, there was no daily standup the next morning to dread. I could enjoy things without worrying.

Unfortunately in those six months I didn't make a fortune nor won the jackpot so I went back to industry but without ruhsing. I looked for jobs that at least, before joining, made me want to join. Maybe because of the area, maybe because of the technology and in some case, maybe because it seemed chill. I had my share of rejections and processes that went nowhere but I ended up in a startup with a really enjoyable day to day structure. I'm really grateful about it, I have a couple of meetings per weeks and that's it. I can do things at my own pace (given some realistic timelines), I can share ideas with my manager but I don't need to impress anyone on a daily basis. You could think that I'm chilling but not, I work hard, I learn a lot of stuff and I put the best of me because I feel comfortable and grateful to how my manager handles the startup.

Conclusions

  • Fix your sleep. Try everything and don't give up for too much time. You can try:
    • Magnesium
    • Having dinner earlier
    • Start relaxing one hour before (lights down, no screens, reading a bit, etc). This is the most unrealistic for me, can be done once, not regularly if you are anxious.
    • Drink some relaxing tea.
    • Legal drugs (alprazolam or similar, check with doctor.)
    • Nose strips
    • Meditate

You need to find what's the main cause but if it's only "being anxious", start trying stuff, at least you will feel doing something about it.

  • If you can afford it, change your job, take a sabbatical, take some time and look for what you really want.

Expectation Maximization for Poisson process

The problem

What if you want to run a regression to estimate the coefficients and retrieve the data generating process of some data BUT you are in a scenario with censored data? How does that affect my estimation? Can I do better?

Censored data

In short, you have censored data when some observation is unobserved or constrained because of some specific reason which is natural and unavoidable.

Unobserved

For instance, in survival analysis, if the subject is not yet dead (luckily) you can not observe the time of death yet and hence you don't know if he will die tomorrow or in two years. You can't observe yet the death date.

Constrained

For goods with limited demand, once you deplete your stock, by definition you can't sell more and you can't observe all the demand that you would have if more items were able to be sold. Or an Airline , when selling seats, can't observe the full demand after all the seats have been sold, maybe more people were willing to flight.

We will simulate an example of the latest case, for an airline with simple toy data.

import numpy as np
import statsmodels.api as sm
from scipy.stats import poisson
import matplotlib.pyplot as plt

We create some fake data, where seats sold are generated from a poisson distribution, with an intercept (2) and a coefficient related to how many days are left to the departure (-0.2), the further from the departure, the less seats are sold.

What we also do, is that on the last date, we constrain the demand. We don't get the full seats that would have been sold under a poisson distribution but the amount minus 3. Only ourselves know that because we want to tweak the data. In real life we would see data similar to the one generated and we couldn't see the actual sales that would have been present in a complete poisson generating process (that actually is how the tickets are sold in our example)

# Step 1: Generate data
np.random.seed(100)
n_samples = 20
n_features = 1
substract_sales = 3

# Generate features
X = np.random.normal(size=(n_samples, n_features))
X = np.array(range(n_samples+1, 1, -1))
X = sm.add_constant(X)  # Add intercept term

# True coefficients
beta_true = np.array([2, -0.2])

# Generate Poisson-distributed target values
linear_pred = np.dot(X, beta_true)
lambda_true = np.exp(linear_pred)
y = np.random.poisson(lambda_true)
# y

# Right-censoring threshold in the last date (when the fare is closed)
censor_threshold = np.concatenate([(y+1)[:-1],np.array([y[-1]-substract_sales])])
y_censored = np.where(y > censor_threshold, censor_threshold, y)
is_censored = (y > censor_threshold)
Maybe it's easier to visualize.
The green curve is the actual lambda parameter for each data, based on the intercept and the real coefficient multiplied by days to departure.
The blue dots are the actual data points we see for all dates except from the last one.
The grey dot is the data we should see if there was no limit of how many seats to sell, following the poisson distribution.
The red dot is the actual data point we have, the last day the airline sold all the remaining seats and couldn't fulfill the true demand and hence we see lower sales than the ones generated by the poisson distribution.

# Plot the data
plt.xlabel('Days to departure')
plt.ylabel('Bookings')
plt.title('Data Plot')
plt.plot(X[:-1, 1], y[:-1], 'o')
plt.plot(X[:, 1], lambda_true, label='Lambda True', color="green")
# plt.plot(X[:, 1], lambda_est, label='Lambda Estimated')
plt.scatter(X[-1, 1], y[-1:],  color="grey", label="Real unseen demand" )
plt.scatter(X[-1, 1], y[-1:]-substract_sales, color='red', label='Constrained')
plt.legend()
plt.show()

Image

Esimation
Poisson Regression

First we estimate the parameters using a poisson regression. We assume the data follows a poisson process and we don't do anything to manage the constrained data.
The results are kind of ok, we get some closeness to the true parameters but it is clearly biased.

model_censored = sm.GLM(y_censored, X, family=sm.families.Poisson())
results_censored = model_censored.fit()
print(f'Estimated Censored coefficients: {results_censored.params}')
## Estimated Censored coefficients: [ 1.86976922 -0.21124541]
Expectation Maximization

OK, here is our alternative. What we can do, a bit more sophisticated, is to apply EM algorithm, which is a iterative process, to estimate the coefficients of the poisson regression, including the knowledge we have, that there might be constrained data.

# Step 2: Initialize parameters
beta_est = np.zeros(n_features + 1)
tol = 1e-4
max_iter = 1000

for iteration in range(max_iter):
    # Step 3: E-step
    # Estimate the expected values of censored data
    lambda_est = np.exp(np.dot(X, beta_est))
    expected_y = np.where(
        is_censored,
        (censor_threshold + 1) / (1 - poisson.cdf(censor_threshold, lambda_est)),
        y_censored
    )
    # Ensure expected_y values are valid
    expected_y = np.nan_to_num(expected_y, nan=np.mean(y_censored), posinf=np.max(y_censored), neginf=0)
    # Step 4: M-step
    # Update parameter estimates using Poisson regression
    model = sm.GLM(expected_y, X, family=sm.families.Poisson())
    results = model.fit(method="lbfgs")
    # results = model.fit_regularized(L1_wt=0, alpha=0.1) 
    new_beta_est = results.params

    # Check convergence
    if np.linalg.norm(new_beta_est - beta_est) < tol:
        break

    beta_est = new_beta_est
## <string>:7: RuntimeWarning: divide by zero encountered in divide
print(f'Estimated coefficients: {beta_est}')

## Estimated coefficients: [ 2.11015996 -0.23573144]
We can see our estimates are of course not perfect but the intercept is closer to the true parameter.
Let's visualize the results to have a clear intuition.

uncensored_predictions = results.predict(X)
censored_predictions = results_censored.predict(X)
Promising!
For the furthest dates we see a bias in both approaches, we are not doing better than the Poisson regression, but neither worse.
As we move closer the the departure, the EM algorithm predictions get closer and closer to the true lambdas, while the regular poisson regression continues it's biased trajectory.
The EM procedure adjusted the coefficients to better match the constrained data.

plt.plot(X[:, 1], lambda_true, label='Lambda True', color="green")
plt.plot(X[:, 1], uncensored_predictions, label='Uncensored Predictions (EM)', color="blue")
plt.plot(X[:, 1], censored_predictions, label='Censored Predictions', color="red")
plt.xlabel('Days to departure')
plt.ylabel('Lambda - Poisson regression')
plt.title('Lambda Predictions')
plt.legend()
plt.show()

Image

Reflections on quitting my ML job

As I'm starting my sabbatical journey I am reading some posts of Jason Liu since he seems to have a career similar to something I would like to if I got into ML and LLM riding the hype wave. Furthermore I find his writing pleasant and his content could be useful even as a solopreneur.

One of my goals for this period of time is to write more. To take more notes since I struggle memorizing without them (or even with them but at least I can read a summary later) and I want to actually do more. More of everything, less thinking about and more actual doing. From shipping products, to really learning and that includes writing.

Reading his post and starting now my journey, a real action would be to take notes about his post, which I found useful, at least some bits of it. Not taking notes would be less writing, less doing, less memorizing.

Choosing

Right now, I feel more at this point

"This despair arises from the realization of one's absolute freedom and the responsibility for creating one's own essence and purpose."

Lately, last few years probably, I have started to realize that quote. We can go a long way by following the usual paths, at least usual to our surroundings. In my case, without major distress that turns everything upside down, it was high school, university, get a job, get married, retire. At least, that's something I (many?) thought as given, a fate that if didn't screw it, it would happen automatically.

And as I was applied and I did good in school, I got the first 3 steps quite easily. I never really thought of going out of that path and saw the ones doing so as outliers or with greater safety nets to fail. In some cases it could be true, in others it was probably me being short sighted, I guess it was not my fault but just lack of adult life.

As years go by, I start questioning the meaning of what I'm doing. Is my job meaningful? Does it create value or help anyone? Corporate world feels like a charade. Despite I was working in analytics and studying and building ML models, I couldn't see if that was worth the effort and time. Once you go past the hype of learning models and cool tech bits, looking over that it feels empty. I changed jobs. Still the same, I was dreading in boredom and rat race. Needing to "impress" people or feeling that I need to be there for work 9-5 each day without feeling motivated was awful. I quit after a few months willing to take some time off. What am I doing? I have no purpose and my day to day means nothing. I like seeing my friends and family, traveling, etc but the regular life has a lot more things and time to fill. I can't make this for 30 more years.

Fortunately, tech and ML pays high salaries so I felt safe. I could take some time off, I can always look for other job, etc. But the feeling of "I need to do something different" was there. I started looking for other opportunities, looking for purpose in jobs, looking for motivation. I was in despair as I understood that I needed to do my own way and no one was doing it for me.

Ironically, I took another job quit quickly because it paid much more and I could just try it, maybe I was just in the wrong place. Seemed to work better, it was less of a charade, I had more time for crafting and working, I got some motivation back as I felt at least more comfortable daily. Not that I had purpose really but working was at least with less pressure. The time zone differences helped because I didn't feel like 9 hours a day I was supposed to be there, with someone 1 click away of sending a slack message to me. I stayed two years. Earning good money compared to my spending, saving and "happy". Eventually everything started to decay, the job was less free, the business request were more urgent and I quickly lost motivation and started to fall in despair again. What am I doing? Who cares about this parquet file with funny numbers?

I quit, this time for good. Despair again, realization, I need to find something for myself, or try at least. I fear for its complexity, I fear if I can do it, if I can make a living in another way. I'm taking some time to rest but I know eventually I will need to figure things out. On my own, making my own way, and I see how no one will do it for me.

As a side note, finding purpose, finding your way, etc was also a thing when I thought about romantic relationships. Finding someone that you really care about and that cares about you is not easy at all and it doesn't happen magically as I thought as a kid. I am not going to expand about it here but it was another topic that made me realize about our own fate and efforts.

Side side note, retirement is another one. I don't see my national retirement plan as something you can live off. I see my family, that luckily does well but despite that nothing is granted and we can support the elder in my family because our own safety net. A younger me didn't know that but as my twenties went by I started to see a lot more of adult/real life. Eye opening period of my life.

How to be lucky

"Okay, I'm focused on getting X, but let's not forget to read the headlines."

High Agency and be the plumber

Reminder to myself, focus on bringing solutions and not the shiny new tool. I think I'm good at high agency or at being responsible and wanting to finish what I commit to.
I need to work on focusing on which solution am I bringing. Finding the right niche and actually bringing value. In my case with sportsjobs.online, not just throwing things into it, but making it clear what I am providing. For other topics related to ML and LLM I need to decide what I'll focus on this time coming in.

Impostor Syndrome I'm the classical example of not trusting myself and thinking stuff like this.

but at the end of the day, you must just think I have shit taste and that you've somehow tricked me into thinking you're good when you're an impostor? Right?

I need to stop with that. Quitting now I had plenty of nice words for every colleague, including the desire from managers that I stay or I come back if I get bored of not having a job. Of course, as I write that, I think in the back of my mind of exceptions and that technical colleagues were less effusive as my managers and etc. but all of that is probably not true. I'm going against all evidence and just guessing and making things in my head.

How to Be Good at Many Things Consistency. Every one says this. I need to do it the right way now that I'll have the time. No excuses.

Do things, practice, keep going for it. Everything will be easier and quicker.

And I need to be grateful for what I have and how lucky I am to be able to get a sabbatical time to try to change the path I'm going.