Implementing BPE

-Adwita Singh

If you want to skip the article or prefer video, just skip to this dudebro cuz he explains it really well: https://www.youtube.com/watch?v=BcxJk4WQVIw

If you have a terrible attention span like me, buckle up folks cuz this is actually pretty easy lol.

This article actually came about as I was reading a section from An Introduction to Information Retrieval by Christopher D. Manning, Hinrich Schütze, and Prabhakar Raghavan.

In the 2nd chapter titled The term vocabulary and posting lists, section 2.2.2 on Dropping common terms: stop words:

“The general trend in IR systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists”

which made no sense because every tutorial I had read (prior to 2023) on this topic went into great detail about how stop-words are the bane of our existence while building language models, and must be excluded from being passed to it (always).

Naturally, I assume that is the case for all related systems (it isn’t my bros don’t assume things)

While the book essentially explains their mitigation via compression techniques and further goes on to explain why common words have little impact on document rankings due to standard term-weighing techniques

(the book is a banger read 10/10 recommended my French classes felt funnier when I thought about tokenizing l'ordonnance and watching my pipeline struggle).

All this might sound unrelated to tokenizer algorithms, but the question is still worthwhile: how are large systems tokenizing words such that perfect sentence reconstruction is able to happen? Because when you think about it, if you train an LLM with no stop-words, the LLM will inherently not know that stop-words exist in the human vocabulary.

How else are the inputs then prepared for LLM training and how are they passed to the model at inference time?

You all have probably used some kind of tokenizer.encode() call while working with LLMs to vectorise the input prompt before passing to the LLM. Similarly, the output produced by the LLM is converted to a readable format using tokenizer.decode().