What are Tokens?
Tokens are at the heart of how LLMs are so powerful. Yet - they are not “logical” in the way we would like them to be... Knowing how they work is critical for every developer planning to build w/ AI!
If you’re interested in using an API like OpenAI to integrate AI functionality into your app, you will see that the pricing is based on tokens. For example, here is OpenAI’s pricing model for using GPT-4 Turbo:
You may have also seen AI-related tweets / blog / news about tokens:
So what exactly are tokens? This is critical to know and understand not just as a terminology and understanding pricing, but for also understanding why certain bugs happen in AI systems.
The best person I’ve seen explain it very very well is an OpenAI founding team member Andrej Karpathy in his video Let's build the GPT Tokenizer. As the video is over 2 hours long, I’ll be summarizing from it here for easy reference. However, I do highly recommend for everyone to watch it…
The Goal
It is much easier to understand tokens if we understand the goal of using tokens. The task of Large Language Models (LLMs) is to take an input from us (the user) and to process it based on an insanely large amount of data it’s been trained on (think the whole internet!) to provide us with an output that makes sense to us as humans.
In order to do this processing, it is extremely inefficient for a computer to use or work with words. In addition, words cannot be easily used in mathematical operations that are needed for more complex “reasoning” or extremely complex probability calculations.
So in order to make LLMs powerful, we need to convert language into numbers. Tokens are part of this conversion process.
What are Tokens?
In order to make Large Language Models (LLMs) powerful, we want them to be able to take in as much context as possible. Think of how much worse the ChatGPT experience would be without it keeping the context of your entire conversation.
Imagine you’d like to ask it questions about a certain text you put in. Now imagine that instead of it “remembering” the text and just answering every question, you have to copy and paste the text every single time for every single question. It wouldn’t feel like a conversation and would be annoying.
So being able to take in as much information as possible is a key to a powerful LLM. It is for this reason that simply using Unicode is not good enough to use (as of now). It ends up creating too many extra numbers that would result in decreasing the context window. In order to compress the unicode and create a more efficient way to pass in the most data, researchers found that Byte pair encoding is the key.
The algorithm is pretty simple. Take the sentence:
"she sells seashells by the seashore."
We can see that the characters “sh” occur very often. So to compress it, we’re going to assign a variable for “sh”:
X = “sh”
And replace it in the sentence:
"Xe sells seaXells by the seaXore."
You can also notice that the combination “se” is also there a few times, so we’re again going to replace it:
Y = “se”
"Xe Ylls YaXells by the YaXore."
We can now see that Xe also occurs 2 times, so we can again do one more replacement:
Z = Xe
"Z Ylls YaZlls by the YaXore."
So now using this algorithm, we went from 36 characters to 28 characters.
Now imagine doing this algorithm based on billions of words / sentences and extracting the most common combinations of characters that go together. This is what Tokens are.
The Tokenizer
Now that we have tokens (a group of letters that appear together often), we need an additional layer of The Tokenizer that will encode / decode these tokens to / from a number. Because of this encoding / decoding procedure, there needs to be a sweet spot of how many tokens there are. If there are too many tokens (in the case of using a word-to-token mapping), the Tokenizer will be very slow and inefficient.
To keep the context of the entire sentence / paragraphs / etc, it’ll put these numbers assigned to each token into an array:
Here the word “Code” becomes token number 2123, the word “ strings” becomes token 9246, the word “ awake” becomes token number 35477, the “.” becomes 306, and so on.
You can use the OpenAI Tokenizer or the Vercel Tokenizer to put in any sentences and see the Tokens and the ending Vector produced by the Tokenizer.
Some interesting things about Tokens to keep in mind:
Tokens are not words
As mentioned above, tokens are the most likely letters that are often together optimized for the encoding / decoding function of the Tokenizer. In many cases, especially in English, this maps to words, but in many cases, it does not. For example, the words “Emojis”, “chime”, “rhythmic” are each 2 different tokens in the below sentence:
Not English = More Tokens
This brings us to the issue of language. LLMs have been majorly trained and optimized for English, so there are fewer tokens for English verses other languages. Take a look at the same sentence in Hindi.
What was 18 tokens in English has become 66 tokens in Hindi!!
What does that mean? Since OpenAI charges on a per token basis, if you’re working in a non-English language, you will be paying A LOT more. In addition, the context window (how much information the LLM can process at one time) will be much smaller for foreign languages.
Numbers Are Divided
Just like with words, numbers are also split into numbers that occur next to each other most often. That means that the way numbers are managed is not as a full number, but a sequence of several numbers. For example:
Notice that while 100 is one token, 1000 is a set of 2 tokens - 100 & 0. This is why simple math could sometimes be very complicated for an LLM.
Spaces & Tokens
Notice that spaces are included as part of the tokens…
While “In” is a token on it’s own, “a” is actually “ a” (space a). It is also “ world” (space world”, “ and” (space and), etc. This is good because spaces are not being counted as extra tokens (more money saved for us!). But it may sometimes cause an issue if the user leaves an extra space at the end of a prompt, for example, as the system is expecting the next token to start with a space.
Tokens are Case Sensative
The same word starting with a capital letter vs non-capital letter will become different tokens.
In addition, when the same word is placed in the end of a sentence, (using the space + word), it might be split differently in terms of Tokens. For example, the word “Pearl” / “pearl” is 2 tokens in the front of a sentence, but 1 token when at the end of a sentence:
Special Tokens
If you check the Vercel Tokenizer, you will notice that the ChatGPT-type user / assistant model actually has a few special tokens encoded that we do not explicitly see as a user:
This is similar to HMTL - these special tokens signal to the LLM information about the conversation. While you yourself likely won’t be using this, it’s good to know that these special HMTL-like tokens exist and ChatGPT has created them to train their LLM in the user / assistant interface.
Are More Tokens Better?
As you might have seen in the tweet posted in the beginning of this post, Google announced that their next generation Gemeni model will have a context window of up to 1 million tokens!
While this sounds impressive - you can now upload and analyze your entire codebase, a whole book / PDF, and even videos, are more tokens actually better? In theory yes, but when @GregKamradt did pressure testing of GPT-4s 128K token context recall, he found a few problems.
The way that LLMs work is they are calculating the probability of the most likely next token through various calculations done on the Vectors (remember: tokens are converted into numbers in an array, or vectors). The more tokens there are to do the calculation on, the more complicated the calculations become, and the more likelihood there is of LLMs hallucinating, or giving wrong information.
He specifically found that GPT’s performance started to degrade after around 73K tokens. And that it was better at recalling the beginning and end of the document, but not the middle parts.
Now, it’s possible that Google has solved this problem. Those who have early access are impressed. But this is just a note to be weary of using extreme amount of tokens. Double check if what the LLM is giving back as a response is an actual fact!
Conclusion
Tokens are at the heart of how LLMs are so powerful. Yet - they are not “logical” in the way we would like them to be. They were created because this was the best way to give as much data to LLMs as possible in a compressed manner while still relying on the Tokenizer as an in-between layer.
So as a developer working with tokens, it is important to understand how they work as this may lead to weird issues / bugs and more importantly, a bigger bill if you’re using an API that is charging per token.
Have questions / thoughts / feedback? Let me know in the comments!