Embeddings are a cornerstone in the world of machine learning, transforming the way machines interpret and process complex data. At their core, embeddings are numerical representations of information — such as words, sentences, or images — mapped into a continuous vector space. In other words, they translate data into a language that machines can understand and process efficiently.

To fully grasp how embeddings work, we must understand how it's defined mathematically, followed by how they are computed, and the various operations we can perform to utilize them. I will start with a broad definition of embeddings before talking specifically about token embeddings and how models like GPT-3 create and process them.


A vector is simply a sequence of numbers. For instance, in a 2D space, a vector like [2,3] signifies a movement of 2 steps horizontally and 3 steps vertically. They could be used to represent a position in a space relative to its various dimensions.

The number of elements in a vector determines its dimension. Therefore, a 2D vector contains two numbers, a 3D vector has three, and so forth, defining the vector's complexity and the information it can represent. When it comes to natural language processing, embeddings capture not just the standalone meaning of words but also their context.

In the context of machine learning, each dimension in a vector space captures some aspect of the object’s features or context. High dimensionality allows the model to capture and process a vast range of features in the data. It's key to the model's ability to understand and generate nuanced and contextually appropriate outputs.

The number of dimensions are defined by the design of the model, for example GPT-3's, like other transformer-based models, processes input through a series of layers, each comprising multi-headed self-attention and feedforward neural networks. Its largest variant (175B), has a hidden layer size of 12,288, which means each token is represented in a 12,288-dimensional space.

The layers that lead to embedding dimensions in models like GPT or other deep learning architectures are defined through a combination of model design choices and training processes. The architecture of a model can include different types of layers, each with a specific function. Common layers in deep learning models include embedding layers, convolutional layers (in CNNs), recurrent layers (in RNNs) and transformer blocks (in models like GPT).

These layers are designed to learn and transform data in a way that captures relevant patterns and features for the task at hand. In models like GPT, the embedding dimensions are not just initial representations but are refined through the network's layers, adding rich contextual information.

Token embeddings

There are various types of embeddings used to represent different types of data. For the sake of simplicity, I'll only be talking about token embeddings– which are a little non-traditional in the sense that they capture semantic and syntactic properties. So instead of creating embeddings that represent words in a space, tokens can be words, parts of words, or even punctuations, depending on the exact implementation.

In the case of token embeddings used by models like GPT-3, the ability to understand the context in which a word or phrase is used allows the model to generate more accurate and contextually appropriate responses.

An interesting aspect of embeddings in machine learning is their evolution during the training process. Initially, embeddings are set to random values. This randomness is key to starting the learning process, as it provides a diverse range of starting points for the model's optimization algorithms. During training, as the model is exposed to data and learns from its tasks (like predicting the next word in a sentence for language models), these embeddings are continuously adjusted.

The adjustments are made to minimize the difference between the model's predictions and the actual outcomes. Through this iterative process, the embeddings slowly evolve from their random initialization to a state where they meaningfully represent the underlying features of the data. For instance, in language models, word or token embeddings start to capture semantic and syntactic properties of the language, enabling the model to understand and generate text with high coherence and relevance.

You can perform various operations on vectors that help do things like extract information from and perform comparisons using embeddings. For example, cosine similarity measures how similar two embeddings are, irrespective of their magnitude. This operation is crucial in tasks like document clustering or finding words that share semantic meanings. Operations unlocks different capabilities of embeddings. They are the tools that allow us to extract meaning, see patterns, and make predictions using embeddings.

In natural language processing, for example, adding or subtracting word embeddings can lead to intuitive results — like the famous example of "king" - "man" + "woman" resulting in a vector close to "queen". This showcases the embeddings' ability to capture and manipulate linguistic relationships.

Evolution of techniques

The journey of token embeddings from simple to complex mirrors the evolution of our understanding of language in the digital world. Starting off, we had methods like One-Hot Encoding and TF-IDF. Picture One-Hot Encoding as giving each word its unique, but excessively large, digital fingerprint, and TF-IDF as a smarter cousin that knows which words matter more in a document.

Both were like novices in a world of poetry and prose – they knew words but missed the music in them.

The arrival of Word2Vec, GloVe, and FastText marked a notable shift. Word2Vec began to map words into a space where similar meanings were closer together, a bit like plotting words on a map based on their relationships. GloVe added depth by combining broader language patterns with specific word contexts. FastText brought an additional layer, deconstructing words into smaller fragments, which was especially useful for understanding less common words. These models were a big step forward, yet they still struggled with the challenge of words that change meaning based on context.

The latest advancements, led by transformer models like BERT, ELMo, and GPT, have brought a new level of sophistication. These models don't just look at words in isolation; they consider the whole sentence or paragraph, allowing them to grasp the nuances of language. They're like context detectives, examining not just the word but its surroundings to understand its role and meaning. This approach has significantly improved how machines interpret text, though it comes with its own complexities, including the need for substantial computational resources and intricate training.

This evolution from static to dynamic, context-sensitive embeddings is a significant step towards machines that can truly understand the subtlety and richness of human language.

As we delve deeper into the intricacies of machine learning and natural language processing, the significance of embeddings becomes even more pronounced. These embeddings are not just static representations; they are dynamic entities that evolve and adapt, capturing the nuanced and multifaceted nature of language and other forms of data.

The transformative power of embeddings extends beyond mere word associations. They enable machines to grasp the subtleties of context, sentiment, and even cultural nuances, paving the way for more advanced, empathetic, and intuitive AI systems. With embeddings, we're not just teaching machines to understand words; we're guiding them towards a deeper comprehension of human communication in all its complexity.

As we refine these techniques, we edge closer to a world where AI can seamlessly interact with humans, providing insights and assistance in ways we're just beginning to imagine. Whether it's through enhancing natural language understanding, improving recommendation systems, or enabling more sophisticated human-computer interactions, embeddings are at the forefront of this technological revolution.

From simple numerical representations to sophisticated tools that capture the essence of human language and thought, embeddings are not just a cornerstone of AI– they are a gateway to a future where the lines between human and machine intelligence become increasingly blurred. As we continue to explore and innovate in this domain, the possibilities are as limitless as they are exciting.