POSTS / Embeddings: What they are and why they matter

Published: 2024-01-19

Notes while reading: Embeddings: What they are and why they matter.

What are Embeddings?

After reading that blog I got a rough idea of what embeddings are: a technique of turning a piece of content (whatever they are) into an array of floating point numbers, which still keeps the original meaning within it.

Here’s what the blog says:

Embeddings are a technology that’s adjacent to the wider field of Large Language Models—the technology behind ChatGPT and Bard and Claude.

Embeddings are based around one trick: take a piece of content—in this case a blog entry—and turn that piece of content into an array of floating point numbers.

The key thing about that array is that it will always be the same length, no matter how long the content is. The length is defined by the embedding model you are using—an array might be 300, or 1,000, or 1,536 numbers long.

The best way to think about this array of numbers is to imagine it as co-ordinates in a very weird multi-dimensional space.

Once we treat the array as a coordinate in a higher dimentional space, we can learn one interesting thing that the closer the coordinates of the two words (or sentences or whatever), the more similar their meanings.

What can Embeddings do?

The author of that blog writes a python package called llm, which implements lots of features. The blog suggests a number of uses for embedding, such as searching related content, clustering GitHub issues, etc.

Calculate consine similarity distance between two floating point number arrays, which can somehow present the similarity in the meaning of their corresponding contents.

def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)

Also, we can use other distance functions to present the similarity. By quantifying the degree of similarity between the content, we can rank all articles in terms of their similarity to a particular article. The blog says that ’We can call this semantic search. I like to think of it as vibes-based search.’, but I still prefer to call that semantic search.

As multi-modal models such as CLIP, a fasinating model can embed both text and images, come into view, direct image search using text has now become a practical implementation. Just calculate the distance of FP number arrays people can get a result representing the similarity between images and text. What’s more, you can use a fresh new method to compare two images, not with pixels but with the embeddings.

Cluster contents with embeddings

The blog shows one example about Github issues. Tagging them manually is a hard task but we can crawl the infos about each issue like title and description then calculate the distance of their embeddings. After clustering them by one FP number we can use LLM to generate a descriptive name for each cluster.

Retrieval-Augmented Generation

This part talks about how to enhance the answer of LLM with private notes or internal documents. It doesn’t need to fine tune for training a custom model.

The key idea is to find out the most related content by semantic search with embeddings then generate a piece of excerpts of some most related articles/documents, which can be attached along with the original question.