Gillius's Programming

Enhance Your Search With Embeddings

21 Aug 2025

You don’t need to use full AI models to get some of the benefits that have come out of the Gen AI / large language model (LLM) wave, especially when it comes to search.

One of the interesting components that have come out is practical uses for embeddings, which represents a concept as a high-dimensional vector (series of decimal numbers). While the idea of embeddings predate LLMs (such as Word2vec in 2013), the embedding models we have access to now like OpenAI’s text-embedding-3-small have been enhanced to understand context, for example the word “ruler” can refer to a measuring device or to a king. The term used here is “attention” in the sense of paying attention to the rest of the context. Humans can easily differentiate the two different meanings of “ruler” in the phrases “I want to measure this wood, please give me the ruler” and “King George was a ruler of England”.

Even without bringing “AI” to bear to a problem, you can use these embedding models to get what I think is one of the most interesting fundamental advancements around LLMs and that is representing a query as a concept and not as keywords. In this sense, “male ruler”, “king”, and “George III of England” will represent nearby (or “similar”) vectors. If we think of a traditional keyword search, these terms share no words.

Example

Let’s look at a concrete example. I found a list of cat facts online with one fact per line. Using the text-embedding-3-small model, each fact is converted into a vector. I set up an application using PostgreSQL’s pgvector extension containing a table with an “embedding” column of type vector(1536) since that is the size of embeddings made by text-embedding-3-small (it can actually make smaller vectors, which you should try out in a real applications to optimize performance). Each fact and its embedding is inserted as a row.

Now, we take a user’s query and convert that into an embedding. We take that embedding (vector) and find the closest 5 matches. In PostgreSQL I set it up to use a HNSW index and cosine distance (<=> operator in Postgres). The score is (1 - cosine_distance). HNSW is an approximation algorithm, so it may not find the exact top 5, but in this example, I got the following:

What is the gestation period for cats?

Fact Score
A cat is pregnant for about 58-65 days. 0.683
The oldest cat to give birth was Kitty who, at the age of 30, gave birth to two kittens. During her life, she gave birth to 218 kittens. 0.477
On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life. 0.453
Most cats give birth to a litter of between one and nine kittens. The largest known litter ever produced was 19 kittens, of which 15 survived. 0.440
Cats sleep 16 to 18 hours per day. When cats are asleep, they are still alert to incoming stimuli. If you poke the tail of a sleeping cat, it will respond accordingly. 0.437

If you notice the top result answers the question and does not include any keywords in common excepting the word “cat” (assuming a keyword search smart enough to search for word roots like cats -> cat). Therefore, a traditional fulltext search of “gestation period cats” would not find this fact.

In a full implementation, we would need to handle larger documents, which are normally “chunked” into pieces with the idea that each small section represents one or a few ideas that will generate a nice embedding, for example in an article on the American Revolution, we might have one chunk with King George III that would generate highly similar embeddings to “who was King George”, and another chunk about Boston that would better match a query on the Tea Party protest.

Retrieval Augmented Generation

This is the basis of the widely covered RAG (Retrieval Augmented Generation) technique, but what I think is less said is that you can use these embeddings without the “AI” side of it to find concepts instead of words in your documents. Embedding models are much smaller than LLMs and much more feasible to run locally in tools like Ollama and LM Studio if you want to keep processing private or just enhance a “traditional” search tool in your application.

Of course, you could use this in a RAG, as a simple example you can try this out by manually injecting this into a prompt, for example I dropped this query to gpt-4.1-nano, which is a task even the smallest and cheapest LLMs can handle:

Use the following context to answer the user's question. Each bullet point is a separate fact. Only use the context to
answer the question and not predefined knowledge. If the context does not provide an answer to the user's question,
state, "I'm sorry, but I don't know the answer to that".

Context facts:
* A cat is pregnant for about 58-65 days.
* The oldest cat to give birth was Kitty who, at the age of 30, gave birth to two kittens. During her life, she gave
  birth to 218 kittens.
* On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years
  of its life.
* Most cats give birth to a litter of between one and nine kittens. The largest known litter ever produced was 19
  kittens, of which 15 survived.
* Cats sleep 16 to 18 hours per day. When cats are asleep, they are still alert to incoming stimuli. If you poke the
  tail of a sleeping cat, it will respond accordingly.

The user's question is:
what is the gestation period for cats?

GPT 4.1 Nano’s response in my case was simply repeating the first fact: “A cat is pregnant for about 58-65 days”. Here the LLM is just removing the facts that are relevant but don’t directly answer the question, which saves the human a little bit of time, but quite feasible to skip this step if you want to reduce usage of LLMs for cost reasons. Of course, you can modify the instructions. In essence this is all a RAG is doing, although people try to wrap it up into abstracted frameworks that may make it seem like it’s more than it is. Note, these are still useful, since such tools would handle things like document conversion (MS Word to text), chunking, database management, etc. which we did not talk about here, but the core concept is the same.

References

Example table SQL:

CREATE TABLE IF NOT EXISTS document_chunks (
    id BIGSERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1536)
);

CREATE INDEX IF NOT EXISTS idx_document_chunks_embedding ON document_chunks USING hnsw (embedding vector_cosine_ops);

Example query (where parameters like :queryEmbedding are replaced with desired values):

SELECT id, content, (1 - (embedding <=> CAST(:queryEmbedding AS vector))) AS similarity 
FROM document_chunks 
ORDER BY embedding <=> CAST(:queryEmbedding AS vector) 
LIMIT :limit