Education

How RAG & vectors actually work

An interactive, click-to-learn guide to embeddings, semantic search and Retrieval-Augmented Generation — the machinery that lets an AI answer questions over your own documents. No math degree required.

  • Embeddings
  • Cosine similarity
  • Semantic search
  • Vector DBs & pgvector
  • LangChain / LangGraph

Everything below is interactive — click the points, drag the sliders, run the pipeline.

Start here

The one problem RAG solves

A large language model (LLM) only knows what was in its training data, and you can't fit your whole document base into a single prompt. So instead of stuffing everything in, you find just the relevant pieces and attach them to the question. The model then answers from your data — not only from memory. That, in one sentence, is Retrieval-Augmented Generation (RAG).

Your documents

Contracts, wiki, tickets, PDFs — far more than any prompt could hold.

Retrieve the relevant bits

A vector search finds the few chunks that actually relate to the question.

Generate the answer

The LLM answers using those chunks as grounded context.

To make any of this work, a computer needs a way to measure “how related are these two pieces of text?” — even when they share no words. That measure is built from embeddings and cosine similarity, which the rest of this guide makes tangible.

The framework

What is LangChain, and what is it good for?

LangChain is a framework for building applications on top of LLMs. It started in Python (a JS/TS version exists too). The goal: give you building blocks and abstractions so you don't have to write everything around a model call from scratch.

Its main components

Model abstraction

One interface over OpenAI, Anthropic, Azure, Gemini… Switch provider without rewriting the rest of your code.

Prompt templates

Templating prompts with variables instead of string concatenation.

Chains

Composing steps into a pipeline: input → format prompt → call model → parse output → pass on. Hence the name.

Retrieval / RAG

The strongest use case: vector databases, embeddings, document chunking, semantic search over your own data.

Agents

The model decides which “tools” to call (search, calculator, API) and iterates until it reaches a result.

Memory

Keeping conversation state across calls.

The catch

LangChain has a reputation for too many abstractions. For simple things it adds a layer that mostly gets in the way, and debugging through several wrappers gets annoying. Many people use it just for inspiration / a quick RAG prototype, then rewrite it in production as a thin layer over the raw API. That's why LangGraph (explicit state graph for agents) and LangSmith (observability / tracing) exist.

Orchestration

LangChain vs LangGraph

In short: LangChain is about chaining steps one after another (a linear pipeline). LangGraph is about orchestration where steps can branch, loop back and repeat (a cyclic graph). Toggle the diagram to see the difference.

Chains = a DAG

A chain is a directed acyclic graph: data flows one way, input → A → B → output. Great for a clear sequence (format prompt, call model, parse). It struggles the moment you need a loop or a decision like “if the result isn't good enough, try again differently.”

Graphs = state + cycles

LangGraph adds a shared State that flows through the graph, Nodes (steps), and conditional edges that decide where to go next. Crucially the graph can contain cycles — a node can loop back and iterate until done. Exactly what agents need.

A Rails-world analogy

A chain is like a single linear service object — Foo.call runs top to bottom and returns. LangGraph is more like a state machine (think AASM / state_machine): defined states, transitions and the conditions under which a transition fires — except here the LLM often decides the transition based on the state's contents.

The foundation

Embeddings: turning meaning into numbers

Run each piece of text through an embedding model (OpenAI text-embedding-3, Voyage, or locally via Ollama) and you get back a vector — a list of numbers (e.g. 1536 of them) that represents the meaning of the text. Texts with similar meaning end up with vectors that sit close together.

1 · A vector is just a row of numbers

Pick a phrase. We show its embedding as 16 cells — each cell is one dimension, one number. Real models have 1536+ dimensions; the idea is identical.

2 · Similar meaning → nearby vectors

Each dot is one phrase, placed by meaning (this 2-D map is a flattened projection of the high-dimensional space). Click any two dots to compare them — we'll compute their cosine similarity. Notice the legal phrases cluster together, far from animals or food, even though they share no words.

Pick A
Pick B
cosine similarity

The math, made friendly

Cosine similarity — and what it has to do with sine & cosine

The surprise: cosine similarity literally is a cosine — the cosine of the angle between two vectors. It's the same cosine from school trigonometry:

cos(θ) = A · B|A| · |B|

The dot product A · B is computed component by component: a₁·b₁ + a₂·b₂ + … + aₙ·bₙ. It works the same in 2-D as in 1536-D — you just can't picture the 1536-D arrow.

Drag the angle. Vector A is fixed; rotate B and watch the cosine. Two aligned arrows → 1. Perpendicular → 0. Opposite → −1.

45°
cos(θ) = 0.71

Why cosine, not sine?

Because cosine is 1 at a zero angle and falls as the vectors diverge — exactly what we want from “similarity.” Sine is 0 at 0°, so it would call identical vectors maximally dissimilar. Recall the right triangle: cos = adjacent / hypotenuse. When two directions are aligned, they project almost entirely onto each other → cosine near 1.

Why the angle, not the distance?

We care about direction, not length. A short note and a long article on the same topic produce different-length vectors that point the same way. Cosine ignores length (that's the |A|·|B| division), so it sees “same topic.” Euclidean distance would be fooled by length.

cosine_similarity.rb
def cosine_similarity(a, b)
  dot = a.zip(b).sum { |x, y| x * y }
  mag = ->(v) { Math.sqrt(v.sum { |x| x**2 }) }
  dot / (mag.call(a) * mag.call(b))
end

At real scale you don't compute this in Ruby — that's what a vector database is for, where cosine distance is a single SQL operator (<=> in pgvector) the database can index.

Putting it together

The RAG pipeline, step by step

RAG has two phases: one you run once, up front (indexing), and one that runs on every question (retrieval + generation). Click any stage to expand it, or press Run to watch a question flow through.

Phase 1 — Indexing (once, offline)
Phase 2 — Retrieval + Generation (every query)
Where the vectors live

Vector databases & pgvector

A vector database stores the vectors plus the original text and metadata, and answers “find the N nearest vectors to this one” fast. Three common choices:

Pinecone

Managed cloud, scales, no infrastructure to run — you pay for it.

Chroma

Lightweight, open-source, great for a prototype and local work.

pgvector

A PostgreSQL extension — no new database in your stack. Add a vector column to an existing table and search with operators right in SQL: <-> (L2 / Euclidean), <=> (cosine distance), <#> (negative inner product).

In Rails, the neighbor gem adds vector search as an Active Record scope:

document_chunk.rb
class DocumentChunk < ApplicationRecord
  has_neighbors :embedding
end

# store a chunk
DocumentChunk.create!(content: chunk_text, embedding: embedding_vector)

# retrieval — the 5 nearest chunks to the query vector
DocumentChunk
  .nearest_neighbors(:embedding, query_vector, distance: "cosine")
  .first(5)

The embedding call (OpenAI / Voyage / Ollama) is just an HTTP request, chunking is plain Ruby, and retrieval is pgvector + neighbor — so a RAG pipeline can be built natively in Rails, with no separate Python service.

Semantic search vs keyword search

This is the whole point. Keyword search matches words; semantic search matches meaning. Run the query both ways and compare what each retrieves.

Query

Beyond the basics

What makes a RAG system good (or bad)

Chunking quality

Badly cut chunks → irrelevant retrieval. Split with overlap so context isn't lost at the boundaries, and for structured docs (contracts) split by structure — clauses, sections — not blindly every N characters.

Re-ranking

Vector search is fast but coarse. Take the top candidates and re-score them with a more precise re-ranker to pick the truly best ones.

Hybrid search

Combine semantic + classic keyword search. Vectors sometimes miss exact terms (paragraph numbers, product names); keyword catches them.

Metadata filtering

Before searching vectors, narrow by metadata — only this client's documents, only this contract type. Smaller, cleaner search space.

In one picture

Recap: the whole thing

  1. Embedding turns text into a vector — a list of numbers that captures meaning.
  2. Cosine similarity = the cosine of the angle between two vectors: 1 = same meaning, 0 = unrelated, −1 = opposite.
  3. Indexing (once): chunk your documents, embed each chunk, store the vectors in a vector DB.
  4. Retrieval (per query): embed the question, find the nearest chunks by cosine, attach them to the prompt.
  5. Generation: the LLM answers grounded in those retrieved chunks — that's RAG.
  6. LangChain wires linear steps; LangGraph adds state, branches and loops for agents.
  7. In Rails it's all native: HTTP for embeddings, plain Ruby for chunking, pgvector + neighbor for search.