The one problem RAG solves
A large language model (LLM) only knows what was in its training data, and you can't fit your whole document base into a single prompt. So instead of stuffing everything in, you find just the relevant pieces and attach them to the question. The model then answers from your data — not only from memory. That, in one sentence, is Retrieval-Augmented Generation (RAG).
Your documents
Contracts, wiki, tickets, PDFs — far more than any prompt could hold.
Retrieve the relevant bits
A vector search finds the few chunks that actually relate to the question.
Generate the answer
The LLM answers using those chunks as grounded context.
To make any of this work, a computer needs a way to measure “how related are these two pieces of text?” — even when they share no words. That measure is built from embeddings and cosine similarity, which the rest of this guide makes tangible.
What is LangChain, and what is it good for?
LangChain is a framework for building applications on top of LLMs. It started in Python (a JS/TS version exists too). The goal: give you building blocks and abstractions so you don't have to write everything around a model call from scratch.
Its main components
Model abstraction
One interface over OpenAI, Anthropic, Azure, Gemini… Switch provider without rewriting the rest of your code.
Prompt templates
Templating prompts with variables instead of string concatenation.
Chains
Composing steps into a pipeline: input → format prompt → call model → parse output → pass on. Hence the name.
Retrieval / RAG
The strongest use case: vector databases, embeddings, document chunking, semantic search over your own data.
Agents
The model decides which “tools” to call (search, calculator, API) and iterates until it reaches a result.
Memory
Keeping conversation state across calls.
The catch
LangChain has a reputation for too many abstractions. For simple things it adds a layer that mostly gets in the way, and debugging through several wrappers gets annoying. Many people use it just for inspiration / a quick RAG prototype, then rewrite it in production as a thin layer over the raw API. That's why LangGraph (explicit state graph for agents) and LangSmith (observability / tracing) exist.
LangChain vs LangGraph
In short: LangChain is about chaining steps one after another (a linear pipeline). LangGraph is about orchestration where steps can branch, loop back and repeat (a cyclic graph). Toggle the diagram to see the difference.
Chains = a DAG
A chain is a directed acyclic graph: data flows one way, input → A → B → output. Great for a clear sequence (format prompt, call model, parse). It struggles the moment you need a loop or a decision like “if the result isn't good enough, try again differently.”
Graphs = state + cycles
LangGraph adds a shared State that flows through the graph, Nodes (steps), and conditional edges that decide where to go next. Crucially the graph can contain cycles — a node can loop back and iterate until done. Exactly what agents need.
A Rails-world analogy
A chain is like a single linear service object — Foo.call runs top to bottom and returns. LangGraph is more like a state machine (think AASM / state_machine): defined states, transitions and the conditions under which a transition fires — except here the LLM often decides the transition based on the state's contents.
Embeddings: turning meaning into numbers
Run each piece of text through an embedding model (OpenAI text-embedding-3, Voyage, or locally via Ollama) and you get back a vector — a list of numbers (e.g. 1536 of them) that represents the meaning of the text. Texts with similar meaning end up with vectors that sit close together.
1 · A vector is just a row of numbers
Pick a phrase. We show its embedding as 16 cells — each cell is one dimension, one number. Real models have 1536+ dimensions; the idea is identical.
2 · Similar meaning → nearby vectors
Each dot is one phrase, placed by meaning (this 2-D map is a flattened projection of the high-dimensional space). Click any two dots to compare them — we'll compute their cosine similarity. Notice the legal phrases cluster together, far from animals or food, even though they share no words.
Cosine similarity — and what it has to do with sine & cosine
The surprise: cosine similarity literally is a cosine — the cosine of the angle between two vectors. It's the same cosine from school trigonometry:
The dot product A · B is computed component by component: a₁·b₁ + a₂·b₂ + … + aₙ·bₙ. It works the same in 2-D as in 1536-D — you just can't picture the 1536-D arrow.
Drag the angle. Vector A is fixed; rotate B and watch the cosine. Two aligned arrows → 1. Perpendicular → 0. Opposite → −1.
Why cosine, not sine?
Because cosine is 1 at a zero angle and falls as the vectors diverge — exactly what we want from “similarity.” Sine is 0 at 0°, so it would call identical vectors maximally dissimilar. Recall the right triangle: cos = adjacent / hypotenuse. When two directions are aligned, they project almost entirely onto each other → cosine near 1.
Why the angle, not the distance?
We care about direction, not length. A short note and a long article on the same topic produce different-length vectors that point the same way. Cosine ignores length (that's the |A|·|B| division), so it sees “same topic.” Euclidean distance would be fooled by length.
def cosine_similarity(a, b)
dot = a.zip(b).sum { |x, y| x * y }
mag = ->(v) { Math.sqrt(v.sum { |x| x**2 }) }
dot / (mag.call(a) * mag.call(b))
end
At real scale you don't compute this in Ruby — that's what a vector database is for, where cosine distance is a single SQL operator (<=> in pgvector) the database can index.
The RAG pipeline, step by step
RAG has two phases: one you run once, up front (indexing), and one that runs on every question (retrieval + generation). Click any stage to expand it, or press Run to watch a question flow through.
Vector databases & pgvector
A vector database stores the vectors plus the original text and metadata, and answers “find the N nearest vectors to this one” fast. Three common choices:
Pinecone
Managed cloud, scales, no infrastructure to run — you pay for it.
Chroma
Lightweight, open-source, great for a prototype and local work.
pgvector
A PostgreSQL extension — no new database in your stack. Add a vector column to an existing table and search with operators right in SQL: <-> (L2 / Euclidean), <=> (cosine distance), <#> (negative inner product).
In Rails, the neighbor gem adds vector search as an Active Record scope:
class DocumentChunk < ApplicationRecord
has_neighbors :embedding
end
# store a chunk
DocumentChunk.create!(content: chunk_text, embedding: embedding_vector)
# retrieval — the 5 nearest chunks to the query vector
DocumentChunk
.nearest_neighbors(:embedding, query_vector, distance: "cosine")
.first(5)
The embedding call (OpenAI / Voyage / Ollama) is just an HTTP request, chunking is plain Ruby, and retrieval is pgvector + neighbor — so a RAG pipeline can be built natively in Rails, with no separate Python service.
Semantic search vs keyword search
This is the whole point. Keyword search matches words; semantic search matches meaning. Run the query both ways and compare what each retrieves.
What makes a RAG system good (or bad)
Chunking quality
Badly cut chunks → irrelevant retrieval. Split with overlap so context isn't lost at the boundaries, and for structured docs (contracts) split by structure — clauses, sections — not blindly every N characters.
Re-ranking
Vector search is fast but coarse. Take the top candidates and re-score them with a more precise re-ranker to pick the truly best ones.
Hybrid search
Combine semantic + classic keyword search. Vectors sometimes miss exact terms (paragraph numbers, product names); keyword catches them.
Metadata filtering
Before searching vectors, narrow by metadata — only this client's documents, only this contract type. Smaller, cleaner search space.
Recap: the whole thing
- Embedding turns text into a vector — a list of numbers that captures meaning.
- Cosine similarity = the cosine of the angle between two vectors: 1 = same meaning, 0 = unrelated, −1 = opposite.
- Indexing (once): chunk your documents, embed each chunk, store the vectors in a vector DB.
- Retrieval (per query): embed the question, find the nearest chunks by cosine, attach them to the prompt.
- Generation: the LLM answers grounded in those retrieved chunks — that's RAG.
- LangChain wires linear steps; LangGraph adds state, branches and loops for agents.
- In Rails it's all native: HTTP for embeddings, plain Ruby for chunking,
pgvector+neighborfor search.