Quantization & Local LLMs — How Much RAM to Run a Model, Quantization, Parameters, MoE & GGUF Explained

Start here

How much RAM do I need to run this model?

Short version: the memory a model needs is set almost entirely by two numbers — how many parameters it has, and how many bits you store each one in (its quantization). Pick a model and a quant level below and watch the numbers move. Every term here gets its own hands-on section underneath.

Total parameters 744 B

All weights must sit in memory — even the ones that don't run on every word.

Mixture of experts?

Active per token 40 B

Only this slice computes per word — it sets the speed, not the memory.

Quantization (bits per weight)

Fewer bits = smaller & faster, a little less accurate. The Quantization section below lets you feel it.

Context window 32K

The conversation's memory (KV cache). Bigger window = more RAM on top of the weights.

Model file on disk —

RAM you'd want —

Rough speed (CPU) —

Approximate, for intuition — real GGUF quant files carry a little overhead and "dynamic" quants keep important layers at higher precision, so actual sizes drift a bit. The relationship is what's solid: size ≈ parameters × bits ÷ 8.

The first number

Parameters: a model is just a giant pile of numbers

When people say a model is "744B", they mean it's built from 744 billion numbers — the parameters (also called weights). Each one is a tiny dial the model tuned during training. That's literally all a model is: an enormous list of numbers plus the simple rules for multiplying your input through them. More parameters generally means more capability — and more memory to hold them.

Numbers grouped into "tensors"

Those billions of numbers aren't a flat list — they're packed into grids called tensors (a fancy word for a multi-dimensional array of numbers). A model is hundreds of tensors stacked in layers. "Tensor type" just means what number format each grid uses.

Total vs active

In some models (mixture-of-experts) only a fraction of the parameters run on each word. The full count still has to live in memory; the active slice is what does the maths. Hold that thought — it's the whole trick behind big models that still run fast.

The big idea

Quantization: store each number in fewer bits

"Quant" is short for quantization — a compression trick that shrinks the model by storing every weight with fewer bits of precision. Trained models keep each number as a 16-bit value (2 bytes). Quantization rounds them to fewer bits — trading a little accuracy for a lot less memory. It's like saving a photo as a lower-quality JPEG: still clearly the photo, just lighter and slightly less crisp.

Drag the precision down and watch one row of real weights snap to a coarser grid. The file shrinks; the picture gets blockier. Notice how gracefully it degrades — that's why 4-bit is so popular.

Bits per weight 4-bit

File size

100%

Detail kept

100%

Distinct values each weight can take

The bits-per-weight ladder

The names just tell you the bits. Fewer bits → smaller file, less memory, a bit more error.

"Dynamic" quants: smart about where to spend bits

A plain 2-bit quant crushes every weight to 2 bits. A dynamic quant (e.g. Unsloth's "UD" GGUFs) keeps the sensitive layers at higher precision and squeezes the rest harder. That's why a good 2-bit dynamic quant still works surprisingly well — it isn't uniformly 2-bit, it's clever about which bits matter. Which layers are sensitive? That's the next section.

The source format

Tensor types: BF16, F32 — the precision before you quantize

A model card often says something like "Tensor type: BF16 · F32". That's the original precision the weights were trained and released in — before anyone quantizes them. It's usually a mix: most weights in BF16, a few sensitive ones kept in F32. Every number format trades off two things — range (how big/small a number it can hold) and precision (how finely it splits the values in between). Click each to see how the bits are split.

Why models ship in BF16, not FP16

BF16 and FP16 are both 16-bit, but they split the bits differently. FP16 spends more bits on precision but has a narrow range — big or tiny numbers overflow or vanish to zero. BF16 keeps the full F32 range with less precision. During training, not overflowing matters more than fine detail, so almost every modern model is released in BF16.

Why a few layers stay F32

A handful of layers are kept at full 32-bit because a rounding error there hurts the most — things like normalization and the MoE router. Quantization tools (and dynamic quants) protect exactly these. To see why the router is so touchy, route a token through the experts next.

Why a huge model can still run

Mixture of Experts: a router that wakes only a few specialists

In a normal ("dense") model, every weight is used for every word. A Mixture-of-Experts model splits part of each layer into many separate sub-networks — the experts. For any given word, a tiny router (a.k.a. the gating network) scores all the experts and sends the word to only the top few. That's how a model can have a huge total size but a small active slice. Send a token through it:

a word

router scores everyone

Experts in the layer 32

Experts picked per word (top-k) 8

In memory (all experts)100%

Computed (only the chosen)25%

The library mental model

The model is a huge library of books (the experts); the router is the index card saying which few to grab; the inference program is the librarian who fetches just those. But the whole library still has to be built and stocked — which is why all the weights live in RAM even though only a few run per word.

Why the router is touchy

The router makes a discrete choice — which experts win. If quantization nudges a score and a different expert gets picked, the answer can change sharply, not gracefully. A small error in an ordinary weight just makes one number slightly off; a small error in the router can reroute the whole word. That's why it's kept at high precision.

What makes it fast or slow

Speed: the active slice and your memory bandwidth

The active parameters set the speed, not the total. Generating each word means reading those active weights out of RAM and doing the maths. On a CPU the bottleneck is almost never the maths — it's memory bandwidth: how fast weights can be shovelled from RAM into the chip. So a rough rule is speed ≈ bandwidth ÷ bytes read per word. Fewer bits (lower quant) means fewer bytes to read — which is why quantizing also makes it faster.

Active params per word 40 B

Quantization

Hardware (memory bandwidth)

Read per word —

Rough speed —

What the chip actually does: multiply-add

Once weights arrive, the core operation is multiply-accumulate (MAC): multiply each weight by an input number, add it to a running total, repeat millions of times. Chips do many at once with SIMD instructions (AVX on x86, NEON on ARM). Weak SIMD support = much slower CPU inference.

Why it waits on the conveyor belt

The SIMD units finish their multiply-adds faster than RAM can deliver the next batch of weights — so the chip waits. The maths isn't the holdup; the conveyor belt feeding it is. GPUs flip this with far higher memory bandwidth (GDDR on consumer cards, HBM in datacentres), so they stay fed and run far faster.

A common myth

Languages: it's the training data, not the final step

It's tempting to think a model is "bad at Czech" because of some final step. It isn't. The last thing a model does is pick the next token from its vocabulary — the tokenizer's set of text chunks. As long as a language's characters are in the vocabulary (they almost always are for major languages), the model can emit it. How well it does comes from how much of that language it saw in training — and quantization erodes the weakest languages first. Type something in both languages:

English

Czech

For under-represented languages the tokenizer often chops text into more, smaller pieces — so the same sentence costs more tokens. That doesn't break the language, but it's slower, pricier, and eats the context window faster. (Simplified illustration.)

Vocabulary = can it emit the characters

The vocabulary mostly sets efficiency, not what's possible: modern tokenizers are byte-level, so a model can spell out any character byte by byte even with no dedicated token — it just costs more tokens. Czech ř č ž are representable everywhere; rare scripts simply cost more. So the vocabulary is rarely a hard limit.

Quantization hits weak languages first

A model's grip on a low-resource language is already thin, so squeezing weights to 2-bit degrades that before it dents its rock-solid English. If a heavily-quantized model's Czech feels shakier than its English, that's quantization eroding the thinnest representations — not the vocabulary refusing the language.

What you download

PyTorch, Safetensors, GGUF, MLX — who does what

On a model page you'll see a row of library tags. They split into three jobs: the engines that do the tensor maths, the high-level library that loads models conveniently, and the file formats that store the weights. For running a model locally on CPU/Ollama, the one that matters is GGUF. Tap a tag:

So what does GGUF stand for?

GGUF = GGML Universal File. It grew out of GGML — a lightweight C/C++ tensor library named after its creator, Georgi Gerganov (the developer behind llama.cpp). The old GGML files broke every time the format changed; GGUF replaced them in 2023 with a self-describing format — one file holds the quantized weights, the tokenizer, and metadata ("I'm this model, MoE, 4-bit, here's the chat template"). That's why Ollama or llama.cpp can run a single .gguf with no extra config files.

Curious how the model "thinks" in the first place?

This guide was about running a model. How one actually learns and predicts the next word is its own story. Open "How AI Works" →

In one breath

The whole thing, recapped

A model is a pile of numbers — the parameters (weights). "744B" = 744 billion of them.
RAM ≈ parameters × bits ÷ 8. All the weights must live in memory at once.
Quantization stores each weight in fewer bits (Q4, Q2…) — smaller and faster, slightly less accurate; dynamic quants protect the important layers.
Tensor types (BF16/F32) are the original precision; BF16 keeps F32's range, and touchy layers like the router stay F32.
Mixture of experts: a router wakes only a few experts per word — huge total, small active slice.
Active params set the speed; on CPU the real limit is memory bandwidth (bytes read per word).
Languages come from training data (and quant erodes weak ones first), not the final vocabulary step.
You download a GGUF (GGML Universal File) to run a quantized model locally with llama.cpp or Ollama.

Back to all tools How AI Works RAG & Vectors Back to top