# Quantization and Memory Budget

> TinyRustLM explains quantization as product architecture: q8_0 and q4_0 determine which tiny models can fit into browser memory without hiding compute costs.

- Site: TinyRustLM
- Canonical: https://TinyRustLM.mirust.com/quantization/
- Version: 1.3.0
- Updated UTC: 2026-06-29T22:59:10Z

## f32, q8_0, and q4_0

The runtime data paths are intentionally limited. f32 is useful for tests and baselines, q8_0 is the balanced path for quality and compactness, and q4_0 exists for maximum model capacity under tight browser memory limits.

- f32: 4.0 bytes per parameter
- q8_0: roughly 1.0625 bytes per parameter
- q4_0: roughly 0.5625 bytes per parameter

## 4-bit cost in WASM

q4_0 saves bytes but costs instructions. Standard WASM does not provide native 4-bit tensor arithmetic, so the runtime must unpack nibbles with masks and shifts before accumulation. That makes q4_0 a memory win but not a free speed win.

- Nibble unpacking
- Scale per block
- SIMD v128 acceleration where deterministic
- Dequantize on demand

## Practical model candidates

The source analysis identifies candidates such as llama2.c-stories15M, Aurelius 10M, Pythia-14M, Pythia-31M, tiny-gpt2, and larger q4-only stress candidates. The site does not claim all are production-ready; it explains why each is useful for validation or browser experiments.

- Llama path validation
- GPT-NeoX path validation
- Small vocabulary experiments
- q8/q4 stress testing