01
f32, q8_0, and q4_0
The runtime data paths are intentionally limited. f32 is useful for tests and baselines, q8_0 is the balanced path for quality and compactness, and q4_0 exists for maximum model capacity under tight browser memory limits.
- f32: 4.0 bytes per parameter
- q8_0: roughly 1.0625 bytes per parameter
- q4_0: roughly 0.5625 bytes per parameter
02
4-bit cost in WASM
q4_0 saves bytes but costs instructions. Standard WASM does not provide native 4-bit tensor arithmetic, so the runtime must unpack nibbles with masks and shifts before accumulation. That makes q4_0 a memory win but not a free speed win.
- Nibble unpacking
- Scale per block
- SIMD v128 acceleration where deterministic
- Dequantize on demand
03
Practical model candidates
The source analysis identifies candidates such as llama2.c-stories15M, Aurelius 10M, Pythia-14M, Pythia-31M, tiny-gpt2, and larger q4-only stress candidates. The site does not claim all are production-ready; it explains why each is useful for validation or browser experiments.
- Llama path validation
- GPT-NeoX path validation
- Small vocabulary experiments
- q8/q4 stress testing