Memory Math

Quantization and Memory Budget

TinyRustLM explains quantization as product architecture: q8_0 and q4_0 determine which tiny models can fit into browser memory without hiding compute costs.

01

f32, q8_0, and q4_0

The runtime data paths are intentionally limited. f32 is useful for tests and baselines, q8_0 is the balanced path for quality and compactness, and q4_0 exists for maximum model capacity under tight browser memory limits.

  • f32: 4.0 bytes per parameter
  • q8_0: roughly 1.0625 bytes per parameter
  • q4_0: roughly 0.5625 bytes per parameter

02

4-bit cost in WASM

q4_0 saves bytes but costs instructions. Standard WASM does not provide native 4-bit tensor arithmetic, so the runtime must unpack nibbles with masks and shifts before accumulation. That makes q4_0 a memory win but not a free speed win.

  • Nibble unpacking
  • Scale per block
  • SIMD v128 acceleration where deterministic
  • Dequantize on demand

03

Practical model candidates

The source analysis identifies candidates such as llama2.c-stories15M, Aurelius 10M, Pythia-14M, Pythia-31M, tiny-gpt2, and larger q4-only stress candidates. The site does not claim all are production-ready; it explains why each is useful for validation or browser experiments.

  • Llama path validation
  • GPT-NeoX path validation
  • Small vocabulary experiments
  • q8/q4 stress testing

Plain PHP deployment notes

No WordPress bootstrap, theme system, database requirement, composer package, npm build, CDN script, Bootstrap class dependency, or jQuery dependency.

Routes are handled by index.php and server rewrites. Core content lives in data/pages.php. HTML and Markdown share the same page data.

Upload the package to the subdomain root, point Apache/Nginx to the folder, and keep .htaccess enabled for clean routes on Apache.