Memory Math

Quantization and Memory Budget

TinyRustLM explains quantization as product architecture: q8_0 and q4_0 determine which tiny models can fit into browser memory without hiding compute costs.

Explore architecture Markdown view

f32, q8_0, and q4_0

The runtime data paths are intentionally limited. f32 is useful for tests and baselines, q8_0 is the balanced path for quality and compactness, and q4_0 exists for maximum model capacity under tight browser memory limits.

f32: 4.0 bytes per parameter
q8_0: roughly 1.0625 bytes per parameter
q4_0: roughly 0.5625 bytes per parameter

4-bit cost in WASM

q4_0 saves bytes but costs instructions. Standard WASM does not provide native 4-bit tensor arithmetic, so the runtime must unpack nibbles with masks and shifts before accumulation. That makes q4_0 a memory win but not a free speed win.

Nibble unpacking
Scale per block
SIMD v128 acceleration where deterministic
Dequantize on demand

Practical model candidates

The source analysis identifies candidates such as llama2.c-stories15M, Aurelius 10M, Pythia-14M, Pythia-31M, tiny-gpt2, and larger q4-only stress candidates. The site does not claim all are production-ready; it explains why each is useful for validation or browser experiments.

Llama path validation
GPT-NeoX path validation
Small vocabulary experiments
q8/q4 stress testing

Plain PHP deployment notes

No WordPress bootstrap, theme system, database requirement, composer package, npm build, CDN script, Bootstrap class dependency, or jQuery dependency.

Routes are handled by index.php and server rewrites. Core content lives in data/pages.php. HTML and Markdown share the same page data.

Upload the package to the subdomain root, point Apache/Nginx to the folder, and keep .htaccess enabled for clean routes on Apache.