# **Engineering Tiny Language Models for Deterministic WebAssembly and Rust Runtimes**

## **The Paradigm Shift Toward Micro Language Models in Constrained Environments**

The deployment landscape of artificial intelligence is undergoing a foundational realignment. Historically, the execution of large language models (LLMs) necessitated robust cloud infrastructure characterized by virtually unlimited memory bandwidth and arrays of high-throughput graphical processing units. However, privacy concerns, the necessity for zero-latency localized inference, and the proliferation of edge devices have catalyzed a migration toward executing inference workloads directly on client hardware1. This transition reaches its extreme execution environment within browser-based sandboxes utilizing WebAssembly (WASM). Operating within WASM introduces severe architectural constraints: strict linear memory limits, the frequent absence of native multithreading or hardware-accelerated tensor cores, and the imperative for zero-dependency execution to guarantee cross-platform security and portability3.  
To satisfy these constraints, the machine learning community has engineered a specialized tier of architectures defined as Micro Language Models (μLMs), Super Tiny Language Models (STLMs), or Small Language Models (SLMs)1. These networks operate within parameter budgets that are a fraction of traditional foundational models, deliberately targeting bounds under 100 million, 50 million, 25 million, and even 10 million parameters7. At this microscopic scale, the mathematical and engineering calculus changes profoundly. Brute-force scaling laws—which rely on massive parameter counts to memorize factual data—fail. Instead, micro-models must leverage aggressive quantization, highly specialized architectural topologies, contrastive learning techniques, and highly curated, domain-specific training distributions to maintain reasoning and grammatical capabilities1.  
When engineering these systems for a pure Rust execution environment compiled to WebAssembly, the technical requirements extend far beyond basic inference capabilities. The host runtime must function entirely without external C++ bindings—such as the standard llama.cpp or PyTorch backend ecosystems—to preserve a fully auditable, zero-dependency footprint3. Furthermore, the deployment lifecycle of these embedded engines requires rigorous, bit-exact deterministic smoke testing to validate execution safety across the disparate host CPU architectures underlying the WASM virtual machine12. This comprehensive analysis dissects the current landscape of micro-model candidates, evaluating their architectural designs, tokenizer efficiencies, quantization support, dynamic memory footprints, licensing structures, and the systemic challenges of achieving strict determinism in a sandboxed Rust runtime.

## **Architectural Topologies and the Depth-Versus-Width Tradeoff**

The compression of transformer architectures below the 100 million parameter threshold requires an asymmetric allocation of the available parameter budget. Empirical research into scaling dynamics at the microscopic level reveals that scaling a model's depth (the number of sequential transformer blocks) is significantly more critical for maintaining contextual tracking, syntactic coherence, and logical reasoning than scaling its width (the dimension of the hidden states)9. Width primarily governs the network's capacity to store factual, domain-specific knowledge9. Consequently, the most highly capable micro-models feature deep, narrow transformer stacks that prioritize processing complexity over factual memorization.  
Furthermore, these architectures frequently benefit from linear mode connectivity (LMC) optimization. Recent frameworks for bidirectional endpoint alignment demonstrate that properly parameterized functionality-preserving weight transformations—such as residual-space rotation and attention-head permutation—allow independently trained models to be merged or optimized along a shared linear interpolation path with near-zero loss barriers, even at the 160M parameter scale and below14. These techniques provide a mechanism for stabilizing training and merging specialized micro-model capabilities without requiring ensembles or expensive retraining14.  
Another pivotal mechanism being explored in small-scale models is test-time compute scaling, exemplified by models such as PonderingPythia. By introducing recurrent pondering steps—where the model generates a weighted sum of token embeddings and feeds them back into the language model to iteratively refine predictions—a 70M parameter model can achieve perplexity improvements comparable to much larger networks under the same computational budget16.

### **The Embedding Vocabulary Paradox**

A dominant architectural constraint in the sub-25M and sub-10M tiers is the "embedding vocabulary paradox." In a multi-billion parameter model, the token embedding table represents a negligible fraction of the total parameter count. However, in a micro-model, it can consume the vast majority of the network's capacity.  
For instance, a model utilizing a standard 50,257-token vocabulary with a modest hidden dimension of 256 requires approximately 12.8 million parameters purely for the embedding and un-embedding matrices17. If the total parameter budget is constrained to 15 million, over 85% of the network's expressive capacity is locked in static vocabulary mapping before a single self-attention calculation occurs. This paradox forces engineers to either drastically truncate the vocabulary, tie the input and output embeddings, or rely on byte-level tokenization paradigms6.

### **The Sub-100M Tier: Hybrid Architectures and Specialized Agents**

Models operating just below the 100 million parameter ceiling represent the upper echelon of micro architectures. They are highly capable of serving as specialized router agents, on-device code completion assistants, or tool-calling orchestrators that generate the first few tokens of a response locally to mask the latency of a larger cloud model1.  
The Falcon-H1-Tiny series (90M parameters for English-centric tasks, 100M for multilingual contexts) currently defines the state-of-the-art in this tier18. Breaking from pure decoder-only norms, the Falcon-H1-Tiny models utilize a hybrid topology that interleaves standard Transformer attention mechanisms with Mamba (State Space Model) blocks19. This hybrid structure maintains strong instruction-following capabilities and long-context coherence while circumventing the quadratic memory scaling inherently associated with pure attention during continuous execution19. The models are trained using a novel optimization paradigm that combines Learnable Multipliers (LRM) with the Muon optimizer, consistently outperforming standard AdamW optimizers10. Furthermore, rather than relying on a traditional pre-training phase followed by fine-tuning, the Falcon-H1-Tiny series utilizes an "anti-curriculum" strategy. Target-domain data—such as supervised fine-tuning (SFT) traces, tool-calling patterns, and reasoning steps—is injected from the very first training token10. Through memorization-aware repetition algorithms that calculate the model's forgetting window, the data is iterated safely across up to 100 epochs without inducing catastrophic overfitting10.  
Conversely, the Pythia-70M model serves as the standardized, classical baseline for this tier23. Engineered upon the GPT-NeoX architecture, it utilizes 6 hidden layers, a hidden dimension of 512, and 8 attention heads24. Because all Pythia models across all scales are trained on the exact same dataset (The Pile) in the exact same order, they offer unparalleled transparency for interpretability research24. Pythia-70M utilizes conventional rotary position embeddings (RoPE) and standard multi-head attention, rendering it highly compatible with bare-metal Rust parsing implementations that lack support for exotic hybrid kernels24.

### **The Sub-50M Tier: Balanced Edge Deployment**

The 30M to 50M parameter range offers a critical balance, maintaining enough capacity to support standard vocabularies while ensuring rapid continuous execution speeds within a WASM interpreter.  
The llama2.c-stories42M model, frequently deployed as quantized GGUF artifacts, compresses the highly successful LLaMA architectural standard into an exceptionally small footprint26. Operating at approximately 58.1 million total parameters (inclusive of its 32,000-token embedding matrix), it relies on 8 hidden layers, 8 attention heads, and a 512-dimension embedding space28. By utilizing SwiGLU activation functions and Root Mean Square Normalization (RMSNorm), the model inherits modern transformer efficiencies. However, the non-linearities of SwiGLU require careful handling in pure Rust implementations to avoid divergent floating-point rounding errors across different hardware architectures.  
Pythia-31M is a highly stable alternative within this bracket. Utilizing 6 layers and 8 attention heads, its hidden dimension is constrained to 25624. Because the hidden dimension is exactly halved relative to Pythia-70M, the memory bandwidth required for vector-matrix multiplications during token generation drops precipitously24. This reduction is critical for WASM environments, where memory access speeds frequently throttle inference more than raw CPU compute capabilities.

### **The Sub-25M Tier: Navigating Strict Browser Constraints**

Dropping beneath 25 million parameters requires synthetic data curation; models at this scale lose the capacity to generalize across broad distributions and must be focused on restricted language subsets8.  
The llama2.c-stories15M model (yielding 24.4 million actual parameters depending on the specific tokenizer configuration) is a prime example of synthetic distillation32. It was trained predominantly on the TinyStories dataset, which is synthetically generated by larger models (GPT-3.5 and GPT-4) to contain only the vocabulary typically understood by a three- to four-year-old child8. Architecturally, it features 6 hidden layers, a hidden size of 288, and 6 attention heads34. Crucially, because it adheres to the LLaMA configuration, it inherently supports architectural efficiencies like Grouped Query Attention (GQA), which substantially reduces the size of the Key-Value (KV) cache required during generation34.  
Pythia-14M represents the absolute lower operational bound of the official EleutherAI suite24. It features 6 layers, 4 attention heads, and a highly restricted hidden size of 12824. Excluding the embedding table, the core transformer logic contains roughly 1.19 million parameters24. While Pythia-14M is a cornerstone for learning dynamics and algorithmic interpretability research, its capacity to generate fluent, contextually coherent prose is noticeably inferior to models of similar size that were trained exclusively on synthetic, simplified datasets like TinyStories24.

### **The Sub-10M Tier: Extreme Theoretical Viability**

Models beneath 10 million parameters push the absolute theoretical limits of decoder-only language modeling. At this microscopic scale, models face severe representation collapse unless both the vocabulary and the training distribution are ruthlessly constrained8.  
The original TinyStories-1M model (operating at approximately 3.7 million total parameters) and iterations like the TinyLlama-v0 5M (roughly 4.6 million parameters) reside here17. To bypass the embedding vocabulary paradox, the TinyStories-1M model relies on a heavily truncated tokenizer. Researchers retained only the top 10,000 most common tokens from the standard GPT-Neo tokenizer, reducing the embedding matrix size to a fraction of standard implementations17. The architecture relies on an extreme bottleneck: a hidden dimension of 64 and a single transformer block31. While they demonstrate that grammatical and syntactic abilities can emerge independently of broad factual knowledge, their utility in a production environment is generally restricted to highly deterministic, narrow routing tasks or specialized syntactic parsing9.

### **Architectural Comparison Matrix**

| Model Identifier | Base Architecture | Total Params | Hidden Layers | Hidden Dim | Attention Heads | Vocabulary Size | Context Window |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| Falcon-H1-Tiny-90M | Hybrid (Attn \+ Mamba) | 91.1M | N/A | 512 | 8 | \~65,024 | 262,144 |
| Pythia-70M | GPT-NeoX | 70.0M | 6 | 512 | 8 | 50,304 | 2,048 |
| llama2.c-stories42M | LLaMA | 58.1M | 8 | 512 | 8 | 32,000 | 1,024 |
| Pythia-31M | GPT-NeoX | 31.0M | 6 | 256 | 8 | 50,304 | 2,048 |
| llama2.c-stories15M | LLaMA | 24.4M | 6 | 288 | 6 | 32,000 | 1,024 |
| Pythia-14M | GPT-NeoX | 14.0M | 6 | 128 | 4 | 50,304 | 2,048 |
| TinyLlama-v0-5M | LLaMA | 4.6M | 8 | 64 | 16 | 32,000 | 2,048 |
| TinyStories-1M | GPT-Neo | \~3.7M | 1 | 64 | 6 | 10,000 (pruned) | 256 |

## **Tokenization Mechanics in Zero-Dependency Contexts**

Tokenization presents a substantial engineering bottleneck when deploying micro-models into zero-dependency environments. Standard implementations heavily rely on libraries like OpenAI's tiktoken or Google's SentencePiece, which contain complex, highly optimized C/C++ dependencies and extensive regular expression engines42. Compiling these dependencies into WASM significantly inflates the binary size and execution surface area.  
For the models evaluated, two primary tokenization algorithms are utilized: Byte-Pair Encoding (BPE), native to the Pythia/GPT-NeoX architectures, and SentencePiece/BPE hybrids, standard in LLaMA-derived models25. Implementing a pure Rust, zero-dependency BPE tokenizer necessitates manually parsing the tokenizer.ggml.merges and tokenizer.ggml.tokens arrays directly from the embedded GGUF metadata44.  
To achieve acceptable latency during WASM execution, the vocabulary must be mapped into a high-performance Trie data structure or pre-computed hash map42. Zero-dependency Rust crates, such as texten, or standalone structural implementations achieve this by eschewing external regex crates in favor of standard library string iteration and byte-level fallback parsing43.  
Furthermore, memory efficiency must be monitored. For models with 32,000 or 50,304 tokens, the raw string representations of the vocabulary consume between 1.5 to 2.5 Megabytes of memory17. While seemingly trivial, this static allocation is permanently resident in the WASM heap, reducing the space available for dynamic inference structures.  
Emergent research into "structure-aware" tokenization, such as json-tokenizer, offers a compelling alternative for specific deployment tasks like parsing APIs or system telemetry on the edge. By assigning dedicated, deterministic tokens to JSON grammar elements (e.g., brackets and quotes) and applying BPE only to variable values, these tokenizers yield 5-15% token compression savings and reduce vocabulary sizes by up to 90%, thereby mitigating the embedding vocabulary paradox entirely for structured data applications42.

## **Quantization Dynamics and the GGUF Format**

For efficient distribution and deployment in a WASM environment, the GGML Universal Format (GGUF) is the definitive industry standard. GGUF encapsulates both the quantized tensor weights and all necessary key-value metadata—including tokenizer vocabularies, RoPE scaling factors, and architectural hyperparameters—into a single contiguous binary file3. This architecture is highly advantageous for browser deployment; the entire model can be fetched into an ArrayBuffer and accessed directly via byte-slice offsets, circumventing the need for secondary configuration files or external file system access44.

### **The Degradation Floor of Extreme Quantization**

In large-scale models (e.g., 7B parameters and above), engineers routinely deploy aggressive quantization techniques—reducing 16-bit floating-point weights to 4-bit (Q4\_K\_M) or even 2-bit (Q2\_K) precision. In massive networks, the sheer volume of parameters provides a buffering effect; the degradation in perplexity is sub-linear, and the memory savings justify the minor precision loss.  
However, micro-models exhibit a rigid "degradation floor." When a model possesses only 14 million or 25 million parameters, the network relies on the high-fidelity precision of every individual weight to encode critical semantic and syntactic boundaries. Applying a Q2\_K or Q3\_K\_M quantization profile to a sub-25M model typically triggers catastrophic failure48. The model loses the capacity to form coherent sentences, instead yielding outputs composed of repeating tokens, hallucinations, or pure computational noise48.  
As a rigorous engineering best practice, models under 25 million parameters should rarely be quantized below 8-bit integers (Q8\_0)26. In many instances, the models should simply be executed in half-precision floating-point (F16)49. For example, the Pythia-14M model in Q2\_K format occupies 12.9 MB of storage, while the highly superior Q8\_0 format occupies 16.8 MB, and the mathematically lossless F16 format occupies 29.9 MB49. In a modern WASM environment—where a standard browser tab can trivially allocate a 2GB to 4GB linear memory heap—the decision to save a mere 15 MB by utilizing a 4-bit quantization profile is mathematically unsound when weighed against the total loss of model coherence49.  
Recent advances in quantization, such as Importance Matrix (imatrix) calibrations used in IQ4\_XS formats, attempt to preserve quality by weighting the quantization error against the statistical importance of specific tensors26. While this slightly improves 4-bit viability, it still falls short of the stability provided by 8-bit or 16-bit executions at the micro-model scale28. Further research into ternary (1.5-bit) or pure 1-bit models, backed by custom inference engines like OxiBonsai, holds promise for the future, but remains highly experimental and requires specifically trained checkpoints rather than post-training quantization11.

### **Memory Footprint and Expected Runtime RAM Utilization**

The static download size of a GGUF file does not equate to the total runtime memory footprint. During execution, the WASM runtime must dynamically allocate linear memory for the Key-Value (KV) cache, forward-pass activation buffers, and the logistical scratch space necessary for sampling and argmax calculations4.  
The KV cache stores the computed key and value vectors for all previous tokens in the context window, preventing the need for redundant recalculations during autoregressive generation. For a standard LLaMA-based model like llama2.c-stories15M running at a batch size of 1 with a full context window of 1024 tokens, the KV cache size can be deterministically calculated. Assuming 6 layers, 6 KV heads, a head dimension of 48 (288 hidden size / 6 heads), and storing values in F16 (2 bytes), the calculation is: 2 (Key and Value) \* 1024 (tokens) \* 6 (layers) \* 6 (KV heads) \* 48 (head dim) \* 2 bytes \= \~1.65 Megabytes34.  
Advanced inference engines can utilize KV cache compression techniques like TurboQuant to store the cache in FP8, reducing the memory requirement by an additional 50%4. Regardless, the dynamic RAM requirements for these micro-models remain extraordinarily low, ensuring they comfortably execute within the strict, contiguous bounds of WASM linear memory34.

### **Download Size and Footprint Projections**

| Model (Format) | Precision | Storage Size (MB) | Est. KV Cache (1024 ctx) | Total RAM Envelope |
| :---- | :---- | :---- | :---- | :---- |
| Falcon-H1-Tiny-90M | Q8\_0 | \~118.0 MB | \~12.5 MB | \~150 MB |
| Pythia-70M | Q8\_0 | \~75.0 MB | \~8.4 MB | \~100 MB |
| llama2.c-stories42M | Q4\_K\_M | \~39.7 MB | \~4.2 MB | \~60 MB |
| Pythia-31M | F16 | \~86.2 MB | \~4.2 MB | \~110 MB |
| llama2.c-stories15M | Q8\_0 | \~26.7 MB | \~1.6 MB | \~40 MB |
| llama2.c-stories15M | F16 | \~49.6 MB | \~1.6 MB | \~65 MB |
| Pythia-14M | Q8\_0 | \~16.8 MB | \~2.1 MB | \~30 MB |
| Pythia-14M | F16 | \~29.9 MB | \~2.1 MB | \~45 MB |

## **Designing a Zero-Dependency Rust Inference Runtime**

Deploying LLMs in a constrained browser environment dictates an uncompromising architectural approach: one cannot rely on extensive C++ frameworks like llama.cpp compiled to WASM via Emscripten if the goal is a minimal, fully auditable, and easily embedded footprint11. Integrating foreign Function Interfaces (FFI) introduces complexity, drastically increases binary size (often to 50MB+ just for the engine), and obscures security audits. Consequently, a pure, zero-dependency Rust implementation is paramount4.

### **GGUF Parsing in Linear Memory**

To load micro-models efficiently, a dedicated zero-dependency parser is required to interpret the GGUF file format. Purpose-built crates, such as gguf-rs, provide the exact programmatic functionality needed to deserialize the binary structure, extract the key-value metadata dictionaries, and establish safe slice references to the underlying tensor buffers without allocating redundant memory3.  
Because the runtime operates within the WASM sandbox, traditional operating system-level memory mapping (e.g., POSIX mmap) is completely unavailable44. The entire GGUF file must be streamed as an array of bytes via browser APIs—such as fetching a blob into an ArrayBuffer—and passed into the Rust WASM module as a contiguous, immutable slice &\[u8\]. The parser then iterates through the headers, validates the GGUF magic bytes, and calculates the exact byte offsets for each tensor3.  
By strictly leveraging Rust's core and alloc libraries, developers can create inference engines—such as xinfer, OxiBonsai, or educational workspaces like zero-depend-pub—that avoid heavy system dependencies, rely entirely on standard library iterators, and minimize computational overhead during the initialization phase4.

### **Implementing Tensor Operations and SIMD Dispatch**

Once the tensors are mapped into linear memory, the inference engine must perform the forward pass mathematically. In a zero-dependency Rust setup, developers must write custom General Matrix Multiply (GEMM) and General Matrix-Vector Multiply (GEMV) kernels11. To achieve acceptable token generation speeds (e.g., \>20 tokens per second), these kernels must utilize Single Instruction, Multiple Data (SIMD) capabilities11.  
Frameworks like OxiBonsai implement multi-tiered SIMD dispatch, detecting the target platform and compiling specific kernels that leverage the WASM v128 specification11. This allows the engine to multiply arrays of floats or integers in parallel, drastically reducing inference latency while maintaining the constraint of using pure, safe Rust primitives11.

## **Achieving Deterministic Execution for WASM Smoke Tests**

A fundamental requirement for validating the integrity of edge-deployed AI models across enterprise networks is deterministic smoke testing. If a continuous integration (CI) pipeline feeds a specific prompt to the Rust inference engine, the resulting sequence of output tokens—and the underlying numerical structure of the floating-point logits—must be bit-for-bit identical across all test runs, regardless of whether the host hardware is an Intel x86 server or an Apple ARM64 processor12.  
However, WebAssembly and the Rust compiler toolchain present several systemic vulnerabilities regarding floating-point determinism that must be aggressively mitigated to prevent test failures.

### **The Inherent Non-Determinism of Floating-Point Operations**

While the IEEE 754-2008 standard governs floating-point arithmetic across all modern processors, it purposefully leaves certain edge-case behaviors implementation-defined, introducing non-determinism into the computational stack12. The most notorious of these vulnerabilities is the handling of Not-a-Number (NaN) values13. When a mathematically invalid operation (like dividing zero by zero) produces a NaN, the specification guarantees the exponent bits are set, but does not mandate the exact configuration of the trailing mantissa bits (the NaN payload)56.  
Different hardware architectures (e.g., x86\_64 vs. ARM64) and even different optimization passes executed by LLVM can yield different quiet NaNs (QNaN) or signaling NaNs (SNaN)55. If the neural network logic inadvertently branches based on these payload bits, or if a cryptographic hash is taken of the output tensor for a strict smoke test assertion, the test will fail unpredictably across differing platforms56.  
To resolve this within a WASM context, the host engine executing the WebAssembly (such as Wasmtime) must enforce strict NaN canonicalization. By enabling compiler flags like wasmtime::Config::cranelift\_nan\_canonicalization, the host engine intercepts undefined NaN behavior and forces all generated NaNs into a single, predictable bit pattern13. While this incurs a minor performance penalty due to the requisite bit-masking instructions appended after susceptible floating-point operations, it is strictly necessary to guarantee cross-platform reproducibility13.

### **Fused Multiply-Add and Relaxed SIMD Variations**

The core operations of transformer inference—specifically the massive vector dot products and matrix multiplications within the attention blocks—rely heavily on SIMD execution54. WebAssembly provides a standardized 128-bit SIMD proposal (v128) which maps deterministically to SSE/AVX on x86 processors and NEON instructions on ARM processors54.  
However, a major computational divergence occurs with the Fused Multiply-Add (FMA) operation (a \* b \+ c). A true hardware FMA computes the multiplication and the subsequent addition retaining infinite internal precision, applying only a single rounding step at the very end55. Conversely, executing the multiplication followed by an addition as separate instructions incurs two sequential rounding steps, resulting in divergent least-significant bits in the final mantissa55.  
WebAssembly's "Relaxed SIMD" proposal was introduced to expose native hardware FMA for enhanced performance, explicitly and purposefully trading absolute determinism for execution speed13. If a WASM module utilizes Relaxed SIMD instructions, an older x86 host lacking native FMA support might emulate the operation with two steps, while a modern ARM64 host executes a true single-step FMA. This discrepancy immediately breaks the deterministic smoke test62.  
Therefore, to guarantee determinism in a custom Rust runtime:

1. **Disable Relaxed SIMD:** The Rust compilation target must restrict itself entirely to standard, strictly deterministic v128 instructions, avoiding any relaxed operations13.  
2. **Enforce Strict Evaluation:** The codebase performing matrix multiplication must avoid enabling compiler flags that permit the associative reordering of floating-point operations (e.g., \-ffast-math equivalents). Floating-point addition is mathematically non-associative; altering the order of parallel reductions changes the final sum55.

### **The Fixed-Point Mathematical Fallback**

If the aforementioned compiler flags and Wasmtime configurations prove insufficient or too brittle to maintain across highly disparate enterprise CI environments, the ultimate, unassailable fallback for ensuring bit-identical smoke tests is to abandon floating-point arithmetic entirely in favor of fixed-point mathematics55.  
Pioneering concepts like the Valori memory kernel demonstrate this approach by modeling AI states and matrix multiplications utilizing a Q16.16 fixed-point representation55. In a Q16.16 structure, 16 bits encode the integer component and 16 bits encode the fractional component55. By translating the model's weights into fixed-point, the runtime leverages standard integer Arithmetic Logic Units (ALUs) rather than floating-point units (FPUs)55.  
Integer operations are universally deterministic and bit-identical across x86, ARM, RISC-V, and WASM architectures55. Similarly, game physics engines written in Rust, such as Rapier, achieve perfect cross-platform determinism by discarding standard library floating-point math entirely, instead relying on strict traits provided by crates like nalgebra to implement custom, deterministic math functions (e.g., sin, cos, sqrt)12. While executing a Q8\_0 quantized model with Q16.16 integer accumulators will deviate slightly from standard F16 benchmark performance, it establishes a mathematically rigorous, unassailable determinism boundary required for absolute validation testing55.

## **Licensing, Governance, and Enterprise Viability**

When embedding language models deeply into software application stacks via WASM, the underlying licensing dictates the legal viability of the deployment, particularly concerning commercialization and patent litigation.  
The micro-models evaluated in this research operate under a spectrum of open-source-friendly licenses:

* **Pythia Series (14M, 31M, 70M):** The entire EleutherAI suite is licensed under the highly permissive **Apache 2.0 License**24. This framework provides robust legal protection, explicitly addressing patent grants and allowing unencumbered commercial deployment. It is generally considered the gold standard for enterprise integration.  
* **TinyStories Series (1M to 33M):** These synthetic models are generally released under the **MIT License**64. Representing the most permissive tier of open-source software, the MIT license requires only basic copyright attribution, making it ideal for lightweight browser utilities and open-source experimental projects.  
* **Falcon-H1-Tiny Series:** These models are governed by the custom **Falcon-LLM License**19. While structured to be permissive and explicitly aimed at advancing open science, custom corporate licenses inevitably require specific, localized legal review19. Organizations must carefully audit the text to ensure compliance with enterprise distribution policies, paying particular attention to any embedded commercialization revenue limits, acceptable use policies, or strict attribution clauses65.

## **Strategic Architectural Recommendations**

Developing a zero-dependency, deterministically testable AI inference runtime in pure Rust for WebAssembly requires the meticulous alignment of the model architecture, the quantization strategy, and the strict mathematical constraints of the target execution engine.  
Based on comprehensive analysis, the **llama2.c-stories15M** model or the **Pythia-14M** model serve as the optimal architectural substrates for this specific engineering endeavor. Operating these models exclusively in **F16 or Q8\_0 GGUF format** circumvents the severe degradation floor observed in aggressively quantized (e.g., 4-bit) micro-models. This ensures the maintenance of logical coherence and syntactic stability while capping the total dynamic memory footprint at an easily manageable 30 MB to 65 MB. Operating at this scale comfortably bypasses strict browser heap limits and avoids the latency overhead associated with continuous dynamic memory allocations during the KV-cache update cycle.  
To fulfill the rigorous determinism requirement for CI/CD smoke testing, the pure Rust inference engine must parse the GGUF file via a deterministic mapping library (such as gguf-rs) and utilize strict, byte-level tokenization loops that avoid non-deterministic C++ regular expression dependencies. Most critically, the WASM runtime environment must enforce explicit NaN canonicalization and disable the Wasm Relaxed SIMD proposal to prevent FMA-induced floating-point drift. Alternatively, migrating the inference backend's tensor accumulation matrices to integer-based fixed-point arithmetic (Q16.16) will guarantee bit-exact cross-platform reproducibility, ensuring that the intersection of edge computing and artificial intelligence remains both computationally powerful and mathematically provable.

#### **Works cited**

1. Daily Papers \- Hugging Face, [https://huggingface.co/papers?q=domain-specialized%20small%20language%20models](https://huggingface.co/papers?q=domain-specialized+small+language+models)  
2. The Rise of the SLMs: Building the Smallest Possible Language Model. \- Medium, [https://medium.com/@naimishbaranwal/the-rise-of-the-slms-building-the-smallest-possible-language-model-d7cdada8faa3](https://medium.com/@naimishbaranwal/the-rise-of-the-slms-building-the-smallest-possible-language-model-d7cdada8faa3)  
3. gguf-rs \- crates.io: Rust Package Registry, [https://crates.io/crates/gguf-rs](https://crates.io/crates/gguf-rs)  
4. guoqingbao/xinfer: Blazing-fast LLM inference in pure Rust. No PyTorch and Python runtime. \- GitHub, [https://github.com/guoqingbao/xinfer](https://github.com/guoqingbao/xinfer)  
5. Paper page \- Micro Language Models Enable Instant Responses \- Hugging Face, [https://huggingface.co/papers/2604.19642](https://huggingface.co/papers/2604.19642)  
6. Paper page \- Super Tiny Language Models \- Hugging Face, [https://huggingface.co/papers/2405.14159](https://huggingface.co/papers/2405.14159)  
7. LeonGuertler/SuperTinyLanguageModels \- GitHub, [https://github.com/leonguertler/supertinylanguagemodels](https://github.com/leonguertler/supertinylanguagemodels)  
8. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, [https://huggingface.co/papers/2305.07759](https://huggingface.co/papers/2305.07759)  
9. TinyStories: The Smallest GPT with Coherent English (by Microsoft) : r/LocalLLaMA \- Reddit, [https://www.reddit.com/r/LocalLLaMA/comments/13ooc3o/tinystories\_the\_smallest\_gpt\_with\_coherent/](https://www.reddit.com/r/LocalLLaMA/comments/13ooc3o/tinystories_the_smallest_gpt_with_coherent/)  
10. Falcon-H1-Tiny (90M) is out \- specialized micro-models that actually work : r/LocalLLaMA, [https://www.reddit.com/r/LocalLLaMA/comments/1qsx51z/falconh1tiny\_90m\_is\_out\_specialized\_micromodels/](https://www.reddit.com/r/LocalLLaMA/comments/1qsx51z/falconh1tiny_90m_is_out_specialized_micromodels/)  
11. OxiBonsai: The World's First Pure Rust 1-Bit LLM Inference Engine | by KitaSan | Medium, [https://medium.com/@kitasanio/oxibonsai-the-worlds-first-pure-rust-1-bit-llm-inference-engine-4c15abf53fce](https://medium.com/@kitasanio/oxibonsai-the-worlds-first-pure-rust-1-bit-llm-inference-engine-4c15abf53fce)  
12. Determinism \- Rapier.rs, [https://rapier.rs/docs/user\_guides/rust/determinism/](https://rapier.rs/docs/user_guides/rust/determinism/)  
13. Deterministic Execution \- Wasmtime, [https://docs.wasmtime.dev/examples-deterministic-wasm-execution.html](https://docs.wasmtime.dev/examples-deterministic-wasm-execution.html)  
14. Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers \- arXiv, [https://arxiv.org/html/2606.23607v1](https://arxiv.org/html/2606.23607v1)  
15. Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers \- arXiv, [https://arxiv.org/pdf/2606.23607](https://arxiv.org/pdf/2606.23607)  
16. Pretraining Language Models to Ponder in Continuous Space \- arXiv, [https://arxiv.org/html/2505.20674v1](https://arxiv.org/html/2505.20674v1)  
17. roneneldan/TinyStories-1M · Actual number of parameters? \- Hugging Face, [https://huggingface.co/roneneldan/TinyStories-1M/discussions/5](https://huggingface.co/roneneldan/TinyStories-1M/discussions/5)  
18. Falcon-H1-Tiny: A series of extremely small, yet powerful language models redefining capabilities at small scale \- a Hugging Face Space by tiiuae, [https://huggingface.co/spaces/tiiuae/tiny-h1-blogpost](https://huggingface.co/spaces/tiiuae/tiny-h1-blogpost)  
19. tiiuae/Falcon-H1-Tiny-90M-Instruct-GGUF \- Hugging Face, [https://huggingface.co/tiiuae/Falcon-H1-Tiny-90M-Instruct-GGUF](https://huggingface.co/tiiuae/Falcon-H1-Tiny-90M-Instruct-GGUF)  
20. Luigi/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF \- Hugging Face, [https://huggingface.co/Luigi/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF](https://huggingface.co/Luigi/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF)  
21. Falcon Models, [https://falconllm.tii.ae/falcon-models.html](https://falconllm.tii.ae/falcon-models.html)  
22. Falcon-H1 Family of Hybrid-Head Language Models, including 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B : r/LocalLLaMA \- Reddit, [https://www.reddit.com/r/LocalLLaMA/comments/1krtvpj/falconh1\_family\_of\_hybridhead\_language\_models/](https://www.reddit.com/r/LocalLLaMA/comments/1krtvpj/falconh1_family_of_hybridhead_language_models/)  
23. Models \- Hugging Face, [https://huggingface.co/models?sort=downloads\&search=eleutherAI](https://huggingface.co/models?sort=downloads&search=eleutherAI)  
24. EleutherAI/pythia-14m \- Hugging Face, [https://huggingface.co/EleutherAI/pythia-14m](https://huggingface.co/EleutherAI/pythia-14m)  
25. Pythia: Interpreting Transformers Across Time and Scale \- GitHub, [https://github.com/eleutherai/pythia](https://github.com/eleutherai/pythia)  
26. mradermacher/llama2.c-stories42M-GGUF \- Hugging Face, [https://huggingface.co/mradermacher/llama2.c-stories42M-GGUF](https://huggingface.co/mradermacher/llama2.c-stories42M-GGUF)  
27. Xenova/llama2.c-stories42M at main \- Hugging Face, [https://huggingface.co/Xenova/llama2.c-stories42M/tree/main](https://huggingface.co/Xenova/llama2.c-stories42M/tree/main)  
28. mradermacher/llama2.c-stories42M-GGUF at main \- Hugging Face, [https://huggingface.co/mradermacher/llama2.c-stories42M-GGUF/blob/main/llama2.c-stories42M.IQ4\_XS.gguf](https://huggingface.co/mradermacher/llama2.c-stories42M-GGUF/blob/main/llama2.c-stories42M.IQ4_XS.gguf)  
29. config.json · DNA-LLM/virus-pythia-4.7M-2048-cross\_entropy at 9f86217dbd7a34ce3d8ad4e6999b9b6510f7656c \- Hugging Face, [https://huggingface.co/DNA-LLM/virus-pythia-4.7M-2048-cross\_entropy/blame/9f86217dbd7a34ce3d8ad4e6999b9b6510f7656c/config.json](https://huggingface.co/DNA-LLM/virus-pythia-4.7M-2048-cross_entropy/blame/9f86217dbd7a34ce3d8ad4e6999b9b6510f7656c/config.json)  
30. EleutherAI/pythia-31m at e556ace21b489575e94e9d50b6dad2fcc7419679 \- Hugging Face, [https://huggingface.co/EleutherAI/pythia-31m/tree/e556ace21b489575e94e9d50b6dad2fcc7419679](https://huggingface.co/EleutherAI/pythia-31m/tree/e556ace21b489575e94e9d50b6dad2fcc7419679)  
31. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? \- arXiv, [https://arxiv.org/pdf/2305.07759](https://arxiv.org/pdf/2305.07759)  
32. mradermacher/llama2.c-stories15M-GGUF \- Hugging Face, [https://huggingface.co/mradermacher/llama2.c-stories15M-GGUF](https://huggingface.co/mradermacher/llama2.c-stories15M-GGUF)  
33. Xenova/llama2.c-stories15M \- Hugging Face, [https://huggingface.co/Xenova/llama2.c-stories15M](https://huggingface.co/Xenova/llama2.c-stories15M)  
34. config.json · Xenova/llama2.c-stories15M at ... \- Hugging Face, [https://huggingface.co/Xenova/llama2.c-stories15M/blob/9397aa7180a7b88fa20a0197da2a0601973373e9/config.json](https://huggingface.co/Xenova/llama2.c-stories15M/blob/9397aa7180a7b88fa20a0197da2a0601973373e9/config.json)  
35. README.md · roneneldan/TinyStories at main \- Hugging Face, [https://huggingface.co/datasets/roneneldan/TinyStories/blob/main/README.md](https://huggingface.co/datasets/roneneldan/TinyStories/blob/main/README.md)  
36. config.json · EleutherAI/pythia-14m-seed5 at main \- Hugging Face, [https://huggingface.co/EleutherAI/pythia-14m-seed5/blob/main/config.json](https://huggingface.co/EleutherAI/pythia-14m-seed5/blob/main/config.json)  
37. config.json · clam004/emotion\_base\_v5 at 4ad8e8dd3a5068cc48daca50aa72b396087ceeae \- Hugging Face, [https://huggingface.co/clam004/emotion\_base\_v5/blob/4ad8e8dd3a5068cc48daca50aa72b396087ceeae/config.json](https://huggingface.co/clam004/emotion_base_v5/blob/4ad8e8dd3a5068cc48daca50aa72b396087ceeae/config.json)  
38. mofosyne/TinyLLama-v0-5M-F16-llamafile at ... \- Hugging Face, [https://huggingface.co/mofosyne/TinyLLama-v0-5M-F16-llamafile/blob/a9a7525f39f968362e870b145f8520810a3cba73/TinyLLama-4.6M-v0.0-F16.gguf](https://huggingface.co/mofosyne/TinyLLama-v0-5M-F16-llamafile/blob/a9a7525f39f968362e870b145f8520810a3cba73/TinyLLama-4.6M-v0.0-F16.gguf)  
39. roneneldan/TinyStories-1M at main \- Hugging Face, [https://huggingface.co/roneneldan/TinyStories-1M/tree/main](https://huggingface.co/roneneldan/TinyStories-1M/tree/main)  
40. config.json · roneneldan/TinyStories-1M at ... \- Hugging Face, [https://huggingface.co/roneneldan/TinyStories-1M/blame/ac533fb8b4f69c71894bf96badfe11e6294d9fcf/config.json](https://huggingface.co/roneneldan/TinyStories-1M/blame/ac533fb8b4f69c71894bf96badfe11e6294d9fcf/config.json)  
41. config.json · mrferr3t/3d7a17ad-5173-47fb-911c-ad48cd93e7d9 at main \- Hugging Face, [https://huggingface.co/mrferr3t/3d7a17ad-5173-47fb-911c-ad48cd93e7d9/blob/main/config.json](https://huggingface.co/mrferr3t/3d7a17ad-5173-47fb-911c-ad48cd93e7d9/blob/main/config.json)  
42. Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90× Smaller Vocabulary \- ResearchGate, [https://www.researchgate.net/publication/401584599\_Structure-Aware\_Tokenization\_for\_JSON\_Exploiting\_Schema\_Repetition\_for\_Compact\_Token\_Sequences\_with\_a\_90\_Smaller\_Vocabulary](https://www.researchgate.net/publication/401584599_Structure-Aware_Tokenization_for_JSON_Exploiting_Schema_Repetition_for_Compact_Token_Sequences_with_a_90_Smaller_Vocabulary)  
43. GitHub \- reinterpretcat/zero-depend-pub: An educational Rust workspace featuring zero-dependency crates built using only standard library. As showcase, has pytorch-like framework rewritten in pure rust (with poor performance, but it works\!), [https://github.com/reinterpretcat/zero-depend-pub](https://github.com/reinterpretcat/zero-depend-pub)  
44. zackshen/gguf: a GGUF file parser \- GitHub, [https://github.com/zackshen/gguf](https://github.com/zackshen/gguf)  
45. A Thorough Guide to the ZINC Project: The New Era of Blazing-Fast Local Execution of Massive AI Models on AMD GPUs and Apple Silicon \- note, [https://note.com/humble\_bobcat51/n/n0ac42cafd794?hl=en](https://note.com/humble_bobcat51/n/n0ac42cafd794?hl=en)  
46. StentorLabs/Stentor2-12M-Preview \- Hugging Face, [https://huggingface.co/StentorLabs/Stentor2-12M-Preview](https://huggingface.co/StentorLabs/Stentor2-12M-Preview)  
47. llama\_gguf \- Rust \- Docs.rs, [https://docs.rs/llama-gguf](https://docs.rs/llama-gguf)  
48. tensorblock/pythia-14m-GGUF \- Hugging Face, [https://huggingface.co/tensorblock/pythia-14m-GGUF](https://huggingface.co/tensorblock/pythia-14m-GGUF)  
49. mradermacher/pythia-14m-GGUF \- Hugging Face, [https://huggingface.co/mradermacher/pythia-14m-GGUF](https://huggingface.co/mradermacher/pythia-14m-GGUF)  
50. pythia-14m.f16.gguf \- mradermacher \- Hugging Face, [https://huggingface.co/mradermacher/pythia-14m-GGUF/blob/main/pythia-14m.f16.gguf](https://huggingface.co/mradermacher/pythia-14m-GGUF/blob/main/pythia-14m.f16.gguf)  
51. FlashLM v4: 4.3M ternary model trained on CPU in 2 hours — coherent stories from adds and subtracts only \- Reddit, [https://www.reddit.com/r/LocalLLaMA/comments/1r8c6th/flashlm\_v4\_43m\_ternary\_model\_trained\_on\_cpu\_in\_2/](https://www.reddit.com/r/LocalLLaMA/comments/1r8c6th/flashlm_v4_43m_ternary_model_trained_on_cpu_in_2/)  
52. Show HN: 5MB Rust binary that runs HuggingFace models (no Python) \- Hacker News, [https://news.ycombinator.com/item?id=45199898](https://news.ycombinator.com/item?id=45199898)  
53. Contract Rust Dialect \- Stellar Developer Docs, [https://developers.stellar.org/docs/learn/fundamentals/contract-development/rust-dialect](https://developers.stellar.org/docs/learn/fundamentals/contract-development/rust-dialect)  
54. WASM SIMD: Browser Support, Features, Limitations | TestMu AI (Formerly LambdaTest), [https://www.testmuai.com/learning-hub/wasm-simd-browser-support/](https://www.testmuai.com/learning-hub/wasm-simd-browser-support/)  
55. Valori: A Deterministic Memory Substrate for AI Systems \- arXiv, [https://arxiv.org/pdf/2512.22280](https://arxiv.org/pdf/2512.22280)  
56. 3514-float-semantics \- The Rust RFC Book, [https://rust-lang.github.io/rfcs/3514-float-semantics.html](https://rust-lang.github.io/rfcs/3514-float-semantics.html)  
57. Stronger floating-point NaN guarantees \- IR & Optimizations \- LLVM Discussion Forums, [https://discourse.llvm.org/t/stronger-floating-point-nan-guarantees/72165](https://discourse.llvm.org/t/stronger-floating-point-nan-guarantees/72165)  
58. Usage of "non-deterministic" floating point operations is misleading · Issue \#150323 · rust-lang/rust \- GitHub, [https://github.com/rust-lang/rust/issues/150323](https://github.com/rust-lang/rust/issues/150323)  
59. LLVM in wasmer\_compiler\_llvm\_near \- Rust \- Docs.rs, [https://docs.rs/wasmer-compiler-llvm-near/latest/wasmer\_compiler\_llvm\_near/struct.LLVM.html](https://docs.rs/wasmer-compiler-llvm-near/latest/wasmer_compiler_llvm_near/struct.LLVM.html)  
60. Config in wasmtime \- Rust, [https://rustdocs.bsx.fi/wasmtime/struct.Config.html](https://rustdocs.bsx.fi/wasmtime/struct.Config.html)  
61. Valori: A Deterministic Memory Substrate for AI Systems \- arXiv, [https://arxiv.org/html/2512.22280v1](https://arxiv.org/html/2512.22280v1)  
62. Add a deterministic FMA \#44 \- WebAssembly/relaxed-simd \- GitHub, [https://github.com/WebAssembly/relaxed-simd/issues/44](https://github.com/WebAssembly/relaxed-simd/issues/44)  
63. For the longest time (and for good reasons), floating point operations were co... | Hacker News, [https://news.ycombinator.com/item?id=47410604](https://news.ycombinator.com/item?id=47410604)  
64. Generate completion \- Datafund App, [https://app.datafund.io/viewentry/3](https://app.datafund.io/viewentry/3)  
65. tiiuae/Falcon-H1-Tiny-90M-Instruct \- Hugging Face, [https://huggingface.co/tiiuae/Falcon-H1-Tiny-90M-Instruct](https://huggingface.co/tiiuae/Falcon-H1-Tiny-90M-Instruct)