The numbers
On a 480 MHz Cortex-M85 with 2 MB of TCM and 4 MB of external QSPI flash, EAI 0.9 sustains 11 tokens per second on a 1.3B-parameter chat model quantized to INT4. End-to-end latency from a 32-token prompt to first emitted token is 380 ms. Peak power, measured at the 3.3 V rail, is 412 mW.
That model fits in 312 MB of compressed weights. Yes — three hundred megabytes, on a microcontroller. The trick isn't a smaller model. It's that EAI 0.9 streams quantized blocks directly from QSPI through the DMA controller into the matmul kernel without ever materializing the full weight matrix in RAM.
Block-streamed inference
The kernel scheduler takes the dependency graph of an LLM forward pass and partitions it into 64-row weight blocks. Each block has three concurrent owners: a QSPI fetcher prefetches the next block, a matmul kernel consumes the current block, and an activation accumulator holds the partial sum. The pipeline depth is set so the matmul kernel is never weight-stalled.
// EAI scheduler (simplified)
for each block in layer.weight_blocks {
qspi.prefetch(block.next); // async DMA
int4_matmul(block.current,
activations,
accumulator); // CPU
activations.barrier_release(); // wakes attention head
} Quantization quality
INT4 is aggressive. We measured a 0.31 perplexity gain on WikiText vs. the FP16 reference — within noise of the published GPTQ baselines. For chat use, blind A/B testing with 200 prompts showed no preference signal at 95% confidence.
What this enables
This is not a research demo. EAI 0.9 ships in the eos-platform 1.0 profiles for both gateway and wearable targets. eApps targeting on-device assistants can now declare a model dependency in their manifest and rely on EAI to handle quantization, scheduling, and power budgeting at install time.
Read next

eos-platform 1.0 lands: one toolchain, every EoS profile
After eighteen months of incremental releases, the eos-platform meta-distribution reaches 1.0 with stable APIs, a unified package manifest, and reproducible builds across all 14 EoS components.

ENI's 1,024-channel pipeline: deterministic spike sorting in 800 µs
How the Embedded Neural Interface stack moves a thousand-electrode array through filtering, sorting, and decoding inside a single RTOS frame — and why the hardest part wasn't the math.

eBootloader secure boot: a measured-launch walkthrough
An end-to-end tour of eBoot's chain of trust — root-of-trust keys, immutable stage 0, signed manifests, anti-rollback counters, and the runtime attestation hooks EAI consumes during model load.
