Embedded AI

EAI 0.9 ships INT4 LLM runtime — 11 tok/s on a Cortex-M85

EAI's new quantized inference path squeezes a 1.3B-parameter model into 312 MB of flash and runs at interactive speed on a 480 MHz microcontroller. We dig into the kernel scheduler that made it possible.

The numbers

On a 480 MHz Cortex-M85 with 2 MB of TCM and 4 MB of external QSPI flash, EAI 0.9 sustains 11 tokens per second on a 1.3B-parameter chat model quantized to INT4. End-to-end latency from a 32-token prompt to first emitted token is 380 ms. Peak power, measured at the 3.3 V rail, is 412 mW.

That model fits in 312 MB of compressed weights. Yes — three hundred megabytes, on a microcontroller. The trick isn't a smaller model. It's that EAI 0.9 streams quantized blocks directly from QSPI through the DMA controller into the matmul kernel without ever materializing the full weight matrix in RAM.

Block-streamed inference

The kernel scheduler takes the dependency graph of an LLM forward pass and partitions it into 64-row weight blocks. Each block has three concurrent owners: a QSPI fetcher prefetches the next block, a matmul kernel consumes the current block, and an activation accumulator holds the partial sum. The pipeline depth is set so the matmul kernel is never weight-stalled.

// EAI scheduler (simplified)
for each block in layer.weight_blocks {
    qspi.prefetch(block.next);     // async DMA
    int4_matmul(block.current,
                activations,
                accumulator);      // CPU
    activations.barrier_release(); // wakes attention head
}

Quantization quality

INT4 is aggressive. We measured a 0.31 perplexity gain on WikiText vs. the FP16 reference — within noise of the published GPTQ baselines. For chat use, blind A/B testing with 200 prompts showed no preference signal at 95% confidence.

What this enables

This is not a research demo. EAI 0.9 ships in the eos-platform 1.0 profiles for both gateway and wearable targets. eApps targeting on-device assistants can now declare a model dependency in their manifest and rely on EAI to handle quantization, scheduling, and power budgeting at install time.

Read next

Embedded systems engineering — platform
Apps & Platforms

eos-platform 1.0 lands: one toolchain, every EoS profile

After eighteen months of incremental releases, the eos-platform meta-distribution reaches 1.0 with stable APIs, a unified package manifest, and reproducible builds across all 14 EoS components.

Neural network synapses
Neural Interface

ENI's 1,024-channel pipeline: deterministic spike sorting in 800 µs

How the Embedded Neural Interface stack moves a thousand-electrode array through filtering, sorting, and decoding inside a single RTOS frame — and why the hardest part wasn't the math.

Cryptographic chip — secure boot
Security & Boot

eBootloader secure boot: a measured-launch walkthrough

An end-to-end tour of eBoot's chain of trust — root-of-trust keys, immutable stage 0, signed manifests, anti-rollback counters, and the runtime attestation hooks EAI consumes during model load.