EAI 0.9 ships INT4 LLM runtime — 11 tok/s on a Cortex-M85

The numbers

On a 480 MHz Cortex-M85 with 2 MB of TCM and 4 MB of external QSPI flash, EAI 0.9 sustains 11 tokens per second on a 1.3B-parameter chat model quantized to INT4. End-to-end latency from a 32-token prompt to first emitted token is 380 ms. Peak power, measured at the 3.3 V rail, is 412 mW.

That model fits in 312 MB of compressed weights. Yes — three hundred megabytes, on a microcontroller. The trick isn't a smaller model. It's that EAI 0.9 streams quantized blocks directly from QSPI through the DMA controller into the matmul kernel without ever materializing the full weight matrix in RAM.

Block-streamed inference

The kernel scheduler takes the dependency graph of an LLM forward pass and partitions it into 64-row weight blocks. Each block has three concurrent owners: a QSPI fetcher prefetches the next block, a matmul kernel consumes the current block, and an activation accumulator holds the partial sum. The pipeline depth is set so the matmul kernel is never weight-stalled.

// EAI scheduler (simplified)
for each block in layer.weight_blocks {
    qspi.prefetch(block.next);     // async DMA
    int4_matmul(block.current,
                activations,
                accumulator);      // CPU
    activations.barrier_release(); // wakes attention head
}

Quantization quality

INT4 is aggressive. We measured a 0.31 perplexity gain on WikiText vs. the FP16 reference — within noise of the published GPTQ baselines. For chat use, blind A/B testing with 200 prompts showed no preference signal at 95% confidence.

What this enables

This is not a research demo. EAI 0.9 ships in the eos-platform 1.0 profiles for both gateway and wearable targets. eApps targeting on-device assistants can now declare a model dependency in their manifest and rely on EAI to handle quantization, scheduling, and power budgeting at install time.

Filed under:Embedded AI

EAI 0.9 ships INT4 LLM runtime — 11 tok/s on a Cortex-M85

The numbers

Block-streamed inference

Quantization quality

What this enables

Read next

eos-platform 1.0 lands: one toolchain, every EoS profile

ENI's 1,024-channel pipeline: deterministic spike sorting in 800 µs

eBootloader secure boot: a measured-launch walkthrough