I quantized the model. It still didn't fit.
This was the moment on-device AI stopped being theoretical for us.
We were deploying a source separation model on TFLite Micro — embedded hardware, severely constrained memory.
The standard advice: quantize your model from FP32 to INT8. You get roughly 4× size reduction. Problem solved.
So we did. The model shrank significantly. On paper, it fit.
It didn't run.
The model size is not your memory budget. The inference runtime is.
Before your model processes a single sample, the runtime has already consumed memory. Operator kernels, tensor arenas, scratch buffers, runtime overhead — all of it lands before inference even begins. By the time the model loads, your headroom is already gone.
We had optimized the wrong thing.
Most ML engineers do this. They spend days squeezing the model — pruning, quantizing, distilling — and treat the runtime as a black box that "just works." On a server, that's fine. On an embedded device with 256KB of RAM, the runtime is the problem.
Here's what we learned
Profile the runtime first, not the model.
Understand what the inference engine itself costs on your target hardware before you touch the model.
Your dev machine is lying to you.
Benchmark numbers on a laptop mean nothing. The only numbers that matter are on the actual target device, with the actual runtime, with the actual memory ceiling.
Operator support shapes your architecture.
Not every op your model uses is supported on every runtime. If you design the model first and check runtime compatibility later, you redesign the model later. Do it in the other order.
The model and the runtime are a system.
Optimizing one while ignoring the other is like upgrading a car engine while ignoring that the fuel line is clogged.
On-device AI is hard not because the ML is hard. It's hard because the gap between "works in a notebook" and "runs on the device" is wider than most people expect — and almost nobody talks honestly about what's in that gap.
That gap is where the real engineering happens.
Working on deploying AI to constrained hardware?
We'd love to hear what's tripping you up. Tell us about your project — we read every message.
Get in touch