nn to cpp: What you need to know about porting deep learning models to the edge
It’s been incredibly exciting to see the big surge of interest in AI at the edge—the idea that we can run sophisticated AI algorithms on embedded devices—over the past couple of weeks. Folks like @ggerganov, who ported Whisper and LLaMA to C++, @nouamanetazi, who ported BLOOM, and @antimatter15, who ported Stanford Alpaca have pulled back the curtain and shown the community that deep neural networks are just code, and they will run on any device that can fit them.
I’ve been building on-device AI for the past four years—first at Google, where we launched TensorFlow Lite for Microcontrollers, and now at Edge Impulse, where I’m Head of Machine Learning—and I’ve written a couple of books along the way. I am thrilled to see so much interest and enthusiasm from a ton of people who are new to the field and have wonderful ideas about what they could build.
Embedded ML is a big field to play in. It makes use of the entire history of computer science and electrical engineering, from semiconductor physics to distributed computing, laid out end to end. If you’re just starting out, it can be a bit overwhelming, and some things may come as a surprise. Here are my top tips for newcomers:
All targets are different. Play to their strengths.
Edge devices span a huge range of capabilities, from GPU-powered workhorses to ultra-efficient microcontrollers. Every unique system (which we call a target) represents a different set of trade-offs: RAM and ROM availability, clock speed, and processor features designed to speed up deep learning inference, along with peripherals and connectivity features for getting data in and out. Mobile phones and Raspberry Pi-style computers are at the high end of this range; microcontrollers are the mid- to low end. There are even purpose-built deep learning accelerators, including neuromorphic chips—inspired by human neurons’ spiking behavior—designed for low latency and energy efficiency.
There are billions of microcontrollers (aka MCUs) manufactured every year; if you can run a model on an MCU, you can run it anywhere. In theory, the same C++ code should run on any device—but in practice, every line of processor has custom instructions that you’ll need to make use of in order to perform computation fast. There are orders-of-magnitude performance penalties for running naive, unoptimized code. To make matters worse, optimization for different deep learning operations varies from target to target, and not all operations are equally supported. A simple change in convolutional stride or filter size may result in a huge performance difference.
The matrix of targets versus models is extraordinarily vast, and traditional tools for optimization are fragmented and difficult to use. Every vendor has their own toolchain, so moving from one target to another is a challenge. Fortunately, you don’t need to hand-craft C++ for every model and target. There are high-level tools available (like Edge Impulse) that will take an arbitrary model and generate optimized C++ or bytecode designed for a specific target. They’ll let you focus on your application and not worry so much about the implementation details. And you’re not always stuck with fixed architectures: you can design a model specifically to run well on a given target.
Compression is lossy. Quantify that loss.
It’s common to compress models so that they fit in the constrained memory of smaller devices and run faster (using integer math) on their limited processors. Quantization is the most important form of compression for edge AI. Other approaches, like pruning, are still waiting for adequate hardware support.
Quantization involves reducing the precision of the model’s weights, for example from 32 to 8 bits. It can routinely get you anything from a 2x to an 8x reduction in model size. Since there’s no such thing as a free lunch, this shrinkage will result in reduced model performance. Quantization results in forgetting, as explained in this fantastic paper by my friend @sarahookr (https://arxiv.org/abs/1911.05248). As the model loses precision, it loses performance on the “long tail” of samples in its dataset, especially those that are infrequent or underrepresented.
Forgetting can lead to serious problems, amplifying any bias in the dataset, so it’s absolutely critical to evaluate a quantized model against the same criteria as its full-sized progenitor in order to understand what was lost. For example, a quantized translation model may lose its abilities unevenly: it might “forget” languages or words that occur less frequently.
Typically, you can get a roughly 4x reduction in size (from 32 to 8 bits) with potentially minimal performance impact (always evaluate), and without doing any additional training. If you quantize deeper than 8 bits, it’s generally necessary to do so during training, so you’ll need access to the model’s original dataset.
A fun part of edge AI product design is figuring out how to make clever use of models that have been partially compromised by their design constraints. Model output is just one contextual clue that you can use to understand a given situation. Even an unreliable model can contribute to an overall system that feels like magic.
Devices have sensors. Learn how to use them.
While it’s super exciting to see today’s large language models ported to embedded devices, they’re only scratching the surface of what’s possible. Edge devices can be equipped with sensors—everything from cameras to radar—that can give them contextual information about the world around them. Combined with deep learning, sensor data gives devices incredible insight into everything from industrial processes to the inner state of a human body.
Today’s large, multimodal models are built using web-scraped data, so they’re biased towards text and vision. The sensors available to an embedded device go far beyond that—you can capture motion, audio, any part of the EM spectrum, gases and other chemicals, and human biosignals, including EEG data representing brain activity! I’m most excited to see the community make use of this additional data to train models that have far more insight than anything possible on the web.
Raw sensor data is highly dimensional and noisy. Digital signal processing algorithms help us sift the signal from the noise. DSP is an incredibly important part of embedded engineering, and many edge processors have on-board acceleration for DSP. As an ML engineer, learning basic DSP gives you superpowers for handling high frequency time series data in your models.
We can’t rely on pre-trained models.
A lot of ML/AI discourse revolves around pre-trained models, like LLaMA or ResNet, which are passed around as artifacts and treated like black boxes. This approach works fine with data with universal structural patterns, like language or photographs, since any user can provide compatible inputs. It falls apart when the structure of data starts to vary from device to device.
For example, imagine you’ve built an edge AI device with on-board sensors. The model, calibration, and location of these sensors, along with any signal processing, will affect the data they produce. If you capture data with one device, train a model, and then share it with another developer who has a device with different sensors, the data will be different and the model may not work.
Devices are infinitely variable, and as physical objects, they can even change over time. This makes pre-trained models less useful for AI at the edge. Instead, we train custom models for every application. We often use pre-trained models as feature extractors: for example, we might use a pre-trained MobileNet to obtain high-level features from an image sensor, then input those into a custom model—alongside other sensor data—in order to make predictions.
Making on-device the norm.
I’m confident that edge AI will enable a world of ambient computing, where our built environment is imbued with subtle, embodied intelligence that improves our lives in myriad ways—while remaining grounded in physical reality. It’s a refreshing new vision, diametrically different to the narrative of inevitable centralization that has characterized the era of cloud compute.
The challenges, constraints, and opportunities of embedded machine learning make it the most fascinating branch of computer science. I’m incredibly excited to see the field open up to new people with diverse perspectives and bold ideas.
Our goal at Edge Impulse is to make this field accessible to everyone. If you’re an ML engineer, we have tools to help you transform your existing models into optimized C++ for pretty much any target—and use digital signal processing to add sensor data to the mix. If you’re an embedded engineer or domain expert otherwise new to AI, we take the mystery out of the ML parts: you can upload data, train a model, and deploy it as a C++ library without needing a PhD. It’s easy to get started, just take a look at our docs.
It’s been amazing to see the gathering momentum behind porting deep learning models to C++, and it comes at an exciting time for us. Inside the company, I’ve been personally leading an effort to make this kind of work quick, easy, and accessible for every engineer. Watch this space: we’ll soon be making some big announcements that will open up our field even more.
Warmly,
Dan