Run AI Models on Raspberry Pi
Run AI and machine learning models on a Raspberry Pi using TensorFlow Lite, ONNX Runtime, and llama.cpp. Includes benchmarks.
Introduction
Running AI models on a Raspberry Pi is entirely feasible—but expectations matter. Inference (prediction) is realistic; training is not. This guide covers practical frameworks and real performance numbers for Pi 4 and Pi 5.
Prerequisites
- Raspberry Pi 4 (2GB+) or Pi 5 (4GB+)
- Python 3.9+
- Virtual environment (
python3 -m venv) - ~1GB free disk space
- For accelerated inference: Coral USB Accelerator (optional)
What's Realistic
- Inference: Yes. Run pre-trained models for classification, detection, segmentation.
- Training: No. RAM and CPU constraints make training impractical on Pi.
- Model size: 1–200MB typically. Larger models quantize down or use distillation.
- Speed: Expect 50–500ms latency depending on model and device.
Step 1 — Install TensorFlow Lite Runtime
TensorFlow Lite is the fastest path for most inference tasks. Install the lightweight runtime (no build required):
python3 -m venv ~/tflite_env
source ~/tflite_env/bin/activate
pip install --upgrade pip
pip install tflite-runtime
Verify:
python3 -c "import tflite_runtime.interpreter as tflite; print('TFLite OK')"
Step 2 — Download a Pre-trained Model
Download MobileNet V2 (image classification, 3.5MB, optimized for Pi):
cd ~/models
wget https://storage.googleapis.com/download.tensorflow.org/models/mobilenet_v2_1.0_224_quant.tflite
Also grab the label file:
wget https://storage.googleapis.com/download.tensorflow.org/models/labels_imagenet_slim.txt
Step 3 — Image Classification Script
Create classify.py:
import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image
# Load model
interpreter = tflite.Interpreter(model_path="mobilenet_v2_1.0_224_quant.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Load and preprocess image
img = Image.open("test.jpg").resize((224, 224))
input_data = np.array(img, dtype=np.uint8)
input_data = np.expand_dims(input_data, axis=0)
# Inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])
top_class = np.argmax(output[0])
confidence = output[0][top_class] / 255.0
# Load labels
with open("labels_imagenet_slim.txt") as f:
labels = [line.strip() for line in f]
print(f"Class: {labels[top_class]} ({confidence:.2%})")
Run:
python3 classify.py
MobileNet V2 takes ~50ms on Pi 5, ~150ms on Pi 4.
Step 4 — ONNX Runtime for Broader Model Support
If you have ONNX models (PyTorch, scikit-learn exports), use ONNX Runtime:
pip install onnxruntime
Example inference:
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("model.onnx")
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
# Dummy input
x = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = sess.run([output_name], {input_name: x})
Step 5 — Small LLMs with llama.cpp
For small language models (TinyLlama 1.1B), use llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Download TinyLlama GGUF format:
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
Run inference:
./main -m tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello, how are you?" -n 128
Expect ~2 tokens/sec on Pi 5 (feasible for chatbots but slow).
Model Performance Benchmark
| Model | Size | Pi 4 Latency | Pi 5 Latency | Notes |
|---|---|---|---|---|
| MobileNet V2 | 3.5MB | 150ms | 50ms | Image classification |
| SqueezeNet | 1.2MB | 80ms | 25ms | Lightweight CNN |
| TinyLlama 1.1B | 650MB | N/A | 400ms/token | Quantized (Q4) |
| YOLO v8n | 6.2MB | 250ms | 80ms | Object detection |
Step 6 — Coral USB Accelerator (10x Speedup)
For production workloads, the Coral USB Accelerator (TPU) is a game-changer:
pip install pycoral
Compile MobileNet for Edge TPU:
edgetpu_compiler mobilenet_v2_1.0_224_quant.tflite
Then run inference:
from pycoral.adapters import classify
from pycoral.utils.edgetpu import make_interpreter
interpreter = make_interpreter("mobilenet_v2_1.0_224_quant_edgetpu.tflite")
interpreter.allocate_tensors()
# ... same inference pattern
Latency drops to ~5ms on Coral.
Troubleshooting
"No module named tflite_runtime" — Ensure you activated the venv and ran pip install inside it. On armv6 (Pi Zero), pre-built wheels may not exist; build from source or use Docker.
Out of memory — Reduce batch size, quantize models further, or use Pi 5 with 8GB RAM. Monitor with free -h.
Model not found — Use absolute paths: /home/pi/models/model.tflite, not relative paths.
Summary
TensorFlow Lite handles most inference workloads efficiently on Pi 4/5. ONNX Runtime adds flexibility; llama.cpp enables small LLMs. For latency-critical applications, Coral USB Accelerator provides 10x speedup. Real-world success depends on model choice—stay under 50MB and expect latencies in the 50–500ms range.