Run AI Models on Raspberry Pi

Run AI and machine learning models on a Raspberry Pi using TensorFlow Lite, ONNX Runtime, and llama.cpp. Includes benchmarks.

Andreas · April 12, 2026 · 10 min read

Introduction

Running AI models on a Raspberry Pi is entirely feasible—but expectations matter. Inference (prediction) is realistic; training is not. This guide covers practical frameworks and real performance numbers for Pi 4 and Pi 5.

Prerequisites

  • Raspberry Pi 4 (2GB+) or Pi 5 (4GB+)
  • Python 3.9+
  • Virtual environment (python3 -m venv)
  • ~1GB free disk space
  • For accelerated inference: Coral USB Accelerator (optional)

What's Realistic

  • Inference: Yes. Run pre-trained models for classification, detection, segmentation.
  • Training: No. RAM and CPU constraints make training impractical on Pi.
  • Model size: 1–200MB typically. Larger models quantize down or use distillation.
  • Speed: Expect 50–500ms latency depending on model and device.

Step 1 — Install TensorFlow Lite Runtime

TensorFlow Lite is the fastest path for most inference tasks. Install the lightweight runtime (no build required):

python3 -m venv ~/tflite_env
source ~/tflite_env/bin/activate
pip install --upgrade pip
pip install tflite-runtime

Verify:

python3 -c "import tflite_runtime.interpreter as tflite; print('TFLite OK')"

Step 2 — Download a Pre-trained Model

Download MobileNet V2 (image classification, 3.5MB, optimized for Pi):

cd ~/models
wget https://storage.googleapis.com/download.tensorflow.org/models/mobilenet_v2_1.0_224_quant.tflite

Also grab the label file:

wget https://storage.googleapis.com/download.tensorflow.org/models/labels_imagenet_slim.txt

Step 3 — Image Classification Script

Create classify.py:

import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image

# Load model
interpreter = tflite.Interpreter(model_path="mobilenet_v2_1.0_224_quant.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Load and preprocess image
img = Image.open("test.jpg").resize((224, 224))
input_data = np.array(img, dtype=np.uint8)
input_data = np.expand_dims(input_data, axis=0)

# Inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()

output = interpreter.get_tensor(output_details[0]['index'])
top_class = np.argmax(output[0])
confidence = output[0][top_class] / 255.0

# Load labels
with open("labels_imagenet_slim.txt") as f:
    labels = [line.strip() for line in f]

print(f"Class: {labels[top_class]} ({confidence:.2%})")

Run:

python3 classify.py

MobileNet V2 takes ~50ms on Pi 5, ~150ms on Pi 4.

Step 4 — ONNX Runtime for Broader Model Support

If you have ONNX models (PyTorch, scikit-learn exports), use ONNX Runtime:

pip install onnxruntime

Example inference:

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("model.onnx")
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

# Dummy input
x = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = sess.run([output_name], {input_name: x})

Step 5 — Small LLMs with llama.cpp

For small language models (TinyLlama 1.1B), use llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Download TinyLlama GGUF format:

wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

Run inference:

./main -m tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello, how are you?" -n 128

Expect ~2 tokens/sec on Pi 5 (feasible for chatbots but slow).

Model Performance Benchmark

Model Size Pi 4 Latency Pi 5 Latency Notes
MobileNet V2 3.5MB 150ms 50ms Image classification
SqueezeNet 1.2MB 80ms 25ms Lightweight CNN
TinyLlama 1.1B 650MB N/A 400ms/token Quantized (Q4)
YOLO v8n 6.2MB 250ms 80ms Object detection

Step 6 — Coral USB Accelerator (10x Speedup)

For production workloads, the Coral USB Accelerator (TPU) is a game-changer:

pip install pycoral

Compile MobileNet for Edge TPU:

edgetpu_compiler mobilenet_v2_1.0_224_quant.tflite

Then run inference:

from pycoral.adapters import classify
from pycoral.utils.edgetpu import make_interpreter

interpreter = make_interpreter("mobilenet_v2_1.0_224_quant_edgetpu.tflite")
interpreter.allocate_tensors()

# ... same inference pattern

Latency drops to ~5ms on Coral.

Troubleshooting

"No module named tflite_runtime" — Ensure you activated the venv and ran pip install inside it. On armv6 (Pi Zero), pre-built wheels may not exist; build from source or use Docker.

Out of memory — Reduce batch size, quantize models further, or use Pi 5 with 8GB RAM. Monitor with free -h.

Model not found — Use absolute paths: /home/pi/models/model.tflite, not relative paths.

Summary

TensorFlow Lite handles most inference workloads efficiently on Pi 4/5. ONNX Runtime adds flexibility; llama.cpp enables small LLMs. For latency-critical applications, Coral USB Accelerator provides 10x speedup. Real-world success depends on model choice—stay under 50MB and expect latencies in the 50–500ms range.

Comments