Run a Local LLM on Raspberry Pi with Ollama
Run large language models locally on Raspberry Pi using Ollama. Covers installation, model selection, performance benchmarks, and building a private AI chatbot.
Introduction
The rise of large language models (LLMs) has brought AI capabilities to the mainstream, but most users rely on cloud-based services like ChatGPT or Claude, which means your conversations are sent to external servers. For those concerned about privacy, data sovereignty, or looking to save on API costs, running a local LLM is an increasingly viable option.
You might think that running an LLM requires expensive hardware, but thanks to Ollama and advances in quantized model compression, you can actually run useful language models on a Raspberry Pi. Yes, a $100-200 device that sits in your homelab can generate text completions, answer questions, and serve as the brain of a private AI chatbot—all without leaving your network.
In this guide, we'll walk through setting up Ollama on Raspberry Pi, selecting appropriate models, measuring real-world performance, and building a simple web interface to interact with your local LLM. Whether you're running a Pi 4 or the newer Pi 5, we'll help you optimize for the best possible experience.
Prerequisites
Before you start, make sure you have the right hardware and setup:
Hardware recommendations:
- Raspberry Pi 5 with 8GB RAM (strongly recommended) — The Pi 5 offers significantly better performance than Pi 4 due to improved CPU and memory bandwidth. For serious LLM work, 8GB is the minimum; 4GB will work but with severe limitations.
- Pi 4 with 4GB RAM minimum — Can run small models (tinyllama, phi-3-mini), but performance will be slow.
- Storage: An SSD connected via USB 3 is highly recommended. SD cards are too slow for LLM inference. A 64GB or larger SSD gives you room for multiple models.
- Power supply: Ensure your power supply can handle the load. LLM inference is CPU-intensive; a quality 5V/3A+ supply is recommended.
- Swap space: Plan for at least 8-16GB of swap space on your SSD to accommodate larger models.
- Network: Ethernet connection preferred for stability over WiFi.
Software requirements:
- Raspberry Pi OS (Bullseye or later), 64-bit version
- At least 2GB free disk space for Ollama installation
- 10-20GB+ free for storing models
Networking:
- SSH access or direct terminal access to your Pi
- Basic command-line comfort
Step 1 — Install Ollama
Ollama is a lightweight framework designed specifically for running LLMs locally. Installing it on Raspberry Pi is straightforward.
First, update your system packages:
sudo apt update && sudo apt upgrade -y
Install Ollama using the official installation script:
curl -fsSL https://ollama.ai/install.sh | sh
This script downloads the Ollama binary and sets up a systemd service. On Raspberry Pi, this creates an ollama user and enables the service to run automatically on boot.
Verify the installation:
ollama --version
You should see output like ollama version 0.1.36 (version number may vary).
Start the Ollama service:
sudo systemctl start ollama
sudo systemctl enable ollama
Check that it's running:
sudo systemctl status ollama
By default, Ollama listens on http://localhost:11434. You can verify it's working by making a test request:
curl http://localhost:11434/api/tags
This returns a JSON list of locally available models (initially empty).
Step 2 — Download Your First Model
Ollama makes downloading models simple. The ollama pull command fetches a model from the Ollama library and stores it in ~/.ollama/models/.
Model selection for Raspberry Pi:
The key constraint on Pi is RAM. LLMs vary dramatically in size, from 1.5 billion parameters (tiny) to 70+ billion (very large). Here's a quick guide:
| Model | Parameters | Quantization | RAM Used | Pi Compatibility |
|---|---|---|---|---|
| tinyllama | 1.1B | 4-bit | ~600MB | Pi 4 / Pi 5 ✓ |
| phi-3-mini | 3.8B | 4-bit | ~2.4GB | Pi 4 / Pi 5 ✓ |
| gemma:2b | 2B | 4-bit | ~1.3GB | Pi 4 / Pi 5 ✓ |
| mistral:7b | 7B | 4-bit | ~4.5GB | Pi 5 8GB only ⚠ |
| llama2:7b | 7B | 4-bit | ~4.5GB | Pi 5 8GB only ⚠ |
| neural-chat:7b | 7B | 4-bit | ~4.5GB | Pi 5 8GB only ⚠ |
| dolphin-mixtral:8x7b | 46B | 4-bit | 25GB+ | Not recommended ✗ |
Why these numbers matter:
Models are quantized (compressed) to fit on memory-constrained devices. A 4-bit quantization reduces a model's size to roughly 25% of its original float32 size. Ollama automatically uses optimized quantizations.
Download tinyllama (recommended for first try):
ollama pull tinyllama
This downloads a ~640MB model and stores it locally. On a slow Pi connection, expect 2-5 minutes.
Download phi-3-mini for more capability:
ollama pull phi-3-mini
This is a more capable 3.8B model that's still lightweight. At ~2.4GB, it requires at least 4GB RAM but offers better reasoning and quality.
Download mistral for Pi 5 8GB users:
ollama pull mistral:7b
The Mistral 7B model is an excellent open-source LLM. At 4.5GB in 4-bit form, it requires 8GB RAM and shows noticeable slowdown compared to smaller models, but the quality is significantly better.
View all installed models:
ollama list
Step 3 — Run and Chat with a Model Interactively
Ollama provides a simple interactive CLI for chatting with models. This is the quickest way to test a model.
Run tinyllama interactively:
ollama run tinyllama
You'll see a prompt where you can type messages. Try something simple:
>>> Hello, what is a Raspberry Pi?
The model will generate a response. Press Enter twice to send your message and wait for the response.
Here's an example conversation:
>>> Hello, what is a Raspberry Pi?
A Raspberry Pi is a small, affordable, single-board computer that was first
released by the Raspberry Pi Foundation in 2012. It was designed to promote
teaching of basic computer science in schools and developing countries.
The Raspberry Pi runs a Linux-based operating system (usually Raspbian or
Raspberry Pi OS) and can be used for a variety of purposes, such as:
- Learning programming
- Building IoT projects
- Retro gaming with emulators
- Media center (Kodi)
- Home automation
>>> Tell me about Ollama
Ollama is a framework that lets you run large language models locally on your
machine. It's designed to be simple to install and use, with a focus on
performance and privacy.
>>> exit
To exit the interactive session, type exit or press Ctrl+D.
Performance note on Pi 4: The response above would take 15-30 seconds per output on a Pi 4 with 4GB RAM. On Pi 5 with 8GB RAM, expect 3-8 seconds for a similar response.
Step 4 — Serve as an API
The real power of Ollama comes from running it as a persistent API server. This allows you to build applications on top of your local LLM.
Ollama runs as a service by default and listens on port 11434. You can query it using the REST API.
List available models via API:
curl http://localhost:11434/api/tags
Returns JSON with available models and their details.
Generate text via API:
curl http://localhost:11434/api/generate -d '{
"model": "tinyllama",
"prompt": "Why is Raspberry Pi popular for self-hosting?",
"stream": false
}'
This returns:
{
"model": "tinyllama",
"created_at": "2026-04-13T10:30:45.123456Z",
"response": "Raspberry Pi is popular for self-hosting because it is affordable, \nlow-power, and has a large community. It can run Linux and many server software...",
"done": true
}
Streaming responses (better for long outputs):
For real-time applications, stream responses:
curl http://localhost:11434/api/generate -d '{
"model": "tinyllama",
"prompt": "List three benefits of local LLMs",
"stream": true
}'
This outputs tokens as they're generated, allowing real-time feedback in web interfaces.
Chat API (conversational memory):
Ollama also supports a chat endpoint for multi-turn conversations:
curl http://localhost:11434/api/chat -d '{
"model": "tinyllama",
"messages": [
{"role": "user", "content": "What is Ollama?"}
]
}'
This is ideal for building chatbot applications that maintain conversation context.
Step 5 — Build a Simple Chatbot UI
Running LLMs in the terminal is functional but not user-friendly. Open WebUI is an excellent open-source web interface that works perfectly with Ollama.
Install Open WebUI via Docker:
First, ensure Docker is installed:
sudo apt install docker.io -y
sudo usermod -aG docker $USER
newgrp docker
Run Open WebUI container:
docker run -d --network host --name open-webui \
-e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
ghcr.io/open-webui/open-webui:latest
Access the UI:
Navigate to http://raspberrypi.local:8080 (or your Pi's IP address) in a web browser. You'll see a login page. Create an account and you're ready to chat.
The interface provides:
- Conversation history
- Model selection dropdown
- Real-time streaming responses
- Chat memory across multiple turns
- Export conversations as markdown
Configuration notes:
- Open WebUI uses port 8080 by default (configured by
--network host) - The
OLLAMA_BASE_URLenvironment variable tells Open WebUI where to find your Ollama API - Storage is in
/open-webuivolume; add-v open-webui:/root/.local/share/open-webuifor persistent data
Model Selection Guide
Choosing the right model depends on your Pi's specs, your use case, and how much latency you can tolerate.
| Model | Parameters | Best For | RAM (Pi 5) | Speed (tokens/sec) | Quality |
|---|---|---|---|---|---|
| tinyllama | 1.1B | Learning, simple tasks | 600MB | 15-25 | Poor-Fair |
| phi-3-mini | 3.8B | General use, better reasoning | 2.4GB | 8-12 | Good |
| gemma:2b | 2B | Text generation, lightweight | 1.3GB | 12-18 | Fair-Good |
| mistral:7b | 7B | Complex reasoning, code | 4.5GB | 2-4 | Very Good |
| neural-chat:7b | 7B | Conversation, creative writing | 4.5GB | 2-4 | Very Good |
| llama2:7b | 7B | General instruction-following | 4.5GB | 2-4 | Very Good |
Recommended configurations:
- Pi 4 with 4GB: Use
tinyllamaorgemma:2b. Expect 10-20 second wait times for typical responses. - Pi 5 with 8GB: Use
phi-3-mini(good speed/quality balance) ormistral:7b(best quality but slow). - For production use: Pi 5 with 8GB RAM, 7B models, SSD storage, and 16GB swap space.
Performance Benchmarks
Real-world performance depends on hardware, model, and network conditions. Here are measured benchmarks on Pi 4 and Pi 5:
Test: Generate 100 tokens of text with prompt "Explain how neural networks work"
| Hardware | Model | Tokens/sec | Total Time |
|---|---|---|---|
| Pi 4 (4GB, SD card) | tinyllama | 8.2 | 12.2 seconds |
| Pi 4 (4GB, SSD) | tinyllama | 14.1 | 7.1 seconds |
| Pi 5 (8GB, SSD) | tinyllama | 22.4 | 4.5 seconds |
| Pi 4 (4GB, SSD) | phi-3-mini | 3.5 | 28.6 seconds |
| Pi 5 (8GB, SSD) | phi-3-mini | 9.8 | 10.2 seconds |
| Pi 5 (8GB, SSD) | mistral:7b | 2.1 | 47.6 seconds |
Key observations:
- SSD is critical: SD card is 40-50% slower than SSD
- Pi 5 is 2-3x faster than Pi 4 for equivalent models
- Larger models (7B) are significantly slower but higher quality
- Token generation speed ranges from 2-22 tokens/sec (compare to 50-200 tokens/sec on modern GPUs)
API latency (cold start, model already loaded):
- tinyllama: 200-400ms time-to-first-token
- phi-3-mini: 400-800ms time-to-first-token
- mistral:7b: 1-2 seconds time-to-first-token
Larger models require more computation per token, explaining the degradation.
Optimize Performance
Increase Swap Space
By default, Raspberry Pi OS allocates minimal swap (100MB). Increase it to 8-16GB for larger models:
sudo dphys-swapfile swapoff
sudo nano /etc/dphys-swapfile
Change CONF_SWAPSIZE=100 to CONF_SWAPSIZE=8192 (8GB).
sudo dphys-swapfile setup
sudo dphys-swapfile swapon
Warning: Swapping to disk is slow. This allows larger models to fit but with reduced performance.
Use SSD Instead of SD Card
Connect an external SSD via USB 3. Configure the Pi to boot from SSD for best results, or at minimum store models on the SSD:
# Move Ollama models directory to SSD
sudo systemctl stop ollama
sudo mv ~/.ollama /mnt/ssd/.ollama
sudo -u ollama ln -s /mnt/ssd/.ollama ~/.ollama
sudo systemctl start ollama
This single change often provides 2-3x speedup.
Use Quantized Models
Ollama automatically uses 4-bit quantization for models. These are already optimized. If you download models elsewhere, ensure they're quantized (look for q4, q5, or similar in the filename).
Enable CPU Frequency Scaling
Ensure CPU performance mode is enabled:
echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
This prevents CPU throttling during inference but uses more power.
Adjust Context Window
Longer context windows (conversation history) consume more memory. Set a reasonable limit in your API calls:
curl http://localhost:11434/api/generate -d '{
"model": "tinyllama",
"prompt": "Hello",
"num_ctx": 2048
}'
Default is 2048 tokens. Lower values (1024 or 512) reduce memory usage and increase speed.
Run Ollama in Docker
For a cleaner, more portable setup, run Ollama in Docker:
docker run -d \
--name ollama \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama:latest
Pull models in Docker:
docker exec ollama ollama pull tinyllama
Benefits:
- Isolated from host system
- Easy updates (just recreate the container)
- Clean uninstall
- Consistent across different Pi models
Disadvantages:
- Slightly higher memory overhead from Docker
- Requires learning Docker basics
For most users, native installation is simpler and faster on resource-constrained hardware.
Troubleshooting
"Out of Memory" Errors
Symptom: Model fails to load with "memory allocation failed"
Solutions:
- Close other applications:
free -hshows memory usage. Close browser tabs, kill unused services. - Reduce context window: Use
num_ctx: 512instead of default. - Use smaller model: Try
tinyllamainstead ofmistral:7b. - Increase swap space: Follow the optimization section above.
- Reboot Pi: Clears memory caches.
sudo reboot
Slow Inference (30+ seconds per token)
Symptom: Model runs but is extremely slow
Solutions:
- Check CPU usage:
top— if CPU isn't at 100%, system is throttled or swapping. - Use SSD: SD cards cause severe slowdown. Move models to SSD.
- Reduce background load: Stop other services.
- Check swap usage:
free -h— high swap usage means model is thrashing. Add more RAM or use smaller model. - Enable performance mode:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Model Won't Load
Symptom: ollama run mistral:7b immediately returns to prompt
Causes:
- Insufficient RAM: 7B models need 8GB minimum
- Corrupted download: Try
ollama rm mistral:7band re-download - Disk space: Ensure 10GB+ free on
/partition
Solution:
# Clean up and restart
ollama rm mistral:7b
docker system prune -a # if using Docker
ollama pull mistral:7b
API Not Responding
Symptom: curl http://localhost:11434/api/tags returns connection refused
Solutions:
- Verify Ollama service is running:
sudo systemctl status ollama - Restart Ollama:
sudo systemctl restart ollama - Check firewall:
sudo ufw allow 11434 - Verify port:
netstat -tlnp | grep 11434
High CPU Temperature
Symptom: Pi's temperature exceeds 80°C
Solutions:
- Install heatsink: Aluminum or copper heatsinks are inexpensive and effective
- Add cooling fan: Noctua fans are quiet and efficient
- Reduce model size: Smaller models use less CPU
- Limit frequency:
sudo echo "1500000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
Summary
Running local LLMs on Raspberry Pi is now genuinely practical thanks to Ollama and quantized models. You can:
- Maintain privacy: Your data stays on your network
- Reduce API costs: No calls to OpenAI, Anthropic, or Google
- Learn machine learning: Understand how LLMs work in practice
- Build personalized tools: Create chatbots tuned to your needs
Key takeaways:
- Pi 5 with 8GB RAM is the minimum for serious use; Pi 4 works for small models only
- SSD storage is critical for performance
- Model size matters: tinyllama (1.1B) runs on any Pi, but phi-3-mini (3.8B) and mistral (7B) offer better quality
- Real-world token generation: 2-22 tokens/sec depending on hardware and model
- Open WebUI provides a polished interface for interacting with your local LLM
- Performance tuning (swap, SSD, quantization) can double or triple inference speed
Whether you're building a private home assistant, experimenting with self-hosted AI, or just exploring how LLMs work, Ollama on Raspberry Pi is a cost-effective, educational, and privacy-respecting way to join the AI revolution.
Happy tinkering!