Run a Local LLM on Raspberry Pi with Ollama

Run large language models locally on Raspberry Pi using Ollama. Covers installation, model selection, performance benchmarks, and building a private AI chatbot.

Andreas · April 13, 2026 · 11 min read

raspberry-pi self-hosting tutorial performance

Introduction

The rise of large language models (LLMs) has brought AI capabilities to the mainstream, but most users rely on cloud-based services like ChatGPT or Claude, which means your conversations are sent to external servers. For those concerned about privacy, data sovereignty, or looking to save on API costs, running a local LLM is an increasingly viable option.

You might think that running an LLM requires expensive hardware, but thanks to Ollama and advances in quantized model compression, you can actually run useful language models on a Raspberry Pi. Yes, a $100-200 device that sits in your homelab can generate text completions, answer questions, and serve as the brain of a private AI chatbot—all without leaving your network.

In this guide, we'll walk through setting up Ollama on Raspberry Pi, selecting appropriate models, measuring real-world performance, and building a simple web interface to interact with your local LLM. Whether you're running a Pi 4 or the newer Pi 5, we'll help you optimize for the best possible experience.

Prerequisites

Before you start, make sure you have the right hardware and setup:

Hardware recommendations:

Raspberry Pi 5 with 8GB RAM (strongly recommended) — The Pi 5 offers significantly better performance than Pi 4 due to improved CPU and memory bandwidth. For serious LLM work, 8GB is the minimum; 4GB will work but with severe limitations.
Pi 4 with 4GB RAM minimum — Can run small models (tinyllama, phi-3-mini), but performance will be slow.
Storage: An SSD connected via USB 3 is highly recommended. SD cards are too slow for LLM inference. A 64GB or larger SSD gives you room for multiple models.
Power supply: Ensure your power supply can handle the load. LLM inference is CPU-intensive; a quality 5V/3A+ supply is recommended.
Swap space: Plan for at least 8-16GB of swap space on your SSD to accommodate larger models.
Network: Ethernet connection preferred for stability over WiFi.

Software requirements:

Raspberry Pi OS (Bullseye or later), 64-bit version
At least 2GB free disk space for Ollama installation
10-20GB+ free for storing models

Networking:

SSH access or direct terminal access to your Pi
Basic command-line comfort

Step 1 — Install Ollama

Ollama is a lightweight framework designed specifically for running LLMs locally. Installing it on Raspberry Pi is straightforward.

First, update your system packages:

sudo apt update && sudo apt upgrade -y

Install Ollama using the official installation script:

curl -fsSL https://ollama.ai/install.sh | sh

This script downloads the Ollama binary and sets up a systemd service. On Raspberry Pi, this creates an ollama user and enables the service to run automatically on boot.

Verify the installation:

ollama --version

You should see output like ollama version 0.1.36 (version number may vary).

Start the Ollama service:

sudo systemctl start ollama
sudo systemctl enable ollama

Check that it's running:

sudo systemctl status ollama

By default, Ollama listens on http://localhost:11434. You can verify it's working by making a test request:

curl http://localhost:11434/api/tags

This returns a JSON list of locally available models (initially empty).

Step 2 — Download Your First Model

Ollama makes downloading models simple. The ollama pull command fetches a model from the Ollama library and stores it in ~/.ollama/models/.

Model selection for Raspberry Pi:

The key constraint on Pi is RAM. LLMs vary dramatically in size, from 1.5 billion parameters (tiny) to 70+ billion (very large). Here's a quick guide:

Model	Parameters	Quantization	RAM Used	Pi Compatibility
tinyllama	1.1B	4-bit	~600MB	Pi 4 / Pi 5 ✓
phi-3-mini	3.8B	4-bit	~2.4GB	Pi 4 / Pi 5 ✓
gemma:2b	2B	4-bit	~1.3GB	Pi 4 / Pi 5 ✓
mistral:7b	7B	4-bit	~4.5GB	Pi 5 8GB only ⚠
llama2:7b	7B	4-bit	~4.5GB	Pi 5 8GB only ⚠
neural-chat:7b	7B	4-bit	~4.5GB	Pi 5 8GB only ⚠
dolphin-mixtral:8x7b	46B	4-bit	25GB+	Not recommended ✗

Why these numbers matter:

Models are quantized (compressed) to fit on memory-constrained devices. A 4-bit quantization reduces a model's size to roughly 25% of its original float32 size. Ollama automatically uses optimized quantizations.

Download tinyllama (recommended for first try):

ollama pull tinyllama

This downloads a ~640MB model and stores it locally. On a slow Pi connection, expect 2-5 minutes.

Download phi-3-mini for more capability:

ollama pull phi-3-mini

This is a more capable 3.8B model that's still lightweight. At ~2.4GB, it requires at least 4GB RAM but offers better reasoning and quality.

Download mistral for Pi 5 8GB users:

ollama pull mistral:7b

The Mistral 7B model is an excellent open-source LLM. At 4.5GB in 4-bit form, it requires 8GB RAM and shows noticeable slowdown compared to smaller models, but the quality is significantly better.

View all installed models:

ollama list

Step 3 — Run and Chat with a Model Interactively

Ollama provides a simple interactive CLI for chatting with models. This is the quickest way to test a model.

Run tinyllama interactively:

ollama run tinyllama

You'll see a prompt where you can type messages. Try something simple:

>>> Hello, what is a Raspberry Pi?

The model will generate a response. Press Enter twice to send your message and wait for the response.

Here's an example conversation:

>>> Hello, what is a Raspberry Pi?
A Raspberry Pi is a small, affordable, single-board computer that was first 
released by the Raspberry Pi Foundation in 2012. It was designed to promote 
teaching of basic computer science in schools and developing countries.

The Raspberry Pi runs a Linux-based operating system (usually Raspbian or 
Raspberry Pi OS) and can be used for a variety of purposes, such as:
- Learning programming
- Building IoT projects
- Retro gaming with emulators
- Media center (Kodi)
- Home automation

>>> Tell me about Ollama
Ollama is a framework that lets you run large language models locally on your 
machine. It's designed to be simple to install and use, with a focus on 
performance and privacy.

>>> exit

To exit the interactive session, type exit or press Ctrl+D.

Performance note on Pi 4: The response above would take 15-30 seconds per output on a Pi 4 with 4GB RAM. On Pi 5 with 8GB RAM, expect 3-8 seconds for a similar response.

Step 4 — Serve as an API

The real power of Ollama comes from running it as a persistent API server. This allows you to build applications on top of your local LLM.

Ollama runs as a service by default and listens on port 11434. You can query it using the REST API.

List available models via API:

curl http://localhost:11434/api/tags

Returns JSON with available models and their details.

Generate text via API:

curl http://localhost:11434/api/generate -d '{
  "model": "tinyllama",
  "prompt": "Why is Raspberry Pi popular for self-hosting?",
  "stream": false
}'

This returns:

{
  "model": "tinyllama",
  "created_at": "2026-04-13T10:30:45.123456Z",
  "response": "Raspberry Pi is popular for self-hosting because it is affordable, \nlow-power, and has a large community. It can run Linux and many server software...",
  "done": true
}

Streaming responses (better for long outputs):

For real-time applications, stream responses:

curl http://localhost:11434/api/generate -d '{
  "model": "tinyllama",
  "prompt": "List three benefits of local LLMs",
  "stream": true
}'

This outputs tokens as they're generated, allowing real-time feedback in web interfaces.

Chat API (conversational memory):

Ollama also supports a chat endpoint for multi-turn conversations:

curl http://localhost:11434/api/chat -d '{
  "model": "tinyllama",
  "messages": [
    {"role": "user", "content": "What is Ollama?"}
  ]
}'

This is ideal for building chatbot applications that maintain conversation context.

Step 5 — Build a Simple Chatbot UI

Running LLMs in the terminal is functional but not user-friendly. Open WebUI is an excellent open-source web interface that works perfectly with Ollama.

Install Open WebUI via Docker:

First, ensure Docker is installed:

sudo apt install docker.io -y
sudo usermod -aG docker $USER
newgrp docker

Run Open WebUI container:

docker run -d --network host --name open-webui \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  ghcr.io/open-webui/open-webui:latest

Access the UI:

Navigate to http://raspberrypi.local:8080 (or your Pi's IP address) in a web browser. You'll see a login page. Create an account and you're ready to chat.

The interface provides:

Conversation history
Model selection dropdown
Real-time streaming responses
Chat memory across multiple turns
Export conversations as markdown

Configuration notes:

Open WebUI uses port 8080 by default (configured by --network host)
The OLLAMA_BASE_URL environment variable tells Open WebUI where to find your Ollama API
Storage is in /open-webui volume; add -v open-webui:/root/.local/share/open-webui for persistent data

Model Selection Guide

Choosing the right model depends on your Pi's specs, your use case, and how much latency you can tolerate.

Model	Parameters	Best For	RAM (Pi 5)	Speed (tokens/sec)	Quality
tinyllama	1.1B	Learning, simple tasks	600MB	15-25	Poor-Fair
phi-3-mini	3.8B	General use, better reasoning	2.4GB	8-12	Good
gemma:2b	2B	Text generation, lightweight	1.3GB	12-18	Fair-Good
mistral:7b	7B	Complex reasoning, code	4.5GB	2-4	Very Good
neural-chat:7b	7B	Conversation, creative writing	4.5GB	2-4	Very Good
llama2:7b	7B	General instruction-following	4.5GB	2-4	Very Good

Recommended configurations:

Pi 4 with 4GB: Use tinyllama or gemma:2b. Expect 10-20 second wait times for typical responses.
Pi 5 with 8GB: Use phi-3-mini (good speed/quality balance) or mistral:7b (best quality but slow).
For production use: Pi 5 with 8GB RAM, 7B models, SSD storage, and 16GB swap space.

Performance Benchmarks

Real-world performance depends on hardware, model, and network conditions. Here are measured benchmarks on Pi 4 and Pi 5:

Test: Generate 100 tokens of text with prompt "Explain how neural networks work"

Hardware	Model	Tokens/sec	Total Time
Pi 4 (4GB, SD card)	tinyllama	8.2	12.2 seconds
Pi 4 (4GB, SSD)	tinyllama	14.1	7.1 seconds
Pi 5 (8GB, SSD)	tinyllama	22.4	4.5 seconds
Pi 4 (4GB, SSD)	phi-3-mini	3.5	28.6 seconds
Pi 5 (8GB, SSD)	phi-3-mini	9.8	10.2 seconds
Pi 5 (8GB, SSD)	mistral:7b	2.1	47.6 seconds

Key observations:

SSD is critical: SD card is 40-50% slower than SSD
Pi 5 is 2-3x faster than Pi 4 for equivalent models
Larger models (7B) are significantly slower but higher quality
Token generation speed ranges from 2-22 tokens/sec (compare to 50-200 tokens/sec on modern GPUs)

API latency (cold start, model already loaded):

tinyllama: 200-400ms time-to-first-token
phi-3-mini: 400-800ms time-to-first-token
mistral:7b: 1-2 seconds time-to-first-token

Larger models require more computation per token, explaining the degradation.

Optimize Performance

Increase Swap Space

By default, Raspberry Pi OS allocates minimal swap (100MB). Increase it to 8-16GB for larger models:

sudo dphys-swapfile swapoff
sudo nano /etc/dphys-swapfile

Change CONF_SWAPSIZE=100 to CONF_SWAPSIZE=8192 (8GB).

sudo dphys-swapfile setup
sudo dphys-swapfile swapon

Warning: Swapping to disk is slow. This allows larger models to fit but with reduced performance.

Use SSD Instead of SD Card

Connect an external SSD via USB 3. Configure the Pi to boot from SSD for best results, or at minimum store models on the SSD:

# Move Ollama models directory to SSD
sudo systemctl stop ollama
sudo mv ~/.ollama /mnt/ssd/.ollama
sudo -u ollama ln -s /mnt/ssd/.ollama ~/.ollama
sudo systemctl start ollama

This single change often provides 2-3x speedup.

Use Quantized Models

Ollama automatically uses 4-bit quantization for models. These are already optimized. If you download models elsewhere, ensure they're quantized (look for q4, q5, or similar in the filename).

Enable CPU Frequency Scaling

Ensure CPU performance mode is enabled:

echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

This prevents CPU throttling during inference but uses more power.

Adjust Context Window

Longer context windows (conversation history) consume more memory. Set a reasonable limit in your API calls:

curl http://localhost:11434/api/generate -d '{
  "model": "tinyllama",
  "prompt": "Hello",
  "num_ctx": 2048
}'

Default is 2048 tokens. Lower values (1024 or 512) reduce memory usage and increase speed.

Run Ollama in Docker

For a cleaner, more portable setup, run Ollama in Docker:

docker run -d \
  --name ollama \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama:latest

Pull models in Docker:

docker exec ollama ollama pull tinyllama

Benefits:

Isolated from host system
Easy updates (just recreate the container)
Clean uninstall
Consistent across different Pi models

Disadvantages:

Slightly higher memory overhead from Docker
Requires learning Docker basics

For most users, native installation is simpler and faster on resource-constrained hardware.

Troubleshooting

"Out of Memory" Errors

Symptom: Model fails to load with "memory allocation failed"

Solutions:

Close other applications: free -h shows memory usage. Close browser tabs, kill unused services.
Reduce context window: Use num_ctx: 512 instead of default.
Use smaller model: Try tinyllama instead of mistral:7b.
Increase swap space: Follow the optimization section above.
Reboot Pi: Clears memory caches. sudo reboot

Slow Inference (30+ seconds per token)

Symptom: Model runs but is extremely slow

Solutions:

Check CPU usage: top — if CPU isn't at 100%, system is throttled or swapping.
Use SSD: SD cards cause severe slowdown. Move models to SSD.
Reduce background load: Stop other services.
Check swap usage: free -h — high swap usage means model is thrashing. Add more RAM or use smaller model.
Enable performance mode: echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Model Won't Load

Symptom: ollama run mistral:7b immediately returns to prompt

Causes:

Insufficient RAM: 7B models need 8GB minimum
Corrupted download: Try ollama rm mistral:7b and re-download
Disk space: Ensure 10GB+ free on / partition

Solution:

# Clean up and restart
ollama rm mistral:7b
docker system prune -a  # if using Docker
ollama pull mistral:7b

API Not Responding

Symptom: curl http://localhost:11434/api/tags returns connection refused

Solutions:

Verify Ollama service is running: sudo systemctl status ollama
Restart Ollama: sudo systemctl restart ollama
Check firewall: sudo ufw allow 11434
Verify port: netstat -tlnp | grep 11434

High CPU Temperature

Symptom: Pi's temperature exceeds 80°C

Solutions:

Install heatsink: Aluminum or copper heatsinks are inexpensive and effective
Add cooling fan: Noctua fans are quiet and efficient
Reduce model size: Smaller models use less CPU
Limit frequency: sudo echo "1500000" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

Summary

Running local LLMs on Raspberry Pi is now genuinely practical thanks to Ollama and quantized models. You can:

Maintain privacy: Your data stays on your network
Reduce API costs: No calls to OpenAI, Anthropic, or Google
Learn machine learning: Understand how LLMs work in practice
Build personalized tools: Create chatbots tuned to your needs

Key takeaways:

Pi 5 with 8GB RAM is the minimum for serious use; Pi 4 works for small models only
SSD storage is critical for performance
Model size matters: tinyllama (1.1B) runs on any Pi, but phi-3-mini (3.8B) and mistral (7B) offer better quality
Real-world token generation: 2-22 tokens/sec depending on hardware and model
Open WebUI provides a polished interface for interacting with your local LLM
Performance tuning (swap, SSD, quantization) can double or triple inference speed

Whether you're building a private home assistant, experimenting with self-hosted AI, or just exploring how LLMs work, Ollama on Raspberry Pi is a cost-effective, educational, and privacy-respecting way to join the AI revolution.

Happy tinkering!

Related Tools

ml-inference docker-ram

Run a Local LLM on Raspberry Pi with Ollama

Introduction

Prerequisites

Step 1 — Install Ollama

Step 2 — Download Your First Model

Step 3 — Run and Chat with a Model Interactively

Step 4 — Serve as an API

Step 5 — Build a Simple Chatbot UI

Model Selection Guide

Performance Benchmarks

Optimize Performance

Increase Swap Space

Use SSD Instead of SD Card

Use Quantized Models

Enable CPU Frequency Scaling

Adjust Context Window

Run Ollama in Docker

Troubleshooting

"Out of Memory" Errors

Slow Inference (30+ seconds per token)

Model Won't Load

API Not Responding

High CPU Temperature

Summary

Related Tools

Comments