Voice Assistant on Raspberry Pi
Build a private voice assistant on a Raspberry Pi using open-source tools. No cloud services required — everything runs locally.
Introduction
Build a private voice assistant entirely on Raspberry Pi—no cloud, no microphone listening to proprietary servers. This guide covers speech-to-text (Whisper), intent matching, text-to-speech (Piper), and wake-word detection using open-source tools.
Prerequisites
- Raspberry Pi 4 (4GB+) or Pi 5 (8GB+ recommended)
- USB microphone and speaker (or 3.5mm audio jack)
- Python 3.9+
- Virtual environment
- 2GB free disk space
pip install numpy scipy
Hardware Setup
Connect USB audio devices:
# List USB devices
lsusb | grep -i audio
# Verify sound devices
arecord -l
aplay -l
# Test recording
arecord -D plughw:1,0 -c 1 -r 16000 -f S16_LE test.wav
Set default audio device in ~/.asoundrc:
pcm.!default {
type asym
playback.pcm "playback"
capture.pcm "capture"
}
pcm.playback {
type plug
slave.pcm "hw:1,0"
}
pcm.capture {
type plug
slave.pcm "hw:1,0"
}
Step 1 — Speech-to-Text with Whisper.cpp
Whisper.cpp is lightweight and accurate. Install from source:
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make
Download the tiny model (75MB, sufficient for Pi):
bash ./models/download-ggml-model.sh tiny
Record and transcribe:
arecord -D plughw:1,0 -c 1 -r 16000 -f S16_LE audio.wav
./main -m models/ggml-tiny.bin audio.wav
Latency: ~3 seconds for 5 seconds of audio on Pi 5.
Step 2 — Intent Matching with Simple Python
Create intent_matcher.py for basic command parsing:
import re
intents = {
"lights_on": ["turn on the lights", "lights on", "switch on"],
"lights_off": ["turn off the lights", "lights off", "switch off"],
"temperature": ["what is the temperature", "how warm", "current temp"],
"shutdown": ["shut down", "power off", "goodbye"],
}
def match_intent(text):
text_lower = text.lower()
for intent, patterns in intents.items():
for pattern in patterns:
if pattern in text_lower:
return intent, pattern
return None, None
# Test
result, matched = match_intent("turn on the lights please")
print(f"Intent: {result}, Matched: {matched}")
For production, use fuzzy matching:
pip install rapidfuzz
from rapidfuzz import fuzz
def match_intent_fuzzy(text, threshold=80):
text_lower = text.lower()
best_intent, best_score = None, 0
for intent, patterns in intents.items():
for pattern in patterns:
score = fuzz.partial_ratio(text_lower, pattern)
if score > best_score:
best_score, best_intent = score, intent
return best_intent if best_score >= threshold else None
Step 3 — Text-to-Speech with Piper TTS
Piper is lightweight and natural. Install:
git clone https://github.com/rhasspy/piper
cd piper
pip install -e .
pip install espeak-ng-python
Download a voice model (small, ~20MB):
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx.json
Synthesize and play:
echo "Hello, I'm your voice assistant" | piper \
--model en_US-amy-medium.onnx \
--output_file output.wav
aplay output.wav
Latency: ~2 seconds for a 10-word sentence on Pi 5.
Step 4 — Wake-Word Detection with openWakeWord
For hands-free activation, openWakeWord detects "Hey Raspberry Pi" locally:
pip install openwakeword scipy numpy
Create wake_word.py:
import numpy as np
import pyaudio
from openwakeword.model import Model
# Load model
model = Model(wakeword="hey_raspberry_pi")
# Audio capture
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1280)
print("Listening for wake word...")
while True:
audio = np.frombuffer(stream.read(1280), dtype=np.int16).astype(np.float32) / 32768.0
prediction = model.predict(audio)
if prediction > 0.5:
print(f"Wake word detected! (confidence: {prediction:.2%})")
# Trigger STT, intent matching, TTS here
Step 5 — Complete Voice Assistant Loop
Combine all components in assistant.py:
#!/usr/bin/env python3
import subprocess
import numpy as np
import pyaudio
from openwakeword.model import Model
from rapidfuzz import fuzz
import os
intents = {
"lights_on": ["turn on", "lights on"],
"lights_off": ["turn off", "lights off"],
"temperature": ["temperature", "how warm"],
}
def capture_audio(duration=5):
"""Record audio and save to file"""
os.system(f"arecord -D plughw:1,0 -d {duration} -c 1 -r 16000 -f S16_LE /tmp/audio.wav")
def transcribe():
"""Speech-to-text"""
result = subprocess.run(
["./whisper.cpp/main", "-m", "./whisper.cpp/models/ggml-tiny.bin", "/tmp/audio.wav"],
capture_output=True, text=True
)
return result.stdout
def match_intent(text):
"""Intent recognition"""
for intent, patterns in intents.items():
for pattern in patterns:
if fuzz.partial_ratio(text.lower(), pattern) > 80:
return intent
return None
def speak(text):
"""Text-to-speech"""
os.system(f'echo "{text}" | piper --model en_US-amy-medium.onnx --output_file /tmp/output.wav')
os.system("aplay /tmp/output.wav")
def handle_intent(intent):
"""Execute action based on intent"""
if intent == "lights_on":
# Call GPIO or smart home API
speak("Turning on the lights")
elif intent == "lights_off":
speak("Turning off the lights")
elif intent == "temperature":
speak("The temperature is 21 degrees")
# Main loop
wake_model = Model(wakeword="hey_raspberry_pi")
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1280)
print("Listening for wake word...")
while True:
audio = np.frombuffer(stream.read(1280), dtype=np.int16).astype(np.float32) / 32768.0
if wake_model.predict(audio) > 0.5:
print("Wake word detected!")
speak("I'm listening")
capture_audio(duration=5)
text = transcribe()
print(f"You said: {text}")
intent = match_intent(text)
if intent:
handle_intent(intent)
else:
speak("I didn't understand that")
Run:
python3 assistant.py
Latency Reality Check
| Component | Time on Pi 5 |
|---|---|
| Wake word detection (continuous) | <100ms |
| Audio capture (5 sec) | 5s |
| Speech-to-text (Whisper tiny) | 3s |
| Intent matching | <100ms |
| Text-to-speech (Piper) | 2s |
| Total response time | ~10s |
Expect ~10 second round-trip latency. Cloud assistants (Alexa, Google) respond in 1–2 seconds because they have server farms. Local is private but slower.
Troubleshooting
Audio not working — Run arecord -l and aplay -l, confirm device numbers. Update ~/.asoundrc with correct hw:X,Y values.
Whisper too slow — Use the tiny model, not base/small. Tiny is 75MB; base is 140MB.
Wake word never triggers — Train a custom model with openWakeWord's training script, or adjust the detection threshold (e.g., > 0.3 instead of 0.5).
Memory errors — Kill background processes: free -h, ps aux | grep python. On Pi 4, consider smaller models or swap.
Summary
A private voice assistant on Pi 4/5 is feasible using Whisper, Piper, and openWakeWord. Expect 10-second response latency but full privacy—no audio sent to cloud servers. Start with the basic loop, then integrate with your own smart home system (GPIO, MQTT, Home Assistant).