Voice Assistant on Raspberry Pi

Build a private voice assistant on a Raspberry Pi using open-source tools. No cloud services required — everything runs locally.

Andreas · April 12, 2026 · 9 min read

Introduction

Build a private voice assistant entirely on Raspberry Pi—no cloud, no microphone listening to proprietary servers. This guide covers speech-to-text (Whisper), intent matching, text-to-speech (Piper), and wake-word detection using open-source tools.

Prerequisites

  • Raspberry Pi 4 (4GB+) or Pi 5 (8GB+ recommended)
  • USB microphone and speaker (or 3.5mm audio jack)
  • Python 3.9+
  • Virtual environment
  • 2GB free disk space
  • pip install numpy scipy

Hardware Setup

Connect USB audio devices:

# List USB devices
lsusb | grep -i audio

# Verify sound devices
arecord -l
aplay -l

# Test recording
arecord -D plughw:1,0 -c 1 -r 16000 -f S16_LE test.wav

Set default audio device in ~/.asoundrc:

pcm.!default {
  type asym
  playback.pcm "playback"
  capture.pcm "capture"
}

pcm.playback {
  type plug
  slave.pcm "hw:1,0"
}

pcm.capture {
  type plug
  slave.pcm "hw:1,0"
}

Step 1 — Speech-to-Text with Whisper.cpp

Whisper.cpp is lightweight and accurate. Install from source:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make

Download the tiny model (75MB, sufficient for Pi):

bash ./models/download-ggml-model.sh tiny

Record and transcribe:

arecord -D plughw:1,0 -c 1 -r 16000 -f S16_LE audio.wav
./main -m models/ggml-tiny.bin audio.wav

Latency: ~3 seconds for 5 seconds of audio on Pi 5.

Step 2 — Intent Matching with Simple Python

Create intent_matcher.py for basic command parsing:

import re

intents = {
    "lights_on": ["turn on the lights", "lights on", "switch on"],
    "lights_off": ["turn off the lights", "lights off", "switch off"],
    "temperature": ["what is the temperature", "how warm", "current temp"],
    "shutdown": ["shut down", "power off", "goodbye"],
}

def match_intent(text):
    text_lower = text.lower()
    for intent, patterns in intents.items():
        for pattern in patterns:
            if pattern in text_lower:
                return intent, pattern
    return None, None

# Test
result, matched = match_intent("turn on the lights please")
print(f"Intent: {result}, Matched: {matched}")

For production, use fuzzy matching:

pip install rapidfuzz
from rapidfuzz import fuzz

def match_intent_fuzzy(text, threshold=80):
    text_lower = text.lower()
    best_intent, best_score = None, 0
    
    for intent, patterns in intents.items():
        for pattern in patterns:
            score = fuzz.partial_ratio(text_lower, pattern)
            if score > best_score:
                best_score, best_intent = score, intent
    
    return best_intent if best_score >= threshold else None

Step 3 — Text-to-Speech with Piper TTS

Piper is lightweight and natural. Install:

git clone https://github.com/rhasspy/piper
cd piper
pip install -e .
pip install espeak-ng-python

Download a voice model (small, ~20MB):

wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx.json

Synthesize and play:

echo "Hello, I'm your voice assistant" | piper \
  --model en_US-amy-medium.onnx \
  --output_file output.wav

aplay output.wav

Latency: ~2 seconds for a 10-word sentence on Pi 5.

Step 4 — Wake-Word Detection with openWakeWord

For hands-free activation, openWakeWord detects "Hey Raspberry Pi" locally:

pip install openwakeword scipy numpy

Create wake_word.py:

import numpy as np
import pyaudio
from openwakeword.model import Model

# Load model
model = Model(wakeword="hey_raspberry_pi")

# Audio capture
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1280)

print("Listening for wake word...")
while True:
    audio = np.frombuffer(stream.read(1280), dtype=np.int16).astype(np.float32) / 32768.0
    prediction = model.predict(audio)
    
    if prediction > 0.5:
        print(f"Wake word detected! (confidence: {prediction:.2%})")
        # Trigger STT, intent matching, TTS here

Step 5 — Complete Voice Assistant Loop

Combine all components in assistant.py:

#!/usr/bin/env python3
import subprocess
import numpy as np
import pyaudio
from openwakeword.model import Model
from rapidfuzz import fuzz
import os

intents = {
    "lights_on": ["turn on", "lights on"],
    "lights_off": ["turn off", "lights off"],
    "temperature": ["temperature", "how warm"],
}

def capture_audio(duration=5):
    """Record audio and save to file"""
    os.system(f"arecord -D plughw:1,0 -d {duration} -c 1 -r 16000 -f S16_LE /tmp/audio.wav")

def transcribe():
    """Speech-to-text"""
    result = subprocess.run(
        ["./whisper.cpp/main", "-m", "./whisper.cpp/models/ggml-tiny.bin", "/tmp/audio.wav"],
        capture_output=True, text=True
    )
    return result.stdout

def match_intent(text):
    """Intent recognition"""
    for intent, patterns in intents.items():
        for pattern in patterns:
            if fuzz.partial_ratio(text.lower(), pattern) > 80:
                return intent
    return None

def speak(text):
    """Text-to-speech"""
    os.system(f'echo "{text}" | piper --model en_US-amy-medium.onnx --output_file /tmp/output.wav')
    os.system("aplay /tmp/output.wav")

def handle_intent(intent):
    """Execute action based on intent"""
    if intent == "lights_on":
        # Call GPIO or smart home API
        speak("Turning on the lights")
    elif intent == "lights_off":
        speak("Turning off the lights")
    elif intent == "temperature":
        speak("The temperature is 21 degrees")

# Main loop
wake_model = Model(wakeword="hey_raspberry_pi")
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1280)

print("Listening for wake word...")
while True:
    audio = np.frombuffer(stream.read(1280), dtype=np.int16).astype(np.float32) / 32768.0
    if wake_model.predict(audio) > 0.5:
        print("Wake word detected!")
        speak("I'm listening")
        
        capture_audio(duration=5)
        text = transcribe()
        print(f"You said: {text}")
        
        intent = match_intent(text)
        if intent:
            handle_intent(intent)
        else:
            speak("I didn't understand that")

Run:

python3 assistant.py

Latency Reality Check

Component Time on Pi 5
Wake word detection (continuous) <100ms
Audio capture (5 sec) 5s
Speech-to-text (Whisper tiny) 3s
Intent matching <100ms
Text-to-speech (Piper) 2s
Total response time ~10s

Expect ~10 second round-trip latency. Cloud assistants (Alexa, Google) respond in 1–2 seconds because they have server farms. Local is private but slower.

Troubleshooting

Audio not working — Run arecord -l and aplay -l, confirm device numbers. Update ~/.asoundrc with correct hw:X,Y values.

Whisper too slow — Use the tiny model, not base/small. Tiny is 75MB; base is 140MB.

Wake word never triggers — Train a custom model with openWakeWord's training script, or adjust the detection threshold (e.g., > 0.3 instead of 0.5).

Memory errors — Kill background processes: free -h, ps aux | grep python. On Pi 4, consider smaller models or swap.

Summary

A private voice assistant on Pi 4/5 is feasible using Whisper, Piper, and openWakeWord. Expect 10-second response latency but full privacy—no audio sent to cloud servers. Start with the basic loop, then integrate with your own smart home system (GPIO, MQTT, Home Assistant).

Comments