Offline Voice Control: Building a Hands-Free Mobile App with On-Device AI
Imagine you’re a field engineer repairing equipment on a remote site: your hands are full, the environment is noisy, and connectivity is spotty. In such constrained environments, hands-free voice control can be a game-changer. Voice commands let users interact with mobile or embedded apps without touching the screen, improving safety and efficiency. However, traditional voice assistants often depend on cloud services, which isn’t always practical in the field. This post explores how to build an offline, real-time voice control system for mobile apps using on-device AI. We’ll use Switchboard, a toolkit for real-time audio and AI processing to achieve reliable voice interaction entirely on-device.
Why Offline Voice Control Matters
Offline voice control offers several key advantages over cloud-based solutions:
Low Latency: Processing speech on-device eliminates network round-trips. The result is near-instant response time, which is crucial for a natural user experience. For example, OpenAI’s Whisper ASR running locally has significantly reduced latency since no cloud server is involved. Real-time interaction feels snappier without the 200ms to 500ms overhead of sending audio to a server and waiting for a reply.
Reliability Anywhere: An offline voice UI works anytime, anywhere: even in a basement, rural area, or airplane mode. There’s no dependence on an internet connection, so the voice commands still function in low or no-connectivity environments. This reliability is critical for field use cases where network access can’t be assumed.
Cost Efficiency: Cloud speech APIs may seem inexpensive per request, but costs add up at scale (and can spike with usage.) With on-device speech recognition, once the model is on the device, each additional voice command is essentially free. There are no hourly or per-character fees for transcription or synthesis, making offline solutions far more cost-effective for high-volume or long-duration use.
Privacy & Compliance: Keeping voice data on-device means sensitive audio never leaves the user’s control. Cloud-based voice assistants send recordings to servers, raising concerns about data breaches or violating regulations. Offline processing mitigates these risks; no audio streams over the internet, which is especially important in domains like healthcare, defense, or enterprise settings with strict data policies. By design, an on-device voice assistant provides strong privacy guarantees.
In short, offline voice control gives you speed, dependability, and user trust that cloud-dependent solutions often can’t match. It lets your app’s voice features work in real time under all conditions, without recurring service fees or privacy headaches.
Problems with Cloud-Based Voice UIs
Conversely, cloud-reliant voice user interfaces come with several pitfalls that affect both developers and users:
Connectivity Issues: A cloud voice UI simply fails when offline. If a technician is in a dead zone or a secure facility with no internet, cloud speech recognition won’t function: no network, no voice UI. Even with connectivity, high latency or jitter can degrade the experience (e.g. delays or mid-command dropouts).
Ongoing Costs: Relying on third-party speech services means ongoing usage fees. What starts cheap in prototyping can become expensive at scale, or if you hit tier limits. For instance, transcribing audio via a popular cloud API might cost on the order of $0.18 per hour of audio; costs that accumulate every time a user talks to your app. This can hurt the viability of voice features in a high-usage app.
Compliance and Privacy Risks: Many industries have regulations that forbid sending user data (especially voice, which may contain personal or sensitive info) to external servers. Cloud voice services introduce data residency and security concerns, since audio is streamed and stored outside the device. There’s an inherent risk in transmitting customer conversations to the cloud. Meeting GDPR, HIPAA, or internal compliance standards becomes much harder with a cloud pipeline.
Battery Drain: Constantly streaming audio to the cloud can also impact battery life. The device’s radios (Wi-Fi or cellular) must stay active, using power for data transmission. In contrast, on-device processing can be optimized to use the device’s local compute resources more efficiently. While running AI models locally does consume CPU, modern on-device models can be tuned to balance performance and energy use. Avoiding the network can actually save power in scenarios where the alternative is an always-on uplink.
The bottom line: cloud voice UIs may work for casual consumer use, but they can stumble in mission-critical or resource-constrained scenarios. An app meant for field work, offline environments, or privacy-sensitive tasks demands an on-device solution to ensure it’s fast, reliable, and secure under all conditions.
Demo Use Case: Voice Commands for Field Service
To make this concrete, let’s consider a demo use case: a field service mobile app for maintenance technicians. These users are frequently away from desks: climbing ladders, wearing gloves, working in tight spaces; making hands-free operation highly valuable. We’ll imagine our app helps a technician manage work orders via voice. For example, the user could say: “Open ticket for unit 42.”
In this scenario, the app would interpret the speech command, create a new maintenance ticket for equipment unit 42, and confirm back to the user (perhaps saying “Ticket 42 opened”). All of this needs to happen offline in real time, since the technician might be in a factory basement with no connectivity. Low latency is important so the workflow isn’t slowed down, and accuracy is key because misrecognitions could lead to the wrong unit number.
This demo encompasses a complete voice interaction loop:
Voice capture: continuously listen for the user’s speech.
Speech recognition (STT): transcribe the spoken command into text.
Command parsing: understand the intent (open a ticket) and extract details (unit number 42).
Take action: create the ticket in the app’s database/backend.
Text-to-speech (TTS): speak a confirmation back to the user.
We’ll build this with Switchboard, which provides a convenient way to set up an on-device voice pipeline for such an app.
Why Switchboard?
Switchboard is a framework of modular audio and AI components built for real-time, on-device processing. It allows you to assemble custom audio pipelines (called audio graphs) with minimal integration effort. For our offline voice control app, Switchboard brings several benefits:
On-Device, Real-Time Processing: All voice data stays on the device, and inference happens locally with minimal latency. Switchboard’s nodes leverage efficient libraries; for example, the Silero VAD model can analyze a 30ms audio frame in under 1ms on a single CPU thread. OpenAI’s Whisper model (for STT) is integrated via C++ for speed, achieving staggeringly low latencies on CPU even on mobile hardware. This means voice commands can be recognized and responded to essentially in real time, without needing any cloud compute.
Cross-Platform Simple Integration: Switchboard provides a unified API across iOS (Swift), Android (Kotlin/C++), desktop (macOS/Windows/Linux in C++), and even embedded Linux. You can integrate it as a library or even design your audio graph visually in the Switchboard Editor and deploy it to different platforms. Our focus here is iOS, but the same graph can run on Android or a Raspberry Pi with minimal changes. This flexibility is a boon for teams targeting multiple environments.
All-in-One Voice Pipeline: Out of the box, Switchboard includes nodes for the core tasks we need: Voice Activity Detection (VAD), Speech-To-Text (STT), Intent processing via an LLM or rule engine, and Text-To-Speech (TTS). Under the hood it uses proven open-source models: Silero VAD to detect speech segments, OpenAI Whisper (via whisper.cpp) for transcription, and Silero TTS for voice synthesis (or other TTS engines as extensions). There’s even support to incorporate a local LLM (e.g. Llama 2 via llama.cpp) to handle more complex intent logic. Because these components are pre-integrated as Switchboard nodes, you don’t have to stitch together separate libraries or processes, they all run in one seamless audio graph.
No GPU Required: Switchboard’s AI nodes are optimized for both GPU and CPU execution, often using quantized models and efficient C++ inference. You do not need a dedicated GPU or Neural Engine to run this pipeline. For example, Whisper’s tiny/base models run comfortably on modern mobile ARM CPUs, and the entire pipeline (VAD → STT → LLM → TTS) can run on a typical smartphone or embedded board in real time. This makes the solution viable on devices like iPhones, Android phones, or edge IoT hardware without specialized accelerators. It also simplifies deployment, no extra drivers or cloud instances needed.
In short, Switchboard provides the building blocks to implement offline voice control quickly and robustly. We get to focus on our app’s logic (the “open ticket” functionality) rather than low-level audio processing or model integration details. Next, let’s look at the architecture and how these pieces connect together.
Architecture Overview
To build our voice control system, we set up an audio graph in Switchboard with the following key nodes:
SileroVADNode: Listens to the microphone audio and detects when the user starts and stops speaking. This voice activity detector filters out background noise and avoids sending silence to the speech recognizer. It outputs events (like “speech started” and “speech ended”) which we use to trigger transcription.
WhisperNode: Takes in audio and produces text transcripts using the Whisper speech-to-text model. This node will give us the recognized command, e.g. converting the audio of "open ticket for unit 42" into the string "open ticket for unit 42". We configure it to run continuously but only actually transcribe when triggered by the VAD (to save compute and improve accuracy).
IntentLLMNode (or Intent Logic Node): Processes the transcribed text to decide what the user intends and what to do. This could be a simple rule-based parser (e.g. find the phrase "open ticket" and extract the unit number) or a more sophisticated LLM prompt that interprets free-form speech. In Switchboard, you can route text into a local LLM to handle intent if needed. In our example, the intent is straightforward, so a small function will parse the command.
TTSNode: Converts a text response into spoken audio in real time. We’ll use a TTS node to output a confirmation like “Opened ticket for unit 42.” Switchboard’s TTS node (backed by Silero TTS) generates speech waveform on-device, which we send to the device speaker.
All these nodes run on-device and are connected in a pipeline. The overall flow is: Microphone → VAD → STT → Intent Logic → (App action + TTS) → Speaker.
The microphone stream is fed into both the VAD and STT components (in Switchboard we use a splitter node to branch the audio). The VAD continuously monitors the audio but only when it detects actual speech (voice activity) do we proceed. When the user finishes speaking (VAD detects the end of utterance), it triggers the Whisper STT node to transcribe just that segment of audio. The recognized text is then passed into our intent logic (which could be a custom Swift function or an LLM node). The app processes the intent (e.g. opening a ticket in the database), and generates a response string. Finally, that response text is sent into the TTS node, which synthesizes audible speech feedback for the user through the speaker output. All of this happens in a matter of milliseconds to a couple seconds (depending on the length of the utterance and model sizes), enabling a smooth conversational interaction without any cloud calls.
Now that we understand the architecture, let’s walk through building this step-by-step.
Hands-On Guide
In this section, we’ll go through the implementation process with Switchboard, from setting up the SDK to handling voice commands in code. We assume you’re using Swift on iOS for the examples, but similar classes and APIs exist for Kotlin/C++ on Android and other platforms.
1. Setup: Adding the Switchboard SDK
First, add Switchboard to your project. The easiest way on iOS is via Swift Package Manager. You can add the Switchboard SDK by entering its GitHub URL in Xcode (the package repository is https://github.com/switchboard-sdk/switchboard-sdk-ios). Alternatively, download the precompiled SwitchboardSDK.xcframework from the Switchboard website and embed it in your project. Make sure to sign up for a free API key on the Switchboard console to get your appID and appSecret; the SDK uses these to enable the audio engine.
Before using any Switchboard features, initialize the SDK (a good place is in your AppDelegate). For example:
import SwitchboardSDK
@main
class AppDelegate: UIResponder, UIApplicationDelegate {
func application(_ application: UIApplication,
didFinishLaunchingWithOptions launchOptions: [...]) -> Bool {
// Initialize Switchboard with your credentials
SwitchboardSDK.initialize(
appID: "YOUR_APP_ID",
appSecret:"YOUR_APP_SECRET"
)
return true
}
}
This registers your app with the Switchboard SDK using the credentials from your dashboard. Now we’re ready to construct the audio graph for voice control.
2. Using SileroVADNode for Voice Activity Detection
Next, we set up the audio processing graph and add a VAD node to detect speech. The VAD will help us ignore background noise or pauses and focus only on actual spoken commands. Switchboard’s SileroVADNode uses a neural network to accurately detect speech segments in real time. We’ll create one and attach it to the microphone input.
3. Streaming Audio to Whisper STT
We also add the Whisper STT node to transcribe speech to text. Whisper is a high-accuracy speech recognition model that runs fully offline. We’ll use a smaller Whisper model (for instance, the Tiny or Base model) to ensure real-time performance on device. The STT node will receive audio frames and output text whenever it transcribes something.
In our graph, the microphone input needs to go to both the VAD and STT nodes in parallel. We accomplish this by first converting the mic input to mono (Whisper expects a mono audio stream) and then splitting the stream. Switchboard provides a MultiChannelToMonoNode for channel mixing and a BusSplitterNode to fork the audio flow. Below is how we set up the nodes and connections in Swift:
import SwitchboardSDK
// 1. Create the audio engine and an empty graph
let audioEngine = SBAudioEngine()
let audioGraph = SBAudioGraph()
// 2. Instantiate the necessary nodes
let monoNode = SBMultiChannelToMonoNode() // convert stereo mic input to mono
let splitterNode= SBBusSplitterNode() // split audio into two paths
let vadNode = SBSileroVADNode() // voice activity detector
let sttNode = SBWhisperSTTNode() // speech-to-text (Whisper)
let ttsNode = SBSileroTTSNode() // text-to-speech
// 3. Add nodes to the audio graph
[audioGraph.inputNode, monoNode, splitterNode, vadNode, sttNode, ttsNode]
.forEach { audioGraph.addNode($0) }
// 4. Connect audio signal path: Mic -> Mono -> Splitter -> (VAD + STT)
audioGraph.connect(audioGraph.inputNode, to: monoNode)
audioGraph.connect(monoNode, to: splitterNode)
audioGraph.connect(splitterNode, to: vadNode)
audioGraph.connect(splitterNode, to: sttNode)
// Also connect TTS output to the audio output (speaker)
audioGraph.connect(ttsNode, to: audioGraph.outputNode)
// 5. Start the audio engine with the configured graph
audioEngine.start(audioGraph)
Let’s break down what this code does:
We initialize SBAudioEngine and SBAudioGraph to manage the audio I/O and processing graph.
We create instances of our nodes: a VAD node (SBSileroVADNode), a Whisper STT node, and a Silero TTS node. We also set up a SBMultiChannelToMonoNode (to downmix the microphone from (potential) stereo to mono) and a SBBusSplitterNode (to duplicate the mono stream).
We add all nodes (including the special audioGraph.inputNode and the default audioGraph.outputNode) to the graph.
We connect the nodes: the microphone input feeds into the mono converter, then into the splitter. The splitter sends the audio to two places: the VAD node and the STT node. Finally, the TTS node is connected to the graph’s output (speaker). At this point, the audio engine is piping mic audio into the STT and VAD, and it can play audio from the TTS out through the speaker.
We start the engine to begin audio capture and processing. (Ensure you have microphone permission in your app’s Info.plist as noted in Switchboard’s docs.)
Now our app is actively listening and ready to handle voice input. The VAD is monitoring the audio stream for voice activity. The Whisper node is loaded (it will initialize the model, which may take a second on first run) and standing by. But we haven’t yet told the system when to transcribe or how to handle the recognized text. That’s where the next steps come in.
4. Implementing Intent Logic (LLM or Rule-Based)
With audio flowing through the graph, we need to decide how to interpret the transcribed text. There are two approaches here:
Rule-Based Parsing: If your voice commands are relatively simple or structured (like our “open ticket for [unit]” example), you can implement the intent logic with straightforward code. For instance, you might look for certain keywords in the text or use a regular expression to extract parameters (like the unit number). This approach is fast and has no overhead, but it’s less flexible with how users phrase commands.
LLM-Based Interpretation: For more complex or open-ended interactions, you can incorporate a local Large Language Model via Switchboard. The IntentLLMNode can route the text into an on-device LLM (such as a quantized Llama 2 model) which can parse the intent or even carry on a dialogue. This would allow understanding a variety of phrasings (“I need to create a ticket for unit 42” or “there’s an issue with unit 42, open a new case”). The LLM could output a formalized command or even directly produce the response text. While powerful, this approach uses more CPU and memory, but since it’s all on-device, it’s still privacy-safe. You’d choose a model small enough to run on your target device (e.g. a 7B parameter model with 4-bit quantization for mobile).
For our field app demo, we’ll implement a simple rule-based intent handler in Swift, since the voice command format is known. We’ll use the transcribed text from Whisper and determine if it’s an “open ticket” command, then parse out the unit number.
Switchboard allows us to get the transcription result via a callback or by connecting the STT node’s output. One convenient method is to use the delegate or closure that provides the final recognized text. We can then perform intent logic in that callback. For example:
// Assume we've set up the graph as above and started the engine.
// Now configure the STT node's callback for when a transcript is ready:
sttNode.onTranscription = { recognizedText in
let command = recognizedText.lowercased()
if command.starts(with: "open ticket") {
// Extract the unit number (e.g. "42") from the command
let unitNumber = command.components(separatedBy: CharacterSet.decimalDigits.inverted)
.compactMap(Int.init).first
if let unit = unitNumber {
// 1. Trigger the app action: open a ticket in the system
TicketManager.shared.openTicket(forUnit: unit)
// 2. Use TTS to confirm the action to the user
ttsNode.speak("Opened ticket for unit \(unit).")
} else {
ttsNode.speak("Opened a new ticket.") // fallback if number not found
}
} else {
// Handle unrecognized command or pass to LLM for further handling
ttsNode.speak("Sorry, I didn't catch that.")
}
}
In this snippet, whenever Whisper produces a transcription, we:
Normalize the text to lowercase and check if it starts with "open ticket". (This is a simplistic check; you could make it more robust with NLP libraries or patterns.)
Extract the numeric portion of the command to get the unit number. We use a quick trick of stripping non-digits and taking the first number we find (e.g., from "unit 42" we get the integer 42).
If we got a unit number, we call our app’s ticket management logic to actually open the ticket (this would be your app-specific code; for demonstration we call a TicketManager.shared.openTicket method).
We then instruct the TTS node to speak a confirmation like "Opened ticket for unit 42." The ttsNode.speak(...) method feeds the given text into the TTS processing pipeline, and since our TTS node is connected to the output, the user will hear this via the device speakers.
If the command didn’t match "open ticket", we handle it as an unrecognized command. Here, we simply respond with a polite failure message via TTS. In a more advanced app, you might pass the text to an LLM node for further interpretation, or handle other keywords accordingly.
A key detail is that we only want to call onTranscription after the user has finished speaking the command. That’s why we have the VAD in place. We’d configure the VAD node to signal the STT node when to start or stop transcribing. In Switchboard’s graph, this can be done by connecting the VAD’s end-of-speech event to the Whisper node’s trigger (so Whisper only transcribes after the user pauses). In code above, we assumed onTranscription gives us the final result of that triggered transcription.
5. Triggering App Actions from Intents
As shown in the code, once the intent is recognized, we invoke the necessary app action, in this case, opening a new ticket in the app. This would typically involve updating your model/database and UI. The voice control pipeline simply acts as another input modality (like a voice-driven button press). You can update the app’s state directly in the transcription callback or dispatch an event in your app’s architecture (e.g. send an Notification or call a ViewModel method). The important part is that the voice command translates into a real action in the app.
Switchboard doesn’t dictate how you structure your app logic; it just provides the voice interface. In our example, TicketManager.shared.openTicket(forUnit:) could create a new ticket object and maybe post a notification that the UI observes to show the ticket details. The integration can be as tight or loose as you prefer. The main point is that offline voice control can tie into existing app functionality seamlessly; the app treats it like any other user input.
6. Speaking Responses with TTS
To complete the loop, we give the user feedback via spoken response. We’ve already utilized the ttsNode in the code above to speak the confirmation. The TTS node in Switchboard uses a lightweight text-to-speech model (from Silero) that runs on-device. By default it may use a generic English voice. You can customize aspects of the voice output if needed, for example, choosing a male/female voice or adjusting the speaking rate, depending on what the TTS engine supports. Since everything is local, the response is generated quickly, and the user hears the result within moments of speaking their command.
One advantage of having TTS in the pipeline is you can provide audio feedback in any language or style that suits your app. For instance, the app could read back more details: "Opened ticket 42 for Generator Unit – priority set to High." Because it’s offline, even the synthesized response text is not sent to any server. This keeps the entire interaction private.
At this point, our field service app demo is functional: the user says “Open ticket for unit 42,” the app hears it, understands it, creates the ticket, and speaks back a confirmation; all without any network connection or cloud services involved.
Going Further
We have built a basic hands-free voice control feature, but there are many ways you can extend and refine this system:
Better Noise Handling: In a field environment, background noise can be a challenge. Switchboard provides noise suppression nodes like RNNoiseNode (a denoising ML model) and others that you can put in the pipeline before the STT node. You can also tune the Silero VAD’s sensitivity or use a noise gate to ignore constant hums. Selecting the right Whisper model size for accuracy vs speed is also important if the domain has lots of noise or technical jargon; larger models like Whisper Small or Medium might give better accuracy at the cost of some speed, so you’ll need to balance according to your target market.
Custom Wake Words: Our current setup is always listening, which might not be ideal for battery or user experience. You can incorporate a wake word (like “Hey AppName”) to activate voice processing only when needed. Switchboard can integrate with Picovoice Porcupine or other wake word detectors as nodes. This way, the VAD+STT only runs after the wake word is detected, saving power and avoiding unintended commands. In an embedded scenario, a wake word can be extremely low-power compared to running full ASR constantly.
Multiple Intents and Dialogues: We demonstrated one command, but you can extend the intent handler to support multiple voice commands (open ticket, close ticket, lookup manual, etc.). For complex interactions, consider using an LLM to manage a dialogue. Switchboard’s LLM integration could maintain context; e.g. the user could ask "What’s the status of unit 42?" after opening the ticket, and the LLM node (with some prompt engineering) could fetch that info and reply via TTS. This would turn your app into a more conversational assistant, all offline. Just be mindful of the device limitations when adding more AI tasks.
Error Handling and Retries: In practice, you’ll want to handle cases where the speech wasn’t clear or the STT confidence is low. You might implement a simple retry logic: if the transcription confidence or intent match is below a threshold, ask the user to repeat (using TTS to say "Sorry, could you repeat that?"). Whisper doesn’t provide confidence scores out-of-the-box, but you can infer it or use heuristics (e.g. no intent identified). Ensuring a smooth fallback will improve usability.
Multi-Language Support: Whisper models can handle many languages. If your app needs to support multi-lingual users, you could set the WhisperNode to auto-detect language or explicitly load models for the target languages. Switchboard allows switching out models or running multiple STT nodes if needed (though running two large models at once on device might be heavy). Similarly, you can use TTS voices for different languages. All without cloud services. This is great for apps that must operate in remote regions with various local languages (imagine an agriculture app for remote villages, etc.).
Deployment on Embedded Devices: While our example was mobile-focused, the same pipeline can run on an embedded Linux board or even inside a desktop app. You might deploy an offline voice-controlled interface on an industrial device or a kiosk. Switchboard’s C++ API lets you integrate into such environments. The absence of cloud dependencies means you just have to ship the model files and binary, it will run entirely on-premise. Do monitor memory and CPU usage on lower-end hardware and use the smallest models that meet your accuracy needs.
There is a lot of room to tailor the solution to your specific use case. The modular nature of Switchboard means you can plug and play components (swap Whisper STT with another model, or Silero TTS with say a custom TTS) and tweak the graph configuration. Switchboard’s documentation is a great resource to learn about available nodes and best practices for real-time audio graphs.
Available Now
Hands-free voice control is no longer limited to big-tech assistants. With on-device AI, any mobile or embedded app can have a reliable voice interface that works offline. By using Switchboard’s SDK, we integrated state-of-the-art speech models (for VAD, STT, and TTS) into a cohesive pipeline, all running locally. The result is an app that’s faster (low latency), cheaper (no cloud fees), more secure (user data never leaves the device), and more robust (works in a faraday cage or the middle of nowhere). These benefits are game-changing for developers building solutions in industrial, healthcare, military, or any scenario where cloud connectivity isn’t guaranteed or desirable.
We demonstrated a field service app that opens tickets via voice, but the possibilities are endless: from voice-controlled IoT appliances, to offline voice assistants for vehicles, to mobile apps that users can operate while exercising or driving. With on-device voice AI, you control the experience end-to-end, and users get the convenience of voice interaction with full privacy and reliability.
If you’re ready to add offline voice capabilities to your own app, give Switchboard a try. The SDK is actively maintained, and the official documentation has detailed guides and examples to get you started. You can start with a simple command or two and gradually build up a powerful voice UX tailored to your domain. Empower your users to talk to your app anywhere, no internet required. Happy coding, and happy talking!
Early Access
We're opening early access to Switchboard. If you're building something serious with AI and audio and you're tired of paying big bucks for cloud AI, this is for you. It gave us back the control we needed to make our app reliable, and we think it'll do the same for you.
Want to see what else we're building? Check out Switchboard and Synervoz.