Voice Kiosk on QL601: Building a Low-Power, Voice-First Edge Terminal

Imagine walking up to a kiosk in a busy fast-food restaurant. Instead of tapping through a complex menu on a touchscreen, you simply say, "I'd like a cheeseburger, no onions, with a side of fries and a vanilla milkshake." The kiosk confirms your order instantly, and you're ready to pay. This isn't science fiction; it's the future of customer interaction, and it's powered by AI running directly at the edge.

For years, conversational AI has been tethered to the cloud, leading to API fees, reliance on a stable internet connection, and the need to send sensitive customer data to third-party servers. By processing data locally, we can achieve instantaneous responses, ensure that private conversations remain private, and build systems that are completely reliable, even when the internet isn't.

This is precisely what we've built. Using our QL601 AI turnkey solution, powered by the Qualcomm® QCS6490 chipset, we've engineered a low-power, self-contained voice conversational AI terminal. We tackled a challenge that isn't officially supported out-of-the-box by the manufacturer: we built the entire voice AI pipeline from scratch. This demonstrates not only the power of our hardware but also our unique expertise in unlocking its full potential for your products.

The Core Challenge

To create a seamless voice-first experience, you need a sophisticated pipeline that can intelligently process human speech. The architecture is straightforward in concept:

graph TD
    A@{ shape: text, label: "audio" }
    A --> B["VAD
    (Voice Activity Detection)"]
    B --> C["Speech Recognition (ASR)"]
    C --> D["Language Model (SLM)"]
    D --> E@{ shape: text, label: "text response" }

The voice AI pipeline.

However, while Qualcomm's QIM SDK provides handy GStreamer plugins for building visual AI pipelines, similar plugins for audio are noticeably absent.

This was our core challenge. There was no pre-built, optimized path to get from a raw audio stream to an AI-powered text response. To make our voice kiosk a reality, our team had to engineer the entire voice AI pipeline from the ground up, creating a bridge between the hardware and the complex AI models that would give the device its voice.

Engineering the "Ear": A Real-Time C++ ASR Engine

The heart of any voice system is its ability to accurately convert spoken words into text. The Qualcomm AI Hub provided a starting point with its Whisper-Tiny-En model, but it wasn't a plug-and-play solution. The model was split into separate encoder and decoder TFLite files, requiring us to implement a complex process called autoregressive decoding—where the model predicts the next word based on the words it has already identified. We also had to handle all the pre- and post-processing ourselves, including feature extraction, logits processing, token sampling, etc. To manage this, we developed a custom C++ library to orchestrate the entire Speech-to-Text (STT) process efficiently on the QL601.

Whisper Engine: The Speech-To-Text Process — Diagram of the Whisper Engine: the Basic Speech-To-Text Process.

Designing the STT process was only half the battle; it needed to perform in the real world, and in real time. A customer won't repeat themselves, so our system must listen continuously. We engineered a sophisticated buffering and inference mechanism that processes audio in chunks as it arrives. The system continuously transcribes incomplete sentences and revises the output as the user speaks. This approach results in a final transcription that is both fast and robust.

To further enhance accuracy in noisy environments, we integrated the Silero VAD (Voice Activity Detection) model. While Whisper has its own speech detection mechanism, we found it still suffers from severe "hallucinations" in a loud setting like a restaurant, where the model may transcribe environmental noises into nonsense. Our solution uses the small, fast, and highly accurate Silero VAD to act as a gatekeeper, activating the main Whisper model only when it detects actual speech. This two-stage approach dramatically improves reliability and reduces unnecessary processing, making the entire system more efficient.

On Real-Time ASR Performance

In a live ASR system, audio is processed in short chunks, often containing just a few words. Since autoregressive decoding generates tokens sequentially, shorter text outputs require fewer decoding steps, making this phase relatively fast. However, the encoder becomes the primary bottleneck because it must process fixed-size (zero-padded) audio features regardless of speech duration. Consequently, hardware acceleration for the encoder is critical for achieving responsive, real-time performance.

Giving the AI a "Brain": Deploying and Customizing Phi-3-mini

Once the user's speech is converted to text, the system needs to understand the intent and generate a helpful response. This is the job of a language model. We've moved beyond the days of rigid, keyword-based voice commands. Small Language Models (SLMs) like Microsoft's Phi-3-mini allow for natural, fluid conversations, even on resource-limited edge devices.

The true power of our solution lies in its adaptability. Instead of spending months and significant resources fine-tuning a massive model, we use several prompt engineering techniques to craft a specialized prompt, which gives the general-purpose Phi-3 model the specific knowledge and "personality" it needs for a particular job.

For our restaurant kiosk demo, we "primed" the model with the menu, instructions, and examples of how to handle customer queries. The result? The SLM instantly transforms into an expert restaurant clerk capable of answering questions and taking complex orders. This lightweight customization makes it incredibly easy to deploy our solution across different industries—from retail and hospitality to healthcare—with minimal engineering effort.

Putting It All Together: The Smart Retail Demo in Action

So, what does this all look like to the end-user? The experience is seamless. Here are some videos of demos running live on QL601.

Customer: I'll have 3 hamburgers, 1 milkshake, and 2 cokes, please.

Kiosk: Got it. So that's 3 hamburgers, 1 milkshake, and 2 cokes.

(Besides the kiosk's response, Phi-3 also formats the order as "[order] Hamburger: 3, Milkshake: 1, Coke: 2" to inform the backend system.)

This demo shows that our kiosk can not only properly answer questions but also communicate with the backend system using a structured format. Furthermore, thanks to the general-purpose nature of Phi-3, it is robust enough to handle unexpected questions:

Customer: Two tickets to New York, please.

(Guided by the prompt, Phi-3 knows that it is a restaurant kiosk and cannot help with booking tickets.)

Kiosk: I'm sorry, but we don't offer ticket services. We specialize in food and drinks.

These demos showcase the power of prompt engineering. With proper prompts, even a small model like Phi-3-mini can perform complex, specialized tasks, without losing the ability to answer general questions. The entire process happens right on the device, with no cloud connection required. It's fast, private, and reliable—exactly what's needed to build the next generation of smart, voice-first devices.

Enabling USB Microphone on QL601

In the demos, a USB microphone is connected to QL601 to capture the user's voice. However, the PulseAudio service on QL601 disables the USB microphone by default. It can be enabled by running the following command:

pactl load-module module-udev-detect

Then the USB microphone should appear in the pactl list sources output and be ready to use.

Conclusion

Building a high-performance voice AI solution on an edge device from the ground up is a complex challenge. It requires expertise not just in AI models, but in low-level software engineering and hardware optimization. We've demonstrated that it's possible by creating a custom C++ ASR engine and an intelligently customized on-device SLM.

The result is a solution that delivers on the promise of edge AI: enhanced privacy, real-time responsiveness, and unmatched reliability. If you're ready to redefine your customer experience with voice, we're ready to help.