AI on the QL601: Bringing Your Models to Life Fast with LiteRT on Python
Imagine you’ve trained an amazing AI model on your workstation. Now, it’s time for it to shine on the QL601 edge device. Whether you want HTP acceleration, GPU power, or just CPU execution, LiteRT on Python makes this transition seamless.
Deploying AI models to edge devices often means rewriting your entire pipeline, learning new frameworks, or accepting significant performance compromises. But what if you could keep your Python workflow intact while unlocking 10-20x performance gains?
LiteRT acts as the bridge between your existing Python workflow and Qualcomm’s optimized hardware. Instead of rewriting your pipeline, you convert your model to TFLite, attach the LiteRT delegate, and keep your preprocessing, postprocessing, and business logic intact. With only minor changes to the inference step, your Python app can continue running on your laptop or in the cloud while the QL601 handles fast, edge-ready inference.
LiteRT doesn’t rewrite your story—it simply makes your model run faster in the real world.
The Journey Ahead
Before we begin, let's understand what we're building. You'll need two environments—think of it like preparing ingredients in your kitchen (host machine) and then cooking them in a specialized oven (QL601):
- Model Quantization Environment (on your host machine) - Where you'll prepare and optimize your model
- LiteRT Runtime Environment (on QL601) - Where your model will run and perform inference
This separation keeps your development workflow clean and efficient. The host machine handles the heavy lifting of model conversion and quantization, while the QL601 focuses purely on fast, optimized inference.
Setting Up Your Development Kitchen (Host)
Let's start by preparing your host machine where you'll convert and quantize your model.
Install System Dependencies
Set Up Python Quantization Environment
mkdir ~/litert/ && cd ~/litert
python3.10 -m venv litert_venv
source litert_venv/bin/activate
pip3 install torch==2.8.0 super-gradients==3.7.1 onnx_tf==1.8.0 tqdm opencv-python
requirement.txt
You can also use pip3 install -r quantization_model_requirements.txt with quantization_model_requirements.txt.
Preparing Your QL601 Device
Now let's set up the QL601 device where your model will run. This is where the magic happens!
QL601 Platform
The QL601 supports multiple BSPs (Board Support Packages), including Ubuntu. The instructions below assume you are using a QL601 with the Ubuntu BSP and have access via SSH or direct terminal.
Install Qualcomm AI SDK
This is the key that unlocks the power of QL601's specialized hardware:
mkdir /opt/litert
wget https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.40.0.251030/v2.40.0.251030.zip
unzip v2.40.0.251030.zip -d /opt/litert
Install System Dependencies
/bin/bash /opt/litert/qairt/2.40.0.251030/bin/check-linux-dependency.sh
apt update
apt install python3-pip libhdf5-dev libgtk2.0-dev pkg-config
Set Up Python Runtime Environment
# Default use python3.8
python3 -m venv /opt/litert/litert_venv
source /opt/litert/litert_venv/bin/activate
python3 /opt/litert/qairt/2.40.0.251030/bin/check-python-dependency
pip3 install tensorflow==2.20 opencv-python==4.12.0.88 tqdm
requirements.txt
You can also use pip3 install -r liteRT_requirements.txt with liteRT_requirements.txt.
With both environments ready, we're now prepared to transform your model. This is where the real magic begins—taking your trained model and making it QL601-ready.
Transforming Your Model
Now comes the exciting part - converting your model into a format that QL601 can understand and optimize. For this guide, we'll use YOLO-NAS-S as our running example—a modern, high-performance object detection model from Deci AI that showcases what LiteRT can do.
YOLO-NAS comes pre-trained and ready to use through SuperGradients. We'll load the pretrained COCO weights directly—no need to download or manage model files yourself. The conversion process we'll walk through applies to any PyTorch model, so you can easily adapt these steps to your own architecture.
Converting Your Model to TensorFlow SavedModel Format
If your model is in PyTorch format (like many modern models), here's how to convert it:
# Convert torch model into ONNX
import torch
from super_gradients.training import models
from super_gradients.common.object_names import Models
yoloNas_url = "https://sg-hub-nv.s3.amazonaws.com/models/yolo_nas_s_coco.pth"
# Load your model with pretrained weights
model = models.get(Models.YOLO_NAS_S, num_classes=80, checkpoint_path=yoloNas_url)
# Prepare model for conversion
# Input size is in format of [Batch x Channels x Width x Height]
model.eval()
model.prep_model_for_conversion(input_size=[1, 3, 640, 640])
# Create dummy input for conversion
dummy_input = torch.randn([1, 3, 640, 640], device="cpu")
# Convert model to ONNX
torch.onnx.export(model, dummy_input, "yolo_nas_s.onnx", opset_version=11)
# Convert ONNX model into SavedModel
import onnx
from onnx_tf.backend import prepare
model = onnx.load("yolo_nas_s.onnx")
graph = model.graph
print("Old input name:", graph.input[0].name)
graph.input[0].name = "input"
# Rewrite input name to "input" (not "input.1") so TensorFlow recognizes it as a string, not a dict key
# Update all references to the input name throughout the graph
for node in graph.node:
for i, name in enumerate(node.input):
if name == "input.1":
node.input[i] = "input"
# Prepare TensorFlow representation
tf_rep = prepare(model)
# Export the model
tf_rep.export_graph("new_yolo_nas_saved_model")
For Your Own Model
Most AI models start in PyTorch, but LiteRT on the QL601 works with TensorFlow Lite. The important step is converting your model into TensorFlow’s SavedModel format so LiteRT can run it efficiently.
The Art of Quantization
Quantization is like compressing a high-quality image - it makes your model smaller and faster while maintaining most of its accuracy. QL601 works best with quantized models, especially full-integer quantization.
Qualcomm AI Hub
If you just want to get to an optimized model with the least friction, start with Qualcomm AI Hub. Most of the featured vision and language models are already quantized, so you can download them and run immediately on QL601 without converting and quantizing the model yourself.
Dynamic Range Quantization (Quick Start)
This is the fastest way to get started, but may not give you the best performance:
Dynamic Range Quantization
import tensorflow as tf
saved_model_dir = "/opt/litert/yolo_nas_s/yolo_nas_s_saved_model"
liteRT_model_name = "yolo_nas_s"
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS]
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
save_name = f"{liteRT_model_name}/quantized_{liteRT_model_name}.tflite"
with open(save_name, 'wb') as f:
f.write(tflite_model)
Full-Integer Quantization (Recommended for QL601)
This is the gold standard for QL601. It provides the best performance, especially on HTP. Unlike dynamic range quantization which keeps some operations in float32, full-integer quantization converts everything to int8, allowing the HTP accelerator to process the entire model with maximum efficiency. The process uses representative data from your training dataset to calibrate the quantization thresholds, ensuring accuracy is preserved:
import cv2
import tensorflow as tf
import os, glob
import numpy as np
from tqdm import tqdm
import requests
import zipfile
import sys
# Helper function to download validation dataset
def download_and_unzip_val_dataset():
url = "http://images.cocodataset.org/zips/val2017.zip"
filename = "val2017.zip"
output_dir = "val2017"
print("Downloading", url)
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(filename, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
print("Download complete:", filename)
print("Unzipping", filename)
os.makedirs(output_dir, exist_ok=True)
with zipfile.ZipFile(filename, "r") as zip_ref:
zip_ref.extractall(output_dir)
print("Unzip complete:", output_dir)
def load_random_val_images(max: int = -1):
image_paths = np.array(glob.glob("val2017/*.jpg"))
if max > 0:
index = np.arange(0, image_paths.size)
np.random.shuffle(index)
image_paths = image_paths[index[:max]]
return image_paths
# Configuration for your model
saved_model_dir = "yolo_nas_s/yolo_nas_saved_model"
liteRT_model_name = "yolo_nas_s"
input_dimension = (1, 3, 640, 640)
# Number of images for calibration (more = better accuracy, but slower)
num_images = 100
if not os.path.exists(saved_model_dir):
print(f"{saved_model_dir} not exist!")
sys.exit()
if not os.path.exists("val2017"):
download_and_unzip_val_dataset()
# Prepare converter
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Full-integer quantization settings
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Representative dataset function
# This preprocessing must match your training pipeline exactly
# YOLO-NAS expects RGB images, normalized to [0, 1], resized to 640x640
# The quantization converter uses these samples to determine optimal scaling factors
def representative_dataset():
image_paths = load_random_val_images(max=num_images)
for img_path in tqdm(image_paths, desc="Quantizing"):
img = cv2.imread(img_path, cv2.IMREAD_COLOR)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (640, 640))
img = img.astype(np.float32) / 255.0
# Adjust based on your model's input format
if input_dimension[-1] != 3:
img = np.transpose(img, (2, 0, 1))
if len(input_dimension) != 3:
img = np.expand_dims(img, axis=0)
yield [img]
# Set representative dataset
converter.representative_dataset = representative_dataset
# Convert model
print("Converting model... This may take a while.")
tflite_quant_model = converter.convert()
save_name = f"{liteRT_model_name}/quantized_{liteRT_model_name}_int8.tflite"
with open(save_name, 'wb') as f:
f.write(tflite_quant_model)
print(f"Model saved to {save_name}")
For Your Own Model
- Adjust
input_dimensionto match your model's input shape - Modify the preprocessing in
representative_dataset()to match how you preprocess images during training - Use your own validation dataset instead of COCO if possible
Choosing Your Hardware Backend
One of the most powerful features of QL601 is its ability to run models on different hardware backends. Let's explore your options:
- CPU: Universal, always works, moderate performance
- GPU: Good for parallel workloads, better than CPU
- HTP (Hexagon Tensor Processor): The star performer! Optimized for AI workloads, often 10-20x faster than CPU
Loading Your Model with the Right Backend
import tensorflow as tf
import cv2
import numpy as np
# QL601 AI SDK library path
ai_sdk_lib_path = "/opt/litert/qairt/2.40.0.251030/lib"
# Machine architecture (for QL601 Ubuntu)
machine_arch = "aarch64-ubuntu-gcc9.4"
# QL601 uses QCS6490, which is based on hexagon-v68
hexagon_version = "hexagon-v68"
# ============================================
# CHOOSE YOUR BACKEND HERE
# ============================================
# Option 1: HTP (Hexagon Tensor Processor) - RECOMMENDED for best performance
backend_type = "htp"
library_path = f"{ai_sdk_lib_path}/{machine_arch}/libQnnHtp.so"
# Option 2: GPU - Good for parallel workloads
# backend_type = "gpu"
# library_path = f"{ai_sdk_lib_path}/{machine_arch}/libQnnGpu.so"
# Option 3: CPU - Fallback option (no special library needed)
# backend_type = "cpu"
# library_path = None
# Load the delegate
delegate = [
tf.lite.experimental.load_delegate(
library=f"{ai_sdk_lib_path}/{machine_arch}/libQnnTFLiteDelegate.so",
options={
"backend_type": backend_type,
# "log_level": 5, # Uncomment for debugging
"library_path": library_path,
"skel_library_dir": f"{ai_sdk_lib_path}/{hexagon_version}/unsigned/",
"htp_performance_mode": 2,
"htp_use_fold_relu": 1,
}
)
] if backend_type != "cpu" else None
# Load your quantized model
model_path = "yolo_nas_s/quantized_yolo_nas_s_int8.tflite" # Your model path here
interpreter = tf.lite.Interpreter(
model_path=model_path,
experimental_delegates=delegate,
)
interpreter.allocate_tensors()
print(f"Model loaded successfully on {backend_type.upper()} backend!")
You're now ready to run inference. The next step is preprocessing your input data to match the model's expectations.
Preprocessing Your Input
Before the model can understand your image, it needs it in the right “language.” This function takes your image, turns it from BGR to RGB, resizes it to fit the model, and converts the pixels into the integer format the model was trained on. Get this step right, and the model can see your image just like it learned to during training.
def image_preprocess(frame, input_details, model_input_size):
"""
Preprocess image for model inference.
Args:
frame: Input image (BGR format from OpenCV)
input_details: Model input details from interpreter
model_input_size: Tuple of (width, height) for resizing
"""
# Get input specifications
input_type = input_details[0]['dtype']
# Convert BGR to RGB and resize
frame_resized = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frame_resized = cv2.resize(frame_resized, model_input_size, interpolation=cv2.INTER_LINEAR)
# Handle quantization parameters
scale, zero_point = input_details[0]['quantization']
scale = 1 if scale == 0 else scale
# Normalize pixel values to [0, 1] range
frame_resized = frame_resized.astype(np.float32)/ 255.0
# Apply quantization: convert normalized float to quantized integer
# Formula: quantized = (normalized_value / scale) + zero_point
frame_resized = frame_resized / scale + zero_point
# Convert to the correct data type
if input_type == np.int8:
np.clip(frame_resized, -128, 127, out=frame_resized)
frame_resized = frame_resized.astype(np.int8, copy=False)
elif input_type == np.uint8:
np.clip(frame_resized, 0, 255, out=frame_resized)
frame_resized = frame_resized.astype(np.uint8, copy=False)
# Check if model expects CHW format (channels first) vs HWC (channels last)
# If shape is [1, 3, 640, 640] (CHW), we need to transpose from HWC
# If shape is [1, 640, 640, 3] (HWC), no transpose needed
input_shape = input_details[0]['shape']
if input_shape[-1] != 3:
frame_resized = frame_resized.transpose(2, 0, 1)
frame_resized = np.expand_dims(frame_resized, axis=0)
return frame_resized
After preprocessing, your image is ready to be fed into the model, transformed into its “native language” with RGB colors, the correct size, and integer pixels—just like it saw during training.
Running Inference
Now for the moment of truth - running your model! Here's how to do it:
# Get model input details
input_details = interpreter.get_input_details()
# Your input image (replace with your image source)
# For example: frame = cv2.imread("your_image.jpg")
model_input_size = (640, 640) # Adjust to your model's input size
# Preprocess the image
input_data = image_preprocess(frame, input_details, model_input_size)
# Set input tensor
interpreter.set_tensor(input_details[0]['index'], input_data)
# Run inference
import time
start_time = time.time()
interpreter.invoke()
inference_time = time.time() - start_time
print(f"Inference completed in {inference_time*1000:.2f}ms")
# Get outputs
output_details = interpreter.get_output_details()
# Process your results here...
The inference is complete, but the raw outputs from a quantized model need special handling. We need to dequantize the outputs and then apply your model-specific postprocessing logic to get meaningful results.
Postprocessing the Results
After inference, you need to dequantize and process the outputs:
# Dequantize outputs (adjust based on your model's output structure)
def dequantize_output(output_details, output_index):
"""Dequantize model output."""
scale, zero_point = output_details[output_index]['quantization']
scale = 1 if scale == 0 else scale
output = interpreter.get_tensor(output_details[output_index]['index'])
return scale * (output.astype(np.float32) - zero_point)
# For YOLO models, typically you get confidence and bounding boxes
# Output 0: Class confidence scores (1 x 8400 x 80 for COCO)
class_confidence = dequantize_output(output_details, 0)
# Output 1: Bounding box coordinates (1 x 8400 x 4)
bbox_coord = dequantize_output(output_details, 1)
# Now apply your post-processing (NMS, filtering, etc.)
# This depends on your specific model
After dequantization, you'll typically apply model-specific logic like Non-Maximum Suppression (NMS) for object detection models to remove duplicate detections, or argmax for classification models to get the predicted class.
For YOLO-NAS, you'd apply NMS to filter overlapping bounding boxes and select the most confident detections. The exact steps depend on your model architecture, but the dequantization step ensures you're working with properly scaled floating-point values.
Seeing LiteRT in Action: A Live Demo
After all this setup and configuration, you might wonder: does it actually work in practice? The answer is a resounding yes. Here's a live demonstration of YOLO-NAS running on QL601, processing real-time video with bounding boxes drawn on the HTP backend.
Notice how smoothly the model tracks multiple objects simultaneously—this is the power of HTP acceleration with full-integer quantization. You can replicate this demo yourself using the provided script:
ai_demo.py– Python source code that runs YOLO-NAS inference on QL601
Video by Alley Walker, licensed under Creative Commons Attribution (CC BY 4.0).
Performance Benchmarks
Numbers don't lie. When we ran benchmarks on the QL601 with different quantization strategies, the results revealed something fundamental about how model format directly determines hardware capability.
We tested each configuration by running models on 640x640 images continuously for 30 seconds, measuring latency across CPU, GPU, and HTP backends.
Dynamic Range Quantization
The convenient starting point. Minimal setup, works immediately. Here's what it delivers:
| Model | CPU | GPU | HTP |
|---|---|---|---|
| InceptionV3 | 125 ms | 303 ms | 125 ms |
| yolo_nas_s | 526 ms | 1111 ms | 769 ms |
| yolov5_s | 385 ms | 833 ms | 385 ms |
| yolov5_n | 175 ms | 667 ms | 185 ms |
Notice the HTP performance is comparable to CPU—not the advantage you'd expect. This is because dynamic range quantization uses mixed precision (e.g., FP16 activations + INT8 weights).
The HTP's tensor processing units are optimized specifically for full INT8 operations; when they encounter multiple data types, they can't leverage their specialization and execute less efficiently.
Full-Integer Quantization (Recommended)
Everything changes when you commit to int8 across the entire model. Same hardware, same input, completely different results:
| Model | Dataset Used | CPU | GPU | HTP |
|---|---|---|---|---|
| yolo_nas_s | VAL2017 | 263 ms | 270 ms | 18.4 ms |
| yolov5_s | VAL2017 | 208 ms | 217 ms | 10.4 ms |
| yolov5_n | VAL2017 | 87 ms | 94 ms | 5.6 ms |
What makes this transformation possible? Full-integer quantization converts every operation to int8, allowing the HTP's dedicated tensor processing units to execute the entire model with maximum efficiency. This delivers 10-20x speedup compared to CPU—enabling real-time inference at 60+ fps, concurrent multi-model execution, and sub-10ms latency on edge devices.
Speedup
Full-integer quantization with QL601 HTP backend provides 10-20x speedup compared to CPU! Notice how HTP performance scales dramatically with full-integer quantization, while CPU and GPU show more modest gains.
Your Complete Deployment Checklist
Before deploying your model, make sure you've:
- Set up the quantization environment on your host machine
- Set up the LiteRT runtime environment on QL601
- Converted your model to TensorFlow SavedModel format
- Quantized your model (preferably full-integer)
- Tested preprocessing matches your training pipeline
- Chosen the right backend (HTP recommended)
- Implemented proper postprocessing
- Tested with real data
The LiteRT Advantage – Your Final Takeaway
LiteRT lets you run your Python AI app on Qualcomm hardware like QL601 with almost no changes. Your UI, business logic, and data pipeline stay the same. Simply:
- One-time setup: Initialize LiteRT for your device.
- Swap inference: Replace .predict() with the LiteRT call.
- That’s it—no rewrites, no major refactoring. Your app now runs on QL601 with a 10–50× speed boost.
Important
Ensure preprocessing/postprocessing matches your model, especially for quantized inputs/outputs, to fully leverage LiteRT and the HTP backend.
Now it’s your turn—grab a QL601 and deploy your AI models with blazing speed! Experience effortless performance and see your projects come alive: Get yours here
Reference Documents
- Qualcomm AI Runtime Options - Detailed configuration options
- TensorFlow Lite Developer Workflow - Official workflow guide
Related Guides
- How to Download Qualcomm AI Hub Models - Using pre-quantized models from Qualcomm AI Hub
- Voice Kiosk on QL601 - Another example of AI deployment on QL601