AI on the QL601: Bringing Your Models to Life Fast with LiteRT on Python

Imagine you’ve trained an amazing AI model on your workstation. Now, it’s time for it to shine on the QL601 edge device. Whether you want HTP acceleration, GPU power, or just CPU execution, LiteRT on Python makes this transition seamless.

Deploying AI models to edge devices often means rewriting your entire pipeline, learning new frameworks, or accepting significant performance compromises. But what if you could keep your Python workflow intact while unlocking 10-20x performance gains?

LiteRT acts as the bridge between your existing Python workflow and Qualcomm’s optimized hardware. Instead of rewriting your pipeline, you convert your model to TFLite, attach the LiteRT delegate, and keep your preprocessing, postprocessing, and business logic intact. With only minor changes to the inference step, your Python app can continue running on your laptop or in the cloud while the QL601 handles fast, edge-ready inference.

LiteRT doesn’t rewrite your story—it simply makes your model run faster in the real world.

The Journey Ahead

Before we begin, let's understand what we're building. You'll need two environments—think of it like preparing ingredients in your kitchen (host machine) and then cooking them in a specialized oven (QL601):

Model Quantization Environment (on your host machine) - Where you'll prepare and optimize your model
LiteRT Runtime Environment (on QL601) - Where your model will run and perform inference

This separation keeps your development workflow clean and efficient. The host machine handles the heavy lifting of model conversion and quantization, while the QL601 focuses purely on fast, optimized inference.

Setting Up Your Development Kitchen (Host)

Let's start by preparing your host machine where you'll convert and quantize your model.

Install System Dependencies

apt update
apt install python3.10 python3.10-venv python3-pip libgtk2.0-dev pkg-config

Set Up Python Quantization Environment

mkdir ~/litert/ && cd ~/litert
python3.10 -m venv litert_venv
source litert_venv/bin/activate
pip3 install torch==2.8.0 super-gradients==3.7.1 onnx_tf==1.8.0 tqdm opencv-python

requirement.txt

You can also use pip3 install -r quantization_model_requirements.txt with quantization_model_requirements.txt.

Preparing Your QL601 Device

Now let's set up the QL601 device where your model will run. This is where the magic happens!

QL601 Platform

The QL601 supports multiple BSPs (Board Support Packages), including Ubuntu. The instructions below assume you are using a QL601 with the Ubuntu BSP and have access via SSH or direct terminal.

Install Qualcomm AI SDK

This is the key that unlocks the power of QL601's specialized hardware:

mkdir /opt/litert
wget https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.40.0.251030/v2.40.0.251030.zip
unzip v2.40.0.251030.zip -d /opt/litert

Install System Dependencies

/bin/bash /opt/litert/qairt/2.40.0.251030/bin/check-linux-dependency.sh
apt update
apt install python3-pip libhdf5-dev libgtk2.0-dev pkg-config

Set Up Python Runtime Environment

# Default use python3.8
python3 -m venv /opt/litert/litert_venv
source /opt/litert/litert_venv/bin/activate
python3 /opt/litert/qairt/2.40.0.251030/bin/check-python-dependency
pip3 install tensorflow==2.20 opencv-python==4.12.0.88 tqdm

requirements.txt

You can also use pip3 install -r liteRT_requirements.txt with liteRT_requirements.txt.

With both environments ready, we're now prepared to transform your model. This is where the real magic begins—taking your trained model and making it QL601-ready.

Transforming Your Model

Now comes the exciting part - converting your model into a format that QL601 can understand and optimize. For this guide, we'll use YOLO-NAS-S as our running example—a modern, high-performance object detection model from Deci AI that showcases what LiteRT can do.

YOLO-NAS comes pre-trained and ready to use through SuperGradients. We'll load the pretrained COCO weights directly—no need to download or manage model files yourself. The conversion process we'll walk through applies to any PyTorch model, so you can easily adapt these steps to your own architecture.

Converting Your Model to TensorFlow SavedModel Format

If your model is in PyTorch format (like many modern models), here's how to convert it:

# Convert torch model into ONNX
import torch
from super_gradients.training import models
from super_gradients.common.object_names import Models

yoloNas_url = "https://sg-hub-nv.s3.amazonaws.com/models/yolo_nas_s_coco.pth"
# Load your model with pretrained weights
model = models.get(Models.YOLO_NAS_S, num_classes=80, checkpoint_path=yoloNas_url)

# Prepare model for conversion
# Input size is in format of [Batch x Channels x Width x Height]
model.eval()
model.prep_model_for_conversion(input_size=[1, 3, 640, 640])

# Create dummy input for conversion
dummy_input = torch.randn([1, 3, 640, 640], device="cpu")

# Convert model to ONNX
torch.onnx.export(model, dummy_input, "yolo_nas_s.onnx", opset_version=11)

# Convert ONNX model into SavedModel
import onnx
from onnx_tf.backend import prepare

model = onnx.load("yolo_nas_s.onnx")
graph = model.graph

print("Old input name:", graph.input[0].name)
graph.input[0].name = "input"
# Rewrite input name to "input" (not "input.1") so TensorFlow recognizes it as a string, not a dict key

# Update all references to the input name throughout the graph
for node in graph.node:
    for i, name in enumerate(node.input):
        if name == "input.1":
            node.input[i] = "input"

# Prepare TensorFlow representation
tf_rep = prepare(model)

# Export the model
tf_rep.export_graph("new_yolo_nas_saved_model")

For Your Own Model

Most AI models start in PyTorch, but LiteRT on the QL601 works with TensorFlow Lite. The important step is converting your model into TensorFlow’s SavedModel format so LiteRT can run it efficiently.

The Art of Quantization

Quantization is like compressing a high-quality image - it makes your model smaller and faster while maintaining most of its accuracy. QL601 works best with quantized models, especially full-integer quantization.

Qualcomm AI Hub

If you just want to get to an optimized model with the least friction, start with Qualcomm AI Hub. Most of the featured vision and language models are already quantized, so you can download them and run immediately on QL601 without converting and quantizing the model yourself.

Dynamic Range Quantization (Quick Start)

This is the fastest way to get started, but may not give you the best performance:

Dynamic Range Quantization

import tensorflow as tf

saved_model_dir = "/opt/litert/yolo_nas_s/yolo_nas_s_saved_model"
liteRT_model_name = "yolo_nas_s"

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS]
converter.optimizations = [tf.lite.Optimize.DEFAULT]

tflite_model = converter.convert()

save_name = f"{liteRT_model_name}/quantized_{liteRT_model_name}.tflite"
with open(save_name, 'wb') as f:
    f.write(tflite_model)

Full-Integer Quantization (Recommended for QL601)

This is the gold standard for QL601. It provides the best performance, especially on HTP. Unlike dynamic range quantization which keeps some operations in float32, full-integer quantization converts everything to int8, allowing the HTP accelerator to process the entire model with maximum efficiency. The process uses representative data from your training dataset to calibrate the quantization thresholds, ensuring accuracy is preserved:

import cv2
import tensorflow as tf
import os, glob
import numpy as np
from tqdm import tqdm
import requests
import zipfile
import sys

# Helper function to download validation dataset
def download_and_unzip_val_dataset():
    url = "http://images.cocodataset.org/zips/val2017.zip"
    filename = "val2017.zip"
    output_dir = "val2017"

    print("Downloading", url)
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(filename, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    print("Download complete:", filename)

    print("Unzipping", filename)
    os.makedirs(output_dir, exist_ok=True)
    with zipfile.ZipFile(filename, "r") as zip_ref:
        zip_ref.extractall(output_dir)
    print("Unzip complete:", output_dir)

def load_random_val_images(max: int = -1):
    image_paths = np.array(glob.glob("val2017/*.jpg"))
    if max > 0:
        index = np.arange(0, image_paths.size)
        np.random.shuffle(index)
        image_paths = image_paths[index[:max]]
    return image_paths

# Configuration for your model
saved_model_dir = "yolo_nas_s/yolo_nas_saved_model"
liteRT_model_name = "yolo_nas_s"
input_dimension = (1, 3, 640, 640)

# Number of images for calibration (more = better accuracy, but slower)
num_images = 100

if not os.path.exists(saved_model_dir):
    print(f"{saved_model_dir} not exist!")
    sys.exit()

if not os.path.exists("val2017"):
    download_and_unzip_val_dataset()

# Prepare converter
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Full-integer quantization settings
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Representative dataset function
# This preprocessing must match your training pipeline exactly
# YOLO-NAS expects RGB images, normalized to [0, 1], resized to 640x640
# The quantization converter uses these samples to determine optimal scaling factors
def representative_dataset():
    image_paths = load_random_val_images(max=num_images)
    for img_path in tqdm(image_paths, desc="Quantizing"):
        img = cv2.imread(img_path, cv2.IMREAD_COLOR)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.resize(img, (640, 640))
        img = img.astype(np.float32) / 255.0

        # Adjust based on your model's input format
        if input_dimension[-1] != 3:
            img = np.transpose(img, (2, 0, 1))

        if len(input_dimension) != 3:
            img = np.expand_dims(img, axis=0)
        yield [img]

# Set representative dataset
converter.representative_dataset = representative_dataset

# Convert model
print("Converting model... This may take a while.")
tflite_quant_model = converter.convert()
save_name = f"{liteRT_model_name}/quantized_{liteRT_model_name}_int8.tflite"

with open(save_name, 'wb') as f:
    f.write(tflite_quant_model)

print(f"Model saved to {save_name}")

For Your Own Model

Adjust input_dimension to match your model's input shape
Modify the preprocessing in representative_dataset() to match how you preprocess images during training
Use your own validation dataset instead of COCO if possible

Choosing Your Hardware Backend

One of the most powerful features of QL601 is its ability to run models on different hardware backends. Let's explore your options:

CPU: Universal, always works, moderate performance
GPU: Good for parallel workloads, better than CPU
HTP (Hexagon Tensor Processor): The star performer! Optimized for AI workloads, often 10-20x faster than CPU

Loading Your Model with the Right Backend

import tensorflow as tf
import cv2
import numpy as np

# QL601 AI SDK library path
ai_sdk_lib_path = "/opt/litert/qairt/2.40.0.251030/lib"

# Machine architecture (for QL601 Ubuntu)
machine_arch = "aarch64-ubuntu-gcc9.4"

# QL601 uses QCS6490, which is based on hexagon-v68
hexagon_version = "hexagon-v68"

# ============================================
# CHOOSE YOUR BACKEND HERE
# ============================================

# Option 1: HTP (Hexagon Tensor Processor) - RECOMMENDED for best performance
backend_type = "htp"
library_path = f"{ai_sdk_lib_path}/{machine_arch}/libQnnHtp.so"

# Option 2: GPU - Good for parallel workloads
# backend_type = "gpu"
# library_path = f"{ai_sdk_lib_path}/{machine_arch}/libQnnGpu.so"

# Option 3: CPU - Fallback option (no special library needed)
# backend_type = "cpu"
# library_path = None

# Load the delegate
delegate = [
    tf.lite.experimental.load_delegate(
        library=f"{ai_sdk_lib_path}/{machine_arch}/libQnnTFLiteDelegate.so",
        options={
            "backend_type": backend_type,
            # "log_level": 5,  # Uncomment for debugging
            "library_path": library_path,
            "skel_library_dir": f"{ai_sdk_lib_path}/{hexagon_version}/unsigned/",
            "htp_performance_mode": 2,
            "htp_use_fold_relu": 1,
        }
    )
] if backend_type != "cpu" else None

# Load your quantized model
model_path = "yolo_nas_s/quantized_yolo_nas_s_int8.tflite"  # Your model path here
interpreter = tf.lite.Interpreter(
    model_path=model_path,
    experimental_delegates=delegate,
)

interpreter.allocate_tensors()
print(f"Model loaded successfully on {backend_type.upper()} backend!")

You're now ready to run inference. The next step is preprocessing your input data to match the model's expectations.

Preprocessing Your Input

Before the model can understand your image, it needs it in the right “language.” This function takes

your image, turns it from BGR to RGB, resizes it to fit the model, and converts the pixels into the integer format the model was trained on. Get this step right, and the model can see your image just like it learned to during training. name="__codelineno-9-1" href="#__codelineno-9-1">def image_preprocess(frame, input_details, model_input_size): """ class="sd"> Preprocess image for model inference. class="sd"> Args: class="sd"> frame: Input image (BGR format from OpenCV) class="sd"> input_details: Model input details from interpreter class="sd"> model_input_size: Tuple of (width, height) for resizing class="sd"> """ # Get input specifications input_type = input_details[0]['dtype'] # Convert BGR to RGB and resize frame_resized = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) frame_resized = cv2.resize(frame_resized, model_input_size, interpolation=cv2.INTER_LINEAR) # Handle quantization parameters scale, zero_point = input_details[0]['quantization'] scale = 1 if scale == 0 else scale # Normalize pixel values to [0, 1] range frame_resized = frame_resized.astype(np.float32)/ 255.0 # Apply quantization: convert normalized float to quantized integer # Formula: quantized = (normalized_value / scale) + zero_point frame_resized = frame_resized / scale + zero_point # Convert to the correct data type if input_type == np.int8: np.clip(frame_resized, -128, 127, out=frame_resized) frame_resized = frame_resized.astype(np.int8, copy=False) elif input_type == np.uint8: np.clip(frame_resized, 0, 255, out=frame_resized) frame_resized = frame_resized.astype(np.uint8, copy=False) # Check if model expects CHW format (channels first) vs HWC (channels last) # If shape is [1, 3, 640, 640] (CHW), we need to transpose from HWC # If shape is [1, 640, 640, 3] (HWC), no transpose needed input_shape = input_details[0]['shape'] if input_shape[-1] != 3: frame_resized = frame_resized.transpose(2, 0, 1) frame_resized = np.expand_dims(frame_resized, axis=0) return frame_resized

After preprocessing, your image is ready to be fed into the model, transformed into its “native language” with RGB colors, the correct size, and integer pixels—just like it saw during training.

Running Inference

Now for the moment of truth - running your model! Here's how to do it:

# Get model input details
input_details = interpreter.get_input_details()

# Your input image (replace with your image source)
# For example: frame = cv2.imread("your_image.jpg")
model_input_size = (640, 640)  # Adjust to your model's input size

# Preprocess the image
input_data = image_preprocess(frame, input_details, model_input_size)

# Set input tensor
interpreter.set_tensor(input_details[0]['index'], input_data)

# Run inference
import time
start_time = time.time()
interpreter.invoke()
inference_time = time.time() - start_time

print(f"Inference completed in {inference_time*1000:.2f}ms")

# Get outputs
output_details = interpreter.get_output_details()

# Process your results here...

The inference is complete, but the raw outputs from a quantized model need special handling. We need to dequantize the outputs and then apply your model-specific postprocessing logic to get meaningful results.

Postprocessing the Results

After inference, you need to dequantize and process the outputs:

# Dequantize outputs (adjust based on your model's output structure)
def dequantize_output(output_details, output_index):
    """Dequantize model output."""
    scale, zero_point = output_details[output_index]['quantization']
    scale = 1 if scale == 0 else scale
    output = interpreter.get_tensor(output_details[output_index]['index'])
    return scale * (output.astype(np.float32) - zero_point)

# For YOLO models, typically you get confidence and bounding boxes
# Output 0: Class confidence scores (1 x 8400 x 80 for COCO)
class_confidence = dequantize_output(output_details, 0)

# Output 1: Bounding box coordinates (1 x 8400 x 4)
bbox_coord = dequantize_output(output_details, 1)

# Now apply your post-processing (NMS, filtering, etc.)
# This depends on your specific model

After dequantization, you'll typically apply model-specific logic like Non-Maximum Suppression (NMS) for object detection models to remove duplicate detections, or argmax for classification models to get the predicted class.

For YOLO-NAS, you'd apply NMS to filter overlapping bounding boxes and select the most confident detections. The exact steps depend on your model architecture, but the dequantization step ensures you're working with properly scaled floating-point values.

Seeing LiteRT in Action: A Live Demo

After all this setup and configuration, you might wonder: does it actually work in practice? The answer is a resounding yes. Here's a live demonstration of YOLO-NAS running on QL601, processing real-time video with bounding boxes drawn on the HTP backend.

Notice how smoothly the model tracks multiple objects simultaneously—this is the power of HTP acceleration with full-integer quantization. You can replicate this demo yourself using the provided script:

ai_demo.py – Python source code that runs YOLO-NAS inference on QL601

Video by Alley Walker, licensed under Creative Commons Attribution (CC BY 4.0).

Performance Benchmarks

Numbers don't lie. When we ran benchmarks on the QL601 with different quantization strategies, the results revealed something fundamental about how model format directly determines hardware capability.

We tested each configuration by running models on 640x640 images continuously for 30 seconds, measuring latency across CPU, GPU, and HTP backends.

Dynamic Range Quantization

The convenient starting point. Minimal setup, works immediately. Here's what it delivers:

Model	CPU	GPU	HTP
InceptionV3	125 ms	303 ms	125 ms
yolo_nas_s	526 ms	1111 ms	769 ms
yolov5_s	385 ms	833 ms	385 ms
yolov5_n	175 ms	667 ms	185 ms

Notice the HTP performance is comparable to CPU—not the advantage you'd expect. This is because dynamic range quantization uses mixed precision (e.g., FP16 activations + INT8 weights).

The HTP's tensor processing units are optimized specifically for full INT8 operations; when they encounter multiple data types, they can't leverage their specialization and execute less efficiently.

Full-Integer Quantization (Recommended)

Everything changes when you commit to int8 across the entire model. Same hardware, same input, completely different results:

Model	Dataset Used	CPU	GPU	HTP
yolo_nas_s	VAL2017	263 ms	270 ms	18.4 ms
yolov5_s	VAL2017	208 ms	217 ms	10.4 ms
yolov5_n	VAL2017	87 ms	94 ms	5.6 ms

What makes this transformation possible? Full-integer quantization converts every operation to int8, allowing the HTP's dedicated tensor processing units to execute the entire model with maximum efficiency. This delivers 10-20x speedup compared to CPU—enabling real-time inference at 60+ fps, concurrent multi-model execution, and sub-10ms latency on edge devices.

Speedup

Full-integer quantization with QL601 HTP backend provides 10-20x speedup compared to CPU! Notice how HTP performance scales dramatically with full-integer quantization, while CPU and GPU show more modest gains.

Your Complete Deployment Checklist

Before deploying your model, make sure you've:

Set up the quantization environment on your host machine
Set up the LiteRT runtime environment on QL601
Converted your model to TensorFlow SavedModel format
Quantized your model (preferably full-integer)
Tested preprocessing matches your training pipeline
Chosen the right backend (HTP recommended)
Implemented proper postprocessing
Tested with real data

The LiteRT Advantage – Your Final Takeaway

LiteRT lets you run your Python AI app on Qualcomm hardware like QL601 with almost no changes. Your UI, business logic, and data pipeline stay the same. Simply:

One-time setup: Initialize LiteRT for your device.
Swap inference: Replace .predict() with the LiteRT call.
That’s it—no rewrites, no major refactoring. Your app now runs on QL601 with a 10–50× speed boost.

Important

Ensure preprocessing/postprocessing matches your model, especially for quantized inputs/outputs, to fully leverage LiteRT and the HTP backend.

Now it’s your turn—grab a QL601 and deploy your AI models with blazing speed! Experience effortless performance and see your projects come alive: Get yours here

Reference Documents

Qualcomm AI Runtime Options - Detailed configuration options
TensorFlow Lite Developer Workflow - Official workflow guide

How to Download Qualcomm AI Hub Models - Using pre-quantized models from Qualcomm AI Hub
Voice Kiosk on QL601 - Another example of AI deployment on QL601

AI on the QL601: Bringing Your Models to Life Fast with LiteRT on Python

The Journey Ahead

Setting Up Your Development Kitchen (Host)

Install System Dependencies

Set Up Python Quantization Environment

Preparing Your QL601 Device

Install Qualcomm AI SDK

Install System Dependencies

Set Up Python Runtime Environment

Transforming Your Model

Converting Your Model to TensorFlow SavedModel Format

The Art of Quantization

Qualcomm AI Hub

Dynamic Range Quantization (Quick Start)

Full-Integer Quantization (Recommended for QL601)

Choosing Your Hardware Backend

Loading Your Model with the Right Backend

Preprocessing Your Input

Running Inference

Postprocessing the Results

Seeing LiteRT in Action: A Live Demo

Performance Benchmarks

Dynamic Range Quantization

Full-Integer Quantization (Recommended)

Your Complete Deployment Checklist

The LiteRT Advantage – Your Final Takeaway

Reference Documents

Related Guides