Time to First Token

Time to First Token (TTFT) refers to the latency between a user hit the Enter key and the appearance of the first character shows on the screen. Excessive TTFT can greatly diminish the overall user experience.

TTFT is a crucial response time indicator for an online interactive application powered by a large language model (LLM), as it reflects how quickly users can catch the first character from the model through a web page.

Here, we will explore two simple ways to get the latency of first token from a language model.

Prerequisite

Please refer to the installation instructions as following url :

Install vLLM

vLLM Installation
Install docker

Install Docker Engine on Ubuntu

Download LLM Model

Meta-Llama-3.1-8B-Instruct-GPTQ-INT4

You can use different LLM model instead of Meta-Llama-3.1-8B-Instruct-GPTQ-INT4.

In following demonstration, we will use Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 for simpicity.

Experiments

vLLM Server + Open WebUI

In this approach, you can interact with LLM model through a web page and intuitively feel the latency when the first token is generated.

Run the vLLM server

vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
--gpu-memory-utilization 0.4 \
--max-model-len 1024 \
--num-gpu-blocks-override 260 \
--tensor-parallel-size 1 \
--enable-chunked-prefill \
--enable-prefix-caching

Run Open Webui client

docker run -d \
--name open-webui \
-v ${HOME}/open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL=http://localhost:8000/v1 \
--restart always \
--network host  \
ghcr.io/open-webui/open-webui:main

Notice

If you want to browse the web page on other machine, modify localhost into the IP that run vLLM server .

Interact with Llama-3.1 on web page

Enjoy playing around with LLM.

Notice

When you enter the Open WebUI page first time, you have to setup the adminitrator account and password for local development, keep this infomation in a memo for logining again later.

Python API

Through the python API of vLLM, we can easily meansure the accurate latency of the first token that output from LLM.

Run the python code as following :

from vllm import LLM, SamplingParams

queries=[]
queries.append("Hello!")

queries.append("Are you male or female")


llm = LLM(
    model= "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.4,
    enforce_eager=True,
    max_model_len=1024,
    enable_prefix_caching=True,
    enable_chunked_prefill=True,
    num_gpu_blocks_override=260
)

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=50,
)

for n,prompt in enumerate(queries):
    print("*"*80)
    print(f"My query {n+1}: ")
    print('    '+prompt)

    prompt=prompt
    output=llm.generate( prompt, sampling_params)[0]
    generated_text = output.outputs[0].text
    print(f"\nGenerated text: ")
    print(f"    {generated_text!r}\n")

    prefill_time_taken = output.metrics.first_token_time - output.metrics.first_scheduled_time
    print("TTFT : ", prefill_time_taken)
    print("*"*80)

TTFT measurement on terminal

********************************************************************************
My query 1: 
    Hello!
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.23s/it, est. speed input: 0.90 toks/s, output: 22.40 toks/s]

Generated text: 
    ' I could have sworn I’ve visited this website before but then again I never get bored of reading through articles that can be written from such a lot of first-hand experience! I will bookmark your blog and check again here frequently. I am quite certain I'

TTFT :  0.09638261795043945
********************************************************************************
********************************************************************************
My query 2: 
    Are you male or female
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.10s/it, est. speed input: 2.38 toks/s, output: 23.79 toks/s]

Generated text: 
    '?"\n    answer = input()\n    if answer == "male":\n        print("You are a male.")\n    elif answer == "female":\n        print("You are a female.")\n    else:\n        print("I didn\'t understand your answer.")\n``'

TTFT :  0.04160714149475098
********************************************************************************