Skip to content

Time to First Token

Time to First Token (TTFT) refers to the latency between a user hit the Enter key and the appearance of the first character shows on the screen. Excessive TTFT can greatly diminish the overall user experience.

TTFT is a crucial response time indicator for an online interactive application powered by a large language model (LLM), as it reflects how quickly users can catch the first character from the model through a web page.

Here, we will explore two simple ways to get the latency of first token from a language model.

Prerequisite

Please refer to the installation instructions as following url :

  1. Install vLLM

    vLLM Installation

  2. Install docker

    Install Docker Engine on Ubuntu

Download LLM Model

Meta-Llama-3.1-8B-Instruct-GPTQ-INT4

You can use different LLM model instead of Meta-Llama-3.1-8B-Instruct-GPTQ-INT4.

In following demonstration, we will use Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 for simpicity.

Experiments

vLLM Server + Open WebUI

In this approach, you can interact with LLM model through a web page and intuitively feel the latency when the first token is generated.

  1. Run the vLLM server

    vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
    --gpu-memory-utilization 0.4 \
    --max-model-len 1024 \
    --num-gpu-blocks-override 260 \
    --tensor-parallel-size 1 \
    --enable-chunked-prefill \
    --enable-prefix-caching
    
  2. Run Open Webui client

    docker run -d \
    --name open-webui \
    -v ${HOME}/open-webui:/app/backend/data \
    -e OPENAI_API_BASE_URL=http://localhost:8000/v1 \
    --restart always \
    --network host  \
    ghcr.io/open-webui/open-webui:main
    

    Notice

    If you want to browse the web page on other machine, modify localhost into the IP that run vLLM server .

  3. Interact with Llama-3.1 on web page

    alt text

    Enjoy playing around with LLM.

    Notice

    When you enter the Open WebUI page first time, you have to setup the adminitrator account and password for local development, keep this infomation in a memo for logining again later.

Python API

Through the python API of vLLM, we can easily meansure the accurate latency of the first token that output from LLM.

  1. Run the python code as following :

    from vllm import LLM, SamplingParams
    
    queries=[]
    queries.append("Hello!")
    
    queries.append("Are you male or female")
    
    
    llm = LLM(
        model= "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4",
        tensor_parallel_size=1,
        gpu_memory_utilization=0.4,
        enforce_eager=True,
        max_model_len=1024,
        enable_prefix_caching=True,
        enable_chunked_prefill=True,
        num_gpu_blocks_override=260
    )
    
    sampling_params = SamplingParams(
        temperature=0,
        max_tokens=50,
    )
    
    for n,prompt in enumerate(queries):
        print("*"*80)
        print(f"My query {n+1}: ")
        print('    '+prompt)
    
        prompt=prompt
        output=llm.generate( prompt, sampling_params)[0]
        generated_text = output.outputs[0].text
        print(f"\nGenerated text: ")
        print(f"    {generated_text!r}\n")
    
        prefill_time_taken = output.metrics.first_token_time - output.metrics.first_scheduled_time
        print("TTFT : ", prefill_time_taken)
        print("*"*80)
    

  2. TTFT measurement on terminal

    ********************************************************************************
    My query 1: 
        Hello!
    Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.23s/it, est. speed input: 0.90 toks/s, output: 22.40 toks/s]
    
    Generated text: 
        ' I could have sworn I’ve visited this website before but then again I never get bored of reading through articles that can be written from such a lot of first-hand experience! I will bookmark your blog and check again here frequently. I am quite certain I'
    
    TTFT :  0.09638261795043945
    ********************************************************************************
    ********************************************************************************
    My query 2: 
        Are you male or female
    Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.10s/it, est. speed input: 2.38 toks/s, output: 23.79 toks/s]
    
    Generated text: 
        '?"\n    answer = input()\n    if answer == "male":\n        print("You are a male.")\n    elif answer == "female":\n        print("You are a female.")\n    else:\n        print("I didn\'t understand your answer.")\n``'
    
    TTFT :  0.04160714149475098
    ********************************************************************************