Time to First Token
Time to First Token (TTFT) refers to the latency between a user hit the Enter key and the appearance of the first character shows on the screen. Excessive TTFT can greatly diminish the overall user experience.
TTFT is a crucial response time indicator for an online interactive application powered by a large language model (LLM), as it reflects how quickly users can catch the first character from the model through a web page.
Here, we will explore two simple ways to get the latency of first token from a language model.
Prerequisite
Please refer to the installation instructions as following url :
-
Install vLLM
-
Install docker
Download LLM Model
Meta-Llama-3.1-8B-Instruct-GPTQ-INT4
You can use different LLM model instead of Meta-Llama-3.1-8B-Instruct-GPTQ-INT4.
In following demonstration, we will use Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 for simpicity.
Experiments
vLLM Server + Open WebUI
In this approach, you can interact with LLM model through a web page and intuitively feel the latency when the first token is generated.
-
Run the vLLM server
-
Run Open Webui client
docker run -d \ --name open-webui \ -v ${HOME}/open-webui:/app/backend/data \ -e OPENAI_API_BASE_URL=http://localhost:8000/v1 \ --restart always \ --network host \ ghcr.io/open-webui/open-webui:main
Notice
If you want to browse the web page on other machine, modify localhost into the IP that run vLLM server .
-
Interact with Llama-3.1 on web page
Enjoy playing around with LLM.
Notice
When you enter the Open WebUI page first time, you have to setup the adminitrator account and password for local development, keep this infomation in a memo for logining again later.
Python API
Through the python API of vLLM, we can easily meansure the accurate latency of the first token that output from LLM.
-
Run the python code as following :
from vllm import LLM, SamplingParams queries=[] queries.append("Hello!") queries.append("Are you male or female") llm = LLM( model= "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4", tensor_parallel_size=1, gpu_memory_utilization=0.4, enforce_eager=True, max_model_len=1024, enable_prefix_caching=True, enable_chunked_prefill=True, num_gpu_blocks_override=260 ) sampling_params = SamplingParams( temperature=0, max_tokens=50, ) for n,prompt in enumerate(queries): print("*"*80) print(f"My query {n+1}: ") print(' '+prompt) prompt=prompt output=llm.generate( prompt, sampling_params)[0] generated_text = output.outputs[0].text print(f"\nGenerated text: ") print(f" {generated_text!r}\n") prefill_time_taken = output.metrics.first_token_time - output.metrics.first_scheduled_time print("TTFT : ", prefill_time_taken) print("*"*80)
-
TTFT measurement on terminal
******************************************************************************** My query 1: Hello! Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.23s/it, est. speed input: 0.90 toks/s, output: 22.40 toks/s] Generated text: ' I could have sworn I’ve visited this website before but then again I never get bored of reading through articles that can be written from such a lot of first-hand experience! I will bookmark your blog and check again here frequently. I am quite certain I' TTFT : 0.09638261795043945 ******************************************************************************** ******************************************************************************** My query 2: Are you male or female Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.10s/it, est. speed input: 2.38 toks/s, output: 23.79 toks/s] Generated text: '?"\n answer = input()\n if answer == "male":\n print("You are a male.")\n elif answer == "female":\n print("You are a female.")\n else:\n print("I didn\'t understand your answer.")\n``' TTFT : 0.04160714149475098 ********************************************************************************