In our previous benchmarking blog post, we compared the performance of different inference backends using two key metrics: Time to First Token and Token Generation Rate. We intentionally did not tune the inference configurations, such as GPU memory utilization, maximum number of sequences, and paged KV cache block size, to implicitly measure the performance and ease-of-use of each backend, highlighting their practicality in real-world applications.
In this blog post, the BentoML engineering team shifts focus to the impact of performance tuning, specifically examining how tuning inference configurations can significantly enhance the serving performance of large language models (LLMs) using TensorRT-LLM (TRT-LLM). By adjusting key parameters like batch size and prefix chunking, we aim to demonstrate the substantial improvements that can be achieved.
This post serves as a comprehensive guide for optimizing TRT-LLM settings, offering practical insights and detailed steps to help you achieve superior performance. Specifically, it will cover
- Key findings of performance tuning
- Best practices and explanations of key parameters
- Main steps to serve LLMs with TRT-LLM and BentoML
- Benchmark client

Key Findings
Similar to the previous blog post, we evaluated TensorRT-LLM serving performance with two key metrics:
- Time to First Token (TTFT): Measures the time from when a request is sent to when the first token is generated, recorded in milliseconds. TTFT is important for applications requiring immediate feedback, such as interactive chatbots. Lower latency improves perceived performance and user satisfaction.
- Token Generation Rate: Assesses the number of tokens the model generates per second during decoding, measured in tokens per second. The token generation rate is an indicator of the model's capacity to handle high loads. A higher rate suggests that the model can efficiently manage multiple requests and generate responses quickly, making it suitable for high-concurrency environments.
We compared the performance of TRT-LLM serving Llama-3 8B before and after configuration tuning.
TRT-LLM Model Compilation
Start by compiling your TRT-LLM model. This step involves converting your model into a format optimized for TensorRT. Refer to Figure 3 for a visual guide on the compilation process.
Serving Models with Triton Inference Server
Utilize the trtllm-backend to serve TensorRT-LLM models using the Triton Inference Server. This backend is specifically designed to handle TensorRT-LLM models efficiently, ensuring optimal performance during inference.
How to Benchmark
To accurately assess the performance of LLM backends, we created a custom benchmark script. This script simulates real-world scenarios by varying user loads and sending generation requests under different levels of concurrency.
Our benchmark client can spawn up to the target number of users within 20 seconds, after which it stress tests the LLM backend by sending concurrent generation requests with randomly selected prompts. We tested with 10, 50, and 100 concurrent users to evaluate the system under varying loads.
Each stress test ran for 5 minutes, during which time we collected inference metrics every 5 seconds. This duration was sufficient to observe potential performance degradation, resource utilization bottlenecks, or other issues that might not be evident in shorter tests.