Company News
September 16, 2024

Injects €1.13M Funding 
into Just Slots

Written by
Aboubaker Hamadou
CEO

Don't want to

miss anything?

Subscribe
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

In our previous benchmarking blog post, we compared the performance of different inference backends using two key metrics: Time to First Token and Token Generation Rate. We intentionally did not tune the inference configurations, such as GPU memory utilization, maximum number of sequences, and paged KV cache block size, to implicitly measure the performance and ease-of-use of each backend, highlighting their practicality in real-world applications.

In this blog post, the BentoML engineering team shifts focus to the impact of performance tuning, specifically examining how tuning inference configurations can significantly enhance the serving performance of large language models (LLMs) using TensorRT-LLM (TRT-LLM). By adjusting key parameters like batch size and prefix chunking, we aim to demonstrate the substantial improvements that can be achieved.

This post serves as a comprehensive guide for optimizing TRT-LLM settings, offering practical insights and detailed steps to help you achieve superior performance. Specifically, it will cover

  • Key findings of performance tuning
  • Best practices and explanations of key parameters
  • Main steps to serve LLMs with TRT-LLM and BentoML
  • Benchmark client

Key Findings

Similar to the previous blog post, we evaluated TensorRT-LLM serving performance with two key metrics:

  1. Time to First Token (TTFT): Measures the time from when a request is sent to when the first token is generated, recorded in milliseconds. TTFT is important for applications requiring immediate feedback, such as interactive chatbots. Lower latency improves perceived performance and user satisfaction.
  2. Token Generation Rate: Assesses the number of tokens the model generates per second during decoding, measured in tokens per second. The token generation rate is an indicator of the model's capacity to handle high loads. A higher rate suggests that the model can efficiently manage multiple requests and generate responses quickly, making it suitable for high-concurrency environments.

We compared the performance of TRT-LLM serving Llama-3 8B before and after configuration tuning.

TRT-LLM Model Compilation

Start by compiling your TRT-LLM model. This step involves converting your model into a format optimized for TensorRT. Refer to Figure 3 for a visual guide on the compilation process.

Serving Models with Triton Inference Server

Utilize the trtllm-backend to serve TensorRT-LLM models using the Triton Inference Server. This backend is specifically designed to handle TensorRT-LLM models efficiently, ensuring optimal performance during inference.

How to Benchmark

To accurately assess the performance of LLM backends, we created a custom benchmark script. This script simulates real-world scenarios by varying user loads and sending generation requests under different levels of concurrency.

Our benchmark client can spawn up to the target number of users within 20 seconds, after which it stress tests the LLM backend by sending concurrent generation requests with randomly selected prompts. We tested with 10, 50, and 100 concurrent users to evaluate the system under varying loads.

Each stress test ran for 5 minutes, during which time we collected inference metrics every 5 seconds. This duration was sufficient to observe potential performance degradation, resource utilization bottlenecks, or other issues that might not be evident in shorter tests.

Share

Latest news and events Seven Seas

Sep 15, 2024
The Network Effect: Building Bridges to Find Innovative Solutions
Sep 2, 2024
Investments launches new flagship fund
Jul 23, 2024
Investments Injects €1.13M Funding into Just Slots
Jun 4, 2024
Bitcoin Spot ETFs: A Paradigm Shift in Digital Finance