Sorry, hi. Uh, my name is Mega, and I am a software engineer at Databricks. Um, today I'll be talking about some of the

work that we have been doing at Databricks related to improving LLM inference performance. So, yeah, let's get started. So this is the

outline of the talk. Um, I'll start with just a brief introduction of our LLM serving products, why we continue

using LLMs, some of the key improvements that came with the latest release, um, also focusing on some of the

optimizations that we have upstreamed to VLLM and also some other projects related to FP8 and speculative decoding.

Hopefully, these 30 minutes will be useful before you guys can head to happy hour. Um, so yeah, um, starting with what Databricks

has to offer and why we care about LLM inference performance. So, Foundational Models API is one of our products, um,

and it has two modes for different use cases. So the first is per-token LLM serving. Um, this is where

any user can access and query any open-source model or any proprietary models, including DBX, which is, which was one of

our own mixture-of-experts model trained in-house. Um, and it has very simple pricing for input and output

tokens. This is one of the products that are very widely used by a lot of our users, especially developers. And the

second one that we have is provisioned throughput. Um, this is again where you can use any popular open-source models

or even bring your own. You can also fine-tune your models on our platform and then serve them with this provisioned throughput as an end-to-end experience. And what we will provide is the entire infrastructure where you can pick

minimum and maximum throughput under some latency constraints, and we will provide the hardware that is running on and also autoscale it for you up or

down based on your unique traffic patterns with guaranteed throughput. Yeah, so um, our inference engine is a mixture of backends and we

have uh, integrated VMs um to support many of our serving use cases. So why do we continue uh using VMs? Some of the

key features are highlighted in this slide. Um, first, it's flexible; a lot of uh, broad range of model supports exist on

VMs. As soon as you have like a new open-source model, you will instantly find it launched on the same day with VMs. Uh, we

ourselves upstream the support uh for DBRx um when it at the time of the launch, and it was very easy to add a

custom model. Um, it also has very efficient memory management; continuous batching for improved throughput. Page attention, as you all might be aware, was one of

the key optimizations that was introduced by VMs uh that enables continuous batching. And lastly, it has a

very active open-source community engagement. Some of the people who are also here in the audience um that

continue pushing the boundaries for LLM inference performance. So uh, taking a bird's-eye view of how VM's architecture looks like,

um, on a very high level, uh, the core of VM has an LLM engine which runs an infinite loop in the background, and all

the infinite loop—uh, happening inside that loop is that there is one step that is iteratively generating a batch,

executing it, and then processing it. So inside every step in VM, you have a scheduler which prepares the input

tensors to run this on the GPU. So it allocates KV cache for any new requests, um, it updates the status of the

sequences as pre-fill or decode, and it mixes all the batches together and then finally ships this off and broadcasts it

to the workers, which then runs its own model shard of weights. Um, so this—all of this is happening on the CPU, and now

These input tensors are then run on the GPU, involving many kernel launches.  Basically, a CUDA graph replays all these forward passes on the GPU, and then the sampler generates

the next token ID. This is then shipped back to the CPU where output processing happens. Inside that,

you have detokenization (for stopping sequences), removal of finished sequences from the batch, and finally, streaming the output tokens back to the user. So, a lot is

happening in one decoding step, but the same thing keeps happening in an infinite loop. We ran profiling to see how much time the GPU is idle in each decoding

step. As you can see, only the forward pass uses the GPU. We ran some

Torch traces for Llama 3.1 70B on A100s with FP8 per-tensor

quantization. At batch size 1, 2.6% of the time the GPU is idle. As

you increase the batch size, it becomes 14%, and further increases result in 25%. Clearly, this is very

inefficient, with much room for improvement.  This was the state before our latest release (0.6.0+), which included many performance upgrades—a 2.7x

throughput improvement and 5x latency reduction. This resulted from three key changes driven by the open-source community—

a cross-collaboration across different teams.

Unlike the snail developers in this meme, the community was pretty fast at shipping all of these features.

In less than a month.  But the next few slides, I'm going to focus more on the last

optimization, which was a joint collaboration between Neural Magic and Databricks. Asynchronous output processing is the key idea behind

asynchronous output processing. You delay the output processing of step I to step I+1. So if you see

this diagram, the top shows the state before asynchronous output processing was enabled. You have scheduled forward

and output. But with asynchronous output processing, while the model executor is running

for step I, we're simultaneously processing the outputs from the previous [Music] step, step I-1. So

what happens is this entire large white bubble gets reduced, so your per-token output latency is reduced

because you're reducing GPU idle time by concurrently running your output processor with the model executor. However, there are some

tradeoffs because this delays the output processing. The time to first token will be slightly increased by a few

milliseconds. Also, this approach assumes that all sequences from the previous step will be scheduled for the

next step because the output processing has been delayed. So you might end up decoding one extra token per request,

but overall, the benefits of reduced time per output token and improved throughput

make this a worthwhile tradeoff for most of our use cases. So, while this is a very

simple idea in theory, implementing it had some challenges, especially because of the complexity of VMs.

Architecture and, um, so some of the details I mentioned here are... so this particular asynchronous output processing is compatible only when you

enable CUDA graphs for the VM. Now, um, the VM has integrated CUDA graphs very nicely, in the sense that, uh, you could... CUDA graphs

essentially, the power of CUDA graphs is that you can reduce all the kernel launch overheads, and you can also fuse

operations together and get very high, uh, get very low latency on just the model executor part. So while that is

happening, um, the CPU was always running ahead of the GPU. So inside a model executor, the CPU has already lined up a bunch of

operations. So the way we implemented asynchronous output processing was in a callback, as a callback function, um, so

we basically line up right after the CUDA graph is replaying on the model executor, and we, uh, call the output processing as

a callback function. So, and this way we achieve concurrency. Um, also we tried implementing a very seamless

switching of scheduler states. So the main idea is like if, uh, the scheduler can easily switch from an asynchronous to

synchronous state. So if you have—especially this is, uh, important for when you have requests with beam search

parameters or sampling parameters with best-of-N greater than one. So in such cases, asynchronous output processing is not compatible. So, um, right now what

would happen is that the scheduler, if it sees any request with beam search, it will automatically switch to a

synchronous state. So you can—you will default back to the original behavior of the VM. Um, yeah, so some results of

asynchronous output processing is that we ran, uh, again, 317B on A100s with, um, FP8 per-tensor

quantization. And the speedup that we see on prefill-heavy workloads is... so in this graph, what you

See, the blue line is the baseline, and the red line is with the output, asynchronous output processing.  The x-axis

has throughput; the y-axis has latency. So you see lower latency for the same throughput.

Essentially, there's a 1.01 to 1.09% improvement—a 9x speedup—for prefill-heavy

workloads.  This is basically for 3500 input and 300 output

tokens. So you see, at most, for... basically, the advantage here lies at high batch sizes because the CPU overhead

grows linearly with the batch size. So you see a much higher speedup (the green line) at higher batch sizes, higher requests

per second. Similarly, for decode-heavy workloads—this is where I send requests of, like, 100 input and 300 output

tokens—the speedup is larger. So you get almost a 14% speedup with the

asynchronous output processing at high requests per second.  Another optimization I'd like to talk about—this was led by

AnyScale and Neural Magic—is multi-step scheduling. This brought about a huge improvement in latency

and throughput. The core idea behind multi-step scheduling is that you schedule multiple decode passes in one large step, and what this essentially

does is it minimizes the cost of scheduling and post-processing over *n* steps. Remember the graph

that we saw before, which had scheduled forward and output processing? So, with multi-step, what you do

is you schedule *n* steps, then you do a forward pass for *n* steps, and then you do output processing of *n* steps.

Right.  But the disadvantage is that, if you just have multi-step scheduling, the output tokens are streamed back to the user

Only one out of eight steps. So let's assume you run your multi-step schedule for eight steps; you're only streaming

back without asynchronous output processing. You will be streaming back the output tokens one out of every eight steps, but now, uh, the asynchronous output processing

works quite nicely with multi-step scheduling, and you end up even seeing further boosts to the performance, so

essentially, as soon as the output tokens are ready, the asynchronous output processor will stream it back to the user. Yeah, so here is some—here are some

results when you have all uh these two features together. So in this graph, I again look at—we look at the 3.1, um, 70B

model, FP8 per-tensor quantization on A100s, and the blue line is the baseline

where you have neither multi-step nor asynchronous output processing. Then the red line is when you just enable the asynchronous output

processor, and you get a 9% speedup over the baseline, especially as you can see the advantage of asynchronous output

processing is only at high RPS, right? Because the CPU overhead in that case only grows linearly with the batch size.

For uh, the yellow line is when you enable multi-steps, and this one is for eight steps. So for multi-step, you get an

overall improvement even at low RPS because it is actually even helping at um low batch sizes and trying to amortize

the cost of the scheduler. So that's why you see this um lower latency even with just multi-step, but if you again add on to that

asynchronous output processor, you will get again an 8% speedup over multi-step. So overall, uh, going from the blue

line to the green line, you get a 1.23x speedup over the baseline, which

Databricks' vLLM Optimization for Cost-Effective LLM Inference | Ray Summit 2024

Welcome to the Anyscale YouTube channel! Join us as we explore Ray, an open-source framework designed to accelerate the development and deployment of machine learning applications. From beginner tutorials to advanced insights, our videos cater to all skill levels. Stay up-to-date with the latest advancements and hear from experts through interviews, panel discussions, and keynotes.

Anyscale

Databricks의 LLM 서빙 제품 및 성능 개선

Foundational Models API 소개: 토큰 기반 서빙과 프로비저닝 처리량 두 가지 모드 제공

토큰 기반 서빙: 오픈소스 및 독점 모델(DBX 포함) 접근 및 쿼리 가능, 간단한 토큰 가격 책정

프로비저닝 처리량: 오픈소스 또는 사용자 지정 모델 사용, 미세 조정 및 서빙 가능, 지정된 처리량 보장

VM 기반 추론 엔진: 유연성, 다양한 모델 지원, 효율적인 메모리 관리, 활성 오픈소스 커뮤니티 참여

VM 아키텍처 개요: 무한 루프 기반 LLM 엔진, 배치 준비, GPU 실행, 출력 처리 과정 설명

최신 릴리스(0.6.0+) 성능 개선 및 비동기 출력 처리

2.7배 처리량 향상 및 5배 지연 감소 달성

비동기 출력 처리: I+1 단계에서 I 단계 출력 처리, GPU 유휴 시간 감소, 처리량 향상

비동기 출력 처리의 장단점: 지연된 출력 처리로 인한 첫 토큰 시간 증가, 추가 토큰 디코딩 가능성, 그러나 전반적인 성능 향상

구현 과제 및 CUDA 그래프 통합, 콜백 함수를 통한 비동기 처리 구현

비동기 출력 처리 결과: 다양한 작업량에서 지연 감소 및 처리량 향상 확인

다단계 스케줄링 및 기타 최적화

다단계 스케줄링: 여러 디코딩 패스를 하나의 큰 단계로 스케줄링, 스케줄링 및 후처리 비용 최소화

다단계 스케줄링과 비동기 출력 처리의 시너지 효과: 출력 토큰 준비 즉시 스트리밍

결과: 다단계 스케줄링과 비동기 출력 처리 결합 시 기준선 대비 1.23배 속도 향상

FP8 및 예측적 디코딩: 정량화를 통한 성능 향상, 정확도 저하 없이 높은 병렬 처리 가능

예측적 디코딩: 메모리 제한 환경에서 Llama 모델 성능 향상

비동기 출력 처리 및 다단계 스케줄링을 통한 성능 향상

FP8 정량화 및 예측적 디코딩의 성능 개선 효과

VMA의 향후 로드맵: 비동기 스케줄링 및 더욱 개선된 코어 구현

다단계 스케줄링 단계 수 결정 기준: 처리량과 지연 간의 절충

성능 프로파일링 도구 및 최적화 대상 선정 방법: Torch Profiler 사용, 추적 결과 분석을 통한 병목 지점 파악

예측적 디코딩 지표: 4~8 토큰 미리보기, 70% 이상의 수용률

실시간 작업량에 따른 예측적 디코딩 전략 동적 변경 계획

스케줄러와 예측적 디코딩 간의 상호 작용 및 비동기 스케줄링 계획

캐싱 시스템의 역할 및 지연 감소 가능성: 기존 캐싱 메커니즘(예: 접두사 캐싱)과 향후 개선 방향

Databricks' vLLM Optimization for Cost-Effective LLM Inference | Ray Summit 2024

스크립트