AI and Robotics

Splitwise improves GPU usage by splitting LLM inference phases

Expanded LLM use creates new demands on cloud GPU capacity. Splitwise presents an efficient solution by separating the two essential phases of LLM inference, achieving higher throughput within a limited power budget. The post Splitwise improves GPU usage by splitting LLM inference phases appeared first on Microsoft Research.

Techatty

Jan 4, 2024 - 05:05

Jan 5, 2024 - 15:26

Splitwise improves GPU usage by splitting LLM inference phases

The recent surge in large language model (LLM) use is causing significant challenges for cloud providers, requiring them to deploy more GPUs at an unprecedented rate. However, the capacity to provision the power needed to run these GPUs is limited, and with demand for computation surpassing supply, it is not uncommon for user queries to be denied. Therefore, any approach to making the existing infrastructure more efficient—enabling it to serve more queries faster under the same power budget—can have very tangible benefits to both cloud providers and users.

One aspect of LLM inference that currently limits efficient use of resources is that it has two distinct phases with different characteristics: the prompt phase and the token-generation phase. During the prompt phase, LLMs process all user input, or prompts, in parallel, efficiently utilizing GPU compute. However, during the token-generation phase, LLMs generate each output token sequentially and are limited by GPU memory bandwidth. Even when employing state-of-the-art batching mechanisms, the discrepancy between these two phases results in low overall hardware utilization, leading to much higher costs when offering LLMs to users. Figure 1 illustrates the differences between these two phases.

An example of the generative LLM inference process and the two phases associated with it. The initial prompt is “Which is better, pizza or burger?” and it generates the word “Pizza”. The token generation phase generates the words/tokens: “is”, “better”, and “.”. The prompt phase has the following properties: (1) all input tokens are processed in parallel to generate the first output token, (2) compute intensive, and (3) is a smaller part of the end-to-end latency. The token phase is: (1) serialized, (2) memory intensive, and (3) tends to be the majority of the end-to-end latency. — Figure 1. An example of the generative LLM inference process and the two phases associated with it. The prompt phase is computationally intensive, while the token phase is memory intensive.

Splitting the phases with Splitwise

At Azure Research – Systems, we tackled this by creating Splitwise, a technique designed to optimally utilize available hardware by separating the prompt computation and token-generation phases onto separate machines. This approach is underpinned by the insight that prompt processing and token-generation are distinct in their computational, memory, and power requirements. By separating these two phases, we can enhance hardware utilization during both phases. Our paper, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” details our methods for developing and testing this technique, including an exploration of how different types of GPUs perform during each phase.

To create a sustainable approach for GPU provisioning, we used Splitwise to design GPU clusters with three primary objectives: maximizing throughput, minimizing costs, and reducing power. In addition to separating the two LLM inference phases into two distinct machine pools, we include a third machine pool for mixed batching across the prompt and token phases, sized dynamically based on real-time computational demands. Lastly, we transferred the state context (i.e., KV-cache in the LLM transformer attention layers) from the prompt to the token machines over InfiniBand without any perceivable latency impact to the user. This high-level system architecture is illustrated in Figure 2.

A high-level diagram of Splitwise architecture. Machines maintained in different pools are dedicated to the corresponding phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency. — Figure 2. A high-level diagram of the Splitwise architecture. Machines maintained in different pools are dedicated to the two distinct LLM inference phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency.

Tests show Splitwise maximizes throughput while lowering costs

To evaluate its performance, we used Splitwise to design clusters with different types of GPUs, including NVIDIA DGX-A100 and DGX-H100, while optimizing cost, power, and throughput under specific latency service level agreements (SLAs) for each query. Table 1 shows the machine types we used for each cluster design. Our application of Splitwise encompassed two use cases: code and conversation using the Llama-2-70B (opens in new tab) and BLOOM-176B (opens in new tab) LLMs.

Table 1. Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power.

Our findings demonstrate that Splitwise successfully achieves our three goals of maximizing throughput, minimizing costs, and reducing power. Through our evaluation, we observed that the Splitwise cluster design can maximize throughput at the same cost compared with an A100 baseline cluster. Moreover, Splitwise delivers much higher throughput while operating within the same provisioned power constraints as the baseline cluster. Figure 3 shows that compared with Baseline-H100, we can achieve 1.4x higher throughput at 20 percent lower cost. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints. Splitwise-HH requires the least number of machines. Splitwise-HHcap provides the best throughput. Splitwise-AA is the cheapest option. — Figure 3. Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints.

Looking forward

Splitwise marks a leap toward efficient, high-performance LLM deployments. By separating the prompt and token phases, we can unlock new potential in GPU use. Looking forward, we at Microsoft Azure envision tailored machine pools driving maximum throughput, reduced costs, and power efficiency, and we will continue to focus on making LLM inference efficient and sustainable.

Our approach is now part of vLLM (opens in new tab) and can also be implemented with other frameworks.

Acknowledgements

This work was done in collaboration with our intern, Pratyush Patel from the University of Washington. We also appreciate the help and guidance of Suriya Kalivardhan, Gopi Kumar, and Chetan Bansal.

The post Splitwise improves GPU usage by splitting LLM inference phases appeared first on Microsoft Research.

Tags:

How To Learn Django - here is a comprehensive and the easiest guide for Beginner...

Techatty Connecting the world of tech differently! Read. Write. Learn. Thrive. Make an informed decision without distractions. We are building tech media and publication networks to connect YOU and everyone to reliable information, opportunities, and resources to achieve greater success.

Sponsor to Give Hope, Transform, and Uplift Lives.

	Need help implementing innovative technology, with tech support or management? You can count on us.
	24-7 Press Release - Let's distribute your Press Releases to traditional and digital media outlets. Get started!
	Reliable Website Security Solutions, built for small businesses, web professionals, and enterprise organizations.
	Paternity Lab - bringing DNA Paternity Testing closer to people. We offer accurate, affordable, and easy DNA Paternity Testing. Also at home.
	Rexing USA - exclusive cameras, car gadgets, and EV accessories with unique designs, innovative technology, and in affordable price ranges.

The Rising Wave of Blockchain Technology Adop...

HackaTRON Season 7 Launches With Google Cloud...

Skybridge Founder: Kamala Harris Open-Minded ...

Auradine Ships 3nm Teraflux Bitcoin Mining Pl...

Wazirx Details Plan to Resume Withdrawals and...

Agentic AI Leaders to Showcase Latest Advance...

NVIDIA Releases NIM Microservices to Safeguar...

How AI Is Enhancing Surgical Safety and Educa...

NVIDIA and IQVIA Build Domain-Expert Agentic ...

AI Gets Real for Retailers: 9 Out of 10 Retai...

Alleged Co-Founder of Garantex Arrested in India

Feds Link $150M Cyberheist to 2022 LastPass H...

Who is the DOGE and X Technician Branden Spikes?

Notorious Malware, Spam Host “Prospero” Moves...

U.S. Soldier Charged in AT&T Hack Searched “C...

Splitwise improves GPU usage by splitting LLM inference phases

Splitting the phases with Splitwise

Tests show Splitwise maximizes throughput while lowering costs

Looking forward

Acknowledgements

Tags:

How To Learn Django - here is a comprehensive and the easiest guide for Beginner...

[Industry Direct] Revolutionizing Fitness Apps with Snapdragon Technologies - Te...

Oracle Cloud Infrastructure Expands NVIDIA GPU-Accelera...

Big data dreams for tiny technologies

Phishing Attacks Now a Focus for AI Cybersecurity Tools

Change language

SPONSORED

Recommended for you

Great Opportunity You Can't Reject! (No, Seriously...

Pause and let's talk about responsible spending an...

Experts Estimate £20 Million+ Loss from Heathrow A...

Welcome to ProtoPie

Ready to turn your innovative tech business dream ...

Gold Could Surge to $40,000 per Ounce, Strategist ...

Web & Cloud - Engineering Tech for a Better Tomorrow!

Introducing: Techatty Aerospace

Splitwise improves GPU usage by splitting LLM inference phases

Splitting the phases with Splitwise

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Tests show Splitwise maximizes throughput while lowering costs

Looking forward

Acknowledgements

Tags:

Related Posts

Change language

SPONSORED

Recommended for you