CertHub - RSS Feed Reader

As organizations race to productionize large language model (LLM) workloads, two powerful open-source projects have emerged to tackle the complexity of inference at scale: vLLM and llm-d. Are llm-d and vLLM on the same track, or are they steering toward different finishing lines? vLLM: The High-Performance Inference Engine vLLM is an enterprise open-source based inference engine for LLMs. Its performance edge comes from innovations like: - PagedAttention, which enables efficient KV cache management - Speculative decoding support - Tensor parallelism (TP) and multi-model support - Integration with Hugging Face, and OpenAI APIs vLLM is the engine responsible for executing inference workloads, managing GPU memory efficiently, and delivering fast responses with high throughput. Use it when: You want a state-of-the-art model serving on a single node or a well-tuned multi-GPU cluster. llm-d: The Cloud-Native Distributed Inference Framework llm-d is a newer project that focuses on orchestrating vLLM as part of a cloud-native, distributed inference system. Instead of running vLLM as a single monolith, llm-d provides: - Disaggregates inference into components, prefill, and decode. Using a racing analogy: - Prefill is analogous to the extensive formation lap in a race, where the driver warms the tires, checks the car systems (the user's prompt), and prepares for the start (calculates the initial Key-Value cache). This phase is compute-intensive and requires a high volume of operations. - Decode, on the other hand, is the fast, iterative race itself. It involves adding one new token at a time to generate the model's response. This phase requires a high-speed memory bandwidth to access and produce new tokens quickly. - Prefix-Aware Routing determines which backend handles the request, similar to a race team telling drivers where to line up or providing high-level race strategy. - Observability and monitoring acts as the race control, ensuring smooth operation. Leveraging tools such as Prometheus and OpenTelemetry, it offers essential observability for performance metrics. Kubernetes-native APIs enable robust policy control to manage traffic and deployments efficiently. What problems does llm-d solve? Running monolithic LLM servers leads to inefficient GPU utilization, unpredictable latency, and poor cost efficiency, especially as workloads vary in context length, token rate, and model size. llm-d solves these issues by providing a distributed, cache-aware scheduling layer that intelligently routes requests, disaggregates prefill and decode stages, and maintains session affinity to maximize throughput. Use it when: You want to scale vLLM-based inference across a fleet of GPUs, enable load balancing, route traffic based on session cache, or integrate with Kubernetes-based observability. vLLM is the Performance Car, llm-d is the Pit Crew and Race Management Continuing with our racing analogy, think of it this way: - vLLM is a high-performance car that is capable, fast, and efficient. It can certainly take you to the track on its own. - llm-d is the pit crew, race strategist, and telemetry system that helps you manage a fleet of those cars across a modern cloud-native stack, ensuring optimal performance and victory. You wouldn't expect a Formula 1 car, even a fully functioning one, to win a championship without a dedicated team supporting it. Likewise, you wouldn't deploy a globally scalable LLM service using only a single vLLM process without orchestration, routing, or telemetry. They’re designed to work together, not replace each other. What You Can Do with llm-d + vLLM Together Serve multi-billion parameter LLMs with disaggregated prefill/decode workers to achieve higher concurrency and more efficient resource utilization. This architecture allows for independent scaling of prefill and decode operations, optimizing performance for diverse workloads. Run MoE (Mixture of Experts) models across multiple GPU nodes with expert-parallel scheduling. This advanced scheduling technique ensures that different experts within an MoE model can be distributed across various GPUs, maximizing the benefits of MoE architectures for large-scale inference. Achieve prefix cache reuse with KV cache-aware routing, significantly reducing latency and compute cost. By intelligently reusing cached key-value pairs from previous requests, the system minimizes redundant computations and accelerates response times, especially for requests with common prefixes. Observe per-token metrics like Time to First Token, KV Cache Hit Rate, and GPU memory pressure. These detailed metrics provide comprehensive insights into the inference process, enabling fine-grained performance monitoring and optimization to ensure smooth and efficient operation. Scale elastically using Kubernetes tools like KEDA, ArgoCD, and Prometheus-based autoscaling. This integration with leading Kubernetes technologies allows the system to dynamically adjust its resources based on demand, ensuring high availability and cost-effectiveness by scaling up or down as needed. All these capabilities are realized using vLLM as the core execution engine, which provides high-performance inference, and llm-d as the distributed control plane, which orchestrates and manages the entire distributed inference system. Final Thoughts This isn't to say that llm-d replaces vLLM but enhances it. It streamlines the process of running vLLM at scale, across clusters, and with cloud-native tools. Want to try it out? Explore the llm-d guides Resource The adaptable enterprise: Why AI readiness is disruption readiness About the author More like this 9 strategic articles defining the open hybrid cloud and AI future How Red Hat Training makes you a better IT professional Technically Speaking | Driving healthcare discoveries with AI Technically Speaking | Security for the AI supply chain Browse by channel Automation The latest on IT automation for tech, teams, and environments Artificial intelligence Updates on the platforms that free customers to run AI workloads anywhere Open hybrid cloud Explore how we build a more flexible future with hybrid cloud Security The latest on how we reduce risks across environments and technologies Edge computing Updates on the platforms that simplify operations at the edge Infrastructure The latest on the world’s leading enterprise Linux platform Applications Inside our solutions to the toughest application challenges Virtualization The future of enterprise virtualization for your workloads on-premise or across clouds