Stop wasting budget on idle GPUs. Learn how to implement dynamic allocation, multi-tenancy, and effective autoscaling for your AI workloads. The need for GPUaaS on Red Hat OpenShift AI For organizations investing heavily in AI, the cost of specialized hardware is a primary concern. GPUs/accelerators are expensive, and if that hardware is unused and sits idle, it leads to significant budget waste, making it more difficult to scale your AI projects. One solution is to adopt GPU-as-a-Service (GPUaaS), an operational model designed to help maximize the return on investment (ROI) of your hardware. Red Hat OpenShift AI is a Kubernetes-based platform that can be used to implement a multi-user GPUaaS solution. While provisioning the hardware is the first step, achieving true GPUaaS requires additional dynamic allocation based on workload demand, so GPUs are more quickly reclaimed to minimize idle time. GPUaaS also necessitates multi-tenancy. This is where advanced queuing tools like Kueue (Kubernetes Elastic Unit Execution) become indispensable. Kueue partitions shared resources and enforces multi-tenancy via quotas, guaranteeing fair, predictable access for multiple teams and projects. Once this governance is in place, the core challenge shifts to creating an autoscaling pipeline for AI workloads. AI workload integration and autoscaling The goal of a GPUaaS platform is to integrate popular AI frameworks and automatically scale resources based on workload demand. OpenShift AI simplifies the deployment of common AI frameworks. These workloads fall into 3 main categories, all supported by Kueue: - Inference: Frameworks like KServe and vLLM handle serving models efficiently, especially for large language models (LLMs). - Training (distributed): Tools like KubeFlow Training and KubeRay manage complex, multi-node distributed training jobs. - Interactive data science: Workbenches, which are the OpenShift AI data science notebook experience, also integrate with Kueue so notebooks are only launched when resources are available, reducing resource waste. Queue management with Kueue The central challenge in a multi-tenant AI cluster is managing the flood of GPU job requests. This is the precise role of Kueue . Kueue provides essential queuing and batch management for these compute-intense jobs. Instead of immediately failing a resource request when a cluster is momentarily saturated, Kueue intelligently holds and manages a waiting list. This capability is key to maintaining fairness and efficiency, so requests aren't arbitrarily rejected and helping prevent resource monopolization. Effective autoscaling with KEDA Kueue and KEDA (Kubernetes Event-driven Autoscaling) work together to optimize resource use through both automated scaling up and scaling down. Automated scale up: KEDA monitors Kueue's metrics, specifically the length of the GPU job queue. By observing this backlog, KEDA can proactively initiate the scaling up of new GPU nodes. This means that new resources are provisioned before the current capacity is overwhelmed by demand, leading to high availability and improved cost efficiency. This integration transforms Kueue's queue into a vital scaling signal, enabling proactive, demand-driven resource management. Automated scale down: KEDA facilitates the automatic release of quota by claiming it from idle workloads. When a workload (e.g., a RayCluster) finishes its task but is not deleted, a custom metric (exposed via Prometheus or similar) reports its idle status. KEDA monitors this idle metric and, through a ScaledObject, triggers the autoscaler to scale down the idle workload's worker components to zero replicas. This significantly reduces operational costs. Similar methods can be applied to inference clusters, using KEDA to scale KServe components to zero during idle periods. Scaling down worker components frees up underlying node resources. The Kueue Workload object and its reserved quota remain, so teams retain their quota reservation for the next job without a full re-queueing process, while simultaneously reducing the waste of expensive, idle compute resources. Observability-driven optimization Continuous monitoring is critical to improve efficiency and maximize ROI for your GPUaaS. Administrators must constantly track GPU health, temperature, and utilization rates. OpenShift AI’s built-in Prometheus/Grafana stack allows administrators to create custom dashboards to track GPU utilization, broken down per-tenant, per-project, and per-GPU. These metrics feed back into the system, enabling administrators to refine GPU quotas, adjust fair-sharing policies enforced by Kueue, and confirm ROI maximization. Conclusion GPUaaS on OpenShift AI delivers direct business benefits. You gain cost savings through dynamic GPU allocation, improved governance through the multi-tenancy enforced by Kueue's queues and quotas, and improved scalability through integrated autoscaling for all your AI workloads. Red Hat OpenShift AI provides the enterprise-grade solution that transforms expensive, often underutilized GPU hardware into a high-efficiency, multi-tenant GPUaaS platform. Explore the OpenShift AI page to learn more. Resource The adaptable enterprise: Why AI readiness is disruption readiness About the authors More like this Browse by channel Automation The latest on IT automation for tech, teams, and environments Artificial intelligence Updates on the platforms that free customers to run AI workloads anywhere Open hybrid cloud Explore how we build a more flexible future with hybrid cloud Security The latest on how we reduce risks across environments and technologies Edge computing Updates on the platforms that simplify operations at the edge Infrastructure The latest on the world’s leading enterprise Linux platform Applications Inside our solutions to the toughest application challenges Virtualization The future of enterprise virtualization for your workloads on-premise or across clouds