CertHub - RSS Feed Reader

Red Hat is announcing the latest addition to our portfolio of validated and optimized models: a compressed version of NVIDIA Nemotron-Nano-9B-v2. By leveraging the open source LLM Compressor library, we have created a new INT4 (W4A16) variant that unlocks significant performance gains on NVIDIA AI infrastructure with negligible impact on the model's reasoning capabilities. This release continues our commitment to providing enterprises with open, flexible, and efficient AI solutions that are ready for production deployment across the hybrid cloud. Key benefits include: - Smaller and faster Nemotron: A new, compressed Nemotron-Nano-9B-v2 model in an INT4 (W4A16) format, created using LLM Compressor, offers a smaller version which performs best at latency sensitive use cases. - Optimized for NVIDIA accelerated computing: The INT4 model coupled with our open-source Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs, is engineered to deliver substantial improvements in inference speed and efficiency in vLLM. - Validated for the enterprise: This model has been rigorously tested and validated by Red Hat, providing the confidence and predictability needed to deploy AI workloads in enterprise environments on Red Hat AI. - Open and extensible: The model is available in the Red Hat AI repository on Hugging Face, ready for deployment with vLLM and further customization using LLM Compressor. - Hybrid model support in vLLM: With the release of the vLLM V1 engine, hybrid models, like Nemotron-Nano-9B-v2, have become first-class citizens through an exceptionally efficient implementation that delivers blazing performance. A more efficient Nemotron for all NVIDIA Nemotron-Nano-9B-v2 stands out for its innovative hybrid Mamba-Transformer architecture, which is specifically designed to accelerate the long reasoning traces common in agentic AI tasks. This 9-billion-parameter model delivers exceptional performance, generating responses up to six times faster than comparable dense models while supporting a 128K token context length and a wide range of languages. This model also has a “thinking budget” feature that limits the model from overthinking, allowing it to optimize the model for accuracy, performance, and inference cost. To make this powerful model even more accessible and efficient for enterprise use, we have applied state-of-the-art compression techniques using LLM Compressor. Our new variant utilizes the INT4 (W4A16) quantization scheme. This is a mixed-precision format that uses 4-bit integers for the model's weights and 16-bit floating-point numbers for the activations. The result is a model that is not only faster but also has a significantly smaller memory footprint, enabling more concurrent requests and the ability to handle longer context lengths on the same hardware - gains in both performance and memory efficiency. Extensive evaluations have shown that W4A16 quantization maintains a consistently negligible accuracy loss, making it highly competitive with the unquantized baseline model. Why INT4 (W4A16)? For this release, we specifically chose the INT4 (W4A16) quantization scheme to deliver maximum efficiency for latency-sensitive enterprise applications. This mixed-precision format offers several key advantages: - Maximum memory reduction: By compressing model weights to just 4 bits, W4A16 dramatically reduces the model's memory footprint. This is critical for deploying large models on resource-constrained hardware or for maximizing the number of models that can be served on a single GPU. - Optimized for low latency: The W4A16 format offers excellent performance in synchronous, single-stream deployments. This makes it an ideal choice for interactive applications like chatbots, coding assistants, and agentic workflows where fast response times are essential. - High accuracy preservation: Despite the aggressive 4-bit weight compression, the W4A16 scheme maintains a consistently low accuracy loss. By applying state-of-the-art quantization techniques implemented in LLM Compressor, the model preserves the nuance required for complex reasoning tasks, making it competitive with less aggressive 8-bit quantization methods. While 8-bit formats like INT8 or FP8 can be more effective for high-throughput, asynchronous workloads on high-end GPUs, W4A16 provides a powerful solution for developers who need to prioritize low latency and minimize memory usage without a significant trade-off in accuracy. Unlocking Long-Context performance: vLLM and the hybrid Mamba-2-Transformer architecture The innovative hybrid architecture of NVIDIA Nemotron-Nano-9B-v2, which combines traditional attention with Mamba-2 layers, is key to its performance but also presents unique challenges for inference engines. Recent work in the vLLM community has elevated hybrid models from early experimental implementations to fully supported, high-class citizens, unlocking their potential for long-context applications. Supporting hybrid models in vLLM requires careful treatment of their state. Attention layers rely on a paged KV cache, organized into blocks that are appended as the sequence grows. Mamba-2 layers, by contrast, maintain a large, fixed-size state for each sequence that is updated in place. In early vLLM versions (v0), hybrid support was achieved through a fragile hack. KV cache was managed efficiently, but Mamba-2 state was allocated separately based on a user-defined parameter. This forced users to guess the right value to avoid CUDA out-of-memory errors, creating a poor developer experience. In vLLM V1, support was rebuilt around a unified allocator that manages both KV cache and Mamba-2 state. This design enables advanced features like prefix caching and allows hybrid models to benefit from V1-wide optimizations. However, Mamba-2 complicates this because the size of its state pages is much larger than that of attention blocks. To address this, V1 relaxes the requirement that all layers use the same block size. Instead, attention block sizes are automatically increased until their page size aligns with Mamba-2’s. While using unusual block sizes might seem inefficient, empirical testing shows little impact on performance. This first-class Hybrid Mamba-2-Transformer support in vLLM delivers several key benefits: - Efficient long-context processing: Mamba-2's primary advantage is its linear-time processing and constant-size state, which avoids the memory and compute costs that cause traditional attention to slow down on long sequences. vLLM's unified memory management fully enables this capability, making Nemotron-Nano-9B-v2 highly effective for use cases like RAG and complex agentic workflows that rely on its 128K context window. - Optimized performance with CUDA graphs: Mamba-2 architectures can introduce significant CPU overhead due to multiple kernel calls. vLLM's full support for CUDA graphs is crucial for these models, as it captures the entire model execution graph and dramatically reduces CPU bottlenecks. This leads to substantial improvements in low-concurrency scenarios. - Foundation for future features: By treating hybrid models as a core part of the architecture, this unified approach ensures that future vLLM features like prefix caching and prefill-decode disaggregation will work seamlessly, further enhancing performance and efficiency. Performance The primary benefit of quantization is the critical trade-off between inference performance and model accuracy. The INT4 (W4A16) format is specifically engineered to maximize this trade-off for certain enterprise workloads. In single-stream, low-latency, synchronous deployments, reducing model weights to 4-bit integers directly translates to lower latency because it effectively reduces the amount of bits to transfer by 75%. This makes it an excellent choice for interactive, real-time applications where a user is actively waiting for a response. For use cases that demand high throughput with many concurrent users (asynchronous deployment), 8-bit formats like INT8 or FP8 often provide better overall performance on high-end GPUs. However, for developers prioritizing the fastest possible response time for individual queries or deploying on memory-constrained edge devices, the INT4 model delivers compelling performance gains. To ensure a consistent and transparent evaluation, all benchmarks were conducted on a single NVIDIA H100 GPU (H100x1) using vLLM (v0.11.0) as the underlying inference backend. Each model was launched with the following command: bash python3 -m vllm.entrypoints.openai.api_server \ --model \ --max-model-len 16384 \ --tensor-parallel-size 1 \ --trust-remote-code We used GuideLLM to drive benchmark traffic and measure performance under different constant RPS (Request Per Second) values ranging from 1 to 9. The benchmark configuration for a given RPS was: bash guidellm benchmark \ --target http://localhost:10000 \ --output-path /tmp/outputs/guidellm_results/report.json \ --rate-type constant \ --rate \ --data prompt_tokens=512,prompt_tokens_stdev=51,prompt_tokens_min=410,prompt_tokens_max=614,output_tokens=256,output_tokens_stdev=26,output_tokens_min=205,output_tokens_max=307,samples=1000 \ --max-seconds 900 \ --processor \ --model \ --request-samples 2147483647 \ --stop-over-saturated \ --max-error-rate 0.5 This configuration simulates realistic workloads with 512 input/256 output token lengths, with some standard deviation, and controlled RPS rates (1–9) to capture performance across different low-latency relevant workloads. As you can see in the graph below [Figure 5] the higher the RPS the more the INT4 quantized model shines and delivers superior latency performance. Accuracy While aggressive quantization can sometimes result in a noticeable degradation in model quality, our team has developed and fully open-sourced advanced methods in LLM Compressor that enhance state-of-the-art quantization algorithms such as GPTQ, pushing the boundaries of accuracy recovery in quantized models. For this Nemotron model in particular, we augment the standard GPTQ approach with mean-squared-error–optimal quantization scales and an importance-based ordering of weight quantization. This enables the algorithm to compensate for the quantization errors of more challenging weights by leveraging the redundancy of weights that are easier to quantize. The Nemotron-Nano-9B-v2 model itself achieves state-of-the-art accuracy on a wide range of reasoning benchmarks, outperforming other models in its class on tasks like math, coding, and general knowledge. Our INT4 variant retains the vast majority of this capability, ensuring that the benefits of quantization come with minimal trade-off in reasoning performance. In the figure below, we present reasoning results for three configurations: the baseline (unquantized) Nemotron-Nano-9B-v2 model, its FP8 weight-and-activation quantized counterpart, and our INT4 weight-only quantized variant. As shown, the well-optimized INT4 model performs on par with both the FP8 and unquantized baselines on popular reasoning benchmarks including AIME25, GPQA-Diamond, and MATH-500. Compressed Nemotron in action The INT4 compressed Nemotron-Nano-9B-v2 model is ready for immediate deployment on Red Hat AI, integrating seamlessly with the vLLM serving engine in Red Hat AI Inference Server and Red Hat OpenShift AI. Developers can get started quickly with just a few lines of code. Deploy on vLLM bash VLLM_USE_PRECOMPILED=1 uv pip install --no-cache git+https://github.com/vllm-project/vllm.git vllm serve RedHatAI/NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 \ --trust-remote-code \ --max-num-seqs 64 \ --mamba_ssm_cache_dtype float32 Note: - Remember to add `--mamba_ssm_cache_dtype float32` for accurate quality. Without this option, the model’s accuracy may degrade. - If you encounter a CUDA OOM issue, try `--max-num-seqs 64` and consider lowering the value further if the error persists. - See more ways to deploy with vLLM. Deploy on Red Hat AI Deploy on Red Hat AI Inference Server bash podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \ --ipc=host \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \ --name=vllm \ registry.access.redhat.com/rhaiis/rh-vllm-cuda \ vllm serve RedHatAI/NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 \ --trust-remote-code \ --max-num-seqs 64 \ --mamba_ssm_cache_dtype float32 See Red Hat AI Inference Server documentation for more details. Deploy on Red Hat OpenShift AI 1. Set up the vLLM server with ServingRuntime Save the following as vllm-servingruntime.yaml: YAML apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name annotations: openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' labels: opendatahub.io/dashboard: 'true' spec: annotations: prometheus.io/port: '8080' prometheus.io/path: '/metrics' multiModel: false supportedModelFormats: - autoSelect: true name: vLLM containers: - name: kserve-container image: quay.io/modh/vllm:rhoai-2.25-cuda command: - python - -m - vllm.entrypoints.openai.api_server args: - "--port=8080" - "--model=/mnt/models" - "--served-model-name={{.Name}}" env: - name: HF_HOME value: /tmp/hf_home ports: - containerPort: 8080 protocol: TCP 2. Attach the model to the vLLM server with InferenceService Save the following as inferenceservice.yaml: YAML apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: annotations: openshift.io/display-name: NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 # OPTIONAL CHANGE serving.kserve.io/deploymentMode: RawDeployment name: nvidia-Nemotron-Nano-9B-v2-quantized.w4a16 # specify model name. This value will be used to invoke the model in the payload labels: opendatahub.io/dashboard: 'true' spec: predictor: maxReplicas: 1 minReplicas: 1 model: modelFormat: name: vLLM name: '' resources: limits: cpu: '2' # this is model specific memory: 8Gi # this is model specific nvidia.com/gpu: '1' # this is accelerator specific requests: # same comment for this block cpu: '1' memory: 4Gi nvidia.com/gpu: '1' runtime: vllm-cuda-runtime # must match the ServingRuntime name above storageUri: oci://registry.redhat.io/rhelai1/modelcar-nvidia-Nemotron-Nano-9B-v2-quantized.w4a16:1.5 tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists 3. Apply the resources to your cluster bash # Make sure first to be in the project where you want to deploy the model # oc project # Apply the ServingRuntime oc apply -f vllm-servingruntime.yaml # Apply the InferenceService oc apply -f inferenceservice.yaml 4. Call the server using curl bash # Replace and below: # - Run `oc get inferenceservice` to find your URL if unsure. curl https://-predictor-default./v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia-Nemotron-Nano-9B-v2-quantized.w4a16", "stream": true, "stream_options": { "include_usage": true }, "max_tokens": 1, "messages": [ { "role": "user", "content": "How can a bee fly when its wings are so small?" } ] }' See Red Hat OpenShift AI documentation for more details. Validated for the enterprise: Deploying with confidence True enterprise AI requires more than just a high score on a public benchmark. It demands confidence, predictability, and flexibility. Deploying a model into production without rigorous testing can lead to spiraling inference costs, rising latency under load, and unexpected alignment issues—all of which pose a risk to business operations. This is why Red Hat’s model validation process is so critical. We move beyond leaderboard hype to assess AI models with real-world data and tasks, across diverse hardware, targeting key enterprise use cases. Our validation is built on four pillars designed to accelerate your time to value: - Clear deployment guidance: We provide the confidence and predictability you need to select and deploy the right third-party models for your specific needs on Red Hat AI. Our process offers guidance to right-size your deployments, helping you select the optimal combination of models and hardware for your use cases and accelerate your time to production. - Concrete data: We offer a clear understanding of each model's scalable performance and accuracy, all tested within the context of specific hardware and real-world enterprise scenarios. By running workload-specific benchmarks, we provide transparent results that clarify the complex trade-offs between performance, accuracy, and cost. - Reproducible results: We ensure that the performance you see in our benchmarks can be reliably reproduced in your own production environments. Our entire validation process uses open source tools like the Language Model Evaluation Harness and vLLM, ensuring our accuracy and performance results are transparent and repeatable. - Enterprise-ready packaging: We provide standardized container formats in our production registry to create a version-controlled asset with an enhanced security footprint for your AI workloads. These assets are scanned for vulnerabilities and simplify deployment and lifecycle management, integrating into a trusted AI software supply chain from creation to deployment. By running the INT4 Nemotron-Nano-9B-v2 variant through this process, we provide organizations with the empirical data needed to confidently deploy the best-fit models on their infrastructure of choice—including Red Hat OpenShift AI, Red Hat Enterprise Linux AI, and Red Hat AI Inference Server—with full visibility into expected performance and cost. Open and scalable AI The release of the compressed INT4 Nemotron-Nano-9B-v2 model exemplifies the power of a collaborative, open source ecosystem. By combining a state-of-the-art model from NVIDIA with open, extensible tools like LLM Compressor and vLLM, and backing it with Red Hat’s enterprise-grade validation and support, we are making powerful AI more accessible, efficient, and reliable for everyone. Explore the new model and our compression recipes in the Red Hat AI Repository on Hugging Face or on the Red Hat Container Registry, and soon the Red Hat OpenShift AI 3.0 catalog. Resource The adaptable enterprise: Why AI readiness is disruption readiness About the author More like this Browse by channel Automation The latest on IT automation for tech, teams, and environments Artificial intelligence Updates on the platforms that free customers to run AI workloads anywhere Open hybrid cloud Explore how we build a more flexible future with hybrid cloud Security The latest on how we reduce risks across environments and technologies Edge computing Updates on the platforms that simplify operations at the edge Infrastructure The latest on the world’s leading enterprise Linux platform Applications Inside our solutions to the toughest application challenges Virtualization The future of enterprise virtualization for your workloads on-premise or across clouds