DiffusionGemma: NVIDIA RTX Accelerates Google's Parallel Text AI Model

Google DeepMind's DiffusionGemma generates text in 256-token parallel blocks rather than one token at a time. NVIDIA has optimised the model for up to 4x the speed of autoregressive equivalents on RTX, DGX Spark, and H100 hardware.

Published: June 11, 2026 By Sarah Chen, AI & Automotive Technology Editor Category: AI

Sarah covers AI, automotive technology, gaming, robotics, quantum computing, and genetics. Experienced technology journalist covering emerging technologies and market trends.

DiffusionGemma: NVIDIA RTX Accelerates Google's Parallel Text AI Model

LONDON, June 11, 2026 — Google DeepMind released DiffusionGemma on June 10, a new open-weight language model that generates text through diffusion rather than sequential token prediction. NVIDIA has optimised the model across GeForce RTX, RTX PRO, and DGX hardware from day one. The release marks the most significant architectural departure from mainstream autoregressive language models in several years, with coordinated ecosystem support across Hugging Face, vLLM, and Unsloth on launch day rather than weeks later.

What Is DiffusionGemma

Built on the Gemma 4 26-billion-parameter mixture-of-experts architecture, DiffusionGemma replaces the standard autoregressive output head with a diffusion head that denoises a full block of up to 256 tokens per step. In a conventional large language model, each token must wait for the previous one — the model reads its own output back into the forward pass at every step. DiffusionGemma removes that dependency entirely: all 256 tokens in a block are computed simultaneously, then iteratively refined through a denoising process until stable text emerges.

The architectural consequence is a shift from memory-bandwidth-bound inference — where GPUs spend most time waiting on data movement — to compute-bound inference, where dense matrix operations dominate. NVIDIA Tensor Cores are specifically designed to saturate that regime, which is why the performance advantage is so pronounced on NVIDIA silicon and why the company moved so quickly to validate it.

Performance Benchmarks

According to the NVIDIA RTX AI Garage announcement, the optimised model delivers 1,000 tokens per second on a single H100 Tensor Core GPU, 150 tokens per second on NVIDIA DGX Spark, and up to 2,000 tokens per second on DGX Station — approximately four times the throughput of an equivalent autoregressive model in single-user inference. The NVIDIA-hosted API at build.nvidia.com gives developers immediate access before local GeForce RTX deployment is finalised.

| Platform | Tokens/sec | Memory | |---|---|---| | NVIDIA H100 Tensor Core GPU | 1,000 | HBM3 | | DGX Station | 2,000 | 748GB coherent | | DGX Spark (GB10 Grace Blackwell) | 150 | 128GB unified | | GeForce RTX (llama.cpp) | Coming soon | Consumer GDDR |

The DGX Spark figure deserves particular attention for the local AI market: 150 tokens per second on a deskside device with 128GB unified memory is a meaningful threshold for interactive agentic loops. It crosses the point at which a local model can keep pace with developer iteration cycles without noticeable latency.

Open Weights and Ecosystem Readiness

DiffusionGemma is released under an Apache 2.0 licence. Weights are available on Hugging Face with same-day support in the Transformers library. vLLM provides production serving from launch, and fine-tuning is available through Unsloth and NVIDIA NeMo, with published DGX Spark playbooks for both inference and fine-tuning workflows.

Three independent runtimes on day one signals a coordinated launch, not a research release. NVIDIA's newsroom has framed DiffusionGemma as a validation of RTX AI Garage — the company's programme for certifying and optimising open models for local and professional deployment. The NVIDIA developer blog has also published a technical deep-dive covering the CUDA-level implementation details.

Industry Analysis

The timing aligns with accelerating enterprise interest in local inference. As Business 2.0 has reported, AI chip architectures are evolving rapidly under combined cost and latency pressure. Running a 26B-parameter model locally at 150 tokens per second eliminates per-token API costs and data-egress risk — precisely the factors driving the agentic AI build-out enterprises are now funding at scale.

For the GPU market, DiffusionGemma is a useful proof point at a sensitive moment. As rival AI chip architectures gain momentum, NVIDIA's ability to deliver four-times performance on a third-party model without model-specific tuning reinforces CUDA software depth as a durable competitive advantage beyond hardware specifications alone. Investors tracking AI capital allocation trends should note that local inference infrastructure is increasingly where hardware premiums are being validated in practice.

One important caveat: DiffusionGemma is explicitly experimental. Google DeepMind has not published exhaustive quality benchmarks comparing diffusion text generation against leading autoregressive models at equivalent parameter counts. The 4x throughput advantage is measured in single-user, low-batch workloads — the regime where autoregressive models are most constrained by sequential compute. In high-batch server inference the advantage narrows considerably. The broader AI investment wave should weigh the single-user framing carefully before extrapolating to data-centre economics.

Forward Outlook

DiffusionGemma introduces a credible second inference paradigm to the open-model ecosystem alongside autoregressive generation. If independent quality benchmarks validate output quality at Gemma 4 scale, the model could reshape how local-first AI applications — particularly agentic workloads where latency is the binding constraint — are architected and deployed. NVIDIA's day-zero investment across its full hardware stack signals that the company views diffusion-based text generation as strategically material. Developers can test it now via Hugging Face Transformers or the hosted API at build.nvidia.com.

About the Author

SC

Sarah Chen

AI & Automotive Technology Editor

Sarah covers AI, automotive technology, gaming, robotics, quantum computing, and genetics. Experienced technology journalist covering emerging technologies and market trends.

About Our Mission Editorial Guidelines Corrections Policy Contact

Frequently Asked Questions

What is DiffusionGemma?

DiffusionGemma is an experimental open-weight language model from Google DeepMind that generates text via diffusion — denoising up to 256 tokens in parallel per step — rather than the sequential token-by-token method used by standard LLMs.

How fast is DiffusionGemma on NVIDIA hardware?

NVIDIA's optimised build delivers 1,000 tokens/sec on H100, 150 tokens/sec on DGX Spark, and up to 2,000 tokens/sec on DGX Station — roughly 4x faster than autoregressive equivalents in single-user workloads.

Is DiffusionGemma open source?

Yes. DiffusionGemma ships under an Apache 2.0 licence with weights on Hugging Face and day-zero support in Transformers, vLLM, and Unsloth.

What hardware do I need to run DiffusionGemma locally?

GeForce RTX GPUs are supported with llama.cpp integration coming soon. DGX Spark and RTX PRO 6000 workstations are available today with published setup playbooks from NVIDIA.

Does DiffusionGemma replace autoregressive models?

Not yet — it remains experimental and independent quality benchmarks at scale are pending. Its speed advantage is largest in single-user, low-batch settings, not high-throughput server inference.