CUDA Agent | Large-Scale Agentic RL for CUDA Kernel Generation

Latest News

2026.02.27 The GitHub repository now includes the agent workdir for the CUDA Agent workflow.

2026.02.27 The training dataset has been released on Hugging Face as CUDA-Agent-Ops-6K.

Abstract

GPU kernel optimization is fundamental to modern deep learning but remains a specialized task requiring deep hardware expertise. Existing CUDA code generation approaches either rely on training-free refinement or fixed execution-feedback loops, which limits intrinsic optimization ability.

We present CUDA Agent, a large-scale agentic reinforcement learning system with three core components: scalable data synthesis, a skill-augmented CUDA development environment with reliable verification and profiling, and RL algorithmic techniques for stable long-context training.

CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100%, 100%, and 92% faster rate over torch.compile on Level-1, Level-2, and Level-3 splits.

KernelBench benchmark chart for CUDA Agent

KernelBench comparison against torch.compile and strong proprietary models.

Key Contributions

Large-Scale Agentic RL System for CUDA Optimization

We introduce CUDA Agent, a large-scale agentic reinforcement learning system that improves intrinsic CUDA kernel generation and optimization ability through scalable synthesis, a skill-augmented environment, and stable long-horizon training.

State-of-the-Art KernelBench Performance

CUDA Agent achieves state-of-the-art results on KernelBench, delivering strong faster-than-compile rates across all levels and outperforming strong proprietary models on the hardest Level-3 setting.

Data Release: CUDA-Agent-Ops-6K

We release CUDA-Agent-Ops-6K, a high-quality synthesized training dataset with filtering and contamination control, supporting reproducible research on RL-based CUDA kernel optimization.

System Pipeline

Data Synthesis

We build training tasks with a three-stage pipeline: seed problem crawling, LLM-based combinatorial synthesis, and execution-driven filtering. Seed operators are mined from torch and transformers, each represented as a Python class with initialization and forward methods.

Combinatorial synthesis samples up to 5 torch operators and composes them sequentially into fused tasks.
Filtering keeps only tasks that run in both eager and compile modes and removes stochastic operators.
Anti-hacking checks remove constant or indistinguishable outputs across different inputs.
Workload control keeps eager runtime in the 1ms-100ms range and removes high-similarity KernelBench cases.

The final curated dataset contains 6,000 training samples (CUDA-Agent-Ops-6K), designed for scalable RL training with broad task diversity and reduced contamination risk.

Agent Environment

The agent loop follows a ReAct-style workflow with coding tools and a CUDA skill specification (SKILL.md), enabling iterative coding, compile-debug cycles, and profiler-guided optimization.

Standard workflow: profile native PyTorch, implement CUDA kernels/bindings, compile in GPU sandbox, iterate.
Target requirement: pass correctness checks and exceed a 5% speedup over torch.compile.
Robust reward schedule uses milestone-based discrete rewards for correctness and speed gains.
Anti-reward-hacking controls: protected verify/profile scripts, forbidden fallback calls, 5-input correctness checks, synchronized warm-up profiling, no web retrieval.

These constraints provide reliable execution-based feedback so policy learning emphasizes true kernel quality rather than shortcut behaviors.

Training Pipeline

Training is staged to stabilize long-horizon RL for CUDA coding. We first run single-turn PPO warm-up, then initialize both actor and critic before full multi-turn agentic RL.

Single-turn warm-up improves base CUDA generation before entering interactive agent training.
Actor initialization uses Rejection Fine-Tuning (RFT) on sampled trajectories with positive outcomes.
RFT filtering removes inefficient loops and invalid tool-call patterns to reduce policy collapse risk.
Critic initialization uses value pretraining so advantage estimates are reliable from early steps.

With this multi-stage design, training remains stable for long-context settings (up to 128k context, 150 training turns, and up to 200 turns during evaluation), enabling sustained reward growth.

Main Results

We report full metrics for both Overall and Level-3 splits on KernelBench: Pass Rate, Faster Rate (vs. Eager / vs. Compile), and Geomean Speed-up (vs. Eager / vs. Compile).

Overall

Pass Rate 98.8%

Faster Rate vs. Eager 98.4%

Faster Rate vs. Compile 96.8%

Speed-up vs. Eager 2.60x

Speed-up vs. Compile 2.11x

Level-3

Pass Rate 94%

Faster Rate vs. Eager 94%

Faster Rate vs. Compile 90%

Speed-up vs. Eager 1.80x

Speed-up vs. Compile 1.52x

Compared with strong proprietary baselines, CUDA Agent shows a clear optimization gap in compile-relative performance: on overall KernelBench, it reaches 96.8% faster rate vs. compile and 2.11x geomean speed-up, while Claude Opus 4.5 and Gemini 3 Pro are around 66.4%-69.6% faster rate and 1.42x-1.46x speed-up. The advantage is most pronounced on difficult settings: on Level-3, CUDA Agent achieves 90% faster rate vs. compile (about +40 points over the strongest proprietary baselines), and on Level-2 operator-sequence tasks it reaches 100% faster rate with 2.80x geomean speed-up vs. compile.

Main experimental results on KernelBench

Overall performance and speedup metrics on KernelBench.

Citation

If you use CUDA Agent in your research, please cite:

@article{cudaagent2026,
  title   = {CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation},
  author  = {Dai, Weinan and Wu, Hanlin and Yu, Qiying and Gao, Huan-ang and Li, Jiahao and Jiang, Chengquan and Lou, Weiqiang and Song, Yufan and Yu, Hongli and Chen, Jiaze and Ma, Wei-Ying and Zhang, Ya-Qin and Liu, Jingjing and Wang, Mingxuan and Liu, Xin and Zhou, Hao},
  journal = {arXiv preprint},
  year    = {2026}
}