Skip to main content
ByteDance Seed Institute for AI Industry Research, Tsinghua University

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

High-Quality Training Tasks via a Scalable Data Pipeline

CUDA Agent is a large-scale agentic reinforcement learning system that develops robust CUDA kernel optimization ability through scalable data synthesis, a skill-augmented execution environment, and stable long-horizon RL training.

Weinan Dai1,2,3*, Hanlin Wu1,2,3*, Qiying Yu1,2,3, Huan-ang Gao1,2,3, Jiahao Li1, Chengquan Jiang1, Weiqiang Lou1, Yufan Song1, Hongli Yu1,2,3, Jiaze Chen1,3, Wei-Ying Ma2,3, Ya-Qin Zhang2,3, Jingjing Liu2,3, Mingxuan Wang1,3, Xin Liu1, Hao Zhou2,3
1ByteDance Seed
2Institute for AI Industry Research (AIR), Tsinghua University
3SIA-Lab of Tsinghua AIR and ByteDance Seed
* Equal contributions, † Corresponding Authors
98.8% Overall Pass Rate
96.8% Overall Faster than torch.compile
2.11x Overall Speedup vs torch.compile
6K Synthesized Training Ops
Scroll to details

Latest News

2026.02.27 The GitHub repository now includes the agent workdir for the CUDA Agent workflow.
2026.02.27 The training dataset has been released on Hugging Face as CUDA-Agent-Ops-6K.

Abstract

GPU kernel optimization is fundamental to modern deep learning but remains a specialized task requiring deep hardware expertise. Existing CUDA code generation approaches either rely on training-free refinement or fixed execution-feedback loops, which limits intrinsic optimization ability.

We present CUDA Agent, a large-scale agentic reinforcement learning system with three core components: scalable data synthesis, a skill-augmented CUDA development environment with reliable verification and profiling, and RL algorithmic techniques for stable long-context training.

CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100%, 100%, and 92% faster rate over torch.compile on Level-1, Level-2, and Level-3 splits.

KernelBench benchmark chart for CUDA Agent

KernelBench comparison against torch.compile and strong proprietary models.

Key Contributions

Large-Scale Agentic RL System for CUDA Optimization

We introduce CUDA Agent, a large-scale agentic reinforcement learning system that improves intrinsic CUDA kernel generation and optimization ability through scalable synthesis, a skill-augmented environment, and stable long-horizon training.

State-of-the-Art KernelBench Performance

CUDA Agent achieves state-of-the-art results on KernelBench, delivering strong faster-than-compile rates across all levels and outperforming strong proprietary models on the hardest Level-3 setting.

Data Release: CUDA-Agent-Ops-6K

We release CUDA-Agent-Ops-6K, a high-quality synthesized training dataset with filtering and contamination control, supporting reproducible research on RL-based CUDA kernel optimization.

System Pipeline

Data Synthesis

We build training tasks with a three-stage pipeline: seed problem crawling, LLM-based combinatorial synthesis, and execution-driven filtering. Seed operators are mined from torch and transformers, each represented as a Python class with initialization and forward methods.

  • Combinatorial synthesis samples up to 5 torch operators and composes them sequentially into fused tasks.
  • Filtering keeps only tasks that run in both eager and compile modes and removes stochastic operators.
  • Anti-hacking checks remove constant or indistinguishable outputs across different inputs.
  • Workload control keeps eager runtime in the 1ms-100ms range and removes high-similarity KernelBench cases.

The final curated dataset contains 6,000 training samples (CUDA-Agent-Ops-6K), designed for scalable RL training with broad task diversity and reduced contamination risk.

CUDA Agent data synthesis pipeline

Agent Environment

The agent loop follows a ReAct-style workflow with coding tools and a CUDA skill specification (SKILL.md), enabling iterative coding, compile-debug cycles, and profiler-guided optimization.

  • Standard workflow: profile native PyTorch, implement CUDA kernels/bindings, compile in GPU sandbox, iterate.
  • Target requirement: pass correctness checks and exceed a 5% speedup over torch.compile.
  • Robust reward schedule uses milestone-based discrete rewards for correctness and speed gains.
  • Anti-reward-hacking controls: protected verify/profile scripts, forbidden fallback calls, 5-input correctness checks, synchronized warm-up profiling, no web retrieval.

These constraints provide reliable execution-based feedback so policy learning emphasizes true kernel quality rather than shortcut behaviors.

CUDA Agent environment loop

Training Pipeline

Training is staged to stabilize long-horizon RL for CUDA coding. We first run single-turn PPO warm-up, then initialize both actor and critic before full multi-turn agentic RL.

  • Single-turn warm-up improves base CUDA generation before entering interactive agent training.
  • Actor initialization uses Rejection Fine-Tuning (RFT) on sampled trajectories with positive outcomes.
  • RFT filtering removes inefficient loops and invalid tool-call patterns to reduce policy collapse risk.
  • Critic initialization uses value pretraining so advantage estimates are reliable from early steps.

With this multi-stage design, training remains stable for long-context settings (up to 128k context, 150 training turns, and up to 200 turns during evaluation), enabling sustained reward growth.

CUDA Agent training stages

Main Results

We report full metrics for both Overall and Level-3 splits on KernelBench: Pass Rate, Faster Rate (vs. Eager / vs. Compile), and Geomean Speed-up (vs. Eager / vs. Compile).

Overall

Pass Rate 98.8%
Faster Rate vs. Eager 98.4%
Faster Rate vs. Compile 96.8%
Speed-up vs. Eager 2.60x
Speed-up vs. Compile 2.11x

Level-3

Pass Rate 94%
Faster Rate vs. Eager 94%
Faster Rate vs. Compile 90%
Speed-up vs. Eager 1.80x
Speed-up vs. Compile 1.52x
Compared with strong proprietary baselines, CUDA Agent shows a clear optimization gap in compile-relative performance: on overall KernelBench, it reaches 96.8% faster rate vs. compile and 2.11x geomean speed-up, while Claude Opus 4.5 and Gemini 3 Pro are around 66.4%-69.6% faster rate and 1.42x-1.46x speed-up. The advantage is most pronounced on difficult settings: on Level-3, CUDA Agent achieves 90% faster rate vs. compile (about +40 points over the strongest proprietary baselines), and on Level-2 operator-sequence tasks it reaches 100% faster rate with 2.80x geomean speed-up vs. compile.
Main experimental results on KernelBench

Overall performance and speedup metrics on KernelBench.

Citation

If you use CUDA Agent in your research, please cite:

@article{cudaagent2026,
  title   = {CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation},
  author  = {Dai, Weinan and Wu, Hanlin and Yu, Qiying and Gao, Huan-ang and Li, Jiahao and Jiang, Chengquan and Lou, Weiqiang and Song, Yufan and Yu, Hongli and Chen, Jiaze and Ma, Wei-Ying and Zhang, Ya-Qin and Liu, Jingjing and Wang, Mingxuan and Liu, Xin and Zhou, Hao},
  journal = {arXiv preprint},
  year    = {2026}
}