High-Throughput Distributed RL Training System
High-throughput distributed reinforcement learning system designed to scale experience generation and policy optimization across multi-node CPU/GPU infrastructure.
20×
Training speedup
100+
Concurrent actors
Multi-node
CPU/GPU scaling
Async
Stable training
Problem
Single-node RL training collapses under production workloads. Environment interaction is CPU-bound, policy optimization is GPU-bound, and experience throughput must sustain both without bottlenecking either. Naïve parallelization introduces synchronization overhead that negates the compute gains.
The core challenge: decouple experience generation from policy optimization while preserving training stability at scale.
Contribution
Designed and implemented a distributed RL system that fully decouples actors, replay memory, and learning into independently scalable components operating across multi-node HPC infrastructure. The system sustains continuous high-throughput training by combining asynchronous execution, sharded prioritized replay, and centralized GPU-based optimization.
System Architecture
Distributed actors interact with parallel simulation environments and stream transitions into sharded replay memory. A centralized GPU learner performs gradient updates and asynchronously broadcasts updated policy weights to all actors.
Distributed RL Training Architecture
Actors generate experience from simulation groups, replay memory stores and samples training data, and the learner updates policies that are broadcast back to actors.
System Flow
Simulators generate transitions
Actors stream experience into replay shards
Replay memory samples prioritized minibatches
GPU learner updates the policy
Updated weights are broadcast asynchronously
Actors continue generating experience without blocking
Design Decisions
Fully Decoupled Components
Actors, replay, and learner operate as independent processes with no shared state — eliminating synchronization bottlenecks.
Sharded Replay Memory
Experience is partitioned across shards with per-shard priority queues, sustaining high write throughput and concurrent read access.
Asynchronous Weight Broadcast
Policy weights are pushed to actors without blocking the training loop. Actors tolerate slightly stale policies.
Centralized GPU Learner
Single learner processes large minibatches on GPU, amortizing gradient computation across the full experience distribution.
Horizontally Scaled Actors
100+ CPU workers generate experience in parallel. Actor count scales linearly with available compute.
Results
20×
Training speedup over single-node baseline
100+
Actors sustaining continuous experience generation
Linear
Scaling across multi-node CPU/GPU infrastructure
Stable
Convergence under fully asynchronous operation
Impact
This system reduced experiment cycle time from weeks to hours, making large-scale RL experimentation operationally feasible. By removing infrastructure as the bottleneck, it enabled rapid iteration on policy architectures and training configurations that were previously impractical to evaluate at scale.
My Role
- Designed the distributed actor–replay–learner architecture.
- Implemented high-throughput replay and actor coordination.
- Integrated PyTorch RPC, Redis, ZeroMQ, and HPC scheduling.
- Optimized the system for experiment throughput and stable asynchronous training.
