High-Throughput Distributed RL Training System

High-throughput distributed reinforcement learning system designed to scale experience generation and policy optimization across multi-node CPU/GPU infrastructure.

PyTorchPyTorch RPCZeroMQRedisLSFSlurmHPC

Problem

Single-node RL training collapses under production workloads. Environment interaction is CPU-bound, policy optimization is GPU-bound, and experience throughput must sustain both without bottlenecking either. Naïve parallelization introduces synchronization overhead that negates the compute gains — actors stall waiting for weight updates, learners idle waiting for data, and replay buffers become contention points.

The core challenge: decouple experience generation from policy optimization while preserving training stability at scale.

Contribution

Designed and implemented a distributed RL system that fully decouples actors, replay memory, and learning into independently scalable components operating across multi-node HPC infrastructure.

The system sustains continuous high-throughput training by combining asynchronous execution, sharded prioritized replay, and centralized GPU-based optimization — eliminating the synchronization barriers that limit conventional distributed RL implementations.

System Architecture

Distributed RL Training Architecture

Actors generate experience from simulation groups, replay memory stores and samples training data, and the learner updates policies that are broadcast back to actors.

Distributed actors interact with parallel simulation environments and stream transitions into sharded replay memory. The replay system partitions storage across shards, aggregates experience, and serves prioritized minibatches to the learner.

A centralized GPU learner performs gradient updates on sampled batches and asynchronously broadcasts updated policy weights to all actors. Actors never block on learner updates; the learner never blocks on actor throughput. This decoupling allows both pipelines to saturate their respective compute independently.

Design Decisions

Fully Decoupled Components — Actors, replay, and learner operate as independent processes with no shared state — eliminating synchronization bottlenecks and enabling per-component scaling.
Sharded Replay Memory — Experience is partitioned across multiple shards with per-shard priority queues, sustaining high write throughput from actors and concurrent read access for minibatch sampling.
Asynchronous Weight Broadcast — Policy weights are pushed to actors without blocking the training loop. Actors tolerate slightly stale policies — a well-understood trade-off in off-policy methods.
Centralized GPU Learner — Single learner processes large minibatches on GPU, amortizing gradient computation and maintaining stable optimization dynamics across the full experience distribution.
Horizontally Scaled Actors — 100+ CPU workers generate experience in parallel, each running independent simulation instances. Actor count scales linearly with available compute.

Results

20× training speedup over single-node baseline
100+ concurrent actors sustaining continuous experience generation
Linear scaling across multi-node CPU/GPU infrastructure
Stable convergence under fully asynchronous operation

Impact

This system reduced experiment cycle time from weeks to hours, making large-scale RL experimentation operationally feasible. By removing infrastructure as the bottleneck, it enabled rapid iteration on policy architectures and training configurations that were previously impractical to evaluate at scale.