Burak Demirel
← Back to Projects

High-Throughput Distributed RL Training System

High-throughput distributed reinforcement learning system designed to scale experience generation and policy optimization across multi-node CPU/GPU infrastructure.

PyTorch RPCRedisZeroMQSlurm / LSFHPC

20×

Training speedup

100+

Concurrent actors

Multi-node

CPU/GPU scaling

Async

Stable training

Problem

Single-node RL training collapses under production workloads. Environment interaction is CPU-bound, policy optimization is GPU-bound, and experience throughput must sustain both without bottlenecking either. Naïve parallelization introduces synchronization overhead that negates the compute gains.

The core challenge: decouple experience generation from policy optimization while preserving training stability at scale.

Contribution

Designed and implemented a distributed RL system that fully decouples actors, replay memory, and learning into independently scalable components operating across multi-node HPC infrastructure. The system sustains continuous high-throughput training by combining asynchronous execution, sharded prioritized replay, and centralized GPU-based optimization.

System Architecture

Distributed actors interact with parallel simulation environments and stream transitions into sharded replay memory. A centralized GPU learner performs gradient updates and asynchronously broadcasts updated policy weights to all actors.

Distributed RL Training Architecture

SIMULATION GROUPSSim 1Sim 2Sim NACTORS100+ CPU workersCPUCPUCPUCPUCPUCPUREPLAY MEMORYshardedS1S2S3prioritized samplingGPU LEARNERpolicy updategradient stepenv dataexperiencesminibatchesbroadcast model weightspriorities

Actors generate experience from simulation groups, replay memory stores and samples training data, and the learner updates policies that are broadcast back to actors.

System Flow

1

Simulators generate transitions

2

Actors stream experience into replay shards

3

Replay memory samples prioritized minibatches

4

GPU learner updates the policy

5

Updated weights are broadcast asynchronously

6

Actors continue generating experience without blocking

Design Decisions

Fully Decoupled Components

Actors, replay, and learner operate as independent processes with no shared state — eliminating synchronization bottlenecks.

Sharded Replay Memory

Experience is partitioned across shards with per-shard priority queues, sustaining high write throughput and concurrent read access.

Asynchronous Weight Broadcast

Policy weights are pushed to actors without blocking the training loop. Actors tolerate slightly stale policies.

Centralized GPU Learner

Single learner processes large minibatches on GPU, amortizing gradient computation across the full experience distribution.

Horizontally Scaled Actors

100+ CPU workers generate experience in parallel. Actor count scales linearly with available compute.

Results

20×

Training speedup over single-node baseline

100+

Actors sustaining continuous experience generation

Linear

Scaling across multi-node CPU/GPU infrastructure

Stable

Convergence under fully asynchronous operation

Impact

This system reduced experiment cycle time from weeks to hours, making large-scale RL experimentation operationally feasible. By removing infrastructure as the bottleneck, it enabled rapid iteration on policy architectures and training configurations that were previously impractical to evaluate at scale.

My Role

  • Designed the distributed actor–replay–learner architecture.
  • Implemented high-throughput replay and actor coordination.
  • Integrated PyTorch RPC, Redis, ZeroMQ, and HPC scheduling.
  • Optimized the system for experiment throughput and stable asynchronous training.

References & Coverage