Burak Demirel
← Back to Blog
7 min

When AI Has No Time to Think: Deploying Models Under Extreme Latency

The Problem: Microsecond Decision Windows

Most AI applications enjoy generous latency budgets. A recommendation engine can take hundreds of milliseconds. An LLM chatbot can take seconds. But in a 5G radio access network, the physics of the air interface dictates the clock.

A single TTI (Transmission Time Interval) in 5G NR ranges from 1ms down to 62.5μs depending on the numerology. Within that window, the RAN must make dozens of decisions — channel estimation, link adaptation, scheduling, power control — each with its own slice of the total budget.

The numbers are brutal:

FunctionMid-bandHigh-band
L1 symbol-related~30μs4–8μs
L1 slot-related~400μs50–100μs
L2 aggregate200–300μs~50μs
Link adaptation10–30μs~5μs

Link adaptation — selecting the right modulation and coding scheme (MCS) for each user — gets 10–30 microseconds in mid-band and just 5 microseconds in high-band. That's not a latency "budget." That's a hard wall.

Why This Matters for AI

Rule-based link adaptation algorithms are fast but brittle. They rely on CQI reports that are often stale, quantized, or mismatched to actual channel conditions. Reinforcement learning can outperform these heuristics by learning directly from the environment — but RL policies are typically neural networks, and neural networks are not free to evaluate.

The core tension: model expressiveness versus execution feasibility. A large, expressive model generalizes better across diverse network conditions but cannot run within the latency budget. A tiny model fits the timing constraints but may lack the capacity to generalize.

This isn't a problem you can solve by buying faster hardware.

Hardware Won't Save You

RAN baseband platforms are built from CPUs, digital signal processors (DSPs), and application-specific circuits. Some newer platforms include GPU or NPU accelerators — but even with acceleration, the story is more nuanced than "throw a GPU at it."

GPU inference introduces non-negligible overheads: data movement between host and device, accelerator access contention, and kernel invocation latency. For small models (the only ones that fit the timing budget), CPU inference often achieves lower latency than GPU inference because it avoids these fixed costs.

Worst-case inference latency versus model size across compute platforms

The critical finding from our analysis: even models with just 0.5–1M parameters exceed the link adaptation latency budget when measured end-to-end on production hardware. The feasible region is far smaller than most ML practitioners assume — we're talking thousands of parameters, not millions.

The Distillation Approach

If you can't deploy the model you want, train the model you can deploy to behave like the model you want.

Knowledge distillation (or policy distillation in the RL setting) transfers learned behaviors from a large, expressive teacher model to a compact student model that meets real-time constraints. The teacher trains offline with no latency pressure. The student inherits the teacher's decision-making capability in a fraction of the parameters.

For link adaptation, this looks like:

  1. Teacher: An eight-layer MLP (~115k parameters) trained via distributed RL with domain randomization across diverse network conditions
  2. Student: A three-layer MLP with 32 neurons per layer (~3.5k parameters) — small enough for deterministic sub-30μs inference on baseband hardware

The 30x parameter reduction sounds aggressive, but the key insight is that most of a teacher's capacity encodes the training distribution, not the decision boundary. A well-distilled student retains the decision-making quality while shedding the representational overhead.

Two Distillation Strategies

Single-Teacher Distillation

Train one large teacher that generalizes across all scenarios. Compress it into one student. Simple, but the teacher must already solve the generalization problem — which is itself a hard research challenge.

Multi-Teacher Distillation

Train multiple specialized teachers, each expert in a specific deployment scenario (e.g., urban macro, rural, high-mobility). Consolidate their knowledge into a single generalist student. The student learns when to apply which expert's strategy without needing explicit scenario detection.

Latency-Aware Online Distillation

The most interesting contribution is what we call latency-aware online distillation. Traditional distillation is a two-phase process: train teacher, then compress. But in a distributed RL system, you can integrate distillation into the training loop itself.

The architecture:

  • A centralized teacher trains in the cloud with full computational resources
  • Distributed actors (running on RAN hardware or simulators) generate experience using a periodically-distilled student
  • The student is re-distilled at regular intervals as the teacher improves
  • Critically, all online policy evaluation respects the target latency budget — the student deployed to actors always fits within execution constraints

This ensures that the training data distribution reflects what the actual deployed model will encounter. There's no distribution shift between training and deployment, because the deployment-sized model is generating the data throughout training.

Results

We evaluated latency-aware distillation on a link adaptation task across three unseen deployment scenarios (different cell configurations, traffic patterns, and mobility profiles):

  • Distilled students (~3.5k params) closely preserved the teacher's throughput and spectral efficiency across all scenarios
  • Directly-trained small models (same architecture, trained from scratch without distillation) suffered 10–25% performance loss compared to the teacher
  • The gap widened in high-mobility and heterogeneous traffic conditions — precisely the scenarios where generalization matters most

The distillation overhead is negligible at deployment time: you pay the cost once during training (cloud-side), and the deployed student is just a three-layer MLP with fast, deterministic inference.

Design Principles

From this work, a few principles emerge for deploying AI in latency-critical RAN functions:

1. Dimension for worst-case, not average-case. Latency budgets in real-time systems are hard deadlines. A model that meets the average budget but occasionally exceeds it is unusable. Design for the worst-case execution time on the target platform.

2. Separate learning capacity from deployment capacity. The model that learns and the model that deploys don't need to be the same model. Use distillation to decouple these concerns.

3. Validate on hardware early. End-to-end inference latency on production baseband hardware differs significantly from GPU benchmarks. Profile on the target platform before committing to an architecture.

4. Integrate constraints into training. Latency-aware distillation outperforms post-hoc compression because it keeps the training distribution aligned with deployment reality throughout the learning process.

Implications Beyond RANs

The constraints we face in RAN AI — microsecond budgets, deterministic execution, limited on-chip memory — are an extreme case of a more general trend. As AI moves from cloud to edge to embedded systems, the gap between "what we can train" and "what we can deploy" will only grow.

Policy distillation isn't just a RAN technique. It's a design pattern for any system where inference must be fast, deterministic, and resource-constrained — from autonomous vehicles to real-time trading to robotic control.

The key insight: feasible deployment depends not on hardware scaling alone, but on principled model dimensioning aligned with worst-case execution timelines and platform characteristics. The era of "just scale the model" doesn't apply when physics sets the clock.


This post is based on our paper When AI Has No Time to Think: Inference Under Extreme Latency and Compute Constraints in RANs, published in Ericsson Technology Review, 2026.