Back to Blog
engineering performance

Building an Adaptive Media Engine

Hyper Team ·

A video call is only as good as its worst moment. One freeze during a critical point, one garbled sentence during a presentation, and the entire experience breaks down. That’s why we’ve invested heavily in Hyper’s media engine — the layer responsible for capturing, encoding, transmitting, and rendering audio and video in real time.

The challenge

Real-world networks are unpredictable. Bandwidth fluctuates, packets get lost, and latency spikes happen at the worst times. A media engine needs to adapt to all of this continuously, making decisions in milliseconds about how to allocate limited resources.

Most platforms handle this reactively — they detect congestion after it happens and degrade quality in response. By the time the adjustment kicks in, you’ve already experienced the glitch.

Hyper’s approach: predict, don’t react

Our media engine uses a three-layer adaptive strategy:

1. Proactive bandwidth estimation

Rather than waiting for packet loss to signal congestion, we use a combination of techniques to predict available bandwidth before it becomes a bottleneck:

  • Send-side delay-based estimation tracks inter-packet arrival times to detect early signs of congestion
  • Receiver-side jitter analysis provides a second signal that’s compared with sender estimates
  • Historical network profiling learns patterns specific to each user’s connection over time

These signals feed into a bandwidth allocation model that adjusts encoding parameters proactively — before quality visibly degrades.

2. Scalable Video Coding (SVC)

Hyper encodes video using SVC with temporal and spatial layers. This means a single encoded stream contains multiple quality levels that can be selectively forwarded by the server:

  • When bandwidth is plentiful, receivers get the full-quality stream
  • When bandwidth drops, the server strips higher layers without re-encoding
  • When a participant is displayed in a small tile, only lower layers are sent

This approach is far more efficient than simulcast (encoding multiple independent streams) and allows instant quality adaptation without waiting for encoder reconfiguration.

3. Intelligent prioritization

Not all media is created equal. During a screen share, the presenter’s video is less important than the shared content. When someone is speaking, their audio takes absolute priority. Our media engine understands these contexts and allocates bandwidth accordingly:

  • Audio always wins — we reserve bandwidth for audio before allocating to video
  • Active speaker boost — the current speaker gets higher video quality
  • Content-aware encoding — screen shares use different encoding parameters than camera feeds, optimizing for text sharpness over motion smoothness

Results

In our testing across a range of network conditions — including 4G mobile connections, congested home Wi-Fi, and intercontinental calls — Hyper maintains usable call quality at bandwidths where other platforms show freezing or heavy pixelation.

Some numbers from our internal benchmarks:

  • 50% less CPU usage for encoding compared to VP8/simulcast approaches
  • 200ms faster adaptation to bandwidth changes versus reactive-only systems
  • Clear audio maintained down to 15kbps total available bandwidth

What’s next

We’re continuing to improve the media engine with work on:

  • AI-enhanced audio — noise suppression and echo cancellation using lightweight on-device models
  • Super-resolution upscaling — using neural networks to enhance low-bandwidth video on the receiver side
  • Network-aware scheduling — pre-buffering and prefetching for participants with known intermittent connectivity

We’ll share more technical deep dives as we continue building. If you’re interested in the intersection of real-time media and systems engineering, we’re hiring.