Acoustic Echo Cancellation: How WebRTC AEC3 Works - Switchboard

Acoustic Echo Cancellation: How WebRTC AEC3 Works

Every voice and video call you make through a browser uses acoustic echo cancellation to keep your conversation clean. Without it, your own voice would bounce back at you from the other person's speaker, creating an unbearable feedback loop. WebRTC AEC3 is the echo canceller built into Chrome, Edge, and every WebRTC-based application, and it handles this problem in real time with remarkably low latency. Despite being one of the most widely deployed audio processing algorithms in the world, WebRTC AEC3 has almost no accessible documentation explaining how it actually works.

This guide covers how acoustic echo cancellation (AEC) works and examines how WebRTC's AEC3 implementation handles each stage of the pipeline. Whether you're integrating echo cancellation into a voice SDK, debugging echo issues in a real-time audio application, evaluating AEC solutions for a mobile platform, or simply want to understand how this technology works under the hood, this is the reference that WebRTC's own documentation doesn't provide.

What Is Acoustic Echo Cancellation?

Acoustic echo cancellation is the process of removing the sound that loops back from a loudspeaker into a microphone during a two-way audio conversation. When a remote speaker's voice plays through your device's speaker, it travels through the room and arrives at your microphone as an echo. Without AEC, this echo gets transmitted back to the remote speaker, who then hears a delayed, distorted copy of their own voice.

The core idea behind echo cancellation technology is deceptively simple: if you know what signal was played through the speaker, and you can model how the room transforms that signal before it reaches the microphone, you can predict the echo and subtract it from the captured audio. What remains (ideally) is only the near-end speaker's voice, with the echo removed.

In practice, this is significantly harder than it sounds. The room's acoustic characteristics (its impulse response) change constantly as people move and furniture shifts. The speaker and microphone introduce non-linear distortions. Both people may talk at the same time (double-talk). And all of this must be handled in real time, with processing latency measured in milliseconds.

How Echo Cancellation Works: The Adaptive Filter

At the heart of every acoustic echo canceller is an adaptive filter. This filter maintains an estimate of the room's impulse response, which is the acoustic path between the loudspeaker and the microphone. Using this estimate, the filter predicts what the echo will look like and subtracts that prediction from the microphone signal.

The process works in a continuous loop:

  1. The far-end signal (what the remote speaker said) is played through the loudspeaker

  2. That signal bounces around the room and arrives at the microphone, mixed with any near-end speech

  3. The adaptive filter takes the far-end signal as input and produces an echo estimate

  4. The echo estimate is subtracted from the microphone signal

  5. The residual (the difference between the actual signal and the estimate) is used to update the filter, improving future predictions

The most common algorithm for this adaptive filter is the Normalized Least Mean Squares (NLMS) algorithm. NLMS adjusts the filter coefficients after each sample (or block of samples) to minimize the error between the predicted echo and the actual microphone signal. The "normalized" part scales the update step by the energy of the input signal, which prevents the filter from diverging when the input is loud and from stalling when it's quiet.

A key parameter is the filter length, measured in milliseconds. This determines the maximum echo delay the canceller can handle. A typical room might have a reverberation tail of 100 to 300 milliseconds, so the filter needs enough taps to cover that duration. Longer filters can handle more reverberant spaces but require more computation.

The Acoustic Echo Cancellation Pipeline

A production echo canceller involves far more than just an adaptive filter. The complete acoustic echo cancellation pipeline has several stages, each addressing a specific challenge. Here's how the signal flows through a typical AEC implementation, including WebRTC AEC3:

Delay Estimation

Before the adaptive filter can work, the system needs to know the time offset between the far-end reference signal and the echo in the microphone signal. This delay comes from several sources: the audio playback buffer, the DAC (digital-to-analogue converter), the speaker-to-microphone acoustic path, the ADC, and the capture buffer. On mobile devices, this total delay can range from 20 to 200 milliseconds and can vary during a call.

WebRTC AEC3's render delay controller continuously estimates this delay by cross-correlating the reference signal with the capture signal. Getting the delay estimate right is critical: if the adaptive filter is looking at the wrong time offset in the reference buffer, it cannot converge on a useful echo estimate.

Linear Adaptive Filter

Once the delay is estimated, the linear adaptive filter does the heavy lifting. WebRTC AEC3 uses a partitioned block frequency-domain adaptive filter (PBFDAF). Instead of processing one sample at a time in the time domain, it works on blocks of samples in the frequency domain using the FFT.

This frequency-domain approach has two major advantages. First, convolution in the time domain becomes multiplication in the frequency domain, which is computationally much cheaper for long filter lengths. Second, partitioning the filter into blocks allows the system to update the filter incrementally, reducing latency compared to processing the entire filter length at once.

The linear filter typically removes 20 to 40 dB of echo. For many scenarios this is sufficient, but for challenging conditions (reflective rooms, non-linear speaker distortion), the residual echo can still be audible.

Double-Talk Detection

Double-talk occurs when the near-end speaker talks at the same time as the far-end speaker. This is the hardest problem in echo cancellation because the adaptive filter's update mechanism assumes the error signal (microphone minus echo estimate) represents only the filter's estimation error. During double-talk, the near-end speech is also present in the error signal, and if the filter adapts to it, the filter diverges: it starts trying to model the near-end speaker as part of the echo, which corrupts the echo estimate.

WebRTC AEC3 handles double-talk by monitoring the coherence between the reference signal and the capture signal, along with the energy levels of both. When double-talk is detected, the filter adaptation rate is reduced or halted entirely, protecting the filter coefficients from corruption. Once the double-talk ends, normal adaptation resumes.

Residual Echo Suppression (Non-Linear Processing)

After the linear filter has done its work, some echo usually remains. This residual echo has two main sources: the linear filter's inability to perfectly model the room (especially in changing conditions), and non-linear effects that the linear filter cannot capture by design. Non-linearity comes from the loudspeaker itself (which distorts at high volumes), the amplifier, dynamic range processing in the audio path, and clipping at the ADC.

WebRTC AEC3's residual echo suppressor estimates the power of the remaining echo in each frequency band and applies a frequency-dependent suppression gain. Bands with more estimated residual echo get suppressed more aggressively. This stage acts as a safety net that catches what the linear filter misses.

The suppression gain calculation is a balancing act. Too much suppression removes the echo but also degrades the near-end speech (creating a "hollow" or underwater sound). Too little suppression leaves audible echo. AEC3 uses the quality of the linear filter's estimate (measured by the echo return loss enhancement, or ERLE) to calibrate how aggressively the residual suppressor should act.

Comfort Noise Generation

When the suppressor removes residual echo, it can create unnaturally silent gaps in the audio. These gaps feel jarring to listeners because background noise (room tone, ventilation hum) suddenly disappears during suppression. Comfort noise generation fills these gaps with synthetic noise that matches the spectral characteristics of the room's background noise, preserving a natural listening experience.

WebRTC AEC3: Architecture and Implementation

WebRTC AEC3 (the "3" denotes the third iteration, replacing the older AEC and AECM modules) was introduced into the WebRTC codebase around 2017-2018. It represents a significant architectural improvement over its predecessors, and it runs in Chromium-based browsers, Android WebRTC applications, native desktop apps, and any other platform that links the WebRTC audio processing module.

Why AEC3 Replaced the Older Modules

The original WebRTC AEC module used a time-domain adaptive filter with a fixed delay estimator. It worked adequately in controlled environments but struggled with several real-world conditions: rapidly changing echo paths, devices with variable audio latency (common on Android), non-linear speaker distortion at high volumes, and Bluetooth audio routing changes. The AECM (AEC Mobile) module was a lighter-weight alternative for constrained devices, but it sacrificed quality for efficiency.

AEC3 addressed these limitations with a redesigned architecture:

  • A robust, continuously-adapting delay estimator that tracks changing latencies

  • A frequency-domain partitioned block filter that converges faster and handles longer reverberation tails efficiently

  • Improved echo path change detection that recovers quickly when someone moves or a door opens

  • A more sophisticated suppression gain calculation that balances echo removal against speech quality

  • Sub-band processing for better frequency resolution in the suppression stage

Block Processing in AEC3

AEC3 processes audio in blocks (typically 64 samples at 16 kHz, or 4 ms per block). Each block passes through the full pipeline: the render delay buffer is updated with the latest far-end audio, the delay controller adjusts the alignment, the linear filter produces an echo estimate and adapts its coefficients, and the residual echo suppressor applies its gains.

This block-based architecture aligns well with how audio hardware delivers data (in buffers, not individual samples) and enables efficient use of SIMD (Single Instruction, Multiple Data) instructions on modern CPUs. AEC3 can process a block of audio well within the 4 ms budget on mobile ARM processors, leaving headroom for other audio processing stages.

Source Code

For developers who want to trace the implementation, the WebRTC AEC3 source code lives under modules/audio_processing/aec3/ in the WebRTC repository. The main entry point is echo_canceller3.cc, which orchestrates the full pipeline. The code is C++ with hand-optimized SIMD paths for x86 (SSE2) and ARM (NEON).

Common Echo Cancellation Problems and How to Debug Them

Understanding the AEC pipeline makes it much easier to diagnose echo problems in practice. Here are the failure modes developers encounter most often, along with what causes them and how to investigate.

Echo Breakthrough

Echo breakthrough, the most common AEC complaint, is when the remote party hears clearly audible echo despite AEC being enabled.

Possible causes:

  • The delay estimator has locked onto the wrong offset. If the estimated delay is off by even a few milliseconds, the linear filter operates on misaligned data and cannot converge. This happens most often on Android devices with unpredictable audio latency.

  • The filter length is too short for the room's reverberation time. The echo tail extends beyond what the filter can model.

  • The echo path changed suddenly (someone moved the device, a Bluetooth audio route switched) and the filter hasn't re-converged yet.

How to debug: Log the estimated render delay and the ERLE (echo return loss enhancement). If ERLE is consistently low (under 10 dB), the linear filter isn't converging. If the delay estimate is unstable (jumping between values), the delay controller is struggling with the audio path.

Half-Duplex Behaviour

Instead of allowing both parties to speak simultaneously, the system suppresses the near-end speaker whenever the far-end speaker is active. It feels like a walkie-talkie.

Possible causes:

  • The residual echo suppressor is being too aggressive, treating near-end speech as echo during double-talk.

  • The double-talk detector is not recognizing simultaneous speech, allowing the filter to diverge and then over-suppressing to compensate.

How to debug: Reduce the suppression aggressiveness if your implementation exposes that parameter. In WebRTC, the EchoCanceller3Config struct contains tuning parameters for the suppressor. Check whether the issue correlates with the far-end signal level (worse at higher volumes suggests non-linear distortion is fooling the suppressor).

Filter Divergence

Echo cancellation works initially but degrades over time as ERLE drops and echo becomes audible again.

Possible causes:

  • Double-talk is corrupting the filter. The adaptation isn't being paused correctly during simultaneous speech.

  • A feedback loop exists in the audio path. If the cancelled output is accidentally being fed back as the reference signal, the filter chases its own tail.

  • Numeric instability in the filter coefficients, usually from an excessively high adaptation rate.

How to debug: Check your audio routing. The reference signal must be the far-end audio before it's mixed with any near-end audio. Verify that the adaptation rate (step size) is within expected bounds. If you're using WebRTC's AEC3 directly, the defaults are well-tested, so divergence usually points to an audio routing problem.

Echo Cancellation on Mobile Platforms

Mobile devices present unique challenges for acoustic echo cancellation. Variable audio latency, non-linear speaker behaviour at high volumes, diverse hardware configurations, and power constraints all complicate the task.

iOS and AVAudioSession

On iOS, echo cancellation is tightly integrated with the AVAudioSession system. Setting the audio session mode to .voiceChat enables the system's built-in AEC along with other voice processing (automatic gain control, noise suppression). This is the simplest path for most iOS voice applications.

However, the built-in iOS echo cancellation has limitations. It assumes a standard phone-call-like scenario and may not perform well for applications with custom audio routing or music mixed with voice. If your application uses .measurement or .default mode for other reasons, the built-in AEC is not active, and you'll need to provide your own.

Switchboard's iOS SDK provides a WebRTC AEC3 node that you can insert into your audio graph, giving you echo cancellation without requiring .voiceChat mode. This is particularly useful for applications that need AEC alongside other audio processing that .voiceChat mode would interfere with.

Android and AudioEffect

Android provides the AcousticEchoCanceler class in its android.media.audiofx package. This wraps the device manufacturer's AEC implementation, which varies significantly across devices and Android versions. Some devices have excellent echo cancellation; others have noticeably poor implementations.

Because of this inconsistency, many voice applications on Android bypass the platform AEC and use WebRTC AEC3 directly (or through an SDK like Switchboard that wraps it). This provides consistent behaviour across the fragmented Android device ecosystem at the cost of slightly higher CPU usage.

The biggest challenge on Android is audio latency. The round-trip latency between playing audio and capturing it varies from 10 ms on flagship devices to over 100 ms on budget hardware. AEC3's delay controller handles this variability, but extreme or rapidly changing latencies can still cause problems. Using Android's low-latency audio path (AAudio with AAUDIO_PERFORMANCE_MODE_LOW_LATENCY) helps stabilize the delay.

Embedded and Edge Devices

For edge AI applications running on embedded Linux boards, Raspberry Pi devices, or custom hardware, there is no platform AEC to fall back on. You need to run an echo canceller as part of your audio pipeline. WebRTC AEC3 is a strong choice here because it's pure C++ with no platform dependencies and runs efficiently on ARM CPUs, having been battle-tested across billions of WebRTC calls. Switchboard's C++ API provides AEC3 as a node that you can integrate into any audio graph on embedded platforms.

Measuring AEC Performance

When evaluating or tuning an acoustic echo canceller, several metrics matter:

  • ERLE (Echo Return Loss Enhancement): The ratio of echo power before and after cancellation, measured in dB. Higher is better. A well-performing linear filter typically achieves 20 to 40 dB ERLE. Below 10 dB indicates the filter isn't converging.

  • Residual echo level: The absolute level of remaining echo after both linear cancellation and non-linear suppression. This is what the remote listener actually hears.

  • Near-end speech degradation: AEC can damage the near-end speech, especially during double-talk. Measuring with PESQ (Perceptual Evaluation of Speech Quality) or POLQA gives an objective quality score.

  • Convergence time: How quickly the filter adapts to a new room or a changed echo path. AEC3 typically converges within one to two seconds in normal conditions.

Integrating Echo Cancellation in Your Application

If you're building a real-time voice application, you have several options for echo cancellation:

Use the platform's built-in AEC. On iOS (.voiceChat mode) and in WebRTC-based browser applications, this is the simplest option. The trade-off is limited control over tuning and behaviour.

Use WebRTC AEC3 directly. If you're already using the WebRTC native library, AEC3 is available as part of the audio processing module. You feed it the render (far-end) signal and the capture (near-end) signal, and it returns the cleaned audio. You need to manage the audio routing yourself.

Use an audio SDK with AEC built in. Switchboard provides WebRTC AEC3 as a node in its audio graph architecture. You connect it alongside your other audio processing nodes (noise suppression, voice activity detection, speech-to-text, text-to-speech) and the SDK handles the signal routing, buffering, thread management, and sample-rate conversion. This approach gives you AEC3's quality with less integration effort, and it works cross-platform across iOS, Android, desktop, and embedded Linux.

Whichever path you choose, the critical integration requirement is the same: the echo canceller must receive the far-end reference signal and the near-end capture signal with accurate timing. If these signals are misaligned or if the reference signal doesn't match what was actually played through the speaker (e.g., because of post-processing or mixing after the reference tap point), AEC performance will suffer.

Available Now

Acoustic echo cancellation is one of those technologies that works so well, most people never think about it. From browser-based calls to voice chat in games to telehealth consultations to enterprise conferencing, AEC runs silently in the background. WebRTC AEC3 handles this for billions of calls, and understanding how it works gives you the foundation to debug echo problems when they arise and make informed decisions about your audio pipeline.

Switchboard's audio SDK provides WebRTC AEC3 as part of its modular audio graph, alongside noise suppression (RNNoise), voice activity detection, speech-to-text, and text-to-speech nodes. If you're building a voice application that needs echo cancellation across platforms, check out the AEC documentation to get started.