A New Paradigm for Lossy Media Compression

Introduction

The world has moved toward a more fluid work life where flexibility is the new normal and the demand for quality online communication is more prevalent than ever. Recent times have highlighted the disadvantages faced by users in developing countries and rural areas when trying to access reliable internet. Even in locations with strong internet we struggle to conduct a meeting without interruption.

Classical Compression Schemes

Classical lossy compression schemes such as JPEG for images and MP3 for music work by removing as much information from the original media as possible, while still retaining the essential components for viewing or listening at an acceptable level of quality. The algorithms behind these codecs use knowledge about the human eyes (psychovisual) and ears (psychoacoustics), in order to remove the information that has the least impact on the ability for a human to perceive the media.

Comparing Vaux to Classical Audio Compression Schemes

Here we demonstrate a quick comparison between Vaux, Opus and MP3 given a noisy input file. These audios have a sample rate of 16kHz.

The original audio, encoded as 16bit PCM, or WAV, with a bitrate of 256kbps.
Audio encoded as Opus, with a bitrate of 6kbps, the lowest Opus will operate at.
Audio encoded as MP3, with a bitrate of 24kbps.
Audio encoded using Vaux, with a bitrate of 2kbps.

Core Enabling Technologies

There are two key AI technologies that give rise to the era of AI-powered codecs: AI techniques that learn to extract compact and meaningful representations from media, Representation Learning; and powerful Generative Models that can synthesise natural looking and sounding media from these compact representations.

Generative Models

Generative models are trained over enormous amounts of real world data from a specified domain — images, audio, or text for example. The goal is that the model will be able to generate data similar to what it’s seen during training. The model is forced, by the design of it’s architecture, to discover and efficiently internalise the essence of the media in order to generate new examples.

Outputs from a generative model during training, starting with random noise and moving towards intelligible images. Source: https://openai.com/blog/generative-models/
Block diagram of our theoretical image compression scheme.

Representation Learning

Autoencoders are one example of representation learning techniques. They leverage neural networks to learn a compact, yet rich and meaningful representation of the input media. The autoencoder architecture consists of an encoder and a decoder, with an information bottleneck imposed between the two, the bottleneck forces the compressed representation to be extracted from the input.

A diagram of an autoencoder, showing the input vector, the encoder with 2 hidden layers, the output of the encoder is a vector smaller than that of the input, the decoder also has 2 hidden layers and finally the output vector which is the same size as the input vector.
A diagram of an autoencoder, showing the input vector, the encoder with 2 hidden layers, the output of the encoder is a vector smaller than that of the input, the decoder also has 2 hidden layers and finally the output vector which is the same size as the input vector.
Basic Autoencoder architecture, Encoder, Bottleneck / Code, Decoder. Source: https://external.codecademy.com/articles/common-applications-of-deep-learning

Vaux Architecture

The Vaux denoising and compression pipeline brings together a handful of modern deep learning techniques. A time-domain denoising model, a recurrent autoencoder with feedback and a hybrid neural network and Digital Signal Processing (DSP) vocoder.

The Vaux audio compression model has an autoencoder architecture. Input speech is first denoised, then compressed by the encoder and quantised. The compressed frame is sent to the decoder, it outputs a set of parameters which tells the vocoder how to generate the output speech.

Denoising

Denoising is the task of un-mixing two signals, a speech signal which we want to keep and a noise signal which we want to discard. This task is essentially the same as blind speaker separation for which there is a wealth of research available among the Deep Learning community. At Vaux we have implemented and trained our own version of ConvTASNet, a speaker separation model that has achieved State-Of-The-Art (SOTA) results.

Block diagram of ConvTASNet, showing the encoder or filterbank, the separation network and masking, and the inverse-filterbank or decoder. Source: https://arxiv.org/abs/1910.11615

Feedback Recurrent Auto Encoder

The Vaux encoder/decoder architecture is based on the Feedback Recurrent Auto Encoder (FRAE), proposed in this paper from Qualcomm. This work introduces a recurrent autoencoder scheme, where the hidden state of the encoder and decoder are shared.

Block diagram of the FRAE, showing Encoder, Quantisation, and Decoder, and the sharing of the hidden state. Source: https://arxiv.org/abs/1911.04018

Vocoder

Vaux has developed a Differential Digital Signal Processing (DDSP) vocoder, based on the DDSP work published by Google Magenta.

Block diagram of a similar DDSP Vocoder. The vocoder is conditioned on a number of inputs (shown in green). These include the pitch (F0), the latent (quantised latent vector mentioned previously in FRAE section), and other information like loudness). For each frame of audio, the decoder parameterises the DSP modules (shown in yellow), these can be passed through other modules like filtering and reverb for a more natural sounding audio output. Source: https://magenta.tensorflow.org/ddsp

Note About Compute Costs

One concern about ML powered compression, is that ML models are generally known for their high computational requirements, and mostly require expensive accelerators like GPUs. It’s interesting to think about ML powered compression as trading network bandwidth for local compute power. We’ve worked to avoid this issue by selecting ML architectures that are computationally light, or can be made computationally light using techniques like pruning, removing the need for a high-power accelerator like a GPU.

Development Roadmap

Variable Bit Rate

While we’ve achieved great results so far with the models described above, we are exploring a technique that will allow Vaux to operate in a Variable Bit Rate (VBR) mode. This technique will remove the need for quantisation, which is currently a source of artefacts in the output audio.

Vocoder Upgrade

We are moving to real-time WaveRNN vocoder, while not as interpretable as our current DDSP vocoder it is generating higher fidelity audio, and removes the need for accurate pitch-tracking. Extracting accurate pitch values from real-world audio has been a limitation to robust generalisation for our DDSP vocoder. A vocoder with less ‘moving parts’ enables us to achieve better generalisation to any voice in any environment.

Follow Our Journey!

Our next demo will incorporate VBR, our new vocoder, and feature more real-world comparisons; and our next update will include details about our beta product!