A New Paradigm for Lossy Media Compression

Introduction

Classical Compression Schemes

Comparing Vaux to Classical Audio Compression Schemes

The original audio, encoded as 16bit PCM, or WAV, with a bitrate of 256kbps.
Audio encoded as Opus, with a bitrate of 6kbps, the lowest Opus will operate at.
Audio encoded as MP3, with a bitrate of 24kbps.
Audio encoded using Vaux, with a bitrate of 2kbps.

Core Enabling Technologies

Generative Models

Outputs from a generative model during training, starting with random noise and moving towards intelligible images. Source: https://openai.com/blog/generative-models/
Block diagram of our theoretical image compression scheme.

Representation Learning

A diagram of an autoencoder, showing the input vector, the encoder with 2 hidden layers, the output of the encoder is a vector smaller than that of the input, the decoder also has 2 hidden layers and finally the output vector which is the same size as the input vector.
A diagram of an autoencoder, showing the input vector, the encoder with 2 hidden layers, the output of the encoder is a vector smaller than that of the input, the decoder also has 2 hidden layers and finally the output vector which is the same size as the input vector.
Basic Autoencoder architecture, Encoder, Bottleneck / Code, Decoder. Source: https://external.codecademy.com/articles/common-applications-of-deep-learning

Vaux Architecture

The Vaux audio compression model has an autoencoder architecture. Input speech is first denoised, then compressed by the encoder and quantised. The compressed frame is sent to the decoder, it outputs a set of parameters which tells the vocoder how to generate the output speech.

Denoising

Block diagram of ConvTASNet, showing the encoder or filterbank, the separation network and masking, and the inverse-filterbank or decoder. Source: https://arxiv.org/abs/1910.11615

Feedback Recurrent Auto Encoder

Block diagram of the FRAE, showing Encoder, Quantisation, and Decoder, and the sharing of the hidden state. Source: https://arxiv.org/abs/1911.04018

Vocoder

Block diagram of a similar DDSP Vocoder. The vocoder is conditioned on a number of inputs (shown in green). These include the pitch (F0), the latent (quantised latent vector mentioned previously in FRAE section), and other information like loudness). For each frame of audio, the decoder parameterises the DSP modules (shown in yellow), these can be passed through other modules like filtering and reverb for a more natural sounding audio output. Source: https://magenta.tensorflow.org/ddsp

Note About Compute Costs

Development Roadmap

Variable Bit Rate

Vocoder Upgrade

Follow Our Journey!