DDSP: Differentiable Digital Signal Processing

Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu, Adam Roberts

Keywords: adversarial, audio, autoencoder, autoregressive models, disentanglement, expressive power, generation, generative models, inductive bias

Mon Session 1 (05:00-07:00 GMT) [Live QA] [Cal]
Mon Session 5 (20:00-22:00 GMT) [Live QA] [Cal]
Monday: Sequence Representations

Abstract: Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library will is available at https://github.com/magenta/ddsp and we encourage further contributions from the community and domain experts.

Similar Papers

High Fidelity Speech Synthesis with Adversarial Networks
Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan,
Semi-Supervised Generative Modeling for Controllable Speech Synthesis
Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby,
Masked Based Unsupervised Content Transfer
Ron Mokady, Sagie Benaim, Lior Wolf, Amit Bermano,
On the "steerability" of generative adversarial networks
Ali Jahanian, Lucy Chai, Phillip Isola,