Time-frequency (TF) representations provide powerful and intuitive features for the analysis of time series such as audio. But still, generative modeling of audio in the TF domain is a subtle matter. Consequently, neural audio synthesis widely relies on directly modeling the waveform and previous attempts at unconditionally synthesizing audio from neurally generated TF features still struggle to produce audio at satisfying quality. In this contribution, focusing on the short-time Fourier transform, we discuss the challenges that arise in audio synthesis based on generated TF features and how to overcome them.
We demonstrate the potential of deliberate generative TF modeling with TiFGAN, which generates audio successfully using an invertible TF representation and improves on the current state-of-the-art for audio synthesis with GANs. TiFGAN is an adaptation of DCGAN, originally proposed for image generation. The TiFGAN architecture additionally relies on the guidelines and principles for generating short-time Fourier data that we presented in the accompanying paper.
Since the distribution of values in the magnitude of the Short-Time Fourier Transform (STFT) is not well suited for a GAN (see Figure above) we use log-magnitude coefficients. We first normalize the STFT magnitude to have maximum value 1, such that the log-magnitude is confined in (-inf, 0]. Then, the dynamic range of the log-magnitude is limited by clipping at -r (in our experiments r=10), before scaling and shifting to the range of the generator output [-1,1], i.e. dividing by r/2 before adding constant 1.
The network trained to generate log-magnitudes will be referred to as TiFGAN-M. For TiFGAN-M, the phase derivatives are estimated from the generated log-magnitude. Generation of, and synthesis from, the log-magnitude STFT is the main focus of this contribution. Nonetheless, we also trained a variant architecture TiFGAN-MTF for which we additionally provided the time- and frequency-direction derivatives of the (unwrapped, demodulated) phase.
It may seem straightforward to restore the phase from its time-direction derivative by summation along frequency channels as proposed by Engel et Al. Even on real, unmodified STFTs, the resulting phase misalignment introduces cancellation between frequency bands resulting in energy loss, see Figure above (2) for a simple example. In practice, such cancellations often leads to clearly perceptible changes of timbre (see below). Moreover, in areas of small STFT magnitude, the phase is known to be unreliable (see Balazs et Al) and highly sensitive to distortions (see Alaifari et Al), such that it cannot be reliably modelled and synthesis from generated phase derivatives is likely to introduce more distortion.
Phase-gradient heap integration (PGHI) (see Prusa et Al) relies on the phase-magnitude relations and bypasses phase instabilities by avoiding integration through areas of small magnitude, leading to significantly better and more robust phase estimates see Figure above (4). PGHI often outperforms more expensive, iterative schemes relying on alternate projection, e.g., Griffin-Lim, at the phaseless reconstruction (PLR) task.
The following examples were computed on samples from the EBU SQAM dataset.