Sound examples from Adversarial Audio Synthesis

Chris Donahue, Julian McAuley, Miller Puckette

We present sound examples from our WaveGAN and SpecGAN models (paper). Each sound file represents fifty examples of one second in length concatenated together, with a half second of silence after each example. All models are trained in the unsupervised setting and results here are a random sampling of fifty latent vectors.

Speech Commands Zero through Nine (SC09)

The SC09 dataset, a subset of the Speech Commands dataset (license), has many speakers and a ten word vocabulary. Our WaveGAN and SpecGAN models learn the ten semantic modes (words) of this dataset without supervision. Our results are arranged into numerical ordering by post-hoc labeling of random examples using the classifier discussed in the paper.

Bird vocalizations

Bird vocalizations collected from Peter Boesman (license)


Single drum hits from drum machines collected from here


Professional pianist playing a variety of Bach compositions (original dataset collected for this work by the authors)

Large vocabulary speech (TIMIT)

The TIMIT dataset consists of many speakers reading English sentences. Recording conditions were carefully controlled and the utterances have much less noise than those from SC09.

Real data resynthesized with Griffin-Lim

The SpecGAN model operates on lossy audio spectrograms. As a point of comparison, we provide examples of the real data projected into this domain and resynthesized back into audio. This is useful to gauge roughly how much distortion may be caused by the audio feature represention vs. the SpecGAN generative process itself.

Quantitative evaluation experiments

We also provide examples for all models in Table 1 of our paper.

Comparison to existing methods

WARNING: Loud volume

On the SC09 dataset, we also compare to two other methods, WaveNet (Oord et al. 2016) and SampleRNN (Mehri et al. 2017) that learn to generate audio in the unsupervised setting. While these methods produce reasonable examples when trained on speech datasets with cleaner recording conditions and less sparsity, neither appear to produce semantically-meaningful results for the single-word SC09 dataset.