We present sound examples from our WaveGAN and SpecGAN models (paper). Each sound file represents fifty examples of one second in length concatenated together, with a half second of silence after each example. All models are trained in the unsupervised setting and results here are a random sampling of fifty latent vectors.
Speech Commands Zero through Nine (SC09)
The SC09 dataset, a subset of the Speech Commands dataset (license), has many speakers and a ten word vocabulary. Our WaveGAN and SpecGAN models learn the ten semantic modes (words) of this dataset without supervision. Our results are arranged into numerical ordering by post-hoc labeling of random examples using the classifier discussed in the paper.
Single drum hits from drum machines collected from here
Professional pianist playing a variety of Bach compositions (original dataset collected for this work by the authors)
Large vocabulary speech (TIMIT)
The TIMIT dataset consists of many speakers reading English sentences. Recording conditions were carefully controlled and the utterances have much less noise than those from SC09.
Real data resynthesized with Griffin-Lim
The SpecGAN model operates on lossy audio spectrograms. As a point of comparison, we provide examples of the real data projected into this domain and resynthesized back into audio. This is useful to gauge roughly how much distortion may be caused by the audio feature represention vs. the SpecGAN generative process itself.
Quantitative evaluation experiments
We also provide examples for all models in Table 1 of our paper.
Parametric (Buchanan 2017)
WaveGAN + phase shuffle (n=2)
WaveGAN + phase shuffle (n=4)
WaveGAN + nearest neighbor
WaveGAN + linear interpolation
WaveGAN + cubic interpolation
WaveGAN + post-processing filter
WaveGAN + DCGAN + batchnorm
WaveGAN + dropout
SpecGAN + phase shuffle (n=1)
Comparison to existing methods
WARNING: Loud volume
On the SC09 dataset, we also compare to two other methods, WaveNet (Oord et al. 2016) and SampleRNN (Mehri et al. 2017) that learn to generate audio in the unsupervised setting. While these methods produce reasonable examples when trained on speech datasets with cleaner recording conditions and less sparsity, neither appear to produce semantically-meaningful results for the single-word SC09 dataset.
WaveNet (Oord et al. 2016) public implementation 1 (link)
WaveNet (Oord et al. 2016) public implementation 2 (link)
SampleRNN (Mehri et al. 2017) official implementation 1 (link)