Master 2018 2019
Stages de la spécialité SAR
Invertible Speech Encodings for Speech and Singing Synthesis

Site : Invertible Speech Encodings for Speech and Singing Synthesis
Lieu : Analysis/Synthesis team, IRCAM, 1, place Igor-Stravinsky, 75004 Paris
Encadrant : Axel Roebel (, Nicolas Obin (
Dates :01.02.2019 - 30.07.2019
Rémunération : 600€ / month + benefits (tickets RATP and ticket resto)
Mots-clés : Parcours ATIAM : Traitement du signal



The Analysis/Synthesis team of IRCAM has a long history in developing state of the art speech and singing synthesis and - transformation algorithms [Beller 2009, Roebel 2010, Lanchantin 2011, Ardaillon 2017]. The recent success of deep learning based speech synthesis systems [Oord 2016, Blaauw 2017, Shen 2017] has fundamentally shifted the focus of speech synthesis research away from HMM and concatenative approaches towards deep neural networks that directly learn to convert the input text into a voice representation (mel-band or STFT magnitude spectrograms) that are then converted into speech using either classical methods [Griffin and Lim 1984] or the WaveNet vocoder [Shen 2017, Oord 2017, Lorenzo-Trueba 2018]. WaveNet vocoders working on mel-band magnitude input are very slow, complex and computationally rather costly. Given the fact that the mel band - and STFT magnitude representation cannot be inverted unambiguously the question of the optimal intermediate speech representation seems of central importance.

Objectives :

The aim of the present internship is to develop new representations of speech and singing voice signals for use in speech - or singing synthesis. A central objective of this work is to devise a computationally efficient, invertible representation of arbitrary voice signals. The approaches to be studied will not be disclosed publically. The intern will develop innovative encoder/decoder signal processing chains and evaluate them using speech and singing databases, and investigate into the use of the new approaches in the context of Tacotron2 [Shen 2017] type speech synthesis systems. All implementations will be performed using python and the Tensorflow framework [TF 2017]. The networks will be trained on the GPU cluster of the Analysis/Synthesis team.


[Ardaillon 2017] L. Ardaillon, “Synthesis and expressive transformation of singing voice”, PhD thesis Université Paris 6 (UPMC),2017.

[Beller, 2009] G. Beller. “Analyse et modèle génératif de l’expressivité. Application à la parole et à l’interprétation musicale », PhD. thesis, Ircam, 2009.

[Blaauw 2017] M. Blaauw and J. Bonada, “A Neural Parametric Singing Synthesizer Modeling Timbre and Expression form Natrual Songs”, Applied Sciences (, vol 7, no 12, p. 1313, 2017.

[Roebel 2010] A. Roebel, « Shape-invariant speech transformation with the phase vocoder”, Proc. International Conf. on Spoken Language Processing (InterSpeech) pp. 2146-2149, 2010.

[Lanchantin 2011] P. Lanchantin, et al. ,” Vivos Voco : A survey of recent research on voice transformation at IRCAM”, Proc. International Conf on Digital Audio Effects (DAFx), pp. 277-285, 2011.

[Lorenzo-Trueba 2018] J. Lorenzo-Trueba, et al., “Robust universal neural vocoding”, arXiv:1811.06292, (, 2018.

[OOrd 2016] A. van Oord, et al, « WaveNet : A Generative Model for Raw Audio », arXiv:1609.03499v2 (, 2016.

[OOrd 2017] A. van Oord, et al., « Parallel WaveNet : Fast High-Fidelity Speech Synthesis », arXiv :1711.10433 (, 2017.

[Shen 2017] J. Shen, et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”, arXiv:1712.05884 (, 2017.

[TF 2017]