Master 2018 2019
Stages de la spécialité SAR
Invertible Speech Encodings for Speech and Singing Synthesis

Site : Invertible Speech Encodings for Speech and Singing Synthesis
Lieu : Analysis/Synthesis team, IRCAM, 1, place Igor-Stravinsky, 75004 Paris
Encadrant : Axel Roebel (, Nicolas Obin (
Dates :01.02.2019 - 30.07.2019
Rémunération : 600€ / month + benefits (tickets RATP and ticket resto)
Mots-clés : Parcours ATIAM : Traitement du signal



The Analysis/Synthesis team of IRCAM has a long history in developing state of the art speech and singing synthesis and - transformation algorithms [Beller 2009, Roebel 2010, Lanchantin 2011, Ardaillon 2017]. The recent success of deep learning based speech synthesis systems [Oord 2016, Blaauw 2017, Shen 2017] has fundamentally shifted the focus of speech synthesis research away from HMM and concatenative approaches towards deep neural networks that directly learn to convert the input text into a voice representation (mel-band or STFT magnitude spectrograms) that are then converted into speech using either classical methods [Griffin and Lim 1984] or the WaveNet vocoder [Shen 2017, Oord 2017, Lorenzo-Trueba 2018]. WaveNet vocoders working on mel-band magnitude input are very slow, complex and computationally rather costly. Given the fact that the mel band - and STFT magnitude representation cannot be inverted unambiguously the question of the optimal intermediate speech representation seems of central importance.

Objectives :

The aim of the present internship is to develop new representations of speech and singing voice signals for use in speech - or singing synthesis. A central objective of this work is to devise a computationally efficient, invertible representation of arbitrary voice signals. The approaches to be studied will not be disclosed publically. The intern will develop innovative encoder/decoder signal processing chains and evaluate them using speech and singing databases, and investigate into the use of the new approaches in the context of Tacotron2 [Shen 2017] type speech synthesis systems. All implementations will be performed using python and the Tensorflow framework [TF 2017]. The networks will be trained on the GPU cluster of the Analysis/Synthesis team.


