Master 2017 2018
Stages de la spécialité SAR
Singing Synthesis with Deep Neural Networks

Site : Analysis/Synthesis Team, IRCAM
Lieu : IRCAM, 1, place Igor-Stravinsky, 75004 Paris
Encadrant : Axel Roebel
Dates :01/02/2018 au 30/06/2018
Rémunération : 600€ € / month + benefits (tickets RATP and ticket resto)
Mots-clés : Parcours ATIAM : Traitement du signal



In the context of the ANR project Chanter, the Analysis/Synthesis team has developed the singing synthesis system ISiS based on concatenative synthesis [Ardaillon 2016, 2017], and semi parametric speech transformation [Degottex 2013, Huber 2015].

Over the last two years signal synthesis algorithms based on deep neural networks have gained significant momentum. The WaveNet [van den Oord 2016] and DeepVoice [Arik 2017] architectures for speech, but also related approaches using convolutional neural networks for the synthesis of the spectral envelope of singing signals [Blauw 2017] have demonstrated performance that outperform previous state of the art systems in terms of signal quality and flexibility, with the most recent implementations also being competitive with respect to computational complexity.


Starting with a literature survey into current DNN architectures that have been proposed for signal synthesis the intern will propose an architecture of a deep neural network aiming to replace the concatenative synthesis of the spectral envelopes in the singing synthesis system ISiS.

The neural network will be implemented using for example Tensorflow [TF 2017], trained with the databases of singers that have been recorded in the context of the Chanter project on the GPU cluster of the Analysis/Synthesis team and integrated into the ISiS system such allowing for comparison with the existing concatenative approach by means of perceptual subjective tests. A major interest is the investigation into approaches that allow coherent pitch and intensity modifications of the singing signal by means of integrating small databases covering pitch and intensity variations for the different singers.


[Ardaillon 2017] L. Ardaillon, « Syntghesis and Expressive Transformation of Singing Voice », PhD Thesis, Sorbonne University/UPMC, date of defense 18.9.2017. [Ardaillon 2016], L. Ardaillon, C. Chabot-Canet, A. Roebel. « Expressive control of singing voice synthesis using musical contexts and a parametric F0 model », Proc. Interspeech, pp. 1250-1254. [Arik 2017] S. Arik, et. al. « Deep Voice 2 : Multi-Speaker Neural Text-to-Speech », arxiv : 1705.08947. [Blauuw 2017] M. Blauuw, and J. Bonada, « A Neural Parametric Singing Synthesizer », arxiv : 1704.03809. [Degottex 2013] C. Yeh, A. Roebel X. Rodet (2010). « Mixed source model and its adapted vocal-tract filter estimate for voice transformation and synthesis », Speech Communication, Vol. 55, No 2, pp. 278-294. [Huber 2015], S. Huber and A. Roebel. « On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system », Proc. Interspeech 2015. [TF 2017] [van den Oord, 2016] A. van den OOrd, et al. « WaveNet : A Generative Model for Raw Audio », arXiv : 1609.03499.