Master 2018 2019
Stages de la spécialité SAR
Multimodal variational learning for musical generation

Lieu : IRCAM, Equipes Représentations Musicales
Encadrant : Philippe Esling, Mathieu Prang
Dates :18/02/2019 au 18/08/2018
Rémunération :Tarif en vigueur IRCAM
Mots-clés : Parcours ATIAM : Informatique musicale


This project aims to develop new simultaneous representations for symbolic and audio music. The goal is to represent musical symbols and corresponding short excerpts of audio in the same space called a multimodal embedding space. This approach allows to address the problem of matching musical audio directly to musical symbols. Moreover, theses kind of spaces could be very powerful tools for the orchestration field. By disentangling the correlation between the orchestral score and the audio signal result, we can provide efficient systems to analyze and generate specific orchestral effects. It has been shown that embedding spaces provide metric relationships over semantic concepts [Mikolov13]. These approaches can generate analogies, infer missing modalities and even perform knowledge inference [Socher13]. Hence, metric relationships could be used for musical purposes. In the context of orchestration, this would allow us to find audio signals with identical timbre properties but that come from widely different musical notations (different scores leading to a similar perceptual effect).

In order to capture features of both modalities and represent them within a joint space, the idea is to train two encoding models simultaneously so that they map to similar points in the common space. First, you will have to prepare your dataset by synthesizing and aligning the audio that correspond to given MIDI files. Then, you will implement the model proposed by Dorfer et al which is composed by two networks. You will train it on your synthesized dataset. Once your model will be efficient on the training data, you will test it on real data through two tasks : (1) piece/score identification from audio queries and (2) retrieving relevant performances given a score as a search query. Finally, you will propose (or even implement) improvements in the architecture or the training of the model.

A wide set of first-of-their-kind composition applications and music generation tasks can then be explored from this multimodal inference. The link between different perceptual effects could also be evaluated through metric analysis, which will contribute to the bases of a theory of orchestration. Finally, as the generative mechanism of the variational learning could be exploited to directly generate orchestrations based on perceptual effects by simply moving across the latent space. Hence, this will aim to understand the higher-level semantic regularities, along with the innovative re-use of the embedding spaces in a generative framework.


Mikolov, T. Yih, WT. & Zweig, G. (2013) Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751.

Socher, R. Ganjoo, M. Manning, C. & Ng A. (2013) Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pages 935–943.

Michalski, V. Memisevic, R. and Konda, K. (2014). Modeling deep temporal dependencies with recurrent grammar cells. In Advances in neural information processing systems, pages 1925– 1933.

Matthias Dorfer, Jan Hajic Jr, Andreas Arzt, Harald Frostel, and Gerhard Widmer. Learning audio–sheet music correspondences for cross-modal retrieval and piece identification.Transactions of the International Society for Music Information Retrieval, 1(1), 2018.