Master 2019 2020
Stages de la spécialité SAR
Musical source separation for automatic karaoke generation (machine learning)

Site : A-Volute / Nahimic / R&D Team
Lieu : Lille
Encadrant : — Nathan Souviraà-Labastie, R&D engineer (Nahimic) — Damien Granger, R&D engineer (Nahimic) — Raphaël Greff, CTO and R&D Director (Nahimic)
Dates :fev/mars pour 5 à 6 mois
Rémunération :Gratification
Mots-clés : Parcours ATIAM : Acoustique, Parcours ATIAM : Informatique musicale, Parcours ATIAM : Traitement du signal


COMPANY DESCRIPTION Nahimic (a.k.a. A-Volute) is a company based in Villeneuve d’Ascq, Lille (France) that publishes audio enhancement software for the gaming industry, in particular the Nahimic software on MSI laptop. Nahimic has developed a solution for digital and real-time 3D sound. The suite of audio effects proposed by Nahimic includes effects to improve multimedia content (music or movie) and gaming experience, as well as microphone effects for communication such as noise reduction. You will join the R&D team that is in charge of proposing and prototyping innovative audio algorithms.

CONUNDRUM A Karaoke version of a music piece is a version where the singer’s voice is no longer present in the song. Generally, such a version of the music is presented with subtitles of the lyrics allowing the user to sing to the rhythm of the "instrumental" piece. Most of the time, these Karaoke versions are generated ("mastered") by hand by a sound engineer. Entertainment companies already have large databases of this type of content. However, they can not cope with the amount of songs created every day, especially by amateur musicians, and must focus on the most famous songs. Thus, an automatic Karaoke generation tool would allow the general public to access a potentially infinite database of Karaoke. Similarly, in the case of streamed content, an automatic (and real-time) tool would also be required.

APPROACHES AND TOPICS Our algorithm already equals state of the art [12, 13] and many tracks of improvement are possible, both in terms of implementations and applications (details hereafter). The successful candidate will work on one or several of the following topics according to her/his aspirations and skills, and will work on our internal substantive dataset (description upon request).

*New core algorithm Machine learning is a fast changing research domain and an algorithm can move from being state of the art to being obsolete in less than a year. The work would be to try recent powerful neural network approaches on the audio source separation task. Other research domains outside audio (like computer vision) might be considered as sources of inspiration. For instance, the approaches in [14, 6] have shown promising results on other tasks.

*Multi-task approach Metadata such as music style/genre could be used during the training. One possible way is to consider those classes as other tasks to be solved together with the separation tasks. This is a very challenging machine learning problem, especially because the different tasks are heterogeneous (classification, regression, signal estimation) and just a few studies targeting audio multi-task have been carried out so far (exhaustive list from advisors knowledge [5, 8, 10]). Potential advantages are performance improvement for the main/principal task and computational cost reduction in products as several tasks are achieved at the same time. The work would be to investigate this approach based on the previous internal work and network architecture.

*Data augmentation One core aspect that has been raised by last Sisec challenge [12] is that the use of additional data was the key for music separation. Most of the existing approaches use data augmentation (remixing, additional audio effect, reverb), and this could be an axis of amelioration for our current algorithm as well. The work would be to investigate this approach. Interesting recent studies can be found in [2, 9].

*Backing tracks generation While a karaoke version of a song is the song without the singing voice, a backing track is a version of a song without a given instrument (e.g., drum, guitar, piano...). The work would mainly consist in trying new parameters (window size, overlap between windows, instrument specific spectral basis and cost functions...) so that our current algorithm can also address automatic backing track generation. Interesting recent studies can be found in [7, 3].

*Lyrics generation For a full karaoke experience, the subtitles (and sometimes a video clip) are usually required and should somehow or other be synchronized with the music pieces. A first axis of work would be the adaptation of a state of the art speech-to-text method for singing voice to obtain a singing-to-text algorithm. An axis of amelioration would be to use the lyrics that are available online in plain text version. It would then be a matter of synchronizing and displaying this version based on a comparison with the version produced by the singing-to-text algorithm.

*Extension Most of music separation algorithms are able to take into account stereo versions of songs. Making our current algorithm able to also take the spatial information of multi-channel recordings into account would potentially lead to significant improvements. The work would mainly consist in designing an evolution of the current used network architecture (more references upon request). So far, most of the state of the art approaches have address the backing track problem as a one instrument versus the rest problem, hence using specific networks for each instruments when multiple instruments are present in the mix. A more challenging problem would be to estimate all the different instruments with a single network.

*Subjective cost functions Most audio separation techniques seek to optimize an objective criterion, for example the divergence of Itakura-Saito, the mean square error between spectrograms or energy metrics such as the signal to distortion or artifact ratio [15]. However, these metrics do not reflect much about the quality perceived by the auditor when this should be the objective of such a separation algorithm. For example, most audio source separation techniques use a time-frequency masking step, and in many cases this step induces artifacts (chirping) that are perceptible to the ear but not fully taken into account by objective criteria.

Some automatic approaches for subjective evaluation (Usually, the subjective evaluation of the results of separating audio sources is carried out in a non-automatic way, i.e. by a human, e.g., through MUSHRA tests where the evaluation consists in assigning a score between 0 and 100 to each audio extract.) exist but are not often used [1], [4]. In particular, the algorithm for predicting subjective notes developed in [4] (1), which is however the state of the art of the domain, is particularly slow and cannot be used as a cost function during the learning of a source separation algorithm because its use would increase the learning time by 7000% (which is prohibitive when you know that the time of a complete learning on this task is about several weeks). In previous work, A-Volute has successfully developed an end-to-end subjective note prediction algorithm with the same performance as [4] but with a minimal impact on learning time. The work would be to demonstrate that this approach improves separation results in terms of perception, i.e. subjectively. It will first be necessary to convert the algorithm from Keras to Pytorch, verify that the same prediction performances are achieved, in order to then use it as a cost function within the learning of an open algorithm of the state of the art such as [11] (2) (3).

*Other applications Audio source separation for gaming and communication applications is also of interest for A-Volute. Transfer learning is one track of reflection.

(1) Approach by calculating features frequency followed by an MLP at a layer of 8 neurons (2) (3) himself in pytorch

SKILLS Who are we looking for ? Preparing a master’s degree, you preferably have knowledge in the development and implementation of advanced algorithms for digital audio signal processing or experience in Natural Language Processing (NLP) or symbolic data processing. Whereas not mandatory, notions in the following various fields would be appreciated : - Audio, acoustics and psychoacoustics - Audio effects in general compression, equalization, etc. - Machine learning and artificial neural networks. - Statistics, probabilist approaches, optimization. - Programming language : Matlab, Python, Pytorch, Keras, Tensorflow. - Sound spatialization effects : binaural synthesis, ambisonics, artificial reverberation. - Voice recognition, voice command. - Voice processing effects : noise reduction, echo cancellation, array processing. - Virtual, augmented and mixed reality. - Computer programming and development : Max/MSP, C/C+++/C#. - Video game engines : Unity, Unreal Engine, Wwise, FMod, etc. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity.



[1] M. Cartwright, B. Pardo et G. J. Mysore. « Crowdsourced pairwise-comparison for source separation evaluation ». In : p. 5.

[2] A. Cohen-Hadria, A. Roebel et G. Peeters. « Improving singing voice separation using Deep U-Net and Wave-U-Net with data augmentation ». In : arXiv preprint arXiv :1903.01415 (2019).

[3] D. Ditter et T. Gerkmann. « A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet ». In : arXiv preprint arXiv :1910.11615 (2019).

[4] V. Emiya et al. « Subjective and Objective Quality Assessment of Audio Source Separation ». In : IEEE Transactions on Audio, Speech, and Language Processing 19.7 (sept. 2011), p. 2046-2057.

[5] P. Georgiev et al. « Low-resource multi-task audio sensing for mobile and embedded devices via shared deep neural network representations ». In : Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1.3 (2017), p. 50.

[6] Language Modelling on Penn Treebank (Word Level). 2019 (accessed the 5th of December 2019).

[7] M. Pariente et al. « Filterbank design for end-to-end speech separation ». In : arXiv preprint arXiv :1910.10400 (2019).

[8] G. Pironkov, S. Dupont et T. Dutoit. « Multi-task learning for speech recognition : an overview. » In : ESANN. 2016.

[9] L. Prétet et al. « Singing voice separation : a study on training data ». In : ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, p. 506-510.

[10] D. Stoller, S. Ewert et S. Dixon. « Jointly detecting and separating singing voice : A multi-task approach ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 329-339.

[11] F.-R. Stoter et al. « Open-Unmix - A Reference Implementation for Music Source Separation ». In : Journal of Open Source Software (2019).

[12] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305.

[13] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588.

[14] A. Vaswani et al. « Attention is all you need ». In : Advances in neural information processing systems. 2017, p. 5998-6008.

[15] E. Vincent, R. Gribonval et C. Fevotte. « Performance measurement in blind audio source separation ». In : IEEE Transactions on Audio, Speech, and Language Processing 14.4 (juil. 2006). 5*, p. 1462-1469.