F0 Modeling For Singing Voice Synthesizers with LSTM Recurrent Neural Networks

TitleF0 Modeling For Singing Voice Synthesizers with LSTM Recurrent Neural Networks
Publication TypeMaster Thesis
Year of Publication2015
AuthorsÖzer, S.
AbstractIn singing voice synthesis process, score and lyrics for a target song are converted to singing voice expression parameters such as F0, spectra and dynamics. However, this study aims to model and automatically generate F0 parameter by assuring expressiveness and human-likeness in final synthesized singing voice. Musical contexts are important factor on evolution of F0 through a singing performance. Thus, we propose a machine-learning framework that learns F0 of the singing from a set of real human singing recordings with respect to musical contexts, at the same time, capturing expressiveness and naturalness of the human singer. Then, we can automatically generate F0 parameter from our trained model given musical contexts of the score. Recurrent Neural Networks with Long Short Term Memory networks are employed for first time to this specific problem due to their flexibility and strong power in modeling complex sequences. Two recurrent neural networks are trained to learn baseline and vibrato parts of F0 separately. Then, F0 sequences are generated from the trained networks and applied to a singing voice synthesizer. Finally, synthesized songs are evaluated with AB preference tests.
intranet