Improving automatic phonetic segmentation for creating singing voice synthesizer corpora

TitleImproving automatic phonetic segmentation for creating singing voice synthesizer corpora
Publication TypeMaster Thesis
Year of Publication2013
AuthorsJewalikar, V.

Phonetic segmentation is the breakup and classi cation of the sound signal into a string of phones. This is a fundamental step for using a corpora for a singing voice synthesizer. We propose improvements to an existing automatic phonetic segmentation method by adding more relevant descriptors to the computed feature set and by using a di erent regression model.
We start with a short introduction to singing voice synthesizers and how their corpora are created. We discuss the importance of automatic phonetic segmentation for these corpora. We briefly review and critique works relevant to phonetic segmentation of both speech and singing voice. This is followed by an introduction to score predictive modelling and how it will benefi t with some fundamental modi fications.
A detailed description of how score predictive modelling is adapted for our corpora and how it is implemented is presented. The corpora contains sentences sung by a professional female singer in Spanish and also contains accurate manual phonetic segmentation information. This corpora is divided into a train set and a test set (in a 3 to 1 ratio respectively). Relevant audio features are extracted and these serve as the backbone for training and testing of the machine learning models. A score function is calculated for candidate boundaries in the train set. The score and features for the train set are used for training random forest regression models. These trained models (called score predictive models) are used for predicting improved phoneme boundaries, around boundaries predicted by Hidden Markov Models (HMMs) for the test set. These predicted boundaries are then evaluated against the manually labelled boundaries (true boundaries) and boundaries previously found using HMMs (baseline).
The results obtained are promising and justify our modi cations of using a large feature set and a di fferent regression model. A number of interesting possibilities for future works are presented. We conclude with a summary of the work, conclusions and contributions.

Final publication