Improving automatic phonetic segmentation for creating singing voice synthesizer corpora

Varun Jewalikar

Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Improving automatic phonetic segmentation for creating singing voice synthesizer corpora

Title	Improving automatic phonetic segmentation for creating singing voice synthesizer corpora
Publication Type	Master Thesis
Year of Publication	2013
Authors	Jewalikar, V.
Abstract	Phonetic segmentation is the breakup and classication of the sound signal into a string of phones. This is a fundamental step for using a corpora for a singing voice synthesizer. We propose improvements to an existing automatic phonetic segmentation method by adding more relevant descriptors to the computed feature set and by using a dierent regression model. We start with a short introduction to singing voice synthesizers and how their corpora are created. We discuss the importance of automatic phonetic segmentation for these corpora. We briefly review and critique works relevant to phonetic segmentation of both speech and singing voice. This is followed by an introduction to score predictive modelling and how it will benefit with some fundamental modifications. A detailed description of how score predictive modelling is adapted for our corpora and how it is implemented is presented. The corpora contains sentences sung by a professional female singer in Spanish and also contains accurate manual phonetic segmentation information. This corpora is divided into a train set and a test set (in a 3 to 1 ratio respectively). Relevant audio features are extracted and these serve as the backbone for training and testing of the machine learning models. A score function is calculated for candidate boundaries in the train set. The score and features for the train set are used for training random forest regression models. These trained models (called score predictive models) are used for predicting improved phoneme boundaries, around boundaries predicted by Hidden Markov Models (HMMs) for the test set. These predicted boundaries are then evaluated against the manually labelled boundaries (true boundaries) and boundaries previously found using HMMs (baseline). The results obtained are promising and justify our modications of using a large feature set and a different regression model. A number of interesting possibilities for future works are presented. We conclude with a summary of the work, conclusions and contributions.
Final publication	https://doi.org/10.5281/zenodo.1161281