Voice Quality Modelling with the Wide-Band Harmonic Sinusoidal Modelling Algorithm

S.I. Mimilakis

Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Voice Quality Modelling with the Wide-Band Harmonic Sinusoidal Modelling Algorithm

Title	Voice Quality Modelling with the Wide-Band Harmonic Sinusoidal Modelling Algorithm
Publication Type	Master Thesis
Year of Publication	2014
Authors	Mimilakis, S. I.
Abstract	Modern advances in the areas of speech and voice processing, have underlined the significance of voice qualities. These qualities, have been proved to provide an increased perceivable naturalness in applications spanning from text to speech synthesis and sound source separation to singing voice conversions and transformations. As a result, different and multiple approaches co-exist, with main task to reproduce and transmit these specific voice characteristics. In this work, we aim to model these voice qualities incorporating robust analysis algorithms, alongside with machine learning tasks. This methodology, allows the extraction and modelling of specific features and patterns, that are enabling the re-synthesis of the phenomena involved during each voice quality. Then, the extracted patterns are fed into an ensemble of Artificial Neural Networks training procedure, capable of generalisation and satisfactory performance among restricted audio corpus. Finally, for the final transformation stage each input voice is activating the Artificial Neural Networks enabling and predicting the re-synthesis of the voice qualities patterns, while allowing the operation to perform in an adaptive way. The proposed method was also evaluated through series of subjective listening tests, where a set of singing voices was processed and 8 experienced listeners had to rate the perceived naturalness, expressivity and transparency of each audio segment. Results are demonstrating the solid performance, achieving almost adequate, to original audio corpus, perceived naturalness, while the perceptual expressivity grade was higher for the transformed audio corpus. As far it concerns the transparency, a mean total success rate of 47.2% was achieved, during the distinction between original natural voices and transformed ones.

Stylianos-Ioannis-Mimilakis-Master-Thesis-2014.pdf