Expression Control of Singing Voice Synthesis: Modeling Pitch and Dynamics with Unit Selection and Statistical Approaches

TitleExpression Control of Singing Voice Synthesis: Modeling Pitch and Dynamics with Unit Selection and Statistical Approaches
Publication TypePhD Thesis
Year of Publication2016
UniversityUniversitat Pompeu Fabra
AuthorsUmbert, M.
AdvisorBonada, J., & Serra X.
Academic DepartmentDepartment of Information and Communication Technologies
Number of Pages177
Date Published01/2016
CityBarcelona
Keywordsdynamics, expression control, pitch, singing voice synthesis, statistical modeling, unit selection
Abstract

Sound synthesis technologies have been applied to speech, instruments, and singing voice. While these technologies need to have a sound representation as realistic as possible, the sound synthesis should also reproduce the expressive characteristics of the original sound. This, we refer to emotional speech synthesis, expressive performances of synthesized instruments, as well as expression in singing voice synthesis. Indeed, the singing voice has some commonalities with both speech (the sound source is the same) and instruments (concerning musical aspects such as melody and expression resources).

Modeling singing voice expression is a difficult task. We are completely familiarized with the singing voice instrument, and thus we easily detect whether artificially achieved results are similar to a real singer or not. There are many features that should be controlled related to melody, dynamics, rhythm, and timbre, which make achieving natural expression a complex task.

This thesis focuses on the control of a singing voice synthesizer to achieve natural expression similar to a real singer. In this thesis we examine the control of pitch and dynamics. In the unit selection-based system we define the cost functions for unit selection as well as the unit transformations and concatenation steps. The statistically-based systems model both sequences of notes and sequences of note transitions and sustains. Finally, we also present a system which combines the previous ones. These systems are trained with two expression databases that we have designed, recorded, and labeled. These databases comprise sequences of three notes or rests.

Our perceptual evaluation compares the proposed systems with a baseline expression system and a performance-driven approach. The perceptual evaluation shows that the hybrid systems achieves the closest natural expression to a human voice. In the objective evaluation we focus on the systems efficiency.

This thesis delivers numerous contributions to the field of our research: 1) it provides a discussion on expression and summarizes some expression definitions, 2) it reviews previous works on expression control in singing voice synthesis, 3) it provides an online compilation of sound excerpts from different works, 4) it proposes a methodology for expression database creation, 5) it implements a unit selection-based system for expression control, 6) it proposes two statistical-based systems, 7) it presents a hybrid system, 8) it compares the proposed systems with other state of the art systems, 9) it proposes another use case in which the proposed systems can be applied, 10) it provides a set of proposals to improve the evaluation.

All sounds mentioned in this thesis have been collected in a single website (PhD evaluation section).


intranet