Knowledge-based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals

Georgi Dzhambazov

Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Knowledge-based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals

Title	Knowledge-based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals
Publication Type	PhD Thesis
Year of Publication	2017
University	Universitat Pompeu Fabra
Authors	Dzhambazov, G.
Advisor	Serra, X.
Academic Department	Department of Information and Communication Technologies
Abstract	This thesis proposes specific signal processing and machine learning methods for automatically aligning the lyrics of a song to its corresponding audio recording. The research carried out falls in the broader field of music information retrieval (MIR) and in this respect, we aim at improving some existing state-of-the-art methods, by introducing domain-specific knowledge. The goal of this work is to devise models capable of tracking in the music audio signal the sequential aspect of one particular element of lyrics – the phonemes. Music can be understood as comprising different facets, one of which is lyrics. The models we build take into account the complementary context that exists around lyrics, which is any musical facet complementary to lyrics. The facets used in this thesis include the structure of the music composition, temporal structure of a lyrics line, the structure of the metrical cycle. From this perspective, we analyse not only the low-level acoustic characteristics, representing the timbre of the phonemes, but also higher-level characteristics, in which the complementary context manifests. We propose specific probabilistic models to represent how the transitions between consecutive sung phonemes are conditioned by different facets of complementary context. The complementary context, which we address, unfolds in time according to principles that are particular of a music tradition. To capture these, we created corpora and datasets for two music traditions, which have a rich set of such principles: Ottoman Turkish makam and Beijing opera. The datasets and the corpora comprise different data types: audio recordings, music scores, and metadata. From this perspective, the proposed models can take advantage both of the data and the music-domain knowledge of particular musical styles to improve existing baseline approaches. As a baseline, we choose a phonetic recognizer based on hidden Markov models (HMM): a widely-used method for tracking phonemes both in singing and speech processing problems. We present refinements in the typical steps of existing phonetic recognizer approaches, tailored towards the characteristics of the studied music traditions. On top of the refined baseline, we devise probabilistic models, based on dynamic Bayesian networks (DBN) that represent the relation of phoneme transitions to its complementary context. Two separate models are built for two granularities of complementary context: the temporal structure of a lyrics line (higher-level) and the structure of the metrical cycle (finer-level). In one model we exploit the fact the syllable durations depend on their position within a lyrics line. Information about the expected durations is obtained from the score, as well as from music-specific knowledge. Then in another model, we analyse how vocal note onsets, estimated from audio recordings, influence the transitions between consecutive vowels and consonants. We also propose how to detect the time positions of sung note onsets by tracking simultaneously the positions in the metrical cycle (i.e. metrical accents). In order to evaluate the potential of the proposed models, we use lyrics-to-audio alignment as a concrete task. Each model improves the alignment accuracy, compared to the baseline, which is based solely on the acoustics of the phonetic timbre. This validates our hypothesis that knowledge of complementary context is an important stepping stone for computationally tracking lyrics, especially in the challenging case of singing with instrumental accompaniment. The outcomes of this study are not only theoretic methods and data, but also specific software tools that have been integrated into Dunya — a suite of tools, built in the context of CompMusic, a project for advancing the computational analysis of the world's music. With this application, we have also shown that the developed methods are useful not only for tracking lyrics, but also for other use cases, such as enriched music listening and appreciation, and for educational purposes.
Final publication	https://doi.org/10.5281/zenodo.841980

Additional material:

Companion thesis page with links to additional materials at http://compmusic.upf.edu/phd-thesis-georgi

Video of PhD Defense

PhD Defense presentation slides

lyx document to generate this thesis document and the figures can be found in https://github.com/georgid/PhDThesis/