Knowledge-based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals

TitleKnowledge-based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals
Publication TypePhD Thesis
Year of Publication2017
UniversityUniversitat Pompeu Fabra
AuthorsDzhambazov, G.
AdvisorSerra, X.
KeywordsBeijing opera, Dynamic Bayesian Networks, hidden Markov models, lyrics, lyrics-to-audio alignment, Machine learning, music information retrieval, music scores, phonemes, signal processing, singing voice, Turkish makam
Abstract

This thesis proposes specific signal processing and machine learning methodologies for automatically aligning the lyrics of a song to its corresponding audio recording. The research carried out falls in the broader field of music information retrieval (MIR) and in this respect, we aim at improving some existing state-of-the-art methodologies, by introducing domain-specific knowledge.

The goal of this work is to devise models capable of tracking in the music audio signal the sequential aspect of one particular element of lyrics - the phonemes. Music can be understood as comprising different facets, one of which is lyrics. The models we build take into account the complementary context that exists around lyrics, which is any musical facet complementary to lyrics. The facets used in this thesis include the structure of the music composition, temporal structure of a lyrics line, the structure of the metrical cycle. From this perspective, we analyse not only the low-level acoustic characteristics, representing the timbre of the phonemes, but also higher-level characteristics, in which the complementary context manifests. We propose specific probabilistic models to represent how the transitions between consecutive sung phonemes are conditioned by different facets of complementary context.

The complementary context, which we address, unfolds in time according to principles that are particular of a music tradition. To capture these, we created corpora and datasets for two music traditions, which have a rich set of such principles: Ottoman Turkish makam and Beijing opera. The datasets and the corpora comprise different data types: audio recordings, music scores, and metadata. From this perspective, the proposed models can take advantage both of the data and the music-domain knowledge of particular musical styles to improve existing baseline approaches.

As a baseline, we choose a phonetic recognizer based on hidden Markov models (HMM): a widely-used methodology for tracking phonemes both in singing and speech processing problems. We present refinements in the typical steps of existing phonetic recognizer approaches, tailored towards the characteristics of the studied music traditions. On top of the refined baseline, we device probabilistic models, based on dynamic Bayesian networks (DBN) that represent the relation of phoneme transitions to its complementary context. Two separate models are built for two granularities of complementary context: the temporal structure of a lyrics line (higher-level) and the structure of the metrical cycle (finer-level). In one model we exploit the fact the syllable durations depend on their position within a lyrics line. Information about the expected durations is obtained from the score, as well as from music-specific knowledge. Then in another model, we analyse how vocal note onsets, estimated from audio recordings, influence the transitions between consecutive vowels and consonants. We also propose how to detect the time positions of sung note onsets by tracking simultaneously the positions in the metrical cycle (i.e. metrical accents).

In order to evaluate the potential of the proposed models, we use the lyrics-to-audio alignment as a concrete task. Each model improves the alignment accuracy, compared to the baseline, which is based solely on the acoustics of the phonetic timbre. This validates our hypothesis that knowledge of complementary context is an important stepping stone for computationally tracking lyrics, especially in the challenging case of singing with instrumental accompaniment.

The outcomes of this study are not only theoretic methodologies and data, but also specific software tools that have been integrated into Dunya - a suite of tools, built in the context of CompMusic, a project for advancing the computational analysis of the world's music. With this application, we have also shown that the developed methodologies are useful not only for tracking lyrics, but also for other use cases, such as enriched music listening and appreciation, or for educational purposes.

Additional material: 

Companion thesis page with links to additional materials at http://compmusic.upf.edu/phd-thesis-georgi

Video of PhD Defense  

PhD Defense presentation slides

lyx document to generate this thesis document and the figures can be found in https://github.com/georgid/PhDThesis/

intranet