Automatic Assessment of Singing Voice Pronunciation: A Case Study with Jingju Music

TitleAutomatic Assessment of Singing Voice Pronunciation: A Case Study with Jingju Music
Publication TypePhD Thesis
Year of Publication2018
UniversityUniversitat Pompeu Fabra
AuthorsGong, R.
AdvisorSerra, X.
Academic DepartmentDepartment of Information and Communications Technologies
Number of Pagesxxxii + 235
Date Published11/2018
Keywordsautomatic assessment, Beijing opera, hmm, jingju, neural networks, pronunciation, singing voice
AbstractOnline learning has altered music education remarkable in the last decade. Large and increasing amount of music performing learners participate in online music learning courses due to the easy-accessibility and boundless of time-space constraints. However, online music learning cannot be extended to a large-scale unless there is an automatic system to provide assessment feedback for the student music performances. Singing can be considered the most basic form of music performing. The critical role of singing played in music education cannot be overemphasized. Automatic singing voice assessment, as an important task in Music Information Research (MIR), aims to extract musically meaningful information and measure the quality of learners’ singing voice. Singing correctness and quality is culture-specific and its assessment requires culture-aware methodologies. Jingju (also known as Beijing opera) music is one of the representative music traditions in China and has spread to many places in the world where there are Chinese communities. The Chinese tonal languages and the strict conventions in oral transmission adopted by jingju singing training pose unique challenges that have not been addressed by the current MIR research, which motivates us to select it as the major music tradition for this dissertation. Our goal is to tackle unexplored automatic singing voice assessment problems in jingju music, to make the current eurogeneric assessment approaches more culture- aware, and in return, to develop new assessment approaches which can be generalized to other music traditions. This dissertation aims to develop data-driven audio signal processing and machine learning (deep learning) models for automatic singing voice assessment in audio collections of jingju music. We identify challenges and opportunities, and present several research tasks relevant to automatic singing voice assessment of jingju music. Data-driven computational approaches require well-organized data for model training and testing, and we report the process of curating the data collections (audio and editorial metadata) in detail. We then focus on the research topics of automatic syllable and phoneme segmentation, automatic mispronunciation detection and automatic pronunciation similarity measurement in jingju music. It is extremely demanding in jingju singing training that students have to pronounce each singing syllable correctly and to reproduce the teacher’s reference pronunciation quality. Automatic syllable and phoneme segmentation, as a preliminary step for the assessment, aims to divide the singing audio stream into finer granularities – syllable and phoneme. The proposed method adopts deep learning models to calculate syllable and phoneme onset probabilities, and achieves a state of the art segmentation accuracy by incorporating side information – syllable and phoneme durations estimated from musical scores, into the algorithm. Jingju singing uses a unique pronunciation system which is a mixture of several Chinese language dialects. This pronunciation system contains various special pronounced syllables which are not included in standard Mandarin. A crucial step in jingju singing training is to pronounce these special syllables correctly. We approach the problem of automatic mispronunciation detection for special pronunciation syllables using a deep learning-based classification method by which the student’s interpretation of a special pronounced syllable segment is assessed. The proposed method shows a great potential by comparing with the existing forced alignment-based approach, indicates its validity in pronunciation correctness assessment. The strict oral transmission convention in jingju singing teaching requires that students accurately reproduce the teacher’s reference pronunciation at phoneme level. Hence, the proposed assessment method needs to be able to measure the pronunciation similarity between teacher’s and student’s corresponding phonemes. Acoustic phoneme embeddings learned by deep learning models can capture the pronunciation nuance and convert variable-length phoneme segment into the fixed-length vector, and consequently to facilitate the pronunciation similarity measurement. The technologies developed from the work of this dissertation are a part of the comprehensive toolset within the CompMusic project, aimed at enriching the online learning experience for jingju music singing. The data and methodologies should also be contributed to computational musicology research and other MIR or speech tasks related to automatic voice assessment.
Final publication
Additional material: