Audio Source Separation for Music in Low-latency and High-latency Scenarios

TitleAudio Source Separation for Music in Low-latency and High-latency Scenarios
Publication TypePhD Thesis
Year of Publication2013
UniversityUniversitat Pompeu Fabra
AuthorsMarxer, R.
Academic DepartmentDepartment of Information and Communication Technologies
Date Published09/2013
Keywordsaudio processing, low-latency, source separation
AbstractThe source separation problem in digital signal processing consists in finding the original signals that were mixed together into a set of mixture signals. Solutions to this problem have been extensively studied for the specific case of musical signals, however their application to real-world practical situations remains infrequent. There are two main obstacles for their widespread adoption depending on the scenario. The main limitation in some cases is their high latency and computational requirements. In other cases the quality of the results is still unacceptable. There has been extensive work on improving the quality of music separation, but few studies have been devoted to the development of low-latency and low computational cost separation of monaural music signals. We propose specific methods to address these issues in each of these scenarios independently. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch and multipitch estimation and tracking tasks, which are a crucial step in many separation methods. We then use the proposed spectrum decomposition method in low-latency music separation tasks targeting singing voice, bass and drums. Second, we develop methods that achieve improved separation results with respect to existing state-of-the-art methods at the cost of greater computational cost and higher latency. We propose several high-latency and computationally complex methods that improve the separation of singing voice, by modeling components that are often not accounted for, such as breathiness and the consonants. Finally we explore the use of temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals.