Automatic musical instrument recognition from polyphonic music audio signals

Ferdinand Fuhrmann

Note: This bibliographic page is archived and will no longer be updated. For an up-to-date list of publications from the Music Technology Group see the Publications list .

Automatic musical instrument recognition from polyphonic music audio signals

Title	Automatic musical instrument recognition from polyphonic music audio signals
Publication Type	PhD Thesis
Year of Publication	2012
University	Universitat Pompeu Fabra
Authors	Fuhrmann, F.
Advisor	Serra, X.
Abstract	Facing the rapidly growing amount of digital media, the need for an effective data management is challenging technology. In this context, we approach the problem of automatically recognising musical instruments from music audio signals. Information regarding the instrumentation is among the most important semantic concepts humans use to communicate musical meaning. Hence, knowledge regarding the instrumentation eases a meaningful description of a music piece, indispensable for approaching the aforementioned need with modern (music) technology. Nonetheless, the addressed problem may sound elementary or basic, given the competence of the human auditory system. However, during at least two decades of study, while being tackled from various perspectives, the problem itself has been proven to be highly complex; no system has yet been presented that is even getting close to a human-comparable performance. Especially the problem of resolving multiple simultaneous sounding sources poses the main difficulties to the computational approaches. In this dissertation we present a general purpose method for the automatic recognition of musical instruments from music audio signals. Unlike many related approaches, our specific conception mostly avoids laboratory constraints on the method's algorithmic design, its input data, or the targeted application context. In particular, the developed method models 12 instrumental categories, including pitched and percussive instruments as well as the human singing voice, all of them frequently adopted in Western music. To account for the assumable complex nature of the input signal, we limit the most basic process in the algorithmic chain to the recognition of a single predominant musical instrument from a short audio fragment. By applying statistical pattern recognition techniques together with properly designed, extensive datasets we predict one source from the analysed polytimbral sound and thereby prevent the method from resolving the mixture. To compensate for this restriction we further incorporate information derived from a hierarchical music analysis; we first utilise musical context to extract instrumental labels from the time-varying model decisions. Second, the method incorporates information regarding the piece's formal aspects into the recognition process. Finally, we include information from the collection level by exploiting associations between musical genres and instrumentations. In our experiments we assess the performance of the developed method by applying a thorough evaluation methodology using real music signals only, estimating the method's accuracy, generality, scalability, robustness, and efficiency. More precisely, both the models' recognition performance and the label extraction algorithm exhibit reasonable, thus expected accuracies given the problem at hand. Furthermore, we demonstrate that the method generalises well in terms of the modelled categories and is scalable to any kind of input data complexity, hence it provides a robust extraction of the targeted information. Moreover, we show that the information regarding the instrumentation of a Western music piece is highly redundant, thus enabling a great reduction of the data to analyse. Here, our best settings lead to a recognition performance of almost 0.7 in terms of the applied F-score from less than 50% of the input data. At last, the experiments incorporating the information on the musical genre of the analysed music pieces do not show the expected improvement in recognition performance, suggesting that a more fine-grained instrumental taxonomy is needed for exploiting this kind of information.
Final publication	http://hdl.handle.net/10803/81328