Deep Neural Networks for Music and Audio Tagging

TitleDeep Neural Networks for Music and Audio Tagging
Publication TypePhD Thesis
Year of Publication2019
UniversityUniversitat Pompeu Fabra
AuthorsPons, J.
AdvisorSerra, X.
Academic DepartmentInformation and Communication Technologies
Number of Pages216+xxiii
Date Published11/2019
CityBarcelona, Spain
KeywordsAudio tagging, convolutional neural networks, deep learning, Music tagging, neural networks
Abstract

Automatic music and audio tagging can help increase the retrieval and re-use possibilities of many audio databases that remain poorly labeled. In this dissertation, we tackle the task of music and audio tagging from the deep learning perspective and, within that context, we address the following research questions:

i. Which deep learning architectures are most appropriate for(music) audio signals?

ii. In which scenarios is waveform-based end-to-end learning feasible?

iii. How much data is required for carrying out competitive deep learning research?

In pursuit of answering research question(i), we propose to use musically motivated convolutional neural networks as an alternative to designing deep learning models that is based on domain knowledge, and we evaluate several deep learning architectures for audio at a low computational cost with a novel methodology based on non-trained(randomly weighted) convolutional neural networks. Throughout our work, we find that employing music and audio domain knowledge during the model’s design can help improve the efficiency, interpretability, and performance of spectrogram-based deep learning models.

For research questions (ii)and (iii), we perform a study with the Sample CNN, a recently proposed end-to-end learning model, to assess its viability for music audio tagging when variable amounts of training data —ranging from 25k to 1.2M songs— are available. We compare the Sample CNN against a spectrogram-based architecture that is musically motivated and conclude that, given enough data, end-to-end learning models can achieve better results. Finally, throughout our quest for answering research question(iii), we also investigate whether a naive regularization of the solution space, prototypical networks, transfer learning, or their combination, can foster deep learning models to better leverage a small number of training examples. Results indicate that transfer learning and proto-typical networks are powerful strategies in such low-data regimes.

preprint/postprint documenthttp://jordipons.me/media/PhD_Thesis_JordiPons.pdf
intranet