| Abstract | In this paper we analyze the reliability of the evaluation
of Audio Melody Extraction algorithms. We focus on the
procedures and collections currently used as part of the
annual Music Information Retrieval Evaluation eXchange
(MIREX), which has become the de-facto benchmark for
evaluating and comparing melody extraction algorithms.
We study several factors: the duration of the audio clips,
time offsets in the ground truth annotations, and the size
and musical content of the collection. The results show
that the clips currently used are too short to predict performance
on full songs, highlighting the paramount need
to use complete musical pieces. Concerning the ground
truth, we show how a minor error, specifically a time offset
between the annotation and the audio, can have a dramatic
effect on the results, emphasizing the importance of
establishing a common protocol for ground truth annotation
and system output. We also show that results based on
the small ADC04, MIREX05 and INDIAN08 collections
are unreliable, while the MIREX09 collections are larger
than necessary. This evidences the need for new and larger
collections containing realistic music material, for reliable
and meaningful evaluation of Audio Melody Extraction.
|