PianoCoRe: Combined and Refined Piano MIDI Dataset

Anonymous Authors
Anonymous Submission
Overview of the data matching process.

PianoCoRe is a new large-scale dataset of over 250,000 piano performance MIDI unified from five major corpora.

Abstract

Symbolic music datasets with matched scores and performances are essential for a range of Music Information Retrieval (MIR) tasks. Yet, existing resources suffer from significant limitations: they often cover a narrow range of composers, lack performance variety, omit precise note-level alignments, or use inconsistent naming formats. We present PianoCoRe, a large-scale piano dataset that combines and refines major open-source piano corpora into a unified collection of scores and performances. The dataset contains 253,633 performances of 6,008 pieces written by 624 composers, totaling over 22,000 hours of performed music. Our methodological contributions are: (1) a MIDI quality classifier to identify and remove low-quality transcriptions, and (2) RAScoP, a novel pipeline that refines raw note alignments by correcting temporal errors and interpolating missing notes. PianoCoRe is released in tiered subsets (C, B, A, and A*) tailored for different applications, from large-scale pre-training to expressive performance rendering with precise note alignments. PianoCore-A, the note-aligned subset, is the largest available score-aligned performance MIDI dataset, containing 160,207 performances aligned to 1,697 musical scores. Experimental validation shows the effectiveness of our processing pipeline and improved out-of-domain generalization of the performance rendering model trained on a complete aligned dataset. PianoCoRe provides a robust, ready-to-use foundation for the next generation of expressive piano performance modeling.

Key Contributions

BibTeX

@article{anonymous2025pianocore,
  author    = {Anonymous Authors},
  title     = {PianoCoRe: Combined and Refined Piano {MIDI} Dataset},
  journal   = {Anonymous Submission},
  year      = {2025},
}