PianoCoRe: Combined and Refined Piano MIDI Dataset

Anonymous Authors
Anonymous Submission
Overview of the data matching process.

PianoCoRe is a new large-scale dataset of 250,046 piano performance MIDI unified from five major corpora.

Abstract

Symbolic music datasets with matched scores and performances are essential for many Music Information Retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. We present PianoCoRe a large-scale piano MIDI dataset that combines and refines major open-source piano corpora into a unified collection. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 hours of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,199 performances aligned to 1,591 scores. Apart from the dataset, our contributions include: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions, and (2) RAScoP, an alignment refinement pipeline that corrects temporal errors and interpolates missing notes. Evaluation shows that the refinement pipeline reduces alignment errors and temporal noise. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.

Key Contributions

BibTeX

@article{anonymous2026pianocore,
  author    = {Anonymous Authors},
  title     = {{PianoCoRe}: Combined and Refined Piano {MIDI} Dataset},
  journal   = {Anonymous Submission},
  year      = {2026},
}