We provide the resources to enable reproducibility of our work. The links above contain a demo version of the final PianoCoRe dataset, as well as the source code for our RAScoP refinement pipeline.
Note: The dataset is only provided for the review process and is not available for distribution or use. The metadata inside the .tar.gz file includes information on all performances and 6,008 musical pieces. However, to reduce the file size (from 7GB to 700MB) for the review process, the dataset is limited to a representative set of up to four performances of each piece from each data source. The unarchived demo dataset takes around 1.7GB of disk space.
The full dataset will be released upon final publication. The data will be archived and distributed on HuggingFace/Zenodo under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license to align with the licenses used for the source datasets.
The code release will include a documented RAScoP pipeline and a MIDI quality classifier.
Symbolic music datasets with matched scores and performances are essential for a range of Music Information Retrieval (MIR) tasks. Yet, existing resources suffer from significant limitations: they often cover a narrow range of composers, lack performance variety, omit precise note-level alignments, or use inconsistent naming formats. We present PianoCoRe, a large-scale piano dataset that combines and refines major open-source piano corpora into a unified collection of scores and performances. The dataset contains 253,633 performances of 6,008 pieces written by 624 composers, totaling over 22,000 hours of performed music. Our methodological contributions are: (1) a MIDI quality classifier to identify and remove low-quality transcriptions, and (2) RAScoP, a novel pipeline that refines raw note alignments by correcting temporal errors and interpolating missing notes. PianoCoRe is released in tiered subsets (C, B, A, and A*) tailored for different applications, from large-scale pre-training to expressive performance rendering with precise note alignments. PianoCore-A, the note-aligned subset, is the largest available score-aligned performance MIDI dataset, containing 160,207 performances aligned to 1,697 musical scores. Experimental validation shows the effectiveness of our processing pipeline and improved out-of-domain generalization of the performance rendering model trained on a complete aligned dataset. PianoCoRe provides a robust, ready-to-use foundation for the next generation of expressive piano performance modeling.
@article{anonymous2025pianocore,
author = {Anonymous Authors},
title = {PianoCoRe: Combined and Refined Piano {MIDI} Dataset},
journal = {Anonymous Submission},
year = {2025},
}