Pianist Transformer
Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training
Abstract
Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.
Audio Demonstrations
The following examples compare three versions of each piece: the mechanical playback of the score, the performance generated by our model, and a recording of a human performance. These comparisons highlight the model's generated nuances in timing (rubato), volume (dynamics), articulation and pedaling.
| Score | Pianist Transformer | Human Performance |
|---|---|---|
| Bach, J. S. — Prelude, BWV 885 | ||
| Haydn, J. — Sonata No. 58 in C Major, Hob. XVI:48 (II) | ||
| Mozart, W. A. — Sonata No. 9 in A minor, K. 310, I | ||
| Beethoven, L. v. — Sonata No. 7 in D Major, Op. 10 No. 3, I | ||
| Chopin, F. — Sonata No. 3 in B-minor, Op. 58, IV | ||
| Liszt, F. — Étude d’exécution transcendante No. 1 “Preludio”, S. 139 | ||
| Ravel, M. — Miroirs, III “Une barque sur l’océan” | ||
Subjective Evaluation Results
We conducted a blind listening study where participants rated performances from our model, baseline models, and human pianists. The results indicate that our model's performances were rated as highly as the human performances and were preferred significantly over the baselines.
Editable MIDI Output
The model generates a standard MIDI file containing an editable tempo map created by our Expressive Tempo Mapping algorithm. This captures the performance's detailed timing variations. The resulting file can be imported directly into any Digital Audio Workstation (DAW) for further editing and use in music production.
Try the GUI
To make our model more accessible, we offer a desktop GUI for easy experimentation. This provides a user-friendly interface for generating expressive performances without writing any code. For download and instructions, please visit our GitHub repository.
Citation
If you find our work useful, please consider citing our paper:
@misc{you2025pianisttransformerexpressivepiano,
title={Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training},
author={Hong-Jie You and Jie-Jing Shao and Xiao-Wen Yang and Lin-Han Jia and Lan-Zhe Guo and Yu-Feng Li},
year={2025},
eprint={2512.02652},
archivePrefix={arXiv},
primaryClass={cs.SD}
}