Pianist Transformer

Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li

Abstract

Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.

Architecture diagram of Pianist Transformer showing the workflow from pre-training on a massive unlabeled corpus to supervised fine-tuning and inference.
The overall architecture and workflow of Pianist Transformer, featuring a unified tokenizer and an asymmetric Transformer for efficient, high-quality performance rendering.

Audio Demonstrations

The following examples compare three versions of each piece: the mechanical playback of the score, the performance generated by our model, and a recording of a human performance. These comparisons highlight the model's generated nuances in timing (rubato), volume (dynamics), articulation and pedaling.

Score Pianist Transformer Human Performance
Bach, J. S. — Prelude, BWV 885
Haydn, J. — Sonata No. 58 in C Major, Hob. XVI:48 (II)
Mozart, W. A. — Sonata No. 9 in A minor, K. 310, I
Beethoven, L. v. — Sonata No. 7 in D Major, Op. 10 No. 3, I
Chopin, F. — Sonata No. 3 in B-minor, Op. 58, IV
Liszt, F. — Étude d’exécution transcendante No. 1 “Preludio”, S. 139
Ravel, M. — Miroirs, III “Une barque sur l’océan”

Subjective Evaluation Results

We conducted a blind listening study where participants rated performances from our model, baseline models, and human pianists. The results indicate that our model's performances were rated as highly as the human performances and were preferred significantly over the baselines.

Graph showing subjective preference ranking results, where 'Ours' is ranked highest, slightly above 'Human'.
Figure 3 from our paper: The average rank of our Pianist Transformer is statistically indistinguishable from the Human performance, demonstrating strong listener appeal.

Editable MIDI Output

The model generates a standard MIDI file containing an editable tempo map created by our Expressive Tempo Mapping algorithm. This captures the performance's detailed timing variations. The resulting file can be imported directly into any Digital Audio Workstation (DAW) for further editing and use in music production.

A GIF showing a MIDI file being dragged into a DAW, with the tempo track visible and fluctuating during playback.
The generated performance can be imported into any DAW, preserving expressive timing as an editable tempo map.

Try the GUI

To make our model more accessible, we offer a desktop GUI for easy experimentation. This provides a user-friendly interface for generating expressive performances without writing any code. For download and instructions, please visit our GitHub repository.

Screenshot of the Pianist Transformer GUI application.
The Pianist Transformer GUI provides an intuitive interface for generating and comparing expressive performances.
Get the GUI on GitHub

Citation

If you find our work useful, please consider citing our paper:

@misc{you2025pianisttransformerexpressivepiano,
      title={Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training}, 
      author={Hong-Jie You and Jie-Jing Shao and Xiao-Wen Yang and Lin-Han Jia and Lan-Zhe Guo and Yu-Feng Li},
      year={2025},
      eprint={2512.02652},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}