PoM Demo Page

This entire website is fully anonymous to support double-blind peer review.

Link to Anonymized PDF paper.

Really? Yet another MIDI inpainting paper? Yes! Sure, MIDI inpainting's been demonstated many times, but can we make it easier to do, and more expressive? Yes!

In this study, we get comparable performance to prior methods while using a new(er) Hierarchical Diffusion Transformer (HDiT) that makes things simple and easy, and handles big images (long sequences) too! And we can inpaint unique shapes for our melodies and accompaniments.

But why merely solo piano MIDI??

The point of this study is not piano or MIDI per se, it's about exploring how to control generative music models, and piano MIDI is just a nice, compact data representation with which to conduct these investigations. Transformer-based approaches to music modeling tend to offer a limited suite of user control opportuntities compared to diffusion models. This paper is an early exploration of the simple idea of, "what if we took advantage of the prolific work done controllable image diffusion methods and applied them toward music generation?"

Examples

I. Click here for example Subjective Evaluation (Listening) Test

II. Example Generations below...(Work in Progress)

Example of "Drawn" Melody:

Original PoM Undirected Melody PoM-Drawn - RePaint=1 👎 PoM-Drawn Melody - RePaint=2 😀 PoM-Drawn - RePaint=4 👎

For examples showing inpainting of accompaniment, continuation, infilling, and more, click on examples at the bottom of the live Gradio Demo below, or use the demo to draw and generate your own examples!

Trying to Spell Musical Words (Like Jacob Collier)

...doesn't sound amazing. We need to crank up the RePaint parameter to get enough notes to read the words, but more RePaint seems to introduce more randomness. Conditioning on chords might help, but that part of the code isn't working yet. What we can do instead is use a lower value of RePaint (say, 3) and then "ReMask" a few times, i.e., run the sampling again with a new mask in which the previously-generated notes are left alone. ReMask-ing with a lower value of RePaint seems to better preserve the "musicality" than the randomness we got from higher RePaint values.
(This ReMask stuff is not in the preprint BTW, because page limits & I didn't develop the idea much until after submission. Can add it to the final paper, demo, & code later; for now you have manually do the iteration yourself.)

RePaint=3, ReMask=5 RePaint=4, ReMask=3 RePaint=6, ReMask=1

Interactive Demo

Choose from the examples below, and/or use the 'draw' tool draw shapes to inpaint with notes. White denotes regions to inpaint.
Iterative Workflow Idea: Once the output image is generated, download it and upload it as a new input image, edit via drawing, then re-run the model!
If the demo below is 'down", there is a HuggingFace Spaces version of it, but you'll have to search for it as it will remove double-blind anonymity.

Moving Forward / Other Features

So, what about supporting other instruments, multi-track, etc? Sure, we want to! What about velocity-sentivity? Actually that's in there, just not in the checkpoint that goes with this demo. This paper's really just been focussed on the feasibility of the control method itself. Since the chord-based inpainting (i.e. those colored borders) doesn't seem to "work", we'd like to add chord-based conditioning like Polyffusion had, and also mimic Polyffusion's timing-texture conditioning. TBD!