Hi all, I have been working on a method of "Compositor handoffs"--allowing clients to switch between compositors at runtime--and I wanted to share my progress with everyone and gather some feedback.
# Overview The design is very simple at it's core. A client knows everything about the state of it's windows and still owns all appropriate memory. In the event of a compositor exit, the client should be able to just send everything again like it's a brand new client. I have successfully ported: Qt, SDL and XWayland to do this, and they have worked with practically every application using those toolkits that I have thrown at it. The primary goal of this is to handle a compositor crash. Whilst rare, it happens to some extent to all of the desktops and with wayland still being such a dynamic landscape and taking on more and more responsibilities for security reasons this isn't something that will end anytime soon. We want to make sure we don't have a path where users can lose data. Long term this could solve other future goals, like switching between compositors across graphic cards, supporting CRIU (checkpoint restore in userspace) for GUI applications for mobile, or dynamically switching between the active compositor and waypipe. All of which will be really interesting new features. It also makes the developer experience a billion times better, as I am no longer needing to log out to actually test changes, or only test in contrived nested situations. # Slightly more detail ## The compositor The only change we needed to make in the compositor was making our public facing socket (i.e /var/run/.../wayland-0) persist on the file system. We run a small helper that creates the socket then spawns our real compositor. If our compositor crashes, we re-use the same socket. Clients still receive the same error about being disconnected, but there are no races should a client try to reconnect whilst a compositor is starting. systemd's socket activation would also be an option here, though we ended up writing something custom. ## Toolkit changes Whilst this appears as an invasive thing to retroactively insert, it turns out to be relatively unintrusive and small. Most of the tasks we need to do involve functions that already exist anyway. We need to: - reset every wl_output (which can happen at runtime already) - reset input devices (which can happen at runtime already) - reset our data devices (which can get cancelled at runtime already) - reset our wl_surface/xdg_shell objects - send new contents (which happens already) So for the most part we're just hooking into existing code. ## SDL: https://github.com/davidedmundson/SDL/ branch reconnect With the above patchset if the compositor crashes, all SDL apps will reconnect. This was tested against everything in SDL's test directory as well as obviously supertuxkart, the most important client on all our machines. This is a good example to look at to see the extent of toolkit changes, the changes come in at around 200 lines, and the vast majority of that is in the cursor code. ## Qt (Qt5 version) https://invent.kde.org/davidedmundson/qtwayland/-/tree/reconnect_main Qt's usage is slightly more complicated than SDL, but still the changes are still relatively small and managable. I have been running this for the best part of 6 months, and everything "just works". I've thrown games, full IDE's (kdevelop) and image edtiors at this and everything works flawlessly without client changes. This is slightly held back by us having to be API and ABI compatible with the frozen Qt5 base, a change landing in Qt6 could potentially address this cleaner, especially the part about refreshing the QBackingStore. Note that if running Qt on Plasma we have a "plasma-integration" plugin that also makes some wayland calls that has also needed adjusting. The wiki (https://invent.kde.org/plasma/kwin/-/wikis/Restarting) lists all repositories and pending changes. ## XWayland (somewhat WIP): https://gitlab.freedesktop.org/davidedmundson/xserver/-/tree/reconnections Xwayland was a whole new adventure, but the code for resending windows was very simple even tackling details like pointer constraints. Being xwayland it even retains positional information as the X client itself remains intact. The biggest challenge here is that the compositor typically starts Xwayland and does so with custom file descriptors for direct connections. That obviously doesn't work for handling the compositor changing. This required quite a lot of reshuffling about--moving XWwayland spawning to the wrapping helper, as well as adding new mutexes to handle startup being announced. Ultimately everything was do-able. Messy branches for that are available on request. ## Other toolkits. One nice thing is that it's completely opt-in on the client. For some system services (i.e plasmashell ) just restarting is a perfectly acceptable approach. For wl-paste, it may as well just exit. Firefox has such good in-built crash handling that I just wrapped the executable in a tiny wrapper and it works well enough. We don't need to have one solution fitting everything. Ultimately the absolutely worst case is the application quits/crashes, which is exactly what happens now. ## Handling openGL https://gitlab.freedesktop.org/davidedmundson/mesa/ branch reconnect OpenGL presented the biggest challenge. We only have the one EGLDisplay object persist for the lifespan of the application, we pass the wl_display constructor in as an argument, and therefore we need to persist the same wl_display object across connections. This has meant we need to change both Mesa and libwayland: When a client reconnects after an error a signal is emitted which is picked up by mesa. Any globals are then replaced, so new objects are created against the new connection. With these changes the rest is easy. We need to recreate the EGLWindow against a new wl_surface on the new connection, but the context survives completely untouched and we can render a new frame right away. ## libwayland changes: https://gitlab.freedesktop.org/davidedmundson/wayland/-/tree/reconnections As mentioned above, we have a new method called "reconnect" and a signal that is emitted to all subscribers once reconnect is called. As it introduces new API all other changes rely on it. Another big change here is handling of dangling objects. After a client reconnects all proxy objects are marked as "defunct" and we no longer send any requests on these objects, as the compositor is obviously unaware of these IDs. However calling the proxy's destructor needs to still work to free memory. By doing this inside libwayland we are able to make it quite easy for toolkits to recreate objects "lazily". This has proved very useful when wayland proxies are hidden inside opaque pointers inside the toolkit, or if rendering is on another thread. # Expected questions ## Pics or it didn't happen Here is a video of restoration working fully: https://home.davidedmundson.co.uk/index.php/s/KbQyb3eiBodTcFm ## If the changes address only toolkits, what if I use low level wayland proxies in my apps? The relevant change (watching for the signal on the wl_display and recreating a registry, new globals and new objects would need to be redone in the client. Within KDE, at least direct use of proxies outside the toolkit and a few select libraries is very very rare. ## What happens to the clipboard The current selection is lost, just like if the client closed. A clipboard manager can alleviate this. Every new selection afterwards behaves as expected. ## What about window size/position/stacking order Because everything is treated as new windows (from a compositor POV) stacking order and positioning is effectively random. Size is often maintained as the clients do know their original size. Session managment (https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/18) is a possible solution to this. ## Could it be done without toolkit changes via a magic proxy? Possibly, but I don't think it adds anything. Clients are the canonical source of what a client needs, and knows how to handle any sudden changes the best. There is also zero overhead this way. ## What if the EGLDisplay capabilities change? My mesa change keeps the same connection to the card specified by the first wl_drm.device event from the first wayland connection. If the new compositor passes a different device it is ignored. That definitely has some theoretical quirks, but none that have come up in practice for the immediate task at hand. ## What if the compositor has no support? Then the socket on the file system will disappear and clients will fail to reconnect and quit. We also use this in clients to detect a graceful kwin exit and just exit accordingly. ## What about vulkan? It will need similar changes to what we've done for OpenGL. ## That was too much text There will be an XDC talk (https://indico.freedesktop.org/event/1/contributions/20/) where I will describe everything mentioned here. # Next steps The changes to libwayland is the most invasive, and the most potentially controversial. I wanted to send an overview email and get a discussion going before I submitted merge requests. Unfortunately it's the blocker on everything else.