On Mon, 16 Aug 2021 14:13:06 +0100 David Edmundson <da...@davidedmundson.co.uk> said:
FYI we did this a few years back for efl and enlightenment... on a loss of the compositor socket, toolkit retires connecting and comes back. we added an extended protocol for the compositor to send a UUID to the client per surface and clients on reconnect provide that UUID to the compositor - this allows the compositor to fix all the stacking and other state when the surfaces come back. :) https://git.enlightenment.org/core/enlightenment.git/tree/src/protocol/session-recovery.xml pretty simple - you set_uuid if you are recovering and set the previous uuid you had. if not and its a fresh new surface you get_uuid to get the compositor provided uuid and then store this for recovery purposes. on a CLEAN destroy of your window you destroy_uuid(). these uuids can also be used for full sessions (eg shut down machine, boot again and then have all your windows opened up in their previous state - so compositor can store the uuid's on disk as well as client apps too do the same). you might want to give this a go and then all the rest of the state is solved. i think in theory efl apps should then work with your hand-off (for us its session recovery but that is your intent too - handling compositor crashes and.or on the fly upgrades which require it to re-execute). > Hi all, > > I have been working on a method of "Compositor handoffs"--allowing > clients to switch between compositors at runtime--and I wanted to > share my progress with everyone and gather some feedback. > > # Overview > > The design is very simple at it's core. A client knows everything > about the state of it's windows and still owns all appropriate memory. > In the event of a compositor exit, the client should be able to just > send everything again like it's a brand new client. > > I have successfully ported: Qt, SDL and XWayland to do this, and they > have worked with practically every application using those toolkits > that I have thrown at it. > > The primary goal of this is to handle a compositor crash. Whilst rare, > it happens to some extent to all of the desktops and with wayland > still being such a dynamic landscape and taking on more and more > responsibilities for security reasons this isn't something that will > end anytime soon. We want to make sure we don't have a path where > users can lose data. > > Long term this could solve other future goals, like switching between > compositors across graphic cards, supporting CRIU (checkpoint restore > in userspace) for GUI applications for mobile, or dynamically > switching between the active compositor and waypipe. All of which will > be really interesting new features. > > It also makes the developer experience a billion times better, as I am > no longer needing to log out to actually test changes, or only test in > contrived nested situations. > > # Slightly more detail > > ## The compositor > > The only change we needed to make in the compositor was making our > public facing socket (i.e /var/run/.../wayland-0) persist on the file > system. We run a small helper that creates the socket then spawns our > real compositor. If our compositor crashes, we re-use the same socket. > Clients still receive the same error about being disconnected, but > there are no races should a client try to reconnect whilst a > compositor is starting. > > systemd's socket activation would also be an option here, though we > ended up writing something custom. > > ## Toolkit changes > > Whilst this appears as an invasive thing to retroactively insert, it > turns out to be relatively unintrusive and small. Most of the tasks we > need to do involve functions that already exist anyway. > > We need to: > - reset every wl_output (which can happen at runtime already) > - reset input devices (which can happen at runtime already) > - reset our data devices (which can get cancelled at runtime already) > - reset our wl_surface/xdg_shell objects > - send new contents (which happens already) > > So for the most part we're just hooking into existing code. > > ## SDL: https://github.com/davidedmundson/SDL/ branch reconnect > > With the above patchset if the compositor crashes, all SDL apps will > reconnect. This was tested against everything in SDL's test directory > as well as obviously supertuxkart, the most important client on all > our machines. > > This is a good example to look at to see the extent of toolkit > changes, the changes come in at around 200 lines, and the vast > majority of that is in the cursor code. > > ## Qt (Qt5 version) > https://invent.kde.org/davidedmundson/qtwayland/-/tree/reconnect_main > > Qt's usage is slightly more complicated than SDL, but still the > changes are still relatively small and managable. > > I have been running this for the best part of 6 months, and everything > "just works". I've thrown games, full IDE's (kdevelop) and image > edtiors at this and everything works flawlessly without client > changes. > > This is slightly held back by us having to be API and ABI compatible > with the frozen Qt5 base, a change landing in Qt6 could potentially > address this cleaner, especially the part about refreshing the > QBackingStore. > > Note that if running Qt on Plasma we have a "plasma-integration" > plugin that also makes some wayland calls that has also needed > adjusting. The wiki > (https://invent.kde.org/plasma/kwin/-/wikis/Restarting) lists all > repositories and pending changes. > > ## XWayland (somewhat WIP): > https://gitlab.freedesktop.org/davidedmundson/xserver/-/tree/reconnections > > Xwayland was a whole new adventure, but the code for resending windows > was very simple even tackling details like pointer constraints. Being > xwayland it even retains positional information as the X client itself > remains intact. > > The biggest challenge here is that the compositor typically starts > Xwayland and does so with custom file descriptors for direct > connections. That obviously doesn't work for handling the compositor > changing. This required quite a lot of reshuffling about--moving > XWwayland spawning to the wrapping helper, as well as adding new > mutexes to handle startup being announced. Ultimately everything was > do-able. Messy branches for that are available on request. > > ## Other toolkits. > > One nice thing is that it's completely opt-in on the client. For some > system services (i.e plasmashell ) just restarting is a perfectly > acceptable approach. For wl-paste, it may as well just exit. Firefox > has such good in-built crash handling that I just wrapped the > executable in a tiny wrapper and it works well enough. We don't need > to have one solution fitting everything. > > Ultimately the absolutely worst case is the application quits/crashes, > which is exactly what happens now. > > ## Handling openGL https://gitlab.freedesktop.org/davidedmundson/mesa/ > branch reconnect > > OpenGL presented the biggest challenge. We only have the one > EGLDisplay object persist for the lifespan of the application, we pass > the wl_display constructor in as an argument, and therefore we need to > persist the same wl_display object across connections. > > This has meant we need to change both Mesa and libwayland: > > When a client reconnects after an error a signal is emitted which is > picked up by mesa. Any globals are then replaced, so new objects are > created against the new connection. > > With these changes the rest is easy. We need to recreate the EGLWindow > against a new wl_surface on the new connection, but the context > survives completely untouched and we can render a new frame right > away. > > ## libwayland changes: > https://gitlab.freedesktop.org/davidedmundson/wayland/-/tree/reconnections > > As mentioned above, we have a new method called "reconnect" and a > signal that is emitted to all subscribers once reconnect is called. As > it introduces new API all other changes rely on it. > > Another big change here is handling of dangling objects. After a > client reconnects all proxy objects are marked as "defunct" and we no > longer send any requests on these objects, as the compositor is > obviously unaware of these IDs. However calling the proxy's destructor > needs to still work to free memory. By doing this inside libwayland > we are able to make it quite easy for toolkits to recreate objects > "lazily". This has proved very useful when wayland proxies are hidden > inside opaque pointers inside the toolkit, or if rendering is on > another thread. > > # Expected questions > > ## Pics or it didn't happen > > Here is a video of restoration working fully: > https://home.davidedmundson.co.uk/index.php/s/KbQyb3eiBodTcFm > > ## If the changes address only toolkits, what if I use low level > wayland proxies in my apps? > > The relevant change (watching for the signal on the wl_display and > recreating a registry, new globals and new objects would need to be > redone in the client. Within KDE, at least direct use of proxies > outside the toolkit and a few select libraries is very very rare. > > ## What happens to the clipboard > > The current selection is lost, just like if the client closed. A > clipboard manager can alleviate this. Every new selection afterwards > behaves as expected. > > ## What about window size/position/stacking order > > Because everything is treated as new windows (from a compositor POV) > stacking order and positioning is effectively random. Size is often > maintained as the clients do know their original size. Session > managment > (https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/18) > is a possible solution to this. > > ## Could it be done without toolkit changes via a magic proxy? > > Possibly, but I don't think it adds anything. Clients are the > canonical source of what a client needs, and knows how to handle any > sudden changes the best. There is also zero overhead this way. > > ## What if the EGLDisplay capabilities change? > > My mesa change keeps the same connection to the card specified by the > first wl_drm.device event from the first wayland connection. If the > new compositor passes a different device it is ignored. > That definitely has some theoretical quirks, but none that have come > up in practice for the immediate task at hand. > > ## What if the compositor has no support? > > Then the socket on the file system will disappear and clients will > fail to reconnect and quit. We also use this in clients to detect a > graceful kwin exit and just exit accordingly. > > ## What about vulkan? > > It will need similar changes to what we've done for OpenGL. > > ## That was too much text > > There will be an XDC talk > (https://indico.freedesktop.org/event/1/contributions/20/) where I > will describe everything mentioned here. > > # Next steps > > The changes to libwayland is the most invasive, and the most > potentially controversial. I wanted to send an overview email and get > a discussion going before I submitted merge requests. Unfortunately > it's the blocker on everything else. > -- ------------- Codito, ergo sum - "I code, therefore I am" -------------- Carsten Haitzler - ras...@rasterman.com