Re: Compositor handoffs: Switching clients between compositors

Carsten Haitzler Mon, 16 Aug 2021 13:19:41 -0700

On Mon, 16 Aug 2021 14:13:06 +0100 David Edmundson <da...@davidedmundson.co.uk>
said:


FYI we did this a few years back for efl and enlightenment... on a loss of the
compositor socket, toolkit retires connecting and comes back.

we added an extended protocol for the compositor to send a UUID to the client
per surface and clients on reconnect provide that UUID to the compositor - this
allows the compositor to fix all the stacking and other state when the surfaces
come back. :)

https://git.enlightenment.org/core/enlightenment.git/tree/src/protocol/session-recovery.xml

pretty simple - you set_uuid if you are recovering and set the previous uuid
you had. if not and its a fresh new surface you get_uuid to get the compositor
provided uuid and then store this for recovery purposes.

on a CLEAN destroy of your window you destroy_uuid(). these uuids can also be
used for full sessions (eg shut down machine, boot again and then have all your
windows opened up in their previous state - so compositor can store the uuid's
on disk as well as client apps too do the same).

you might want to give this a go and then all the rest of the state is solved.
i think in theory efl apps should then work with your hand-off (for us its
session recovery but that is your intent too - handling compositor crashes
and.or on the fly upgrades which require it to re-execute).

> Hi all,
> 
> I have been working on a method of "Compositor handoffs"--allowing
> clients to switch between compositors at runtime--and I wanted to
> share my progress with everyone and gather some feedback.
> 
> # Overview
> 
> The design is very simple at it's core. A client knows everything
> about the state of it's windows and still owns all appropriate memory.
> In the event of a compositor exit, the client should be able to just
> send everything again like it's a brand new client.
> 
> I have successfully ported: Qt, SDL and XWayland to do this, and they
> have worked with practically every application using those toolkits
> that I have thrown at it.
> 
> The primary goal of this is to handle a compositor crash. Whilst rare,
> it happens to some extent to all of the desktops and with wayland
> still being such a dynamic landscape and taking on more and more
> responsibilities for security reasons this isn't something that will
> end anytime soon. We want to make sure we don't have a path where
> users can lose data.
> 
> Long term this could solve other future goals, like switching between
> compositors across graphic cards, supporting CRIU (checkpoint restore
> in userspace) for GUI applications for mobile, or dynamically
> switching between the active compositor and waypipe. All of which will
> be really interesting new features.
> 
> It also makes the developer experience a billion times better, as I am
> no longer needing to log out to actually test changes, or only test in
> contrived nested situations.
> 
> # Slightly more detail
> 
> ## The compositor
> 
> The only change we needed to make in the compositor was making our
> public facing socket (i.e /var/run/.../wayland-0) persist on the file
> system. We run a small helper that creates the socket then spawns our
> real compositor. If our compositor crashes, we re-use the same socket.
> Clients still receive the same error about being disconnected, but
> there are no races should a client try to reconnect whilst a
> compositor is starting.
> 
> systemd's socket activation would also be an option here, though we
> ended up writing something custom.
> 
> ## Toolkit changes
> 
> Whilst this appears as an invasive thing to retroactively insert, it
> turns out to be relatively unintrusive and small. Most of the tasks we
> need to do involve functions that already exist anyway.
> 
> We need to:
>      - reset every wl_output (which can happen at runtime already)
>      - reset input devices (which can happen at runtime already)
>      - reset our data devices (which can get cancelled at runtime already)
>      - reset our wl_surface/xdg_shell objects
>      - send new contents (which happens already)
> 
> So for the most part we're just hooking into existing code.
> 
> ## SDL: https://github.com/davidedmundson/SDL/ branch reconnect
> 
> With the above patchset if the compositor crashes, all SDL apps will
> reconnect. This was tested against everything in SDL's test directory
> as well as obviously supertuxkart, the most important client on all
> our machines.
> 
> This is a good example to look at to see the extent of toolkit
> changes, the changes come in at around 200 lines, and the vast
> majority of that is in the cursor code.
> 
> ## Qt (Qt5 version)
> https://invent.kde.org/davidedmundson/qtwayland/-/tree/reconnect_main
> 
> Qt's usage is slightly  more complicated than SDL, but still the
> changes are still relatively small and managable.
> 
> I have been running this for the best part of 6 months, and everything
> "just works". I've thrown games, full IDE's (kdevelop) and image
> edtiors at this and everything works flawlessly without client
> changes.
> 
> This is slightly held back by us having to be API and ABI compatible
> with the frozen Qt5 base, a change landing in Qt6 could potentially
> address this cleaner, especially the part about refreshing the
> QBackingStore.
> 
> Note that if running Qt on Plasma we have a "plasma-integration"
> plugin that also makes some wayland calls that has also needed
> adjusting. The wiki
> (https://invent.kde.org/plasma/kwin/-/wikis/Restarting) lists all
> repositories and pending changes.
> 
> ## XWayland (somewhat WIP):
> https://gitlab.freedesktop.org/davidedmundson/xserver/-/tree/reconnections
> 
> Xwayland was a whole new adventure, but the code for resending windows
> was very simple even tackling details like pointer constraints. Being
> xwayland it even retains positional information as the X client itself
> remains intact.
> 
> The biggest challenge here is that the compositor typically starts
> Xwayland and does so with custom file descriptors for direct
> connections. That obviously doesn't work for handling the compositor
> changing. This required quite a lot of reshuffling about--moving
> XWwayland spawning to the wrapping helper, as well as adding new
> mutexes to handle startup being announced. Ultimately everything was
> do-able. Messy branches for that are available on request.
> 
> ## Other toolkits.
> 
> One nice thing is that it's completely opt-in on the client. For some
> system services (i.e plasmashell ) just restarting is a perfectly
> acceptable approach. For wl-paste, it may as well just exit. Firefox
> has such good in-built crash handling that I just wrapped the
> executable in a tiny wrapper and it works well enough. We don't need
> to have one solution fitting everything.
> 
> Ultimately the absolutely worst case is the application quits/crashes,
> which is exactly what happens now.
> 
> ## Handling openGL https://gitlab.freedesktop.org/davidedmundson/mesa/
> branch reconnect
> 
> OpenGL presented the biggest challenge. We only have the one
> EGLDisplay object persist for the lifespan of the application, we pass
> the wl_display constructor in as an argument, and therefore we need to
> persist the same wl_display object across connections.
> 
> This has meant we need to change both Mesa and libwayland:
> 
> When a client reconnects after an error a signal is emitted which is
> picked up by mesa. Any globals are then replaced, so new objects are
> created against the new connection.
> 
> With these changes the rest is easy. We need to recreate the EGLWindow
> against a new wl_surface on the new connection, but the context
> survives completely untouched and we can render a new frame right
> away.
> 
> ## libwayland changes:
> https://gitlab.freedesktop.org/davidedmundson/wayland/-/tree/reconnections
> 
> As mentioned above, we have a new method called "reconnect" and a
> signal that is emitted to all subscribers once reconnect is called. As
> it introduces new API all other changes rely on it.
> 
> Another big change here is handling of dangling objects. After a
> client reconnects all proxy objects are marked as "defunct" and we no
> longer send any requests on these objects, as the compositor is
> obviously unaware of these IDs. However calling the proxy's destructor
> needs to still work to free memory.  By doing this inside libwayland
> we are able to make it quite easy for toolkits to recreate objects
> "lazily". This has proved very useful when wayland proxies are hidden
> inside opaque pointers inside the toolkit, or if rendering is on
> another thread.
> 
> # Expected questions
> 
> ## Pics or it didn't happen
> 
> Here is a video of restoration working fully:
> https://home.davidedmundson.co.uk/index.php/s/KbQyb3eiBodTcFm
> 
> ## If the changes address only toolkits, what if I use low level
> wayland proxies in my apps?
> 
> The relevant change (watching for the signal on the wl_display and
> recreating a registry, new globals and new objects would need to be
> redone in the client. Within KDE, at least direct use of proxies
> outside the toolkit and a few select libraries is very very rare.
> 
> ## What happens to the clipboard
> 
> The current selection is lost, just like if the client closed. A
> clipboard manager can alleviate this. Every new selection afterwards
> behaves as expected.
> 
> ## What about window size/position/stacking order
> 
> Because everything is treated as new windows (from a compositor POV)
> stacking order and positioning is effectively random. Size is often
> maintained as the clients do know their original size. Session
> managment
> (https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/18)
> is a possible solution to this.
> 
> ## Could it be done without toolkit changes via a magic proxy?
> 
> Possibly, but I don't think it adds anything. Clients are the
> canonical source of what a client needs, and knows how to handle any
> sudden changes the best. There is also zero overhead this way.
> 
> ## What if the EGLDisplay capabilities change?
> 
> My mesa change keeps the same connection to the card specified by the
> first wl_drm.device event from the first wayland connection. If the
> new compositor passes a different device it is ignored.
> That definitely has some theoretical quirks, but none that have come
> up in practice for the immediate task at hand.
> 
> ## What if the compositor has no support?
> 
> Then the socket on the file system will disappear and clients will
> fail to reconnect and quit. We also use this in clients to detect a
> graceful kwin exit and just exit accordingly.
> 
> ## What about vulkan?
> 
> It will need similar changes to what we've done for OpenGL.
> 
> ## That was too much text
> 
> There will be an XDC talk
> (https://indico.freedesktop.org/event/1/contributions/20/) where I
> will describe everything mentioned here.
> 
> # Next steps
> 
> The changes to libwayland is the most invasive, and the most
> potentially controversial. I wanted to send an overview email and get
> a discussion going before I submitted merge requests. Unfortunately
> it's the blocker on everything else.
> 


-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
Carsten Haitzler - ras...@rasterman.com

Re: Compositor handoffs: Switching clients between compositors

Reply via email to