Re: Plumbing explicit synchronization through the Linux ecosystem
It seems I may have not set the tone I intended with this e-mail... My intention was never to stomp on anyone's favorite window system (Adam, isn't the only one who's seemed a bit miffed). My intention was to try and solve some very real problems that we have with Vulkan and I had the hope that a solution there could be helpful for others. The problem we have in Vulkan is that we have an inherently explicit sync graphics API and we're trying to strap it onto some inherently implicit sync window systems and kernel interfaces. Our mechanisms for doing so have evolved over the course of the last 4-5 years and it's way better now than it was when we started but it's still pretty bad and very invasive to the driver. My objective is to completely remove the concept of implicit sync from the Vulkan driver eventually. Also (and this is going further down the rabbit hole), I would like to begin cleaning up our i915 UAPI to better separate memory residency handling, command submission, and synchronization. Eventually (and this may sound crazy to some), I'd like to get to the point where i915 doesn't own any of the synchronization primitives except what it needs to handle memory management internally. Linux graphics UAPI is about 10 years behind Windows in terms of design (roughly equivalent to Win7) and I think it's costing us in terms of latency and CPU overhead. Some of that may just be implementation problems in i915; some of it may be core API design. It's a bit unclear. Why am I bringing up kernel APIs? Because one of the biggest problems in evolving things is the fact that our kernel APIs are tied to implicit sync on dma-buf. We can't detangle that until we can remove implicit dma-buf signaling from the command execution APIs. This means that we either need to get rid of ALL implicit synchronization from window-system APIs far enough back in time that we don't run the risk of "breaking userspace" or else we need a plan which lets the kernel driver not support implicit sync but make implicit sync work anyway. What I'm proposing with dma-buf sync_file import/export is one such plan. So, while this may not solve any problems for Wayland compositors as I previously thought (KMS/atomic supports sync_file. Yay!), we still have a very real problem in Vulkan. It's great that Wayland has an explicit sync API but until all compositors have supported it for at least 2 years, I can't assume it's existence and start deleting my old code paths. Currently, it's only implemented in Weston and the ChromeOS compositor; gnome-shell, kwin, and sway are all still 100% implicit sync AFAIK. We also have to deal with X11. For those who are asking the question in the back of their minds: Yes, I'm trying to solve a userspace problem with kernel code and, no, I don't think that's necessarily the wrong way around. Don't get me wrong; I very much want to solve the problem "properly" but unless we're very sure we can get it solved properly everywhere quickly, a solution which lets us improve our driver kernel APIs independently of misc. Wayland compositors seems advantageous. On Wed, Mar 11, 2020 at 6:02 PM Adam Jackson wrote: > > On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote: > > > - X11: With present, it has these "explicit" fence objects but > > they're always a shmfence which lets the X server and client do a > > userspace CPU-side hand-off without going over the socket (and > > round-tripping through the kernel). However, the only thing that > > fence does is order the OpenGL API calls in the client and server and > > the real synchronization is still implicit. > > I'm pretty sure "the only thing that fence does" is an implementation > detail. So I've been told, many times. > PresentPixmap blocks until the wait-fence signals, but when and > how it signals are properties of the fence itself. You could have drm > give the client back a fence fd, pass that to xserver to create a fence > object, and name that in the PresentPixmap request, and then drm can do > whatever it wants to signal the fence. Poking around at things, X11 may not be quite as bad as I thought here. It's not really set up for sync_file for a couple reasons: 1. It only passes the file descriptor in once at xcb_dri3_fence_from_fd rather than re-creating every frame from a new sync_file 2. It only takes a fence on present and doesn't return one in the PRESENT_COMPLETE event That said, plumbing syncobj in as an extension looks like a real possibility. A syncobj is just a container that holds a pointer to a dma_fence and it has roughly the same CPU signal/reset behavior that's exposed by the SyncFenceFuncsRec struct. There's a few things I'm not sure how to handle: 1. The Sync extension has these trigger funcs which get called when the fence is signalled. I'm not sure how to handle that with syncobj without a thread polling on them somehow. 2. Not all kernel GPU drivers support syncobj; currently it's just i915, amdgpu, and
Plumbing explicit synchronization through the Linux ecosystem
All, Sorry for casting such a broad net with this one. I'm sure most people who reply will get at least one mailing list rejection. However, this is an issue that affects a LOT of components and that's why it's thorny to begin with. Please pardon the length of this e-mail as well; I promise there's a concrete point/proposal at the end. Explicit synchronization is the future of graphics and media. At least, that seems to be the consensus among all the graphics people I've talked to. I had a chat with one of the lead Android graphics engineers recently who told me that doing explicit sync from the start was one of the best engineering decisions Android ever made. It's also the direction being taken by more modern APIs such as Vulkan. ## What are implicit and explicit synchronization? For those that aren't familiar with this space, GPUs, media encoders, etc. are massively parallel and synchronization of some form is required to ensure that everything happens in the right order and avoid data races. Implicit synchronization is when bits of work (3D, compute, video encode, etc.) are implicitly based on the absolute CPU-time order in which API calls occur. Explicit synchronization is when the client (whatever that means in any given context) provides the dependency graph explicitly via some sort of synchronization primitives. If you're still confused, consider the following examples: With OpenGL and EGL, almost everything is implicit sync. Say you have two OpenGL contexts sharing an image where one writes to it and the other textures from it. The way the OpenGL spec works, the client has to make the API calls to render to the image before (in CPU time) it makes the API calls which texture from the image. As long as it does this (and maybe inserts a glFlush?), the driver will ensure that the rendering completes before the texturing happens and you get correct contents. Implicit synchronization can also happen across processes. Wayland, for instance, is currently built on implicit sync where the client does their rendering and then does a hand-off (via wl_surface::commit) to tell the compositor it's done at which point the compositor can now texture from the surface. The hand-off ensures that the client's OpenGL API calls happen before the server's OpenGL API calls. A good example of explicit synchronization is the Vulkan API. There, a client (or multiple clients) can simultaneously build command buffers in different threads where one of those command buffers renders to an image and the other textures from it and then submit both of them at the same time with instructions to the driver for which order to execute them in. The execution order is described via the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore extension, you can even submit the work which does the texturing BEFORE the work which does the rendering and the driver will sort it out. The #1 problem with implicit synchronization (which explicit solves) is that it leads to a lot of over-synchronization both in client space and in driver/device space. The client has to synchronize a lot more because it has to ensure that the API calls happen in a particular order. The driver/device have to synchronize a lot more because they never know what is going to end up being a synchronization point as an API call on another thread/process may occur at any time. As we move to more and more multi-threaded programming this synchronization (on the client-side especially) becomes more and more painful. ## Current status in Linux Implicit synchronization in Linux works via a the kernel's internal dma_buf and dma_fence data structures. A dma_fence is a tiny object which represents the "done" status for some bit of work. Typically, dma_fences are created as a by-product of someone submitting some bit of work (say, 3D rendering) to the kernel. The dma_buf object has a set of dma_fences on it representing shared (read) and exclusive (write) access to the object. When work is submitted which, for instance renders to the dma_buf, it's queued waiting on all the fences on the dma_buf and and a dma_fence is created representing the end of said rendering work and it's installed as the dma_buf's exclusive fence. This way, the kernel can manage all its internal queues (3D rendering, display, video encode, etc.) and know which things to submit in what order. For the last few years, we've had sync_file in the kernel and it's plumbed into some drivers. A sync_file is just a wrapper around a single dma_fence. A sync_file is typically created as a by-product of submitting work (3D, compute, etc.) to the kernel and is signaled when that work completes. When a sync_file is created, it is guaranteed by the kernel that it will become signaled in finite time and, once it's signaled, it remains signaled for the rest of time. A sync_file is represented in UAPIs as a file descriptor and can be used with normal file APIs such as dup(). It
Re: Plumbing explicit synchronization through the Linux ecosystem
(I know I'm going to be spammed by so many mailing list ...) Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit : > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand wrote: > > All, > > > > Sorry for casting such a broad net with this one. I'm sure most people > > who reply will get at least one mailing list rejection. However, this > > is an issue that affects a LOT of components and that's why it's > > thorny to begin with. Please pardon the length of this e-mail as > > well; I promise there's a concrete point/proposal at the end. > > > > > > Explicit synchronization is the future of graphics and media. At > > least, that seems to be the consensus among all the graphics people > > I've talked to. I had a chat with one of the lead Android graphics > > engineers recently who told me that doing explicit sync from the start > > was one of the best engineering decisions Android ever made. It's > > also the direction being taken by more modern APIs such as Vulkan. > > > > > > ## What are implicit and explicit synchronization? > > > > For those that aren't familiar with this space, GPUs, media encoders, > > etc. are massively parallel and synchronization of some form is > > required to ensure that everything happens in the right order and > > avoid data races. Implicit synchronization is when bits of work (3D, > > compute, video encode, etc.) are implicitly based on the absolute > > CPU-time order in which API calls occur. Explicit synchronization is > > when the client (whatever that means in any given context) provides > > the dependency graph explicitly via some sort of synchronization > > primitives. If you're still confused, consider the following > > examples: > > > > With OpenGL and EGL, almost everything is implicit sync. Say you have > > two OpenGL contexts sharing an image where one writes to it and the > > other textures from it. The way the OpenGL spec works, the client has > > to make the API calls to render to the image before (in CPU time) it > > makes the API calls which texture from the image. As long as it does > > this (and maybe inserts a glFlush?), the driver will ensure that the > > rendering completes before the texturing happens and you get correct > > contents. > > > > Implicit synchronization can also happen across processes. Wayland, > > for instance, is currently built on implicit sync where the client > > does their rendering and then does a hand-off (via wl_surface::commit) > > to tell the compositor it's done at which point the compositor can now > > texture from the surface. The hand-off ensures that the client's > > OpenGL API calls happen before the server's OpenGL API calls. > > > > A good example of explicit synchronization is the Vulkan API. There, > > a client (or multiple clients) can simultaneously build command > > buffers in different threads where one of those command buffers > > renders to an image and the other textures from it and then submit > > both of them at the same time with instructions to the driver for > > which order to execute them in. The execution order is described via > > the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore > > extension, you can even submit the work which does the texturing > > BEFORE the work which does the rendering and the driver will sort it > > out. > > > > The #1 problem with implicit synchronization (which explicit solves) > > is that it leads to a lot of over-synchronization both in client space > > and in driver/device space. The client has to synchronize a lot more > > because it has to ensure that the API calls happen in a particular > > order. The driver/device have to synchronize a lot more because they > > never know what is going to end up being a synchronization point as an > > API call on another thread/process may occur at any time. As we move > > to more and more multi-threaded programming this synchronization (on > > the client-side especially) becomes more and more painful. > > > > > > ## Current status in Linux > > > > Implicit synchronization in Linux works via a the kernel's internal > > dma_buf and dma_fence data structures. A dma_fence is a tiny object > > which represents the "done" status for some bit of work. Typically, > > dma_fences are created as a by-product of someone submitting some bit > > of work (say, 3D rendering) to the kernel. The dma_buf object has a > > set of dma_fences on it representing shared (read) and exclusive > > (write) access to the object. When work is submitted which, for > > instance renders to the dma_buf, it's queued waiting on all the fences > > on the dma_buf and and a dma_fence is created representing the end of > > said rendering work and it's installed as the dma_buf's exclusive > > fence. This way, the kernel can manage all its internal queues (3D > > rendering, display, video encode, etc.) and know which things to > > submit in what order. > > > > For the last few years, we've had sync_file in the kernel
Re: Plumbing explicit synchronization through the Linux ecosystem
On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote: > - X11: With present, it has these "explicit" fence objects but > they're always a shmfence which lets the X server and client do a > userspace CPU-side hand-off without going over the socket (and > round-tripping through the kernel). However, the only thing that > fence does is order the OpenGL API calls in the client and server and > the real synchronization is still implicit. I'm pretty sure "the only thing that fence does" is an implementation detail. PresentPixmap blocks until the wait-fence signals, but when and how it signals are properties of the fence itself. You could have drm give the client back a fence fd, pass that to xserver to create a fence object, and name that in the PresentPixmap request, and then drm can do whatever it wants to signal the fence. > From my perspective, as a Vulkan driver developer, I have to deal with > the fact that Vulkan is an explicit sync API but Wayland and X11 > aren't. I'm quite sure we can give you an explicit-sync X11 API. I think you may already have one. - ajax ___ xorg-devel@lists.x.org: X.Org development Archives: http://lists.x.org/archives/xorg-devel Info: https://lists.x.org/mailman/listinfo/xorg-devel
Re: Plumbing explicit synchronization through the Linux ecosystem
On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand wrote: > > All, > > Sorry for casting such a broad net with this one. I'm sure most people > who reply will get at least one mailing list rejection. However, this > is an issue that affects a LOT of components and that's why it's > thorny to begin with. Please pardon the length of this e-mail as > well; I promise there's a concrete point/proposal at the end. > > > Explicit synchronization is the future of graphics and media. At > least, that seems to be the consensus among all the graphics people > I've talked to. I had a chat with one of the lead Android graphics > engineers recently who told me that doing explicit sync from the start > was one of the best engineering decisions Android ever made. It's > also the direction being taken by more modern APIs such as Vulkan. > > > ## What are implicit and explicit synchronization? > > For those that aren't familiar with this space, GPUs, media encoders, > etc. are massively parallel and synchronization of some form is > required to ensure that everything happens in the right order and > avoid data races. Implicit synchronization is when bits of work (3D, > compute, video encode, etc.) are implicitly based on the absolute > CPU-time order in which API calls occur. Explicit synchronization is > when the client (whatever that means in any given context) provides > the dependency graph explicitly via some sort of synchronization > primitives. If you're still confused, consider the following > examples: > > With OpenGL and EGL, almost everything is implicit sync. Say you have > two OpenGL contexts sharing an image where one writes to it and the > other textures from it. The way the OpenGL spec works, the client has > to make the API calls to render to the image before (in CPU time) it > makes the API calls which texture from the image. As long as it does > this (and maybe inserts a glFlush?), the driver will ensure that the > rendering completes before the texturing happens and you get correct > contents. > > Implicit synchronization can also happen across processes. Wayland, > for instance, is currently built on implicit sync where the client > does their rendering and then does a hand-off (via wl_surface::commit) > to tell the compositor it's done at which point the compositor can now > texture from the surface. The hand-off ensures that the client's > OpenGL API calls happen before the server's OpenGL API calls. > > A good example of explicit synchronization is the Vulkan API. There, > a client (or multiple clients) can simultaneously build command > buffers in different threads where one of those command buffers > renders to an image and the other textures from it and then submit > both of them at the same time with instructions to the driver for > which order to execute them in. The execution order is described via > the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore > extension, you can even submit the work which does the texturing > BEFORE the work which does the rendering and the driver will sort it > out. > > The #1 problem with implicit synchronization (which explicit solves) > is that it leads to a lot of over-synchronization both in client space > and in driver/device space. The client has to synchronize a lot more > because it has to ensure that the API calls happen in a particular > order. The driver/device have to synchronize a lot more because they > never know what is going to end up being a synchronization point as an > API call on another thread/process may occur at any time. As we move > to more and more multi-threaded programming this synchronization (on > the client-side especially) becomes more and more painful. > > > ## Current status in Linux > > Implicit synchronization in Linux works via a the kernel's internal > dma_buf and dma_fence data structures. A dma_fence is a tiny object > which represents the "done" status for some bit of work. Typically, > dma_fences are created as a by-product of someone submitting some bit > of work (say, 3D rendering) to the kernel. The dma_buf object has a > set of dma_fences on it representing shared (read) and exclusive > (write) access to the object. When work is submitted which, for > instance renders to the dma_buf, it's queued waiting on all the fences > on the dma_buf and and a dma_fence is created representing the end of > said rendering work and it's installed as the dma_buf's exclusive > fence. This way, the kernel can manage all its internal queues (3D > rendering, display, video encode, etc.) and know which things to > submit in what order. > > For the last few years, we've had sync_file in the kernel and it's > plumbed into some drivers. A sync_file is just a wrapper around a > single dma_fence. A sync_file is typically created as a by-product of > submitting work (3D, compute, etc.) to the kernel and is signaled when > that work completes. When a sync_file is created, it is guaranteed by > the kernel