Re: Plumbing explicit synchronization through the Linux ecosystem

2020-03-12 Thread Jason Ekstrand
It seems I may have not set the tone I intended with this e-mail... My
intention was never to stomp on anyone's favorite window system (Adam,
isn't the only one who's seemed a bit miffed).  My intention was to
try and solve some very real problems that we have with Vulkan and I
had the hope that a solution there could be helpful for others.

The problem we have in Vulkan is that we have an inherently explicit
sync graphics API and we're trying to strap it onto some inherently
implicit sync window systems and kernel interfaces.  Our mechanisms
for doing so have evolved over the course of the last 4-5 years and
it's way better now than it was when we started but it's still pretty
bad and very invasive to the driver.  My objective is to completely
remove the concept of implicit sync from the Vulkan driver eventually.

Also (and this is going further down the rabbit hole), I would like to
begin cleaning up our i915 UAPI to better separate memory residency
handling, command submission, and synchronization.  Eventually (and
this may sound crazy to some), I'd like to get to the point where i915
doesn't own any of the synchronization primitives except what it needs
to handle memory management internally.  Linux graphics UAPI is about
10 years behind Windows in terms of design (roughly equivalent to
Win7) and I think it's costing us in terms of latency and CPU
overhead.  Some of that may just be implementation problems in i915;
some of it may be core API design.  It's a bit unclear.

Why am I bringing up kernel APIs?  Because one of the biggest problems
in evolving things is the fact that our kernel APIs are tied to
implicit sync on dma-buf.  We can't detangle that until we can remove
implicit dma-buf signaling from the command execution APIs.  This
means that we either need to get rid of ALL implicit synchronization
from window-system APIs far enough back in time that we don't run the
risk of "breaking userspace" or else we need a plan which lets the
kernel driver not support implicit sync but make implicit sync work
anyway.  What I'm proposing with dma-buf sync_file import/export is
one such plan.

So, while this may not solve any problems for Wayland compositors as I
previously thought (KMS/atomic supports sync_file.  Yay!), we still
have a very real problem in Vulkan.  It's great that Wayland has an
explicit sync API but until all compositors have supported it for at
least 2 years, I can't assume it's existence and start deleting my old
code paths.  Currently, it's only implemented in Weston and the
ChromeOS compositor; gnome-shell, kwin, and sway are all still 100%
implicit sync AFAIK.  We also have to deal with X11.

For those who are asking the question in the back of their minds:
Yes, I'm trying to solve a userspace problem with kernel code and, no,
I don't think that's necessarily the wrong way around.  Don't get me
wrong; I very much want to solve the problem "properly" but unless
we're very sure we can get it solved properly everywhere quickly, a
solution which lets us improve our driver kernel APIs independently of
misc. Wayland compositors seems advantageous.

On Wed, Mar 11, 2020 at 6:02 PM Adam Jackson  wrote:
>
> On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote:
>
> >  - X11: With present, it has these "explicit" fence objects but
> > they're always a shmfence which lets the X server and client do a
> > userspace CPU-side hand-off without going over the socket (and
> > round-tripping through the kernel).  However, the only thing that
> > fence does is order the OpenGL API calls in the client and server and
> > the real synchronization is still implicit.
>
> I'm pretty sure "the only thing that fence does" is an implementation
> detail.

So I've been told, many times.

> PresentPixmap blocks until the wait-fence signals, but when and
> how it signals are properties of the fence itself. You could have drm
> give the client back a fence fd, pass that to xserver to create a fence
> object, and name that in the PresentPixmap request, and then drm can do
> whatever it wants to signal the fence.

Poking around at things, X11 may not be quite as bad as I thought
here.  It's not really set up for sync_file for a couple reasons:

 1. It only passes the file descriptor in once at
xcb_dri3_fence_from_fd rather than re-creating every frame from a new
sync_file
 2. It only takes a fence on present and doesn't return one in the
PRESENT_COMPLETE event

That said, plumbing syncobj in as an extension looks like a real
possibility.  A syncobj is just a container that holds a pointer to a
dma_fence and it has roughly the same CPU signal/reset behavior that's
exposed by the SyncFenceFuncsRec struct.  There's a few things I'm not
sure how to handle:

 1. The Sync extension has these trigger funcs which get called when
the fence is signalled.  I'm not sure how to handle that with syncobj
without a thread polling on them somehow.
 2. Not all kernel GPU drivers support syncobj; currently it's just
i915, amdgpu, and 

Plumbing explicit synchronization through the Linux ecosystem

2020-03-12 Thread Jason Ekstrand
All,

Sorry for casting such a broad net with this one. I'm sure most people
who reply will get at least one mailing list rejection.  However, this
is an issue that affects a LOT of components and that's why it's
thorny to begin with.  Please pardon the length of this e-mail as
well; I promise there's a concrete point/proposal at the end.


Explicit synchronization is the future of graphics and media.  At
least, that seems to be the consensus among all the graphics people
I've talked to.  I had a chat with one of the lead Android graphics
engineers recently who told me that doing explicit sync from the start
was one of the best engineering decisions Android ever made.  It's
also the direction being taken by more modern APIs such as Vulkan.


## What are implicit and explicit synchronization?

For those that aren't familiar with this space, GPUs, media encoders,
etc. are massively parallel and synchronization of some form is
required to ensure that everything happens in the right order and
avoid data races.  Implicit synchronization is when bits of work (3D,
compute, video encode, etc.) are implicitly based on the absolute
CPU-time order in which API calls occur.  Explicit synchronization is
when the client (whatever that means in any given context) provides
the dependency graph explicitly via some sort of synchronization
primitives.  If you're still confused, consider the following
examples:

With OpenGL and EGL, almost everything is implicit sync.  Say you have
two OpenGL contexts sharing an image where one writes to it and the
other textures from it.  The way the OpenGL spec works, the client has
to make the API calls to render to the image before (in CPU time) it
makes the API calls which texture from the image.  As long as it does
this (and maybe inserts a glFlush?), the driver will ensure that the
rendering completes before the texturing happens and you get correct
contents.

Implicit synchronization can also happen across processes.  Wayland,
for instance, is currently built on implicit sync where the client
does their rendering and then does a hand-off (via wl_surface::commit)
to tell the compositor it's done at which point the compositor can now
texture from the surface.  The hand-off ensures that the client's
OpenGL API calls happen before the server's OpenGL API calls.

A good example of explicit synchronization is the Vulkan API.  There,
a client (or multiple clients) can simultaneously build command
buffers in different threads where one of those command buffers
renders to an image and the other textures from it and then submit
both of them at the same time with instructions to the driver for
which order to execute them in.  The execution order is described via
the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
extension, you can even submit the work which does the texturing
BEFORE the work which does the rendering and the driver will sort it
out.

The #1 problem with implicit synchronization (which explicit solves)
is that it leads to a lot of over-synchronization both in client space
and in driver/device space.  The client has to synchronize a lot more
because it has to ensure that the API calls happen in a particular
order.  The driver/device have to synchronize a lot more because they
never know what is going to end up being a synchronization point as an
API call on another thread/process may occur at any time.  As we move
to more and more multi-threaded programming this synchronization (on
the client-side especially) becomes more and more painful.


## Current status in Linux

Implicit synchronization in Linux works via a the kernel's internal
dma_buf and dma_fence data structures.  A dma_fence is a tiny object
which represents the "done" status for some bit of work.  Typically,
dma_fences are created as a by-product of someone submitting some bit
of work (say, 3D rendering) to the kernel.  The dma_buf object has a
set of dma_fences on it representing shared (read) and exclusive
(write) access to the object.  When work is submitted which, for
instance renders to the dma_buf, it's queued waiting on all the fences
on the dma_buf and and a dma_fence is created representing the end of
said rendering work and it's installed as the dma_buf's exclusive
fence.  This way, the kernel can manage all its internal queues (3D
rendering, display, video encode, etc.) and know which things to
submit in what order.

For the last few years, we've had sync_file in the kernel and it's
plumbed into some drivers.  A sync_file is just a wrapper around a
single dma_fence.  A sync_file is typically created as a by-product of
submitting work (3D, compute, etc.) to the kernel and is signaled when
that work completes.  When a sync_file is created, it is guaranteed by
the kernel that it will become signaled in finite time and, once it's
signaled, it remains signaled for the rest of time.  A sync_file is
represented in UAPIs as a file descriptor and can be used with normal
file APIs such as dup().  It 

Re: Plumbing explicit synchronization through the Linux ecosystem

2020-03-12 Thread Nicolas Dufresne
(I know I'm going to be spammed by so many mailing list ...)

Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand  wrote:
> > All,
> > 
> > Sorry for casting such a broad net with this one. I'm sure most people
> > who reply will get at least one mailing list rejection.  However, this
> > is an issue that affects a LOT of components and that's why it's
> > thorny to begin with.  Please pardon the length of this e-mail as
> > well; I promise there's a concrete point/proposal at the end.
> > 
> > 
> > Explicit synchronization is the future of graphics and media.  At
> > least, that seems to be the consensus among all the graphics people
> > I've talked to.  I had a chat with one of the lead Android graphics
> > engineers recently who told me that doing explicit sync from the start
> > was one of the best engineering decisions Android ever made.  It's
> > also the direction being taken by more modern APIs such as Vulkan.
> > 
> > 
> > ## What are implicit and explicit synchronization?
> > 
> > For those that aren't familiar with this space, GPUs, media encoders,
> > etc. are massively parallel and synchronization of some form is
> > required to ensure that everything happens in the right order and
> > avoid data races.  Implicit synchronization is when bits of work (3D,
> > compute, video encode, etc.) are implicitly based on the absolute
> > CPU-time order in which API calls occur.  Explicit synchronization is
> > when the client (whatever that means in any given context) provides
> > the dependency graph explicitly via some sort of synchronization
> > primitives.  If you're still confused, consider the following
> > examples:
> > 
> > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > two OpenGL contexts sharing an image where one writes to it and the
> > other textures from it.  The way the OpenGL spec works, the client has
> > to make the API calls to render to the image before (in CPU time) it
> > makes the API calls which texture from the image.  As long as it does
> > this (and maybe inserts a glFlush?), the driver will ensure that the
> > rendering completes before the texturing happens and you get correct
> > contents.
> > 
> > Implicit synchronization can also happen across processes.  Wayland,
> > for instance, is currently built on implicit sync where the client
> > does their rendering and then does a hand-off (via wl_surface::commit)
> > to tell the compositor it's done at which point the compositor can now
> > texture from the surface.  The hand-off ensures that the client's
> > OpenGL API calls happen before the server's OpenGL API calls.
> > 
> > A good example of explicit synchronization is the Vulkan API.  There,
> > a client (or multiple clients) can simultaneously build command
> > buffers in different threads where one of those command buffers
> > renders to an image and the other textures from it and then submit
> > both of them at the same time with instructions to the driver for
> > which order to execute them in.  The execution order is described via
> > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > extension, you can even submit the work which does the texturing
> > BEFORE the work which does the rendering and the driver will sort it
> > out.
> > 
> > The #1 problem with implicit synchronization (which explicit solves)
> > is that it leads to a lot of over-synchronization both in client space
> > and in driver/device space.  The client has to synchronize a lot more
> > because it has to ensure that the API calls happen in a particular
> > order.  The driver/device have to synchronize a lot more because they
> > never know what is going to end up being a synchronization point as an
> > API call on another thread/process may occur at any time.  As we move
> > to more and more multi-threaded programming this synchronization (on
> > the client-side especially) becomes more and more painful.
> > 
> > 
> > ## Current status in Linux
> > 
> > Implicit synchronization in Linux works via a the kernel's internal
> > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > which represents the "done" status for some bit of work.  Typically,
> > dma_fences are created as a by-product of someone submitting some bit
> > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > set of dma_fences on it representing shared (read) and exclusive
> > (write) access to the object.  When work is submitted which, for
> > instance renders to the dma_buf, it's queued waiting on all the fences
> > on the dma_buf and and a dma_fence is created representing the end of
> > said rendering work and it's installed as the dma_buf's exclusive
> > fence.  This way, the kernel can manage all its internal queues (3D
> > rendering, display, video encode, etc.) and know which things to
> > submit in what order.
> > 
> > For the last few years, we've had sync_file in the kernel 

Re: Plumbing explicit synchronization through the Linux ecosystem

2020-03-12 Thread Adam Jackson
On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote:

>  - X11: With present, it has these "explicit" fence objects but
> they're always a shmfence which lets the X server and client do a
> userspace CPU-side hand-off without going over the socket (and
> round-tripping through the kernel).  However, the only thing that
> fence does is order the OpenGL API calls in the client and server and
> the real synchronization is still implicit.

I'm pretty sure "the only thing that fence does" is an implementation
detail. PresentPixmap blocks until the wait-fence signals, but when and
how it signals are properties of the fence itself. You could have drm
give the client back a fence fd, pass that to xserver to create a fence
object, and name that in the PresentPixmap request, and then drm can do
whatever it wants to signal the fence.

> From my perspective, as a Vulkan driver developer, I have to deal with
> the fact that Vulkan is an explicit sync API but Wayland and X11
> aren't.

I'm quite sure we can give you an explicit-sync X11 API. I think you
may already have one.

- ajax

___
xorg-devel@lists.x.org: X.Org development
Archives: http://lists.x.org/archives/xorg-devel
Info: https://lists.x.org/mailman/listinfo/xorg-devel


Re: Plumbing explicit synchronization through the Linux ecosystem

2020-03-12 Thread Jason Ekstrand
On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand  wrote:
>
> All,
>
> Sorry for casting such a broad net with this one. I'm sure most people
> who reply will get at least one mailing list rejection.  However, this
> is an issue that affects a LOT of components and that's why it's
> thorny to begin with.  Please pardon the length of this e-mail as
> well; I promise there's a concrete point/proposal at the end.
>
>
> Explicit synchronization is the future of graphics and media.  At
> least, that seems to be the consensus among all the graphics people
> I've talked to.  I had a chat with one of the lead Android graphics
> engineers recently who told me that doing explicit sync from the start
> was one of the best engineering decisions Android ever made.  It's
> also the direction being taken by more modern APIs such as Vulkan.
>
>
> ## What are implicit and explicit synchronization?
>
> For those that aren't familiar with this space, GPUs, media encoders,
> etc. are massively parallel and synchronization of some form is
> required to ensure that everything happens in the right order and
> avoid data races.  Implicit synchronization is when bits of work (3D,
> compute, video encode, etc.) are implicitly based on the absolute
> CPU-time order in which API calls occur.  Explicit synchronization is
> when the client (whatever that means in any given context) provides
> the dependency graph explicitly via some sort of synchronization
> primitives.  If you're still confused, consider the following
> examples:
>
> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> two OpenGL contexts sharing an image where one writes to it and the
> other textures from it.  The way the OpenGL spec works, the client has
> to make the API calls to render to the image before (in CPU time) it
> makes the API calls which texture from the image.  As long as it does
> this (and maybe inserts a glFlush?), the driver will ensure that the
> rendering completes before the texturing happens and you get correct
> contents.
>
> Implicit synchronization can also happen across processes.  Wayland,
> for instance, is currently built on implicit sync where the client
> does their rendering and then does a hand-off (via wl_surface::commit)
> to tell the compositor it's done at which point the compositor can now
> texture from the surface.  The hand-off ensures that the client's
> OpenGL API calls happen before the server's OpenGL API calls.
>
> A good example of explicit synchronization is the Vulkan API.  There,
> a client (or multiple clients) can simultaneously build command
> buffers in different threads where one of those command buffers
> renders to an image and the other textures from it and then submit
> both of them at the same time with instructions to the driver for
> which order to execute them in.  The execution order is described via
> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> extension, you can even submit the work which does the texturing
> BEFORE the work which does the rendering and the driver will sort it
> out.
>
> The #1 problem with implicit synchronization (which explicit solves)
> is that it leads to a lot of over-synchronization both in client space
> and in driver/device space.  The client has to synchronize a lot more
> because it has to ensure that the API calls happen in a particular
> order.  The driver/device have to synchronize a lot more because they
> never know what is going to end up being a synchronization point as an
> API call on another thread/process may occur at any time.  As we move
> to more and more multi-threaded programming this synchronization (on
> the client-side especially) becomes more and more painful.
>
>
> ## Current status in Linux
>
> Implicit synchronization in Linux works via a the kernel's internal
> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> which represents the "done" status for some bit of work.  Typically,
> dma_fences are created as a by-product of someone submitting some bit
> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> set of dma_fences on it representing shared (read) and exclusive
> (write) access to the object.  When work is submitted which, for
> instance renders to the dma_buf, it's queued waiting on all the fences
> on the dma_buf and and a dma_fence is created representing the end of
> said rendering work and it's installed as the dma_buf's exclusive
> fence.  This way, the kernel can manage all its internal queues (3D
> rendering, display, video encode, etc.) and know which things to
> submit in what order.
>
> For the last few years, we've had sync_file in the kernel and it's
> plumbed into some drivers.  A sync_file is just a wrapper around a
> single dma_fence.  A sync_file is typically created as a by-product of
> submitting work (3D, compute, etc.) to the kernel and is signaled when
> that work completes.  When a sync_file is created, it is guaranteed by
> the kernel