Re: [PATCH v2] drm/doc: add rfc section for small BAR uapi

2022-04-27 Thread Daniel Vetter
On Wed, Apr 27, 2022 at 08:55:07AM +0200, Christian König wrote:
> Well usually we increment the drm minor version when adding some new flags
> on amdgpu.
> 
> Additional to that just one comment from our experience with that: You don't
> just need one flag, but two. The first one is a hint which says "CPU access
> needed" and the second is a promise which says "CPU access never needed".
> 
> The background is that on a whole bunch of buffers you can 100% certain say
> that you will never ever need CPU access.
> 
> Then at least we have a whole bunch of buffers where we might need CPU
> access, but can't tell for sure.
> 
> And last we have stuff like transfer buffers you can be 100% sure that you
> need CPU access.
> 
> Separating it like this helped a lot with performance on small BAR systems.

So my assumption was that for transfer buffers you'd fill them with the
cpu first anyway, so no need for the extra flag.

I guess this if for transfer buffers for gpu -> cpu transfers, where it
would result in costly bo move and stalls and it's better to make sure
it's cpu accessible from the start? At least on current gpu we have where
there's no coherent interconnect, those buffers have to be in system
memory or your cpu access will be a disaster, so again they're naturally
cpu accessible.

What's the use-case for the "cpu access required" flag where "cpu access
before gpu access" isn't a good enough hint already to get the same perf
benefits?

Also for scanout my idea at least is that we just fail mmap when you
haven't set the flag and the scanout is pinned to unmappable, for two
reasons:
- 4k buffers are big, if we force them all into mappable things are
  non-pretty.
- You need mesa anyway to access tiled buffers, and mesa knows how to use
  a transfer buffer. That should work even when you do desktop switching
  and fastboot and stuff like that with the getfb2 ioctl should all work
  (and without getfb2 it's doomed to garbage anyway).

So only dumb kms buffers (which are linear) would ever get the
NEEDS_CPU_ACCESS flag, and only those we'd ever pin into cpu accessible
range for scanout. Is there a hole in that plan?

Cheers, Daniel

> 
> Regards,
> Christian.
> 
> Am 27.04.22 um 08:48 schrieb Lionel Landwerlin:
> > One question though, how do we detect that this flag
> > (I915_GEM_CREATE_EXT_FLAG_NEEDS_CPU_ACCESS) is accepted on a given
> > kernel?
> > I assume older kernels are going to reject object creation if we use
> > this flag?
> > 
> > I didn't plan to use __drm_i915_query_vma_info, but isn't it
> > inconsistent to select the placement on the GEM object and then query
> > whether it's mappable by address?
> > You made a comment stating this is racy, wouldn't querying on the GEM
> > object prevent this?
> > 
> > Thanks,
> > 
> > -Lionel
> > 
> > On 27/04/2022 09:35, Lionel Landwerlin wrote:
> > > Hi Matt,
> > > 
> > > 
> > > The proposal looks good to me.
> > > 
> > > Looking forward to try it on drm-tip.
> > > 
> > > 
> > > -Lionel
> > > 
> > > On 20/04/2022 20:13, Matthew Auld wrote:
> > > > Add an entry for the new uapi needed for small BAR on DG2+.
> > > > 
> > > > v2:
> > > >    - Some spelling fixes and other small tweaks. (Akeem & Thomas)
> > > >    - Rework error capture interactions, including no longer needing
> > > >  NEEDS_CPU_ACCESS for objects marked for capture. (Thomas)
> > > >    - Add probed_cpu_visible_size. (Lionel)
> > > > 
> > > > Signed-off-by: Matthew Auld 
> > > > Cc: Thomas Hellström 
> > > > Cc: Lionel Landwerlin 
> > > > Cc: Jon Bloomfield 
> > > > Cc: Daniel Vetter 
> > > > Cc: Jordan Justen 
> > > > Cc: Kenneth Graunke 
> > > > Cc: Akeem G Abodunrin 
> > > > Cc: mesa-dev@lists.freedesktop.org
> > > > ---
> > > >   Documentation/gpu/rfc/i915_small_bar.h   | 190
> > > > +++
> > > >   Documentation/gpu/rfc/i915_small_bar.rst |  58 +++
> > > >   Documentation/gpu/rfc/index.rst  |   4 +
> > > >   3 files changed, 252 insertions(+)
> > > >   create mode 100644 Documentation/gpu/rfc/i915_small_bar.h
> > > >   create mode 100644 Documentation/gpu/rfc/i915_small_bar.rst
> > > > 
> > > > diff --git a/Documentation/gpu/rfc/i915_small_bar.h
> > > > b/Documentation/gpu/rfc/i915_small_bar.h
> > > > new file mode 100644
> > > > index ..7bfd0cf44d35
> > > > --- /de

[Mesa-dev] [PATCH] dma-buf: Document dma-buf implicit fencing/resv fencing rules

2021-06-24 Thread Daniel Vetter
 didn't connect

v5:
- A few s/should/must/ to make clear what must be done (if the driver
  does implicit sync) and what's more a maybe (Daniel Stone)
- drop all the example api discussion, that needs to be expanded,
  clarified and put into a new chapter in drm-uapi.rst (Daniel Stone)

Cc: Daniel Stone 
Acked-by: Daniel Stone 
Reviewed-by: Dave Airlie  (v4)
Reviewed-by: Christian König  (v3)
Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen 
Cc: Dave Airlie 
Cc: Rob Clark 
Cc: Kristian H. Kristensen 
Cc: Michel Dänzer 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 
Cc: Kevin Wang 
Cc: Dennis Li 
Cc: Luben Tuikov 
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: Daniel Vetter 
---
 include/linux/dma-buf.h | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 81cebf414505..2b814fde0d11 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -386,6 +386,40 @@ struct dma_buf {
 * @resv:
 *
 * Reservation object linked to this dma-buf.
+*
+* IMPLICIT SYNCHRONIZATION RULES:
+*
+* Drivers which support implicit synchronization of buffer access as
+* e.g. exposed in `Implicit Fence Poll Support`_ must follow the
+* below rules.
+*
+* - Drivers must add a shared fence through dma_resv_add_shared_fence()
+*   for anything the userspace API considers a read access. This highly
+*   depends upon the API and window system.
+*
+* - Similarly drivers must set the exclusive fence through
+*   dma_resv_add_excl_fence() for anything the userspace API considers
+*   write access.
+*
+* - Drivers may just always set the exclusive fence, since that only
+*   causes unecessarily synchronization, but no correctness issues.
+*
+* - Some drivers only expose a synchronous userspace API with no
+*   pipelining across drivers. These do not set any fences for their
+*   access. An example here is v4l.
+*
+* DYNAMIC IMPORTER RULES:
+*
+* Dynamic importers, see dma_buf_attachment_is_dynamic(), have
+* additional constraints on how they set up fences:
+*
+* - Dynamic importers must obey the exclusive fence and wait for it to
+*   signal before allowing access to the buffer's underlying storage
+*   through the device.
+*
+* - Dynamic importers should set fences for any access that they can't
+*   disable immediately from their _buf_attach_ops.move_notify
+*   callback.
 */
struct dma_resv *resv;
 
-- 
2.32.0.rc2

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH] dma-buf: Document dma-buf implicit fencing/resv fencing rules

2021-06-24 Thread Daniel Vetter
 didn't connect

v5:
- A few s/should/must/ to make clear what must be done (if the driver
  does implicit sync) and what's more a maybe (Daniel Stone)
- drop all the example api discussion, that needs to be expanded,
  clarified and put into a new chapter in drm-uapi.rst (Daniel Stone)

Cc: Daniel Stone 
Acked-by: Daniel Stone 
Reviewed-by: Dave Airlie  (v4)
Reviewed-by: Christian König  (v3)
Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen 
Cc: Dave Airlie 
Cc: Rob Clark 
Cc: Kristian H. Kristensen 
Cc: Michel Dänzer 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 
Cc: Kevin Wang 
Cc: Dennis Li 
Cc: Luben Tuikov 
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: Daniel Vetter 
---
 include/linux/dma-buf.h | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 81cebf414505..2b814fde0d11 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -386,6 +386,40 @@ struct dma_buf {
 * @resv:
 *
 * Reservation object linked to this dma-buf.
+*
+* IMPLICIT SYNCHRONIZATION RULES:
+*
+* Drivers which support implicit synchronization of buffer access as
+* e.g. exposed in `Implicit Fence Poll Support`_ must follow the
+* below rules.
+*
+* - Drivers must add a shared fence through dma_resv_add_shared_fence()
+*   for anything the userspace API considers a read access. This highly
+*   depends upon the API and window system.
+*
+* - Similarly drivers must set the exclusive fence through
+*   dma_resv_add_excl_fence() for anything the userspace API considers
+*   write access.
+*
+* - Drivers may just always set the exclusive fence, since that only
+*   causes unecessarily synchronization, but no correctness issues.
+*
+* - Some drivers only expose a synchronous userspace API with no
+*   pipelining across drivers. These do not set any fences for their
+*   access. An example here is v4l.
+*
+* DYNAMIC IMPORTER RULES:
+*
+* Dynamic importers, see dma_buf_attachment_is_dynamic(), have
+* additional constraints on how they set up fences:
+*
+* - Dynamic importers must obey the exclusive fence and wait for it to
+*   signal before allowing access to the buffer's underlying storage
+*   through the device.
+*
+* - Dynamic importers should set fences for any access that they can't
+*   disable immediately from their _buf_attach_ops.move_notify
+*   callback.
 */
struct dma_resv *resv;
 
-- 
2.32.0.rc2

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] dma-buf: Document dma-buf implicit fencing/resv fencing rules

2021-06-24 Thread Daniel Vetter
On Thu, Jun 24, 2021 at 1:08 PM Daniel Stone  wrote:
>
> Hi,
>
> On Wed, 23 Jun 2021 at 17:20, Daniel Vetter  wrote:
> > +*
> > +* IMPLICIT SYNCHRONIZATION RULES:
> > +*
> > +* Drivers which support implicit synchronization of buffer access 
> > as
> > +* e.g. exposed in `Implicit Fence Poll Support`_ should follow the
> > +* below rules.
>
> 'Should' ... ? Must.

Yeah  I guess I can upgrade a bunch of them.

> > +* - Drivers should add a shared fence through
> > +*   dma_resv_add_shared_fence() for anything the userspace API
> > +*   considers a read access. This highly depends upon the API and
> > +*   window system: E.g. OpenGL is generally implicitly 
> > synchronized on
> > +*   Linux, but explicitly synchronized on Android. Whereas Vulkan 
> > is
> > +*   generally explicitly synchronized for everything, and window 
> > system
> > +*   buffers have explicit API calls (which then need to make sure 
> > the
> > +*   implicit fences store here in @resv are updated correctly).
> > +*
> > +* - [...]
>
> Mmm, I think this is all right, but it could be worded much more
> clearly. Right now it's a bunch of points all smashed into one, and
> there's a lot of room for misinterpretation.
>
> Here's a strawman, starting with most basic and restrictive, working
> through to when you're allowed to wriggle your way out:
>
> Rule 1: Drivers must add a shared fence through
> dma_resv_add_shared_fence() for any read accesses against that buffer.
> This appends a fence to the shared array, ensuring that any future
> non-read access will be synchronised against this operation to only
> begin after it has completed.
>
> Rule 2: Drivers must add an exclusive fence through
> dma_resv_add_excl_fence() for any write accesses against that buffer.
> This replaces the exclusive fence with the new operation, ensuring
> that all future access will be synchronised against this operation to
> only begin after it has completed.
>
> Rule 3: Drivers must synchronise all accesses to buffers against
> existing implicit fences. Read accesses must synchronise against the
> exclusive fence (read-after-write), and write accesses must
> synchronise against both the exclusive (write-after-write) and shared
> (write-after-read) fences.
>
> Note 1: Users like OpenGL and window systems on non-Android userspace
> are generally implicitly synchronised. An implicitly-synchronised
> userspace is unaware of fences from prior operations, so the kernel
> mediates scheduling to create the illusion that GPU work is FIFO. For
> example, an application will flush and schedule GPU write work to
> render its image, then immediately tell the window system to display
> that image; the window system may immediately flush and schedule GPU
> read work to display that image, with neither waiting for the write to
> have completed. The kernel provides coherence by synchronising the
> read access against the write fence in the exclusive slot, so that the
> image displayed is correct.
>
> Note 2: Users like Vulkan and Android window system are generally
> explicitly synchronised. An explicitly-synchronised userspace is
> responsible for tracking its own read and write access and providing
> the kernel with synchronisation barriers. For instance, a Vulkan
> application rendering to a buffer and subsequently using it as a read
> texture, must annotate the read operation with a read-after-write
> synchronisation barrier.
>
> Note 3: Implicit and explicit userspace can coexist. For instance, an
> explicitly-synchronised Vulkan application may be running as a client
> of an implicitly-synchronised window system which uses OpenGL for
> composition; an implicitly-synchronised OpenGL application may be
> running as a client of a window system which uses Vulkan for
> composition.
>
> Note 4: Some subsystems, for example V4L2, do not pipeline operations,
> and instead only return to userspace when the scheduled work against a
> buffer has fully retired.
>
> Exemption 1: Fully self-coherent userspace may skip implicit
> synchronisation barriers. For instance, accesses between two
> Vulkan-internal buffers allocated by a single application do not need
> to synchronise against each other's implicit fences, as the client is
> responsible for explicitly providing barriers for access. A
> self-contained OpenGL userspace also has no need to implicitly
> synchronise its access if the driver instead tracks all access and
> inserts the appropriate synchronisation barriers.
>
> Exemption 2: When implicit and explicit use

[Mesa-dev] [PATCH] dma-buf: Document dma-buf implicit fencing/resv fencing rules

2021-06-23 Thread Daniel Vetter
k didn't connect

Reviewed-by: Christian König  (v3)

Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen 
Cc: Dave Airlie 
Cc: Rob Clark 
Cc: Kristian H. Kristensen 
Cc: Michel Dänzer 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 
Cc: Kevin Wang 
Cc: Dennis Li 
Cc: Luben Tuikov 
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: Daniel Vetter 
---
 include/linux/dma-buf.h | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 81cebf414505..494f639ee486 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -386,6 +386,45 @@ struct dma_buf {
 * @resv:
 *
 * Reservation object linked to this dma-buf.
+*
+* IMPLICIT SYNCHRONIZATION RULES:
+*
+* Drivers which support implicit synchronization of buffer access as
+* e.g. exposed in `Implicit Fence Poll Support`_ should follow the
+* below rules.
+*
+* - Drivers should add a shared fence through
+*   dma_resv_add_shared_fence() for anything the userspace API
+*   considers a read access. This highly depends upon the API and
+*   window system: E.g. OpenGL is generally implicitly synchronized on
+*   Linux, but explicitly synchronized on Android. Whereas Vulkan is
+*   generally explicitly synchronized for everything, and window system
+*   buffers have explicit API calls (which then need to make sure the
+*   implicit fences store here in @resv are updated correctly).
+*
+* - Similarly drivers should set the exclusive fence through
+*   dma_resv_add_excl_fence() for anything the userspace API considers
+*   write access.
+*
+* - Drivers may just always set the exclusive fence, since that only
+*   causes unecessarily synchronization, but no correctness issues.
+*
+* - Some drivers only expose a synchronous userspace API with no
+*   pipelining across drivers. These do not set any fences for their
+*   access. An example here is v4l.
+*
+* DYNAMIC IMPORTER RULES:
+*
+* Dynamic importers, see dma_buf_attachment_is_dynamic(), have
+* additional constraints on how they set up fences:
+*
+* - Dynamic importers must obey the exclusive fence and wait for it to
+*   signal before allowing access to the buffer's underlying storage
+*   through the device.
+*
+* - Dynamic importers should set fences for any access that they can't
+*   disable immediately from their _buf_attach_ops.move_notify
+*   callback.
 */
struct dma_resv *resv;
 
-- 
2.32.0.rc2

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 05:07:17PM +0200, Christian König wrote:
> Am 23.06.21 um 17:03 schrieb Daniel Vetter:
> > On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote:
> > > On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  
> > > wrote:
> > > > On Wed, Jun 23, 2021 at 4:02 PM Christian König
> > > >  wrote:
> > > > > Am 23.06.21 um 15:49 schrieb Daniel Vetter:
> > > > > > On Wed, Jun 23, 2021 at 3:44 PM Christian König
> > > > > >  wrote:
> > > > > > > Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> > > > > > > > On Wed, Jun 23, 2021 at 2:59 PM Christian König
> > > > > > > >  wrote:
> > > > > > > > > Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> > > > > > > > > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> > > > > > > > > >  wrote:
> > > > > > > > > > > On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > WARNING: Absolutely untested beyond "gcc isn't dying in 
> > > > > > > > > > > > agony".
> > > > > > > > > > > > 
> > > > > > > > > > > > Implicit fencing done properly needs to treat the 
> > > > > > > > > > > > implicit fencing
> > > > > > > > > > > > slots like a funny kind of IPC mailbox. In other words 
> > > > > > > > > > > > it needs to be
> > > > > > > > > > > > explicitly. This is the only way it will mesh well with 
> > > > > > > > > > > > explicit
> > > > > > > > > > > > fencing userspace like vk, and it's also the bare 
> > > > > > > > > > > > minimum required to
> > > > > > > > > > > > be able to manage anything else that wants to use the 
> > > > > > > > > > > > same buffer on
> > > > > > > > > > > > multiple engines in parallel, and still be able to 
> > > > > > > > > > > > share it through
> > > > > > > > > > > > implicit sync.
> > > > > > > > > > > > 
> > > > > > > > > > > > amdgpu completely lacks such an uapi. Fix this.
> > > > > > > > > > > > 
> > > > > > > > > > > > Luckily the concept of ignoring implicit fences exists 
> > > > > > > > > > > > already, and
> > > > > > > > > > > > takes care of all the complexities of making sure that 
> > > > > > > > > > > > non-optional
> > > > > > > > > > > > fences (like bo moves) are not ignored. This support 
> > > > > > > > > > > > was added in
> > > > > > > > > > > > 
> > > > > > > > > > > > commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > > > > > > > > > > > Author: Andres Rodriguez 
> > > > > > > > > > > > Date:   Fri Sep 15 20:44:06 2017 -0400
> > > > > > > > > > > > 
> > > > > > > > > > > > drm/amdgpu: introduce 
> > > > > > > > > > > > AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> > > > > > > > > > > > 
> > > > > > > > > > > > Unfortuantely it's the wrong semantics, because it's a 
> > > > > > > > > > > > bo flag and
> > > > > > > > > > > > disables implicit sync on an allocated buffer 
> > > > > > > > > > > > completely.
> > > > > > > > > > > > 
> > > > > > > > > > > > We _do_ want implicit sync, but control it explicitly. 
> > > > > > > > > > > > For this we
> > > > > > > > > > > > need a flag on the drm_file, so that a given userspace 
> > > > > > > > > > > > (like vulkan)
> > > > > > > > > > > > can 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote:
> On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  wrote:
> >
> > On Wed, Jun 23, 2021 at 4:02 PM Christian König
> >  wrote:
> > >
> > > Am 23.06.21 um 15:49 schrieb Daniel Vetter:
> > > > On Wed, Jun 23, 2021 at 3:44 PM Christian König
> > > >  wrote:
> > > >> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> > > >>> On Wed, Jun 23, 2021 at 2:59 PM Christian König
> > > >>>  wrote:
> > > >>>> Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> > > >>>>> On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> > > >>>>>  wrote:
> > > >>>>>> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter 
> > > >>>>>>  wrote:
> > > >>>>>>> WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> > > >>>>>>>
> > > >>>>>>> Implicit fencing done properly needs to treat the implicit fencing
> > > >>>>>>> slots like a funny kind of IPC mailbox. In other words it needs 
> > > >>>>>>> to be
> > > >>>>>>> explicitly. This is the only way it will mesh well with explicit
> > > >>>>>>> fencing userspace like vk, and it's also the bare minimum 
> > > >>>>>>> required to
> > > >>>>>>> be able to manage anything else that wants to use the same buffer 
> > > >>>>>>> on
> > > >>>>>>> multiple engines in parallel, and still be able to share it 
> > > >>>>>>> through
> > > >>>>>>> implicit sync.
> > > >>>>>>>
> > > >>>>>>> amdgpu completely lacks such an uapi. Fix this.
> > > >>>>>>>
> > > >>>>>>> Luckily the concept of ignoring implicit fences exists already, 
> > > >>>>>>> and
> > > >>>>>>> takes care of all the complexities of making sure that 
> > > >>>>>>> non-optional
> > > >>>>>>> fences (like bo moves) are not ignored. This support was added in
> > > >>>>>>>
> > > >>>>>>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > > >>>>>>> Author: Andres Rodriguez 
> > > >>>>>>> Date:   Fri Sep 15 20:44:06 2017 -0400
> > > >>>>>>>
> > > >>>>>>>drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> > > >>>>>>>
> > > >>>>>>> Unfortuantely it's the wrong semantics, because it's a bo flag and
> > > >>>>>>> disables implicit sync on an allocated buffer completely.
> > > >>>>>>>
> > > >>>>>>> We _do_ want implicit sync, but control it explicitly. For this we
> > > >>>>>>> need a flag on the drm_file, so that a given userspace (like 
> > > >>>>>>> vulkan)
> > > >>>>>>> can manage the implicit sync slots explicitly. The other side of 
> > > >>>>>>> the
> > > >>>>>>> pipeline (compositor, other process or just different stage in a 
> > > >>>>>>> media
> > > >>>>>>> pipeline in the same process) can then either do the same, or 
> > > >>>>>>> fully
> > > >>>>>>> participate in the implicit sync as implemented by the kernel by
> > > >>>>>>> default.
> > > >>>>>>>
> > > >>>>>>> By building on the existing flag for buffers we avoid any issues 
> > > >>>>>>> with
> > > >>>>>>> opening up additional security concerns - anything this new flag 
> > > >>>>>>> here
> > > >>>>>>> allows is already.
> > > >>>>>>>
> > > >>>>>>> All drivers which supports this concept of a userspace-specific
> > > >>>>>>> opt-out of implicit sync have a flag in their CS ioctl, but in 
> > > >>>>>>> reality
> >

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 4:02 PM Christian König
 wrote:
>
> Am 23.06.21 um 15:49 schrieb Daniel Vetter:
> > On Wed, Jun 23, 2021 at 3:44 PM Christian König
> >  wrote:
> >> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> >>> On Wed, Jun 23, 2021 at 2:59 PM Christian König
> >>>  wrote:
> >>>> Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> >>>>> On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> >>>>>  wrote:
> >>>>>> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  
> >>>>>> wrote:
> >>>>>>> WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> >>>>>>>
> >>>>>>> Implicit fencing done properly needs to treat the implicit fencing
> >>>>>>> slots like a funny kind of IPC mailbox. In other words it needs to be
> >>>>>>> explicitly. This is the only way it will mesh well with explicit
> >>>>>>> fencing userspace like vk, and it's also the bare minimum required to
> >>>>>>> be able to manage anything else that wants to use the same buffer on
> >>>>>>> multiple engines in parallel, and still be able to share it through
> >>>>>>> implicit sync.
> >>>>>>>
> >>>>>>> amdgpu completely lacks such an uapi. Fix this.
> >>>>>>>
> >>>>>>> Luckily the concept of ignoring implicit fences exists already, and
> >>>>>>> takes care of all the complexities of making sure that non-optional
> >>>>>>> fences (like bo moves) are not ignored. This support was added in
> >>>>>>>
> >>>>>>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> >>>>>>> Author: Andres Rodriguez 
> >>>>>>> Date:   Fri Sep 15 20:44:06 2017 -0400
> >>>>>>>
> >>>>>>>drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> >>>>>>>
> >>>>>>> Unfortuantely it's the wrong semantics, because it's a bo flag and
> >>>>>>> disables implicit sync on an allocated buffer completely.
> >>>>>>>
> >>>>>>> We _do_ want implicit sync, but control it explicitly. For this we
> >>>>>>> need a flag on the drm_file, so that a given userspace (like vulkan)
> >>>>>>> can manage the implicit sync slots explicitly. The other side of the
> >>>>>>> pipeline (compositor, other process or just different stage in a media
> >>>>>>> pipeline in the same process) can then either do the same, or fully
> >>>>>>> participate in the implicit sync as implemented by the kernel by
> >>>>>>> default.
> >>>>>>>
> >>>>>>> By building on the existing flag for buffers we avoid any issues with
> >>>>>>> opening up additional security concerns - anything this new flag here
> >>>>>>> allows is already.
> >>>>>>>
> >>>>>>> All drivers which supports this concept of a userspace-specific
> >>>>>>> opt-out of implicit sync have a flag in their CS ioctl, but in reality
> >>>>>>> that turned out to be a bit too inflexible. See the discussion below,
> >>>>>>> let's try to do a bit better for amdgpu.
> >>>>>>>
> >>>>>>> This alone only allows us to completely avoid any stalls due to
> >>>>>>> implicit sync, it does not yet allow us to use implicit sync as a
> >>>>>>> strange form of IPC for sync_file.
> >>>>>>>
> >>>>>>> For that we need two more pieces:
> >>>>>>>
> >>>>>>> - a way to get the current implicit sync fences out of a buffer. Could
> >>>>>>>  be done in a driver ioctl, but everyone needs this, and 
> >>>>>>> generally a
> >>>>>>>  dma-buf is involved anyway to establish the sharing. So an ioctl 
> >>>>>>> on
> >>>>>>>  the dma-buf makes a ton more sense:
> >>>>>>>
> >>>>>>>  
> >>>>>>> https://nam11.safelinks.protection.outlook.com/?u

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 3:44 PM Christian König
 wrote:
>
> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> > On Wed, Jun 23, 2021 at 2:59 PM Christian König
> >  wrote:
> >> Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> >>> On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> >>>  wrote:
> >>>> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  
> >>>> wrote:
> >>>>> WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> >>>>>
> >>>>> Implicit fencing done properly needs to treat the implicit fencing
> >>>>> slots like a funny kind of IPC mailbox. In other words it needs to be
> >>>>> explicitly. This is the only way it will mesh well with explicit
> >>>>> fencing userspace like vk, and it's also the bare minimum required to
> >>>>> be able to manage anything else that wants to use the same buffer on
> >>>>> multiple engines in parallel, and still be able to share it through
> >>>>> implicit sync.
> >>>>>
> >>>>> amdgpu completely lacks such an uapi. Fix this.
> >>>>>
> >>>>> Luckily the concept of ignoring implicit fences exists already, and
> >>>>> takes care of all the complexities of making sure that non-optional
> >>>>> fences (like bo moves) are not ignored. This support was added in
> >>>>>
> >>>>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> >>>>> Author: Andres Rodriguez 
> >>>>> Date:   Fri Sep 15 20:44:06 2017 -0400
> >>>>>
> >>>>>   drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> >>>>>
> >>>>> Unfortuantely it's the wrong semantics, because it's a bo flag and
> >>>>> disables implicit sync on an allocated buffer completely.
> >>>>>
> >>>>> We _do_ want implicit sync, but control it explicitly. For this we
> >>>>> need a flag on the drm_file, so that a given userspace (like vulkan)
> >>>>> can manage the implicit sync slots explicitly. The other side of the
> >>>>> pipeline (compositor, other process or just different stage in a media
> >>>>> pipeline in the same process) can then either do the same, or fully
> >>>>> participate in the implicit sync as implemented by the kernel by
> >>>>> default.
> >>>>>
> >>>>> By building on the existing flag for buffers we avoid any issues with
> >>>>> opening up additional security concerns - anything this new flag here
> >>>>> allows is already.
> >>>>>
> >>>>> All drivers which supports this concept of a userspace-specific
> >>>>> opt-out of implicit sync have a flag in their CS ioctl, but in reality
> >>>>> that turned out to be a bit too inflexible. See the discussion below,
> >>>>> let's try to do a bit better for amdgpu.
> >>>>>
> >>>>> This alone only allows us to completely avoid any stalls due to
> >>>>> implicit sync, it does not yet allow us to use implicit sync as a
> >>>>> strange form of IPC for sync_file.
> >>>>>
> >>>>> For that we need two more pieces:
> >>>>>
> >>>>> - a way to get the current implicit sync fences out of a buffer. Could
> >>>>> be done in a driver ioctl, but everyone needs this, and generally a
> >>>>> dma-buf is involved anyway to establish the sharing. So an ioctl on
> >>>>> the dma-buf makes a ton more sense:
> >>>>>
> >>>>> 
> >>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287217723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=L8KCz8711Y2qZx0%2FJWT6HSg4o6OMhn%2BC4U2IR06nViE%3Dreserved=0
> >>>>>
> >>>>> Current drivers in upstream solves this by having the opt-out flag
> >>>>> on their CS ioctl. This has the downside that very often the CS
> >>>>> which must actually stall for the implicit fence is run a while
> >>>>> aft

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:
>
> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:
> >
> > WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> >
> > Implicit fencing done properly needs to treat the implicit fencing
> > slots like a funny kind of IPC mailbox. In other words it needs to be
> > explicitly. This is the only way it will mesh well with explicit
> > fencing userspace like vk, and it's also the bare minimum required to
> > be able to manage anything else that wants to use the same buffer on
> > multiple engines in parallel, and still be able to share it through
> > implicit sync.
> >
> > amdgpu completely lacks such an uapi. Fix this.
> >
> > Luckily the concept of ignoring implicit fences exists already, and
> > takes care of all the complexities of making sure that non-optional
> > fences (like bo moves) are not ignored. This support was added in
> >
> > commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > Author: Andres Rodriguez 
> > Date:   Fri Sep 15 20:44:06 2017 -0400
> >
> > drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> >
> > Unfortuantely it's the wrong semantics, because it's a bo flag and
> > disables implicit sync on an allocated buffer completely.
> >
> > We _do_ want implicit sync, but control it explicitly. For this we
> > need a flag on the drm_file, so that a given userspace (like vulkan)
> > can manage the implicit sync slots explicitly. The other side of the
> > pipeline (compositor, other process or just different stage in a media
> > pipeline in the same process) can then either do the same, or fully
> > participate in the implicit sync as implemented by the kernel by
> > default.
> >
> > By building on the existing flag for buffers we avoid any issues with
> > opening up additional security concerns - anything this new flag here
> > allows is already.
> >
> > All drivers which supports this concept of a userspace-specific
> > opt-out of implicit sync have a flag in their CS ioctl, but in reality
> > that turned out to be a bit too inflexible. See the discussion below,
> > let's try to do a bit better for amdgpu.
> >
> > This alone only allows us to completely avoid any stalls due to
> > implicit sync, it does not yet allow us to use implicit sync as a
> > strange form of IPC for sync_file.
> >
> > For that we need two more pieces:
> >
> > - a way to get the current implicit sync fences out of a buffer. Could
> >   be done in a driver ioctl, but everyone needs this, and generally a
> >   dma-buf is involved anyway to establish the sharing. So an ioctl on
> >   the dma-buf makes a ton more sense:
> >
> >   
> > https://lore.kernel.org/dri-devel/20210520190007.534046-4-ja...@jlekstrand.net/
> >
> >   Current drivers in upstream solves this by having the opt-out flag
> >   on their CS ioctl. This has the downside that very often the CS
> >   which must actually stall for the implicit fence is run a while
> >   after the implicit fence point was logically sampled per the api
> >   spec (vk passes an explicit syncobj around for that afaiui), and so
> >   results in oversync. Converting the implicit sync fences into a
> >   snap-shot sync_file is actually accurate.
> >
> > - Simillar we need to be able to set the exclusive implicit fence.
> >   Current drivers again do this with a CS ioctl flag, with again the
> >   same problems that the time the CS happens additional dependencies
> >   have been added. An explicit ioctl to only insert a sync_file (while
> >   respecting the rules for how exclusive and shared fence slots must
> >   be update in struct dma_resv) is much better. This is proposed here:
> >
> >   
> > https://lore.kernel.org/dri-devel/20210520190007.534046-5-ja...@jlekstrand.net/
> >
> > These three pieces together allow userspace to fully control implicit
> > fencing and remove all unecessary stall points due to them.
> >
> > Well, as much as the implicit fencing model fundamentally allows:
> > There is only one set of fences, you can only choose to sync against
> > only writers (exclusive slot), or everyone. Hence suballocating
> > multiple buffers or anything else like this is fundamentally not
> > possible, and can only be fixed by a proper explicit fencing model.
> >
> > Aside from that caveat this model gets implicit fencing as closely to
> > explicit fencing semantics as possible:
> >
> > On the actual implementation I opted for a simple setparam ioctl, n

[Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-22 Thread Daniel Vetter
WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
  be done in a driver ioctl, but everyone needs this, and generally a
  dma-buf is involved anyway to establish the sharing. So an ioctl on
  the dma-buf makes a ton more sense:

  
https://lore.kernel.org/dri-devel/20210520190007.534046-4-ja...@jlekstrand.net/

  Current drivers in upstream solves this by having the opt-out flag
  on their CS ioctl. This has the downside that very often the CS
  which must actually stall for the implicit fence is run a while
  after the implicit fence point was logically sampled per the api
  spec (vk passes an explicit syncobj around for that afaiui), and so
  results in oversync. Converting the implicit sync fences into a
  snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
  Current drivers again do this with a CS ioctl flag, with again the
  same problems that the time the CS happens additional dependencies
  have been added. An explicit ioctl to only insert a sync_file (while
  respecting the rules for how exclusive and shared fence slots must
  be update in struct dma_resv) is much better. This is proposed here:

  
https://lore.kernel.org/dri-devel/20210520190007.534046-5-ja...@jlekstrand.net/

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only choose to sync against
only writers (exclusive slot), or everyone. Hence suballocating
multiple buffers or anything else like this is fundamentally not
possible, and can only be fixed by a proper explicit fencing model.

Aside from that caveat this model gets implicit fencing as closely to
explicit fencing semantics as possible:

On the actual implementation I opted for a simple setparam ioctl, no
locking (just atomic reads/writes) for simplicity. There is a nice
flag parameter in the VM ioctl which we could use, except:
- it's not checked, so userspace likely passes garbage
- there's already a comment that userspace _does_ pass garbage in the
  priority field
So yeah unfortunately this flag parameter for setting vm flags is
useless, and we need to hack up a new one.

v2: Explain why a new SETPARAM (Jason)

v3: Bas noticed I forgot to hook up the dependency-side shortcut. We
need both, or this doesn't do much.

v4: Rebase over the amdgpu patch to always set the implicit sync
fences.

Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen 
Cc: Dave Airlie 
Cc: Rob Clark 
Cc: Kristian H. Kristensen 
Cc: Michel Dänzer 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 
Cc: Kevin Wang 
Cc: Dennis Li 
Cc: Luben Tuikov 
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: Daniel Vetter 

[Mesa-dev] [PATCH 03/15] dma-buf: Document dma-buf implicit fencing/resv fencing rules

2021-06-22 Thread Daniel Vetter
 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 
Cc: Kevin Wang 
Cc: Dennis Li 
Cc: Luben Tuikov 
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: Daniel Vetter 
---
 include/linux/dma-buf.h | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 6d18b9e448b9..4807cefe81f5 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -388,6 +388,45 @@ struct dma_buf {
 * @resv:
 *
 * Reservation object linked to this dma-buf.
+*
+* IMPLICIT SYNCHRONIZATION RULES:
+*
+* Drivers which support implicit synchronization of buffer access as
+* e.g. exposed in `Implicit Fence Poll Support`_ should follow the
+* below rules.
+*
+* - Drivers should add a shared fence through
+*   dma_resv_add_shared_fence() for anything the userspace API
+*   considers a read access. This highly depends upon the API and
+*   window system: E.g. OpenGL is generally implicitly synchronized on
+*   Linux, but explicitly synchronized on Android. Whereas Vulkan is
+*   generally explicitly synchronized for everything, and window system
+*   buffers have explicit API calls (which then need to make sure the
+*   implicit fences store here in @resv are updated correctly).
+*
+* - Similarly drivers should set the exclusive fence through
+*   dma_resv_add_excl_fence() for anything the userspace API considers
+*   write access.
+*
+* - Drivers may just always set the exclusive fence, since that only
+*   causes unecessarily synchronization, but no correctness issues.
+*
+* - Some drivers only expose a synchronous userspace API with no
+*   pipelining across drivers. These do not set any fences for their
+*   access. An example here is v4l.
+*
+* DYNAMIC IMPORTER RULES:
+*
+* Dynamic importers, see dma_buf_attachment_is_dynamic(), have
+* additional constraints on how they set up fences:
+*
+* - Dynamic importers must obey the exclusive fence and wait for it to
+*   signal before allowing access to the buffer's underlying storage
+*   through.
+*
+* - Dynamic importers should set fences for any access that they can't
+*   disable immediately from their @dma_buf_attach_ops.move_notify
+*   callback.
 */
struct dma_resv *resv;
 
-- 
2.32.0.rc2

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-21 Thread Daniel Vetter
On Mon, Jun 21, 2021 at 12:16:55PM +0200, Christian König wrote:
> Am 18.06.21 um 20:45 schrieb Daniel Vetter:
> > On Fri, Jun 18, 2021 at 8:02 PM Christian König
> >  wrote:
> > > Am 18.06.21 um 19:20 schrieb Daniel Vetter:
> > > [SNIP]
> > > The whole thing was introduced with this commit here:
> > > 
> > > commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa
> > > Author: Maarten Lankhorst 
> > > Date:   Wed Apr 2 17:14:48 2014 +0200
> > > 
> > >   drm/ttm: flip the switch, and convert to dma_fence
> > > 
> > >   Signed-off-by: Maarten Lankhorst 
> > > 
> > >int ttm_bo_move_accel_cleanup(struct ttm_buffer_object *bo,
> > > 
> > > -   bo->sync_obj = driver->sync_obj_ref(sync_obj);
> > > +   reservation_object_add_excl_fence(bo->resv, fence);
> > >   if (evict) {
> > > 
> > > Maarten replaced the bo->sync_obj reference with the dma_resv exclusive
> > > fence.
> > > 
> > > This means that we need to apply the sync_obj semantic to all drivers
> > > using a DMA-buf with its dma_resv object, otherwise you break imports
> > > from TTM drivers.
> > > 
> > > Since then and up till now the exclusive fence must be waited on and
> > > never replaced with anything which signals before the old fence.
> > > 
> > > Maarten and I think Thomas did that and I was always assuming that you
> > > know about this design decision.
> > Surprisingly I do actually know this.
> > 
> > Still the commit you cite did _not_ change any of the rules around
> > dma_buf: Importers have _no_ obligation to obey the exclusive fence,
> > because the buffer is pinned. None of the work that Maarten has done
> > has fundamentally changed this contract in any way.
> 
> Well I now agree that the rules around dma_resv are different than I
> thought, but this change should have raised more eyebrows.
> 
> The problem is this completely broke interop with all drivers using TTM and
> I think might even explain some bug reports.
> 
> I re-introduced the moving fence by adding bo->moving a few years after the
> initial introduction of dma_resv, but that was just to work around
> performance problems introduced by using the exclusive fence for both use
> cases.

Ok that part is indeed not something I've known.

> > If amdgpu (or any other ttm based driver) hands back and sgt without
> > waiting for ttm_bo->moving or the exclusive fence first, then that's a
> > bug we need to fix in these drivers. But if ttm based drivers did get
> > this wrong, then they got this wrong both before and after the switch
> > over to using dma_resv - this bug would go back all the way to Dave's
> > introduction of drm_prime.c and support for that.
> 
> I'm not 100% sure, but I think before the switch to the dma_resv object
> drivers just waited for the BOs to become idle and that should have
> prevented this.
> 
> Anyway let's stop discussing history and move forward. Sending patches for
> all affected TTM driver with CC: stable tags in a minute.
> 
> 
> > The only thing which importers have to do is not wreak the DAG nature
> > of the dma_resv fences and drop dependencies. Currently there's a
> > handful of drivers which break this (introduced over the last few
> > years), and I have it somewhere on my todo list to audit them all.
> 
> Please give that some priority.
> 
> Ignoring the moving fence is a information leak, but messing up the DAG
> gives you access to freed up memory.

Yeah will try to. I've also been hung up a bit on how to fix that, but I
think just closing the DAG-breakage is simplest. Any userspace which then
complains about the additional sync that causes would then be motivated to
look into the import ioctl Jason has. And I think the impact in practice
should be minimal, aside from some corner cases.

> > The goal with extracting dma_resv from ttm was to make implicit sync
> > working and get rid of some terrible stalls on the userspace side.
> > Eventually it was also the goal to make truly dynamic buffer
> > reservation possible, but that took another 6 or so years to realize
> > with your work. And we had to make dynamic dma-buf very much opt-in,
> > because auditing all the users is very hard work and no one
> > volunteered. And for dynamic dma-buf the rule is that the exclusive
> > fence must _never_ be ignored, and the two drivers supporting it (mlx5
> > and amdgpu) obey that.
> > 
> > So yeah for ttm drivers dma_resv is primarily for memory management,
> > with a side effect of also

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Daniel Vetter
On Fri, Jun 18, 2021 at 8:02 PM Christian König
 wrote:
>
> Am 18.06.21 um 19:20 schrieb Daniel Vetter:
> > On Fri, Jun 18, 2021 at 6:43 PM Christian König
> >  wrote:
> >> Am 18.06.21 um 17:17 schrieb Daniel Vetter:
> >>> [SNIP]
> >>> Ignoring _all_ fences is officially ok for pinned dma-buf. This is
> >>> what v4l does. Aside from it's definitely not just i915 that does this
> >>> even on the drm side, we have a few more drivers nowadays.
> >> No it seriously isn't. If drivers are doing this they are more than broken.
> >>
> >> See the comment in dma-resv.h
> >>
> >>* Based on bo.c which bears the following copyright notice,
> >>* but is dual licensed:
> >> 
> >>
> >>
> >> The handling in ttm_bo.c is and always was that the exclusive fence is
> >> used for buffer moves.
> >>
> >> As I said multiple times now the *MAIN* purpose of the dma_resv object
> >> is memory management and *NOT* synchronization.
> >>
> >> Those restrictions come from the original design of TTM where the
> >> dma_resv object originated from.
> >>
> >> The resulting consequences are that:
> >>
> >> a) If you access the buffer without waiting for the exclusive fence you
> >> run into a potential information leak.
> >>   We kind of let that slip for V4L since they only access the buffers
> >> for writes, so you can't do any harm there.
> >>
> >> b) If you overwrite the exclusive fence with a new one without waiting
> >> for the old one to signal you open up the possibility for userspace to
> >> access freed up memory.
> >>   This is a complete show stopper since it means that taking over the
> >> system is just a typing exercise.
> >>
> >>
> >> What you have done by allowing this in is ripping open a major security
> >> hole for any DMA-buf import in i915 from all TTM based driver.
> >>
> >> This needs to be fixed ASAP, either by waiting in i915 and all other
> >> drivers doing this for the exclusive fence while importing a DMA-buf or
> >> by marking i915 and all other drivers as broken.
> >>
> >> Sorry, but if you allowed that in you seriously have no idea what you
> >> are talking about here and where all of this originated from.
> > Dude, get a grip, seriously. dma-buf landed in 2011
> >
> > commit d15bd7ee445d0702ad801fdaece348fdb79e6581
> > Author: Sumit Semwal 
> > Date:   Mon Dec 26 14:53:15 2011 +0530
> >
> > dma-buf: Introduce dma buffer sharing mechanism
> >
> > and drm prime landed in the same year
> >
> > commit 3248877ea1796915419fba7c89315fdbf00cb56a
> > (airlied/drm-prime-dmabuf-initial)
> > Author: Dave Airlie 
> > Date:   Fri Nov 25 15:21:02 2011 +
> >
> > drm: base prime/dma-buf support (v5)
> >
> > dma-resv was extracted much later
> >
> > commit 786d7257e537da0674c02e16e3b30a44665d1cee
> > Author: Maarten Lankhorst 
> > Date:   Thu Jun 27 13:48:16 2013 +0200
> >
> > reservation: cross-device reservation support, v4
> >
> > Maarten's patch only extracted the dma_resv stuff so it's there,
> > optionally. There was never any effort to roll this out to all the
> > existing drivers, of which there were plenty.
> >
> > It is, and has been since 10 years, totally fine to access dma-buf
> > without looking at any fences at all. From your pov of a ttm driver
> > dma-resv is mainly used for memory management and not sync, but I
> > think that's also due to some reinterpretation of the actual sync
> > rules on your side. For everyone else the dma_resv attached to a
> > dma-buf has been about implicit sync only, nothing else.
>
> No, that was way before my time.
>
> The whole thing was introduced with this commit here:
>
> commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa
> Author: Maarten Lankhorst 
> Date:   Wed Apr 2 17:14:48 2014 +0200
>
>  drm/ttm: flip the switch, and convert to dma_fence
>
>  Signed-off-by: Maarten Lankhorst 
>
>   int ttm_bo_move_accel_cleanup(struct ttm_buffer_object *bo,
> 
> -   bo->sync_obj = driver->sync_obj_ref(sync_obj);
> +   reservation_object_add_excl_fence(bo->resv, fence);
>  if (evict) {
>
> Maarten replaced the bo->sync_obj reference with the dma_resv exclusive
> fence.
>
> This means that we need to apply the sync_obj semantic to all drivers
> using a DMA-buf with its dma_resv object, otherwise you br

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Daniel Vetter
On Fri, Jun 18, 2021 at 6:43 PM Christian König
 wrote:
>
> Am 18.06.21 um 17:17 schrieb Daniel Vetter:
> > [SNIP]
> > Ignoring _all_ fences is officially ok for pinned dma-buf. This is
> > what v4l does. Aside from it's definitely not just i915 that does this
> > even on the drm side, we have a few more drivers nowadays.
>
> No it seriously isn't. If drivers are doing this they are more than broken.
>
> See the comment in dma-resv.h
>
>   * Based on bo.c which bears the following copyright notice,
>   * but is dual licensed:
> 
>
>
> The handling in ttm_bo.c is and always was that the exclusive fence is
> used for buffer moves.
>
> As I said multiple times now the *MAIN* purpose of the dma_resv object
> is memory management and *NOT* synchronization.
>
> Those restrictions come from the original design of TTM where the
> dma_resv object originated from.
>
> The resulting consequences are that:
>
> a) If you access the buffer without waiting for the exclusive fence you
> run into a potential information leak.
>  We kind of let that slip for V4L since they only access the buffers
> for writes, so you can't do any harm there.
>
> b) If you overwrite the exclusive fence with a new one without waiting
> for the old one to signal you open up the possibility for userspace to
> access freed up memory.
>  This is a complete show stopper since it means that taking over the
> system is just a typing exercise.
>
>
> What you have done by allowing this in is ripping open a major security
> hole for any DMA-buf import in i915 from all TTM based driver.
>
> This needs to be fixed ASAP, either by waiting in i915 and all other
> drivers doing this for the exclusive fence while importing a DMA-buf or
> by marking i915 and all other drivers as broken.
>
> Sorry, but if you allowed that in you seriously have no idea what you
> are talking about here and where all of this originated from.

Dude, get a grip, seriously. dma-buf landed in 2011

commit d15bd7ee445d0702ad801fdaece348fdb79e6581
Author: Sumit Semwal 
Date:   Mon Dec 26 14:53:15 2011 +0530

   dma-buf: Introduce dma buffer sharing mechanism

and drm prime landed in the same year

commit 3248877ea1796915419fba7c89315fdbf00cb56a
(airlied/drm-prime-dmabuf-initial)
Author: Dave Airlie 
Date:   Fri Nov 25 15:21:02 2011 +

   drm: base prime/dma-buf support (v5)

dma-resv was extracted much later

commit 786d7257e537da0674c02e16e3b30a44665d1cee
Author: Maarten Lankhorst 
Date:   Thu Jun 27 13:48:16 2013 +0200

   reservation: cross-device reservation support, v4

Maarten's patch only extracted the dma_resv stuff so it's there,
optionally. There was never any effort to roll this out to all the
existing drivers, of which there were plenty.

It is, and has been since 10 years, totally fine to access dma-buf
without looking at any fences at all. From your pov of a ttm driver
dma-resv is mainly used for memory management and not sync, but I
think that's also due to some reinterpretation of the actual sync
rules on your side. For everyone else the dma_resv attached to a
dma-buf has been about implicit sync only, nothing else.

_only_ when you have a dynamic importer/exporter can you assume that
the dma_resv fences must actually be obeyed. That's one of the reasons
why we had to make this a completely new mode (the other one was
locking, but they really tie together).

Wrt your problems:
a) needs to be fixed in drivers exporting buffers and failing to make
sure the memory is there by the time dma_buf_map_attachment returns.
b) needs to be fixed in the importers, and there's quite a few of
those. There's more than i915 here, which is why I think we should
have the dma_resv_add_shared_exclusive helper extracted from amdgpu.
Avoids hand-rolling this about 5 times (6 if we include the import
ioctl from Jason).

Also I've like been trying to explain this ever since the entire
dynamic dma-buf thing started.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Daniel Vetter
On Fri, Jun 18, 2021 at 4:42 PM Christian König
 wrote:
>
> Am 18.06.21 um 16:31 schrieb Daniel Vetter:
> > [SNIP]
> >> And that drivers choose to ignore the exclusive fence is an absolutely
> >> no-go from a memory management and security point of view. Exclusive
> >> access means exclusive access. Ignoring that won't work.
> > Yeah, this is why I've been going all over the place about lifting
> > ttm_bo->moving to dma_resv. And also that I flat out don't trust your
> > audit, if you havent found these drivers then very clearly you didn't
> > audit much at all :-)
>
> I just didn't though that anybody could be so stupid to allow such a
> thing in.
>
> >> The only thing which saved us so far is the fact that drivers doing this
> >> are not that complex.
> >>
> >> BTW: How does it even work? I mean then you would run into the same
> >> problem as amdgpu with its page table update fences, e.g. that your
> >> shared fences might signal before the exclusive one.
> > So we don't ignore any fences when we rip out the backing storage.
> >
> > And yes there's currently a bug in all these drivers that if you set
> > both the "ignore implicit fences" and the "set the exclusive fence"
> > flag, then we just break this. Which is why I think we want to have a
> > dma_fence_add_shared_exclusive() helper extracted from your amdgpu
> > code, which we can then use everywhere to plug this.
>
> Daniel are you realizing what you are talking about here? Does that also
> apply for imported DMA-bufs?
>
> If yes than that is a security hole you can push an elephant through.
>
> Can you point me to the code using that?
>
> >>> For dma-buf this isn't actually a problem, because dma-buf are pinned. You
> >>> can't move them while other drivers are using them, hence there's not
> >>> actually a ttm_bo->moving fence we can ignore.
> >>>
> >>> p2p dma-buf aka dynamic dma-buf is a different beast, and i915 (and fwiw
> >>> these other drivers) need to change before they can do dynamic dma-buf.
> >>>
> >>>> Otherwise we have an information leak worth a CVE and that is certainly 
> >>>> not
> >>>> something we want.
> >>> Because yes otherwise we get a CVE. But right now I don't think we have
> >>> one.
> >> Yeah, agree. But this is just because of coincident and not because of
> >> good engineering :)
> > Well the good news is that I think we're now talking slightly less
> > past each another than the past few weeks :-)
> >
> >>> We do have a quite big confusion on what exactly the signaling ordering is
> >>> supposed to be between exclusive and the collective set of shared fences,
> >>> and there's some unifying that needs to happen here. But I think what
> >>> Jason implements here in the import ioctl is the most defensive version
> >>> possible, so really can't break any driver. It really works like you have
> >>> an ad-hoc gpu engine that does nothing itself, but waits for the current
> >>> exclusive fence and then sets the exclusive fence with its "CS" completion
> >>> fence.
> >>>
> >>> That's imo perfectly legit use-case.
> >> The use case is certainly legit, but I'm not sure if merging this at the
> >> moment is a good idea.
> >>
> >> Your note that drivers are already ignoring the exclusive fence in the
> >> dma_resv object was eye opening to me. And I now have the very strong
> >> feeling that the synchronization and the design of the dma_resv object
> >> is even more messy then I thought it is.
> >>
> >> To summarize we can be really lucky that it didn't blow up into our
> >> faces already.
> > I don't think there was that much luck involved (ok I did find a
> > possible bug in i915 already around cpu cache flushing) - for SoC the
> > exclusive slot in dma_resv really is only used for implicit sync and
> > nothing else. The fun only starts when you throw in pipelined backing
> > storage movement.
> >
> > I guess this also explains why you just seemed to ignore me when I was
> > asking for a memory management exclusive fence for the p2p stuff, or
> > some other way to specifically handling movements (like ttm_bo->moving
> > or whatever it is). From my pov we clearly needed that to make p2p
> > dma-buf work well enough, mixing up the memory management exclusive
> > slot with the implicit sync exclusive slot never looked like a bright

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-18 Thread Daniel Vetter
On Fri, Jun 18, 2021 at 11:15 AM Christian König
 wrote:
>
> Am 17.06.21 um 21:58 schrieb Daniel Vetter:
> > On Thu, Jun 17, 2021 at 09:37:36AM +0200, Christian König wrote:
> >> [SNIP]
> >>> But, to the broader point, maybe?  I'm a little fuzzy on exactly where
> >>> i915 inserts and/or depends on fences.
> >>>
> >>>> When you combine that with complex drivers which use TTM and buffer
> >>>> moves underneath you can construct an information leak using this and
> >>>> give userspace access to memory which is allocated to the driver, but
> >>>> not yet initialized.
> >>>>
> >>>> This way you can leak things like page tables, passwords, kernel data
> >>>> etc... in large amounts to userspace and is an absolutely no-go for
> >>>> security.
> >>> Ugh...  Unfortunately, I'm really out of my depth on the implications
> >>> going on here but I think I see your point.
> >>>
> >>>> That's why I'm said we need to get this fixed before we upstream this
> >>>> patch set here and especially the driver change which is using that.
> >>> Well, i915 has had uAPI for a while to ignore fences.
> >> Yeah, exactly that's illegal.
> > You're a few years too late with closing that barn door. The following
> > drives have this concept
> > - i915
> > - msm
> > - etnaviv
> >
> > Because you can't write a competent vulkan driver without this.
>
> WHAT? ^^
>
> > This was discussed at absolute epic length in various xdcs iirc. We did 
> > ignore a
> > bit the vram/ttm/bo-moving problem because all the people present were
> > hacking on integrated gpu (see list above), but that just means we need to
> > treat the ttm_bo->moving fence properly.
>
> I should have visited more XDCs in the past, the problem is much larger
> than this.
>
> But I now start to understand what you are doing with that design and
> why it looks so messy to me, amdgpu is just currently the only driver
> which does Vulkan and complex memory management at the same time.
>
> >> At least the kernel internal fences like moving or clearing a buffer object
> >> needs to be taken into account before a driver is allowed to access a
> >> buffer.
> > Yes i915 needs to make sure it never ignores ttm_bo->moving.
>
> No, that is only the tip of the iceberg. See TTM for example also puts
> fences which drivers needs to wait for into the shared slots. Same thing
> for use cases like clear on release etc
>
>  From my point of view the main purpose of the dma_resv object is to
> serve memory management, synchronization for command submission is just
> a secondary use case.
>
> And that drivers choose to ignore the exclusive fence is an absolutely
> no-go from a memory management and security point of view. Exclusive
> access means exclusive access. Ignoring that won't work.

Yeah, this is why I've been going all over the place about lifting
ttm_bo->moving to dma_resv. And also that I flat out don't trust your
audit, if you havent found these drivers then very clearly you didn't
audit much at all :-)

> The only thing which saved us so far is the fact that drivers doing this
> are not that complex.
>
> BTW: How does it even work? I mean then you would run into the same
> problem as amdgpu with its page table update fences, e.g. that your
> shared fences might signal before the exclusive one.

So we don't ignore any fences when we rip out the backing storage.

And yes there's currently a bug in all these drivers that if you set
both the "ignore implicit fences" and the "set the exclusive fence"
flag, then we just break this. Which is why I think we want to have a
dma_fence_add_shared_exclusive() helper extracted from your amdgpu
code, which we can then use everywhere to plug this.

> > For dma-buf this isn't actually a problem, because dma-buf are pinned. You
> > can't move them while other drivers are using them, hence there's not
> > actually a ttm_bo->moving fence we can ignore.
> >
> > p2p dma-buf aka dynamic dma-buf is a different beast, and i915 (and fwiw
> > these other drivers) need to change before they can do dynamic dma-buf.
> >
> >> Otherwise we have an information leak worth a CVE and that is certainly not
> >> something we want.
> > Because yes otherwise we get a CVE. But right now I don't think we have
> > one.
>
> Yeah, agree. But this is just because of coincident and not because of
> good engineering :)

Well the good news is that I think we're now talking slightly less
past each another than the p

Re: [Mesa-dev] [PATCH 0/6] dma-buf: Add an API for exporting sync files (v12)

2021-06-17 Thread Daniel Vetter
ike the flag idea. But the problem
is we first need a ton more encapsulation and review of drivers before we
can change the internals. One thing at a time.

And yes for amdgpu this gets triple-hard because you both have the
ttm_bo->moving fence _and_ the current uapi of using fence ownership _and_
you need to figure out how to support vulkan properly with true opt-in
fencing. I'm pretty sure it's doable, I'm just not finding any time
anywhere to hack on these patches - too many other fires :-(

Cheers, Daniel

> 
> Christian.
> 
> > 
> > --Jason
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > Am 10.06.21 um 23:09 schrieb Jason Ekstrand:
> > > > Modern userspace APIs like Vulkan are built on an explicit
> > > > synchronization model.  This doesn't always play nicely with the
> > > > implicit synchronization used in the kernel and assumed by X11 and
> > > > Wayland.  The client -> compositor half of the synchronization isn't too
> > > > bad, at least on intel, because we can control whether or not i915
> > > > synchronizes on the buffer and whether or not it's considered written.
> > > > 
> > > > The harder part is the compositor -> client synchronization when we get
> > > > the buffer back from the compositor.  We're required to be able to
> > > > provide the client with a VkSemaphore and VkFence representing the point
> > > > in time where the window system (compositor and/or display) finished
> > > > using the buffer.  With current APIs, it's very hard to do this in such
> > > > a way that we don't get confused by the Vulkan driver's access of the
> > > > buffer.  In particular, once we tell the kernel that we're rendering to
> > > > the buffer again, any CPU waits on the buffer or GPU dependencies will
> > > > wait on some of the client rendering and not just the compositor.
> > > > 
> > > > This new IOCTL solves this problem by allowing us to get a snapshot of
> > > > the implicit synchronization state of a given dma-buf in the form of a
> > > > sync file.  It's effectively the same as a poll() or I915_GEM_WAIT only,
> > > > instead of CPU waiting directly, it encapsulates the wait operation, at
> > > > the current moment in time, in a sync_file so we can check/wait on it
> > > > later.  As long as the Vulkan driver does the sync_file export from the
> > > > dma-buf before we re-introduce it for rendering, it will only contain
> > > > fences from the compositor or display.  This allows to accurately turn
> > > > it into a VkFence or VkSemaphore without any over- synchronization.
> > > > 
> > > > This patch series actually contains two new ioctls.  There is the export
> > > > one mentioned above as well as an RFC for an import ioctl which provides
> > > > the other half.  The intention is to land the export ioctl since it 
> > > > seems
> > > > like there's no real disagreement on that one.  The import ioctl, 
> > > > however,
> > > > has a lot of debate around it so it's intended to be RFC-only for now.
> > > > 
> > > > Mesa MR: 
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fmesa%2F-%2Fmerge_requests%2F4037data=04%7C01%7Cchristian.koenig%40amd.com%7Cb094e69c94814727939508d930f4ca94%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637594650220923783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=xUwaiuw8Qt3d37%2F6NYOHU3K%2FMFwsvg79rno9zTNodRs%3Dreserved=0
> > > > IGT tests: 
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fseries%2F90490%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cb094e69c94814727939508d930f4ca94%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637594650220923783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wygYaeVg%2BXmfeEUC45lWH5GgNBukl0%2B%2FMpT5u9LKYDI%3Dreserved=0
> > > > 
> > > > v10 (Jason Ekstrand, Daniel Vetter):
> > > >- Add reviews/acks
> > > >- Add a patch to rename _rcu to _unlocked
> > > >- Split things better so import is clearly RFC status
> > > > 
> > > > v11 (Daniel Vetter):
> > > >- Add more CCs to try and get maintainers
> > > >- Add a patch to document DMA_BUF_IOCTL_SYNC
> > > >- Generally better docs
> > > >- Use separate structs for impo

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Daniel Vetter
On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
> The kernel will know who should touch the implicit-sync semaphore next, and
> at the same time, the copy of all write requests to the implicit-sync
> semaphore will be forwarded to the kernel for monitoring and bo_wait.
> 
> Syncobjs could either use the same monitored access as implicit sync or be
> completely unmonitored. We haven't decided yet.
> 
> Syncfiles could either use one of the above or wait for a syncobj to go
> idle before converting to a syncfile.

Hm this sounds all like you're planning to completely rewrap everything
... I'm assuming the plan is still that this is going to be largely
wrapped in dma_fence? Maybe with timeline objects being a bit more
optimized, but I'm not sure how much you can optimize without breaking the
interfaces.
-Daniel

> 
> Marek
> 
> 
> 
> On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:
> 
> > On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > > As long as we can figure out who touched to a certain sync object last
> > that
> > > would indeed work, yes.
> >
> > Don't you need to know who will touch it next, i.e. who is holding up your
> > fence? Or maybe I'm just again totally confused.
> > -Daniel
> >
> > >
> > > Christian.
> > >
> > > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > > The call to the hw scheduler has a limitation on the size of all
> > > > parameters combined. I think we can only pass a 32-bit sequence number
> > > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > > else.
> > > >
> > > > The syncobj handle can be an element index in a global (per-GPU)
> > syncobj
> > > > table and it's read only for all processes with the exception of the
> > > > signal command. Syncobjs can either have per VMID write access flags
> > for
> > > > the signal command (slow), or any process can write to any syncobjs and
> > > > only rely on the kernel checking the write log (fast).
> > > >
> > > > In any case, we can execute the memory write in the queue engine and
> > > > only use the hw scheduler for logging, which would be perfect.
> > > >
> > > > Marek
> > > >
> > > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > >  > > > <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
> > > >
> > > > Hi guys,
> > > >
> > > > maybe soften that a bit. Reading from the shared memory of the
> > > > user fence is ok for everybody. What we need to take more care of
> > > > is the writing side.
> > > >
> > > > So my current thinking is that we allow read only access, but
> > > > writing a new sequence value needs to go through the
> > scheduler/kernel.
> > > >
> > > > So when the CPU wants to signal a timeline fence it needs to call
> > > > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > > > to hand that of to the hardware scheduler.
> > > >
> > > > If we lockup the kernel can check with the hardware who did the
> > > > last write and what value was written.
> > > >
> > > > That together with an IOCTL to give out sequence number for
> > > > implicit sync to applications should be sufficient for the kernel
> > > > to track who is responsible if something bad happens.
> > > >
> > > > In other words when the hardware says that the shader wrote stuff
> > > > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > > > who did that.
> > > >
> > > > If the hardware says that seq - 1 was written fine, but seq is
> > > > missing then the kernel blames whoever was supposed to write seq.
> > > >
> > > > Just pieping the write through a privileged instance should be
> > > > fine to make sure that we don't run into issues.
> > > >
> > > > Christian.
> > > >
> > > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > > Hi Daniel,
> > > > >
> > > > > We just talked about this whole topic internally and we came up
> > > > > to the conclusion that the hardware needs to understand sync
> > > > > object handles and have high-level wait and signal operations in
> >

Re: [Mesa-dev] [Intel-gfx] [PATCH 0/2] GuC submission / DRM scheduler integration plan + new uAPI

2021-06-17 Thread Daniel Vetter
On Fri, Jun 11, 2021 at 04:40:42PM -0700, Matthew Brost wrote:
> Subject and patches say it all.
> 
> v2: Address comments, patches have details of changes
> v3: Address comments, patches have details of changes
> v4: Address comments, patches have details of changes
> 
> Signed-off-by: Matthew Brost 

Imo ready (well overdue) for merging, please annoy Carl or someone from
media for an ack and then ask John or Daniele to merge it into
drm-intel-gt-next.
-Daniel

> 
> Matthew Brost (2):
>   drm/doc/rfc: i915 GuC submission / DRM scheduler
>   drm/doc/rfc: i915 new parallel submission uAPI plan
> 
>  Documentation/gpu/rfc/i915_parallel_execbuf.h | 117 ++
>  Documentation/gpu/rfc/i915_scheduler.rst  | 148 ++
>  Documentation/gpu/rfc/index.rst   |   4 +
>  3 files changed, 269 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
>  create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst
> 
> -- 
> 2.28.0
> 
> ___
> Intel-gfx mailing list
> intel-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Daniel Vetter
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> As long as we can figure out who touched to a certain sync object last that
> would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up your
fence? Or maybe I'm just again totally confused.
-Daniel

> 
> Christian.
> 
> Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > The call to the hw scheduler has a limitation on the size of all
> > parameters combined. I think we can only pass a 32-bit sequence number
> > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > else.
> > 
> > The syncobj handle can be an element index in a global (per-GPU) syncobj
> > table and it's read only for all processes with the exception of the
> > signal command. Syncobjs can either have per VMID write access flags for
> > the signal command (slow), or any process can write to any syncobjs and
> > only rely on the kernel checking the write log (fast).
> > 
> > In any case, we can execute the memory write in the queue engine and
> > only use the hw scheduler for logging, which would be perfect.
> > 
> > Marek
> > 
> > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> >  > <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
> > 
> > Hi guys,
> > 
> > maybe soften that a bit. Reading from the shared memory of the
> > user fence is ok for everybody. What we need to take more care of
> > is the writing side.
> > 
> > So my current thinking is that we allow read only access, but
> > writing a new sequence value needs to go through the scheduler/kernel.
> > 
> > So when the CPU wants to signal a timeline fence it needs to call
> > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > to hand that of to the hardware scheduler.
> > 
> > If we lockup the kernel can check with the hardware who did the
> > last write and what value was written.
> > 
> > That together with an IOCTL to give out sequence number for
> > implicit sync to applications should be sufficient for the kernel
> > to track who is responsible if something bad happens.
> > 
> > In other words when the hardware says that the shader wrote stuff
> > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > who did that.
> > 
> > If the hardware says that seq - 1 was written fine, but seq is
> > missing then the kernel blames whoever was supposed to write seq.
> > 
> > Just pieping the write through a privileged instance should be
> > fine to make sure that we don't run into issues.
> > 
> > Christian.
> > 
> > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > Hi Daniel,
> > > 
> > > We just talked about this whole topic internally and we came up
> > > to the conclusion that the hardware needs to understand sync
> > > object handles and have high-level wait and signal operations in
> > > the command stream. Sync objects will be backed by memory, but
> > > they won't be readable or writable by processes directly. The
> > > hardware will log all accesses to sync objects and will send the
> > > log to the kernel periodically. The kernel will identify
> > > malicious behavior.
> > > 
> > > Example of a hardware command stream:
> > > ...
> > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > number is assigned by the kernel
> > > Draw();
> > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > ...
> > > 
> > > I'm afraid we have no other choice because of the TLB
> > > invalidation overhead.
> > > 
> > > Marek
> > > 
> > > 
> > > On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  > > <mailto:dan...@ffwll.ch>> wrote:
> > > 
> > > On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> > > > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > > > [SNIP]
> > > > > > Yeah, we call this the lightweight and the heavyweight
> > > tlb flush.
> > > > > >
> > > > > > The lighweight can be used when you are sure that you
> > > don't have any of the
> > > > > > PTEs currently in flight in the 3D/DMA engine and you
> > > just need to
> > >

Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-06-17 Thread Daniel Vetter
Sorry I'm behind on mails  ...

On Fri, Jun 11, 2021 at 12:50:29PM -0700, Matthew Brost wrote:
> On Fri, Jun 04, 2021 at 07:59:05PM +0200, Daniel Vetter wrote:
> > On Wed, May 26, 2021 at 04:33:57PM -0700, Matthew Brost wrote:
> > > Add entry for i915 new parallel submission uAPI plan.
> > > 
> > > v2:
> > >  (Daniel Vetter):
> > >   - Expand logical order explaination
> > >   - Add dummy header
> > >   - Only allow N BBs in execbuf IOCTL
> > >   - Configure parallel submission per slot not per gem context
> > > v3:
> > >  (Marcin Ślusarz):
> > >   - Lot's of typos / bad english fixed
> > >  (Tvrtko Ursulin):
> > >   - Consistent pseudo code, clean up wording in descriptions
> > > 
> > > Cc: Tvrtko Ursulin 
> > > Cc: Tony Ye 
> > > CC: Carl Zhang 
> > > Cc: Daniel Vetter 
> > > Cc: Jason Ekstrand 
> > > Signed-off-by: Matthew Brost 
> > > ---
> > >  Documentation/gpu/rfc/i915_parallel_execbuf.h | 145 ++
> > >  Documentation/gpu/rfc/i915_scheduler.rst  |  55 ++-
> > >  2 files changed, 198 insertions(+), 2 deletions(-)
> > >  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > 
> > > diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> > > b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > new file mode 100644
> > > index ..20de206e3ab4
> > > --- /dev/null
> > > +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > @@ -0,0 +1,145 @@
> > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> > > i915_context_engines_parallel_submit */
> > > +
> > > +/*
> > > + * i915_context_engines_parallel_submit:
> > 
> > So the idea is to make these kerneldoc and pull them into the rfc section.
> > Then when we merge, move them to the real uapi section, like what Matt has
> > done for lmem.
> > 
> 
> Yep, will fix in next rev.
> 
> > > + *
> > > + * Setup a slot in the context engine map to allow multiple BBs to be 
> > > submitted
> > > + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on 
> > > the GPU
> > > + * in parallel. Multiple hardware contexts are created internally in the 
> > > i915
> > > + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> > > + * submitted in each execbuf IOCTL and this is implicit behavior e.g. 
> > > The user
> > > + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL 
> > > know how
> > > + * many BBs there are based on the slots configuration. The N BBs are 
> > > the last N
> > > + * buffer objects for first N if I915_EXEC_BATCH_FIRST is set.
> > 
> > s/for/or/
> > 
> > > + *
> > > + * There are two currently defined ways to control the placement of the
> > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > + * I915_PARALLEL_IMPLICIT_BONDS (a flag). More flags may be added the in 
> > > the
> > > + * future as new hardware / use cases arise. Details of how to use this
> > > + * interface above the flags field in this structure.
> > > + *
> > > + * Returns -EINVAL if hardware context placement configuration is 
> > > invalid or if
> > > + * the placement configuration isn't supported on the platform / 
> > > submission
> > > + * interface.
> > > + * Returns -ENODEV if extension isn't supported on the platform / 
> > > submission
> > > + * inteface.
> > > + */
> > > +struct i915_context_engines_parallel_submit {
> > > + struct i915_user_extension base;
> > > +
> > > + __u16 engine_index; /* slot for parallel engine */
> > 
> > Kernel doc here for the inline comments too.
> >
> 
> Yep.
>  
> > > + __u16 width;/* number of contexts per parallel engine */
> > > + __u16 num_siblings; /* number of siblings per context */
> > > + __u16 mbz16;
> > > +/*
> > > + * Default placement behavior (currently unsupported):
> > > + *
> > > + * Allow BBs to be placed on any available engine instance. In this case 
> > > each
> > > + * context's engine mask indicates where that context can be placed. It 
> > > is
> > > + * implied in this mode that all contexts have mutual exclusive 
> > > placement.
> > > + * e.g. If one context is running CSX[0] no other 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Daniel Vetter
On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight tlb flush.
> > > 
> > > The lighweight can be used when you are sure that you don't have any of 
> > > the
> > > PTEs currently in flight in the 3D/DMA engine and you just need to
> > > invalidate the TLB.
> > > 
> > > The heavyweight must be used when you need to invalidate the TLB *AND* 
> > > make
> > > sure that no concurrently operation moves new stuff into the TLB.
> > > 
> > > The problem is for this use case we have to use the heavyweight one.
> > Just for my own curiosity: So the lightweight flush is only for in-between
> > CS when you know access is idle? Or does that also not work if userspace
> > has a CS on a dma engine going at the same time because the tlb aren't
> > isolated enough between engines?
> 
> More or less correct, yes.
> 
> The problem is a lightweight flush only invalidates the TLB, but doesn't
> take care of entries which have been handed out to the different engines.
> 
> In other words what can happen is the following:
> 
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it asks the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts that into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for address X.
> 5. You send lightweight TLB invalidate and TLB throws away cached values for
> address X.
> 6. Shader happily still uses whatever the TLB gave to it in step 3 to
> accesses address X
> 
> See it like the shader has their own 1 entry L0 TLB cache which is not
> affected by the lightweight flush.
> 
> The heavyweight flush on the other hand sends out a broadcast signal to
> everybody and only comes back when we are sure that an address is not in use
> any more.

Ah makes sense. On intel the shaders only operate in VA, everything goes
around as explicit async messages to IO blocks. So we don't have this, the
only difference in tlb flushes is between tlb flush in the IB and an mmio
one which is independent for anything currently being executed on an
egine.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Daniel Vetter
On Fri, Jun 04, 2021 at 01:27:15PM +0200, Christian König wrote:
> Am 04.06.21 um 10:57 schrieb Daniel Vetter:
> > On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> > > Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > Yes, we can't break anything because we don't want to 
> > > > > > > > complicate things
> > > > > > > > for us. It's pretty much all NAK'd already. We are trying to 
> > > > > > > > gather more
> > > > > > > > knowledge and then make better decisions.
> > > > > > > > 
> > > > > > > > The idea we are considering is that we'll expose memory-based 
> > > > > > > > sync objects
> > > > > > > > to userspace for read only, and the kernel or hw will strictly 
> > > > > > > > control the
> > > > > > > > memory writes to those sync objects. The hole in that idea is 
> > > > > > > > that
> > > > > > > > userspace can decide not to signal a job, so even if userspace 
> > > > > > > > can't
> > > > > > > > overwrite memory-based sync object states arbitrarily, it can 
> > > > > > > > still decide
> > > > > > > > not to signal them, and then a future fence is born.
> > > > > > > > 
> > > > > > > This would actually be treated as a GPU hang caused by that 
> > > > > > > context, so it
> > > > > > > should be fine.
> > > > > > This is practically what I proposed already, except your not doing 
> > > > > > it with
> > > > > > dma_fence. And on the memory fence side this also doesn't actually 
> > > > > > give
> > > > > > what you want for that compute model.
> > > > > > 
> > > > > > This seems like a bit a worst of both worlds approach to me? Tons 
> > > > > > of work
> > > > > > in the kernel to hide these not-dma_fence-but-almost, and still 
> > > > > > pain to
> > > > > > actually drive the hardware like it should be for compute or direct
> > > > > > display.
> > > > > > 
> > > > > > Also maybe I've missed it, but I didn't see any replies to my 
> > > > > > suggestion
> > > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > > > interesting to know what doesn't work there instead of amd folks 
> > > > > > going of
> > > > > > into internal again and then coming back with another rfc from out 
> > > > > > of
> > > > > > nowhere :-)
> > > > > Well to be honest I would just push back on our hardware/firmware 
> > > > > guys that
> > > > > we need to keep kernel queues forever before going down that route.
> > > > I looked again, and you said the model wont work because preemption is 
> > > > way
> > > > too slow, even when the context is idle.
> > > > 
> > > > I guess at that point I got maybe too fed up and just figured "not my
> > > > problem", but if preempt is too slow as the unload fence, you can do it
> > > > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > > > otherwise your hw is just garbage and wont even be fast for direct 
> > > > submit
> > > > compute workloads).
> > > Have you seen that one here:
> > > https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> > > 
> > > I've rejected it because I think polling for 6 seconds on a TLB flush 
> > > which
> > > can block interrupts as well is just madness.
> > Hm but I thought you had like 2 tlb flush modes, the shitty one (with
> > retrying page faults) and the not so shitty one?
> 
> Yeah, we call this the lightweight and the heavyweight tlb flush.
> 
> The lighweight can be used when you are sure that you don't have any of the
> PTEs currently 

Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-06-04 Thread Daniel Vetter
On Wed, May 26, 2021 at 04:33:57PM -0700, Matthew Brost wrote:
> Add entry for i915 new parallel submission uAPI plan.
> 
> v2:
>  (Daniel Vetter):
>   - Expand logical order explaination
>   - Add dummy header
>   - Only allow N BBs in execbuf IOCTL
>   - Configure parallel submission per slot not per gem context
> v3:
>  (Marcin Ślusarz):
>   - Lot's of typos / bad english fixed
>  (Tvrtko Ursulin):
>   - Consistent pseudo code, clean up wording in descriptions
> 
> Cc: Tvrtko Ursulin 
> Cc: Tony Ye 
> CC: Carl Zhang 
> Cc: Daniel Vetter 
> Cc: Jason Ekstrand 
> Signed-off-by: Matthew Brost 
> ---
>  Documentation/gpu/rfc/i915_parallel_execbuf.h | 145 ++
>  Documentation/gpu/rfc/i915_scheduler.rst  |  55 ++-
>  2 files changed, 198 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> 
> diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> new file mode 100644
> index ..20de206e3ab4
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> @@ -0,0 +1,145 @@
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> i915_context_engines_parallel_submit */
> +
> +/*
> + * i915_context_engines_parallel_submit:

So the idea is to make these kerneldoc and pull them into the rfc section.
Then when we merge, move them to the real uapi section, like what Matt has
done for lmem.

> + *
> + * Setup a slot in the context engine map to allow multiple BBs to be 
> submitted
> + * in a single execbuf IOCTL. Those BBs will then be scheduled to run on the 
> GPU
> + * in parallel. Multiple hardware contexts are created internally in the i915
> + * run these BBs. Once a slot is configured for N BBs only N BBs can be
> + * submitted in each execbuf IOCTL and this is implicit behavior e.g. The 
> user
> + * doesn't tell the execbuf IOCTL there are N BBs, the execbuf IOCTL know how
> + * many BBs there are based on the slots configuration. The N BBs are the 
> last N
> + * buffer objects for first N if I915_EXEC_BATCH_FIRST is set.

s/for/or/

> + *
> + * There are two currently defined ways to control the placement of the
> + * hardware contexts on physical engines: default behavior (no flags) and
> + * I915_PARALLEL_IMPLICIT_BONDS (a flag). More flags may be added the in the
> + * future as new hardware / use cases arise. Details of how to use this
> + * interface above the flags field in this structure.
> + *
> + * Returns -EINVAL if hardware context placement configuration is invalid or 
> if
> + * the placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * inteface.
> + */
> +struct i915_context_engines_parallel_submit {
> + struct i915_user_extension base;
> +
> + __u16 engine_index; /* slot for parallel engine */

Kernel doc here for the inline comments too.

> + __u16 width;/* number of contexts per parallel engine */
> + __u16 num_siblings; /* number of siblings per context */
> + __u16 mbz16;
> +/*
> + * Default placement behavior (currently unsupported):
> + *
> + * Allow BBs to be placed on any available engine instance. In this case each
> + * context's engine mask indicates where that context can be placed. It is
> + * implied in this mode that all contexts have mutual exclusive placement.
> + * e.g. If one context is running CSX[0] no other contexts can run on 
> CSX[0]).
> + *
> + * Example 1 pseudo code:
> + * CSX,Y[N] = generic engine class X or Y, logical instance N
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID)
> + * set_parallel(engine_index=0, width=2, num_siblings=2,
> + *   engines=CSX[0],CSX[1],CSY[0],CSY[1])
> + *
> + * Results in the following valid placements:
> + * CSX[0], CSY[0]
> + * CSX[0], CSY[1]
> + * CSX[1], CSY[0]
> + * CSX[1], CSY[1]
> + *
> + * This can also be thought of as 2 virtual engines described by 2-D array in
> + * the engines the field:
> + * VE[0] = CSX[0], CSX[1]
> + * VE[1] = CSY[0], CSY[1]
> + *
> + * Example 2 pseudo code:
> + * CSX[Y] = generic engine of same class X, logical instance N
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID)
> + * set_parallel(engine_index=0, width=2, num_siblings=3,
> + *   engines=CSX[0],CSX[1],CSX[2],CSX[0],CSX[1],CSX[2])
> + *
> + * Results in the following valid placements:
> + * CSX[0], CSX[1]
> + * CSX[0], CSX[2]
> + * CSX[1], CSX[0]
> + * CSX[1], CSX[2]
> + * CSX[2], CSX[0]

Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 1/2] drm/doc/rfc: i915 GuC submission / DRM scheduler

2021-06-04 Thread Daniel Vetter
On Wed, May 26, 2021 at 04:33:56PM -0700, Matthew Brost wrote:
> Add entry for i915 GuC submission / DRM scheduler integration plan.
> Follow up patch with details of new parallel submission uAPI to come.
> 
> v2:
>  (Daniel Vetter)
>   - Expand explaination of why bonding isn't supported for GuC
> submission
>   - CC some of the DRM scheduler maintainers
>   - Add priority inheritance / boosting use case
>   - Add reasoning for removing in order assumptions
>  (Daniel Stone)
>   - Add links to priority spec
> 
> Cc: Christian König 
> Cc: Luben Tuikov 
> Cc: Alex Deucher 
> Cc: Steven Price 
> Cc: Jon Bloomfield 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: Daniel Vetter 
> Cc: Jason Ekstrand 
> Cc: dri-de...@lists.freedesktop.org
> Signed-off-by: Matthew Brost 

You have a one-line hunk in the next patch that probably should be here.
With that moved.

Reviewed-by: Daniel Vetter 

Also did you get an ack from Carl on these?
-Daniel

> ---
>  Documentation/gpu/rfc/i915_scheduler.rst | 85 
>  Documentation/gpu/rfc/index.rst  |  4 ++
>  2 files changed, 89 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst
> 
> diff --git a/Documentation/gpu/rfc/i915_scheduler.rst 
> b/Documentation/gpu/rfc/i915_scheduler.rst
> new file mode 100644
> index ..7faa46cde088
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> @@ -0,0 +1,85 @@
> +=
> +I915 GuC Submission/DRM Scheduler Section
> +=
> +
> +Upstream plan
> +=
> +For upstream the overall plan for landing GuC submission and integrating the
> +i915 with the DRM scheduler is:
> +
> +* Merge basic GuC submission
> + * Basic submission support for all gen11+ platforms
> + * Not enabled by default on any current platforms but can be enabled via
> +   modparam enable_guc
> + * Lots of rework will need to be done to integrate with DRM scheduler so
> +   no need to nit pick everything in the code, it just should be
> +   functional, no major coding style / layering errors, and not regress
> +   execlists
> + * Update IGTs / selftests as needed to work with GuC submission
> + * Enable CI on supported platforms for a baseline
> + * Rework / get CI heathly for GuC submission in place as needed
> +* Merge new parallel submission uAPI
> + * Bonding uAPI completely incompatible with GuC submission, plus it has
> +   severe design issues in general, which is why we want to retire it no
> +   matter what
> + * New uAPI adds I915_CONTEXT_ENGINES_EXT_PARALLEL context setup step
> +   which configures a slot with N contexts 
> + * After I915_CONTEXT_ENGINES_EXT_PARALLEL a user can submit N batches to
> +   a slot in a single execbuf IOCTL and the batches run on the GPU in
> +   paralllel
> + * Initially only for GuC submission but execlists can be supported if
> +   needed
> +* Convert the i915 to use the DRM scheduler
> + * GuC submission backend fully integrated with DRM scheduler
> + * All request queues removed from backend (e.g. all backpressure
> +   handled in DRM scheduler)
> + * Resets / cancels hook in DRM scheduler
> + * Watchdog hooks into DRM scheduler
> + * Lots of complexity of the GuC backend can be pulled out once
> +   integrated with DRM scheduler (e.g. state machine gets
> +   simplier, locking gets simplier, etc...)
> + * Execlist backend will do the minimum required to hook in the DRM
> +   scheduler so it can live next to the fully integrated GuC backend
> + * Legacy interface
> + * Features like timeslicing / preemption / virtual engines would
> +   be difficult to integrate with the DRM scheduler and these
> +   features are not required for GuC submission as the GuC does
> +   these things for us
> + * ROI low on fully integrating into DRM scheduler
> + * Fully integrating would add lots of complexity to DRM
> +   scheduler
> + * Port i915 priority inheritance / boosting feature in DRM scheduler
> + * Used for i915 page flip, may be useful to other DRM drivers as
> +   well
> + * Will be an optional feature in the DRM scheduler
> + * Remove in-order completion assumptions from DRM scheduler
> + * Even when using the DRM scheduler the backends will handle
> +   preemption, timeslicing, etc... so it is possible for jobs to
> +   finish out of order
&g

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Daniel Vetter
On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > 
> > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > > > 
> > > > > > Yes, we can't break anything because we don't want to complicate 
> > > > > > things
> > > > > > for us. It's pretty much all NAK'd already. We are trying to gather 
> > > > > > more
> > > > > > knowledge and then make better decisions.
> > > > > > 
> > > > > > The idea we are considering is that we'll expose memory-based sync 
> > > > > > objects
> > > > > > to userspace for read only, and the kernel or hw will strictly 
> > > > > > control the
> > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > userspace can decide not to signal a job, so even if userspace can't
> > > > > > overwrite memory-based sync object states arbitrarily, it can still 
> > > > > > decide
> > > > > > not to signal them, and then a future fence is born.
> > > > > > 
> > > > > This would actually be treated as a GPU hang caused by that context, 
> > > > > so it
> > > > > should be fine.
> > > > This is practically what I proposed already, except your not doing it 
> > > > with
> > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > what you want for that compute model.
> > > > 
> > > > This seems like a bit a worst of both worlds approach to me? Tons of 
> > > > work
> > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > actually drive the hardware like it should be for compute or direct
> > > > display.
> > > > 
> > > > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > interesting to know what doesn't work there instead of amd folks going 
> > > > of
> > > > into internal again and then coming back with another rfc from out of
> > > > nowhere :-)
> > > Well to be honest I would just push back on our hardware/firmware guys 
> > > that
> > > we need to keep kernel queues forever before going down that route.
> > I looked again, and you said the model wont work because preemption is way
> > too slow, even when the context is idle.
> > 
> > I guess at that point I got maybe too fed up and just figured "not my
> > problem", but if preempt is too slow as the unload fence, you can do it
> > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > otherwise your hw is just garbage and wont even be fast for direct submit
> > compute workloads).
> 
> Have you seen that one here:
> https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> 
> I've rejected it because I think polling for 6 seconds on a TLB flush which
> can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with
retrying page faults) and the not so shitty one?

But yeah at that point I think you just have to bite one of the bullets.

The thing is with hmm/userspace memory fence model this will be even
worse, because you will _have_ to do this tlb flush deep down in core mm
functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least
wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the
compute model, i.e. indefinite timeouts, no tdr, because everything takes
too long. At least looking around sriov timeouts tend to be 10x bare
metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting
since it brings us right back "the entire linux/android 3d stack is built
on top of dma_fence right now".

> > The only thing that you need to do when you use pte clearing + tlb
> > shootdown instad of preemption as the unload fence for buffers that get
> > moved is that if you get any gpu page fault, you don't serve that, but
> > instead treat it

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Daniel Vetter
On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák  wrote:
>
> Daniel, I think what you are suggesting is that we need to enable user queues 
> with the drm scheduler and dma_fence first, and once that works, we can 
> investigate how much of that kernel logic can be moved to the hw. Would that 
> work? In theory it shouldn't matter whether the kernel does it or the hw does 
> it. It's the same code, just in a different place.

Yeah I guess that's another way to look at it. Maybe in practice
you'll just move it from the kernel to userspace, which then programs
the hw waits directly into its IB. That's at least how I'd do it on
i915, assuming I'd have such hw. So these fences that userspace
programs directly (to sync within itself) won't even show up as
dependencies in the kernel.

And then yes on the other side you can lift work from the
drm/scheduler wrt dependencies you get in the kernel (whether explicit
sync with sync_file, or implicit sync fished out of dma_resv) and
program the hw directly that way. That would mean that userspace wont
fill the ringbuffer directly, but the kernel would do that, so that
you have space to stuff in the additional waits. Again assuming i915
hw model, maybe works differently on amd. Iirc we have some of that
already in the i915 scheduler, but I'd need to recheck how much it
really uses the hw semaphores.
-Daniel

> Thanks,
> Marek
>
> On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter  wrote:
>>
>> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák  wrote:
>> >
>> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:
>> >>
>> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
>> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
>> >> >
>> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
>> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  
>> >> > > > wrote:
>> >> > > >
>> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  
>> >> > > > > > wrote:
>> >> > > > > >
>> >> > > > > > > Yes, we can't break anything because we don't want to 
>> >> > > > > > > complicate
>> >> > > things
>> >> > > > > > > for us. It's pretty much all NAK'd already. We are trying to 
>> >> > > > > > > gather
>> >> > > > > more
>> >> > > > > > > knowledge and then make better decisions.
>> >> > > > > > >
>> >> > > > > > > The idea we are considering is that we'll expose memory-based 
>> >> > > > > > > sync
>> >> > > > > objects
>> >> > > > > > > to userspace for read only, and the kernel or hw will strictly
>> >> > > control
>> >> > > > > the
>> >> > > > > > > memory writes to those sync objects. The hole in that idea is 
>> >> > > > > > > that
>> >> > > > > > > userspace can decide not to signal a job, so even if userspace
>> >> > > can't
>> >> > > > > > > overwrite memory-based sync object states arbitrarily, it can 
>> >> > > > > > > still
>> >> > > > > decide
>> >> > > > > > > not to signal them, and then a future fence is born.
>> >> > > > > > >
>> >> > > > > >
>> >> > > > > > This would actually be treated as a GPU hang caused by that 
>> >> > > > > > context,
>> >> > > so
>> >> > > > > it
>> >> > > > > > should be fine.
>> >> > > > >
>> >> > > > > This is practically what I proposed already, except your not 
>> >> > > > > doing it
>> >> > > with
>> >> > > > > dma_fence. And on the memory fence side this also doesn't 
>> >> > > > > actually give
>> >> > > > > what you want for that compute model.
>> >> > > > >
>> >> > > > > This seems like a bit a worst of both worlds approach to me? Tons 
>> >> > > > > of
>> >> > > work
>> >> > > > > in the kerne

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Daniel Vetter
On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák  wrote:
>
> On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:
>>
>> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
>> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
>> >
>> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
>> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:
>> > > >
>> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Yes, we can't break anything because we don't want to complicate
>> > > things
>> > > > > > > for us. It's pretty much all NAK'd already. We are trying to 
>> > > > > > > gather
>> > > > > more
>> > > > > > > knowledge and then make better decisions.
>> > > > > > >
>> > > > > > > The idea we are considering is that we'll expose memory-based 
>> > > > > > > sync
>> > > > > objects
>> > > > > > > to userspace for read only, and the kernel or hw will strictly
>> > > control
>> > > > > the
>> > > > > > > memory writes to those sync objects. The hole in that idea is 
>> > > > > > > that
>> > > > > > > userspace can decide not to signal a job, so even if userspace
>> > > can't
>> > > > > > > overwrite memory-based sync object states arbitrarily, it can 
>> > > > > > > still
>> > > > > decide
>> > > > > > > not to signal them, and then a future fence is born.
>> > > > > > >
>> > > > > >
>> > > > > > This would actually be treated as a GPU hang caused by that 
>> > > > > > context,
>> > > so
>> > > > > it
>> > > > > > should be fine.
>> > > > >
>> > > > > This is practically what I proposed already, except your not doing it
>> > > with
>> > > > > dma_fence. And on the memory fence side this also doesn't actually 
>> > > > > give
>> > > > > what you want for that compute model.
>> > > > >
>> > > > > This seems like a bit a worst of both worlds approach to me? Tons of
>> > > work
>> > > > > in the kernel to hide these not-dma_fence-but-almost, and still pain 
>> > > > > to
>> > > > > actually drive the hardware like it should be for compute or direct
>> > > > > display.
>> > > > >
>> > > > > Also maybe I've missed it, but I didn't see any replies to my
>> > > suggestion
>> > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
>> > > > > interesting to know what doesn't work there instead of amd folks 
>> > > > > going
>> > > of
>> > > > > into internal again and then coming back with another rfc from out of
>> > > > > nowhere :-)
>> > > > >
>> > > >
>> > > > Going internal again is probably a good idea to spare you the long
>> > > > discussions and not waste your time, but we haven't talked about the
>> > > > dma_fence stuff internally other than acknowledging that it can be
>> > > solved.
>> > > >
>> > > > The compute use case already uses the hw as-is with no inter-process
>> > > > sharing, which mostly keeps the kernel out of the picture. It uses
>> > > glFinish
>> > > > to sync with GL.
>> > > >
>> > > > The gfx use case needs new hardware logic to support implicit and
>> > > explicit
>> > > > sync. When we propose a solution, it's usually torn apart the next day 
>> > > > by
>> > > > ourselves.
>> > > >
>> > > > Since we are talking about next hw or next next hw, preemption should 
>> > > > be
>> > > > better.
>> > > >
>> > > > user queue = user-mapped ring buffer
>> > > >
>> > > > For implicit sync, we will only let userspace lock access to a buffer
>> > > via a
>>

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Daniel Vetter
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
> 
> > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > > >
> > > > > > Yes, we can't break anything because we don't want to complicate
> > things
> > > > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > > > more
> > > > > > knowledge and then make better decisions.
> > > > > >
> > > > > > The idea we are considering is that we'll expose memory-based sync
> > > > objects
> > > > > > to userspace for read only, and the kernel or hw will strictly
> > control
> > > > the
> > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > userspace can decide not to signal a job, so even if userspace
> > can't
> > > > > > overwrite memory-based sync object states arbitrarily, it can still
> > > > decide
> > > > > > not to signal them, and then a future fence is born.
> > > > > >
> > > > >
> > > > > This would actually be treated as a GPU hang caused by that context,
> > so
> > > > it
> > > > > should be fine.
> > > >
> > > > This is practically what I proposed already, except your not doing it
> > with
> > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > what you want for that compute model.
> > > >
> > > > This seems like a bit a worst of both worlds approach to me? Tons of
> > work
> > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > actually drive the hardware like it should be for compute or direct
> > > > display.
> > > >
> > > > Also maybe I've missed it, but I didn't see any replies to my
> > suggestion
> > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > interesting to know what doesn't work there instead of amd folks going
> > of
> > > > into internal again and then coming back with another rfc from out of
> > > > nowhere :-)
> > > >
> > >
> > > Going internal again is probably a good idea to spare you the long
> > > discussions and not waste your time, but we haven't talked about the
> > > dma_fence stuff internally other than acknowledging that it can be
> > solved.
> > >
> > > The compute use case already uses the hw as-is with no inter-process
> > > sharing, which mostly keeps the kernel out of the picture. It uses
> > glFinish
> > > to sync with GL.
> > >
> > > The gfx use case needs new hardware logic to support implicit and
> > explicit
> > > sync. When we propose a solution, it's usually torn apart the next day by
> > > ourselves.
> > >
> > > Since we are talking about next hw or next next hw, preemption should be
> > > better.
> > >
> > > user queue = user-mapped ring buffer
> > >
> > > For implicit sync, we will only let userspace lock access to a buffer
> > via a
> > > user queue, which waits for the per-buffer sequence counter in memory to
> > be
> > > >= the number assigned by the kernel, and later unlock the access with
> > > another command, which increments the per-buffer sequence counter in
> > memory
> > > with atomic_inc regardless of the number assigned by the kernel. The
> > kernel
> > > counter and the counter in memory can be out-of-sync, and I'll explain
> > why
> > > it's OK. If a process increments the kernel counter but not the memory
> > > counter, that's its problem and it's the same as a GPU hang caused by
> > that
> > > process. If a process increments the memory counter but not the kernel
> > > counter, the ">=" condition alongside atomic_inc guarantee that
> > signaling n
> > > will signal n+1, so it will never deadlock but also it will effectively
> > > disable synchronization. This method of disabling synchronization is
> > > similar to a process corrupting the buffer, which should be fine. Can you
> > > find any flaw in 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Daniel Vetter
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:
> 
> > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > >
> > > > Yes, we can't break anything because we don't want to complicate things
> > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > more
> > > > knowledge and then make better decisions.
> > > >
> > > > The idea we are considering is that we'll expose memory-based sync
> > objects
> > > > to userspace for read only, and the kernel or hw will strictly control
> > the
> > > > memory writes to those sync objects. The hole in that idea is that
> > > > userspace can decide not to signal a job, so even if userspace can't
> > > > overwrite memory-based sync object states arbitrarily, it can still
> > decide
> > > > not to signal them, and then a future fence is born.
> > > >
> > >
> > > This would actually be treated as a GPU hang caused by that context, so
> > it
> > > should be fine.
> >
> > This is practically what I proposed already, except your not doing it with
> > dma_fence. And on the memory fence side this also doesn't actually give
> > what you want for that compute model.
> >
> > This seems like a bit a worst of both worlds approach to me? Tons of work
> > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > actually drive the hardware like it should be for compute or direct
> > display.
> >
> > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > how to fake the entire dma_fence stuff on top of new hw. Would be
> > interesting to know what doesn't work there instead of amd folks going of
> > into internal again and then coming back with another rfc from out of
> > nowhere :-)
> >
> 
> Going internal again is probably a good idea to spare you the long
> discussions and not waste your time, but we haven't talked about the
> dma_fence stuff internally other than acknowledging that it can be solved.
> 
> The compute use case already uses the hw as-is with no inter-process
> sharing, which mostly keeps the kernel out of the picture. It uses glFinish
> to sync with GL.
> 
> The gfx use case needs new hardware logic to support implicit and explicit
> sync. When we propose a solution, it's usually torn apart the next day by
> ourselves.
> 
> Since we are talking about next hw or next next hw, preemption should be
> better.
> 
> user queue = user-mapped ring buffer
> 
> For implicit sync, we will only let userspace lock access to a buffer via a
> user queue, which waits for the per-buffer sequence counter in memory to be
> >= the number assigned by the kernel, and later unlock the access with
> another command, which increments the per-buffer sequence counter in memory
> with atomic_inc regardless of the number assigned by the kernel. The kernel
> counter and the counter in memory can be out-of-sync, and I'll explain why
> it's OK. If a process increments the kernel counter but not the memory
> counter, that's its problem and it's the same as a GPU hang caused by that
> process. If a process increments the memory counter but not the kernel
> counter, the ">=" condition alongside atomic_inc guarantee that signaling n
> will signal n+1, so it will never deadlock but also it will effectively
> disable synchronization. This method of disabling synchronization is
> similar to a process corrupting the buffer, which should be fine. Can you
> find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
of "we let userspace free-wheel whatever it wants, kernel ensures
correctness of the resulting chain of dma_fence with reset the entire
context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

> The explicit submit can be done by userspace (if there is no
> synchronization), but we plan to use the kernel to do it for implicit sync.
> Essentially, the kernel will receive a buffer list and addresses of wait
> commands in the user queue. It will assign new sequence numbers to all
> buffers and write those numbers into the wait commands, and ring the hw
> doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort out
the dma_fence dependencies is probably best. Since you can filter out
which dma_fence you hand to the scheduler for dependency tracking you can
filter out your own ones and let the hw handle those di

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Daniel Vetter
On Wed, Jun 02, 2021 at 10:09:01AM +0200, Michel Dänzer wrote:
> On 2021-06-01 12:49 p.m., Michel Dänzer wrote:
> > On 2021-06-01 12:21 p.m., Christian König wrote:
> > 
> >> Another question is if that is sufficient as security for the display 
> >> server or if we need further handling down the road? I mean essentially we 
> >> are moving the reliability problem into the display server.
> > 
> > Good question. This should generally protect the display server from 
> > freezing due to client fences never signalling, but there might still be 
> > corner cases.
> 
> E.g. a client might be able to sneak in a fence between when the
> compositor checks fences and when it submits its drawing to the kernel.

This is why implicit sync should be handled with explicit IPC. You pick
the fence up once, and then you need to tell your GL stack to _not_ do
implicit sync. Would need a new extension. vk afaiui does this
automatically already.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Daniel Vetter
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> 
> 
> Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > 
> > > > Yes, we can't break anything because we don't want to complicate things
> > > > for us. It's pretty much all NAK'd already. We are trying to gather more
> > > > knowledge and then make better decisions.
> > > > 
> > > > The idea we are considering is that we'll expose memory-based sync 
> > > > objects
> > > > to userspace for read only, and the kernel or hw will strictly control 
> > > > the
> > > > memory writes to those sync objects. The hole in that idea is that
> > > > userspace can decide not to signal a job, so even if userspace can't
> > > > overwrite memory-based sync object states arbitrarily, it can still 
> > > > decide
> > > > not to signal them, and then a future fence is born.
> > > > 
> > > This would actually be treated as a GPU hang caused by that context, so it
> > > should be fine.
> > This is practically what I proposed already, except your not doing it with
> > dma_fence. And on the memory fence side this also doesn't actually give
> > what you want for that compute model.
> > 
> > This seems like a bit a worst of both worlds approach to me? Tons of work
> > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > actually drive the hardware like it should be for compute or direct
> > display.
> > 
> > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > how to fake the entire dma_fence stuff on top of new hw. Would be
> > interesting to know what doesn't work there instead of amd folks going of
> > into internal again and then coming back with another rfc from out of
> > nowhere :-)
> 
> Well to be honest I would just push back on our hardware/firmware guys that
> we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).

The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
  write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
  blindly trusts userspace to set up everything else, and even just wraps
  dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
  kernel shot the entire context and b) userspace recovers by setting up a
  new ringbuffer

- memory management is done using ttm only, you still need to supply the
  buffer list (ofc that list includes the always present ones, so CS will
  only get the list of special buffers like today). If you hw can't trun
  gpu page faults and you ever get one we pull up the same old solution:
  Kernel shots the entire context.

  The important thing is that from the gpu pov memory management works
  exactly like compute workload with direct submit, except that you just
  terminate the context on _any_ page fault, instead of only those that go
  somewhere where there's really no mapping and repair the others.

  Also I guess from reading the old thread this means you'd disable page
  fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all
  userspace memory fences (represented as dma_fence) to complete, with
  maybe some fudge to busy-spin until we've reached the actual end of the
  ringbuffer (maybe you have a IB tail there after the memory fence write,
  we have that on intel hw), and it waits for the memory to get
  "unloaded". This is either preemption, or pte clearing + tlb shootdown,
  or whatever else your hw provides which is a) used for dynamic memory
  management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences,
  in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Daniel Vetter
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> 
> > Yes, we can't break anything because we don't want to complicate things
> > for us. It's pretty much all NAK'd already. We are trying to gather more
> > knowledge and then make better decisions.
> >
> > The idea we are considering is that we'll expose memory-based sync objects
> > to userspace for read only, and the kernel or hw will strictly control the
> > memory writes to those sync objects. The hole in that idea is that
> > userspace can decide not to signal a job, so even if userspace can't
> > overwrite memory-based sync object states arbitrarily, it can still decide
> > not to signal them, and then a future fence is born.
> >
> 
> This would actually be treated as a GPU hang caused by that context, so it
> should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Daniel Vetter
On Tue, Jun 1, 2021 at 2:10 PM Christian König
 wrote:
>
> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> > On 2021-06-01 12:21 p.m., Christian König wrote:
> >> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> >>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> >>>> 3) Compositors (and other privileged processes, and display flipping) 
> >>>> can't trust imported/exported fences. They need a timeout recovery 
> >>>> mechanism from the beginning, and the following are some possible 
> >>>> solutions to timeouts:
> >>>>
> >>>> a) use a CPU wait with a small absolute timeout, and display the 
> >>>> previous content on timeout
> >>>> b) use a GPU wait with a small absolute timeout, and conditional 
> >>>> rendering will choose between the latest content (if signalled) and 
> >>>> previous content (if timed out)
> >>>>
> >>>> The result would be that the desktop can run close to 60 fps even if an 
> >>>> app runs at 1 fps.
> >>> FWIW, this is working with
> >>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
> >>> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to 
> >>> provide the same dma-buf poll semantics as other drivers and high 
> >>> priority GFX contexts via EGL_IMG_context_priority which can preempt 
> >>> lower priority ones).
> >> Yeah, that is really nice to have.
> >>
> >> One question is if you wait on the CPU or the GPU for the new surface to 
> >> become available?
> > It's based on polling dma-buf fds, i.e. CPU.
> >
> >> The former is a bit bad for latency and power management.
> > There isn't a choice for Wayland compositors in general, since there can be 
> > arbitrary other state which needs to be applied atomically together with 
> > the new buffer. (Though in theory, a compositor might get fancy and 
> > special-case surface commits which can be handled by waiting on the GPU)
> >
> > Latency is largely a matter of scheduling in the compositor. The latency 
> > incurred by the compositor shouldn't have to be more than single-digit 
> > milliseconds. (I've seen total latency from when the client starts 
> > processing a (static) frame to when it starts being scanned out as low as 
> > ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, 
> > lower than typical with Xorg)
>
> Well let me describe it like this:
>
> We have an use cases for 144 Hz guaranteed refresh rate. That
> essentially means that the client application needs to be able to spit
> out one frame/window content every ~6.9ms. That's tough, but doable.
>
> When you now add 6ms latency in the compositor that means the client
> application has only .9ms left for it's frame which is basically
> impossible to do.
>
> See for the user fences handling the display engine will learn to read
> sequence numbers from memory and decide on it's own if the old frame or
> the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.

I still think the most reasonable approach here is that we wrap a
dma_fence compat layer/mode over new hw for existing
userspace/compositors. And then enable userspace memory fences and the
fancy new features those allow with a new model that's built for them.
Also even with dma_fence we could implement your model of staying with
the previous buffer (or an older buffer at that's already rendered),
but it needs explicit involvement of the compositor. At least without
adding eglstreams fd to the kernel and wiring up all the relevant
extensions.
-Daniel

> >> Another question is if that is sufficient as security for the display 
> >> server or if we need further handling down the road? I mean essentially we 
> >> are moving the reliability problem into the display server.
> > Good question. This should generally protect the display server from 
> > freezing due to client fences never signalling, but there might still be 
> > corner cases.
> >
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [Intel-gfx] [RFC PATCH 1/2] drm/doc/rfc: i915 GuC submission / DRM scheduler

2021-05-27 Thread Daniel Vetter
On Thu, May 27, 2021 at 11:06:38AM +0100, Tvrtko Ursulin wrote:
> 
> On 27/05/2021 00:33, Matthew Brost wrote:
> > Add entry for i915 GuC submission / DRM scheduler integration plan.
> > Follow up patch with details of new parallel submission uAPI to come.
> > 
> > v2:
> >   (Daniel Vetter)
> >- Expand explaination of why bonding isn't supported for GuC
> >  submission
> >- CC some of the DRM scheduler maintainers
> >- Add priority inheritance / boosting use case
> >- Add reasoning for removing in order assumptions
> >   (Daniel Stone)
> >- Add links to priority spec
> 
> Where will the outstanding items like, from the top of my head only, error
> capture and open source logging tool be tracked? I thought here but maybe
> not.

I thought the same that we'd put these really important bits into the
rfc/todo here. Matt, can you pls do that?
-Daniel

> 
> Regards,
> 
> Tvrtko
> 
> > Cc: Christian König 
> > Cc: Luben Tuikov 
> > Cc: Alex Deucher 
> > Cc: Steven Price 
> > Cc: Jon Bloomfield 
> > Cc: Jason Ekstrand 
> > Cc: Dave Airlie 
> > Cc: Daniel Vetter 
> > Cc: Jason Ekstrand 
> > Cc: dri-de...@lists.freedesktop.org
> > Signed-off-by: Matthew Brost 
> > ---
> >   Documentation/gpu/rfc/i915_scheduler.rst | 85 
> >   Documentation/gpu/rfc/index.rst  |  4 ++
> >   2 files changed, 89 insertions(+)
> >   create mode 100644 Documentation/gpu/rfc/i915_scheduler.rst
> > 
> > diff --git a/Documentation/gpu/rfc/i915_scheduler.rst 
> > b/Documentation/gpu/rfc/i915_scheduler.rst
> > new file mode 100644
> > index ..7faa46cde088
> > --- /dev/null
> > +++ b/Documentation/gpu/rfc/i915_scheduler.rst
> > @@ -0,0 +1,85 @@
> > +=
> > +I915 GuC Submission/DRM Scheduler Section
> > +=
> > +
> > +Upstream plan
> > +=
> > +For upstream the overall plan for landing GuC submission and integrating 
> > the
> > +i915 with the DRM scheduler is:
> > +
> > +* Merge basic GuC submission
> > +   * Basic submission support for all gen11+ platforms
> > +   * Not enabled by default on any current platforms but can be enabled via
> > + modparam enable_guc
> > +   * Lots of rework will need to be done to integrate with DRM scheduler so
> > + no need to nit pick everything in the code, it just should be
> > + functional, no major coding style / layering errors, and not regress
> > + execlists
> > +   * Update IGTs / selftests as needed to work with GuC submission
> > +   * Enable CI on supported platforms for a baseline
> > +   * Rework / get CI heathly for GuC submission in place as needed
> > +* Merge new parallel submission uAPI
> > +   * Bonding uAPI completely incompatible with GuC submission, plus it has
> > + severe design issues in general, which is why we want to retire it no
> > + matter what
> > +   * New uAPI adds I915_CONTEXT_ENGINES_EXT_PARALLEL context setup step
> > + which configures a slot with N contexts
> > +   * After I915_CONTEXT_ENGINES_EXT_PARALLEL a user can submit N batches to
> > + a slot in a single execbuf IOCTL and the batches run on the GPU in
> > + paralllel
> > +   * Initially only for GuC submission but execlists can be supported if
> > + needed
> > +* Convert the i915 to use the DRM scheduler
> > +   * GuC submission backend fully integrated with DRM scheduler
> > +   * All request queues removed from backend (e.g. all backpressure
> > + handled in DRM scheduler)
> > +   * Resets / cancels hook in DRM scheduler
> > +   * Watchdog hooks into DRM scheduler
> > +   * Lots of complexity of the GuC backend can be pulled out once
> > + integrated with DRM scheduler (e.g. state machine gets
> > + simplier, locking gets simplier, etc...)
> > +   * Execlist backend will do the minimum required to hook in the DRM
> > + scheduler so it can live next to the fully integrated GuC backend
> > +   * Legacy interface
> > +   * Features like timeslicing / preemption / virtual engines would
> > + be difficult to integrate with the DRM scheduler and these
> > + features are not required for GuC submission as the GuC does
> > + these things for us
> > +   * ROI low on fully integrating into DRM scheduler
> > +   * Fully integrating would add lots o

Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-26 Thread Daniel Vetter
On Wed, May 26, 2021 at 3:32 PM Christian König
 wrote:
>
> Am 25.05.21 um 17:23 schrieb Daniel Vetter:
> > On Tue, May 25, 2021 at 5:05 PM Christian König
> >  wrote:
> >> Hi Daniel,
> >>
> >> Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> >>> Hi Christian,
> >>>
> >>> On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> >>>> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> >>>> This works by adding the fence of the last eviction DMA operation to BOs
> >>>> when their backing store is newly allocated. That's what the
> >>>> ttm_bo_add_move_fence() function you stumbled over is good for: 
> >>>> https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> >>>>
> >>>> Now the problem is it is possible that the application is terminated 
> >>>> before
> >>>> it can complete it's command submission. But since resource management 
> >>>> only
> >>>> waits for the shared fences when there are some there is a chance that we
> >>>> free up memory while it is still in use.
> >>> Hm where is this code? Would help in my audit that I wanted to do this
> >>> week? If you look at most other places like
> >>> drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> >>> treat the shared fences special and always also include the exclusive one.
> >> See amdgpu_gem_object_close():
> >>
> >> ...
> >>   fence = dma_resv_get_excl(bo->tbo.base.resv);
> >>   if (fence) {
> >>   amdgpu_bo_fence(bo, fence, true);
> >>   fence = NULL;
> >>   }
> >> ...
> >>
> >> We explicitly added that because resource management of some other
> >> driver was going totally bananas without that.
> >>
> >> But I'm not sure which one that was. Maybe dig a bit in the git and
> >> mailing history of that.
> > Hm I looked and it's
> >
> > commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
> > Author: Christian König 
> > Date:   Thu Mar 12 12:03:34 2020 +0100
> >
> > drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
> >
> > That sounded more like amdgpu itself needing this, not another driver?
>
> No, that patch was just a follow up moving the functionality around.

That patch added the "add exclusive fence to shared slots before
amdgpu_vm_clear_freed() call", which I thought was at least part of
your fix.

> > But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
> > right fencing mode for gpu pte clearing, so I'm really not sure what
> > the bug was that you worked around here?The implementation boils down
> > to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
> > there's nothing else that I could find in public history at least, no
> > references to bug reports or anything. I think you need to dig
> > internally, because as-is I'm not seeing the problem here.
> >
> > Or am I missing something here?
>
> See the code here for example:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361
>
> Nouveau assumes that when a shared fence is present it doesn't need to
> wait for the exclusive one because the shared are always supposed to
> finish after the exclusive one.
>
> But for page table unmap fences that isn't true and we ran into a really
> nasty and hard to reproduce bug because of this.
>
> I think it would be much more defensive if we could say that we always
> wait for the exclusive fence and fix the use case in nouveau and double
> check if somebody else does stuff like that as well.

Yeah most other drivers do the defensive thing here. I think it would
be good to standardize on that. I'll add that to my list and do more
auditing.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-25 Thread Daniel Vetter
On Tue, May 25, 2021 at 5:05 PM Christian König
 wrote:
>
> Hi Daniel,
>
> Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> > Hi Christian,
> >
> > On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> >> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> >> This works by adding the fence of the last eviction DMA operation to BOs
> >> when their backing store is newly allocated. That's what the
> >> ttm_bo_add_move_fence() function you stumbled over is good for: 
> >> https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> >>
> >> Now the problem is it is possible that the application is terminated before
> >> it can complete it's command submission. But since resource management only
> >> waits for the shared fences when there are some there is a chance that we
> >> free up memory while it is still in use.
> > Hm where is this code? Would help in my audit that I wanted to do this
> > week? If you look at most other places like
> > drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> > treat the shared fences special and always also include the exclusive one.
>
> See amdgpu_gem_object_close():
>
> ...
>  fence = dma_resv_get_excl(bo->tbo.base.resv);
>  if (fence) {
>  amdgpu_bo_fence(bo, fence, true);
>  fence = NULL;
>  }
> ...
>
> We explicitly added that because resource management of some other
> driver was going totally bananas without that.
>
> But I'm not sure which one that was. Maybe dig a bit in the git and
> mailing history of that.

Hm I looked and it's

commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
Author: Christian König 
Date:   Thu Mar 12 12:03:34 2020 +0100

   drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4

That sounded more like amdgpu itself needing this, not another driver?

But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
right fencing mode for gpu pte clearing, so I'm really not sure what
the bug was that you worked around here?The implementation boils down
to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
there's nothing else that I could find in public history at least, no
references to bug reports or anything. I think you need to dig
internally, because as-is I'm not seeing the problem here.

Or am I missing something here?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-25 Thread Daniel Vetter
Hi Christian,

On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> > [SNIP]
> > > We could provide an IOCTL for the BO to change the flag.
> > That's not the semantics we need.
> > 
> > > But could we first figure out the semantics we want to use here?
> > > 
> > > Cause I'm pretty sure we don't actually need those changes at all and as
> > > said before I'm certainly NAKing things which break existing use cases.
> > Please read how other drivers do this and at least _try_ to understand
> > it. I'm really loosing my patience here with you NAKing patches you're
> > not even understanding (or did you actually read and fully understand
> > the entire story I typed up here, and your NAK is on the entire
> > thing?). There's not much useful conversation to be had with that
> > approach. And with drivers I mean kernel + userspace here.
> 
> Well to be honest I did fully read that, but I was just to emotionally
> attached to answer more appropriately in that moment.
> 
> And I'm sorry that I react emotional on that, but it is really frustrating
> that I'm not able to convince you that we have a major problem which affects
> all drivers and not just amdgpu.
> 
> Regarding the reason why I'm NAKing this particular patch, you are breaking
> existing uAPI for RADV with that. And as a maintainer of the driver I have
> simply no other choice than saying halt, stop we can't do it like this.
> 
> I'm perfectly aware that I've some holes in the understanding of how ANV or
> other Vulkan/OpenGL stacks work. But you should probably also admit that you
> have some holes how amdgpu works or otherwise I can't imagine why you
> suggest a patch which simply breaks RADV.
> 
> I mean we are working together for years now and I think you know me pretty
> well, do you really think I scream bloody hell we can't do this without a
> good reason?
> 
> So let's stop throwing halve backed solutions at each other and discuss what
> we can do to solve the different problems we are both seeing here.

Well this was meant to be a goal post/semantics discussion starter. Yes
the patch breaks performance (but not correctness) for amdgpu, but it also
contains my suggestion for how to fix that issue (in text form at least).

Plus a plan how to roll it out so that anyone who cares doesn't hit the
perf issues this patch can cause.

Also the overall series is really meant as a subsystem wide assessment of
the status quo. Aside from a correctness issue Lucas spotted in my panfrost
patches no substantial input from others on this yet unfortunately. I need
to poke more people I think.

Anyway since the plan as a text didn't stick I'm typing up now something
more detailed in form of amdgpu patches. Maybe Bas can do the radv
conversion too.

It won't be complete by far either (I'm not working for amd after all
...), I'll leave out the opengl/media side entirely. But if this works for
radv is should be a useful blueprint for gl/media too (with some changes
in the interfaces, but not really the exposed semantics).

> > That's the other frustration part: You're trying to fix this purely in
> > the kernel. This is exactly one of these issues why we require open
> > source userspace, so that we can fix the issues correctly across the
> > entire stack. And meanwhile you're steadfastily refusing to even look
> > at that the userspace side of the picture.
> 
> Well I do fully understand the userspace side of the picture for the AMD
> stack. I just don't think we should give userspace that much control over
> the fences in the dma_resv object without untangling them from resource
> management.
> 
> And RADV is exercising exclusive sync for amdgpu already. You can do
> submission to both the GFX, Compute and SDMA queues in Vulkan and those
> currently won't over-synchronize.
> 
> When you then send a texture generated by multiple engines to the Compositor
> the kernel will correctly inserts waits for all submissions of the other
> process.
> 
> So this already works for RADV and completely without the IOCTL Jason
> proposed. IIRC we also have unit tests which exercised that feature for the
> video decoding use case long before RADV even existed.

Yeah multiple engines on the producer side works fine with the current
scheme you have (if we ignore for now that the way amdgpu uses dma_resv is
different from other drivers by not setting the exclusive slots for
producers).

Where it breaks down is you have overlapping reads once a frame is
generated, on either side. E.g.

- compositors read the buffer as consumer
- but also producer reads the buffer again (maybe reference frame for
  media, or maybe for some post-processing like motion blurr or whatever).

Then

Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Daniel Vetter
On Fri, May 21, 2021 at 8:08 PM Christian König
 wrote:
>
> Am 21.05.21 um 17:16 schrieb Daniel Vetter:
> > On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
> >> On Fri, May 21, 2021 at 4:37 PM Daniel Vetter  wrote:
> >>> On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
> >>>> On Fri, May 21, 2021 at 11:10 AM Daniel Vetter  
> >>>> wrote:
> >>>>> ---
> >>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
> >>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>>>> index 88a24a0b5691..cc8426e1e8a8 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> >>>>> @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct 
> >>>>> amdgpu_cs_parser *p,
> >>>>>  amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> >>>>>  struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
> >>>>>
> >>>>> -   /* Make sure we use the exclusive slot for shared BOs */
> >>>>> -   if (bo->prime_shared_count)
> >>>>> +   /* Make sure we use the exclusive slot for all 
> >>>>> potentially shared BOs */
> >>>>> +   if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
> >>>>>  e->tv.num_shared = 0;
> >>>> I think it also makes sense to skip this with
> >>>> AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
> >>>> anyone expects implicit sync to happen with those.
> >>> Ah yes, I missed this entirely. So the "no implicit flag" is already
> >>> there, and the _only_ thing that's missing really is a way to fish out the
> >>> implicit fences, and set them.
> >>>
> >>> https://lore.kernel.org/dri-devel/20210520190007.534046-1-ja...@jlekstrand.net/
> >>>
> >>> So I think all that's really needed in radv is not setting
> >>> RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
> >>> are present (means you need to do some import/export and keep the fd
> >>> around for winsys buffers, but shouldn't be too bad), and then control the
> >>> implicit fences entirely explicitly like vk expects.
> >> That is the part I'm less sure about. This is a BO wide flag so we are
> >> also disabling implicit sync in the compositor. If the compositor does
> >> only do read stuff that is ok, as the inserted exclusive fence will
> >> work for that. But as I learned recently the app provided buffer may
> >> end up being written to by the X server which open a whole can of
> >> potential problems if implicit sync gets disabled between Xserver
> >> operations on the app provided buffer. Hence setting that on the WSI
> >> buffer is a whole new can of potential problems and hence I've said a
> >> submission based flag would be preferred.
> >>
> >> I can certainly try it out though.
> > Hm yeah that's the wrong flag. We need a flag on the drm_file which the
> > explicit userspace sets. And which is valid only for itself.
> >
> > There's a nice flags field when creating a ctx, but it's not validated and
> > there's already a comment that we have to filter out garbage priority, so
> > that's not use. I'll whip up something entirely untested just as a draft.
>
> We could provide an IOCTL for the BO to change the flag.

That's not the semantics we need.

> But could we first figure out the semantics we want to use here?
>
> Cause I'm pretty sure we don't actually need those changes at all and as
> said before I'm certainly NAKing things which break existing use cases.

Please read how other drivers do this and at least _try_ to understand
it. I'm really loosing my patience here with you NAKing patches you're
not even understanding (or did you actually read and fully understand
the entire story I typed up here, and your NAK is on the entire
thing?). There's not much useful conversation to be had with that
approach. And with drivers I mean kernel + userspace here.

That's the other frustration part: You're trying to fix this purely in
the kernel. This is exactly one of these issues why we require open
source userspace, so that we can fix the issues correctly

Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Daniel Vetter
On Fri, May 21, 2021 at 05:00:46PM +0200, Bas Nieuwenhuizen wrote:
> On Fri, May 21, 2021 at 4:37 PM Daniel Vetter  wrote:
> >
> > On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
> > > On Fri, May 21, 2021 at 11:10 AM Daniel Vetter  
> > > wrote:
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > index 88a24a0b5691..cc8426e1e8a8 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > > > @@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct 
> > > > amdgpu_cs_parser *p,
> > > > amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> > > > struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
> > > >
> > > > -   /* Make sure we use the exclusive slot for shared BOs */
> > > > -   if (bo->prime_shared_count)
> > > > +   /* Make sure we use the exclusive slot for all 
> > > > potentially shared BOs */
> > > > +   if (!(bo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID))
> > > > e->tv.num_shared = 0;
> > >
> > > I think it also makes sense to skip this with
> > > AMDGPU_GEM_CREATE_EXPLICIT_SYNC? It can be shared but I don't think
> > > anyone expects implicit sync to happen with those.
> >
> > Ah yes, I missed this entirely. So the "no implicit flag" is already
> > there, and the _only_ thing that's missing really is a way to fish out the
> > implicit fences, and set them.
> >
> > https://lore.kernel.org/dri-devel/20210520190007.534046-1-ja...@jlekstrand.net/
> >
> > So I think all that's really needed in radv is not setting
> > RADEON_FLAG_IMPLICIT_SYNC for winsys buffers when Jason's dma-buf ioctl
> > are present (means you need to do some import/export and keep the fd
> > around for winsys buffers, but shouldn't be too bad), and then control the
> > implicit fences entirely explicitly like vk expects.
> 
> That is the part I'm less sure about. This is a BO wide flag so we are
> also disabling implicit sync in the compositor. If the compositor does
> only do read stuff that is ok, as the inserted exclusive fence will
> work for that. But as I learned recently the app provided buffer may
> end up being written to by the X server which open a whole can of
> potential problems if implicit sync gets disabled between Xserver
> operations on the app provided buffer. Hence setting that on the WSI
> buffer is a whole new can of potential problems and hence I've said a
> submission based flag would be preferred.
> 
> I can certainly try it out though.

Hm yeah that's the wrong flag. We need a flag on the drm_file which the
explicit userspace sets. And which is valid only for itself.

There's a nice flags field when creating a ctx, but it's not validated and
there's already a comment that we have to filter out garbage priority, so
that's not use. I'll whip up something entirely untested just as a draft.
-Daniel



> 
> >
> > Are you bored enough to type this up for radv? I'll give Jason's kernel
> > stuff another review meanwhile.
> > -Daniel
> >
> > > > e->bo_va = amdgpu_vm_bo_find(vm, bo);
> > > > }
> > > > --
> > > > 2.31.0
> > > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Daniel Vetter
On Fri, May 21, 2021 at 07:58:57AM -0700, Rob Clark wrote:
> On Fri, May 21, 2021 at 2:10 AM Daniel Vetter  wrote:
> >
> > - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT,
> >   but because it doesn't use the drm/scheduler it handles fences from
> >   the wrong context with a synchronous dma_fence_wait. See
> >   submit_fence_sync() leading to msm_gem_sync_object(). Investing into
> >   a scheduler might be a good idea.
> 
> Yeah, drm/scheduler is (along with a lot of other things) on the TODO
> list.  But this isn't quite as bad as it sounds because userspace uses
> a u_queue thread to call the submit ioctl rather than blocking the
> driver.  (It also offloads some other work from the driver thread,
> like submit merging to reduce # of ioctls.  Coincidentally that
> arrangement was a step towards preparing userspace for some
> hypothetical non-ioctl uapi ;-))

You're also holding a pile of locks, which I expect latest with
multi-engine buffer sharing will be pain. If you push this to the
scheduler then the locks aren't held. Or maybe I've misread the flow, it's
become all a bit a blurr after all these drivers :-)

> OTOH it would be good to move blocking until the system can free
> enough pages to repin bo's out of the ioctl path to better handle some
> memory pressure corner cases without having to be interruptable over a
> lot more of the submit path..  Running chrome+android on devices
> without a lot of memory is fun..

Uh that one needs the userspace thread. Or entirely different semantics of
your ioctl, because you're not allowed to allocate memory once any
dma_fence is visible. So offloading the entire pinning to a submit thread
is no-go.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Daniel Vetter
On Fri, May 21, 2021 at 11:46:23AM +0200, Bas Nieuwenhuizen wrote:
> On Fri, May 21, 2021 at 11:10 AM Daniel Vetter  wrote:
> >
> > Docs for struct dma_resv are fairly clear:
> >
> > "A reservation object can have attached one exclusive fence (normally
> > associated with write operations) or N shared fences (read
> > operations)."
> >
> > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#reservation-objects
> >
> > Furthermore a review across all of upstream.
> >
> > First of render drivers and how they set implicit fences:
> >
> > - nouveau follows this contract, see in validate_fini_no_ticket()
> >
> > nouveau_bo_fence(nvbo, fence, !!b->write_domains);
> >
> >   and that last boolean controls whether the exclusive or shared fence
> >   slot is used.
> >
> > - radeon follows this contract by setting
> >
> > p->relocs[i].tv.num_shared = !r->write_domain;
> >
> >   in radeon_cs_parser_relocs(), which ensures that the call to
> >   ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the
> >   right thing.
> >
> > - vmwgfx seems to follow this contract with the shotgun approach of
> >   always setting ttm_val_buf->num_shared = 0, which means
> >   ttm_eu_fence_buffer_objects() will only use the exclusive slot.
> >
> > - etnaviv follows this contract, as can be trivially seen by looking
> >   at submit_attach_object_fences()
> >
> > - i915 is a bit a convoluted maze with multiple paths leading to
> >   i915_vma_move_to_active(). Which sets the exclusive flag if
> >   EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for
> >   softpin mode, or through the write_domain when using relocations. It
> >   follows this contract.
> >
> > - lima follows this contract, see lima_gem_submit() which sets the
> >   exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that
> >   bo
> >
> > - msm follows this contract, see msm_gpu_submit() which sets the
> >   exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer
> >
> > - panfrost follows this contract with the shotgun approach of just
> >   always setting the exclusive fence, see
> >   panfrost_attach_object_fences(). Benefits of a single engine I guess
> >
> > - v3d follows this contract with the same shotgun approach in
> >   v3d_attach_fences_and_unlock_reservation(), but it has at least an
> >   XXX comment that maybe this should be improved
> >
> > - v4c uses the same shotgun approach of always setting an exclusive
> >   fence, see vc4_update_bo_seqnos()
> >
> > - vgem also follows this contract, see vgem_fence_attach_ioctl() and
> >   the VGEM_FENCE_WRITE. This is used in some igts to validate prime
> >   sharing with i915.ko without the need of a 2nd gpu
> >
> > - vritio follows this contract again with the shotgun approach of
> >   always setting an exclusive fence, see virtio_gpu_array_add_fence()
> >
> > This covers the setting of the exclusive fences when writing.
> >
> > Synchronizing against the exclusive fence is a lot more tricky, and I
> > only spot checked a few:
> >
> > - i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all
> >   implicit dependencies (which is used by vulkan)
> >
> > - etnaviv does this. Implicit dependencies are collected in
> >   submit_fence_sync(), again with an opt-out flag
> >   ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in
> >   etnaviv_sched_dependency which is the
> >   drm_sched_backend_ops->dependency callback.
> >
> > - v4c seems to not do much here, maybe gets away with it by not having
> >   a scheduler and only a single engine. Since all newer broadcom chips than
> >   the OG vc4 use v3d for rendering, which follows this contract, the
> >   impact of this issue is fairly small.
> >
> > - v3d does this using the drm_gem_fence_array_add_implicit() helper,
> >   which then it's drm_sched_backend_ops->dependency callback
> >   v3d_job_dependency() picks up.
> >
> > - panfrost is nice here and tracks the implicit fences in
> >   panfrost_job->implicit_fences, which again the
> >   drm_sched_backend_ops->dependency callback panfrost_job_dependency()
> >   picks up. It is mildly questionable though since it only picks up
> >   exclusive fences in panfrost_acquire_object_fences(), but not buggy
> >   in practice because it also always sets the exclusive fence. It
> >   should pick up both sets o

[Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

2021-05-21 Thread Daniel Vetter
epare_fb() helper will
  correctly. Overwhelmingly most drivers get this right, except a few
  totally dont. I'll follow up with a patch to make this the default
  and avoid a bunch of bugs.

- I didn't audit the ttm drivers, but given that dma_resv started
  there I hope they get this right.

In conclusion this IS the contract, both as documented and
overwhelmingly implemented, specically as implemented by all render
drivers except amdgpu.

Amdgpu tried to fix this already in

commit 049aca4363d8af87cab8d53de5401602db3b
Author: Christian König 
Date:   Wed Sep 19 16:54:35 2018 +0200

drm/amdgpu: fix using shared fence for exported BOs v2

but this fix falls short on a number of areas:

- It's racy, by the time the buffer is shared it might be too late. To
  make sure there's definitely never a problem we need to set the
  fences correctly for any buffer that's potentially exportable.

- It's breaking uapi, dma-buf fds support poll() and differentitiate
  between, which was introduced in

commit 9b495a5887994a6d74d5c261d012083a92b94738
Author: Maarten Lankhorst 
Date:   Tue Jul 1 12:57:43 2014 +0200

dma-buf: add poll support, v3

- Christian König wants to nack new uapi building further on this
  dma_resv contract because it breaks amdgpu, quoting

  "Yeah, and that is exactly the reason why I will NAK this uAPI change.

  "This doesn't works for amdgpu at all for the reasons outlined above."

  
https://lore.kernel.org/dri-devel/f2eb6751-2f82-9b23-f57e-548de5b72...@gmail.com/

  Rejecting new development because your own driver is broken and
  violates established cross driver contracts and uapi is really not
  how upstream works.

Now this patch will have a severe performance impact on anything that
runs on multiple engines. So we can't just merge it outright, but need
a bit a plan:

- amdgpu needs a proper uapi for handling implicit fencing. The funny
  thing is that to do it correctly, implicit fencing must be treated
  as a very strange IPC mechanism for transporting fences, where both
  setting the fence and dependency intercepts must be handled
  explicitly. Current best practices is a per-bo flag to indicate
  writes, and a per-bo flag to to skip implicit fencing in the CS
  ioctl as a new chunk.

- Since amdgpu has been shipping with broken behaviour we need an
  opt-out flag from the butchered implicit fencing model to enable the
  proper explicit implicit fencing model.

- for kernel memory fences due to bo moves at least the i915 idea is
  to use ttm_bo->moving. amdgpu probably needs the same.

- since the current p2p dma-buf interface assumes the kernel memory
  fence is in the exclusive dma_resv fence slot we need to add a new
  fence slot for kernel fences, which must never be ignored. Since
  currently only amdgpu supports this there's no real problem here
  yet, until amdgpu gains a NO_IMPLICIT CS flag.

- New userspace needs to ship in enough desktop distros so that users
  wont notice the perf impact. I think we can ignore LTS distros who
  upgrade their kernels but not their mesa3d snapshot.

- Then when this is all in place we can merge this patch here.

What is not a solution to this problem here is trying to make the
dma_resv rules in the kernel more clever. The fundamental issue here
is that the amdgpu CS uapi is the least expressive one across all
drivers (only equalled by panfrost, which has an actual excuse) by not
allowing any userspace control over how implicit sync is conducted.

Until this is fixed it's completely pointless to make the kernel more
clever to improve amdgpu, because all we're doing is papering over
this uapi design issue. amdgpu needs to attain the status quo
established by other drivers first, once that's achieved we can tackle
the remaining issues in a consistent way across drivers.

Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen 
Cc: Dave Airlie 
Cc: Rob Clark 
Cc: Kristian H. Kristensen 
Cc: Michel Dänzer 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 
Cc: Kevin Wang 
Cc: Dennis Li 
Cc: Luben Tuikov 
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: Daniel Vetter 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 88a24a0b5691..cc8426e1e8a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -617,8 +617,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
 
-   /* Make sure we use the exclusive slot for shared BOs */
-   if (bo->prime_shared_count)
+   /* Make sure we use the exclusive slot for all potentially 
shared BOs

Re: [Mesa-dev] [Intel-gfx] [RFC 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-05-20 Thread Daniel Vetter
On Thu, May 20, 2021 at 08:10:59AM -0700, Matthew Brost wrote:
> On Thu, May 20, 2021 at 11:54:25AM +0200, Daniel Vetter wrote:
> > On Wed, May 19, 2021 at 7:19 PM Matthew Brost  
> > wrote:
> > >
> > > On Wed, May 19, 2021 at 01:10:04PM +0200, Daniel Vetter wrote:
> > > > On Tue, May 18, 2021 at 04:58:30PM -0700, Matthew Brost wrote:
> > > > > Add entry fpr i915 new parallel submission uAPI plan.
> > > > >
> > > > > v2:
> > > > >  (Daniel Vetter):
> > > > >   - Expand logical order explaination
> > > > >   - Add dummy header
> > > > >   - Only allow N BBs in execbuf IOCTL
> > > > >   - Configure parallel submission per slot not per gem context
> > > > >
> > > > > Cc: Tvrtko Ursulin 
> > > > > Cc: Tony Ye 
> > > > > CC: Carl Zhang 
> > > > > Cc: Daniel Vetter 
> > > > > Cc: Jason Ekstrand 
> > > > > Signed-off-by: Matthew Brost 
> > > > > ---
> > > > >  Documentation/gpu/rfc/i915_parallel_execbuf.h | 144 
> > > > > ++
> > > > >  Documentation/gpu/rfc/i915_scheduler.rst  |  53 ++-
> > > > >  2 files changed, 196 insertions(+), 1 deletion(-)
> > > > >  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > >
> > > > > diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> > > > > b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > > new file mode 100644
> > > > > index ..8c64b983ccad
> > > > > --- /dev/null
> > > > > +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > > @@ -0,0 +1,144 @@
> > > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> > > > > i915_context_engines_parallel_submit */
> > > > > +
> > > > > +/*
> > > > > + * i915_context_engines_parallel_submit:
> > > > > + *
> > > > > + * Setup a slot to allow multiple BBs to be submitted in a single 
> > > > > execbuf IOCTL.
> > > > > + * Those BBs will then be scheduled to run on the GPU in parallel. 
> > > > > Multiple
> > > > > + * hardware contexts are created internally in the i915 run these 
> > > > > BBs. Once a
> > > > > + * slot is configured for N BBs only N BBs can be submitted in each 
> > > > > execbuf
> > > > > + * IOCTL and this is implict behavior (e.g. the user doesn't tell 
> > > > > the execbuf
> > > > > + * IOCTL there are N BBs, the execbuf IOCTL know how many BBs there 
> > > > > are based on
> > > > > + * the slots configuration).
> > > > > + *
> > > > > + * Their are two currently defined ways to control the placement of 
> > > > > the
> > > > > + * hardware contexts on physical engines: default behavior (no 
> > > > > flags) and
> > > > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the 
> > > > > in the
> > > > > + * future as new hardware / use cases arise. Details of how to use 
> > > > > this
> > > > > + * interface below above the flags.
> > > > > + *
> > > > > + * Returns -EINVAL if hardware context placement configuration 
> > > > > invalid or if the
> > > > > + * placement configuration isn't supported on the platform / 
> > > > > submission
> > > > > + * interface.
> > > > > + * Returns -ENODEV if extension isn't supported on the platform / 
> > > > > submission
> > > > > + * inteface.
> > > > > + */
> > > > > +struct i915_context_engines_parallel_submit {
> > > > > +   struct i915_user_extension base;
> > > > > +
> > > > > +   __u16 engine_index; /* slot for parallel engine */
> > > > > +   __u16 width;/* number of contexts per parallel engine 
> > > > > */
> > > > > +   __u16 num_siblings; /* number of siblings per context */
> > > > > +   __u16 mbz16;
> > > >
> > > > Ok the big picture looks reasonable now, the flags still confuse me.
> > > >
> > >
> > > Yea, it is a bit confusing.
> > >
> > > > > +/*
> > > > > + * Default placement 

Re: [Mesa-dev] [Intel-gfx] [RFC 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-05-20 Thread Daniel Vetter
On Thu, May 20, 2021 at 11:57:44AM +0100, Tvrtko Ursulin wrote:
> 
> On 20/05/2021 10:54, Daniel Vetter wrote:
> > On Wed, May 19, 2021 at 7:19 PM Matthew Brost  
> > wrote:
> > > 
> > > On Wed, May 19, 2021 at 01:10:04PM +0200, Daniel Vetter wrote:
> > > > On Tue, May 18, 2021 at 04:58:30PM -0700, Matthew Brost wrote:
> > > > > Add entry fpr i915 new parallel submission uAPI plan.
> > > > > 
> > > > > v2:
> > > > >   (Daniel Vetter):
> > > > >- Expand logical order explaination
> > > > >- Add dummy header
> > > > >- Only allow N BBs in execbuf IOCTL
> > > > >    - Configure parallel submission per slot not per gem context
> > > > > 
> > > > > Cc: Tvrtko Ursulin 
> > > > > Cc: Tony Ye 
> > > > > CC: Carl Zhang 
> > > > > Cc: Daniel Vetter 
> > > > > Cc: Jason Ekstrand 
> > > > > Signed-off-by: Matthew Brost 
> > > > > ---
> > > > >   Documentation/gpu/rfc/i915_parallel_execbuf.h | 144 
> > > > > ++
> > > > >   Documentation/gpu/rfc/i915_scheduler.rst  |  53 ++-
> > > > >   2 files changed, 196 insertions(+), 1 deletion(-)
> > > > >   create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > > 
> > > > > diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> > > > > b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > > new file mode 100644
> > > > > index ..8c64b983ccad
> > > > > --- /dev/null
> > > > > +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > > > @@ -0,0 +1,144 @@
> > > > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> > > > > i915_context_engines_parallel_submit */
> > > > > +
> > > > > +/*
> > > > > + * i915_context_engines_parallel_submit:
> > > > > + *
> > > > > + * Setup a slot to allow multiple BBs to be submitted in a single 
> > > > > execbuf IOCTL.
> > > > > + * Those BBs will then be scheduled to run on the GPU in parallel. 
> > > > > Multiple
> > > > > + * hardware contexts are created internally in the i915 run these 
> > > > > BBs. Once a
> > > > > + * slot is configured for N BBs only N BBs can be submitted in each 
> > > > > execbuf
> > > > > + * IOCTL and this is implict behavior (e.g. the user doesn't tell 
> > > > > the execbuf
> > > > > + * IOCTL there are N BBs, the execbuf IOCTL know how many BBs there 
> > > > > are based on
> > > > > + * the slots configuration).
> > > > > + *
> > > > > + * Their are two currently defined ways to control the placement of 
> > > > > the
> > > > > + * hardware contexts on physical engines: default behavior (no 
> > > > > flags) and
> > > > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the 
> > > > > in the
> > > > > + * future as new hardware / use cases arise. Details of how to use 
> > > > > this
> > > > > + * interface below above the flags.
> > > > > + *
> > > > > + * Returns -EINVAL if hardware context placement configuration 
> > > > > invalid or if the
> > > > > + * placement configuration isn't supported on the platform / 
> > > > > submission
> > > > > + * interface.
> > > > > + * Returns -ENODEV if extension isn't supported on the platform / 
> > > > > submission
> > > > > + * inteface.
> > > > > + */
> > > > > +struct i915_context_engines_parallel_submit {
> > > > > +   struct i915_user_extension base;
> > > > > +
> > > > > +   __u16 engine_index; /* slot for parallel engine */
> > > > > +   __u16 width;/* number of contexts per parallel engine 
> > > > > */
> > > > > +   __u16 num_siblings; /* number of siblings per context */
> > > > > +   __u16 mbz16;
> > > > 
> > > > Ok the big picture looks reasonable now, the flags still confuse me.
> > > > 
> > > 
> > > Yea, it is a bit confusing.
> > > 
> > > > > +/*
> > > > > + * Default placeme

Re: [Mesa-dev] [Intel-gfx] [RFC 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-05-20 Thread Daniel Vetter
On Wed, May 19, 2021 at 7:19 PM Matthew Brost  wrote:
>
> On Wed, May 19, 2021 at 01:10:04PM +0200, Daniel Vetter wrote:
> > On Tue, May 18, 2021 at 04:58:30PM -0700, Matthew Brost wrote:
> > > Add entry fpr i915 new parallel submission uAPI plan.
> > >
> > > v2:
> > >  (Daniel Vetter):
> > >   - Expand logical order explaination
> > >   - Add dummy header
> > >   - Only allow N BBs in execbuf IOCTL
> > >   - Configure parallel submission per slot not per gem context
> > >
> > > Cc: Tvrtko Ursulin 
> > > Cc: Tony Ye 
> > > CC: Carl Zhang 
> > > Cc: Daniel Vetter 
> > > Cc: Jason Ekstrand 
> > > Signed-off-by: Matthew Brost 
> > > ---
> > >  Documentation/gpu/rfc/i915_parallel_execbuf.h | 144 ++
> > >  Documentation/gpu/rfc/i915_scheduler.rst  |  53 ++-
> > >  2 files changed, 196 insertions(+), 1 deletion(-)
> > >  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> > >
> > > diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> > > b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > new file mode 100644
> > > index ..8c64b983ccad
> > > --- /dev/null
> > > +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> > > @@ -0,0 +1,144 @@
> > > +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> > > i915_context_engines_parallel_submit */
> > > +
> > > +/*
> > > + * i915_context_engines_parallel_submit:
> > > + *
> > > + * Setup a slot to allow multiple BBs to be submitted in a single 
> > > execbuf IOCTL.
> > > + * Those BBs will then be scheduled to run on the GPU in parallel. 
> > > Multiple
> > > + * hardware contexts are created internally in the i915 run these BBs. 
> > > Once a
> > > + * slot is configured for N BBs only N BBs can be submitted in each 
> > > execbuf
> > > + * IOCTL and this is implict behavior (e.g. the user doesn't tell the 
> > > execbuf
> > > + * IOCTL there are N BBs, the execbuf IOCTL know how many BBs there are 
> > > based on
> > > + * the slots configuration).
> > > + *
> > > + * Their are two currently defined ways to control the placement of the
> > > + * hardware contexts on physical engines: default behavior (no flags) and
> > > + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in 
> > > the
> > > + * future as new hardware / use cases arise. Details of how to use this
> > > + * interface below above the flags.
> > > + *
> > > + * Returns -EINVAL if hardware context placement configuration invalid 
> > > or if the
> > > + * placement configuration isn't supported on the platform / submission
> > > + * interface.
> > > + * Returns -ENODEV if extension isn't supported on the platform / 
> > > submission
> > > + * inteface.
> > > + */
> > > +struct i915_context_engines_parallel_submit {
> > > +   struct i915_user_extension base;
> > > +
> > > +   __u16 engine_index; /* slot for parallel engine */
> > > +   __u16 width;/* number of contexts per parallel engine */
> > > +   __u16 num_siblings; /* number of siblings per context */
> > > +   __u16 mbz16;
> >
> > Ok the big picture looks reasonable now, the flags still confuse me.
> >
>
> Yea, it is a bit confusing.
>
> > > +/*
> > > + * Default placement behvavior (currently unsupported):
> > > + *
> > > + * Rather than restricting parallel submission to a single class with a
> > > + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a 
> > > mode that
> > > + * enables parallel submission across multiple engine classes. In this 
> > > case each
> > > + * context's logical engine mask indicates where that context can 
> > > placed. It is
> > > + * implied in this mode that all contexts have mutual exclusive 
> > > placement (e.g.
> > > + * if one context is running CS0 no other contexts can run on CS0).
> > > + *
> > > + * Example 1 pseudo code:
> > > + * CSX[Y] = engine class X, logical instance Y
> > > + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> > > + * set_engines(INVALID)
> > > + * set_parallel(engine_index=0, width=2, num_siblings=2,
> > > + * engines=CS0[0],CS0[1],CS1[0],CS1[1])
> > > + *
> > > + * Results in 

Re: [Mesa-dev] [Intel-gfx] [RFC 2/2] drm/doc/rfc: i915 new parallel submission uAPI plan

2021-05-19 Thread Daniel Vetter
On Tue, May 18, 2021 at 04:58:30PM -0700, Matthew Brost wrote:
> Add entry fpr i915 new parallel submission uAPI plan.
> 
> v2:
>  (Daniel Vetter):
>   - Expand logical order explaination
>   - Add dummy header
>   - Only allow N BBs in execbuf IOCTL
>   - Configure parallel submission per slot not per gem context
> 
> Cc: Tvrtko Ursulin 
> Cc: Tony Ye 
> CC: Carl Zhang 
> Cc: Daniel Vetter 
> Cc: Jason Ekstrand 
> Signed-off-by: Matthew Brost 
> ---
>  Documentation/gpu/rfc/i915_parallel_execbuf.h | 144 ++
>  Documentation/gpu/rfc/i915_scheduler.rst  |  53 ++-
>  2 files changed, 196 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/gpu/rfc/i915_parallel_execbuf.h
> 
> diff --git a/Documentation/gpu/rfc/i915_parallel_execbuf.h 
> b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> new file mode 100644
> index ..8c64b983ccad
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_parallel_execbuf.h
> @@ -0,0 +1,144 @@
> +#define I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT 2 /* see 
> i915_context_engines_parallel_submit */
> +
> +/*
> + * i915_context_engines_parallel_submit:
> + *
> + * Setup a slot to allow multiple BBs to be submitted in a single execbuf 
> IOCTL.
> + * Those BBs will then be scheduled to run on the GPU in parallel. Multiple
> + * hardware contexts are created internally in the i915 run these BBs. Once a
> + * slot is configured for N BBs only N BBs can be submitted in each execbuf
> + * IOCTL and this is implict behavior (e.g. the user doesn't tell the execbuf
> + * IOCTL there are N BBs, the execbuf IOCTL know how many BBs there are 
> based on
> + * the slots configuration).
> + *
> + * Their are two currently defined ways to control the placement of the
> + * hardware contexts on physical engines: default behavior (no flags) and
> + * I915_PARALLEL_IMPLICT_BONDS (a flag). More flags may be added the in the
> + * future as new hardware / use cases arise. Details of how to use this
> + * interface below above the flags.
> + *
> + * Returns -EINVAL if hardware context placement configuration invalid or if 
> the
> + * placement configuration isn't supported on the platform / submission
> + * interface.
> + * Returns -ENODEV if extension isn't supported on the platform / submission
> + * inteface.
> + */
> +struct i915_context_engines_parallel_submit {
> + struct i915_user_extension base;
> +
> + __u16 engine_index; /* slot for parallel engine */
> + __u16 width;/* number of contexts per parallel engine */
> + __u16 num_siblings; /* number of siblings per context */
> + __u16 mbz16;

Ok the big picture looks reasonable now, the flags still confuse me.

> +/*
> + * Default placement behvavior (currently unsupported):
> + *
> + * Rather than restricting parallel submission to a single class with a
> + * logically contiguous placement (I915_PARALLEL_IMPLICT_BONDS), add a mode 
> that
> + * enables parallel submission across multiple engine classes. In this case 
> each
> + * context's logical engine mask indicates where that context can placed. It 
> is
> + * implied in this mode that all contexts have mutual exclusive placement 
> (e.g.
> + * if one context is running CS0 no other contexts can run on CS0).
> + *
> + * Example 1 pseudo code:
> + * CSX[Y] = engine class X, logical instance Y
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID)
> + * set_parallel(engine_index=0, width=2, num_siblings=2,
> + *   engines=CS0[0],CS0[1],CS1[0],CS1[1])
> + *
> + * Results in the following valid placements:
> + * CS0[0], CS1[0]
> + * CS0[0], CS1[1]
> + * CS0[1], CS1[0]
> + * CS0[1], CS1[1]
> + *
> + * This can also be though of as 2 virtual engines:
> + * VE[0] = CS0[0], CS0[1]
> + * VE[1] = CS1[0], CS1[1]
> + *
> + * Example 2 pseudo code:
> + * CS[X] = generic engine of same class, logical instance X
> + * INVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONE
> + * set_engines(INVALID)
> + * set_parallel(engine_index=0, width=2, num_siblings=3,
> + *   engines=CS[0],CS[1],CS[2],CS[0],CS[1],CS[2])
> + *
> + * Results in the following valid placements:
> + * CS[0], CS[1]
> + * CS[0], CS[2]
> + * CS[1], CS[0]
> + * CS[1], CS[2]
> + * CS[2], CS[0]
> + * CS[2], CS[1]
> + *
> + *
> + * This can also be though of as 2 virtual engines:
> + * VE[0] = CS[0], CS[1], CS[2]
> + * VE[1] = CS[0], CS[1], CS[2]
> +
> + * This enables a use case where all engines are created equally, we don't 
> care
> + * where they are scheduled, we just want a certain number of resources, for
> + * those resources to 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 02:48:35PM +0200, Christian König wrote:
> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 12:53 PM Christian König
> >  wrote:
> > > Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > > Yeah, it just takes to long for the preemption to complete to be 
> > > > > really
> > > > > useful for the feature we are discussing here.
> > > > > 
> > > > > As I said when the kernel requests to preempt a queue we can easily 
> > > > > expect a
> > > > > timeout of ~100ms until that comes back. For compute that is even in 
> > > > > the
> > > > > multiple seconds range.
> > > > 100ms for preempting an idle request sounds like broken hw to me. Of
> > > > course preemting something that actually runs takes a while, that's
> > > > nothing new. But it's also not the thing we're talking about here. Is 
> > > > this
> > > > 100ms actual numbers from hw for an actual idle ringbuffer?
> > > Well 100ms is just an example of the scheduler granularity. Let me
> > > explain in a wider context.
> > > 
> > > The hardware can have X queues mapped at the same time and every Y time
> > > interval the hardware scheduler checks if those queues have changed and
> > > only if they have changed the necessary steps to reload them are started.
> > > 
> > > Multiple queues can be rendering at the same time, so you can have X as
> > > a high priority queue active and just waiting for a signal to start and
> > > the client rendering one frame after another and a third background
> > > compute task mining bitcoins for you.
> > > 
> > > As long as everything is static this is perfectly performant. Adding a
> > > queue to the list of active queues is also relatively simple, but taking
> > > one down requires you to wait until we are sure the hardware has seen
> > > the change and reloaded the queues.
> > > 
> > > Think of it as an RCU grace period. This is simply not something which
> > > is made to be used constantly, but rather just at process termination.
> > Uh ... that indeed sounds rather broken.
> 
> Well I wouldn't call it broken. It's just not made for the use case we are
> trying to abuse it for.
> 
> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
> 
> Yeah, exactly that's why it isn't much of a problem for process termination
> or freeing memory.

Ok so your hw really hates the unload fence. On ours the various queues
are a bit more explicit, so largely unload/preempt is the same as context
switch and pretty quick. Afaik at least.

Still baffled that you can't fix this in fw, but oh well. Judging from how
fast our fw team moves I'm not surprised :-/

Anyway so next plan: Make this work exactly like hmm:
1. wait for the user fence as a dma-fence fake thing, tdr makes this safe
2. remove pte
3. do synchronous tlb flush

Tada, no more 100ms stall in your buffer move callbacks. And feel free to
pack up 2&3 into an async worker or something if it takes too long and
treating it as a bo move dma_fence is better. Also that way you might be
able to batch up the tlb flushing if it's too damn expensive, by
collecting them all under a single dma_fence (and starting a new tlb flush
cycle every time ->enable_signalling gets called).

As long as you nack any gpu faults and don't try to fill them for these
legacy contexts that support dma-fence there's no harm in using the hw
facilities.

Ofc if you're now telling me your synchronous tlb flush is also 100ms,
then maybe just throw the hw out the window, and accept that the
millisecond anything evicts anything (good look with userptr) the screen
freezes for a bit.

> > So by and large everyone should already be able to cope with it taking a
> > bit longer. So from a design pov I don't see a huge problem, but I
> > guess you guys wont be happy since it means on amd hw there will be
> > random unsightly stalls in desktop linux usage.
> > 
> > > > > The "preemption" feature is really called suspend and made just for 
> > > > > the case
> > > > > when we want to put a process to sleep or need to forcefully kill it 
> > > > > for
> > > > > misbehavior or stuff like that. It is not meant to be used in normal
> > > > > operation.
> > > > > 
> > > > > If we only attach it on ->move then yeah maybe a last resort 
> > > > > possibility to
> > > > > do it this way, but I think in that case we could rather

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 4, 2021 at 12:53 PM Christian König
 wrote:
>
> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> > [SNIP]
> >> Yeah, it just takes to long for the preemption to complete to be really
> >> useful for the feature we are discussing here.
> >>
> >> As I said when the kernel requests to preempt a queue we can easily expect 
> >> a
> >> timeout of ~100ms until that comes back. For compute that is even in the
> >> multiple seconds range.
> > 100ms for preempting an idle request sounds like broken hw to me. Of
> > course preemting something that actually runs takes a while, that's
> > nothing new. But it's also not the thing we're talking about here. Is this
> > 100ms actual numbers from hw for an actual idle ringbuffer?
>
> Well 100ms is just an example of the scheduler granularity. Let me
> explain in a wider context.
>
> The hardware can have X queues mapped at the same time and every Y time
> interval the hardware scheduler checks if those queues have changed and
> only if they have changed the necessary steps to reload them are started.
>
> Multiple queues can be rendering at the same time, so you can have X as
> a high priority queue active and just waiting for a signal to start and
> the client rendering one frame after another and a third background
> compute task mining bitcoins for you.
>
> As long as everything is static this is perfectly performant. Adding a
> queue to the list of active queues is also relatively simple, but taking
> one down requires you to wait until we are sure the hardware has seen
> the change and reloaded the queues.
>
> Think of it as an RCU grace period. This is simply not something which
> is made to be used constantly, but rather just at process termination.

Uh ... that indeed sounds rather broken.

Otoh it's just a dma_fence that'd we'd inject as this unload-fence. So
by and large everyone should already be able to cope with it taking a
bit longer. So from a design pov I don't see a huge problem, but I
guess you guys wont be happy since it means on amd hw there will be
random unsightly stalls in desktop linux usage.

> >> The "preemption" feature is really called suspend and made just for the 
> >> case
> >> when we want to put a process to sleep or need to forcefully kill it for
> >> misbehavior or stuff like that. It is not meant to be used in normal
> >> operation.
> >>
> >> If we only attach it on ->move then yeah maybe a last resort possibility to
> >> do it this way, but I think in that case we could rather stick with kernel
> >> submissions.
> > Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
> > can keep dma-fences working. Because the dma-fence stuff wont work with
> > pure userspace submit, I think that conclusion is rather solid. Once more
> > even after this long thread here.
>
> When assisted with unload fences, then yes. Problem is that I can't see
> how we could implement those performant currently.

Is there really no way to fix fw here? Like if process start/teardown
takes 100ms, that's going to suck no matter what.

> >>> Also, if userspace lies to us and keeps pushing crap into the ring
> >>> after it's supposed to be idle: Userspace is already allowed to waste
> >>> gpu time. If you're too worried about this set a fairly aggressive
> >>> preempt timeout on the unload fence, and kill the context if it takes
> >>> longer than what preempting an idle ring should take (because that
> >>> would indicate broken/evil userspace).
> >> I think you have the wrong expectation here. It is perfectly valid and
> >> expected for userspace to keep writing commands into the ring buffer.
> >>
> >> After all when one frame is completed they want to immediately start
> >> rendering the next one.
> > Sure, for the true userspace direct submit model. But with that you don't
> > get dma-fence, which means this gpu will not work for 3d accel on any
> > current linux desktop.
>
> I'm not sure of that. I've looked a bit into how we could add user
> fences to dma_resv objects and that isn't that hard after all.

I think as a proof of concept it's fine, but as an actual solution ...
pls no. Two reasons:
- implicit sync is bad
- this doesn't fix anything for explicit sync using dma_fence in terms
of sync_file or drm_syncobj.

So if we go with the route of papering over this in the kernel, then
it'll be a ton more work than just hacking something into dma_resv.

> > Which sucks, hence some hybrid model of using the userspace ring and
> > kernel augmented submit is needed. Which was my idea.
>
>

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 11:14:06AM +0200, Christian König wrote:
> Am 04.05.21 um 10:27 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 10:09 AM Christian König
> >  wrote:
> > > Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> > > > On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> > > > > Unfortunately as I pointed out to Daniel as well this won't work 100%
> > > > > reliable either.
> > > > You're claiming this, but there's no clear reason why really, and you
> > > > did't reply to my last mail on that sub-thread, so I really don't get
> > > > where exactly you're seeing a problem.
> > > Yeah, it's rather hard to explain without pointing out how the hardware
> > > works in detail.
> > > 
> > > > > See the signal on the ring buffer needs to be protected by 
> > > > > manipulation from
> > > > > userspace so that we can guarantee that the hardware really has 
> > > > > finished
> > > > > executing when it fires.
> > > > Nope you don't. Userspace is already allowed to submit all kinds of 
> > > > random
> > > > garbage, the only thing the kernel has to guarnatee is:
> > > > - the dma-fence DAG stays a DAG
> > > > - dma-fence completes in finite time
> > > > 
> > > > Everything else is not the kernel's problem, and if userspace mixes 
> > > > stuff
> > > > up like manipulates the seqno, that's ok. It can do that kind of garbage
> > > > already.
> > > > 
> > > > > Protecting memory by immediate page table updates is a good first 
> > > > > step, but
> > > > > unfortunately not sufficient (and we would need to restructure large 
> > > > > parts
> > > > > of the driver to make this happen).
> > > > This is why you need the unload-fence on top, because indeed you can't
> > > > just rely on the fences created from the userspace ring, those are
> > > > unreliable for memory management.
> > > And exactly that's the problem! We can't provide a reliable unload-fence
> > > and the user fences are unreliable for that.
> > > 
> > > I've talked this through lengthy with our hardware/firmware guy last
> > > Thursday but couldn't find a solution either.
> > > 
> > > We can have a preemption fence for the kernel which says: Hey this queue
> > > was scheduled away you can touch it's hardware descriptor, control
> > > registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
> > > But that one is only triggered on preemption and then we have the same
> > > ordering problems once more.
> > > 
> > > Or we can have a end of operation fence for userspace which says: Hey
> > > this queue has finished it's batch of execution, but this one is
> > > manipulable from userspace in both finish to early (very very bad for
> > > invalidations and memory management) or finish to late/never (deadlock
> > > prone but fixable by timeout).
> > > 
> > > What we could do is to use the preemption fence to emulate the unload
> > > fence, e.g. something like:
> > > 1. Preempt the queue in fixed intervals (let's say 100ms).
> > > 2. While preempted check if we have reached the checkpoint in question
> > > by looking at the hardware descriptor.
> > > 3. If we have reached the checkpoint signal the unload fence.
> > > 4. If we haven't reached the checkpoint resume the queue again.
> > > 
> > > The problem is that this might introduce a maximum of 100ms delay before
> > > signaling the unload fence and preempt/resume has such a hefty overhead
> > > that we waste a horrible amount of time on it.
> > So your hw can preempt? That's good enough.
> > 
> > The unload fence is just
> > 1. wait for all dma_fence that are based on the userspace ring. This
> > is unreliable, but we don't care because tdr will make it reliable.
> > And once tdr shot down a context we'll force-unload and thrash it
> > completely, which solves the problem.
> > 2. preempt the context, which /should/ now be stuck waiting for more
> > commands to be stuffed into the ringbuffer. Which means your
> > preemption is hopefully fast enough to not matter. If your hw takes
> > forever to preempt an idle ring, I can't help you :-)
> 
> Yeah, it just takes to long for the preemption to complete to be really
> useful for the feature we are discussing here.
> 
> As I said when the kernel r

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 4, 2021 at 10:09 AM Christian König
 wrote:
>
> Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> > On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> >> Unfortunately as I pointed out to Daniel as well this won't work 100%
> >> reliable either.
> > You're claiming this, but there's no clear reason why really, and you
> > did't reply to my last mail on that sub-thread, so I really don't get
> > where exactly you're seeing a problem.
>
> Yeah, it's rather hard to explain without pointing out how the hardware
> works in detail.
>
> >> See the signal on the ring buffer needs to be protected by manipulation 
> >> from
> >> userspace so that we can guarantee that the hardware really has finished
> >> executing when it fires.
> > Nope you don't. Userspace is already allowed to submit all kinds of random
> > garbage, the only thing the kernel has to guarnatee is:
> > - the dma-fence DAG stays a DAG
> > - dma-fence completes in finite time
> >
> > Everything else is not the kernel's problem, and if userspace mixes stuff
> > up like manipulates the seqno, that's ok. It can do that kind of garbage
> > already.
> >
> >> Protecting memory by immediate page table updates is a good first step, but
> >> unfortunately not sufficient (and we would need to restructure large parts
> >> of the driver to make this happen).
> > This is why you need the unload-fence on top, because indeed you can't
> > just rely on the fences created from the userspace ring, those are
> > unreliable for memory management.
>
> And exactly that's the problem! We can't provide a reliable unload-fence
> and the user fences are unreliable for that.
>
> I've talked this through lengthy with our hardware/firmware guy last
> Thursday but couldn't find a solution either.
>
> We can have a preemption fence for the kernel which says: Hey this queue
> was scheduled away you can touch it's hardware descriptor, control
> registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
> But that one is only triggered on preemption and then we have the same
> ordering problems once more.
>
> Or we can have a end of operation fence for userspace which says: Hey
> this queue has finished it's batch of execution, but this one is
> manipulable from userspace in both finish to early (very very bad for
> invalidations and memory management) or finish to late/never (deadlock
> prone but fixable by timeout).
>
> What we could do is to use the preemption fence to emulate the unload
> fence, e.g. something like:
> 1. Preempt the queue in fixed intervals (let's say 100ms).
> 2. While preempted check if we have reached the checkpoint in question
> by looking at the hardware descriptor.
> 3. If we have reached the checkpoint signal the unload fence.
> 4. If we haven't reached the checkpoint resume the queue again.
>
> The problem is that this might introduce a maximum of 100ms delay before
> signaling the unload fence and preempt/resume has such a hefty overhead
> that we waste a horrible amount of time on it.

So your hw can preempt? That's good enough.

The unload fence is just
1. wait for all dma_fence that are based on the userspace ring. This
is unreliable, but we don't care because tdr will make it reliable.
And once tdr shot down a context we'll force-unload and thrash it
completely, which solves the problem.
2. preempt the context, which /should/ now be stuck waiting for more
commands to be stuffed into the ringbuffer. Which means your
preemption is hopefully fast enough to not matter. If your hw takes
forever to preempt an idle ring, I can't help you :-)

Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

Again, I'm not seeing the problem. Except if your hw is really
completely busted to the point where it can't even support userspace
ringbuffers properly and with sufficient performance :-P

Of course if you issue the preempt context request before the
userspace fences have finished (or tdr cleaned up the mess) like you
do in your proposal, then it will be ridiculously expensive and/or
wont work. So just don't do that.

> > btw I thought some more, and I think it's probably best if we only attach
> > the unload-fence in the ->move(_notify) callbacks. Kinda like we already
> > do for async copy jobs. So the overall buffer move sequence would be:
> >
> > 1. wait for (untrusted for kernel, but ne

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Daniel Vetter
On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> Unfortunately as I pointed out to Daniel as well this won't work 100%
> reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.

> See the signal on the ring buffer needs to be protected by manipulation from
> userspace so that we can guarantee that the hardware really has finished
> executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.

> Protecting memory by immediate page table updates is a good first step, but
> unfortunately not sufficient (and we would need to restructure large parts
> of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.

btw I thought some more, and I think it's probably best if we only attach
the unload-fence in the ->move(_notify) callbacks. Kinda like we already
do for async copy jobs. So the overall buffer move sequence would be:

1. wait for (untrusted for kernel, but necessary for userspace
correctness) fake dma-fence that rely on the userspace ring

2. unload ctx

3. copy buffer

Ofc 2&3 would be done async behind a dma_fence.

> On older hardware we often had the situation that for reliable invalidation
> we need the guarantee that every previous operation has finished executing.
> It's not so much of a problem when the next operation has already started,
> since then we had the opportunity to do things in between the last and the
> next operation. Just see cache invalidation and VM switching for example.

If you have gpu page faults you generally have synchronous tlb
invalidation, so this also shouldn't be a big problem. Combined with the
unload fence at least. If you don't have synchronous tlb invalidate it
gets a bit more nasty and you need to force a preemption to a kernel
context which has the required flushes across all the caches. Slightly
nasty, but the exact same thing would be required for handling page faults
anyway with the direct userspace submit model.

Again I'm not seeing a problem.

> Additional to that it doesn't really buy us anything, e.g. there is not much
> advantage to this. Writing the ring buffer in userspace and then ringing in
> the kernel has the same overhead as doing everything in the kernel in the
> first place.

It gets you dma-fence backwards compat without having to rewrite the
entire userspace ecosystem. Also since you have the hw already designed
for ringbuffer in userspace it would be silly to copy that through the cs
ioctl, that's just overhead.

Also I thought the problem you're having is that all the kernel ringbuf
stuff is going away, so the old cs ioctl wont work anymore for sure?

Maybe also pick up that other subthread which ended with my last reply.

Cheers, Daniel


> 
> Christian.
> 
> Am 04.05.21 um 05:11 schrieb Marek Olšák:
> > Proposal for a new CS ioctl, kernel pseudo code:
> > 
> > lock(_lock);
> > serial = get_next_serial(dev);
> > add_wait_command(ring, serial - 1);
> > add_exec_cmdbuf(ring, user_cmdbuf);
> > add_signal_command(ring, serial);
> > *ring->doorbell = FIRE;
> > unlock(_lock);
> > 
> > See? Just like userspace submit, but in the kernel without
> > concurrency/preemption. Is this now safe enough for dma_fence?
> > 
> > Marek
> > 
> > On Mon, May 3, 2021 at 4:36 PM Marek Olšák  > <mailto:mar...@gmail.com>> wrote:
> > 
> > What about direct submit from the kernel where the process still
> > has write access to the GPU ring buffer but doesn't use it? I
> > think that solves your preemption example, but leaves a potential
> > backdoor for a process to overwrite the signal commands, which
> > shouldn't be a problem since we are OK with timeouts.
> > 
> > Marek
> > 
> > On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
> > mailto:ja...@jlekstrand.net>> wrote:
> > 
> > On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
> > mailto:b...@basnieuwenhuizen.nl>> wrote:
> > >
> >     > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
> > mailto:ja...@jlekstrand.net>> wrote:
> > > >
> > > > Sorry for the top-post but there's no good thing to

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-30 Thread Daniel Vetter
On Fri, Apr 30, 2021 at 11:08 AM Christian König
 wrote:
>
> Am 30.04.21 um 10:58 schrieb Daniel Vetter:
> > [SNIP]
> >>> When the user allocates usermode queues, the kernel driver sets up a
> >>> queue descriptor in the kernel which defines the location of the queue
> >>> in memory, what priority it has, what page tables it should use, etc.
> >>> User mode can then start writing commands to its queues.  When they
> >>> are ready for the hardware to start executing them, they ring a
> >>> doorbell which signals the scheduler and it maps the queue descriptors
> >>> to HW queue slots and they start executing.  The user only has access
> >>> to it's queues and any buffers it has mapped in it's GPU virtual
> >>> address space.  While the queues are scheduled, the user can keep
> >>> submitting work to them and they will keep executing unless they get
> >>> preempted by the scheduler due to oversubscription or a priority call
> >>> or a request from the kernel driver to preempt, etc.
> >> Yeah, works like with our stuff.
> >>
> >> I don't see a problem tbh. It's slightly silly going the detour with the
> >> kernel ioctl, and it's annoying that you still have to use drm/scheduler
> >> to resolve dependencies instead of gpu semaphores and all that. But this
> >> only applies to legacy winsys mode, compute (e.g. vk without winsys) can
> >> use the full power. Just needs a flag or something when setting up the
> >> context.
> >>
> >> And best part is that from hw pov this really is indistinguishable from
> >> the full on userspace submit model.
> >>
> >> The thing where it gets annoying is when you use one of these new cpu
> >> instructions which do direct submit to hw and pass along the pasid id
> >> behind the scenes. That's truly something you can't intercept anymore in
> >> the kernel and fake the legacy dma_fence world.
> >>
> >> But what you're describing here sounds like bog standard stuff, and also
> >> pretty easy to keep working with exactly the current model.
> >>
> >> Ofc we'll want to push forward a more modern model that better suits
> >> modern gpus, but I don't see any hard requirement here from the hw side.
> > Adding a bit more detail on what I have in mind:
> >
> > - memory management works like amdgpu does today, so all buffers are
> > pre-bound to the gpu vm, we keep the entire bo set marked as busy with
> > the bulk lru trick for every command submission.
> >
> > - for the ringbuffer, userspace allcoates a suitably sized bo for
> > ringbuffer, ring/tail/seqno and whatever else it needs
> >
> > - userspace then asks the kernel to make that into a hw context, with
> > all the priviledges setup. Doorbell will only be mapped into kernel
> > (hw can't tell the difference anyway), but if it happens to also be
> > visible to userspace that's no problem. We assume userspace can ring
> > the doorbell anytime it wants to.
>
> This doesn't work in hardware. We at least need to setup a few registers
> and memory locations from inside the VM which userspace shouldn't have
> access to when we want the end of batch fence and ring buffer start to
> be reliable.

The thing is, we don't care whether it's reliable or not. Userspace is
allowed to lie, not signal, signal the wrong thing, out of order,
everything.

The design assumes all this is possible.

So unless you can't signal at all from userspace, this works. And for
the "can't signal at all" it just means something needs to do a cpu
busy wait and burn down lots of cpu time. I hope that's not your hw
design :-)

> > - we do double memory management: One dma_fence works similar to the
> > amdkfd preempt fence, except it doesn't preempt but does anything
> > required to make the hw context unrunnable and take it out of the hw
> > scheduler entirely. This might involve unmapping the doorbell if
> > userspace has access to it.
> >
> > - but we also do classic end-of-batch fences, so that implicit fencing
> > and all that keeps working. The "make hw ctx unrunnable" fence must
> > also wait for all of these pending submissions to complete.
>
> This together doesn't work from the software side, e.g. you can either
> have preemption fences or end of batch fences but never both or your end
> of batch fences would have another dependency on the preemption fences
> which we currently can't express in the dma_fence framework.

It's _not_ a preempt fence. It's an ctx unload fence. Not the same
thing. Normal preempt fence would indee

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-30 Thread Daniel Vetter
On Thu, Apr 29, 2021 at 1:12 PM Daniel Vetter  wrote:
>
> On Wed, Apr 28, 2021 at 04:39:24PM -0400, Alex Deucher wrote:
> > On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter  wrote:
> > >
> > > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > > > > broken for now:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will 
> > > > > > > > > > > > > make us pretty
> > > > > > > > > > > > > unhappy
> > > > > > > > > > > > I concur. I have quite a few users with a multi-GPU 
> > > > > > > > > > > > setup involving
> > > > > > > > > > > > AMD hardware.
> > > > > > > > > > > >
> > > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a 
> > > > > > > > > > > > to get a clear
> > > > > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > > > > synchronized
> > > > > > > > > > > > anymore.
> > > > > > > > > > > It's an upcoming requirement for windows[1], so you are 
> > > > > > > > > > > likely to
> > > > > > > > > > > start seeing this across all GPU vendors that support 
> > > > > > > > > > > windows.  I
> > > > > > > > > > > think the timing depends on how quickly the legacy 
> > > > > > > > > > > hardware support
> > > > > > > > > > > sticks around for each vendor.
> > > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be 
> > > > > > > > > > constructed to not
> > > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > > >
> > > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer 
> > > > > > > > > > outside of the
> > > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously 
> > > > > > > > > > hoping you have r/o
> > > > > > > > > > pte flags. Otherwise the entire "share address space with 
> > > > > > > > > > cpu side,
> > > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > > >
> > > > > > > > > > And with that r/o bit on the ringbuffer you can once more 
> > > > > > > > > > force submit
> > > > > > > > > > through kernel space, and all the legacy dma_fence based 
> > > > > > > > > > stuff keeps
> > > > > > > > > > working. And we don't have to invent some horrendous 
> > > > > &g

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-29 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 04:39:24PM -0400, Alex Deucher wrote:
> On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter  wrote:
> >
> > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser 
> > > > > > > > > >  wrote:
> > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > > > broken for now:
> > > > > > > > > > > > >
> > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will 
> > > > > > > > > > > > make us pretty
> > > > > > > > > > > > unhappy
> > > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > > > involving
> > > > > > > > > > > AMD hardware.
> > > > > > > > > > >
> > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a 
> > > > > > > > > > > to get a clear
> > > > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > > > synchronized
> > > > > > > > > > > anymore.
> > > > > > > > > > It's an upcoming requirement for windows[1], so you are 
> > > > > > > > > > likely to
> > > > > > > > > > start seeing this across all GPU vendors that support 
> > > > > > > > > > windows.  I
> > > > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > > > support
> > > > > > > > > > sticks around for each vendor.
> > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be 
> > > > > > > > > constructed to not
> > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > >
> > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer 
> > > > > > > > > outside of the
> > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping 
> > > > > > > > > you have r/o
> > > > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > > > side,
> > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > >
> > > > > > > > > And with that r/o bit on the ringbuffer you can once more 
> > > > > > > > > force submit
> > > > > > > > > through kernel space, and all the legacy dma_fence based 
> > > > > > > > > stuff keeps
> > > > > > > > > working. And we don't have to invent some horrendous 
> > > > > > > > > userspace fence based
> > > > > > > > > implicit sync mechanism in the kernel, but can instead do 
> > > > > > > > > this transition
> > > > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > > > reving.
> > > > > > 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-29 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 04:45:01PM +0200, Christian König wrote:
> Am 28.04.21 um 16:34 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser 
> > > > > > > > > >  wrote:
> > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > > > broken for now:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will 
> > > > > > > > > > > > make us pretty
> > > > > > > > > > > > unhappy
> > > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > > > involving
> > > > > > > > > > > AMD hardware.
> > > > > > > > > > > 
> > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a 
> > > > > > > > > > > to get a clear
> > > > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > > > synchronized
> > > > > > > > > > > anymore.
> > > > > > > > > > It's an upcoming requirement for windows[1], so you are 
> > > > > > > > > > likely to
> > > > > > > > > > start seeing this across all GPU vendors that support 
> > > > > > > > > > windows.  I
> > > > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > > > support
> > > > > > > > > > sticks around for each vendor.
> > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be 
> > > > > > > > > constructed to not
> > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > > 
> > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer 
> > > > > > > > > outside of the
> > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping 
> > > > > > > > > you have r/o
> > > > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > > > side,
> > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > > 
> > > > > > > > > And with that r/o bit on the ringbuffer you can once more 
> > > > > > > > > force submit
> > > > > > > > > through kernel space, and all the legacy dma_fence based 
> > > > > > > > > stuff keeps
> > > > > > > > > working. And we don't have to invent some horrendous 
> > > > > > > > > userspace fence based
> > > > > > > > > implicit sync mechanism in the kernel, but can instead do 
> > > > > > > > > this transition
> > > > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > > > reving.
> > > > > > > > &

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  
> > > > > > > > wrote:
> > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > >  wrote:
> > > > > > > > > 
> > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > broken for now:
> > > > > > > > > > > 
> > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make 
> > > > > > > > > > us pretty
> > > > > > > > > > unhappy
> > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > involving
> > > > > > > > > AMD hardware.
> > > > > > > > > 
> > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to 
> > > > > > > > > get a clear
> > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > synchronized
> > > > > > > > > anymore.
> > > > > > > > It's an upcoming requirement for windows[1], so you are likely 
> > > > > > > > to
> > > > > > > > start seeing this across all GPU vendors that support windows.  
> > > > > > > > I
> > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > support
> > > > > > > > sticks around for each vendor.
> > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed 
> > > > > > > to not
> > > > > > > support isolating the ringbuffer at all.
> > > > > > > 
> > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside 
> > > > > > > of the
> > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you 
> > > > > > > have r/o
> > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > side,
> > > > > > > seamlessly" thing is out of the window.
> > > > > > > 
> > > > > > > And with that r/o bit on the ringbuffer you can once more force 
> > > > > > > submit
> > > > > > > through kernel space, and all the legacy dma_fence based stuff 
> > > > > > > keeps
> > > > > > > working. And we don't have to invent some horrendous userspace 
> > > > > > > fence based
> > > > > > > implicit sync mechanism in the kernel, but can instead do this 
> > > > > > > transition
> > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > reving.
> > > > > > > 
> > > > > > > At least I think you'd have to work extra hard to create a gpu 
> > > > > > > which
> > > > > > > cannot possibly be intercepted by the kernel, even when it's 
> > > > > > > designed to
> > > > > > > support userspace direct submit only.
> > > > > > > 
> > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > The upcomming hardware generation will have this hardware scheduler 
> > > > > > as a
> > > > > > must have, but there are certain ways we can still stick 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  
> > > > > > wrote:
> > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > >  wrote:
> > > > > > > 
> > > > > > > > > Ok. So that would only make the following use cases broken 
> > > > > > > > > for now:
> > > > > > > > > 
> > > > > > > > > - amd render -> external gpu
> > > > > > > > > - amd video encode -> network device
> > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us 
> > > > > > > > pretty
> > > > > > > > unhappy
> > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > involving
> > > > > > > AMD hardware.
> > > > > > > 
> > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a 
> > > > > > > clear
> > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > synchronized
> > > > > > > anymore.
> > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > sticks around for each vendor.
> > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to 
> > > > > not
> > > > > support isolating the ringbuffer at all.
> > > > > 
> > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you 
> > > > > have r/o
> > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > seamlessly" thing is out of the window.
> > > > > 
> > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > working. And we don't have to invent some horrendous userspace fence 
> > > > > based
> > > > > implicit sync mechanism in the kernel, but can instead do this 
> > > > > transition
> > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > 
> > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > cannot possibly be intercepted by the kernel, even when it's designed 
> > > > > to
> > > > > support userspace direct submit only.
> > > > > 
> > > > > Or are your hw engineers more creative here and we're screwed?
> > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > must have, but there are certain ways we can still stick to the old
> > > > approach:
> > > > 
> > > > 1. The new hardware scheduler currently still supports kernel queues 
> > > > which
> > > > essentially is the same as the old hardware ring buffer.
> > > > 
> > > > 2. Mapping the top level ring buffer into the VM at least partially 
> > > > solves
> > > > the problem. This way you can't manipulate the ring buffer content, but 
> > > > the
> > > > location for the fence must still be writeable.
> > > Yeah allowing userspace to lie about completion fences in this model is
> > > ok. Though I haven't thought through full consequences of that, but I
> > > think it's not any worse than userspace lying about which buffers/address
> > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > 
> > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > That's already 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > >  wrote:
> > > > > 
> > > > > > > Ok. So that would only make the following use cases broken for 
> > > > > > > now:
> > > > > > > 
> > > > > > > - amd render -> external gpu
> > > > > > > - amd video encode -> network device
> > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > unhappy
> > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > AMD hardware.
> > > > > 
> > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > error, and not bad results on screen because nothing is synchronized
> > > > > anymore.
> > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > start seeing this across all GPU vendors that support windows.  I
> > > > think the timing depends on how quickly the legacy hardware support
> > > > sticks around for each vendor.
> > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > support isolating the ringbuffer at all.
> > > 
> > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > pte flags. Otherwise the entire "share address space with cpu side,
> > > seamlessly" thing is out of the window.
> > > 
> > > And with that r/o bit on the ringbuffer you can once more force submit
> > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > working. And we don't have to invent some horrendous userspace fence based
> > > implicit sync mechanism in the kernel, but can instead do this transition
> > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > 
> > > At least I think you'd have to work extra hard to create a gpu which
> > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > support userspace direct submit only.
> > > 
> > > Or are your hw engineers more creative here and we're screwed?
> > 
> > The upcomming hardware generation will have this hardware scheduler as a
> > must have, but there are certain ways we can still stick to the old
> > approach:
> > 
> > 1. The new hardware scheduler currently still supports kernel queues which
> > essentially is the same as the old hardware ring buffer.
> > 
> > 2. Mapping the top level ring buffer into the VM at least partially solves
> > the problem. This way you can't manipulate the ring buffer content, but the
> > location for the fence must still be writeable.
> 
> Yeah allowing userspace to lie about completion fences in this model is
> ok. Though I haven't thought through full consequences of that, but I
> think it's not any worse than userspace lying about which buffers/address
> it uses in the current model - we rely on hw vm ptes to catch that stuff.
> 
> Also it might be good to switch to a non-recoverable ctx model for these.
> That's already what we do in i915 (opt-in, but all current umd use that
> mode). So any hang/watchdog just kills the entire ctx and you don't have
> to worry about userspace doing something funny with it's ringbuffer.
> Simplifies everything.
> 
> Also ofc userspace fencing still disallowed, but since userspace would
> queu up all writes to its ringbuffer through the drm/scheduler, we'd
> handle dependencies through that still. Not great, but workable.
> 
> Thinking about this, not even mapping the ringbuffer r/o is required, it's
> just that we must queue things throug the kernel to resolve dependencies
> and everything without breaking dma_fence. If userspace lies, tdr will
> shoot it and the kernel stops running that context entirely.
> 
> So I think even if we have hw with 100% userspace submit model only we
> should be still fine. It's ofc silly, because instead of using userspace
> fences and gpu semaphores the hw scheduler understands we still take the
> detour through drm/scheduler, but at least it's not a break-the-wor

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > >  wrote:
> > > > 
> > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > 
> > > > > > - amd render -> external gpu
> > > > > > - amd video encode -> network device
> > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > unhappy
> > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > AMD hardware.
> > > > 
> > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > error, and not bad results on screen because nothing is synchronized
> > > > anymore.
> > > It's an upcoming requirement for windows[1], so you are likely to
> > > start seeing this across all GPU vendors that support windows.  I
> > > think the timing depends on how quickly the legacy hardware support
> > > sticks around for each vendor.
> > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > support isolating the ringbuffer at all.
> > 
> > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > pte flags. Otherwise the entire "share address space with cpu side,
> > seamlessly" thing is out of the window.
> > 
> > And with that r/o bit on the ringbuffer you can once more force submit
> > through kernel space, and all the legacy dma_fence based stuff keeps
> > working. And we don't have to invent some horrendous userspace fence based
> > implicit sync mechanism in the kernel, but can instead do this transition
> > properly with drm_syncobj timeline explicit sync and protocol reving.
> > 
> > At least I think you'd have to work extra hard to create a gpu which
> > cannot possibly be intercepted by the kernel, even when it's designed to
> > support userspace direct submit only.
> > 
> > Or are your hw engineers more creative here and we're screwed?
> 
> The upcomming hardware generation will have this hardware scheduler as a
> must have, but there are certain ways we can still stick to the old
> approach:
> 
> 1. The new hardware scheduler currently still supports kernel queues which
> essentially is the same as the old hardware ring buffer.
> 
> 2. Mapping the top level ring buffer into the VM at least partially solves
> the problem. This way you can't manipulate the ring buffer content, but the
> location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

So I think even if we have hw with 100% userspace submit model only we
should be still fine. It's ofc silly, because instead of using userspace
fences and gpu semaphores the hw scheduler understands we still take the
detour through drm/scheduler, but at least it's not a break-the-world
event.

Or do I miss something here?

> For now and the next hardware we are save to support the old submission
> model, but the functionality of kernel queues will sooner or later go away
> if it is only for Linux.
> 
> So we need to work on something which works in the long term and get us away
> from this implicit sync.

Yeah I think we have pretty clear consensus on that goal, just no one yet
volunteered to get going with the winsys/wayland work to plumb drm_syncobj
through, and the kernel/mesa work to make that optionally a userspace
fence underneath. And it's for a sure a lot of work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> >
> > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> >  wrote:
> >
> > > > Ok. So that would only make the following use cases broken for now:
> > > >
> > > > - amd render -> external gpu
> > > > - amd video encode -> network device
> > >
> > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > unhappy
> >
> > I concur. I have quite a few users with a multi-GPU setup involving
> > AMD hardware.
> >
> > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > error, and not bad results on screen because nothing is synchronized
> > anymore.
> 
> It's an upcoming requirement for windows[1], so you are likely to
> start seeing this across all GPU vendors that support windows.  I
> think the timing depends on how quickly the legacy hardware support
> sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 06:27:27PM +, Simon Ser wrote:
> On Tuesday, April 27th, 2021 at 8:01 PM, Alex Deucher  
> wrote:
> 
> > It's an upcoming requirement for windows[1], so you are likely to
> > start seeing this across all GPU vendors that support windows. I
> > think the timing depends on how quickly the legacy hardware support
> > sticks around for each vendor.
> 
> Hm, okay.
> 
> Will using the existing explicit synchronization APIs make it work
> properly? (e.g. IN_FENCE_FD + OUT_FENCE_PTR in KMS, EGL_KHR_fence_sync +
> EGL_ANDROID_native_fence_sync + EGL_KHR_wait_sync in EGL)

If you have hw which really _only_ supports userspace direct submission
(i.e. the ringbuffer has to be in the same gpu vm as everything else by
design, and can't be protected at all with e.g. read-only pte entries)
then all that stuff would be broken.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 11:07:09AM +0200, Michel Dänzer wrote:
> On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> > 
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on which 
> >> drivers we need to interoperate with and update them. We've already found 
> >> the path forward for amdgpu. We just need to find out how many other 
> >> drivers need to be updated and evaluate the cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  >> <mailto:airl...@gmail.com>> wrote:
> >>
> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> >>  >> <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
> >> >
> >> > Correct, we wouldn't have synchronization between device with and 
> >> without user queues any more.
> >> >
> >> > That could only be a problem for A+I Laptops.
> >>
> >> Since I think you mentioned you'd only be enabling this on newer
> >> chipsets, won't it be a problem for A+A where one A is a generation
> >> behind the other?
> >>
> > 
> > Crap, that is a good point as well.
> > 
> >>
> >> I'm not really liking where this is going btw, seems like a ill
> >> thought out concept, if AMD is really going down the road of designing
> >> hw that is currently Linux incompatible, you are going to have to
> >> accept a big part of the burden in bringing this support in to more
> >> than just amd drivers for upcoming generations of gpu.
> >>
> > 
> > Well we don't really like that either, but we have no other option as far 
> > as I can see.
> 
> I don't really understand what "future hw may remove support for kernel
> queues" means exactly. While the per-context queues can be mapped to
> userspace directly, they don't *have* to be, do they? I.e. the kernel
> driver should be able to either intercept userspace access to the
> queues, or in the worst case do it all itself, and provide the existing
> synchronization semantics as needed?
> 
> Surely there are resource limits for the per-context queues, so the
> kernel driver needs to do some kind of virtualization / multi-plexing
> anyway, or we'll get sad user faces when there's no queue available for
> .
> 
> I'm probably missing something though, awaiting enlightenment. :)

Yeah in all this discussion what's unclear to me is, is this a hard amdgpu
requirement going forward, in which case you need a time machine and lots
of people to retroactively fix this because this aint fast to get fixed.

Or is this just musings for an ecosystem that better fits current
hw, for which I think we all agree where the rough direction is?

The former is quite a glorious situation, and I'm with Dave here that if
your hw engineers really removed the bit to not map the ringbuffers to
userspace, then amd gets to eat a big chunk of the cost here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 08:59:47AM +0200, Christian König wrote:
> Hi Dave,
> 
> Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > Supporting interop with any device is always possible. It depends on
> > which drivers we need to interoperate with and update them. We've
> > already found the path forward for amdgpu. We just need to find out how
> > many other drivers need to be updated and evaluate the cost/benefit
> > aspect.
> > 
> > Marek
> > 
> > On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > <mailto:airl...@gmail.com>> wrote:
> > 
> > On Tue, 27 Apr 2021 at 22:06, Christian König
> >  > <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
> > >
> > > Correct, we wouldn't have synchronization between device with
> > and without user queues any more.
> > >
> > > That could only be a problem for A+I Laptops.
> > 
> > Since I think you mentioned you'd only be enabling this on newer
> > chipsets, won't it be a problem for A+A where one A is a generation
> > behind the other?
> > 
> 
> Crap, that is a good point as well.
> 
> > 
> > I'm not really liking where this is going btw, seems like a ill
> > thought out concept, if AMD is really going down the road of designing
> > hw that is currently Linux incompatible, you are going to have to
> > accept a big part of the burden in bringing this support in to more
> > than just amd drivers for upcoming generations of gpu.
> > 
> 
> Well we don't really like that either, but we have no other option as far as
> I can see.
> 
> I have a couple of ideas how to handle this in the kernel without
> dma_fences, but it always require more or less changes to all existing
> drivers.

Yeah one horrible idea is to essentially do the plan we hashed out for
adding userspace fences to drm_syncobj timelines. And then add drm_syncobj
as another implicit fencing thing to dma-buf.

But:
- This is horrible. We're all agreeing that implicit sync is not a great
  idea, building an entire new world on this flawed thing doesn't sound
  like a good path forward.

- It's kernel uapi, so it's going to be forever.

- It's only fixing the correctness issue, since you have to stall for
  future/indefinite fences at the beginning of the CS ioctl. Or at the
  beginning of the atomic modeset ioctl, which kinda defeats the point of
  nonblocking.

- You still have to touch all kmd drivers.

- For performance, you still have to glue a submit thread onto all gl
  drivers.

It is horrendous.
-Daniel

> 
> Christian.
> 
> > 
> > Dave.
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König, 
>  wrote:
>>
>> Correct, we wouldn't have synchronization between device with and without 
>> user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences which pause the user 
>> queues of a process before evicting something. That will be a dma_fence, but 
>> also a well known approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g. we don't use them 
>> for memory management, it can work, right? Memory management can suspend 
>> user queues anytime. It doesn't need to use DMA fences. There might be 
>> something that I'm missing here.
>>
>> What would we lose without DMA fences? Just inter-device synchronization? I 
>> think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence is before a page 
>> flip. Everything today already depends on userspace not hanging the gpu, 
>> which makes everything a future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from userspace via
>>> > memory-mapped queues ("user queues") without changing window systems.
>>> >
>>> > The memory management doesn't have to use GPU page faults like HMM.
>>> > Instead, it can wait for user queues of a specific process to go idle and
>>> > then unmap the queues, so that userspace can't submit anything. Buffer
>>> > evictions, pinning, etc. can be executed when all queues are unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline semaphores. Userspace will
>>> > query the wait and signal value for a shared buffer from the kernel. The
>>> > kernel will keep a history of those queries to know which process is
>>> > responsible for signalling which buffer. There is only the wait-timeout
>>> > issue and how to identify the culprit. One of the solutions is to have the
>>> > GPU send all GPU signal commands and all timed out wait commands via an
>>> > interrupt to the kernel driver to monitor and validate userspace behavior.
>>> > With that, it can be identified whether the culprit is the waiting process
>>> > or the signalling process and which one. Invalid signal/wait parameters 
>>> > can
>>> > also be detected. The kernel can force-signal only the semaphores that 
>>> > time
>>> > out, and punish the processes which caused the timeout or used invalid
>>> > signal/wait parameters.
>>> >
>>> > The question is whether this synchronization solution is robust enough for
>>> > dma_fence and whatever the kernel and window systems need.
>>>
>>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>>> (without page faults). That means dma_fence for synchronization is doa, at
>>> least as-is, and we're back to figuring out the winsys problem.
>>>
>>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>>> akin to saying that we're solving deadlock issues in a locking design by
>>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>>> avoids having to reach the reset button, but that's about it.
>>>
>>> And the fundamental problem is that once you throw in userspace command
>>> submission (and syncing, at least within the userspace driver, otherwise
>>> there's kinda no point if you still need the kernel for cross-engine sync)
>>> means you get deadlocks if you still use dma_fence for sync under
>>> perfectly legit use-case. We've discu

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 1:49 PM Marek Olšák  wrote:
>
> If we don't use future fences for DMA fences at all, e.g. we don't use them 
> for memory management, it can work, right? Memory management can suspend user 
> queues anytime. It doesn't need to use DMA fences. There might be something 
> that I'm missing here.

Other drivers use dma_fence for their memory management. So unles
you've converted them all over to the dma_fence/memory fence split,
dma_fence fences stay memory fences. In theory this is possible, but
maybe not if you want to complete the job this decade :-)

> What would we lose without DMA fences? Just inter-device synchronization? I 
> think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a page 
> flip. Everything today already depends on userspace not hanging the gpu, 
> which makes everything a future fence.

That's not quite what we defined as future fences, because tdr
guarantees those complete, even if userspace hangs. It's when you put
userspace fence waits into the cs buffer you've submitted to the
kernel (or directly to hw) where the "real" future fences kick in.
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>>
>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
>> > how to do it differently.
>> >
>> > I think we can have direct command submission from userspace via
>> > memory-mapped queues ("user queues") without changing window systems.
>> >
>> > The memory management doesn't have to use GPU page faults like HMM.
>> > Instead, it can wait for user queues of a specific process to go idle and
>> > then unmap the queues, so that userspace can't submit anything. Buffer
>> > evictions, pinning, etc. can be executed when all queues are unmapped
>> > (suspended). Thus, no BO fences and page faults are needed.
>> >
>> > Inter-process synchronization can use timeline semaphores. Userspace will
>> > query the wait and signal value for a shared buffer from the kernel. The
>> > kernel will keep a history of those queries to know which process is
>> > responsible for signalling which buffer. There is only the wait-timeout
>> > issue and how to identify the culprit. One of the solutions is to have the
>> > GPU send all GPU signal commands and all timed out wait commands via an
>> > interrupt to the kernel driver to monitor and validate userspace behavior.
>> > With that, it can be identified whether the culprit is the waiting process
>> > or the signalling process and which one. Invalid signal/wait parameters can
>> > also be detected. The kernel can force-signal only the semaphores that time
>> > out, and punish the processes which caused the timeout or used invalid
>> > signal/wait parameters.
>> >
>> > The question is whether this synchronization solution is robust enough for
>> > dma_fence and whatever the kernel and window systems need.
>>
>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> (without page faults). That means dma_fence for synchronization is doa, at
>> least as-is, and we're back to figuring out the winsys problem.
>>
>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>> akin to saying that we're solving deadlock issues in a locking design by
>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> avoids having to reach the reset button, but that's about it.
>>
>> And the fundamental problem is that once you throw in userspace command
>> submission (and syncing, at least within the userspace driver, otherwise
>> there's kinda no point if you still need the kernel for cross-engine sync)
>> means you get deadlocks if you still use dma_fence for sync under
>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>
>> See silly diagramm at the bottom.
>>
>> Now I think all isn't lost, because imo the first step to getting to this
>> brave new world is rebuilding the driver on top of userspace fences, and
>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>> but port that as a context flag or similar to render nodes for gl/vk. Of
>> course that means you can only use this mode in headless, without
>> glx/wayland winsys support, but it's a start.
>> -Daniel
>>
>> >
>> > Marek
>

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Daniel Vetter
On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> Thanks everybody. The initial proposal is dead. Here are some thoughts on
> how to do it differently.
> 
> I think we can have direct command submission from userspace via
> memory-mapped queues ("user queues") without changing window systems.
> 
> The memory management doesn't have to use GPU page faults like HMM.
> Instead, it can wait for user queues of a specific process to go idle and
> then unmap the queues, so that userspace can't submit anything. Buffer
> evictions, pinning, etc. can be executed when all queues are unmapped
> (suspended). Thus, no BO fences and page faults are needed.
> 
> Inter-process synchronization can use timeline semaphores. Userspace will
> query the wait and signal value for a shared buffer from the kernel. The
> kernel will keep a history of those queries to know which process is
> responsible for signalling which buffer. There is only the wait-timeout
> issue and how to identify the culprit. One of the solutions is to have the
> GPU send all GPU signal commands and all timed out wait commands via an
> interrupt to the kernel driver to monitor and validate userspace behavior.
> With that, it can be identified whether the culprit is the waiting process
> or the signalling process and which one. Invalid signal/wait parameters can
> also be detected. The kernel can force-signal only the semaphores that time
> out, and punish the processes which caused the timeout or used invalid
> signal/wait parameters.
> 
> The question is whether this synchronization solution is robust enough for
> dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work. It's
akin to saying that we're solving deadlock issues in a locking design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace command
submission (and syncing, at least within the userspace driver, otherwise
there's kinda no point if you still need the kernel for cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last summer:

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting to this
brave new world is rebuilding the driver on top of userspace fences, and
with the adjusted cmd submit model. You probably don't want to use amdkfd,
but port that as a context flag or similar to render nodes for gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel

> 
> Marek
> 
> On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:
> 
> > Hi,
> >
> > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
> >
> >> The thing is, you can't do this in drm/scheduler. At least not without
> >> splitting up the dma_fence in the kernel into separate memory fences
> >> and sync fences
> >
> >
> > I'm starting to think this thread needs its own glossary ...
> >
> > I propose we use 'residency fence' for execution fences which enact
> > memory-residency operations, e.g. faulting in a page ultimately depending
> > on GPU work retiring.
> >
> > And 'value fence' for the pure-userspace model suggested by timeline
> > semaphores, i.e. fences being (*addr == val) rather than being able to look
> > at ctx seqno.
> >
> > Cheers,
> > Daniel
> > ___
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/9] drm/doc/rfc: i915 DG1 uAPI

2021-04-26 Thread Daniel Vetter
On Mon, Apr 26, 2021 at 11:25:09AM -0500, Jason Ekstrand wrote:
> On Mon, Apr 26, 2021 at 10:31 AM Matthew Auld  wrote:
> >
> > On 26/04/2021 16:11, Jason Ekstrand wrote:
> > > On Mon, Apr 26, 2021 at 4:42 AM Matthew Auld  
> > > wrote:
> > >>
> > >> Add an entry for the new uAPI needed for DG1. Also add the overall
> > >> upstream plan, including some notes for the TTM conversion.
> > >>
> > >> v2(Daniel):
> > >>- include the overall upstreaming plan
> > >>- add a note for mmap, there are differences here for TTM vs i915
> > >>- bunch of other suggestions from Daniel
> > >> v3:
> > >>   (Daniel)
> > >>- add a note for set/get caching stuff
> > >>- add some more docs for existing query and extensions stuff
> > >>- add an actual code example for regions query
> > >>- bunch of other stuff
> > >>   (Jason)
> > >>- uAPI change(!):
> > >>  - try a simpler design with the placements extension
> > >>  - rather than have a generic setparam which can cover multiple
> > >>use cases, have each extension be responsible for one thing
> > >>only
> > >> v4:
> > >>   (Daniel)
> > >>- add some more notes for ttm conversion
> > >>- bunch of other stuff
> > >>   (Jason)
> > >>- uAPI change(!):
> > >>  - drop all the extra rsvd members for the region_query and
> > >>region_info, just keep the bare minimum needed for padding
> > >>
> > >> Signed-off-by: Matthew Auld 
> > >> Cc: Joonas Lahtinen 
> > >> Cc: Thomas Hellström 
> > >> Cc: Daniele Ceraolo Spurio 
> > >> Cc: Lionel Landwerlin 
> > >> Cc: Jon Bloomfield 
> > >> Cc: Jordan Justen 
> > >> Cc: Daniel Vetter 
> > >> Cc: Kenneth Graunke 
> > >> Cc: Jason Ekstrand 
> > >> Cc: Dave Airlie 
> > >> Cc: dri-de...@lists.freedesktop.org
> > >> Cc: mesa-dev@lists.freedesktop.org
> > >> Acked-by: Daniel Vetter 
> > >> Acked-by: Dave Airlie 
> > >> ---
> > >>   Documentation/gpu/rfc/i915_gem_lmem.h   | 212 
> > >>   Documentation/gpu/rfc/i915_gem_lmem.rst | 130 +++
> > >>   Documentation/gpu/rfc/index.rst |   4 +
> > >>   3 files changed, 346 insertions(+)
> > >>   create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
> > >>   create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst
> > >>
> > >> diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
> > >> b/Documentation/gpu/rfc/i915_gem_lmem.h
> > >> new file mode 100644
> > >> index ..7ed59b6202d5
> > >> --- /dev/null
> > >> +++ b/Documentation/gpu/rfc/i915_gem_lmem.h
> > >> @@ -0,0 +1,212 @@
> > >> +/**
> > >> + * enum drm_i915_gem_memory_class - Supported memory classes
> > >> + */
> > >> +enum drm_i915_gem_memory_class {
> > >> +   /** @I915_MEMORY_CLASS_SYSTEM: System memory */
> > >> +   I915_MEMORY_CLASS_SYSTEM = 0,
> > >> +   /** @I915_MEMORY_CLASS_DEVICE: Device local-memory */
> > >> +   I915_MEMORY_CLASS_DEVICE,
> > >> +};
> > >> +
> > >> +/**
> > >> + * struct drm_i915_gem_memory_class_instance - Identify particular 
> > >> memory region
> > >> + */
> > >> +struct drm_i915_gem_memory_class_instance {
> > >> +   /** @memory_class: See enum drm_i915_gem_memory_class */
> > >> +   __u16 memory_class;
> > >> +
> > >> +   /** @memory_instance: Which instance */
> > >> +   __u16 memory_instance;
> > >> +};
> > >> +
> > >> +/**
> > >> + * struct drm_i915_memory_region_info - Describes one region as known 
> > >> to the
> > >> + * driver.
> > >> + *
> > >> + * Note that we reserve some stuff here for potential future work. As 
> > >> an example
> > >> + * we might want expose the capabilities(see @caps) for a given region, 
> > >> which
> > >> + * could include things like if the region is CPU mappable/accessible, 
> > >> what are
> > >> + * the supported mapping types etc.
> > >

Re: [Mesa-dev] [PATCH v3 4/4] drm/doc/rfc: i915 DG1 uAPI

2021-04-21 Thread Daniel Vetter
On Wed, Apr 21, 2021 at 8:28 PM Tvrtko Ursulin
 wrote:
> On 21/04/2021 18:17, Jason Ekstrand wrote:
> > On Wed, Apr 21, 2021 at 9:25 AM Tvrtko Ursulin
> >  wrote:
> >> On 21/04/2021 14:54, Jason Ekstrand wrote:
> >>> On Wed, Apr 21, 2021 at 3:22 AM Tvrtko Ursulin
> >>>  wrote:
> >>>> On 20/04/2021 18:00, Jason Ekstrand wrote:
> >>>> I am not claiming to know memory region query will end up the same, and
> >>>> I definitely agree we cannot guess the future. I am just saying rsvd
> >>>> fields are inconsequential really in terms of maintenance burden and
> >>>> have been proven useful in the past. So I disagree with the drive to
> >>>> kick them all out.
> >>>
> >>> Sure, it doesn't cost anything to have extra zeros in the struct.
> >>> However, if/when the API grows using rsvd fields, we end up with "if
> >>> CAP_FOO is set, rsvd[5] means blah" which makes for a horribly
> >>> confusing API.  As a userspace person who has to remember how to use
> >>> this stuff, I'd rather make another call or chain in a struct than try
> >>> to remember and/or figure out what all 8 rsvd fields mean.
> >>
> >> Well it's not called rsvd in the uapi which is aware of the new field
> >> but has a new name.
> >
> > Are we allowed to do that?  This is a genuine question.  When I've
> > tried in the past (cliprects), I was told we couldn't rename it even
> > though literally no one had used it in code for years.
>
> Well we did the union for pad_to_size so I thought we are allowed that
> trick at least. From my experience backward source level compatibility
> is not always there even with things like glibc. Despite that, are we
> generally required to stay backward source compatible I will not claim
> either way.

I think the anonymous union with exactly same sized field is ok. We
also try hard to be source compatible, but we have screwed up in the
past and shrugged it off. The one example that comes to mind is
extended structures at the bottom with new field, which the kernel
automatically zero-extends for old userspace. But when you recompile,
your new-old userspace might no longer clear the new fields because
the ioctl code didn't start out by memset()ing the entire struct.

But by we managed to not botch things up on source compat, but
it's definitely a lot tricker than ABI compat.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 9:14 PM Daniel Stone  wrote:
>
> Hi,
>
> On Tue, 20 Apr 2021 at 19:54, Daniel Vetter  wrote:
>>
>> So I can mostly get behind this, except it's _not_ going to be
>> dma_fence. That thing has horrendous internal ordering constraints
>> within the kernel, and the one thing that doesn't allow you is to make
>> a dma_fence depend upon a userspace fence.
>>
>> But what we can do is use the same currently existing container
>> objects like drm_syncobj or sync_file (timeline syncobj would fit best
>> tbh), and stuff a userspace fence behind it. The only trouble is that
>> currently timeline syncobj implement vulkan's spec, which means if you
>> build a wait-before-signal deadlock, you'll wait forever. Well until
>> the user ragequits and kills your process.
>>
>> So for winsys we'd need to be able to specify the wait timeout
>> somewhere for waiting for that dma_fence to materialize (plus the
>> submit thread, but userspace needs that anyway to support timeline
>> syncobj) if you're importing an untrusted timeline syncobj. And I
>> think that's roughly it.
>
>
> Right. The only way you get to materialise a dma_fence from an execbuf is 
> that you take a hard timeout, with a penalty for not meeting that timeout. 
> When I say dma_fence I mean dma_fence, because there is no extant winsys 
> support for drm_symcobj, so this is greenfield: the winsys gets to specify 
> its terms of engagement, and again, we've been the orange/green-site enemies 
> of users for quite some time already, so we're happy to continue doing so. If 
> the actual underlying primitive is not a dma_fence, and 
> compositors/protocol/clients need to eat a bunch of typing to deal with a 
> different primitive which offers the same guarantees, then that's fine, as 
> long as there is some tangible whole-of-system benefit.

So atm sync_file doesn't support future fences, but we could add the
support for those there. And since vulkan doesn't really say anything
about those, we could make the wait time out by default.

> How that timeout is actually realised is an implementation detail. Whether 
> it's a property of the last GPU job itself that the CPU-side driver can 
> observe, or that the kernel driver guarantees that there is a GPU job 
> launched in parallel which monitors the memory-fence status and reports back 
> through a mailbox/doorbell, or the CPU-side driver enqueues kqueue work for 
> $n milliseconds' time to check the value in memory and kill the context if it 
> doesn't meet expectations - whatever. I don't believe any of those choices 
> meaningfully impact on kernel driver complexity relative to the initial 
> proposal, but they do allow us to continue to provide the guarantees we do 
> today when buffers cross security boundaries.

The thing is, you can't do this in drm/scheduler. At least not without
splitting up the dma_fence in the kernel into separate memory fences
and sync fences, and the work to get there is imo just not worth it.
We've bikeshedded this ad nauseaum for vk timeline syncobj, and the
solution was to have the submit thread in the userspace driver.

It won't really change anything wrt what applications can observe from
the egl/gl side of things though.

> There might well be an argument for significantly weakening those security 
> boundaries and shifting the complexity from the DRM scheduler into userspace 
> compositors. So far though, I have yet to see that argument made coherently.

Ah we've had that argument. We have moved that into userspace as part
of vk submit threads. It aint pretty, but it's better than the other
option :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 9:17 PM Jason Ekstrand  wrote:
>
> On Tue, Apr 20, 2021 at 1:54 PM Daniel Vetter  wrote:
> >
> > On Tue, Apr 20, 2021 at 7:45 PM Daniel Stone  wrote:
> >
> > > And something more concrete:
> > >
> > > dma_fence.
> > >
> > > This already has all of the properties described above. Kernel-wise, it 
> > > already devolves to CPU-side signaling when it crosses device boundaries. 
> > > We need to support it roughly forever since it's been plumbed so far and 
> > > so wide. Any primitive which is acceptable for winsys-like usage which 
> > > crosses so many device/subsystem/process/security boundaries has to meet 
> > > the same requirements. So why reinvent something which looks so similar, 
> > > and has the same requirements of the kernel babysitting completion, 
> > > providing little to no benefit for that difference?
> >
> > So I can mostly get behind this, except it's _not_ going to be
> > dma_fence. That thing has horrendous internal ordering constraints
> > within the kernel, and the one thing that doesn't allow you is to make
> > a dma_fence depend upon a userspace fence.
>
> Let me elaborate on this a bit.  One of the problems I mentioned
> earlier is the conflation of fence types inside the kernel.  dma_fence
> is used for solving two different semi-related but different problems:
> client command synchronization and memory residency synchronization.
> In the old implicit GL world, we conflated these two and thought we
> were providing ourselves a service.  Not so much
>
> It's all well and good to say that we should turn the memory fence
> into a dma_fence and throw a timeout on it.  However, these
> window-system sync primitives, as you said, have to be able to be
> shared across everything.  In particular, we have to be able to share
> them with drivers that don't make a good separation between command
> and memory synchronization.
>
> Let's say we're rendering on ANV with memory fences and presenting on
> some USB display adapter whose kernel driver is a bit old-school.
> When we pass that fence to the other driver via a sync_file or
> similar, that driver may shove that dma_fence into the dma_resv on
> some buffer somewhere.  Then our client, completely unaware of
> internal kernel dependencies, binds that buffer into its address space
> and kicks off another command buffer.  So i915 throws in a dependency
> on that dma_resv which contains the previously created dma_fence and
> refuses to execute any more command buffers until it signals.
> Unfortunately, unbeknownst to i915, that command buffer which the
> client kicked off after doing that bind was required for signaling the
> memory fence on which our first dma_fence depends.  Deadlock.

Nope. Because the waiting for this future fence will only happen in two places:
- driver submit thread, which is just userspace without holding
anything. From the kernel pov this can be preempted, memory
temporarily taken away, all these things. Until that's done you will
_not_ get a real dma_fence, but just another future fence.
- but what about the usb display you're asking? well for that we'll
need a new atomic extension, which takes a timeline syncobj and gives
you back a timeline syncobj. And the rules are that if one of the is a
future fence/userspace fence, so will the other (even if it's created
by the kernel)

Either way you get a timeline syncobj back which anv can then again
handle properly with it's submit thread. Not a dma_fence with a funny
timeout because there's deadlock issues with those.

So no you wont be able to get a dma_fence out of your slight of hands here.

> Sure, we put a timeout on the dma_fence and it will eventually fire
> and unblock everything.  However, there's one very important point
> that's easy to miss here:  Neither i915 nor the client did anything
> wrong in the above scenario.  The Vulkan footgun approach works
> because there are a set of rules and, if you follow those rules,
> you're guaranteed everything works.  In the above scenario, however,
> the client followed all of the rules and got a deadlock anyway.  We
> can't have that.
>
>
> > But what we can do is use the same currently existing container
> > objects like drm_syncobj or sync_file (timeline syncobj would fit best
> > tbh), and stuff a userspace fence behind it. The only trouble is that
> > currently timeline syncobj implement vulkan's spec, which means if you
> > build a wait-before-signal deadlock, you'll wait forever. Well until
> > the user ragequits and kills your process.
>
> Yeah, it may be that this approach can be made to work.  Instead of
> reusing dma_fence, maybe we can reuse syncobj and have another 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
 
> context and permanently -EIO, whatever) the client. Maybe for future hardware 
> that would be the same thing - the kernel setting a timeout and comparing a 
> read on a particular address against a particular value - but the 'present 
> fence' proposal seems like it requires exactly this anyway.

Yeah return fence for flips/presents sounds unappealing. Android tried
it, we convinced them it's not great and they changed that.

> That to me is the best compromise. We allow clients complete arbitrary 
> flexibility, but as soon as they vkQueuePresentKHR, they're crossing a 
> boundary out of happy fun GPU land and into strange hostile winsys land. 
> We've got a lot of practice at being the bad guys who hate users and are 
> always trying to ruin their dreams, so we'll happily wear the impact of 
> continuing to do that. In doing so, we collectively don't have to invent a 
> third new synchronisation primitive (to add to dma_fence and drm_syncobj) and 
> a third new synchronisation model (implicit sync, explicit-but-bounded sync, 
> explicit-and-maybe-unbounded sync) to support this, and we don't have to do 
> an NT4 where GDI was shoved into the kernel.
>
> It doesn't help with the goal of ridding dma_fence from the kernel, but it 
> does very clearly segregate the two worlds. Drawing that hard boundary would 
> allow drivers to hyperoptimise for clients which want to be extremely clever 
> and agile and quick because they're sailing so close to the wind that they 
> cannot bear the overhead of dma_fence, whilst also providing the guarantees 
> we need when crossing isolation boundaries. In the latter case, the overhead 
> of bouncing into a less-optimised primitive is totally acceptable because 
> it's not even measurable: vkQueuePresentKHR requires client CPU activity -> 
> kernel IPC -> compositor CPU activity -> wait for repaint cycle -> prepare 
> scene -> composition, against which dma_fence overhead isn't and will never 
> be measurable (even if it doesn't cross device/subsystem boundaries, which it 
> probably does). And the converse for vkAcquireNextImageKHR.
>
> tl;dr: we don't need to move winsys into the kernel, winsys and compute don't 
> need to share sync primitives, the client/winsys boundary does need to have a 
> sync primitive does need strong and onerous guarantees, and that transition 
> can be several orders of magnitude less efficient than intra-client sync 
> primitives
>
> Shoot me down. :)

So I can mostly get behind this, except it's _not_ going to be
dma_fence. That thing has horrendous internal ordering constraints
within the kernel, and the one thing that doesn't allow you is to make
a dma_fence depend upon a userspace fence.

But what we can do is use the same currently existing container
objects like drm_syncobj or sync_file (timeline syncobj would fit best
tbh), and stuff a userspace fence behind it. The only trouble is that
currently timeline syncobj implement vulkan's spec, which means if you
build a wait-before-signal deadlock, you'll wait forever. Well until
the user ragequits and kills your process.

So for winsys we'd need to be able to specify the wait timeout
somewhere for waiting for that dma_fence to materialize (plus the
submit thread, but userspace needs that anyway to support timeline
syncobj) if you're importing an untrusted timeline syncobj. And I
think that's roughly it.

The fancy version would allow you to access the underlying memory
fence from the cmd streamer and do fancy conditional rendering and fun
stuff like that (pick old/new frame depending which one is ready), but
that's the fancy advanced compositor on top here. The "give me the
same thing as I get with dma_fence implicit sync today" would just
need the timeout for imporiting untrusted timeline syncobj.

So a vk extension, and also probably a gl extension for timeline
syncobj (not sure that exists already), which probably wants to
specify the reasonable timeout limit by default. Because that's more
the gl way of doing things.

Oh also I really don't want to support this for implicit sync, but
heck we could even do that. It would stall pretty bad because there's
no submit thread in userspace. But we could then optimize that with
some new dma-buf ioctl to get out the syncobj, kinda like what Jason
has already proposed for sync_file or so. And then userspace which has
a submit thread could handle it correctly.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
?) and the userspace 
>> Vulkan driver will somehow wait on this value  either before submitting work 
>> or as a possibly-hardware-assisted GPU-side wait (?)
>> * the kernel's scheduler is thus eliminated from the equation, and every 
>> execbuf is submitted directly to hardware, because either userspace knows 
>> that the fence has already been signaled, or it will issue a GPU-side wait 
>> (?)
>> * but the kernel is still required to monitor completion of every fence 
>> itself, so it can forcibly complete, or penalise the client (?)
>>
>> Lastly, let's say we stop ignoring KMS: what happens for the 
>> render-with-GPU-display-on-KMS case? Do we need to do the equivalent of 
>> glFinish() in userspace and only submit the KMS atomic request when the GPU 
>> work has fully retired?
>>
>> Clarifying those points would be really helpful so this is less of a 
>> strawman. I have some further opinions, but I'm going to wait until I 
>> understand what I'm actually arguing against before I go too far. :) The 
>> last point is very salient though.
>>
>> Cheers,
>> Daniel
>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 1:59 PM Christian König
 wrote:
>
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem.
>
> Well, the path of inner peace begins with four words. “Not my fucking
> problem.”
>
> But I'm not that much concerned about the kernel, but rather about
> important userspace processes like X, Wayland, SurfaceFlinger etc...
>
> I mean attaching a page to a sync object and allowing to wait/signal
> from both CPU as well as GPU side is not so much of a problem.
>
> > You have to somehow handle that, e.g. perhaps with conditional
> > rendering and just using the old frame in compositing if the new one
> > doesn't show up in time.
>
> Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

For opengl we do all the same guarantees, so if you get one of these
you just block until the fence is signalled. Doing that properly means
submit thread to support drm_syncobj like for vulkan.

For vulkan we probably want to represent these as proper vk timeline
objects, and the vulkan way is to just let the application (well
compositor) here deal with it. If they import timelines from untrusted
other parties, they need to handle the potential fallback of being
lied at. How is "not vulkan's fucking problem", because that entire
"with great power (well performance) comes great responsibility" is
the entire vk design paradigm.

Glamour will just rely on GL providing nice package of the harsh
reality of gpus, like usual.

So I guess step 1 here for GL would be to provide some kind of
import/export of timeline syncobj, including properly handling this
"future/indefinite fences" aspect of them with submit thread and
everything.
-Daniel

>
> Regards,
> Christian.
>
> Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> >> Daniel, are you suggesting that we should skip any deadlock prevention in
> >> the kernel, and just let userspace wait for and signal any fence it has
> >> access to?
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem. The only criteria is that the kernel itself must
> > never rely on these userspace fences, except for stuff like implementing
> > optimized cpu waits. And in those we must always guarantee that the
> > userspace process remains interruptible.
> >
> > It's a completely different world from dma_fence based kernel fences,
> > whether those are implicit or explicit.
> >
> >> Do you have any concern with the deprecation/removal of BO fences in the
> >> kernel assuming userspace is only using explicit fences? Any concern with
> >> the submit and return fences for modesetting and other producer<->consumer
> >> scenarios?
> > Let me work on the full replay for your rfc first, because there's a lot
> > of details here and nuance.
> > -Daniel
> >
> >> Thanks,
> >> Marek
> >>
> >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:
> >>
> >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> >>>  wrote:
> >>>> Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> >>>>> Not going to comment on everything on the first pass...
> >>>>>
> >>>>> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> This is our initial proposal for explicit fences everywhere and new
> >>> memory management that doesn't use BO fences. It's a redesign of how Linux
> >>> graphics drivers work, and it can coexist with what we have now.
> >>>>>>
> >>>>>> 1. Introduction
> >>>>>> (skip this if you are already sold on explicit fences)
> >>>>>>
> >>>>>> The current Linux graphics architecture was initially designed for
> >>> GPUs with only one graphics queue where everything was executed in the
> >>> submission order and per-BO fences were used for memory management and
> >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> >>> queues were added on top, which required the introduction of implicit
> >>> GPU-GPU synchronization between queues of different processes using per-BO
> >>> fences. Recently, even parallel execution within one queue was enabled
> >>> where a command buffer starts draws and compute shaders, but doesn't wait
> >>> for them, enabling parallelism between back-to-back command buffers.
> >>

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 3:04 PM Daniel Stone  wrote:
>
> Hi,
>
> On Tue, 20 Apr 2021 at 13:01, Daniel Vetter  wrote:
>>
>> - We live in a post xf86-video-$vendor world, and all these other
>>   compositors rely on implicit sync. You're not going to be able to get
>>   rid of them anytime soon. What's worse, all the various EGL/vk buffer
>>   sharing things also rely on implicit sync, so you get to fix up tons of
>>   applications on top. Any plan that's realistic needs to cope with
>>   implicit/explicit at the same time together won't work.
>>
>> - Absolute infuriating, but you can't use page-faulting together with any
>>   dma_fence synchronization primitives, whether implicit or explicit. This
>>   means until the entire ecosystem moved forward (good luck with that) we
>>   have to support dma_fence. The only sync model that works together with
>>   page faults is userspace fence based sync.
>>
>> This should get rid of the oversync issues, and since implicit sync is
>> backed in everywhere right now, you'll have to deal with implicit sync for
>> a very long time.
>
>
> Depends what you mean by 'implicit sync'. ;)
>
> Getting userspace (Vulkan WSI, EGL, Wayland compositors, browsers, media 
> clients) over to explicit sync is easy, _provided_ that the explicit sync 
> gives us the same guarantees as implicit sync, i.e. completes in bounded 
> time, GPU/display work can be flushed to the kernel predicated on fence 
> completion with the kernel handling synchronisation and scheduling. It's just 
> a matter of typing, and until now we haven't had a great reason to do that 
> typing. Now we do have that reason, so we are implementing it. Whether it's 
> dma_fence or drm_syncobj is mostly immaterial; we can encode in protocol 
> requirements that you can't try to use wait-before-signal with drm_syncobj 
> and you'll get killed if you try.
>
> Getting that userspace over to fully userspace-based sync (wait-before-signal 
> or wait-never-signal, no kernel assistance but you just have to roll your own 
> polling or signal handling on either CPU or GPU side) is not easy. It might 
> never happen, because it's an extraordinary amount of work, introduces a huge 
> amount of fragility into a super-critical path, and and so far it's not clear 
> that it's a global performance improvement for the whole system, just 
> shifting performance problems from kernel to userspace, and probably (AFAICT) 
> making them worse in addition to the other problems it brings.
>
> What am I missing?

Nothing I think.

Which is why I'm arguing that kernel based sync with all the current
dma_fence guarantees is probably going to stick around for something
close to forever, and we need to assume so.

Only in specific cases does full userspace sync make sense imo:
- anything compute, excluding using compute/shaders to create
displayable buffers, but compute as in your final target is writing
some stuff to files and never interacting with any winsys. Those
really care because "run a compute kernel for a few hours" isn't
supported without userspace sync, and I don't think ever will.
- maybe vulkan direct display, once/if we have the extensions for
atomic kms wired up
- maybe someone wants to write a vulkan based compositor and deal with
all this themselves. That model I think would also imply that they
deal with all the timeouts and fallbacks, irrespective of whether
underneath we actually run on dma_fence timeline syncobjs or userspace
fence timeline syncobjs.

>From about 2 years of screaming at this stuff it feels like this will
be a pretty exhaustive list for the next 10 years. Definitely doesn't
include your random linux desktop wayland compositor stack. But
there's definitely some are specific areas where people care enough
for all the pain. For everyone else it's all the other pieces I laid
out.

This also means that I don't think we now have that impedus to start
typing all the explicit sync protocol/compositor bits, since:
- the main driver is compute stuff, that needs mesa work (well vk/ocl
plus all the various repainted copies of cuda)
- with the tricks to make implicit sync work more like explicit sync
the oversyncing can be largely solved without protocol work
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
ls all fences. Other deadlocks will be handled like GPU
> hangs.

Nope, we can't just shrug off all deadlocks with "gpu reset rolls in". For
one, with userspace fencing the kernel isn't aware of any deadlocks, you
fundamentally can't tell "has deadlocked" from "is still doing useful
computations" because that amounts to solving the halting problem.

Any programming model we come up with where both kernel and userspace are
involved needs to come up with rules where at least non-evil userspace
never deadlocks. And if you just allow both then it's pretty easy to come
up with scenarios where both userspace and kernel along are deadlock free,
but interactions result in hangs. That's why we've recently documented all
the corner cases around indefinite dma_fences, and also why you can't use
gpu page faults currently anything that uses dma_fence for sync.

That's why I think with userspace fencing the kernel simply should not be
involved at all, aside from providing optimized/blocking cpu wait
functionality.

> Other window system requests can follow the same idea.
> 
> Merged fences where one fence object contains multiple fences will be
> supported. A merged fence is signalled only when its fences are signalled.
> The consumer will have the option to redefine the unsignalled return fence
> to a merged fence.
> 
> *2.2. Modesetting*
> 
> Since a modesetting driver can also be the consumer, the present ioctl will
> contain a submit fence and a return fence too. One small problem with this
> is that userspace can hang the modesetting driver, but in theory, any later
> present ioctl can override the previous one, so the unsignalled
> presentation is never used.
> 
> 
> *3. New memory management*
> 
> The per-BO fences will be removed and the kernel will not know which
> buffers are busy. This will reduce CPU overhead and latency. The kernel
> will not need per-BO fences with explicit synchronization, so we just need
> to remove their last user: buffer evictions. It also resolves the current
> OOM deadlock.

What's "the current OOM deadlock"?

> 
> *3.1. Evictions*
> 
> If the kernel wants to move a buffer, it will have to wait for everything
> to go idle, halt all userspace command submissions, move the buffer, and
> resume everything. This is not expected to happen when memory is not
> exhausted. Other more efficient ways of synchronization are also possible
> (e.g. sync only one process), but are not discussed here.
> 
> *3.2. Per-process VRAM usage quota*
> 
> Each process can optionally and periodically query its VRAM usage quota and
> change domains of its buffers to obey that quota. For example, a process
> allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
> GB. The process can change the domains of the least important buffers to
> GTT to get the best outcome for itself. If the process doesn't do it, the
> kernel will choose which buffers to evict at random. (thanks to Christian
> Koenig for this idea)
> 
> *3.3. Buffer destruction without per-BO fences*
> 
> When the buffer destroy ioctl is called, an optional fence list can be
> passed to the kernel to indicate when it's safe to deallocate the buffer.
> If the fence list is empty, the buffer will be deallocated immediately.
> Shared buffers will be handled by merging fence lists from all processes
> that destroy them. Mitigation of malicious behavior:
> - If userspace destroys a busy buffer, it will get a GPU page fault.
> - If userspace sends fences that never signal, the kernel will have a
> timeout period and then will proceed to deallocate the buffer anyway.
> 
> *3.4. Other notes on MM*
> 
> Overcommitment of GPU-accessible memory will cause an allocation failure or
> invoke the OOM killer. Evictions to GPU-inaccessible memory might not be
> supported.
> 
> Kernel drivers could move to this new memory management today. Only buffer
> residency and evictions would stop using per-BO fences.
> 
> 
> 
> *4. Deprecating implicit synchronization*
> 
> It can be phased out by introducing a new generation of hardware where the
> driver doesn't add support for it (like a driver fork would do), assuming
> userspace has all the changes for explicit synchronization. This could
> potentially create an isolated part of the kernel DRM where all drivers
> only support explicit synchronization.

10-20 years I'd say before that's even an option.
-Daniel

> 
> Marek

> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> Daniel, are you suggesting that we should skip any deadlock prevention in
> the kernel, and just let userspace wait for and signal any fence it has
> access to?

Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem. The only criteria is that the kernel itself must
never rely on these userspace fences, except for stuff like implementing
optimized cpu waits. And in those we must always guarantee that the
userspace process remains interruptible.

It's a completely different world from dma_fence based kernel fences,
whether those are implicit or explicit.

> Do you have any concern with the deprecation/removal of BO fences in the
> kernel assuming userspace is only using explicit fences? Any concern with
> the submit and return fences for modesetting and other producer<->consumer
> scenarios?

Let me work on the full replay for your rfc first, because there's a lot
of details here and nuance.
-Daniel

> 
> Thanks,
> Marek
> 
> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:
> 
> > On Tue, Apr 20, 2021 at 12:15 PM Christian König
> >  wrote:
> > >
> > > Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > > > Not going to comment on everything on the first pass...
> > > >
> > > > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> > > >> Hi,
> > > >>
> > > >> This is our initial proposal for explicit fences everywhere and new
> > memory management that doesn't use BO fences. It's a redesign of how Linux
> > graphics drivers work, and it can coexist with what we have now.
> > > >>
> > > >>
> > > >> 1. Introduction
> > > >> (skip this if you are already sold on explicit fences)
> > > >>
> > > >> The current Linux graphics architecture was initially designed for
> > GPUs with only one graphics queue where everything was executed in the
> > submission order and per-BO fences were used for memory management and
> > CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> > queues were added on top, which required the introduction of implicit
> > GPU-GPU synchronization between queues of different processes using per-BO
> > fences. Recently, even parallel execution within one queue was enabled
> > where a command buffer starts draws and compute shaders, but doesn't wait
> > for them, enabling parallelism between back-to-back command buffers.
> > Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> > was created to enable all those use cases, and it's the only reason why the
> > scheduler exists.
> > > >>
> > > >> The GPU scheduler, implicit synchronization, BO-fence-based memory
> > management, and the tracking of per-BO fences increase CPU overhead and
> > latency, and reduce parallelism. There is a desire to replace all of them
> > with something much simpler. Below is how we could do it.
> > > >>
> > > >>
> > > >> 2. Explicit synchronization for window systems and modesetting
> > > >>
> > > >> The producer is an application and the consumer is a compositor or a
> > modesetting driver.
> > > >>
> > > >> 2.1. The Present request
> > > >>
> > > >> As part of the Present request, the producer will pass 2 fences (sync
> > objects) to the consumer alongside the presented DMABUF BO:
> > > >> - The submit fence: Initially unsignalled, it will be signalled when
> > the producer has finished drawing into the presented buffer.
> > > >> - The return fence: Initially unsignalled, it will be signalled when
> > the consumer has finished using the presented buffer.
> > > > I'm not sure syncobj is what we want.  In the Intel world we're trying
> > > > to go even further to something we're calling "userspace fences" which
> > > > are a timeline implemented as a single 64-bit value in some
> > > > CPU-mappable BO.  The client writes a higher value into the BO to
> > > > signal the timeline.
> > >
> > > Well that is exactly what our Windows guys have suggested as well, but
> > > it strongly looks like that this isn't sufficient.
> > >
> > > First of all you run into security problems when any application can
> > > just write any value to that memory location. Just imagine an
> > > application sets the counter to zero and X waits forever for some
> > > rendering to finish.
> >
> &g

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Daniel Vetter
emory budget APIs.
> >
> > --Jason
> >
> >
> >> 3.3. Buffer destruction without per-BO fences
> >>
> >> When the buffer destroy ioctl is called, an optional fence list can be 
> >> passed to the kernel to indicate when it's safe to deallocate the buffer. 
> >> If the fence list is empty, the buffer will be deallocated immediately. 
> >> Shared buffers will be handled by merging fence lists from all processes 
> >> that destroy them. Mitigation of malicious behavior:
> >> - If userspace destroys a busy buffer, it will get a GPU page fault.
> >> - If userspace sends fences that never signal, the kernel will have a 
> >> timeout period and then will proceed to deallocate the buffer anyway.
> >>
> >> 3.4. Other notes on MM
> >>
> >> Overcommitment of GPU-accessible memory will cause an allocation failure 
> >> or invoke the OOM killer. Evictions to GPU-inaccessible memory might not 
> >> be supported.
> >>
> >> Kernel drivers could move to this new memory management today. Only buffer 
> >> residency and evictions would stop using per-BO fences.
> >>
> >>
> >> 4. Deprecating implicit synchronization
> >>
> >> It can be phased out by introducing a new generation of hardware where the 
> >> driver doesn't add support for it (like a driver fork would do), assuming 
> >> userspace has all the changes for explicit synchronization. This could 
> >> potentially create an isolated part of the kernel DRM where all drivers 
> >> only support explicit synchronization.
> >>
> >> Marek
> >> ___
> >> dri-devel mailing list
> >> dri-de...@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > ___
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 4/4] drm/doc/rfc: i915 DG1 uAPI

2021-04-16 Thread Daniel Vetter
On Fri, Apr 16, 2021 at 6:38 PM Jason Ekstrand  wrote:
>
> On Thu, Apr 15, 2021 at 11:04 AM Matthew Auld  wrote:
> >
> > Add an entry for the new uAPI needed for DG1.
> >
> > v2(Daniel):
> >   - include the overall upstreaming plan
> >   - add a note for mmap, there are differences here for TTM vs i915
> >   - bunch of other suggestions from Daniel
> > v3:
> >  (Daniel)
> >   - add a note for set/get caching stuff
> >   - add some more docs for existing query and extensions stuff
> >   - add an actual code example for regions query
> >   - bunch of other stuff
> >  (Jason)
> >   - uAPI change(!):
> > - try a simpler design with the placements extension
> > - rather than have a generic setparam which can cover multiple
> >   use cases, have each extension be responsible for one thing
> >   only
> >
> > Signed-off-by: Matthew Auld 
> > Cc: Joonas Lahtinen 
> > Cc: Jordan Justen 
> > Cc: Daniel Vetter 
> > Cc: Kenneth Graunke 
> > Cc: Jason Ekstrand 
> > Cc: Dave Airlie 
> > Cc: dri-de...@lists.freedesktop.org
> > Cc: mesa-dev@lists.freedesktop.org
> > ---
> >  Documentation/gpu/rfc/i915_gem_lmem.h   | 255 
> >  Documentation/gpu/rfc/i915_gem_lmem.rst | 139 +
> >  Documentation/gpu/rfc/index.rst |   4 +
> >  3 files changed, 398 insertions(+)
> >  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
> >  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst
> >
> > diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
> > b/Documentation/gpu/rfc/i915_gem_lmem.h
> > new file mode 100644
> > index ..2a82a452e9f2
> > --- /dev/null
> > +++ b/Documentation/gpu/rfc/i915_gem_lmem.h
> > @@ -0,0 +1,255 @@
> > +/*
> > + * Note that drm_i915_query_item and drm_i915_query are existing bits of 
> > uAPI.
> > + * For the regions query we are just adding a new query id, so no actual 
> > new
> > + * ioctl or anything, but including it here for reference.
> > + */
> > +struct drm_i915_query_item {
> > +#define DRM_I915_QUERY_MEMORY_REGIONS   0xdeadbeaf
> > +   
> > +__u64 query_id;
> > +
> > +/*
> > + * When set to zero by userspace, this is filled with the size of 
> > the
> > + * data to be written at the data_ptr pointer. The kernel sets this
> > + * value to a negative value to signal an error on a particular 
> > query
> > + * item.
> > + */
> > +__s32 length;
> > +
> > +__u32 flags;
> > +/*
> > + * Data will be written at the location pointed by data_ptr when 
> > the
> > + * value of length matches the length of the data to be written by 
> > the
> > + * kernel.
> > + */
> > +__u64 data_ptr;
> > +};
> > +
> > +struct drm_i915_query {
> > +__u32 num_items;
> > +/*
> > + * Unused for now. Must be cleared to zero.
> > + */
> > +__u32 flags;
> > +/*
> > + * This points to an array of num_items drm_i915_query_item 
> > structures.
> > + */
> > +__u64 items_ptr;
> > +};
> > +
> > +#define DRM_IOCTL_I915_QUERY   DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_QUERY, 
> > struct drm_i915_query)
> > +
> > +/**
> > + * enum drm_i915_gem_memory_class
> > + */
> > +enum drm_i915_gem_memory_class {
> > +   /** @I915_MEMORY_CLASS_SYSTEM: system memory */
> > +   I915_MEMORY_CLASS_SYSTEM = 0,
> > +   /** @I915_MEMORY_CLASS_DEVICE: device local-memory */
> > +   I915_MEMORY_CLASS_DEVICE,
> > +};
> > +
> > +/**
> > + * struct drm_i915_gem_memory_class_instance
> > + */
> > +struct drm_i915_gem_memory_class_instance {
> > +   /** @memory_class: see enum drm_i915_gem_memory_class */
> > +   __u16 memory_class;
> > +
> > +   /** @memory_instance: which instance */
> > +   __u16 memory_instance;
> > +};
> > +
> > +/**
> > + * struct drm_i915_memory_region_info
> > + *
> > + * Describes one region as known to the driver.
> > + *
> > + * Note that we reserve quite a lot of stuff here for potential future 
> > work. As
> > + * an example we might want expose the capabilities(see caps) for a given
> > + * region, which could include things like if the region is CPU
> > + * mappable/accessible et

Re: [Mesa-dev] [PATCH 2/4] drm/doc: add section for driver uAPI

2021-04-16 Thread Daniel Vetter
On Fri, Apr 16, 2021 at 12:37 PM Matthew Auld  wrote:
>
> Add section for drm/i915 uAPI and pull in i915_drm.h.
>
> Suggested-by: Daniel Vetter 
> Signed-off-by: Matthew Auld 
> Cc: Joonas Lahtinen 
> Cc: Jordan Justen 
> Cc: Daniel Vetter 
> Cc: Kenneth Graunke 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: dri-de...@lists.freedesktop.org
> Cc: mesa-dev@lists.freedesktop.org

lgtm. Reviewed-by: Daniel Vetter 

> ---
>  Documentation/gpu/driver-uapi.rst | 8 
>  Documentation/gpu/index.rst   | 1 +
>  2 files changed, 9 insertions(+)
>  create mode 100644 Documentation/gpu/driver-uapi.rst
>
> diff --git a/Documentation/gpu/driver-uapi.rst 
> b/Documentation/gpu/driver-uapi.rst
> new file mode 100644
> index ..4411e6919a3d
> --- /dev/null
> +++ b/Documentation/gpu/driver-uapi.rst
> @@ -0,0 +1,8 @@
> +===
> +DRM Driver uAPI
> +===
> +
> +drm/i915 uAPI
> +=
> +
> +.. kernel-doc:: include/uapi/drm/i915_drm.h
> diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
> index ec4bc72438e4..b9c1214d8f23 100644
> --- a/Documentation/gpu/index.rst
> +++ b/Documentation/gpu/index.rst
> @@ -10,6 +10,7 @@ Linux GPU Driver Developer's Guide
> drm-kms
> drm-kms-helpers
> drm-uapi
> +   driver-uapi
> drm-client
> drivers
> backlight
> --
> 2.26.3
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 4/4] drm/doc/rfc: i915 DG1 uAPI

2021-04-16 Thread Daniel Vetter
On Thu, Apr 15, 2021 at 04:59:58PM +0100, Matthew Auld wrote:
> Add an entry for the new uAPI needed for DG1.
> 
> v2(Daniel):
>   - include the overall upstreaming plan
>   - add a note for mmap, there are differences here for TTM vs i915
>   - bunch of other suggestions from Daniel
> v3:
>  (Daniel)
>   - add a note for set/get caching stuff
>   - add some more docs for existing query and extensions stuff
>   - add an actual code example for regions query
>   - bunch of other stuff
>  (Jason)
>   - uAPI change(!):
>   - try a simpler design with the placements extension
>   - rather than have a generic setparam which can cover multiple
> use cases, have each extension be responsible for one thing
> only
> 
> Signed-off-by: Matthew Auld 
> Cc: Joonas Lahtinen 
> Cc: Jordan Justen 
> Cc: Daniel Vetter 
> Cc: Kenneth Graunke 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: dri-de...@lists.freedesktop.org
> Cc: mesa-dev@lists.freedesktop.org
> ---
>  Documentation/gpu/rfc/i915_gem_lmem.h   | 255 
>  Documentation/gpu/rfc/i915_gem_lmem.rst | 139 +
>  Documentation/gpu/rfc/index.rst |   4 +
>  3 files changed, 398 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
>  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst
> 
> diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
> b/Documentation/gpu/rfc/i915_gem_lmem.h
> new file mode 100644
> index ..2a82a452e9f2
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_gem_lmem.h
> @@ -0,0 +1,255 @@
> +/*
> + * Note that drm_i915_query_item and drm_i915_query are existing bits of 
> uAPI.
> + * For the regions query we are just adding a new query id, so no actual new
> + * ioctl or anything, but including it here for reference.
> + */

Oops I didn't realize this. I think the better/prettier way is to just
mention how it's built on top of the query ioctl and structs, and use
kerneldoc hyperlinks to point there. That way it's still easy to find, and
also serves as better documentation for the uapi when it's all merged.

See 
https://dri.freedesktop.org/docs/drm/doc-guide/kernel-doc.html#highlights-and-cross-references

That's also why it matters that we pull the kerneldoc into our overall
documentation, otherwise the hyperlinks aren't working.

> +struct drm_i915_query_item {
> +#define DRM_I915_QUERY_MEMORY_REGIONS   0xdeadbeaf
> + 
> +__u64 query_id;
> +
> +/*
> + * When set to zero by userspace, this is filled with the size of the
> + * data to be written at the data_ptr pointer. The kernel sets this
> + * value to a negative value to signal an error on a particular query
> + * item.
> + */
> +__s32 length;
> +
> +__u32 flags;
> +/*
> + * Data will be written at the location pointed by data_ptr when the
> + * value of length matches the length of the data to be written by 
> the
> + * kernel.
> + */
> +__u64 data_ptr;
> +};
> +
> +struct drm_i915_query {
> +__u32 num_items;
> +/*
> + * Unused for now. Must be cleared to zero.
> + */
> +__u32 flags;
> +/*
> + * This points to an array of num_items drm_i915_query_item 
> structures.
> + */
> +__u64 items_ptr;
> +};
> +
> +#define DRM_IOCTL_I915_QUERY DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_QUERY, 
> struct drm_i915_query)
> +
> +/**
> + * enum drm_i915_gem_memory_class
> + */
> +enum drm_i915_gem_memory_class {
> + /** @I915_MEMORY_CLASS_SYSTEM: system memory */
> + I915_MEMORY_CLASS_SYSTEM = 0,
> + /** @I915_MEMORY_CLASS_DEVICE: device local-memory */
> + I915_MEMORY_CLASS_DEVICE,
> +};
> +
> +/**
> + * struct drm_i915_gem_memory_class_instance
> + */
> +struct drm_i915_gem_memory_class_instance {
> + /** @memory_class: see enum drm_i915_gem_memory_class */
> + __u16 memory_class;
> +
> + /** @memory_instance: which instance */
> + __u16 memory_instance;
> +};
> +
> +/**
> + * struct drm_i915_memory_region_info
> + *
> + * Describes one region as known to the driver.
> + *
> + * Note that we reserve quite a lot of stuff here for potential future work. 
> As
> + * an example we might want expose the capabilities(see caps) for a given
> + * region, which could include things like if the region is CPU
> + * mappable/accessible etc.
> + */
> +struct drm_i915_memory_region_info {
> + /** @region: class:instance pair encoding */
> + struct drm_i915_gem_memory_class_instance region;
> +
> + /** @rsvd0:

Re: [Mesa-dev] [PATCH v3 3/4] drm/i915/uapi: convert i915_query and friend to kernel doc

2021-04-16 Thread Daniel Vetter
On Fri, Apr 16, 2021 at 12:25 AM Ian Romanick  wrote:
> On 4/15/21 8:59 AM, Matthew Auld wrote:
> > Add a note about the two-step process.
> >
> > Suggested-by: Daniel Vetter 
> > Signed-off-by: Matthew Auld 
> > Cc: Joonas Lahtinen 
> > Cc: Jordan Justen 
> > Cc: Daniel Vetter 
> > Cc: Kenneth Graunke 
> > Cc: Jason Ekstrand 
> > Cc: Dave Airlie 
> > Cc: dri-de...@lists.freedesktop.org
> > Cc: mesa-dev@lists.freedesktop.org
> > ---
> >  include/uapi/drm/i915_drm.h | 57 ++---
> >  1 file changed, 46 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index d9c954a5a456..ef36f1a0adde 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -2210,14 +2210,23 @@ struct drm_i915_perf_oa_config {
> >   __u64 flex_regs_ptr;
> >  };
> >
> > +/**
> > + * struct drm_i915_query_item - An individual query for the kernel to 
> > process.
> > + *
> > + * The behaviour is determined by the @query_id. Note that exactly what
>
> Since we just had a big discussion about this on mesa-dev w.r.t. Mesa
> code and documentation... does the kernel have a policy about which
> flavor (pun intended) of English should be used?

I'm not finding it documented in
https://dri.freedesktop.org/docs/drm/doc-guide/sphinx.html but I
thought we've discussed it. Adding linux-doc and Jon Corbet.
-Daniel

>
> > + * @data_ptr is also depends on the specific @query_id.
> > + */
> >  struct drm_i915_query_item {
> > + /** @query_id: The id for this query */
> >   __u64 query_id;
> >  #define DRM_I915_QUERY_TOPOLOGY_INFO1
> >  #define DRM_I915_QUERY_ENGINE_INFO   2
> >  #define DRM_I915_QUERY_PERF_CONFIG  3
> >  /* Must be kept compact -- no holes and well documented */
> >
> > - /*
> > + /**
> > +  * @length:
> > +  *
> >* When set to zero by userspace, this is filled with the size of the
> >* data to be written at the data_ptr pointer. The kernel sets this
> >* value to a negative value to signal an error on a particular query
> > @@ -2225,21 +2234,26 @@ struct drm_i915_query_item {
> >*/
> >   __s32 length;
> >
> > - /*
> > + /**
> > +  * @flags:
> > +  *
> >* When query_id == DRM_I915_QUERY_TOPOLOGY_INFO, must be 0.
> >*
> >* When query_id == DRM_I915_QUERY_PERF_CONFIG, must be one of the
> > -  * following :
> > -  * - DRM_I915_QUERY_PERF_CONFIG_LIST
> > -  * - DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID
> > -  * - DRM_I915_QUERY_PERF_CONFIG_FOR_UUID
> > +  * following:
> > +  *
> > +  *  - DRM_I915_QUERY_PERF_CONFIG_LIST
> > +  *  - DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID
> > +  *  - DRM_I915_QUERY_PERF_CONFIG_FOR_UUID
> >*/
> >   __u32 flags;
> >  #define DRM_I915_QUERY_PERF_CONFIG_LIST  1
> >  #define DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID 2
> >  #define DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_ID   3
> >
> > - /*
> > + /**
> > +  * @data_ptr:
> > +  *
> >* Data will be written at the location pointed by data_ptr when the
> >* value of length matches the length of the data to be written by the
> >* kernel.
> > @@ -2247,16 +2261,37 @@ struct drm_i915_query_item {
> >   __u64 data_ptr;
> >  };
> >
> > +/**
> > + * struct drm_i915_query - Supply an array of drm_i915_query_item for the 
> > kernel
> > + * to fill out.
> > + *
> > + * Note that this is generally a two step process for each 
> > drm_i915_query_item
> > + * in the array:
> > + *
> > + *   1.) Call the DRM_IOCTL_I915_QUERY, giving it our array of
> > + *   drm_i915_query_item, with drm_i915_query_item.size set to zero. The
> > + *   kernel will then fill in the size, in bytes, which tells userspace how
> > + *   memory it needs to allocate for the blob(say for an array of
> > + *   properties).
> > + *
> > + *   2.) Next we call DRM_IOCTL_I915_QUERY again, this time with the
> > + *   drm_i915_query_item.data_ptr equal to our newly allocated blob. Note
> > + *   that the i915_query_item.size should still be the same as what the
> > + *   kernel previously set. At this point the kernel can fill in the blob.
> > + *
> > + */
> >  struct drm_i915_query {
> > + /** @num_items

Re: [Mesa-dev] [PATCH v3 1/4] drm/i915/uapi: hide kernel doc warnings

2021-04-16 Thread Daniel Vetter
On Fri, Apr 16, 2021 at 10:44:28AM +0200, Daniel Vetter wrote:
> On Thu, Apr 15, 2021 at 04:59:55PM +0100, Matthew Auld wrote:
> > It's not properly formatted kernel doc, just nerf the warnings for now.
> > 
> > Signed-off-by: Matthew Auld 
> > Cc: Joonas Lahtinen 
> > Cc: Jordan Justen 
> > Cc: Daniel Vetter 
> > Cc: Kenneth Graunke 
> > Cc: Jason Ekstrand 
> > Cc: Dave Airlie 
> > Cc: dri-de...@lists.freedesktop.org
> > Cc: mesa-dev@lists.freedesktop.org
> 
> Reviewed-by: Daniel Vetter 

Ok I need to revise, we need to pull this into Documentation/gpu/. I think
best would be to create a new driver-uapi.rst file, put it right after
drm-uapi.rst, and then add a section for drm/i915 uapi or similar.

Also since pxp patches, Jason's ctx cleanup and lmem all need this prep
work in patches 1-3 here, can you pls just resend those with the review
feedback so we can fast-track merging?

Thanks, Daniel

> 
> > ---
> >  include/uapi/drm/i915_drm.h | 16 
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> > index ddc47bbf48b6..a50257cde9ff 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1054,12 +1054,12 @@ struct drm_i915_gem_exec_fence {
> > __u32 flags;
> >  };
> >  
> > -/**
> > +/*
> >   * See drm_i915_gem_execbuffer_ext_timeline_fences.
> >   */
> >  #define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
> >  
> > -/**
> > +/*
> >   * This structure describes an array of drm_syncobj and associated points 
> > for
> >   * timeline variants of drm_syncobj. It is invalid to append this 
> > structure to
> >   * the execbuf if I915_EXEC_FENCE_ARRAY is set.
> > @@ -1700,7 +1700,7 @@ struct drm_i915_gem_context_param {
> > __u64 value;
> >  };
> >  
> > -/**
> > +/*
> >   * Context SSEU programming
> >   *
> >   * It may be necessary for either functional or performance reason to 
> > configure
> > @@ -2067,7 +2067,7 @@ struct drm_i915_perf_open_param {
> > __u64 properties_ptr;
> >  };
> >  
> > -/**
> > +/*
> >   * Enable data capture for a stream that was either opened in a disabled 
> > state
> >   * via I915_PERF_FLAG_DISABLED or was later disabled via
> >   * I915_PERF_IOCTL_DISABLE.
> > @@ -2081,7 +2081,7 @@ struct drm_i915_perf_open_param {
> >   */
> >  #define I915_PERF_IOCTL_ENABLE _IO('i', 0x0)
> >  
> > -/**
> > +/*
> >   * Disable data capture for a stream.
> >   *
> >   * It is an error to try and read a stream that is disabled.
> > @@ -2090,7 +2090,7 @@ struct drm_i915_perf_open_param {
> >   */
> >  #define I915_PERF_IOCTL_DISABLE_IO('i', 0x1)
> >  
> > -/**
> > +/*
> >   * Change metrics_set captured by a stream.
> >   *
> >   * If the stream is bound to a specific context, the configuration change
> > @@ -2103,7 +2103,7 @@ struct drm_i915_perf_open_param {
> >   */
> >  #define I915_PERF_IOCTL_CONFIG _IO('i', 0x2)
> >  
> > -/**
> > +/*
> >   * Common to all i915 perf records
> >   */
> >  struct drm_i915_perf_record_header {
> > @@ -2151,7 +2151,7 @@ enum drm_i915_perf_record_type {
> > DRM_I915_PERF_RECORD_MAX /* non-ABI */
> >  };
> >  
> > -/**
> > +/*
> >   * Structure to upload perf dynamic configuration into the kernel.
> >   */
> >  struct drm_i915_perf_oa_config {
> > -- 
> > 2.26.3
> > 
> > ___
> > dri-devel mailing list
> > dri-de...@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 3/4] drm/i915/uapi: convert i915_query and friend to kernel doc

2021-04-16 Thread Daniel Vetter
On Thu, Apr 15, 2021 at 04:59:57PM +0100, Matthew Auld wrote:
> Add a note about the two-step process.
> 
> Suggested-by: Daniel Vetter 
> Signed-off-by: Matthew Auld 
> Cc: Joonas Lahtinen 
> Cc: Jordan Justen 
> Cc: Daniel Vetter 
> Cc: Kenneth Graunke 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: dri-de...@lists.freedesktop.org
> Cc: mesa-dev@lists.freedesktop.org
> ---
>  include/uapi/drm/i915_drm.h | 57 ++---
>  1 file changed, 46 insertions(+), 11 deletions(-)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index d9c954a5a456..ef36f1a0adde 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -2210,14 +2210,23 @@ struct drm_i915_perf_oa_config {
>   __u64 flex_regs_ptr;
>  };
>  
> +/**
> + * struct drm_i915_query_item - An individual query for the kernel to 
> process.
> + *
> + * The behaviour is determined by the @query_id. Note that exactly what
> + * @data_ptr is also depends on the specific @query_id.
> + */
>  struct drm_i915_query_item {
> + /** @query_id: The id for this query */
>   __u64 query_id;
>  #define DRM_I915_QUERY_TOPOLOGY_INFO1
>  #define DRM_I915_QUERY_ENGINE_INFO   2
>  #define DRM_I915_QUERY_PERF_CONFIG  3
>  /* Must be kept compact -- no holes and well documented */
>  
> - /*
> + /**
> +  * @length:
> +  *
>* When set to zero by userspace, this is filled with the size of the
>* data to be written at the data_ptr pointer. The kernel sets this
>* value to a negative value to signal an error on a particular query
> @@ -2225,21 +2234,26 @@ struct drm_i915_query_item {
>*/
>   __s32 length;
>  
> - /*
> + /**
> +  * @flags:
> +  *
>* When query_id == DRM_I915_QUERY_TOPOLOGY_INFO, must be 0.
>*
>* When query_id == DRM_I915_QUERY_PERF_CONFIG, must be one of the
> -  * following :
> -  * - DRM_I915_QUERY_PERF_CONFIG_LIST
> -  * - DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID
> -  * - DRM_I915_QUERY_PERF_CONFIG_FOR_UUID
> +  * following:
> +  *
> +  *  - DRM_I915_QUERY_PERF_CONFIG_LIST
> +  *  - DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID
> +  *  - DRM_I915_QUERY_PERF_CONFIG_FOR_UUID
>*/
>   __u32 flags;
>  #define DRM_I915_QUERY_PERF_CONFIG_LIST  1
>  #define DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID 2
>  #define DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_ID   3
>  
> - /*
> + /**
> +  * @data_ptr:
> +  *
>* Data will be written at the location pointed by data_ptr when the
>* value of length matches the length of the data to be written by the
>* kernel.
> @@ -2247,16 +2261,37 @@ struct drm_i915_query_item {
>   __u64 data_ptr;
>  };
>  
> +/**
> + * struct drm_i915_query - Supply an array of drm_i915_query_item for the 
> kernel
> + * to fill out.
> + *
> + * Note that this is generally a two step process for each 
> drm_i915_query_item
> + * in the array:
> + *
> + *   1.) Call the DRM_IOCTL_I915_QUERY, giving it our array of

I'm not sure this results in pretty rendering in htmldocs output. Please
check this.

This also made me realize that we're not pulling any of this into the drm
documents at all. I'll revise my review on patch 1.

Docs here look good:

Reviewed-by: Daniel Vetter 


> + *   drm_i915_query_item, with drm_i915_query_item.size set to zero. The
> + *   kernel will then fill in the size, in bytes, which tells userspace how
> + *   memory it needs to allocate for the blob(say for an array of
> + *   properties).
> + *
> + *   2.) Next we call DRM_IOCTL_I915_QUERY again, this time with the
> + *   drm_i915_query_item.data_ptr equal to our newly allocated blob. Note
> + *   that the i915_query_item.size should still be the same as what the
> + *   kernel previously set. At this point the kernel can fill in the blob.
> + *
> + */
>  struct drm_i915_query {
> + /** @num_items: The number of elements in the @items_ptr array */
>   __u32 num_items;
>  
> - /*
> -  * Unused for now. Must be cleared to zero.
> + /**
> +  * @flags: Unused for now. Must be cleared to zero.
>*/
>   __u32 flags;
>  
> - /*
> -  * This points to an array of num_items drm_i915_query_item structures.
> + /**
> +  * @items_ptr: This points to an array of num_items drm_i915_query_item
> +  * structures.
>*/
>   __u64 items_ptr;
>  };
> -- 
> 2.26.3
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 2/4] drm/i915/uapi: convert i915_user_extension to kernel doc

2021-04-16 Thread Daniel Vetter
On Thu, Apr 15, 2021 at 04:59:56PM +0100, Matthew Auld wrote:
> Add some example usage for the extension chaining also, which is quite
> nifty.
> 
> Suggested-by: Daniel Vetter 
> Signed-off-by: Matthew Auld 
> Cc: Joonas Lahtinen 
> Cc: Jordan Justen 
> Cc: Daniel Vetter 
> Cc: Kenneth Graunke 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: dri-de...@lists.freedesktop.org
> Cc: mesa-dev@lists.freedesktop.org
> ---
>  include/uapi/drm/i915_drm.h | 46 +
>  1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index a50257cde9ff..d9c954a5a456 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -62,8 +62,8 @@ extern "C" {
>  #define I915_ERROR_UEVENT"ERROR"
>  #define I915_RESET_UEVENT"RESET"
>  
> -/*
> - * i915_user_extension: Base class for defining a chain of extensions
> +/**
> + * struct i915_user_extension - Base class for defining a chain of extensions
>   *
>   * Many interfaces need to grow over time. In most cases we can simply
>   * extend the struct and have userspace pass in more data. Another option,
> @@ -76,12 +76,50 @@ extern "C" {
>   * increasing complexity, and for large parts of that interface to be
>   * entirely optional. The downside is more pointer chasing; chasing across
>   * the __user boundary with pointers encapsulated inside u64.
> + *
> + * Example chaining:
> + *
> + * .. code-block:: C
> + *
> + *   struct i915_user_extension ext3 {
> + *   .next_extension = 0, // end
> + *   .name = ...,
> + *   };
> + *   struct i915_user_extension ext2 {
> + *   .next_extension = (uintptr_t),
> + *   .name = ...,
> + *   };
> + *   struct i915_user_extension ext1 {
> + *   .next_extension = (uintptr_t),
> + *   .name = ...,
> + *   };
> + *
> + * Typically the i915_user_extension would be embedded in some uAPI struct, 
> and
> + * in this case we would feed it the head of the chain(i.e ext1), which would
> + * then apply all of the above extensions.
> + *
>   */
>  struct i915_user_extension {
> + /**
> +  * @next_extension:
> +  *
> +  * Pointer to the next i915_user_extension, or zero if the end.
> +  */
>   __u64 next_extension;
> + /** @name: Name of the extension */

Maybe clarify that the namespace here is per ioctl, not global. And maybe
also that the name is just an integer #define or something like that.

Either this is solid documentation, either way:

Reviewed-by: Daniel Vetter 

>   __u32 name;
> - __u32 flags; /* All undefined bits must be zero. */
> - __u32 rsvd[4]; /* Reserved for future use; must be zero. */
> + /**
> +  * @flags: MBZ
> +  *
> +  * All undefined bits must be zero.
> +  */
> + __u32 flags;
> + /**
> +  * @rsvd: MBZ
> +  *
> +  * Reserved for future use; must be zero.
> +  */
> + __u32 rsvd[4];
>  };
>  
>  /*
> -- 
> 2.26.3
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 1/4] drm/i915/uapi: hide kernel doc warnings

2021-04-16 Thread Daniel Vetter
On Thu, Apr 15, 2021 at 04:59:55PM +0100, Matthew Auld wrote:
> It's not properly formatted kernel doc, just nerf the warnings for now.
> 
> Signed-off-by: Matthew Auld 
> Cc: Joonas Lahtinen 
> Cc: Jordan Justen 
> Cc: Daniel Vetter 
> Cc: Kenneth Graunke 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: dri-de...@lists.freedesktop.org
> Cc: mesa-dev@lists.freedesktop.org

Reviewed-by: Daniel Vetter 

> ---
>  include/uapi/drm/i915_drm.h | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index ddc47bbf48b6..a50257cde9ff 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1054,12 +1054,12 @@ struct drm_i915_gem_exec_fence {
>   __u32 flags;
>  };
>  
> -/**
> +/*
>   * See drm_i915_gem_execbuffer_ext_timeline_fences.
>   */
>  #define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
>  
> -/**
> +/*
>   * This structure describes an array of drm_syncobj and associated points for
>   * timeline variants of drm_syncobj. It is invalid to append this structure 
> to
>   * the execbuf if I915_EXEC_FENCE_ARRAY is set.
> @@ -1700,7 +1700,7 @@ struct drm_i915_gem_context_param {
>   __u64 value;
>  };
>  
> -/**
> +/*
>   * Context SSEU programming
>   *
>   * It may be necessary for either functional or performance reason to 
> configure
> @@ -2067,7 +2067,7 @@ struct drm_i915_perf_open_param {
>   __u64 properties_ptr;
>  };
>  
> -/**
> +/*
>   * Enable data capture for a stream that was either opened in a disabled 
> state
>   * via I915_PERF_FLAG_DISABLED or was later disabled via
>   * I915_PERF_IOCTL_DISABLE.
> @@ -2081,7 +2081,7 @@ struct drm_i915_perf_open_param {
>   */
>  #define I915_PERF_IOCTL_ENABLE   _IO('i', 0x0)
>  
> -/**
> +/*
>   * Disable data capture for a stream.
>   *
>   * It is an error to try and read a stream that is disabled.
> @@ -2090,7 +2090,7 @@ struct drm_i915_perf_open_param {
>   */
>  #define I915_PERF_IOCTL_DISABLE  _IO('i', 0x1)
>  
> -/**
> +/*
>   * Change metrics_set captured by a stream.
>   *
>   * If the stream is bound to a specific context, the configuration change
> @@ -2103,7 +2103,7 @@ struct drm_i915_perf_open_param {
>   */
>  #define I915_PERF_IOCTL_CONFIG   _IO('i', 0x2)
>  
> -/**
> +/*
>   * Common to all i915 perf records
>   */
>  struct drm_i915_perf_record_header {
> @@ -2151,7 +2151,7 @@ enum drm_i915_perf_record_type {
>   DRM_I915_PERF_RECORD_MAX /* non-ABI */
>  };
>  
> -/**
> +/*
>   * Structure to upload perf dynamic configuration into the kernel.
>   */
>  struct drm_i915_perf_oa_config {
> -- 
> 2.26.3
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] 2020 X.Org Board of Directors Elections Nomination period is NOW

2020-03-24 Thread Daniel Vetter
Another reminder that we're in the election process, and the next
deadline is approaching:

- Send board nominations to elections AT x DOT org

- Got to https://members.x.org/ to renew your membership (or become
one to begin with!)

On Tue, Mar 17, 2020 at 7:21 AM Daniel Vetter  wrote:
>
> Just a quick reminder that both board nomination and membership
> renewal periods are still opening:
>
> - Send board nominations to elections AT x DOT org
>
> - Got to https://members.x.org/ to renew your membership (or become
> one to begin with!)
>
> Cheers, Daniel
>
> On Sun, Mar 8, 2020 at 8:51 PM Daniel Vetter  wrote:
> >
> > We are seeking nominations for candidates for election to the X.Org
> > Foundation Board of Directors. All X.Org Foundation members are
> > eligible for election to the board.
> >
> > Nominations for the 202 election are now open and will remain open
> > until 23:59 UTC on 29th March 2020.
> >
> > The Board consists of directors elected from the membership. Each
> > year, an election is held to bring the total number of directors to
> > eight. The four members receiving the highest vote totals will serve
> > as directors for two year terms.
> >
> > The directors who received two year terms starting in 2019 wereSamuel
> > Iglesias Gonsálvez, Manasi D Navare, Lyude Paul and Daniel Vetter.
> > They will continue to serve until their term ends in 2021. Current
> > directors whose term expires in 2020 are Eric Anholt,  Bryce
> > Harrington, Keith Packard and Harry Wentland.
> >
> > A director is expected to participate in the fortnightly IRC meeting
> > to discuss current business and to attend the annual meeting of the
> > X.Org Foundation, which will be held at a location determined in
> > advance by the Board of Directors.
> >
> > A member may nominate themselves or any other member they feel is
> > qualified. Nominations should be sent to the Election Committee at
> > elections at x.org.
> >
> > Nominees shall be required to be current members of the X.Org
> > Foundation, and submit a personal statement of up to 200 words that
> > will be provided to prospective voters. The collected statements,
> > along with the statement of contribution to the X.Org Foundation in
> > the member's account page on http://members.x.org, will be made
> > available to all voters to help them make their voting decisions.
> >
> > Nominations, membership applications or renewals and completed
> > personal statements must be received no later than 23:59 UTC on 02
> > April 2020.
> >
> > The slate of candidates will be published 6 April 2020 and candidate
> > Q will begin then. The deadline for Xorg membership applications and
> > renewals is 02 April 2020.
> >
> > Cheers, Daniel, on behalf of the X.Org BoD
> >
> > PS: I cc'ed the usual dev lists since not many members put in the renewal 
> > yet.
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem

2020-03-19 Thread Daniel Vetter
On Tue, Mar 17, 2020 at 11:01:57AM +0100, Michel Dänzer wrote:
> On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> > On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer  wrote:
> >> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> >>> The synchronization works because the Mesa driver waits for idle (drains
> >>> the GFX pipeline) at the end of command buffers and there is only 1
> >>> graphics queue, so everything is ordered.
> >>>
> >>> The GFX pipeline runs asynchronously to the command buffer, meaning the
> >>> command buffer only starts draws and doesn't wait for completion. If the
> >>> Mesa driver didn't wait at the end of the command buffer, the command
> >>> buffer would finish and a different process could start execution of its
> >>> own command buffer while shaders of the previous process are still
> >> running.
> >>>
> >>> If the Mesa driver submits a command buffer internally (because it's
> >> full),
> >>> it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
> >>> ended and a new one started.
> >>>
> >>> The waiting at the end of command buffers happens only when the flush is
> >>> external (Swap buffers, glFlush).
> >>>
> >>> It's a performance problem, because the GFX queue is blocked until the
> >> GFX
> >>> pipeline is drained at the end of every frame at least.
> >>>
> >>> So explicit fences for SwapBuffers would help.
> >>
> >> Not sure what difference it would make, since the same thing needs to be
> >> done for explicit fences as well, doesn't it?
> > 
> > No. Explicit fences don't require userspace to wait for idle in the command
> > buffer. Fences are signalled when the last draw is complete and caches are
> > flushed. Before that happens, any command buffer that is not dependent on
> > the fence can start execution. There is never a need for the GPU to be idle
> > if there is enough independent work to do.
> 
> I don't think explicit fences in the context of this discussion imply
> using that different fence signalling mechanism though. My understanding
> is that the API proposed by Jason allows implicit fences to be used as
> explicit ones and vice versa, so presumably they have to use the same
> signalling mechanism.
> 
> 
> Anyway, maybe the different fence signalling mechanism you describe
> could be used by the amdgpu kernel driver in general, then Mesa could
> drop the waits for idle and get the benefits with implicit sync as well?

Yeah, this is entirely about the programming model visible to userspace.
There shouldn't be any impact on the driver's choice of a top vs. bottom
of the gpu pipeline used for synchronization, that's entirely up to what
you're hw/driver/scheduler can pull off.

Doing a full gfx pipeline flush for shared buffers, when your hw can do
be, sounds like an issue to me that's not related to this here at all. It
might be intertwined with amdgpu's special interpretation of dma_resv
fences though, no idea. We might need to revamp all that. But for a
userspace client that does nothing fancy (no multiple render buffer
targets in one bo, or vk style "I write to everything all the time,
perhaps" stuff) there should be 0 perf difference between implicit sync
through dma_resv and explicit sync through sync_file/syncobj/dma_fence
directly.

If there is I'd consider that a bit a driver bug.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem

2020-03-19 Thread Daniel Vetter
On Tue, Mar 17, 2020 at 11:27:28AM -0500, Jason Ekstrand wrote:
> On Tue, Mar 17, 2020 at 10:33 AM Nicolas Dufresne  
> wrote:
> >
> > Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > > Hi Jason,
> > >
> > > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > > Another issue is that V4L2 doesn't offer any guarantee on job 
> > > > > ordering.
> > > > > When you queue multiple buffers for camera capture for instance, you
> > > > > don't know until capture complete in which buffer the frame has been
> > > > > captured.
> > > >
> > > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > > start of frame capture which buffer it's getting written into.  I
> > > > would think that the kernel APIs could be adjusted (if we find good
> > > > reason to do so!) such that they return earlier and return a (buffer,
> > > > fence) pair.  Am I missing something fundamental about video here?
> > >
> > > For cameras I believe we could do that, yes. I was pointing out the
> > > issues caused by the current API. For video decoders I'll let Nicolas
> > > answer the question, he's way more knowledgeable that I am on that
> > > topic.
> >
> > Right now, there is simply no uAPI for supporting asynchronous errors
> > reporting when fences are invovled. That is true for both camera's and
> > CODEC. It's likely what all the attempt was missing, I don't know
> > enough myself to suggest something.
> >
> > Now, why Stateless video decoders are special is another subject. In
> > CODECs, the decoding and the presentation order may differ. For
> > Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> > if this bitstream is fully valid, since the it is being parsed and
> > validated by the firmware. It's also firmware job to decide which
> > buffer should be presented first.
> >
> > In most firmware interface, that information is communicated back all
> > at once when the frame is ready to be presented (which may be quite
> > some time after it was decoded). So indeed, a fence model is not really
> > easy to add, unless the firmware was designed with that model in mind.
> 
> Just to be clear, I think we should do whatever makes sense here and
> not try to slam sync_file in when it doesn't make sense just because
> we have it.  The more I read on this thread, the less out-fences from
> video decode sound like they make sense unless we have a really solid
> plan for async error reporting.  It's possible, depending on how many
> processes are involved in the pipeline, that async error reporting
> could help reduce latency a bit if it let the kernel report the error
> directly to the last process in the chain.  However, I'm not convinced
> the potential for userspace programmer error is worth it..  That said,
> I'm happy to leave that up to the actual video experts. (I just do 3D)

dma_fence has an error state which you can set when things went south. The
fence still completes (to guarantee forward progress).

Currently that error code isn't really propagated anywhere (well i915 iirc
does something like that since it tracks the depedencies internally in the
scheduler). Definitely not at the dma_fence level, since we don't track
the dependency graph there at all. We might want to add that, would at
least be possible.

If we track the cascading dma_fence error state in the kernel I do think
this could work. I'm not sure whether it's actually a good/useful idea
still.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem

2020-03-19 Thread Daniel Vetter
On Wed, Mar 18, 2020 at 11:05:48AM +0100, Michel Dänzer wrote:
> On 2020-03-17 6:21 p.m., Lucas Stach wrote:
> > That's one of the issues with implicit sync that explicit may solve: 
> > a single client taking way too much time to render something can 
> > block the whole pipeline up until the display flip. With explicit 
> > sync the compositor can just decide to use the last client buffer if 
> > the latest buffer isn't ready by some deadline.
> 
> FWIW, the compositor can do this with implicit sync as well, by polling
> a dma-buf fd for the buffer. (Currently, it has to poll for writable,
> because waiting for the exclusive fence only isn't enough with amdgpu)

Would be great if we don't have to make this recommended uapi, just
because amdgpu leaks it's trickery into the wider world. Polling for read
really should be enough (and I guess Christian gets to fix up amdgpu more,
at least for anything that has a dma-buf attached even if it's not shared
with anything !amdgpu.ko).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem

2020-03-19 Thread Daniel Vetter
On Tue, Mar 17, 2020 at 12:18:47PM -0500, Jason Ekstrand wrote:
> On Tue, Mar 17, 2020 at 12:13 PM Jacob Lifshay  
> wrote:
> >
> > One related issue with explicit sync using sync_file is that combined
> > CPUs/GPUs (the CPU cores *are* the GPU cores) that do all the
> > rendering in userspace (like llvmpipe but for Vulkan and with extra
> > instructions for GPU tasks) but need to synchronize with other
> > drivers/processes is that there should be some way to create an
> > explicit fence/semaphore from userspace and later signal it. This
> > seems to conflict with the requirement for a sync_file to complete in
> > finite time, since the user process could be stopped or killed.
> 
> Yeah... That's going to be a problem.  The only way I could see that
> working is if you created a sync_file that had a timeout associated
> with it.  However, then you run into the issue where you may have
> corruption if stuff doesn't complete on time.  Then again, you're not
> really dealing with an external unit and so the latency cost of going
> across the window system protocol probably isn't massively different
> from the latency cost of triggering the sync_file.  Maybe the answer
> there is to just do everything in-order and not worry about
> synchronization?

vgem does that already (fences with timeout). The corruption issue is also
not new, if your shaders take forever real gpus will nick your rendering
with a quick reset. Iirc someone (from cros google team maybe) was even
looking into making llvmpipe run on top of vgem as a real dri/drm mesa
driver.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] 2020 X.Org Board of Directors Elections Nomination period is NOW

2020-03-17 Thread Daniel Vetter
Just a quick reminder that both board nomination and membership
renewal periods are still opening:

- Send board nominations to elections AT x DOT org

- Got to https://members.x.org/ to renew your membership (or become
one to begin with!)

Cheers, Daniel

On Sun, Mar 8, 2020 at 8:51 PM Daniel Vetter  wrote:
>
> We are seeking nominations for candidates for election to the X.Org
> Foundation Board of Directors. All X.Org Foundation members are
> eligible for election to the board.
>
> Nominations for the 202 election are now open and will remain open
> until 23:59 UTC on 29th March 2020.
>
> The Board consists of directors elected from the membership. Each
> year, an election is held to bring the total number of directors to
> eight. The four members receiving the highest vote totals will serve
> as directors for two year terms.
>
> The directors who received two year terms starting in 2019 wereSamuel
> Iglesias Gonsálvez, Manasi D Navare, Lyude Paul and Daniel Vetter.
> They will continue to serve until their term ends in 2021. Current
> directors whose term expires in 2020 are Eric Anholt,  Bryce
> Harrington, Keith Packard and Harry Wentland.
>
> A director is expected to participate in the fortnightly IRC meeting
> to discuss current business and to attend the annual meeting of the
> X.Org Foundation, which will be held at a location determined in
> advance by the Board of Directors.
>
> A member may nominate themselves or any other member they feel is
> qualified. Nominations should be sent to the Election Committee at
> elections at x.org.
>
> Nominees shall be required to be current members of the X.Org
> Foundation, and submit a personal statement of up to 200 words that
> will be provided to prospective voters. The collected statements,
> along with the statement of contribution to the X.Org Foundation in
> the member's account page on http://members.x.org, will be made
> available to all voters to help them make their voting decisions.
>
> Nominations, membership applications or renewals and completed
> personal statements must be received no later than 23:59 UTC on 02
> April 2020.
>
> The slate of candidates will be published 6 April 2020 and candidate
> Q will begin then. The deadline for Xorg membership applications and
> renewals is 02 April 2020.
>
> Cheers, Daniel, on behalf of the X.Org BoD
>
> PS: I cc'ed the usual dev lists since not many members put in the renewal yet.
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] 2020 X.Org Board of Directors Elections Nomination period is NOW

2020-03-08 Thread Daniel Vetter
We are seeking nominations for candidates for election to the X.Org
Foundation Board of Directors. All X.Org Foundation members are
eligible for election to the board.

Nominations for the 202 election are now open and will remain open
until 23:59 UTC on 29th March 2020.

The Board consists of directors elected from the membership. Each
year, an election is held to bring the total number of directors to
eight. The four members receiving the highest vote totals will serve
as directors for two year terms.

The directors who received two year terms starting in 2019 wereSamuel
Iglesias Gonsálvez, Manasi D Navare, Lyude Paul and Daniel Vetter.
They will continue to serve until their term ends in 2021. Current
directors whose term expires in 2020 are Eric Anholt,  Bryce
Harrington, Keith Packard and Harry Wentland.

A director is expected to participate in the fortnightly IRC meeting
to discuss current business and to attend the annual meeting of the
X.Org Foundation, which will be held at a location determined in
advance by the Board of Directors.

A member may nominate themselves or any other member they feel is
qualified. Nominations should be sent to the Election Committee at
elections at x.org.

Nominees shall be required to be current members of the X.Org
Foundation, and submit a personal statement of up to 200 words that
will be provided to prospective voters. The collected statements,
along with the statement of contribution to the X.Org Foundation in
the member's account page on http://members.x.org, will be made
available to all voters to help them make their voting decisions.

Nominations, membership applications or renewals and completed
personal statements must be received no later than 23:59 UTC on 02
April 2020.

The slate of candidates will be published 6 April 2020 and candidate
Q will begin then. The deadline for Xorg membership applications and
renewals is 02 April 2020.

Cheers, Daniel, on behalf of the X.Org BoD

PS: I cc'ed the usual dev lists since not many members put in the renewal yet.
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [Intel-gfx] gitlab.fd.o financial situation and impact on services

2020-02-28 Thread Daniel Vetter
 doing it anyway.  If it
> > takes me a single day to set all this up (I estimate a couple of
> > weeks), that costs my employer a lot more than sponsoring the costs of
> > the inefficiencies of the system that has accumulated.
>
> I'm not trying to knock the engineering works the CI contributors have
> done at all, but I've never seen a real discussion about costs until
> now. Engineers aren't accountants.
>
> The thing we seem to be missing here is fiscal responsibility. I know
> this email is us being fiscally responsible, but it's kinda after the
> fact.
>
> I cannot commit my employer to spending a large amount of money (> 0
> actually) without a long and lengthy process with checks and bounds.
> Can you?
>
> The X.org board has budgets and procedures as well. I as a developer
> of Mesa should not be able to commit the X.org foundation to spending
> large amounts of money without checks and bounds.
>
> The CI infrastructure lacks any checks and bounds. There is no link
> between editing .gitlab-ci/* and cashflow. There is no link to me
> adding support for a new feature to llvmpipe that blows out test times
> (granted it won't affect CI budget but just an example).

We're working to get the logging in place to know which projects
exactly burn down the money so that we can take specific actions. If
needed. So pretty soon you wont be able to just burn down endless
amounts of cash with a few gitlab-ci commits. Or at least not for long
until we catch you and you either fix things up or CI is gone for your
project.

> The fact that clouds run on credit means that it's not possible to say
> budget 30K and say when that runs out it runs out, you end up getting
> bills for ever increasing amounts that you have to cover, with nobody
> "responsible" for ever reducing those bills. Higher Faster Further
> baby comes to mind.

We're working on this, since it's the boards responsibility to be on
top of stuff. It's simply that we didn't expect a massive growth of
this scale and this quickly, so we're a bit behind on the controlling
aspect.

Also I guess it wasnt clear, but the board decision yesterday was the
stop loss order where we cut the cord (for CI at least). So yeah the
short term budget is firmly in place now.

> Has X.org actually allocated the remaining cash in it's bank account
> to this task previously? Was there plans for this money that can't be
> executed now because we have to pay the cloud fees? If we continue to
> May and the X.org bank account hits 0, can XDC happen?

There's numbers elsewhere in this thread, but if you'd read the
original announcement it states that the stop loss would still
guarantee that we can pay for everything for at least one year. We're
not going to get even close to 0 in the bank account.

So yeah XDC happens, and it'll also still happen next year. Also fd.o
servers will keep running. The only thing we might need to switch off
is the CI support.

> Budgeting and cloud is hard, the feedback loops are messy. In the old
> system the feedback loop was simple, we don't have admin time or money
> for servers we don't get the features, cloud allows us to get the
> features and enjoy them and at some point in the future the bill gets
> paid by someone else. Credit cards lifestyles all the way.

Uh ... where exactly do you get the credit card approach from? SPI is
legally not allowed to extend us a credit (we're not a legal org
anymore), so if we hit 0 it's out real quick. No credit for us. If SPI
isnt on top of that it's their loss (but they're getting pretty good
at tracking stuff with the contractor they now have and all that).

Which is not going to happen btw, if you've read the announcement mail
and all that.

Cheers, Daniel

> Like maybe we can grow up here and find sponsors to cover all of this,
> but it still feels a bit backwards from a fiscal pov.
>
> Again I'm not knocking the work people have done at all, CI is very
> valuable to the projects involved, but that doesn't absolve us from
> costs.
>
> Dave.



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [Intel-gfx] gitlab.fd.o financial situation and impact on services

2020-02-28 Thread Daniel Vetter
On Fri, Feb 28, 2020 at 10:29 AM Erik Faye-Lund
 wrote:
>
> On Fri, 2020-02-28 at 13:37 +1000, Dave Airlie wrote:
> > On Fri, 28 Feb 2020 at 07:27, Daniel Vetter 
> > wrote:
> > > Hi all,
> > >
> > > You might have read the short take in the X.org board meeting
> > > minutes
> > > already, here's the long version.
> > >
> > > The good news: gitlab.fd.o has become very popular with our
> > > communities, and is used extensively. This especially includes all
> > > the
> > > CI integration. Modern development process and tooling, yay!
> > >
> > > The bad news: The cost in growth has also been tremendous, and it's
> > > breaking our bank account. With reasonable estimates for continued
> > > growth we're expecting hosting expenses totalling 75k USD this
> > > year,
> > > and 90k USD next year. With the current sponsors we've set up we
> > > can't
> > > sustain that. We estimate that hosting expenses for gitlab.fd.o
> > > without any of the CI features enabled would total 30k USD, which
> > > is
> > > within X.org's ability to support through various sponsorships,
> > > mostly
> > > through XDC.
> > >
> > > Note that X.org does no longer sponsor any CI runners themselves,
> > > we've stopped that. The huge additional expenses are all just in
> > > storing and serving build artifacts and images to outside CI
> > > runners
> > > sponsored by various companies. A related topic is that with the
> > > growth in fd.o it's becoming infeasible to maintain it all on
> > > volunteer admin time. X.org is therefore also looking for admin
> > > sponsorship, at least medium term.
> > >
> > > Assuming that we want cash flow reserves for one year of
> > > gitlab.fd.o
> > > (without CI support) and a trimmed XDC and assuming no sponsor
> > > payment
> > > meanwhile, we'd have to cut CI services somewhere between May and
> > > June
> > > this year. The board is of course working on acquiring sponsors,
> > > but
> > > filling a shortfall of this magnitude is neither easy nor quick
> > > work,
> > > and we therefore decided to give an early warning as soon as
> > > possible.
> > > Any help in finding sponsors for fd.o is very much appreciated.
> >
> > a) Ouch.
> >
> > b) we probably need to take a large step back here.
> >
>
> I kinda agree, but maybe the step doesn't have to be *too* large?
>
> I wonder if we could solve this by restructuring the project a bit. I'm
> talking purely from a Mesa point of view here, so it might not solve
> the full problem, but:
>
> 1. It feels silly that we need to test changes to e.g the i965 driver
> on dragonboards. We only have a big "do not run CI at all" escape-
> hatch.
>
> 2. A lot of us are working for a company that can probably pay for
> their own needs in terms of CI. Perhaps moving some costs "up front" to
> the company that needs it can make the future of CI for those who can't
> do this
>
> 3. I think we need a much more detailed break-down of the cost to make
> educated changes. For instance, how expensive is Docker image
> uploads/downloads (e.g intermediary artifacts) compared to build logs
> and final test-results? What kind of artifacts?

We have logs somewhere, but no one yet got around to analyzing that.
Which will be quite a bit of work to do since the cloud storage is
totally disconnected from the gitlab front-end, making the connection
to which project or CI job caused something is going to require
scripting. Volunteers definitely very much welcome I think.

> One suggestion would be to do something more similar to what the kernel
> does, and separate into different repos for different subsystems. This
> could allow us to have separate testing-pipelines for these repos,
> which would mean that for instance a change to RADV didn't trigger a
> full Panfrost test-run.

Uh as someone who lives the kernel multi-tree model daily, there's a
_lot_ of pain involved. I think much better to look at filtering out
CI targets for when nothing relevant happened. But that gets somewhat
tricky, since "nothing relevant" is always only relative to some
baseline, so bit of scripting and all involved to make sure you don't
run stuff too often or (probably worse) not often enough.
-Daniel

> This would probably require us to accept using a more branch-heavy
> work-flow. I don't personally think that would be a bad thing.
>
> But this is all kinda based on an assumption that running hardware-
> testing is the expensive 

Re: [Mesa-dev] [Intel-gfx] gitlab.fd.o financial situation and impact on services

2020-02-27 Thread Daniel Vetter
On Fri, Feb 28, 2020 at 4:38 AM Dave Airlie  wrote:
>
> On Fri, 28 Feb 2020 at 07:27, Daniel Vetter  wrote:
> >
> > Hi all,
> >
> > You might have read the short take in the X.org board meeting minutes
> > already, here's the long version.
> >
> > The good news: gitlab.fd.o has become very popular with our
> > communities, and is used extensively. This especially includes all the
> > CI integration. Modern development process and tooling, yay!
> >
> > The bad news: The cost in growth has also been tremendous, and it's
> > breaking our bank account. With reasonable estimates for continued
> > growth we're expecting hosting expenses totalling 75k USD this year,
> > and 90k USD next year. With the current sponsors we've set up we can't
> > sustain that. We estimate that hosting expenses for gitlab.fd.o
> > without any of the CI features enabled would total 30k USD, which is
> > within X.org's ability to support through various sponsorships, mostly
> > through XDC.
> >
> > Note that X.org does no longer sponsor any CI runners themselves,
> > we've stopped that. The huge additional expenses are all just in
> > storing and serving build artifacts and images to outside CI runners
> > sponsored by various companies. A related topic is that with the
> > growth in fd.o it's becoming infeasible to maintain it all on
> > volunteer admin time. X.org is therefore also looking for admin
> > sponsorship, at least medium term.
> >
> > Assuming that we want cash flow reserves for one year of gitlab.fd.o
> > (without CI support) and a trimmed XDC and assuming no sponsor payment
> > meanwhile, we'd have to cut CI services somewhere between May and June
> > this year. The board is of course working on acquiring sponsors, but
> > filling a shortfall of this magnitude is neither easy nor quick work,
> > and we therefore decided to give an early warning as soon as possible.
> > Any help in finding sponsors for fd.o is very much appreciated.
>
> a) Ouch.
>
> b) we probably need to take a large step back here.
>
> Look at this from a sponsor POV, why would I give X.org/fd.o
> sponsorship money that they are just giving straight to google to pay
> for hosting credits? Google are profiting in some minor way from these
> hosting credits being bought by us, and I assume we aren't getting any
> sort of discounts here. Having google sponsor the credits costs google
> substantially less than having any other company give us money to do
> it.
>
> If our current CI architecture is going to burn this amount of money a
> year and we hadn't worked this out in advance of deploying it then I
> suggest the system should be taken offline until we work out what a
> sustainable system would look like within the budget we have, whether
> that be never transferring containers and build artifacts from the
> google network, just having local runner/build combos etc.

Google has sponsored 30k in hosting credits last year, these simply
ran out _much_ faster than anyone planned for. So this is by far not a
free thing for them. Plus there's also other companies sponsoring CI
runners and what not else in equally substantial amounts, plus the
biggest thing, sponsored admin time (more or less officially). So
there's a _lot_ of room for companies like Red Hat to sponsor without
throwing any money in google's revenue stream.

Or it doesn't happen, and then yeah the decision has already been made
to shutter the CI services. So this is also a question of whether we
(as a community and all the companies benefitting from the work done)
really want this, or maybe not quite.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] gitlab.fd.o financial situation and impact on services

2020-02-27 Thread Daniel Vetter
Hi all,

You might have read the short take in the X.org board meeting minutes
already, here's the long version.

The good news: gitlab.fd.o has become very popular with our
communities, and is used extensively. This especially includes all the
CI integration. Modern development process and tooling, yay!

The bad news: The cost in growth has also been tremendous, and it's
breaking our bank account. With reasonable estimates for continued
growth we're expecting hosting expenses totalling 75k USD this year,
and 90k USD next year. With the current sponsors we've set up we can't
sustain that. We estimate that hosting expenses for gitlab.fd.o
without any of the CI features enabled would total 30k USD, which is
within X.org's ability to support through various sponsorships, mostly
through XDC.

Note that X.org does no longer sponsor any CI runners themselves,
we've stopped that. The huge additional expenses are all just in
storing and serving build artifacts and images to outside CI runners
sponsored by various companies. A related topic is that with the
growth in fd.o it's becoming infeasible to maintain it all on
volunteer admin time. X.org is therefore also looking for admin
sponsorship, at least medium term.

Assuming that we want cash flow reserves for one year of gitlab.fd.o
(without CI support) and a trimmed XDC and assuming no sponsor payment
meanwhile, we'd have to cut CI services somewhere between May and June
this year. The board is of course working on acquiring sponsors, but
filling a shortfall of this magnitude is neither easy nor quick work,
and we therefore decided to give an early warning as soon as possible.
Any help in finding sponsors for fd.o is very much appreciated.

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] Update on Khronos conformance submissions

2019-10-15 Thread Daniel Vetter
Hi all,

At XDC and with a few follow ups with Neil we've clarified the process
for submitting conformance results to get the GL/VK trademarks and
everything:

https://www.x.org/wiki/Khronos/

Big update is that we've had a misunderstanding around submissions by
hardware vendors. Those are only a concern for hardware vendors (they
need to be adopters and pay fees even if they're implementation is
based on X.org), we can submit anything that's open and conformant.
Even if there's no corresponding submission by a hardware vendor, and
including software-only renderers.

Hopefully we'll see a bunch more submissions in the future now!

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] Switching to Gitlab Issues instead of Bugzilla?

2019-09-04 Thread Daniel Vetter
On Wed, Sep 4, 2019 at 6:52 PM Adam Jackson  wrote:
>
> On Fri, 2019-08-30 at 14:26 +0100, Chris Wilson wrote:
> > Quoting Daniel Stone (2019-08-30 14:13:08)
> > > Hi,
> > >
> > > On Thu, 29 Aug 2019 at 21:35, Chris Wilson  
> > > wrote:
> > > >
> > > > I think so. I just want a list of all bugs that may affect the code I'm
> > > > working on, wherever they were filed. I have a search in bugs.fdo, I
> > > > just need instructions on how to get the same from gitlab, hopefully in
> > > > a compact format.
> > >
> > > It's not clear to me what you need. Can you please give more details?
> >
> > At the moment, I always have open a couple of searches which are basically
> >
> > Product: DRI, Mesa, xorg
> > Component: Driver/intel, Drivers/DRI/i830, Drivers/DRI/i915, 
> > Drivers/DRI/i965, Drivers/Vulkan/intel, DRM/AMDgpu, DRM/Intel, IGT
> > Status: NEW, ASSIGNED, REOPENED, NEEDINFO
> >
> > I would like a similar way of getting a quick glance at the issues under
> > discussion and any new issues across the products -- basically I want a
> > heads up in case I've broken something, however subtle. And sometimes
> > you just need to trawl through every bug in case you missed something.
>
> You can do a top-level search for arbitrary strings, and get a list of
> matching issues:
>
> https://gitlab.freedesktop.org/search?group_id=_id=_ref==issues=i965
>
> But that's perhaps not super useful. There's no way to globally search
> for issues with a particular label, probably because labels are scoped
> either to projects or groups and not site-wide. But you _do_ get
> project-wide labels, so we could promote mesa/mesa's i965 label to be
> usable from mesa/*. The xorg project has this already for some labels:
>
> https://gitlab.freedesktop.org/groups/xorg/-/labels
> https://gitlab.freedesktop.org/groups/xorg/-/issues?scope=all=%E2%9C%93=opened_name[]=gsoc
>
> This probably implies that we'd want the kernel repo to be a mesa
> subproject. And then you'd just have top-level label searches for the
> xorg and mesa projects.

Looking at https://gitlab.freedesktop.org/drm and
https://cgit.freedesktop.org/drm we have the following list of kernel
projects we'd need to move:
- overall drm (really probably want no bug reports on that, Dave
ignore them all anyway or at most redirect to subtrees)
- drm-misc
- drm-intel
- amgpu tree
- msm
- nouveau is somewhere else, probably wants to keep its separate
bugzilla component too
- anything else that's not maintained in one of the above perhaps
(it's marginal, but might happen)
- igt
- libdrm (currently under gitlab/mesa/drm)
- maintainer-tools (not going to have a real need for reassigning bugs
with any of the above, but why leave it out)

btw for git repo reasons at least drm-misc, drm and drm-intel need to
be in a group of their own, for acl reasons. Or at least we need a
group somwhere for these, so we can give them all access to drm-tip.
But that's only for once we move the git repos, but I kinda don't want
to move everything once more again.
-Daniel

> > > If you want cross-component search results in a single list, that's
> > > not really something we can do today, and I don't know if it would
> > > land any time soon. You can however subscribe to particular issue
> > > labels, and when you see something that catches your eye add a 'todo'
> > > for it, then the main UI shows all your outstanding todos, including
> > > where people have mentioned you etc.
> >
> > One thing we did for bugzilla was set the default QA component to a
> > mailing list, so we had a single place to subscribe to get all the spam.
> > I presume something similar would be available to subscribe to every
> > issue across a range of categories.
>
> You (individually) can subscribe to a label (per-project-or-group),
> yes. Subscribing a mailing list to a label is somewhat awkward since
> the email address for an account is where things like password reset
> requests get sent.
>
> - ajax
>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

  1   2   3   4   >