Re: [Nouveau] [PATCH drm-misc-next 2/3] drm/gpuva_mgr: generalize dma_resv/extobj handling and GEM validation

2023-10-12 Thread Christoph Hellwig
On Thu, Oct 12, 2023 at 02:35:15PM +0200, Christian König wrote:
> Additional to that from the software side Felix summarized it in the HMM
> peer2peer discussion thread recently quite well.

Do you have a pointer to that discussion?



[Nouveau] [PATCH] drm/nouveau/dispnv04: fix a possible null pointer dereference

2023-10-12 Thread Ma Ke
In nv17_tv_get_hd_modes(), the return value of drm_mode_duplicate()
is assigned to mode, which will lead to a NULL pointer dereference on
failure of drm_mode_duplicate(). The same applies to drm_cvt_mode().
Add a check to avoid null pointer dereference.

Signed-off-by: Ma Ke 
---
 drivers/gpu/drm/nouveau/dispnv04/tvnv17.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/dispnv04/tvnv17.c 
b/drivers/gpu/drm/nouveau/dispnv04/tvnv17.c
index 670c9739e5e1..9c3dc9a5bb46 100644
--- a/drivers/gpu/drm/nouveau/dispnv04/tvnv17.c
+++ b/drivers/gpu/drm/nouveau/dispnv04/tvnv17.c
@@ -258,6 +258,8 @@ static int nv17_tv_get_hd_modes(struct drm_encoder *encoder,
if (modes[i].hdisplay == output_mode->hdisplay &&
modes[i].vdisplay == output_mode->vdisplay) {
mode = drm_mode_duplicate(encoder->dev, output_mode);
+   if (!mode)
+   continue;
mode->type |= DRM_MODE_TYPE_PREFERRED;
 
} else {
@@ -265,6 +267,8 @@ static int nv17_tv_get_hd_modes(struct drm_encoder *encoder,
modes[i].vdisplay, 60, false,
(output_mode->flags &
 DRM_MODE_FLAG_INTERLACE), false);
+   if (!mode)
+   continue;
}
 
/* CVT modes are sometimes unsuitable... */
-- 
2.37.2



Re: [Nouveau] [PATCH] drm/nouveau/disp: fix DP capable DSM connectors

2023-10-12 Thread Lyude Paul
Reviewed-by: Lyude Paul 

On Wed, 2023-10-11 at 13:41 +0200, Karol Herbst wrote:
> Just special case DP DSM connectors until we properly figure out how to
> deal with this.
> 
> This resolves user regressions on GPUs with such connectors without
> reverting the original fix.
> 
> Cc: Lyude Paul 
> Cc: sta...@vger.kernel.org # 6.4+
> Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/255
> Fixes: 2b5d1c29f6c4 ("drm/nouveau/disp: PIOR DP uses GPIO for HPD, not PMGR 
> AUX interrupts")
> Signed-off-by: Karol Herbst 
> ---
>  drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c 
> b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
> index 46b057fe1412e..3249e5c1c8930 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
> @@ -62,6 +62,18 @@ nvkm_uconn_uevent_gpio(struct nvkm_object *object, u64 
> token, u32 bits)
>   return object->client->event(token, , sizeof(args.v0));
>  }
>  
> +static bool
> +nvkm_connector_is_dp_dms(u8 type)
> +{
> + switch (type) {
> + case DCB_CONNECTOR_DMS59_DP0:
> + case DCB_CONNECTOR_DMS59_DP1:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
>  static int
>  nvkm_uconn_uevent(struct nvkm_object *object, void *argv, u32 argc, struct 
> nvkm_uevent *uevent)
>  {
> @@ -101,7 +113,7 @@ nvkm_uconn_uevent(struct nvkm_object *object, void *argv, 
> u32 argc, struct nvkm_
>   if (args->v0.types & NVIF_CONN_EVENT_V0_UNPLUG) bits |= NVKM_GPIO_LO;
>   if (args->v0.types & NVIF_CONN_EVENT_V0_IRQ) {
>   /* TODO: support DP IRQ on ANX9805 and remove this hack. */
> - if (!outp->info.location)
> + if (!outp->info.location && 
> !nvkm_connector_is_dp_dms(conn->info.type))
>   return -EINVAL;
>   }
>  

-- 
Cheers,
 Lyude Paul (she/her)
 Software Engineer at Red Hat



Re: [Nouveau] [PATCH] drm/nouveau/dispnv04: fix a possible null pointer dereference

2023-10-12 Thread Danilo Krummrich

On 10/7/23 05:23, Ma Ke wrote:

In nv17_tv_get_ld_modes(), the return value of drm_mode_duplicate()
is assigned to mode, which will lead to a NULL pointer dereference
on failure of drm_mode_duplicate(). Add a check to avoid npd.

Signed-off-by: Ma Ke 


Reviewed-by: Danilo Krummrich 


---
  drivers/gpu/drm/nouveau/dispnv04/tvnv17.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/dispnv04/tvnv17.c 
b/drivers/gpu/drm/nouveau/dispnv04/tvnv17.c
index 670c9739e5e1..4a08e61f3336 100644
--- a/drivers/gpu/drm/nouveau/dispnv04/tvnv17.c
+++ b/drivers/gpu/drm/nouveau/dispnv04/tvnv17.c
@@ -209,6 +209,8 @@ static int nv17_tv_get_ld_modes(struct drm_encoder *encoder,
struct drm_display_mode *mode;
  
  		mode = drm_mode_duplicate(encoder->dev, tv_mode);

+   if (!mode)
+   continue;
  
  		mode->clock = tv_norm->tv_enc_mode.vrefresh *

mode->htotal / 1000 *




Re: [Nouveau] [PATCH v2] drm/nouveau: exec: fix ioctl kernel-doc warning

2023-10-12 Thread Danilo Krummrich

On 10/8/23 16:02, Randy Dunlap wrote:

kernel-doc emits a warning:

include/uapi/drm/nouveau_drm.h:49: warning: Cannot understand  * 
@NOUVEAU_GETPARAM_EXEC_PUSH_MAX
  on line 49 - I thought it was a doc line

We don't have a way to document a macro value via kernel-doc, so
change the "/**" kernel-doc marker to a C comment and format the comment
more like a kernel-doc comment for consistency.

Fixes: d59e75eef52d ("drm/nouveau: exec: report max pushs through getparam")
Signed-off-by: Randy Dunlap 


Thanks for fixing this up, applied to drm-misc-fixes.

- Danilo


Cc: Dave Airlie 
Cc: Danilo Krummrich 
Cc: Karol Herbst 
Cc: Lyude Paul 
Cc: dri-de...@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: Bragatheswaran Manickavel 
---
v2: update commit text; somehow I sent a version of the patch before
 adding the full text.
v1: https://lore.kernel.org/lkml/20231007005518.32015-1-rdun...@infradead.org/

  include/uapi/drm/nouveau_drm.h |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff -- a/include/uapi/drm/nouveau_drm.h b/include/uapi/drm/nouveau_drm.h
--- a/include/uapi/drm/nouveau_drm.h
+++ b/include/uapi/drm/nouveau_drm.h
@@ -45,8 +45,8 @@ extern "C" {
  #define NOUVEAU_GETPARAM_HAS_BO_USAGE15
  #define NOUVEAU_GETPARAM_HAS_PAGEFLIP16
  
-/**

- * @NOUVEAU_GETPARAM_EXEC_PUSH_MAX
+/*
+ * NOUVEAU_GETPARAM_EXEC_PUSH_MAX - query max pushes through getparam
   *
   * Query the maximum amount of IBs that can be pushed through a single
   * _nouveau_exec structure and hence a single _IOCTL_NOUVEAU_EXEC





Re: [Nouveau] [PATCH drm-misc-next 2/3] drm/gpuva_mgr: generalize dma_resv/extobj handling and GEM validation

2023-10-12 Thread Daniel Vetter
On Thu, Oct 12, 2023 at 02:35:15PM +0200, Christian König wrote:
> Am 12.10.23 um 12:33 schrieb Dave Airlie:
> > On Wed, 11 Oct 2023 at 17:07, Christian König  
> > wrote:
> > > Am 10.10.23 um 22:23 schrieb Dave Airlie:
> > > > > I think we're then optimizing for different scenarios. Our compute
> > > > > driver will use mostly external objects only, and if shared, I don't
> > > > > forsee them bound to many VMs. What saves us currently here is that in
> > > > > compute mode we only really traverse the extobj list after a preempt
> > > > > fence wait, or when a vm is using a new context for the first time. So
> > > > > vm's extobj list is pretty large. Each bo's vma list will typically be
> > > > > pretty small.
> > > > Can I ask why we are optimising for this userspace, this seems
> > > > incredibly broken.
> > > > 
> > > > We've has this sort of problem in the past with Intel letting the tail
> > > > wag the horse, does anyone remember optimising relocations for a
> > > > userspace that didn't actually need to use relocations?
> > > > 
> > > > We need to ask why this userspace is doing this, can we get some
> > > > pointers to it? compute driver should have no reason to use mostly
> > > > external objects, the OpenCL and level0 APIs should be good enough to
> > > > figure this out.
> > > Well that is pretty normal use case, AMD works the same way.
> > > 
> > > In a multi GPU compute stack you have mostly all the data shared between
> > > different hardware devices.
> > > 
> > > As I said before looking at just the Vulcan use case is not a good idea
> > > at all.
> > > 
> > It's okay, I don't think anyone is doing that, some of the these
> > use-cases are buried in server land and you guys don't communicate
> > them very well.
> 
> Yeah, well everybody is trying very hard to get away from those approaches
> :)
> 
> But so far there hasn't been any breakthrough.
> 
> > 
> > multi-gpu compute would I'd hope be moving towards HMM/SVM type
> > solutions though?
> 
> Unfortunately not in the foreseeable future. HMM seems more and more like a
> dead end, at least for AMD.
> 
> AMD still has hardware support in all of their MI* products, but for Navi
> the features necessary for implementing HMM have been dropped. And it looks
> more and more like their are not going to come back.
> 
> Additional to that from the software side Felix summarized it in the HMM
> peer2peer discussion thread recently quite well. A buffer object based
> approach is not only simpler to handle, but also performant vise multiple
> magnitudes faster.

This matches what I'm hearing from all over. Turns out that handling page
faults in full generality in a compute/accel device (not just gpu) is just
too damn hard. At least for anyone who isn't nvidia. Usually time bound
preemption guarantees are the first to go, followed right after by a long
list of more fixed function hardware blocks that outright can't cope with
page faults.

There's so many corner cases where it breaks down that I feel like device
driver allocated memory of one flavor or another will stick around for a
very long time.

This isn't even counting the software challenges.
-Sima

> > I'm also not into looking at use-cases that used to be important but
> > might not as important going forward.
> 
> Well multimedia applications and OpenGL are still around, but it's not the
> main focus any more.
> 
> Christian.
> 
> > 
> > Dave.
> > 
> > 
> > > Christian.
> > > 
> > > > Dave.
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Nouveau] [PATCH drm-misc-next 2/3] drm/gpuva_mgr: generalize dma_resv/extobj handling and GEM validation

2023-10-12 Thread Christian König

Am 12.10.23 um 12:33 schrieb Dave Airlie:

On Wed, 11 Oct 2023 at 17:07, Christian König  wrote:

Am 10.10.23 um 22:23 schrieb Dave Airlie:

I think we're then optimizing for different scenarios. Our compute
driver will use mostly external objects only, and if shared, I don't
forsee them bound to many VMs. What saves us currently here is that in
compute mode we only really traverse the extobj list after a preempt
fence wait, or when a vm is using a new context for the first time. So
vm's extobj list is pretty large. Each bo's vma list will typically be
pretty small.

Can I ask why we are optimising for this userspace, this seems
incredibly broken.

We've has this sort of problem in the past with Intel letting the tail
wag the horse, does anyone remember optimising relocations for a
userspace that didn't actually need to use relocations?

We need to ask why this userspace is doing this, can we get some
pointers to it? compute driver should have no reason to use mostly
external objects, the OpenCL and level0 APIs should be good enough to
figure this out.

Well that is pretty normal use case, AMD works the same way.

In a multi GPU compute stack you have mostly all the data shared between
different hardware devices.

As I said before looking at just the Vulcan use case is not a good idea
at all.


It's okay, I don't think anyone is doing that, some of the these
use-cases are buried in server land and you guys don't communicate
them very well.


Yeah, well everybody is trying very hard to get away from those 
approaches :)


But so far there hasn't been any breakthrough.



multi-gpu compute would I'd hope be moving towards HMM/SVM type
solutions though?


Unfortunately not in the foreseeable future. HMM seems more and more 
like a dead end, at least for AMD.


AMD still has hardware support in all of their MI* products, but for 
Navi the features necessary for implementing HMM have been dropped. And 
it looks more and more like their are not going to come back.


Additional to that from the software side Felix summarized it in the HMM 
peer2peer discussion thread recently quite well. A buffer object based 
approach is not only simpler to handle, but also performant vise 
multiple magnitudes faster.



I'm also not into looking at use-cases that used to be important but
might not as important going forward.


Well multimedia applications and OpenGL are still around, but it's not 
the main focus any more.


Christian.



Dave.



Christian.


Dave.




Re: [Nouveau] [PATCH drm-misc-next 2/3] drm/gpuva_mgr: generalize dma_resv/extobj handling and GEM validation

2023-10-12 Thread Dave Airlie
On Wed, 11 Oct 2023 at 17:07, Christian König  wrote:
>
> Am 10.10.23 um 22:23 schrieb Dave Airlie:
> >> I think we're then optimizing for different scenarios. Our compute
> >> driver will use mostly external objects only, and if shared, I don't
> >> forsee them bound to many VMs. What saves us currently here is that in
> >> compute mode we only really traverse the extobj list after a preempt
> >> fence wait, or when a vm is using a new context for the first time. So
> >> vm's extobj list is pretty large. Each bo's vma list will typically be
> >> pretty small.
> > Can I ask why we are optimising for this userspace, this seems
> > incredibly broken.
> >
> > We've has this sort of problem in the past with Intel letting the tail
> > wag the horse, does anyone remember optimising relocations for a
> > userspace that didn't actually need to use relocations?
> >
> > We need to ask why this userspace is doing this, can we get some
> > pointers to it? compute driver should have no reason to use mostly
> > external objects, the OpenCL and level0 APIs should be good enough to
> > figure this out.
>
> Well that is pretty normal use case, AMD works the same way.
>
> In a multi GPU compute stack you have mostly all the data shared between
> different hardware devices.
>
> As I said before looking at just the Vulcan use case is not a good idea
> at all.
>

It's okay, I don't think anyone is doing that, some of the these
use-cases are buried in server land and you guys don't communicate
them very well.

multi-gpu compute would I'd hope be moving towards HMM/SVM type
solutions though?

I'm also not into looking at use-cases that used to be important but
might not as important going forward.

Dave.


> Christian.
>
> >
> > Dave.
>