Re: [Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

2023-01-27 Thread Linux kernel regression tracking (Thorsten Leemhuis)
On 27.01.23 20:46, Chris Clayton wrote:
> [Resend because the mail client on my phone decided to turn HTML on behind my 
> back, so my reply got bounced.]
> 
> Thanks Thorsten.
> 
> I did try to revert but it didnt revert cleanly and I don't have the 
> knowledge to fix it up.
> 
> The patch was part of a merge that included a number of related patches. 
> Tomorrow, I'll try to revert the lot and report
> back.

You are free to do so, but there is no need for that from my side. I
only wanted to know if a simple revert would do the trick; if it
doesn't, it in my experience often is best to leave things to the
developers of the code in question, as they know it best and thus have a
better idea which hidden side effect a more complex revert might have.

Ciao, Thorsten

> On 27/01/2023 11:20, Linux kernel regression tracking (Thorsten Leemhuis) 
> wrote:
>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>> to make this easily accessible to everyone.
>>
>> @nouveau-maintainers, did anyone take a look at this? The report is
>> already 8 days old and I don't see a single reply. Sure, we'll likely
>> get a -rc8, but still it would be good to not fix this on the finish line.
>>
>> Chris, btw, did you try if you can revert the commit on top of latest
>> mainline? And if so, does it fix the problem?
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot poke
>>
>> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
>> wrote:
>>> [adding various lists and the two other nouveau maintainers to the list
>>> of recipients]
>>
>>> On 18.01.23 21:59, Chris Clayton wrote:
 Hi.

 I build and installed the lastest development kernel earlier this week. 
 I've found that when I try the laptop down (or
 reboot it), it hangs right at the end of closing the current session. The 
 last line I see on  the screen when rebooting is:

sd 4:0:0:0: [sda] Synchronising SCSI cache

 when closing down I see one additional line:

sd 4:0:0:0 [sda]Stopping disk

 In both cases the machine then hangs and I have to hold down the power 
 button fot a few seconds to switch it off.

 Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and 
 landed on:

# first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] 
 drm/nouveau/flcn: new code to load+boot simple HS FWs
 (VPR scrubber)

 I built and installed a kernel with 
 f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) 
 checked out
 and that shuts down and reboots fine. It the did the same with the bad 
 commit checked out and that does indeed hang, so
 I'm confident the bisect outcome is OK.

 Kernels 6.1.6 and 5.15.88 are also OK.

 My system had dual GPUs - one intel and one NVidia. Related extracts from 
 'lscpi -v' is:

 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD 
 Graphics] (rev 05) (prog-if 00 [VGA controller])
 Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]

 Flags: bus master, fast devsel, latency 0, IRQ 142

 Memory at c200 (64-bit, non-prefetchable) [size=16M]

 Memory at a000 (64-bit, prefetchable) [size=256M]

 I/O ports at 5000 [size=64]

 Expansion ROM at 000c [virtual] [disabled] [size=128K]

 Capabilities: [40] Vendor Specific Information: Len=0c 

 Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00

 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-

 Capabilities: [d0] Power Management version 2

 Kernel driver in use: i915

 Kernel modules: i915


 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 
 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
 controller])
 Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
 Flags: bus master, fast devsel, latency 0, IRQ 141
 Memory at c400 (32-bit, non-prefetchable) [size=16M]
 Memory at b000 (64-bit, prefetchable) [size=256M]
 Memory at c000 (64-bit, prefetchable) [size=32M]
 I/O ports at 4000 [size=128]
 Expansion ROM at c300 [disabled] [size=512K]
 Capabilities: [60] Power Management version 3
 Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
 Capabilities: [78] Express Legacy Endpoint, MSI 00
 Kernel driver in use: nouveau
 Kernel modules: nouveau

 DRI_PRIME=1 

Re: [Nouveau] [PATCH] drm/nouveau/mmu: fix Use after Free bug in nvkm_vmm_node_split

2023-01-27 Thread Danilo Krummrich
On Fri, Jan 27, 2023 at 01:10:46PM +0100, Takashi Iwai wrote:
> On Tue, 03 Jan 2023 15:07:55 +0100,
> Takashi Iwai wrote:
> > 
> > On Fri, 30 Dec 2022 08:27:58 +0100,
> > Zheng Wang wrote:
> > > 
> > > Here is a function call chain.
> > > nvkm_vmm_pfn_map->nvkm_vmm_pfn_split_merge->nvkm_vmm_node_split
> > > If nvkm_vma_tail return NULL in nvkm_vmm_node_split, it will
> > > finally invoke nvkm_vmm_node_merge->nvkm_vmm_node_delete, which
> > > will free the vma. However, nvkm_vmm_pfn_map didn't notice that.
> > > It goes into next label and UAF happens.
> > > 
> > > Fix it by returning the return-value of nvkm_vmm_node_merge
> > > instead of NULL.
> > > 
> > > Signed-off-by: Zheng Wang 
> > 
> > FWIW, CVE-2023-0030 has been assigned to this bug.
> > It's a question whether it really deserves as a security issue, but a
> > bug is a bug...
> > 
> > Ben, could you review this please?
> 
> A gentle ping as reminder.  The bug is still present.

This was also reported in [1]. I had a closer look and FWICT this code is fine
and there isn't a bug.

Zheng Wang, the reporter of the BZ, also confirmed this to be a false positive
from CodeQL.

Anyway, here's the explaination I also posted in the BZ:

"In nvkm_vmm_node_merge() nvkm_vmm_node_delete() is only called when prev is
set. However, prev is NULL unless we enter the "if (vma->addr != addr)" path in
nvkm_vmm_node_split(). In such a case the vma pointer, which is also passed to
nvkm_vmm_node_merge(), is set to a freshly allocated struct nvkm_vma with
nvkm_vma_tail() right before prev is set to the old vma pointer.

Hence, the only thing happening there when nvkm_vma_tail() fails in the
"if (vma->size != size)" path is that either nvkm_vmm_node_merge() does nothing
in case prev wasn't set or it merges and frees the new vma created in the
"if (vma->addr != addr)" path. Or in other words the proper cleanup for the
error condition is done.

I can't see any case where the original vma pointer given by nvkm_vmm_pfn_map()
is actually freed."

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2157041

- Danilo

> 
> 
> thanks,
> 
> Takashi
> 
> > 
> > 
> > thanks,
> > 
> > Takashi
> > 
> > > ---
> > >  drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c 
> > > b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> > > index ae793f400ba1..84d6fc87b2e8 100644
> > > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> > > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> > > @@ -937,8 +937,8 @@ nvkm_vmm_node_split(struct nvkm_vmm *vmm,
> > >   if (vma->size != size) {
> > >   struct nvkm_vma *tmp;
> > >   if (!(tmp = nvkm_vma_tail(vma, vma->size - size))) {
> > > - nvkm_vmm_node_merge(vmm, prev, vma, NULL, vma->size);
> > > - return NULL;
> > > + tmp = nvkm_vmm_node_merge(vmm, prev, vma, NULL, 
> > > vma->size);
> > > + return tmp;
> > >   }
> > >   tmp->part = true;
> > >   nvkm_vmm_node_insert(vmm, tmp);
> > > -- 
> > > 2.25.1
> > > 
> 



Re: [Nouveau] [PATCH drm-next 05/14] drm/nouveau: new VM_BIND uapi interfaces

2023-01-27 Thread Danilo Krummrich

On 1/27/23 16:17, Christian König wrote:

Am 27.01.23 um 15:44 schrieb Danilo Krummrich:

[SNIP]


What you want is one component for tracking the VA allocations 
(drm_mm based) and a different component/interface for tracking the 
VA mappings (probably rb tree based).


That's what the GPUVA manager is doing. There are gpuva_regions 
which correspond to VA allocations and gpuvas which represent the 
mappings. Both are tracked separately (currently both with a 
separate drm_mm, though). However, the GPUVA manager needs to take 
regions into account when dealing with mappings to make sure the 
GPUVA manager doesn't propose drivers to merge over region 
boundaries. Speaking from userspace PoV, the kernel wouldn't merge 
mappings from different VKBuffer objects even if they're virtually 
and physically contiguous.


That are two completely different things and shouldn't be handled in 
a single component.


They are different things, but they're related in a way that for 
handling the mappings (in particular merging and sparse) the GPUVA 
manager needs to know the VA allocation (or region) boundaries.


I have the feeling there might be a misunderstanding. Userspace is in 
charge to actually allocate a portion of VA space and manage it. The 
GPUVA manager just needs to know about those VA space allocations and 
hence keeps track of them.


The GPUVA manager is not meant to be an allocator in the sense of 
finding and providing a hole for a given request.


Maybe the non-ideal choice of using drm_mm was implying something else.


Uff, well long story short that doesn't even remotely match the 
requirements. This way the GPUVA manager won't be usable for a whole 
bunch of use cases.


What we have are mappings which say X needs to point to Y with this and 
hw dependent flags.


The whole idea of having ranges is not going to fly. Neither with AMD 
GPUs and I strongly think not with Intels XA either.


A range in the sense of the GPUVA manager simply represents a VA space 
allocation (which in case of Nouveau is taken in userspace). Userspace 
allocates the portion of VA space and lets the kernel know about it. The 
current implementation needs that for the named reasons. So, I think 
there is no reason why this would work with one GPU, but not with 
another. It's just part of the design choice of the manager.


And I'm absolutely happy to discuss the details of the manager 
implementation though.




We should probably talk about the design of the GPUVA manager once 
more when this should be applicable to all GPU drivers.


That's what I try to figure out with this RFC, how to make it 
appicable for all GPU drivers, so I'm happy to discuss this. :-)


Yeah, that was really good idea :) That proposal here is really far away 
from the actual requirements.




And those are the ones I'm looking for. Do you mind sharing the 
requirements for amdgpu in particular?


For sparse residency the kernel also needs to know the region 
boundaries to make sure that it keeps sparse mappings around.


What?


When userspace creates a new VKBuffer with the 
VK_BUFFER_CREATE_SPARSE_BINDING_BIT the kernel may need to create 
sparse mappings in order to ensure that using this buffer without any 
memory backed mappings doesn't fault the GPU.


Currently, the implementation does this the following way:

1. Userspace creates a new VKBuffer and hence allocates a portion of 
the VA space for it. It calls into the kernel indicating the new VA 
space region and the fact that the region is sparse.


2. The kernel picks up the region and stores it in the GPUVA manager, 
the driver creates the corresponding sparse mappings / page table 
entries.


3. Userspace might ask the driver to create a couple of memory backed 
mappings for this particular VA region. The GPUVA manager stores the 
mapping parameters, the driver creates the corresponding page table 
entries.


4. Userspace might ask to unmap all the memory backed mappings from 
this particular VA region. The GPUVA manager removes the mapping 
parameters, the driver cleans up the corresponding page table entries. 
However, the driver also needs to re-create the sparse mappings, since 
it's a sparse buffer, hence it needs to know the boundaries of the 
region it needs to create the sparse mappings in.


Again, this is not how things are working. First of all the kernel 
absolutely should *NOT* know about those regions.


What we have inside the kernel is the information what happens if an 
address X is accessed. On AMD HW this can be:


1. Route to the PCIe bus because the mapped BO is stored in system memory.
2. Route to the internal MC because the mapped BO is stored in local 
memory.

3. Route to other GPUs in the same hive.
4. Route to some doorbell to kick of other work.
...
x. Ignore write, return 0 on reads (this is what is used for sparse 
mappings).

x+1. Trigger a recoverable page fault. This is used for things like SVA.
x+2. Trigger a non-recoverable page fault. This is used 

Re: [Nouveau] [PATCH drm-next 05/14] drm/nouveau: new VM_BIND uapi interfaces

2023-01-27 Thread David Airlie
On Sat, Jan 28, 2023 at 1:17 AM Christian König
 wrote:
>
> Am 27.01.23 um 15:44 schrieb Danilo Krummrich:
> > [SNIP]
> 
>  What you want is one component for tracking the VA allocations
>  (drm_mm based) and a different component/interface for tracking the
>  VA mappings (probably rb tree based).
> >>>
> >>> That's what the GPUVA manager is doing. There are gpuva_regions
> >>> which correspond to VA allocations and gpuvas which represent the
> >>> mappings. Both are tracked separately (currently both with a
> >>> separate drm_mm, though). However, the GPUVA manager needs to take
> >>> regions into account when dealing with mappings to make sure the
> >>> GPUVA manager doesn't propose drivers to merge over region
> >>> boundaries. Speaking from userspace PoV, the kernel wouldn't merge
> >>> mappings from different VKBuffer objects even if they're virtually
> >>> and physically contiguous.
> >>
> >> That are two completely different things and shouldn't be handled in
> >> a single component.
> >
> > They are different things, but they're related in a way that for
> > handling the mappings (in particular merging and sparse) the GPUVA
> > manager needs to know the VA allocation (or region) boundaries.
> >
> > I have the feeling there might be a misunderstanding. Userspace is in
> > charge to actually allocate a portion of VA space and manage it. The
> > GPUVA manager just needs to know about those VA space allocations and
> > hence keeps track of them.
> >
> > The GPUVA manager is not meant to be an allocator in the sense of
> > finding and providing a hole for a given request.
> >
> > Maybe the non-ideal choice of using drm_mm was implying something else.
>
> Uff, well long story short that doesn't even remotely match the
> requirements. This way the GPUVA manager won't be usable for a whole
> bunch of use cases.
>
> What we have are mappings which say X needs to point to Y with this and
> hw dependent flags.
>
> The whole idea of having ranges is not going to fly. Neither with AMD
> GPUs and I strongly think not with Intels XA either.
>
> >> We should probably talk about the design of the GPUVA manager once
> >> more when this should be applicable to all GPU drivers.
> >
> > That's what I try to figure out with this RFC, how to make it
> > appicable for all GPU drivers, so I'm happy to discuss this. :-)
>
> Yeah, that was really good idea :) That proposal here is really far away
> from the actual requirements.
>
> >>> For sparse residency the kernel also needs to know the region
> >>> boundaries to make sure that it keeps sparse mappings around.
> >>
> >> What?
> >
> > When userspace creates a new VKBuffer with the
> > VK_BUFFER_CREATE_SPARSE_BINDING_BIT the kernel may need to create
> > sparse mappings in order to ensure that using this buffer without any
> > memory backed mappings doesn't fault the GPU.
> >
> > Currently, the implementation does this the following way:
> >
> > 1. Userspace creates a new VKBuffer and hence allocates a portion of
> > the VA space for it. It calls into the kernel indicating the new VA
> > space region and the fact that the region is sparse.
> >
> > 2. The kernel picks up the region and stores it in the GPUVA manager,
> > the driver creates the corresponding sparse mappings / page table
> > entries.
> >
> > 3. Userspace might ask the driver to create a couple of memory backed
> > mappings for this particular VA region. The GPUVA manager stores the
> > mapping parameters, the driver creates the corresponding page table
> > entries.
> >
> > 4. Userspace might ask to unmap all the memory backed mappings from
> > this particular VA region. The GPUVA manager removes the mapping
> > parameters, the driver cleans up the corresponding page table entries.
> > However, the driver also needs to re-create the sparse mappings, since
> > it's a sparse buffer, hence it needs to know the boundaries of the
> > region it needs to create the sparse mappings in.
>
> Again, this is not how things are working. First of all the kernel
> absolutely should *NOT* know about those regions.
>
> What we have inside the kernel is the information what happens if an
> address X is accessed. On AMD HW this can be:
>
> 1. Route to the PCIe bus because the mapped BO is stored in system memory.
> 2. Route to the internal MC because the mapped BO is stored in local memory.
> 3. Route to other GPUs in the same hive.
> 4. Route to some doorbell to kick of other work.
> ...
> x. Ignore write, return 0 on reads (this is what is used for sparse
> mappings).
> x+1. Trigger a recoverable page fault. This is used for things like SVA.
> x+2. Trigger a non-recoverable page fault. This is used for things like
> unmapped regions where access is illegal.
>
> All this is plus some hw specific caching flags.
>
> When Vulkan allocates a sparse VKBuffer what should happen is the following:
>
> 1. The Vulkan driver somehow figures out a VA region A..B for the
> buffer. This can be in userspace (libdrm_amdgpu) or 

Re: [Nouveau] [PATCH] drm/nouveau/disp: Fix nvif_outp_acquire_dp() argument size

2023-01-27 Thread Kees Cook
On Wed, Jan 25, 2023 at 04:24:19PM -0500, Lyude Paul wrote:
> Sorry! I've been pretty busy until now, this is:
> 
> Reviewed-by: Lyude Paul 
> 
> Let me know if you've pushed it already or if you want me to push it to drm-
> misc

Either way is fine. I'm currently carrying it, but I can easily drop it
if you prefer it go via drm-misc.

Thanks!

-Kees

> 
> On Wed, 2023-01-25 at 12:15 -0800, Kees Cook wrote:
> > Ping. I'll take this via my tree unless someone else wants to take it...
> > 
> > On Sun, Nov 27, 2022 at 10:30:41AM -0800, Kees Cook wrote:
> > > Both Coverity and GCC with -Wstringop-overflow noticed that
> > > nvif_outp_acquire_dp() accidentally defined its second argument with 1
> > > additional element:
> > > 
> > > drivers/gpu/drm/nouveau/dispnv50/disp.c: In function 
> > > 'nv50_pior_atomic_enable':
> > > drivers/gpu/drm/nouveau/dispnv50/disp.c:1813:17: error: 
> > > 'nvif_outp_acquire_dp' accessing 16 bytes in a region of size 15 
> > > [-Werror=stringop-overflow=]
> > >  1813 | nvif_outp_acquire_dp(_encoder->outp, 
> > > nv_encoder->dp.dpcd, 0, 0, false, false);
> > >   | 
> > > ^~~
> > > drivers/gpu/drm/nouveau/dispnv50/disp.c:1813:17: note: referencing 
> > > argument 2 of type 'u8[16]' {aka 'unsigned char[16]'}
> > > drivers/gpu/drm/nouveau/include/nvif/outp.h:24:5: note: in a call to 
> > > function 'nvif_outp_acquire_dp'
> > >24 | int nvif_outp_acquire_dp(struct nvif_outp *, u8 dpcd[16],
> > >   | ^~~~
> > > 
> > > Avoid these warnings by defining the argument size using the matching
> > > define (DP_RECEIVER_CAP_SIZE, 15) instead of having it be a literal
> > > (and incorrect) value (16).
> > > 
> > > Reported-by: coverity-bot 
> > > Addresses-Coverity-ID: 1527269 ("Memory - corruptions")
> > > Addresses-Coverity-ID: 1527268 ("Memory - corruptions")
> > > Link: https://lore.kernel.org/lkml/202211100848.FFBA2432@keescook/
> > > Link: https://lore.kernel.org/lkml/202211100848.F4C2819BB@keescook/
> > > Fixes: 813443721331 ("drm/nouveau/disp: move DP link config into acquire")
> > > Cc: Ben Skeggs 
> > > Cc: Karol Herbst 
> > > Cc: Lyude Paul 
> > > Cc: David Airlie 
> > > Cc: Daniel Vetter 
> > > Cc: Dave Airlie 
> > > Cc: "Gustavo A. R. Silva" 
> > > Cc: dri-de...@lists.freedesktop.org
> > > Cc: nouveau@lists.freedesktop.org
> > > Signed-off-by: Kees Cook 
> > > ---
> > >  drivers/gpu/drm/nouveau/include/nvif/outp.h | 3 ++-
> > >  drivers/gpu/drm/nouveau/nvif/outp.c | 2 +-
> > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/nouveau/include/nvif/outp.h 
> > > b/drivers/gpu/drm/nouveau/include/nvif/outp.h
> > > index 45daadec3c0c..fa76a7b5e4b3 100644
> > > --- a/drivers/gpu/drm/nouveau/include/nvif/outp.h
> > > +++ b/drivers/gpu/drm/nouveau/include/nvif/outp.h
> > > @@ -3,6 +3,7 @@
> > >  #define __NVIF_OUTP_H__
> > >  #include 
> > >  #include 
> > > +#include 
> > >  struct nvif_disp;
> > >  
> > >  struct nvif_outp {
> > > @@ -21,7 +22,7 @@ int nvif_outp_acquire_rgb_crt(struct nvif_outp *);
> > >  int nvif_outp_acquire_tmds(struct nvif_outp *, int head,
> > >  bool hdmi, u8 max_ac_packet, u8 rekey, u8 scdc, bool 
> > > hda);
> > >  int nvif_outp_acquire_lvds(struct nvif_outp *, bool dual, bool bpc8);
> > > -int nvif_outp_acquire_dp(struct nvif_outp *, u8 dpcd[16],
> > > +int nvif_outp_acquire_dp(struct nvif_outp *outp, u8 
> > > dpcd[DP_RECEIVER_CAP_SIZE],
> > >int link_nr, int link_bw, bool hda, bool mst);
> > >  void nvif_outp_release(struct nvif_outp *);
> > >  int nvif_outp_infoframe(struct nvif_outp *, u8 type, struct 
> > > nvif_outp_infoframe_v0 *, u32 size);
> > > diff --git a/drivers/gpu/drm/nouveau/nvif/outp.c 
> > > b/drivers/gpu/drm/nouveau/nvif/outp.c
> > > index 7da39f1eae9f..c24bc5eae3ec 100644
> > > --- a/drivers/gpu/drm/nouveau/nvif/outp.c
> > > +++ b/drivers/gpu/drm/nouveau/nvif/outp.c
> > > @@ -127,7 +127,7 @@ nvif_outp_acquire(struct nvif_outp *outp, u8 proto, 
> > > struct nvif_outp_acquire_v0
> > >  }
> > >  
> > >  int
> > > -nvif_outp_acquire_dp(struct nvif_outp *outp, u8 dpcd[16],
> > > +nvif_outp_acquire_dp(struct nvif_outp *outp, u8 
> > > dpcd[DP_RECEIVER_CAP_SIZE],
> > >int link_nr, int link_bw, bool hda, bool mst)
> > >  {
> > >   struct nvif_outp_acquire_v0 args;
> > > -- 
> > > 2.34.1
> > > 
> > 
> 
> -- 
> Cheers,
>  Lyude Paul (she/her)
>  Software Engineer at Red Hat
> 

-- 
Kees Cook


Re: [Nouveau] [PATCH drm-next 05/14] drm/nouveau: new VM_BIND uapi interfaces

2023-01-27 Thread Christian König

Am 27.01.23 um 15:44 schrieb Danilo Krummrich:

[SNIP]


What you want is one component for tracking the VA allocations 
(drm_mm based) and a different component/interface for tracking the 
VA mappings (probably rb tree based).


That's what the GPUVA manager is doing. There are gpuva_regions 
which correspond to VA allocations and gpuvas which represent the 
mappings. Both are tracked separately (currently both with a 
separate drm_mm, though). However, the GPUVA manager needs to take 
regions into account when dealing with mappings to make sure the 
GPUVA manager doesn't propose drivers to merge over region 
boundaries. Speaking from userspace PoV, the kernel wouldn't merge 
mappings from different VKBuffer objects even if they're virtually 
and physically contiguous.


That are two completely different things and shouldn't be handled in 
a single component.


They are different things, but they're related in a way that for 
handling the mappings (in particular merging and sparse) the GPUVA 
manager needs to know the VA allocation (or region) boundaries.


I have the feeling there might be a misunderstanding. Userspace is in 
charge to actually allocate a portion of VA space and manage it. The 
GPUVA manager just needs to know about those VA space allocations and 
hence keeps track of them.


The GPUVA manager is not meant to be an allocator in the sense of 
finding and providing a hole for a given request.


Maybe the non-ideal choice of using drm_mm was implying something else.


Uff, well long story short that doesn't even remotely match the 
requirements. This way the GPUVA manager won't be usable for a whole 
bunch of use cases.


What we have are mappings which say X needs to point to Y with this and 
hw dependent flags.


The whole idea of having ranges is not going to fly. Neither with AMD 
GPUs and I strongly think not with Intels XA either.


We should probably talk about the design of the GPUVA manager once 
more when this should be applicable to all GPU drivers.


That's what I try to figure out with this RFC, how to make it 
appicable for all GPU drivers, so I'm happy to discuss this. :-)


Yeah, that was really good idea :) That proposal here is really far away 
from the actual requirements.


For sparse residency the kernel also needs to know the region 
boundaries to make sure that it keeps sparse mappings around.


What?


When userspace creates a new VKBuffer with the 
VK_BUFFER_CREATE_SPARSE_BINDING_BIT the kernel may need to create 
sparse mappings in order to ensure that using this buffer without any 
memory backed mappings doesn't fault the GPU.


Currently, the implementation does this the following way:

1. Userspace creates a new VKBuffer and hence allocates a portion of 
the VA space for it. It calls into the kernel indicating the new VA 
space region and the fact that the region is sparse.


2. The kernel picks up the region and stores it in the GPUVA manager, 
the driver creates the corresponding sparse mappings / page table 
entries.


3. Userspace might ask the driver to create a couple of memory backed 
mappings for this particular VA region. The GPUVA manager stores the 
mapping parameters, the driver creates the corresponding page table 
entries.


4. Userspace might ask to unmap all the memory backed mappings from 
this particular VA region. The GPUVA manager removes the mapping 
parameters, the driver cleans up the corresponding page table entries. 
However, the driver also needs to re-create the sparse mappings, since 
it's a sparse buffer, hence it needs to know the boundaries of the 
region it needs to create the sparse mappings in.


Again, this is not how things are working. First of all the kernel 
absolutely should *NOT* know about those regions.


What we have inside the kernel is the information what happens if an 
address X is accessed. On AMD HW this can be:


1. Route to the PCIe bus because the mapped BO is stored in system memory.
2. Route to the internal MC because the mapped BO is stored in local memory.
3. Route to other GPUs in the same hive.
4. Route to some doorbell to kick of other work.
...
x. Ignore write, return 0 on reads (this is what is used for sparse 
mappings).

x+1. Trigger a recoverable page fault. This is used for things like SVA.
x+2. Trigger a non-recoverable page fault. This is used for things like 
unmapped regions where access is illegal.


All this is plus some hw specific caching flags.

When Vulkan allocates a sparse VKBuffer what should happen is the following:

1. The Vulkan driver somehow figures out a VA region A..B for the 
buffer. This can be in userspace (libdrm_amdgpu) or kernel (drm_mm), but 
essentially is currently driver specific.


2. The kernel gets a request to map the VA range A..B as sparse, meaning 
that it updates the page tables from A..B with the sparse setting.


3. User space asks kernel to map a couple of memory backings at location 
A+1, A+10, A+15 etc


4. The VKBuffer is de-allocated, userspace 

Re: [Nouveau] [PATCH drm-next 05/14] drm/nouveau: new VM_BIND uapi interfaces

2023-01-27 Thread Danilo Krummrich

On 1/27/23 14:23, Christian König wrote:



Am 27.01.23 um 14:12 schrieb Danilo Krummrich:

On 1/27/23 08:55, Christian König wrote:

Am 27.01.23 um 02:26 schrieb Danilo Krummrich:

On 1/27/23 02:05, Matthew Brost wrote:

On Wed, Jan 18, 2023 at 07:12:47AM +0100, Danilo Krummrich wrote:

This commit provides the interfaces for the new UAPI motivated by the
Vulkan API. It allows user mode drivers (UMDs) to:

1) Initialize a GPU virtual address (VA) space via the new
    DRM_IOCTL_NOUVEAU_VM_INIT ioctl. UMDs can provide a kernel 
reserved

    VA area.

2) Bind and unbind GPU VA space mappings via the new
    DRM_IOCTL_NOUVEAU_VM_BIND ioctl.

3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl.

Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC support
asynchronous processing with DRM syncobjs as synchronization 
mechanism.


The default DRM_IOCTL_NOUVEAU_VM_BIND is synchronous processing,
DRM_IOCTL_NOUVEAU_EXEC supports asynchronous processing only.

Co-authored-by: Dave Airlie 
Signed-off-by: Danilo Krummrich 
---
  Documentation/gpu/driver-uapi.rst |   8 ++
  include/uapi/drm/nouveau_drm.h    | 216 
++

  2 files changed, 224 insertions(+)

diff --git a/Documentation/gpu/driver-uapi.rst 
b/Documentation/gpu/driver-uapi.rst

index 4411e6919a3d..9c7ca6e33a68 100644
--- a/Documentation/gpu/driver-uapi.rst
+++ b/Documentation/gpu/driver-uapi.rst
@@ -6,3 +6,11 @@ drm/i915 uAPI
  =
    .. kernel-doc:: include/uapi/drm/i915_drm.h
+
+drm/nouveau uAPI
+
+
+VM_BIND / EXEC uAPI
+---
+
+.. kernel-doc:: include/uapi/drm/nouveau_drm.h
diff --git a/include/uapi/drm/nouveau_drm.h 
b/include/uapi/drm/nouveau_drm.h

index 853a327433d3..f6e7d40201d4 100644
--- a/include/uapi/drm/nouveau_drm.h
+++ b/include/uapi/drm/nouveau_drm.h
@@ -126,6 +126,216 @@ struct drm_nouveau_gem_cpu_fini {
  __u32 handle;
  };
  +/**
+ * struct drm_nouveau_sync - sync object
+ *
+ * This structure serves as synchronization mechanism for 
(potentially)

+ * asynchronous operations such as EXEC or VM_BIND.
+ */
+struct drm_nouveau_sync {
+    /**
+ * @flags: the flags for a sync object
+ *
+ * The first 8 bits are used to determine the type of the 
sync object.

+ */
+    __u32 flags;
+#define DRM_NOUVEAU_SYNC_SYNCOBJ 0x0
+#define DRM_NOUVEAU_SYNC_TIMELINE_SYNCOBJ 0x1
+#define DRM_NOUVEAU_SYNC_TYPE_MASK 0xf
+    /**
+ * @handle: the handle of the sync object
+ */
+    __u32 handle;
+    /**
+ * @timeline_value:
+ *
+ * The timeline point of the sync object in case the syncobj 
is of

+ * type DRM_NOUVEAU_SYNC_TIMELINE_SYNCOBJ.
+ */
+    __u64 timeline_value;
+};
+
+/**
+ * struct drm_nouveau_vm_init - GPU VA space init structure
+ *
+ * Used to initialize the GPU's VA space for a user client, 
telling the kernel
+ * which portion of the VA space is managed by the UMD and kernel 
respectively.

+ */
+struct drm_nouveau_vm_init {
+    /**
+ * @unmanaged_addr: start address of the kernel managed VA 
space region

+ */
+    __u64 unmanaged_addr;
+    /**
+ * @unmanaged_size: size of the kernel managed VA space 
region in bytes

+ */
+    __u64 unmanaged_size;
+};
+
+/**
+ * struct drm_nouveau_vm_bind_op - VM_BIND operation
+ *
+ * This structure represents a single VM_BIND operation. UMDs 
should pass
+ * an array of this structure via struct drm_nouveau_vm_bind's 
_ptr field.

+ */
+struct drm_nouveau_vm_bind_op {
+    /**
+ * @op: the operation type
+ */
+    __u32 op;
+/**
+ * @DRM_NOUVEAU_VM_BIND_OP_ALLOC:
+ *
+ * The alloc operation is used to reserve a VA space region 
within the GPU's VA
+ * space. Optionally, the _NOUVEAU_VM_BIND_SPARSE flag can be 
passed to
+ * instruct the kernel to create sparse mappings for the given 
region.

+ */
+#define DRM_NOUVEAU_VM_BIND_OP_ALLOC 0x0


Do you really need this operation? We have no concept of this in Xe,
e.g. we can create a VM and the entire address space is managed 
exactly

the same.


The idea for alloc/free is to let UMDs allocate a portion of the VA 
space (which I call a region), basically the same thing Vulkan 
represents with a VKBuffer.


If that's mangled into the same component/interface then I can say 
from experience that this is a pretty bad idea. We have tried 
something similar with radeon and it turned out horrible.


What was the exact constellation in radeon and which problems did 
arise from it?




What you want is one component for tracking the VA allocations 
(drm_mm based) and a different component/interface for tracking the 
VA mappings (probably rb tree based).


That's what the GPUVA manager is doing. There are gpuva_regions which 
correspond to VA allocations and gpuvas which represent the mappings. 
Both are tracked separately (currently both with a separate drm_mm, 
though). However, the GPUVA manager needs to take regions into account 
when dealing with mappings to make sure the GPUVA manager doesn't 

Re: [Nouveau] [PATCH drm-next 05/14] drm/nouveau: new VM_BIND uapi interfaces

2023-01-27 Thread Christian König




Am 27.01.23 um 14:12 schrieb Danilo Krummrich:

On 1/27/23 08:55, Christian König wrote:

Am 27.01.23 um 02:26 schrieb Danilo Krummrich:

On 1/27/23 02:05, Matthew Brost wrote:

On Wed, Jan 18, 2023 at 07:12:47AM +0100, Danilo Krummrich wrote:

This commit provides the interfaces for the new UAPI motivated by the
Vulkan API. It allows user mode drivers (UMDs) to:

1) Initialize a GPU virtual address (VA) space via the new
    DRM_IOCTL_NOUVEAU_VM_INIT ioctl. UMDs can provide a kernel 
reserved

    VA area.

2) Bind and unbind GPU VA space mappings via the new
    DRM_IOCTL_NOUVEAU_VM_BIND ioctl.

3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl.

Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC support
asynchronous processing with DRM syncobjs as synchronization 
mechanism.


The default DRM_IOCTL_NOUVEAU_VM_BIND is synchronous processing,
DRM_IOCTL_NOUVEAU_EXEC supports asynchronous processing only.

Co-authored-by: Dave Airlie 
Signed-off-by: Danilo Krummrich 
---
  Documentation/gpu/driver-uapi.rst |   8 ++
  include/uapi/drm/nouveau_drm.h    | 216 
++

  2 files changed, 224 insertions(+)

diff --git a/Documentation/gpu/driver-uapi.rst 
b/Documentation/gpu/driver-uapi.rst

index 4411e6919a3d..9c7ca6e33a68 100644
--- a/Documentation/gpu/driver-uapi.rst
+++ b/Documentation/gpu/driver-uapi.rst
@@ -6,3 +6,11 @@ drm/i915 uAPI
  =
    .. kernel-doc:: include/uapi/drm/i915_drm.h
+
+drm/nouveau uAPI
+
+
+VM_BIND / EXEC uAPI
+---
+
+.. kernel-doc:: include/uapi/drm/nouveau_drm.h
diff --git a/include/uapi/drm/nouveau_drm.h 
b/include/uapi/drm/nouveau_drm.h

index 853a327433d3..f6e7d40201d4 100644
--- a/include/uapi/drm/nouveau_drm.h
+++ b/include/uapi/drm/nouveau_drm.h
@@ -126,6 +126,216 @@ struct drm_nouveau_gem_cpu_fini {
  __u32 handle;
  };
  +/**
+ * struct drm_nouveau_sync - sync object
+ *
+ * This structure serves as synchronization mechanism for 
(potentially)

+ * asynchronous operations such as EXEC or VM_BIND.
+ */
+struct drm_nouveau_sync {
+    /**
+ * @flags: the flags for a sync object
+ *
+ * The first 8 bits are used to determine the type of the 
sync object.

+ */
+    __u32 flags;
+#define DRM_NOUVEAU_SYNC_SYNCOBJ 0x0
+#define DRM_NOUVEAU_SYNC_TIMELINE_SYNCOBJ 0x1
+#define DRM_NOUVEAU_SYNC_TYPE_MASK 0xf
+    /**
+ * @handle: the handle of the sync object
+ */
+    __u32 handle;
+    /**
+ * @timeline_value:
+ *
+ * The timeline point of the sync object in case the syncobj 
is of

+ * type DRM_NOUVEAU_SYNC_TIMELINE_SYNCOBJ.
+ */
+    __u64 timeline_value;
+};
+
+/**
+ * struct drm_nouveau_vm_init - GPU VA space init structure
+ *
+ * Used to initialize the GPU's VA space for a user client, 
telling the kernel
+ * which portion of the VA space is managed by the UMD and kernel 
respectively.

+ */
+struct drm_nouveau_vm_init {
+    /**
+ * @unmanaged_addr: start address of the kernel managed VA 
space region

+ */
+    __u64 unmanaged_addr;
+    /**
+ * @unmanaged_size: size of the kernel managed VA space 
region in bytes

+ */
+    __u64 unmanaged_size;
+};
+
+/**
+ * struct drm_nouveau_vm_bind_op - VM_BIND operation
+ *
+ * This structure represents a single VM_BIND operation. UMDs 
should pass
+ * an array of this structure via struct drm_nouveau_vm_bind's 
_ptr field.

+ */
+struct drm_nouveau_vm_bind_op {
+    /**
+ * @op: the operation type
+ */
+    __u32 op;
+/**
+ * @DRM_NOUVEAU_VM_BIND_OP_ALLOC:
+ *
+ * The alloc operation is used to reserve a VA space region 
within the GPU's VA
+ * space. Optionally, the _NOUVEAU_VM_BIND_SPARSE flag can be 
passed to
+ * instruct the kernel to create sparse mappings for the given 
region.

+ */
+#define DRM_NOUVEAU_VM_BIND_OP_ALLOC 0x0


Do you really need this operation? We have no concept of this in Xe,
e.g. we can create a VM and the entire address space is managed 
exactly

the same.


The idea for alloc/free is to let UMDs allocate a portion of the VA 
space (which I call a region), basically the same thing Vulkan 
represents with a VKBuffer.


If that's mangled into the same component/interface then I can say 
from experience that this is a pretty bad idea. We have tried 
something similar with radeon and it turned out horrible.


What was the exact constellation in radeon and which problems did 
arise from it?




What you want is one component for tracking the VA allocations 
(drm_mm based) and a different component/interface for tracking the 
VA mappings (probably rb tree based).


That's what the GPUVA manager is doing. There are gpuva_regions which 
correspond to VA allocations and gpuvas which represent the mappings. 
Both are tracked separately (currently both with a separate drm_mm, 
though). However, the GPUVA manager needs to take regions into account 
when dealing with mappings to make sure the GPUVA manager doesn't 
propose drivers to merge over region 

Re: [Nouveau] [PATCH drm-next 05/14] drm/nouveau: new VM_BIND uapi interfaces

2023-01-27 Thread Danilo Krummrich

On 1/27/23 08:55, Christian König wrote:

Am 27.01.23 um 02:26 schrieb Danilo Krummrich:

On 1/27/23 02:05, Matthew Brost wrote:

On Wed, Jan 18, 2023 at 07:12:47AM +0100, Danilo Krummrich wrote:

This commit provides the interfaces for the new UAPI motivated by the
Vulkan API. It allows user mode drivers (UMDs) to:

1) Initialize a GPU virtual address (VA) space via the new
    DRM_IOCTL_NOUVEAU_VM_INIT ioctl. UMDs can provide a kernel reserved
    VA area.

2) Bind and unbind GPU VA space mappings via the new
    DRM_IOCTL_NOUVEAU_VM_BIND ioctl.

3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl.

Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC support
asynchronous processing with DRM syncobjs as synchronization mechanism.

The default DRM_IOCTL_NOUVEAU_VM_BIND is synchronous processing,
DRM_IOCTL_NOUVEAU_EXEC supports asynchronous processing only.

Co-authored-by: Dave Airlie 
Signed-off-by: Danilo Krummrich 
---
  Documentation/gpu/driver-uapi.rst |   8 ++
  include/uapi/drm/nouveau_drm.h    | 216 
++

  2 files changed, 224 insertions(+)

diff --git a/Documentation/gpu/driver-uapi.rst 
b/Documentation/gpu/driver-uapi.rst

index 4411e6919a3d..9c7ca6e33a68 100644
--- a/Documentation/gpu/driver-uapi.rst
+++ b/Documentation/gpu/driver-uapi.rst
@@ -6,3 +6,11 @@ drm/i915 uAPI
  =
    .. kernel-doc:: include/uapi/drm/i915_drm.h
+
+drm/nouveau uAPI
+
+
+VM_BIND / EXEC uAPI
+---
+
+.. kernel-doc:: include/uapi/drm/nouveau_drm.h
diff --git a/include/uapi/drm/nouveau_drm.h 
b/include/uapi/drm/nouveau_drm.h

index 853a327433d3..f6e7d40201d4 100644
--- a/include/uapi/drm/nouveau_drm.h
+++ b/include/uapi/drm/nouveau_drm.h
@@ -126,6 +126,216 @@ struct drm_nouveau_gem_cpu_fini {
  __u32 handle;
  };
  +/**
+ * struct drm_nouveau_sync - sync object
+ *
+ * This structure serves as synchronization mechanism for 
(potentially)

+ * asynchronous operations such as EXEC or VM_BIND.
+ */
+struct drm_nouveau_sync {
+    /**
+ * @flags: the flags for a sync object
+ *
+ * The first 8 bits are used to determine the type of the sync 
object.

+ */
+    __u32 flags;
+#define DRM_NOUVEAU_SYNC_SYNCOBJ 0x0
+#define DRM_NOUVEAU_SYNC_TIMELINE_SYNCOBJ 0x1
+#define DRM_NOUVEAU_SYNC_TYPE_MASK 0xf
+    /**
+ * @handle: the handle of the sync object
+ */
+    __u32 handle;
+    /**
+ * @timeline_value:
+ *
+ * The timeline point of the sync object in case the syncobj is of
+ * type DRM_NOUVEAU_SYNC_TIMELINE_SYNCOBJ.
+ */
+    __u64 timeline_value;
+};
+
+/**
+ * struct drm_nouveau_vm_init - GPU VA space init structure
+ *
+ * Used to initialize the GPU's VA space for a user client, telling 
the kernel
+ * which portion of the VA space is managed by the UMD and kernel 
respectively.

+ */
+struct drm_nouveau_vm_init {
+    /**
+ * @unmanaged_addr: start address of the kernel managed VA 
space region

+ */
+    __u64 unmanaged_addr;
+    /**
+ * @unmanaged_size: size of the kernel managed VA space region 
in bytes

+ */
+    __u64 unmanaged_size;
+};
+
+/**
+ * struct drm_nouveau_vm_bind_op - VM_BIND operation
+ *
+ * This structure represents a single VM_BIND operation. UMDs 
should pass
+ * an array of this structure via struct drm_nouveau_vm_bind's 
_ptr field.

+ */
+struct drm_nouveau_vm_bind_op {
+    /**
+ * @op: the operation type
+ */
+    __u32 op;
+/**
+ * @DRM_NOUVEAU_VM_BIND_OP_ALLOC:
+ *
+ * The alloc operation is used to reserve a VA space region within 
the GPU's VA
+ * space. Optionally, the _NOUVEAU_VM_BIND_SPARSE flag can be 
passed to

+ * instruct the kernel to create sparse mappings for the given region.
+ */
+#define DRM_NOUVEAU_VM_BIND_OP_ALLOC 0x0


Do you really need this operation? We have no concept of this in Xe,
e.g. we can create a VM and the entire address space is managed exactly
the same.


The idea for alloc/free is to let UMDs allocate a portion of the VA 
space (which I call a region), basically the same thing Vulkan 
represents with a VKBuffer.


If that's mangled into the same component/interface then I can say from 
experience that this is a pretty bad idea. We have tried something 
similar with radeon and it turned out horrible.


What was the exact constellation in radeon and which problems did arise 
from it?




What you want is one component for tracking the VA allocations (drm_mm 
based) and a different component/interface for tracking the VA mappings 
(probably rb tree based).


That's what the GPUVA manager is doing. There are gpuva_regions which 
correspond to VA allocations and gpuvas which represent the mappings. 
Both are tracked separately (currently both with a separate drm_mm, 
though). However, the GPUVA manager needs to take regions into account 
when dealing with mappings to make sure the GPUVA manager doesn't 
propose drivers to merge over region boundaries. Speaking from userspace 
PoV, the kernel wouldn't 

Re: [Nouveau] [PATCH] drm/nouveau/mmu: fix Use after Free bug in nvkm_vmm_node_split

2023-01-27 Thread Takashi Iwai
On Tue, 03 Jan 2023 15:07:55 +0100,
Takashi Iwai wrote:
> 
> On Fri, 30 Dec 2022 08:27:58 +0100,
> Zheng Wang wrote:
> > 
> > Here is a function call chain.
> > nvkm_vmm_pfn_map->nvkm_vmm_pfn_split_merge->nvkm_vmm_node_split
> > If nvkm_vma_tail return NULL in nvkm_vmm_node_split, it will
> > finally invoke nvkm_vmm_node_merge->nvkm_vmm_node_delete, which
> > will free the vma. However, nvkm_vmm_pfn_map didn't notice that.
> > It goes into next label and UAF happens.
> > 
> > Fix it by returning the return-value of nvkm_vmm_node_merge
> > instead of NULL.
> > 
> > Signed-off-by: Zheng Wang 
> 
> FWIW, CVE-2023-0030 has been assigned to this bug.
> It's a question whether it really deserves as a security issue, but a
> bug is a bug...
> 
> Ben, could you review this please?

A gentle ping as reminder.  The bug is still present.


thanks,

Takashi

> 
> 
> thanks,
> 
> Takashi
> 
> > ---
> >  drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c 
> > b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> > index ae793f400ba1..84d6fc87b2e8 100644
> > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> > @@ -937,8 +937,8 @@ nvkm_vmm_node_split(struct nvkm_vmm *vmm,
> > if (vma->size != size) {
> > struct nvkm_vma *tmp;
> > if (!(tmp = nvkm_vma_tail(vma, vma->size - size))) {
> > -   nvkm_vmm_node_merge(vmm, prev, vma, NULL, vma->size);
> > -   return NULL;
> > +   tmp = nvkm_vmm_node_merge(vmm, prev, vma, NULL, 
> > vma->size);
> > +   return tmp;
> > }
> > tmp->part = true;
> > nvkm_vmm_node_insert(vmm, tmp);
> > -- 
> > 2.25.1
> > 


Re: [Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

2023-01-27 Thread Karol Herbst
Where was the original email sent to anyway, because I don't have it at all.

Anyhow, I suspect we want to fetch logs to see what's happening, but
due to the nature of this bug it might get difficult.

I'm checking out the laptops I have here if I can reproduce this
issue, but I think all mine with Turing GPUs are fine.

Maybe Ben has any idea what might be wrong with
0e44c21708761977dcbea9b846b51a6fb684907a or if that's an issue which
is already fixed by not upstreamed patches as I think I remember Ben
to talk about something like that recently.

Karol

On Fri, Jan 27, 2023 at 12:20 PM Linux kernel regression tracking
(Thorsten Leemhuis)  wrote:
>
> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> to make this easily accessible to everyone.
>
> @nouveau-maintainers, did anyone take a look at this? The report is
> already 8 days old and I don't see a single reply. Sure, we'll likely
> get a -rc8, but still it would be good to not fix this on the finish line.
>
> Chris, btw, did you try if you can revert the commit on top of latest
> mainline? And if so, does it fix the problem?
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
>
> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
> wrote:
> > [adding various lists and the two other nouveau maintainers to the list
> > of recipients]
>
> > On 18.01.23 21:59, Chris Clayton wrote:
> >> Hi.
> >>
> >> I build and installed the lastest development kernel earlier this week. 
> >> I've found that when I try the laptop down (or
> >> reboot it), it hangs right at the end of closing the current session. The 
> >> last line I see on  the screen when rebooting is:
> >>
> >>  sd 4:0:0:0: [sda] Synchronising SCSI cache
> >>
> >> when closing down I see one additional line:
> >>
> >>  sd 4:0:0:0 [sda]Stopping disk
> >>
> >> In both cases the machine then hangs and I have to hold down the power 
> >> button fot a few seconds to switch it off.
> >>
> >> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and 
> >> landed on:
> >>
> >>  # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] 
> >> drm/nouveau/flcn: new code to load+boot simple HS FWs
> >> (VPR scrubber)
> >>
> >> I built and installed a kernel with 
> >> f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) 
> >> checked out
> >> and that shuts down and reboots fine. It the did the same with the bad 
> >> commit checked out and that does indeed hang, so
> >> I'm confident the bisect outcome is OK.
> >>
> >> Kernels 6.1.6 and 5.15.88 are also OK.
> >>
> >> My system had dual GPUs - one intel and one NVidia. Related extracts from 
> >> 'lscpi -v' is:
> >>
> >> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD 
> >> Graphics] (rev 05) (prog-if 00 [VGA controller])
> >> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
> >>
> >> Flags: bus master, fast devsel, latency 0, IRQ 142
> >>
> >> Memory at c200 (64-bit, non-prefetchable) [size=16M]
> >>
> >> Memory at a000 (64-bit, prefetchable) [size=256M]
> >>
> >> I/O ports at 5000 [size=64]
> >>
> >> Expansion ROM at 000c [virtual] [disabled] [size=128K]
> >>
> >> Capabilities: [40] Vendor Specific Information: Len=0c 
> >>
> >> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
> >>
> >> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
> >>
> >> Capabilities: [d0] Power Management version 2
> >>
> >> Kernel driver in use: i915
> >>
> >> Kernel modules: i915
> >>
> >>
> >> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 
> >> 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
> >> controller])
> >> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
> >> Flags: bus master, fast devsel, latency 0, IRQ 141
> >> Memory at c400 (32-bit, non-prefetchable) [size=16M]
> >> Memory at b000 (64-bit, prefetchable) [size=256M]
> >> Memory at c000 (64-bit, prefetchable) [size=32M]
> >> I/O ports at 4000 [size=128]
> >> Expansion ROM at c300 [disabled] [size=512K]
> >> Capabilities: [60] Power Management version 3
> >> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >> Capabilities: [78] Express Legacy Endpoint, MSI 00
> >> Kernel driver in use: nouveau
> >> Kernel modules: nouveau
> >>
> >> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using 
> >> sysvinit).
> >>
> >> I've attached the bisect.log, but please let me know if I can provide any 
> >> other diagnostics. Please cc me as I'm not
> >> 

Re: [Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

2023-01-27 Thread Linux kernel regression tracking (Thorsten Leemhuis)
Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

@nouveau-maintainers, did anyone take a look at this? The report is
already 8 days old and I don't see a single reply. Sure, we'll likely
get a -rc8, but still it would be good to not fix this on the finish line.

Chris, btw, did you try if you can revert the commit on top of latest
mainline? And if so, does it fix the problem?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
wrote:
> [adding various lists and the two other nouveau maintainers to the list
> of recipients]

> On 18.01.23 21:59, Chris Clayton wrote:
>> Hi.
>>
>> I build and installed the lastest development kernel earlier this week. I've 
>> found that when I try the laptop down (or
>> reboot it), it hangs right at the end of closing the current session. The 
>> last line I see on  the screen when rebooting is:
>>
>>  sd 4:0:0:0: [sda] Synchronising SCSI cache
>>
>> when closing down I see one additional line:
>>
>>  sd 4:0:0:0 [sda]Stopping disk
>>
>> In both cases the machine then hangs and I have to hold down the power 
>> button fot a few seconds to switch it off.
>>
>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed 
>> on:
>>
>>  # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] 
>> drm/nouveau/flcn: new code to load+boot simple HS FWs
>> (VPR scrubber)
>>
>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e 
>> (the parent of the bad commit) checked out
>> and that shuts down and reboots fine. It the did the same with the bad 
>> commit checked out and that does indeed hang, so
>> I'm confident the bisect outcome is OK.
>>
>> Kernels 6.1.6 and 5.15.88 are also OK.
>>
>> My system had dual GPUs - one intel and one NVidia. Related extracts from 
>> 'lscpi -v' is:
>>
>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD 
>> Graphics] (rev 05) (prog-if 00 [VGA controller])
>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
>>
>> Flags: bus master, fast devsel, latency 0, IRQ 142
>>
>> Memory at c200 (64-bit, non-prefetchable) [size=16M]
>>
>> Memory at a000 (64-bit, prefetchable) [size=256M]
>>
>> I/O ports at 5000 [size=64]
>>
>> Expansion ROM at 000c [virtual] [disabled] [size=128K]
>>
>> Capabilities: [40] Vendor Specific Information: Len=0c 
>>
>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>>
>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>>
>> Capabilities: [d0] Power Management version 2
>>
>> Kernel driver in use: i915
>>
>> Kernel modules: i915
>>
>>
>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 
>> 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
>> controller])
>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
>> Flags: bus master, fast devsel, latency 0, IRQ 141
>> Memory at c400 (32-bit, non-prefetchable) [size=16M]
>> Memory at b000 (64-bit, prefetchable) [size=256M]
>> Memory at c000 (64-bit, prefetchable) [size=32M]
>> I/O ports at 4000 [size=128]
>> Expansion ROM at c300 [disabled] [size=512K]
>> Capabilities: [60] Power Management version 3
>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> Capabilities: [78] Express Legacy Endpoint, MSI 00
>> Kernel driver in use: nouveau
>> Kernel modules: nouveau
>>
>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using 
>> sysvinit).
>>
>> I've attached the bisect.log, but please let me know if I can provide any 
>> other diagnostics. Please cc me as I'm not
>> subscribed.
> 
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced e44c2170876197
> #regzbot title drm: nouveau: hangs on poweroff/reboot
> #regzbot ignore-activity
> 
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
> 
> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> to the report (the parent of this mail). See page linked in footer for
> details.
> 
> Ciao, Thorsten (wearing 

Re: [Nouveau] [REGRESSION] GM20B probe fails after commit 2541626cfb79

2023-01-27 Thread Nicolas Chauvet
Le ven. 27 janv. 2023 à 07:01, Ben Skeggs  a écrit :
>
> On Fri, 20 Jan 2023 at 21:37, Diogo Ivo  wrote:
> >
> > On Wed, Jan 18, 2023 at 11:28:49AM +1000, Ben Skeggs wrote:
> > > On Mon, 16 Jan 2023 at 22:27, Diogo Ivo  
> > > wrote:
> > > > On Mon, Jan 16, 2023 at 07:45:05AM +1000, David Airlie wrote:
> > > > > As a quick check can you try changing
> > > > >
> > > > > drivers/gpu/drm/nouveau/nvkm/core/firmware.c:nvkm_firmware_mem_target
> > > > > from NVKM_MEM_TARGET_HOST to NVKM_MEM_TARGET_NCOH ?
> >
> > > In addition to Dave's change, can you try changing the
> > > nvkm_falcon_load_dmem() call in gm20b_pmu_init() to:
> > >
> > > nvkm_falcon_pio_wr(falcon, (u8 *), 0, 0, DMEM, addr_args,
> > > sizeof(args), 0, false);
> >
> > Hello!
> >
> > Chiming in just to say that with this change I see the same as Nicolas
> > except that the init message size is 255 instead of 0:
> >
> > [2.196934] nouveau 5700.gpu: pmu: unexpected init message size 255 
> > vs 42
> I've attached an entirely untested patch (to go on top of the other
> hacks/fixes so far), that will hopefully get us a little further.
>
> Would be great if you guys could test it out for me.

Hello,

Thanks for the patch. It works for me on: jetson-tx1:
---
[ 1022.814699] nouveau 5700.gpu: NVIDIA GM20B (12b000a1)
[ 1022.814750] nouveau 5700.gpu: imem: using IOMMU
[ 1022.893976] nouveau 5700.gpu: DRM: VRAM: 0 MiB
[ 1022.893988] nouveau 5700.gpu: DRM: GART: 1048576 MiB
[ 1022.895356] nouveau 5700.gpu: DRM: MM: using COPY for buffer copies
[ 1022.897046] [drm] Initialized nouveau 1.3.1 20120801 for
5700.gpu on minor 1
---
I've tried to run glmark2-wayland under weston with DRI_PRIME=1, it
seems to work at the beginning, but then I have the following error:

[ 1510.861730] nouveau 5700.gpu: gr: DATA_ERROR 0003
[INVALID_OPERATION] ch 3 [04002a2000 glmark2-wayland[2753]] subc 0
class b197 mthd 19d0 data 003d
[ 1510.952000] nouveau 5700.gpu: gr: DATA_ERROR 0003
[INVALID_OPERATION] ch 3 [04002a2000 glmark2-wayland[2753]] subc 0
class b197 mthd 19d0 data 003d
[ 1510.952060] nouveau 5700.gpu: gr: DATA_ERROR 009c [] ch 3
[04002a2000 glmark2-wayland[2753]] subc 0 class b197 mthd 0d78 data
0006
I think it's a separate error as I think I can reproduce on kernel
6.1x (I will open a separate thread).

So you can add my
Tested-By: Nicolas Chauvet 

Thanks