date:20210609

[Nouveau] [PATCH -next v2] drm/nouveau/gr: use list_move instead of list_del/list_add in nv40.c

2021-06-09 Thread Baokun Li

Using list_move() instead of list_del() + list_add() in nv40.c.

Reported-by: Hulk Robot 
Signed-off-by: Baokun Li 
---
V1->V2:
CC mailist

 drivers/gpu/drm/nouveau/nvkm/engine/gr/nv40.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/gr/nv40.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/gr/nv40.c
index 67f3535ff97e..c65ca839e4de 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/gr/nv40.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/gr/nv40.c
@@ -252,8 +252,7 @@ nv40_gr_intr(struct nvkm_gr *base)
list_for_each_entry(temp, >chan, head) {
if (temp->inst >> 4 == inst) {
chan = temp;
-   list_del(>head);
-   list_add(>head, >chan);
+   list_move(>head, >chan);
break;
}
}

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH -next v2] drm/nouveau/fifo: use list_move instead of list_del/list_add in base.c

2021-06-09 Thread Baokun Li

Using list_move() instead of list_del() + list_add() in base.c.

Reported-by: Hulk Robot 
Signed-off-by: Baokun Li 
---
V1->V2:
CC mailist

 drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c
index 2ed4ff05d207..1802ac78b78f 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/base.c
@@ -79,8 +79,7 @@ nvkm_fifo_chan_inst_locked(struct nvkm_fifo *fifo, u64 inst)
struct nvkm_fifo_chan *chan;
list_for_each_entry(chan, >chan, head) {
if (chan->inst->addr == inst) {
-   list_del(>head);
-   list_add(>head, >chan);
+   list_move(>head, >chan);
return chan;
}
}
@@ -109,8 +108,7 @@ nvkm_fifo_chan_chid(struct nvkm_fifo *fifo, int chid, 
unsigned long *rflags)
spin_lock_irqsave(>lock, flags);
list_for_each_entry(chan, >chan, head) {
if (chan->chid == chid) {
-   list_del(>head);
-   list_add(>head, >chan);
+   list_move(>head, >chan);
*rflags = flags;
return chan;
}

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH -next v2] drm/nouveau/mpeg: use list_move instead of list_del/list_add in nv44.c

2021-06-09 Thread Baokun Li

Using list_move() instead of list_del() + list_add() in nv44.c.

Reported-by: Hulk Robot 
Signed-off-by: Baokun Li 
---
V1->V2:
CC mailist

 drivers/gpu/drm/nouveau/nvkm/engine/mpeg/nv44.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/mpeg/nv44.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/mpeg/nv44.c
index 521ce43a2871..16acd33764de 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/mpeg/nv44.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/mpeg/nv44.c
@@ -158,8 +158,7 @@ nv44_mpeg_intr(struct nvkm_engine *engine)
list_for_each_entry(temp, >chan, head) {
if (temp->inst >> 4 == inst) {
chan = temp;
-   list_del(>head);
-   list_add(>head, >chan);
+   list_move(>head, >chan);
break;
}
}

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH -next v2] drm/nouveau/sw: use list_move instead of list_del/list_add in base.c

2021-06-09 Thread Baokun Li

Using list_move() instead of list_del() + list_add() in base.c.

Reported-by: Hulk Robot 
Signed-off-by: Baokun Li 
---
V1->V2:
CC mailist

 drivers/gpu/drm/nouveau/nvkm/engine/sw/base.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/sw/base.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/sw/base.c
index 14871d0bd746..d8c55ea9aa6b 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/sw/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/sw/base.c
@@ -37,8 +37,7 @@ nvkm_sw_mthd(struct nvkm_sw *sw, int chid, int subc, u32 
mthd, u32 data)
list_for_each_entry(chan, >chan, head) {
if (chan->fifo->chid == chid) {
handled = nvkm_sw_chan_mthd(chan, subc, mthd, data);
-   list_del(>head);
-   list_add(>head, >chan);
+   list_move(>head, >chan);
break;
}
}

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-09 Thread Peter Xu

On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> > 
> > [...]
> > 
> > > +static bool page_make_device_exclusive_one(struct page *page,
> > > + struct vm_area_struct *vma, unsigned long address, void 
> > > *priv)
> > > +{
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = address,
> > > + };
> > > + struct make_exclusive_args *args = priv;
> > > + pte_t pteval;
> > > + struct page *subpage;
> > > + bool ret = true;
> > > + struct mmu_notifier_range range;
> > > + swp_entry_t entry;
> > > + pte_t swp_pte;
> > > +
> > > + mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > > +   vma->vm_mm, address, min(vma->vm_end,
> > > +   address + page_size(page)), 
> > > args->owner);
> > > + mmu_notifier_invalidate_range_start();
> > > +
> > > + while (page_vma_mapped_walk()) {
> > > + /* Unexpected PMD-mapped THP? */
> > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> > 
> > [1]
> > 
> > > +
> > > + if (!pte_present(*pvmw.pte)) {
> > > + ret = false;
> > > + page_vma_mapped_walk_done();
> > > + break;
> > > + }
> > > +
> > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > + address = pvmw.address;
> > 
> > I raised a question here previously and didn't get an answer...
> > 
> > https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/
> 
> Sorry, I had overlooked that. Will continue the discussion here.

No problem.  I also didn't really express clearly last time, I'm happy we can
discuss this more thoroughly, even if it may be a corner case only.

> 
> > I think I get your point now and it does look possible that the split page 
> > can
> > still be mapped somewhere else as thp, then having some subpage maintainance
> > looks necessary.  The confusing part is above [1] you've also got that
> > VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..
> 
> Going back I thought your original question was whether subpage != page is
> possible. My main point was it's possible if we get a thp head. In that case 
> we
> need to replace all pte's with exclusive entries because I haven't (yet)
> defined a pmd version of device exclusive entries and also rmap_walk won't 
> deal
> with tail pages (see below).
> 
> > Then I remembered these code majorly come from the try_to_unmap() so I 
> > looked
> > there.  I _think_ what's missing here is something like:
> > 
> > if (flags & TTU_SPLIT_HUGE_PMD)
> > split_huge_pmd_address(vma, address, false, page);
> > 
> > at the entry of page_make_device_exclusive_one()?
> >
> > That !pte assertion in try_to_unmap() makes sense to me as long as it has 
> > split
> > the thp page first always.  However seems not the case for FOLL_SPLIT_PMD as
> > you previously mentioned.
> 
> At present this is limited to PageAnon pages which have had CoW broken, which 
> I
> think means there shouldn't be other mappings so I expect the PMD will always
> have been split into small PTEs mapping subpages by GUP which is what that
> assertion [1] is checking. I could call split_huge_pmd_address() 
> unconditionally
> as suggested but see the discussion below.

Yes, I think calling that unconditionally should be enough.

> 
> > Meanwhile, I also started to wonder whether it's even right to call 
> > rmap_walk()
> > with tail pages...  Please see below.
> > 
> > > +
> > > + /* Nuke the page table entry. */
> > > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > > + pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > +
> > > + /* Move the dirty bit to the page. Now the pte is gone. */
> > > + if (pte_dirty(pteval))
> > > + set_page_dirty(page);
> > > +
> > > + /*
> > > +  * Check that our target page is still mapped at the 
> > > expected
> > > +  * address.
> > > +  */
> > > + if (args->mm == mm && args->address == address &&
> > > + pte_write(pteval))
> > > + args->valid = true;
> > > +
> > > + /*
> > > +  * Store the pfn of the page in a special migration
> > > +  * pte. do_swap_page() will wait until the migration
> > > +  * pte is removed and then restart fault handling.
> > > +  */
> > > + if (pte_write(pteval))
> > > + entry = make_writable_device_exclusive_entry(
> > > + 
> > >

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-09 Thread Alistair Popple

On Thursday, 10 June 2021 2:05:06 AM AEST Peter Xu wrote:
> On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> > On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:

[...]

> > For thp this means we could end up passing
> > tail pages to rmap_walk(), however it doesn't actually walk them.
> >
> > Based on the results of previous testing I had done I assumed rmap_walk()
> > filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> > filtering was not as deliberate as assumed.
> >
> > I've gone back and looked at what was happening in my earlier tests and the
> > tail pages get filtered because the VMA is not getting locked in
> > page_lock_anon_vma_read() due to failing this check:
> >
> >   anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> >   if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> >   goto out;
> >
> > And now I'm not sure it makes sense to read page->mapping of a tail page. So
> > it might be best if we explicitly ignore any tail pages returned from GUP, 
> > at
> > least for now (a future series will improve thp support such as adding a pmd
> > version for exclusive entries).
> 
> I feel like it's illegal to access page->mapping of tail pages; I looked at
> what happens if we call page_anon_vma() on a tail page:
> 
> struct anon_vma *page_anon_vma(struct page *page)
> {
> unsigned long mapping;
> 
> page = compound_head(page);
> mapping = (unsigned long)page->mapping;
> if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> return NULL;
> return __page_rmapping(page);
> }
> 
> It'll just take the head's mapping instead.  It makes sense since the tail 
> page
> shouldn't have a different value against the head page, afaiu.

Right, it makes no sense to look at ->mapping on a tail page because the field
is used for something else. On the 1st tail page it is ->compound_nr and on the
2nd tail page it is ->deferred_list. See the definitions of compound_nr() and
page_deferred_list() respectively. I suppose on the rest of the pages it could
be anything.

I think in practice it is probably ok - iuc bit 0 won't be set for compound_nr
and certainly not for deferred_list->next (a pointer). But none of that seems
intentional, so it would be better to be explicit and not walk the tail pages.

> It would be great if thp experts could chim in.  Before that happens, I agree
> with you that a safer approach is to explicitly not walk a tail page for its
> rmap (and I think the rmap of a tail page will be the same of the head
> anyways.. since they seem to share the anon_vma as quoted).
> >
> > > So... for thp mappings, wondering whether we should do normal GUP (without
> > > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > > unconditionally split_huge_pmd_address() in 
> > > page_make_device_exclusive_one()?
> >
> > That could work (although I think GUP will still return tail pages - see
> > follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).
> 
> Agreed.
> 
> > The main problem is split_huge_pmd_address() unconditionally calls a mmu
> > notifier so I would need to plumb in passing an owner everywhere which could
> > get messy.
> 
> Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm a 
> bit
> confused why we need to pass over the owner.

Sure, it is the same reason we need to pass it for the exclusive notifier.
Any invalidation during the make exclusive operation will break the mmu read
side critical section forcing a retry of the operation. The owner field is what
is used to filter out invalidations (such as the exclusive invalidation) that
don't need to be retried.
 
> I thought plumb it right before your EXCLUSIVE notifier init would work?

I did try this just to double check and it doesn't work due to the unconditional
notifier.

> ---8<---
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a94d9aed9d95..360ce86f3822 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page 
> *page,
> swp_entry_t entry;
> pte_t swp_pte;
> 
> +   /*
> +* Make sure thps split as device exclusive entries only support pte
> +* level for now.
> +*/
> +   split_huge_pmd_address(vma, address, false, page);
> +
> mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0, vma,
>   vma->vm_mm, address, min(vma->vm_end,
>   address + page_size(page)), 
> args->owner);
> ---8<---
> 
> Thanks,
> 
> --
> Peter Xu
> 




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

2021-06-09 Thread Ondrej Zary

On Wednesday 09 June 2021 11:21:05 Christian König wrote:
> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
> > On Wednesday 09 June 2021, Christian König wrote:
> >> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> >>> [SNIP]
>  Thanks for the heads up. So the problem with my patch is already fixed,
>  isn't it?
> >>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
> >>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> >>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> >>>
> >>> That's the bug I hit when bisecting the original problem:
> >>> NULL pointer dereference in nouveau_bo_sync_for_device
> >>> It's caused by:
> >>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: 
> >>> merge ttm_dma_tt back into ttm_tt
> >> Good that I've asked :)
> >>
> >> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
> >> created mostly automated.
> >>
> >> Do you have the original backtrace of that NULL pointer deref once more?
> > The original backtrace is here: 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350data=04%7C01%7Cchristian.koenig%40amd.com%7Ce905b6bd2aa842ace15508d92b15b96d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588195000729460%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=zFqheBbJcOHtYgqG%2Fs63AT1dwuk4REmUDJWHvzaLAlc%3Dreserved=0
> 
> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I 
> don't see how that can happen since nouveau is using ttm_sg_tt_init().
> 
> Apart from that what nouveau does here is rather questionable since you 
> need a coherent architecture for most things anyway, but that's not what 
> we are trying to fix here.
> 
> Can you try to narrow down if ttm_sg_tt_init is called before calling 
> this function for the tt object in question?

ttm_sg_tt_init is not called:
[   12.150124] nouveau :01:00.0: DRM: VRAM: 31 MiB
[   12.150133] nouveau :01:00.0: DRM: GART: 128 MiB
[   12.150143] nouveau :01:00.0: DRM: BMP version 5.6
[   12.150151] nouveau :01:00.0: DRM: No DCB data found in VBIOS
[   12.151362] ttm_tt_init
[   12.151370] ttm_tt_init_fields
[   12.151374] ttm_tt_alloc_page_directory
[   12.151615] BUG: kernel NULL pointer dereference, address: 



-- 
Ondrej Zary
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

2021-06-09 Thread Christian König

Yeah, exactly that's the root cause of the problem.

GEM objects are the base class for TTM BOs for quite a while now.

So we at least need to initialize them or otherwise at least the size of
the object is not available.

Going to send a fix in a few minutes.

Thanks,
Christian.

Am 09.06.21 um 17:13 schrieb Ilia Mirkin:

GEM init happens here:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Ftree%2Fdrivers%2Fgpu%2Fdrm%2Fnouveau%2Fnouveau_gem.c%23n221data=04%7C01%7Cchristian.koenig%40amd.com%7C228788b1c8524fa8128b08d92b591b81%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588483983919147%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=f%2BFFnPEAEoeuD8bKtnGtL5ZU0HBYpJAjqKqD29Xn9Kw%3Dreserved=0

Note the bo alloc / gem init / bo init dance.

I don't think there is a GEM object for internal allocations at all --
we just allocate bo's directly and that's it. Perhaps you meant
something else? I thought GEM was meant for externally-available
objects.

Cheers,

-ilia

On Wed, Jun 9, 2021 at 10:58 AM Christian König
wrote:

Good point, but I think that is unrelated.

My suspicion is rather that nouveau is not initializing the underlying
GEM object for internal allocations.

So what happens is the same as on VMWGFX that TTM doesn't know anything
about the size to of the BO resulting in a kmalloc() with a random value
and eventually -ENOMEM.

Good news is that I can reproduce it, so going to look into that later
today.

Regards,
Christian.

Am 09.06.21 um 16:52 schrieb Ilia Mirkin:

Christian - potentially relevant is that Tegra doesn't have VRAM at
all -- all GTT (or GART or whatever it's called nowadays). No
fake/stolen VRAM.

Cheers,

-ilia

On Wed, Jun 9, 2021 at 10:18 AM Christian König
wrote:

Hi Mikko,

strange sounds like Nouveau was somehow also using the GEM workaround
for VMWGFX as well.

But -12 means -ENOMEM which doesn't fits into the picture.

I will try with a G710, but if that doesn't yields anything I need some
more input from you.

Thanks for the report,
Christian.

Am 09.06.21 um 15:47 schrieb Mikko Perttunen:

Hi,

I'm observing nouveau not initializing recently on linux-next on my
Tegra186 Jetson TX2 board. Specifically it looks like BO allocation is
failing when initializing the sync subsystem:

[ 21.858149] nouveau 1700.gpu: DRM: failed to initialise sync
subsystem, -28

I have been bisecting and I have found two patches that affect this.
Firstly, things first break on

d02117f8efaa drm/ttm: remove special handling for non GEM drivers

starting to return error code -12. Then, at

d79025c7f5e3 drm/ttm: always initialize the full ttm_resource v2

the error code changes to the above -28.

If I checkout one commit prior to d79025c7f5e3 and revert
d02117f8efaa, things work again. There are a bunch of other TTM
commits between this and HEAD, so reverting these on top of HEAD
doesn't work. However, I checked that both yesterday's and today's
nexts are also broken.

Thank you,
Mikko

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fnouveaudata=04%7C01%7Cchristian.koenig%40amd.com%7C228788b1c8524fa8128b08d92b591b81%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588483983919147%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=dn0%2FRAOKKddFQbhJcjd3v1L%2BHc4hGlpWIURPTG33g50%3Dreserved=0

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

2021-06-09 Thread Christian König

Good point, but I think that is unrelated.

My suspicion is rather that nouveau is not initializing the underlying
GEM object for internal allocations.

So what happens is the same as on VMWGFX that TTM doesn't know anything
about the size to of the BO resulting in a kmalloc() with a random value
and eventually -ENOMEM.

Good news is that I can reproduce it, so going to look into that later
today.

Regards,
Christian.

Am 09.06.21 um 16:52 schrieb Ilia Mirkin:

Christian - potentially relevant is that Tegra doesn't have VRAM at
all -- all GTT (or GART or whatever it's called nowadays). No
fake/stolen VRAM.

Cheers,

-ilia

On Wed, Jun 9, 2021 at 10:18 AM Christian König
wrote:

Hi Mikko,

strange sounds like Nouveau was somehow also using the GEM workaround
for VMWGFX as well.

But -12 means -ENOMEM which doesn't fits into the picture.

I will try with a G710, but if that doesn't yields anything I need some
more input from you.

Thanks for the report,
Christian.

Am 09.06.21 um 15:47 schrieb Mikko Perttunen:

Hi,

I'm observing nouveau not initializing recently on linux-next on my
Tegra186 Jetson TX2 board. Specifically it looks like BO allocation is
failing when initializing the sync subsystem:

[ 21.858149] nouveau 1700.gpu: DRM: failed to initialise sync
subsystem, -28

I have been bisecting and I have found two patches that affect this.
Firstly, things first break on

d02117f8efaa drm/ttm: remove special handling for non GEM drivers

starting to return error code -12. Then, at

d79025c7f5e3 drm/ttm: always initialize the full ttm_resource v2

the error code changes to the above -28.

Thank you,
Mikko

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fnouveaudata=04%7C01%7Cchristian.koenig%40amd.com%7Caaf09cbea0b04d8dc01208d92b5637ba%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588472445308290%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=ePoWVtHPXeK5RThkRuQSykKrfWCgPOzG5CLTzfw9%2Fuw%3Dreserved=0

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

2021-06-09 Thread Christian König


Hi Mikko,

strange sounds like Nouveau was somehow also using the GEM workaround 
for VMWGFX as well.


But -12 means -ENOMEM which doesn't fits into the picture.

I will try with a G710, but if that doesn't yields anything I need some 
more input from you.


Thanks for the report,
Christian.


Am 09.06.21 um 15:47 schrieb Mikko Perttunen:

Hi,

I'm observing nouveau not initializing recently on linux-next on my 
Tegra186 Jetson TX2 board. Specifically it looks like BO allocation is 
failing when initializing the sync subsystem:


[   21.858149] nouveau 1700.gpu: DRM: failed to initialise sync 
subsystem, -28


I have been bisecting and I have found two patches that affect this. 
Firstly, things first break on


d02117f8efaa drm/ttm: remove special handling for non GEM drivers

starting to return error code -12. Then, at

d79025c7f5e3 drm/ttm: always initialize the full ttm_resource v2

the error code changes to the above -28.

If I checkout one commit prior to d79025c7f5e3 and revert 
d02117f8efaa, things work again. There are a bunch of other TTM 
commits between this and HEAD, so reverting these on top of HEAD 
doesn't work. However, I checked that both yesterday's and today's 
nexts are also broken.


Thank you,
Mikko



___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] drm/nouveau: init the base GEM fields for internal BOs

2021-06-09 Thread Mikko Perttunen


On 6/9/21 8:29 PM, Christian König wrote:

TTMs buffer objects are based on GEM objects for quite a while
and rely on initializing those fields before initializing the TTM BO.

Noveau now doesn't init the GEM object for internally allocated BOs,


Nouveau


so make sure that we at least initialize some necessary fields.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/nouveau/nouveau_bo.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 520b1ea9d16c..085023624fb0 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -149,6 +149,8 @@ nouveau_bo_del_ttm(struct ttm_buffer_object *bo)
 */
if (bo->base.dev)
drm_gem_object_release(>base);
+   else
+   dma_resv_fini(>base._resv);
  
  	kfree(nvbo);

  }
@@ -330,6 +332,10 @@ nouveau_bo_new(struct nouveau_cli *cli, u64 size, int 
align,
if (IS_ERR(nvbo))
return PTR_ERR(nvbo);
  
+	nvbo->bo.base.size = size;

+   dma_resv_init(>bo.base._resv);
+   drm_vma_node_reset(>bo.base.vma_node);
+
ret = nouveau_bo_init(nvbo, size, align, domain, sg, robj);
if (ret)
return ret;



That works, thanks for the fix!

Tested-by: Mikko Perttunen 

Mikko
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH] drm/nouveau: init the base GEM fields for internal BOs

2021-06-09 Thread Christian König

TTMs buffer objects are based on GEM objects for quite a while
and rely on initializing those fields before initializing the TTM BO.

Noveau now doesn't init the GEM object for internally allocated BOs,
so make sure that we at least initialize some necessary fields.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/nouveau/nouveau_bo.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 520b1ea9d16c..085023624fb0 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -149,6 +149,8 @@ nouveau_bo_del_ttm(struct ttm_buffer_object *bo)
 */
if (bo->base.dev)
drm_gem_object_release(>base);
+   else
+   dma_resv_fini(>base._resv);
 
kfree(nvbo);
 }
@@ -330,6 +332,10 @@ nouveau_bo_new(struct nouveau_cli *cli, u64 size, int 
align,
if (IS_ERR(nvbo))
return PTR_ERR(nvbo);
 
+   nvbo->bo.base.size = size;
+   dma_resv_init(>bo.base._resv);
+   drm_vma_node_reset(>bo.base.vma_node);
+
ret = nouveau_bo_init(nvbo, size, align, domain, sg, robj);
if (ret)
return ret;
-- 
2.25.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

2021-06-09 Thread Ilia Mirkin

GEM init happens here:

https://cgit.freedesktop.org/drm/drm/tree/drivers/gpu/drm/nouveau/nouveau_gem.c#n221

Note the bo alloc / gem init / bo init dance.

I don't think there is a GEM object for internal allocations at all --
we just allocate bo's directly and that's it. Perhaps you meant
something else? I thought GEM was meant for externally-available
objects.

Cheers,

  -ilia

On Wed, Jun 9, 2021 at 10:58 AM Christian König
 wrote:
>
> Good point, but I think that is unrelated.
>
> My suspicion is rather that nouveau is not initializing the underlying
> GEM object for internal allocations.
>
> So what happens is the same as on VMWGFX that TTM doesn't know anything
> about the size to of the BO resulting in a kmalloc() with a random value
> and eventually -ENOMEM.
>
> Good news is that I can reproduce it, so going to look into that later
> today.
>
> Regards,
> Christian.
>
> Am 09.06.21 um 16:52 schrieb Ilia Mirkin:
> > Christian - potentially relevant is that Tegra doesn't have VRAM at
> > all -- all GTT (or GART or whatever it's called nowadays). No
> > fake/stolen VRAM.
> >
> > Cheers,
> >
> >-ilia
> >
> > On Wed, Jun 9, 2021 at 10:18 AM Christian König
> >  wrote:
> >> Hi Mikko,
> >>
> >> strange sounds like Nouveau was somehow also using the GEM workaround
> >> for VMWGFX as well.
> >>
> >> But -12 means -ENOMEM which doesn't fits into the picture.
> >>
> >> I will try with a G710, but if that doesn't yields anything I need some
> >> more input from you.
> >>
> >> Thanks for the report,
> >> Christian.
> >>
> >>
> >> Am 09.06.21 um 15:47 schrieb Mikko Perttunen:
> >>> Hi,
> >>>
> >>> I'm observing nouveau not initializing recently on linux-next on my
> >>> Tegra186 Jetson TX2 board. Specifically it looks like BO allocation is
> >>> failing when initializing the sync subsystem:
> >>>
> >>> [   21.858149] nouveau 1700.gpu: DRM: failed to initialise sync
> >>> subsystem, -28
> >>>
> >>> I have been bisecting and I have found two patches that affect this.
> >>> Firstly, things first break on
> >>>
> >>> d02117f8efaa drm/ttm: remove special handling for non GEM drivers
> >>>
> >>> starting to return error code -12. Then, at
> >>>
> >>> d79025c7f5e3 drm/ttm: always initialize the full ttm_resource v2
> >>>
> >>> the error code changes to the above -28.
> >>>
> >>> If I checkout one commit prior to d79025c7f5e3 and revert
> >>> d02117f8efaa, things work again. There are a bunch of other TTM
> >>> commits between this and HEAD, so reverting these on top of HEAD
> >>> doesn't work. However, I checked that both yesterday's and today's
> >>> nexts are also broken.
> >>>
> >>> Thank you,
> >>> Mikko
> >>>
> >> ___
> >> Nouveau mailing list
> >> Nouveau@lists.freedesktop.org
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fnouveaudata=04%7C01%7Cchristian.koenig%40amd.com%7Caaf09cbea0b04d8dc01208d92b5637ba%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588472445308290%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=ePoWVtHPXeK5RThkRuQSykKrfWCgPOzG5CLTzfw9%2Fuw%3Dreserved=0
>
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

2021-06-09 Thread Ilia Mirkin

Christian - potentially relevant is that Tegra doesn't have VRAM at
all -- all GTT (or GART or whatever it's called nowadays). No
fake/stolen VRAM.

Cheers,

  -ilia

On Wed, Jun 9, 2021 at 10:18 AM Christian König
 wrote:
>
> Hi Mikko,
>
> strange sounds like Nouveau was somehow also using the GEM workaround
> for VMWGFX as well.
>
> But -12 means -ENOMEM which doesn't fits into the picture.
>
> I will try with a G710, but if that doesn't yields anything I need some
> more input from you.
>
> Thanks for the report,
> Christian.
>
>
> Am 09.06.21 um 15:47 schrieb Mikko Perttunen:
> > Hi,
> >
> > I'm observing nouveau not initializing recently on linux-next on my
> > Tegra186 Jetson TX2 board. Specifically it looks like BO allocation is
> > failing when initializing the sync subsystem:
> >
> > [   21.858149] nouveau 1700.gpu: DRM: failed to initialise sync
> > subsystem, -28
> >
> > I have been bisecting and I have found two patches that affect this.
> > Firstly, things first break on
> >
> > d02117f8efaa drm/ttm: remove special handling for non GEM drivers
> >
> > starting to return error code -12. Then, at
> >
> > d79025c7f5e3 drm/ttm: always initialize the full ttm_resource v2
> >
> > the error code changes to the above -28.
> >
> > If I checkout one commit prior to d79025c7f5e3 and revert
> > d02117f8efaa, things work again. There are a bunch of other TTM
> > commits between this and HEAD, so reverting these on top of HEAD
> > doesn't work. However, I checked that both yesterday's and today's
> > nexts are also broken.
> >
> > Thank you,
> > Mikko
> >
>
> ___
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/nouveau
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] Trouble with TTM patches w/nouveau in linux-next

2021-06-09 Thread Mikko Perttunen


Hi,

I'm observing nouveau not initializing recently on linux-next on my 
Tegra186 Jetson TX2 board. Specifically it looks like BO allocation is 
failing when initializing the sync subsystem:


[   21.858149] nouveau 1700.gpu: DRM: failed to initialise sync 
subsystem, -28


I have been bisecting and I have found two patches that affect this. 
Firstly, things first break on


d02117f8efaa drm/ttm: remove special handling for non GEM drivers

starting to return error code -12. Then, at

d79025c7f5e3 drm/ttm: always initialize the full ttm_resource v2

the error code changes to the above -28.

If I checkout one commit prior to d79025c7f5e3 and revert d02117f8efaa, 
things work again. There are a bunch of other TTM commits between this 
and HEAD, so reverting these on top of HEAD doesn't work. However, I 
checked that both yesterday's and today's nexts are also broken.


Thank you,
Mikko

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

2021-06-09 Thread Christian König


Am 09.06.21 um 09:10 schrieb Ondrej Zary:

On Wednesday 09 June 2021, Christian König wrote:

Am 09.06.21 um 08:57 schrieb Ondrej Zary:

[SNIP]

Thanks for the heads up. So the problem with my patch is already fixed,
isn't it?

The NULL pointer dereference in nouveau_bo_wr16 introduced in
141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.

That's the bug I hit when bisecting the original problem:
NULL pointer dereference in nouveau_bo_sync_for_device
It's caused by:
# first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge 
ttm_dma_tt back into ttm_tt

Good that I've asked :)

Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
created mostly automated.

Do you have the original backtrace of that NULL pointer deref once more?

The original backtrace is here: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350data=04%7C01%7Cchristian.koenig%40amd.com%7Ce905b6bd2aa842ace15508d92b15b96d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588195000729460%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=zFqheBbJcOHtYgqG%2Fs63AT1dwuk4REmUDJWHvzaLAlc%3Dreserved=0


And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I 
don't see how that can happen since nouveau is using ttm_sg_tt_init().


Apart from that what nouveau does here is rather questionable since you 
need a coherent architecture for most things anyway, but that's not what 
we are trying to fix here.


Can you try to narrow down if ttm_sg_tt_init is called before calling 
this function for the tt object in question?


Thanks,
Christian.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

2021-06-09 Thread Christian König


Am 09.06.21 um 08:57 schrieb Ondrej Zary:

[SNIP]

Thanks for the heads up. So the problem with my patch is already fixed,
isn't it?

The NULL pointer dereference in nouveau_bo_wr16 introduced in
141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.

That's the bug I hit when bisecting the original problem:
NULL pointer dereference in nouveau_bo_sync_for_device
It's caused by:
# first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge 
ttm_dma_tt back into ttm_tt


Good that I've asked :)

Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was 
created mostly automated.


Do you have the original backtrace of that NULL pointer deref once more?

Thanks,
Christian.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

2021-06-09 Thread Christian König


Am 08.06.21 um 23:59 schrieb Ondrej Zary:

On Tuesday 08 June 2021 22:01:56 Ondrej Zary wrote:

On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:

On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:

On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:

On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:

On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:

Hello,
I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in 
nouveau_bo_sync_for_device.
Found various reports like this but that was back in februaryso that should be 
fixed now.

So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2021-February%2F298531.htmldata=04%7C01%7Cchristian.koenig%40amd.com%7C605d2e3757ba466bb02a08d92ac8a895%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637587864017853132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=M5KXSwD%2Fnro3cnCo8Nx4llFu%2Fj2T%2FGQAaMBLeGl0XMc%3Dreserved=0

Added some debug printks to nouveau_bo_sync_for_device:
[   22.225048] ttm_dma=fc33b500
[   22.225066] ttm_dma->num_pages=18
[   22.225071] i=0 num_pages=16
[   22.225077] ttm_dma->dma_address=
[   22.225094] BUG: kernel NULL pointer dereference, address: 

So ttm->dma_address is NULL.


Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not 
work...
Not sure what I did before.

Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending 
stream of bugs.
As always with nouveau...

e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
Going back one commit makes it crash in a different way:

[   55.444208] BUG: kernel NULL pointer dereference, address: 01b0
[   55.444219] #PF: supervisor read access in kernel mode
[   55.444222] #PF: error_code(0x) - not-present page
[   55.444225] *pde = 
[   55.444231] Oops:  [#1] SMP
[   55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
[   55.444240] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
[   55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
[   55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c 
ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 
00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
[   55.444330] EAX:  EBX:  ECX:  EDX: 
[   55.444334] ESI: 0020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
[   55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
[   55.444341] CR0: 80050033 CR2: 01b0 CR3: 27896000 CR4: 0690
[   55.444344] Call Trace:
[   55.444395]  nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
[   55.42]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
[   55.51]  drm_mode_cursor_common+0x13b/0x1ad
[   55.97]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
[   55.444504]  drm_mode_cursor_ioctl+0x2e/0x36
[   55.444509]  ? drm_mode_setplane+0x203/0x203
[   55.444514]  drm_ioctl_kernel+0x66/0x99
[   55.444518]  drm_ioctl+0x211/0x2d8
[   55.444522]  ? drm_mode_setplane+0x203/0x203
[   55.444529]  ? _cond_resched+0x1e/0x22
[   55.444533]  ? mutex_lock+0xb/0x24
[   55.444582]  ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
[   55.444589]  ? rpm_resume.part.13+0x72/0x365
[   55.444594]  ? ktime_get_mono_fast_ns+0x5e/0xf2
[   55.444598]  ? __pm_runtime_resume+0x5b/0x63
[   55.444647]  nouveau_drm_ioctl+0x65/0x81 [nouveau]
[   55.444696]  ? nouveau_cli_work+0xc3/0xc3 [nouveau]
[   55.444702]  vfs_ioctl+0x1a/0x24
[   55.444706]  __ia32_sys_ioctl+0x583/0x59d
[   55.444711]  ? doublefault_shim+0x120/0x120
[   55.444717]  ? exit_to_user_mode_prepare+0x71/0xba
[   55.444721]  do_int80_syscall_32+0x2c/0x39
[   55.444725]  entry_INT80_32+0xf0/0xf0
[   55.444729] EIP: 0xb7fb2092
[   55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 
ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80  8d b4 26 00 
00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[   55.444737] EAX: ffda EBX: 000e ECX: c01c64a3 EDX: bfe89750
[   55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 000e ESP: bfe89704
[   55.444744] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 00200292
[   55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 
snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg 
soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp 
parport_pc parport
[   55.444769] CR2: 01b0
[   55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
[   55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
[   55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c 
ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 
00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
[   55.444835] EAX:  EBX:  ECX:

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

2021-06-09 Thread Alistair Popple

On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> 
> [...]
> 
> > +static bool page_make_device_exclusive_one(struct page *page,
> > + struct vm_area_struct *vma, unsigned long address, void *priv)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + struct page_vma_mapped_walk pvmw = {
> > + .page = page,
> > + .vma = vma,
> > + .address = address,
> > + };
> > + struct make_exclusive_args *args = priv;
> > + pte_t pteval;
> > + struct page *subpage;
> > + bool ret = true;
> > + struct mmu_notifier_range range;
> > + swp_entry_t entry;
> > + pte_t swp_pte;
> > +
> > + mmu_notifier_range_init_owner(, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > +   vma->vm_mm, address, min(vma->vm_end,
> > +   address + page_size(page)), 
> > args->owner);
> > + mmu_notifier_invalidate_range_start();
> > +
> > + while (page_vma_mapped_walk()) {
> > + /* Unexpected PMD-mapped THP? */
> > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> 
> [1]
> 
> > +
> > + if (!pte_present(*pvmw.pte)) {
> > + ret = false;
> > + page_vma_mapped_walk_done();
> > + break;
> > + }
> > +
> > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > + address = pvmw.address;
> 
> I raised a question here previously and didn't get an answer...
> 
> https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/

Sorry, I had overlooked that. Will continue the discussion here.

> I think I get your point now and it does look possible that the split page can
> still be mapped somewhere else as thp, then having some subpage maintainance
> looks necessary.  The confusing part is above [1] you've also got that
> VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..

Going back I thought your original question was whether subpage != page is
possible. My main point was it's possible if we get a thp head. In that case we
need to replace all pte's with exclusive entries because I haven't (yet)
defined a pmd version of device exclusive entries and also rmap_walk won't deal
with tail pages (see below).

> Then I remembered these code majorly come from the try_to_unmap() so I looked
> there.  I _think_ what's missing here is something like:
> 
> if (flags & TTU_SPLIT_HUGE_PMD)
> split_huge_pmd_address(vma, address, false, page);
> 
> at the entry of page_make_device_exclusive_one()?
>
> That !pte assertion in try_to_unmap() makes sense to me as long as it has 
> split
> the thp page first always.  However seems not the case for FOLL_SPLIT_PMD as
> you previously mentioned.

At present this is limited to PageAnon pages which have had CoW broken, which I
think means there shouldn't be other mappings so I expect the PMD will always
have been split into small PTEs mapping subpages by GUP which is what that
assertion [1] is checking. I could call split_huge_pmd_address() unconditionally
as suggested but see the discussion below.

> Meanwhile, I also started to wonder whether it's even right to call 
> rmap_walk()
> with tail pages...  Please see below.
> 
> > +
> > + /* Nuke the page table entry. */
> > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > + pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > +
> > + /* Move the dirty bit to the page. Now the pte is gone. */
> > + if (pte_dirty(pteval))
> > + set_page_dirty(page);
> > +
> > + /*
> > +  * Check that our target page is still mapped at the expected
> > +  * address.
> > +  */
> > + if (args->mm == mm && args->address == address &&
> > + pte_write(pteval))
> > + args->valid = true;
> > +
> > + /*
> > +  * Store the pfn of the page in a special migration
> > +  * pte. do_swap_page() will wait until the migration
> > +  * pte is removed and then restart fault handling.
> > +  */
> > + if (pte_write(pteval))
> > + entry = make_writable_device_exclusive_entry(
> > + page_to_pfn(subpage));
> > + else
> > + entry = make_readable_device_exclusive_entry(
> > + page_to_pfn(subpage));
> > + swp_pte = swp_entry_to_pte(entry);
> > + if (pte_soft_dirty(pteval))
> > + swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > + if (pte_uffd_wp(pteval))
> > + swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > +
> > + set_pte_at(mm, address, pvmw.pte, swp_pte);
> > +
> >

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

2021-06-09 Thread Ondrej Zary

On Wednesday 09 June 2021, Christian König wrote:
> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> > [SNIP]
> >> Thanks for the heads up. So the problem with my patch is already fixed,
> >> isn't it?
> > The NULL pointer dereference in nouveau_bo_wr16 introduced in
> > 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> > aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> >
> > That's the bug I hit when bisecting the original problem:
> > NULL pointer dereference in nouveau_bo_sync_for_device
> > It's caused by:
> > # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: 
> > merge ttm_dma_tt back into ttm_tt
> 
> Good that I've asked :)
> 
> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was 
> created mostly automated.
> 
> Do you have the original backtrace of that NULL pointer deref once more?

The original backtrace is here: https://lkml.org/lkml/2021/6/5/350

-- 
Ondrej Zary
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

2021-06-09 Thread Ondrej Zary

On Wednesday 09 June 2021, Christian König wrote:
> Am 08.06.21 um 23:59 schrieb Ondrej Zary:
> > On Tuesday 08 June 2021 22:01:56 Ondrej Zary wrote:
> >> On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
> >>> On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
>  On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> > On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> >> On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> >>> Hello,
> >>> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer 
> >>> dereference in nouveau_bo_sync_for_device.
> >>> Found various reports like this but that was back in februaryso that 
> >>> should be fixed now.
> >> So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2021-February%2F298531.htmldata=04%7C01%7Cchristian.koenig%40amd.com%7C605d2e3757ba466bb02a08d92ac8a895%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637587864017853132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=M5KXSwD%2Fnro3cnCo8Nx4llFu%2Fj2T%2FGQAaMBLeGl0XMc%3Dreserved=0
> >>
> >> Added some debug printks to nouveau_bo_sync_for_device:
> >> [   22.225048] ttm_dma=fc33b500
> >> [   22.225066] ttm_dma->num_pages=18
> >> [   22.225071] i=0 num_pages=16
> >> [   22.225077] ttm_dma->dma_address=
> >> [   22.225094] BUG: kernel NULL pointer dereference, address: 
> >>
> >> So ttm->dma_address is NULL.
> >>
> > Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it 
> > does not work...
> > Not sure what I did before.
> >
> > Bisecting between 5.10 and 5.11 is impossible - I keep hitting 
> > neverending stream of bugs.
> > As always with nouveau...
>  e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
>  Going back one commit makes it crash in a different way:
> 
>  [   55.444208] BUG: kernel NULL pointer dereference, address: 01b0
>  [   55.444219] #PF: supervisor read access in kernel mode
>  [   55.444222] #PF: error_code(0x) - not-present page
>  [   55.444225] *pde = 
>  [   55.444231] Oops:  [#1] SMP
>  [   55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
>  [   55.444240] Hardware name:  /848P-ICH5, BIOS 6.00 PG 02/03/2005
>  [   55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
>  [   55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 
>  89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 
>  c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
>  [   55.444330] EAX:  EBX:  ECX:  EDX: 
>  [   55.444334] ESI: 0020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
>  [   55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 
>  00210246
>  [   55.444341] CR0: 80050033 CR2: 01b0 CR3: 27896000 CR4: 0690
>  [   55.444344] Call Trace:
>  [   55.444395]  nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
>  [   55.42]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
>  [   55.51]  drm_mode_cursor_common+0x13b/0x1ad
>  [   55.97]  ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
>  [   55.444504]  drm_mode_cursor_ioctl+0x2e/0x36
>  [   55.444509]  ? drm_mode_setplane+0x203/0x203
>  [   55.444514]  drm_ioctl_kernel+0x66/0x99
>  [   55.444518]  drm_ioctl+0x211/0x2d8
>  [   55.444522]  ? drm_mode_setplane+0x203/0x203
>  [   55.444529]  ? _cond_resched+0x1e/0x22
>  [   55.444533]  ? mutex_lock+0xb/0x24
>  [   55.444582]  ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
>  [   55.444589]  ? rpm_resume.part.13+0x72/0x365
>  [   55.444594]  ? ktime_get_mono_fast_ns+0x5e/0xf2
>  [   55.444598]  ? __pm_runtime_resume+0x5b/0x63
>  [   55.444647]  nouveau_drm_ioctl+0x65/0x81 [nouveau]
>  [   55.444696]  ? nouveau_cli_work+0xc3/0xc3 [nouveau]
>  [   55.444702]  vfs_ioctl+0x1a/0x24
>  [   55.444706]  __ia32_sys_ioctl+0x583/0x59d
>  [   55.444711]  ? doublefault_shim+0x120/0x120
>  [   55.444717]  ? exit_to_user_mode_prepare+0x71/0xba
>  [   55.444721]  do_int80_syscall_32+0x2c/0x39
>  [   55.444725]  entry_INT80_32+0xf0/0xf0
>  [   55.444729] EIP: 0xb7fb2092
>  [   55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 
>  00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 
>  80  8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
>  [   55.444737] EAX: ffda EBX: 000e ECX: c01c64a3 EDX: bfe89750
>  [   55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 000e ESP: bfe89704
>  [   55.444744] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 
>  00200292
>

[Nouveau] [PATCH -next v2] drm/nouveau/gr: use list_move instead of list_del/list_add in nv40.c

[Nouveau] [PATCH -next v2] drm/nouveau/fifo: use list_move instead of list_del/list_add in base.c

[Nouveau] [PATCH -next v2] drm/nouveau/mpeg: use list_move instead of list_del/list_add in nv44.c

[Nouveau] [PATCH -next v2] drm/nouveau/sw: use list_move instead of list_del/list_add in base.c

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

Re: [Nouveau] [PATCH] drm/nouveau: init the base GEM fields for internal BOs

[Nouveau] [PATCH] drm/nouveau: init the base GEM fields for internal BOs

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

Re: [Nouveau] Trouble with TTM patches w/nouveau in linux-next

[Nouveau] Trouble with TTM patches w/nouveau in linux-next

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Re: [Nouveau] [PATCH v10 07/10] mm: Device exclusive memory access

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

21 matches

Site Navigation

Mail list logo

Footer information