Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-09 Thread Jason Gunthorpe
On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> not all.  I need to look at all the walkers and there can be some tricky
> ones, but I believe that applies in general.  It's actually similar to what
> I did with slow gup here.

I think that is the big question, I also haven't done the research to
know the answer.

At this point focusing on moving what is reasonable to the pXX_* API
makes sense to me. Then reviewing what remains and making some
decision.

> Like this series, for cont-pXd we'll need multiple walks comparing to
> before (when with hugetlb_entry()), but for that part I'll provide some
> performance tests too, and we also have a fallback plan, which is to detect
> cont-pXd existance, which will also work for large folios.

I think we can optimize this pretty easy.
 
> > I think if you do the easy places for pXX conversion you will have a
> > good idea about what is needed for the hard places.
> 
> Here IMHO we don't need to understand "what is the size of this hugetlb
> vma"

Yeh, I never really understood why hugetlb was linked to the VMA.. The
page table is self describing, obviously.

> or "which level of pgtable does this hugetlb vma pages locate",

Ditto

> because we may not need that, e.g., when we only want to collect some smaps
> statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> hugetlb_entry removed), we may need extra check later to put things into
> the right bucket, but for the walker itself it doesn't necessarily need
> hugetlb_entry().

Right, places may still need to know it is part of a huge VMA because we
have special stuff linked to that.

> > But then again we come back to power and its big list of page sizes
> > and variety :( Looks like some there have huge sizes at the pgd level
> > at least.
> 
> Yeah this is something I want to be super clear, because I may miss
> something: we don't have real pgd pages, right?  Powerpc doesn't even
> define p4d_leaf(), AFAICT.

AFAICT it is because it hides it all in hugepd.

If the goal is to purge hugepd then some of the options might turn out
to convert hugepd into huge p4d/pgd, as I understand it. It would be
nice to have certainty on this at least.

We have effectively three APIs to parse a single page table and
currently none of the APIs can return 100% of the data for power.

Jason


Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-05 Thread Jason Gunthorpe
On Thu, Apr 04, 2024 at 05:48:03PM -0400, Peter Xu wrote:
> On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> > The more I look at this the more I think we need to get to Matthew's
> > idea of having some kind of generic page table API that is not tightly
> > tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> > with 5 special cases in every place seems just horrible.
> > 
> >struct mm_walk_ops {
> >int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
> >}
> > 
> > And many cases really want something like:
> >struct mm_walk_state state;
> > 
> >if (!mm_walk_seek_leaf(state, mm, address))
> >   goto no_present
> >if (mm_walk_is_write(state)) ..
> > 
> > And detailed walking:
> >for_each_pt_leaf(state, mm, address) {
> >if (mm_walk_is_write(state)) ..
> >}
> > 
> > Replacing it with a mm_walk_state that retains the level or otherwise
> > to allow decoding any entry composes a lot better. Forced Loop
> > unrolling can get back to the current code gen in alot of places.
> > 
> > It also makes the power stuff a bit nicer as the mm_walk_state could
> > automatically retain back pointers to the higher levels in the state
> > struct too...
> > 
> > The puzzle is how to do it and still get reasonable efficient codegen,
> > many operations are going to end up switching on some state->level to
> > know how to decode the entry.
> 
> These discussions are definitely constructive, thanks Jason.  Very helpful.
> 
> I thought about this last week but got interrupted.  It does make sense to
> me; it looks pretty generic and it is flexible enough as a top design.  At
> least that's what I thought.

Yeah, exactly..

> However now when I rethink about it, and look more into the code when I got
> the chance, it turns out this will be a major rewrite of mostly every
> walkers..  

Indeed, it is why it may not be reasonable.

> Consider that what we (or.. I) want to teach the pXd layers are two things
> right now: (1) hugetlb mappings (2) MMIO (PFN) mappings.  That mostly
> shares the generic concept when working on the mm walkers no matter which
> way to go, just different treatment on different type of mem.  (2) is on
> top of current code and new stuff, while (1) is a refactoring to drop
> hugetlb_entry() hook point as the goal.

Right, I view this as a two pronged attack

One one front you teach the generic pXX_* macros to process huge pages
and push that around to the performance walkers like GUP

On another front you want to replace use of the hugepte with the new
walkers.

The challenge with the hugepte code is that it is all structured to
assume one API that works at all levels and that may be a hard fit to
replace with pXX_* functions.

The places that are easy to switch from hugetlb to pXX_* may as well
do so.

Other places maybe need a hugetlb replacement that has a similar
abstraction level of pointing to any page table level.

I think if you do the easy places for pXX conversion you will have a
good idea about what is needed for the hard places.

> Now the important question I'm asking myself is: do we really need huge p4d
> or even bigger?

Do we need huge p4d support with folios? Probably not..

huge p4d support for pfnmap, eg in VFIO. Yes I think that is possibly
interesting - but I wouldn't ask anyone to do the work :)

But then again we come back to power and its big list of page sizes
and variety :( Looks like some there have huge sizes at the pgd level
at least.

> So, can we over-engineer too much if we go the generic route now?

Yes we can, and it will probably be slow as well. The pXX macros are
the most efficient if code can be adapted to use them.

> Considering that we already have most of pmd/pud entries around in the mm
> walker ops.

Yeah, so you add pgd and maybe p4d and then we can don't need any
generic thing. If it is easy it would be nice.

Jason


Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()

2024-04-04 Thread Jason Gunthorpe
On Wed, Apr 03, 2024 at 06:24:38PM +, Christophe Leroy wrote:
> > If it is a software walker there might be value in just aligning to
> > the contig pte scheme in all levels and forgetting about the variable
> > size page table levels. That quarter page stuff is a PITA to manage
> > the memory allocation for on PPC anyhow..
> 
> Looking one step further, into nohash/32, I see a challenge: on that 
> platform, a PTE is 64 bits while a PGD/PMD entry is 32 bits. It is 
> therefore not possible as such to do PMD leaf or cont-PMD leaf.

Hmm, maybe not, I have a feeling you can hide this detail in the
pmd_offset routine if you pass in the PGD information too.

> - Double the size of PGD/PMD entries, but then we loose atomicity when 
> reading or writing an entry, could this be a problem ?

How does the 64 bit PTE work then? We have ignored this bug on x86 32
bit, but there is a general difficult race with 64 bit atomicity on 32
bit CPUs in the page tables.

Ideally you'd have 64 bit entries at the PMD level that encode the
page size the same as the PTE level. So you hit any level and you know
your size. This is less memory efficient (though every other arch
tolerates this) in general cases.

Can you throw away some bits of PA in the 32 bit entries to signal a
size?

> - Do as for the 8xx, ie go down to PTEs even for pages greater than 4M.

Aside from the memory waste, this is the most logical thing, go down
far enough that you can encode the desired page size in the PTE and
use the contig PTE scheme.

Jason


Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-04 Thread Jason Gunthorpe
On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:

> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> Yes, that sounds better too to me, however it means we may also risk other
> archs that can fail another defconfig build.. and I worry I bring trouble
> to multiple such cases.  Fundamentally it's indeed my patch that broke
> those builds, so I still sent the change and leave that for arch developers
> to decide the best for the archs.

But your change causes silent data corruption if the code path is
run.. I think we are overall better to wade through the compile time
bugs from linux-next. Honestly if there were alot then I'd think there
would be more complaints already.

Maybe it should just be a seperate step from this series.

Jason


Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-03 Thread Jason Gunthorpe
On Wed, Apr 03, 2024 at 01:17:06PM +, Christophe Leroy wrote:

> > That commit makes it sounds like the arch supports huge PUD's through
> > the hugepte mechanism - it says a LTP test failed so something
> > populated a huge PUD at least??
> 
> Not sure, I more see it just like a copy/paste of commit 501b81046701 
> ("mips: mm: add p?d_leaf() definitions").
> 
> The commit message says that the test failed because pmd_leaf() is 
> missing, it says nothing about PUD.

AH fair enough, it is probably a C then

Jason


Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-03 Thread Jason Gunthorpe
On Wed, Apr 03, 2024 at 12:26:43PM +, Christophe Leroy wrote:
> 
> 
> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> > On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> >> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> >>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> >>>
> >>>> I actually tested this without hitting the issue (even though I didn't
> >>>> mention it in the cover letter..).  I re-kicked the build test, it turns
> >>>> out my "make alldefconfig" on loongarch will generate a config with both
> >>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> >>>> THP=y (which I assume was the one above build used).  I didn't further
> >>>> check how "make alldefconfig" generated the config; a bit surprising that
> >>>> it didn't fetch from there.
> >>>
> >>> I suspect it is weird compiler variations.. Maybe something is not
> >>> being inlined.
> >>>
> >>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >>>>   triggering it elsewhere but failed..)
> >>>
> >>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> >>> called and the optimizer removing it.
> >>
> >> Good point, for some reason loongarch defined pud_leaf() without defining
> >> pud_pfn(), which does look strange.
> >>
> >> #define pud_leaf(pud)  ((pud_val(pud) & _PAGE_HUGE) != 0)
> >>
> >> But I noticed at least MIPS also does it..  Logically I think one arch
> >> should define either none of both.
> > 
> > Wow, this is definately an arch issue. You can't define pud_leaf() and
> > not have a pud_pfn(). It makes no sense at all..
> > 
> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
> Add p?d_leaf() definitions").

That commit makes it sounds like the arch supports huge PUD's through
the hugepte mechanism - it says a LTP test failed so something
populated a huge PUD at least??

So maybe this?

#define pud_pfn pte_pfn

> Not sure it was added for a good reason, and I'm not sure what was added 
> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
> 
> #define   _PAGE_HUGE_SHIFT6  /* HUGE is a PMD bit */
> 
> So I'm not sure it is correct to use that bit for PUD, is it ?

Could be, lots of arches repeat the bit layouts in each radix
level.. It is essentially why the hugepte trick of pretending every
level is a pte works.
 
Jason


Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-03 Thread Jason Gunthorpe
On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > 
> > > I actually tested this without hitting the issue (even though I didn't
> > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > out my "make alldefconfig" on loongarch will generate a config with both
> > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > THP=y (which I assume was the one above build used).  I didn't further
> > > check how "make alldefconfig" generated the config; a bit surprising that
> > > it didn't fetch from there.
> > 
> > I suspect it is weird compiler variations.. Maybe something is not
> > being inlined.
> > 
> > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > >  triggering it elsewhere but failed..)
> > 
> > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > called and the optimizer removing it.
> 
> Good point, for some reason loongarch defined pud_leaf() without defining
> pud_pfn(), which does look strange.
> 
> #define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
> 
> But I noticed at least MIPS also does it..  Logically I think one arch
> should define either none of both.

Wow, this is definately an arch issue. You can't define pud_leaf() and
not have a pud_pfn(). It makes no sense at all..

I'd say the BUILD_BUG has done it's job and found an issue, fix it by
not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
at least

Jason


Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-02 Thread Jason Gunthorpe
On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:

> I actually tested this without hitting the issue (even though I didn't
> mention it in the cover letter..).  I re-kicked the build test, it turns
> out my "make alldefconfig" on loongarch will generate a config with both
> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> THP=y (which I assume was the one above build used).  I didn't further
> check how "make alldefconfig" generated the config; a bit surprising that
> it didn't fetch from there.

I suspect it is weird compiler variations.. Maybe something is not
being inlined.

> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>  triggering it elsewhere but failed..)

As the pud_leaf() == FALSE should result in the BUILD_BUG never being
called and the optimizer removing it.

Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

Jason


Re: [PATCH v1 3/3] mm: use "GUP-fast" instead "fast GUP" in remaining comments

2024-04-02 Thread Jason Gunthorpe
On Tue, Apr 02, 2024 at 02:55:16PM +0200, David Hildenbrand wrote:
> Let's fixup the remaining comments to consistently call that thing
> "GUP-fast". With this change, we consistently call it "GUP-fast".
> 
> Reviewed-by: Mike Rapoport (IBM) 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/filemap.c| 2 +-
>  mm/khugepaged.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v1 2/3] mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST

2024-04-02 Thread Jason Gunthorpe
On Tue, Apr 02, 2024 at 02:55:15PM +0200, David Hildenbrand wrote:
> Nowadays, we call it "GUP-fast", the external interface includes
> functions like "get_user_pages_fast()", and we renamed all internal
> functions to reflect that as well.
> 
> Let's make the config option reflect that.
> 
> Reviewed-by: Mike Rapoport (IBM) 
> Signed-off-by: David Hildenbrand 
> ---
>  arch/arm/Kconfig   |  2 +-
>  arch/arm64/Kconfig |  2 +-
>  arch/loongarch/Kconfig |  2 +-
>  arch/mips/Kconfig  |  2 +-
>  arch/powerpc/Kconfig   |  2 +-
>  arch/riscv/Kconfig |  2 +-
>  arch/s390/Kconfig  |  2 +-
>  arch/sh/Kconfig|  2 +-
>  arch/x86/Kconfig   |  2 +-
>  include/linux/rmap.h   |  8 
>  kernel/events/core.c   |  4 ++--
>  mm/Kconfig |  2 +-
>  mm/gup.c   | 10 +-
>  mm/internal.h      |  2 +-
>  14 files changed, 22 insertions(+), 22 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()

2024-03-27 Thread Jason Gunthorpe
On Wed, Mar 27, 2024 at 09:58:35AM +, Christophe Leroy wrote:
> > Just general remarks on the ones with huge pages:
> > 
> >   hash 64k and hugepage 16M/16G
> >   radix 64k/radix hugepage 2M/1G
> >   radix 4k/radix hugepage 2M/1G
> >   nohash 32
> >- I think this is just a normal x86 like scheme? PMD/PUD can be a
> >  leaf with the same size as a next level table.
> > 
> >  Do any of these cases need to know the higher level to parse the
> >  lower? eg is there a 2M bit in the PUD indicating that the PMD
> >  is a table of 2M leafs or does each PMD entry have a bit
> >  indicating it is a leaf?
> 
> For hash and radix there is a bit that tells it is leaf (_PAGE_PTE)
> 
> For nohash32/e500 I think the drawing is not full right, there is a huge 
> page directory (hugepd) with a single entry. I think it should be 
> possible to change it to a leaf entry, it seems we have bit _PAGE_SW1 
> available in the PTE.

It sounds to me like PPC breaks down into only a couple fundamental
behaviors
 - x86 like leaf in many page levels. Use the pgd/pud/pmd_leaf() and
   related to implement it
 - ARM like contig PTE within a single page table level. Use the
   contig sutff to implement it
 - Contig PTE across two page table levels with a bit in the
   PMD. Needs new support like you showed
 - Page table levels with a variable page size. Ie a PUD can point to
   a directory of 8 pages or 512 pages of different size. Probbaly
   needs some new core support, but I think your changes to the
   *_offset go a long way already.

> > 
> >   hash 4k and hugepage 16M/16G
> >   nohash 64
> >- How does this work? I guess since 8xx explicitly calls out
> >  consecutive this is actually the pgd can point to 512 256M
> >  entries or 8 16G entries? Ie the table size at each level is
> >  varable? Or is it the same and the table size is still 512 and
> >  each 16G entry is replicated 64 times?
> 
> For those it is using the huge page directory (hugepd) which can be 
> hooked at any level and is a directory of huge pages on its own. There 
> is no consecutive entries involved here I think, allthough I'm not 
> completely sure.
> 
> For hash4k I'm not sure how it works, this was changed by commit 
> e2b3d202d1db ("powerpc: Switch 16GB and 16MB explicit hugepages to a 
> different page table format")
> 
> For the nohash/64, a PGD entry points either to a regular PUD directory 
> or to a HUGEPD directory. The size of the HUGEPD directory is encoded in 
> the 6 lower bits of the PGD entry.

If it is a software walker there might be value in just aligning to
the contig pte scheme in all levels and forgetting about the variable
size page table levels. That quarter page stuff is a PITA to manage
the memory allocation for on PPC anyhow..

Jason


Re: [PATCH RFC 1/3] mm/gup: consistently name GUP-fast functions

2024-03-27 Thread Jason Gunthorpe
On Wed, Mar 27, 2024 at 02:05:36PM +0100, David Hildenbrand wrote:
> Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
> all relevant internal functions to start with "gup_fast", to make it
> clearer that this is not ordinary GUP. The current mixture of
> "lockless", "gup" and "gup_fast" is confusing.
> 
> Further, avoid the term "huge" when talking about a "leaf" -- for
> example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
> "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
> says.
> 
> What remains is the "external" interface:
> * get_user_pages_fast_only()
> * get_user_pages_fast()
> * pin_user_pages_fast()
> 
> And the "internal" interface that handles GUP-fast + fallback:
> * internal_get_user_pages_fast()

This would like a better name too. How about gup_fast_fallback() ?

> The high-level internal function for GUP-fast is now:
> * gup_fast()
> 
> The basic GUP-fast walker functions:
> * gup_pgd_range() -> gup_fast_pgd_range()
> * gup_p4d_range() -> gup_fast_p4d_range()
> * gup_pud_range() -> gup_fast_pud_range()
> * gup_pmd_range() -> gup_fast_pmd_range()
> * gup_pte_range() -> gup_fast_pte_range()
> * gup_huge_pgd()  -> gup_fast_pgd_leaf()
> * gup_huge_pud()  -> gup_fast_pud_leaf()
> * gup_huge_pmd()  -> gup_fast_pmd_leaf()
> 
> The weird hugepd stuff:
> * gup_huge_pd() -> gup_fast_hugepd()
> * gup_hugepte() -> gup_fast_hugepte()
> 
> The weird devmap stuff:
> * __gup_device_huge_pud() -> gup_fast_devmap_pud_leaf()
> * __gup_device_huge_pmd   -> gup_fast_devmap_pmd_leaf()
> * __gup_device_huge() -> gup_fast_devmap_leaf()
>
> Helper functions:
> * unpin_user_pages_lockless() -> gup_fast_unpin_user_pages()
> * gup_fast_folio_allowed() is already properly named
> * gup_fast_permitted() is already properly named
> 
> With "gup_fast()", we now even have a function that is referred to in
> comment in mm/mmu_gather.c.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/gup.c | 164 ++++---
>  1 file changed, 84 insertions(+), 80 deletions(-)

I think it is a great idea, it always takes a moment to figure out if
a function is part of the fast callchain or not..

(even better would be to shift the fast stuff into its own file, but I
expect that is too much)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()

2024-03-26 Thread Jason Gunthorpe
On Mon, Mar 25, 2024 at 07:05:01PM +, Christophe Leroy wrote:

> Not looked into details yet, but I guess so.
> 
> By the way there is a wiki dedicated to huge pages on powerpc, you can 
> have a look at it here : 
> https://github.com/linuxppc/wiki/wiki/Huge-pages , maybe you'll find 
> good ideas there to help me.

There sure are alot of page tables types here

I'm a bit wondering about terminology, eg on the first diagram "huge
pte entry" means a PUD entry that is a leaf? Which ones are contiguous
replications?

Just general remarks on the ones with huge pages:

 hash 64k and hugepage 16M/16G
 radix 64k/radix hugepage 2M/1G
 radix 4k/radix hugepage 2M/1G
 nohash 32
  - I think this is just a normal x86 like scheme? PMD/PUD can be a
leaf with the same size as a next level table. 

Do any of these cases need to know the higher level to parse the
lower? eg is there a 2M bit in the PUD indicating that the PMD 
is a table of 2M leafs or does each PMD entry have a bit
indicating it is a leaf?

 hash 4k and hugepage 16M/16G
 nohash 64
  - How does this work? I guess since 8xx explicitly calls out
consecutive this is actually the pgd can point to 512 256M
entries or 8 16G entries? Ie the table size at each level is
varable? Or is it the same and the table size is still 512 and
each 16G entry is replicated 64 times?

Do the offset accessors already abstract this enough?

 8xx 4K
 8xx 16K
   - As this series does?

Jason


Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-03-26 Thread Jason Gunthorpe
On Mon, Mar 25, 2024 at 02:58:48PM -0400, Peter Xu wrote:

> > This remark would be a little easier to understand if you refer to
> > hugetlb_walk() not huge_pte_offset() - the latter is only used to
> > implement hugetlb_walk() and isn't removed by this series, while a
> > single hugetlb_walk() was removed.
> 
> Right.  Here huge_pte_offset() is the arch API that I hope we can remove,
> the hugetlb_walk() is simply the wrapper.

But arguably hugetlb_walk is the thing that should go..

In the generic code we should really try to get away from the weird
hugetlb abuse of treating every level as a pte_t. That is not how the
generic page table API works and that weirdness is part of what
motivates the arch API to supply special functions for reading. Not
every arch can just cast every level to a pte_t and still work.

But that weirdness is also saving alot of code so something else needs
to be though up..

> > Regardless, I think the point is spot on, the direction should be to
> > make the page table reading generic with minimal/no interaction with
> > hugetlbfs in the generic code.
> 
> Yes, and I also like your terms on calling them "pgtable readers".  It's a
> better way to describe the difference in that regard between
> huge_pte_offset() v.s. huge_pte_alloc().  Exactly that's my goal, that we
> should start with the "readers".

Yeah, it makes alot of sense to tackle the readers first - we are
pretty close now to having enough done to have generic readers. I
would imagine tackling everything outside mm/huge*.c to use the normal
page table API for reading.

Then consider what to do with the reading in mm/huge*.c
 
> The writters might change semantics when merge, and can be more
> challenging, I'll need to look into details of each one, like page fault
> handlers.  Such work may need to be analyzed case by case, and this GUP
> part is definitely the low hanging fruit comparing to the rest.

The write side is tricky, I think if the read side is sorted out then
it will be easer to reason about the write side. Today the write side
is paired with the special read side and it is extra hard to
understand if there is something weird hidden in the arch.
 
> measurements too when getting there.  And btw, IIUC the major challenge of
> pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not
> be that hard if that's the solo purpose.  The better way to go is to remove
> mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all
> callers; that's "the challenge".. pretty much labor works, not a major
> technical challenge it seems.  Not sure if it's a good news or bad..

Ugh, that is a big pain. It is relying on that hugetlbfs trick of
passing in a pte that is not a pte to make the API generic..

The more I look at this the more I think we need to get to Matthew's
idea of having some kind of generic page table API that is not tightly
tied to level. Replacing the hugetlb trick of 'everything is a PTE'
with 5 special cases in every place seems just horrible.

   struct mm_walk_ops {
   int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
   }

And many cases really want something like:
   struct mm_walk_state state;

   if (!mm_walk_seek_leaf(state, mm, address))
  goto no_present
   if (mm_walk_is_write(state)) ..

And detailed walking:
   for_each_pt_leaf(state, mm, address) {
   if (mm_walk_is_write(state)) ..
   }

Replacing it with a mm_walk_state that retains the level or otherwise
to allow decoding any entry composes a lot better. Forced Loop
unrolling can get back to the current code gen in alot of places.

It also makes the power stuff a bit nicer as the mm_walk_state could
automatically retain back pointers to the higher levels in the state
struct too...

The puzzle is how to do it and still get reasonable efficient codegen,
many operations are going to end up switching on some state->level to
know how to decode the entry.

> One thing I'll soon look into is to allow huge mappings for PFNMAP; there's
> one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry()
> seems to be good for all such purposes; well, I may need to bite the bullet
> here.. for either of the purposes to move on.

That would be a nice feature!

> > If, or how much, the hugepd write side remains special is the open
> > question, I think.
> 
> It seems balls are rolling in that aspect, I haven't looked in depth there
> in Christophe's series but it's great to have started!

Yes!

Jason


Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-03-25 Thread Jason Gunthorpe
On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> This series reimplements hugepages with hugepd on powerpc 8xx.
> 
> Unlike most architectures, powerpc 8xx HW requires a two-level
> pagetable topology for all page sizes. So a leaf PMD-contig approach
> is not feasible as such.
> 
> Possible sizes are 4k, 16k, 512k and 8M.
> 
> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> must point to a single entry level-2 page table. Until now that was
> done using hugepd. This series changes it to use standard page tables
> where the entry is replicated 1024 times on each of the two pagetables
> refered by the two associated PMD entries for that 8M page.
> 
> At the moment it has to look into each helper to know if the
> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> a lower size. I hope this can me handled by core-mm in the future.
> 
> There are probably several ways to implement stuff, so feedback is
> very welcome.

I thought it looks pretty good!

Thanks,
Jason


Re: [RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get()

2024-03-25 Thread Jason Gunthorpe
On Mon, Mar 25, 2024 at 03:55:57PM +0100, Christophe Leroy wrote:

>  arch/arm64/include/asm/hugetlb.h |  2 +-
>  fs/hugetlbfs/inode.c |  2 +-
>  fs/proc/task_mmu.c   |  8 +++---
>  fs/userfaultfd.c |  2 +-
>  include/asm-generic/hugetlb.h|  2 +-
>  include/linux/swapops.h  |  2 +-
>  mm/damon/vaddr.c |  6 ++---
>  mm/gup.c |  2 +-
>  mm/hmm.c |  2 +-
>  mm/hugetlb.c | 46 
>  mm/memory-failure.c  |  2 +-
>  mm/mempolicy.c   |  2 +-
>  mm/migrate.c |  4 +--
>  mm/mincore.c |  2 +-
>  mm/userfaultfd.c |  2 +-
>  15 files changed, 43 insertions(+), 43 deletions(-)
> 
> diff --git a/arch/qarm64/include/asm/hugetlb.h 
> b/arch/arm64/include/asm/hugetlb.h
> index 2ddc33d93b13..1af39a74e791 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -46,7 +46,7 @@ extern pte_t huge_ptep_clear_flush(struct vm_area_struct 
> *vma,
>  extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
>  pte_t *ptep, unsigned long sz);
>  #define __HAVE_ARCH_HUGE_PTEP_GET
> -extern pte_t huge_ptep_get(pte_t *ptep);
> +extern pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t 
> *ptep);

The header changed but not the implementation? This will need to do
riscv and s390 too.

Though, really, I think the right path is to work toward removing
huge_ptep_get() from the arch code..

riscv and arm are doing the same thing - propogating dirty/young bits
from the contig PTEs to the results. The core code can do this, maybe
with a ARCH #define opt in.

s390.. Ouchy - is this because hugetlb wants to pretend that every
level is encoded as a PTE so it takes the PGD and recodes the flags to
the PTE layout??

Jason


Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()

2024-03-25 Thread Jason Gunthorpe
On Mon, Mar 25, 2024 at 03:55:54PM +0100, Christophe Leroy wrote:
> Unlike many architectures, powerpc 8xx hardware tablewalk requires
> a two level process for all page sizes, allthough second level only
> has one entry when pagesize is 8M.
> 
> To fit with Linux page table topology and without requiring special
> page directory layout like hugepd, the page entry will be replicated
> 1024 times in the standard page table. However for large pages it is
> necessary to set bits in the level-1 (PMD) entry. At the time being,
> for 512k pages the flag is kept in the PTE and inserted in the PMD
> entry at TLB miss exception, that is necessary because we can have
> pages of different sizes in a page table. However the 12 PTE bits are
> fully used and there is no room for an additional bit for page size.
> 
> For 8M pages, there will be only one page per PMD entry, it is
> therefore possible to flag the pagesize in the PMD entry, with the
> advantage that the information will already be at the right place for
> the hardware.
> 
> To do so, add a new helper called pmd_populate_size() which takes the
> page size as an additional argument, and modify __pte_alloc() to also
> take that argument. pte_alloc() is left unmodified in order to
> reduce churn on callers, and a pte_alloc_size() is added for use by
> pte_alloc_huge().
> 
> When an architecture doesn't provide pmd_populate_size(),
> pmd_populate() is used as a fallback.

I think it would be a good idea to document what the semantic is
supposed to be for sz?

Just a general remark, probably nothing for this, but with these new
arguments the historical naming seems pretty tortured for
pte_alloc_size().. Something like pmd_populate_leaf(size) as a naming
scheme would make this more intuitive. Ie pmd_populate_leaf() gives
you a PMD entry where the entry points to a leaf page table able to
store folios of at least size.

Anyhow, I thought the edits to the mm helpers were fine, certainly
much nicer than hugepd. Do you see a path to remove hugepd entirely
from here?

Thanks,
Jason


Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-03-22 Thread Jason Gunthorpe
On Thu, Mar 21, 2024 at 06:07:50PM -0400, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> v3:
> - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
>   pXd_huge() users) with pXd_leaf() [Jason]
> - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> - Use IS_ENABLED() in follow_huge_pud() [Jason]
> - Remove redundant none pud check in follow_pud_mask() [Jason]
> 
> rfc: https://lore.kernel.org/r/20231116012908.392077-1-pet...@redhat.com
> v1:  https://lore.kernel.org/r/20231219075538.414708-1-pet...@redhat.com
> v2:  https://lore.kernel.org/r/20240103091423.400294-1-pet...@redhat.com
> 
> The series removes the hugetlb slow gup path after a previous refactor work
> [1], so that slow gup now uses the exact same path to process all kinds of
> memory including hugetlb.
> 
> For the long term, we may want to remove most, if not all, call sites of
> huge_pte_offset().  It'll be ideal if that API can be completely dropped
> from arch hugetlb API.  This series is one small step towards merging
> hugetlb specific codes into generic mm paths.  From that POV, this series
> removes one reference to huge_pte_offset() out of many others.

This remark would be a little easier to understand if you refer to
hugetlb_walk() not huge_pte_offset() - the latter is only used to
implement hugetlb_walk() and isn't removed by this series, while a
single hugetlb_walk() was removed.

Regardless, I think the point is spot on, the direction should be to
make the page table reading generic with minimal/no interaction with
hugetlbfs in the generic code.

After this series I would suggest doing the pagewalk.c stuff as it is
very parallel to GUP slow (indeed it would be amazing to figure out a
way to make GUP slow and pagewalk.c use the same code without a
performance cost)

Some of the other core mm callers don't look too bad either, getting
to the point where hugetlb_walk() is internal to hugetlb.c would be a
nice step that looks reasonable size.

> One goal of such a route is that we can reconsider merging hugetlb features
> like High Granularity Mapping (HGM).  It was not accepted in the past
> because it may add lots of hugetlb specific codes and make the mm code even
> harder to maintain.  With a merged codeset, features like HGM can hopefully
> share some code with THP, legacy (PMD+) or modern (continuous PTEs).

Yeah, if all the special hugetlb stuff is using generic arch code and
generic page walkers (maybe with that special vma locking hook) it is
much easier to swallow than adding yet another special class of code
to all the page walkers.

> To make it work, the generic slow gup code will need to at least understand
> hugepd, which is already done like so in fast-gup.  Due to the specialty of
> hugepd to be software-only solution (no hardware recognizes the hugepd
> format, so it's purely artificial structures), there's chance we can merge
> some or all hugepd formats with cont_pte in the future.  That question is
> yet unsettled from Power side to have an acknowledgement.  

At a minimum, I think we have a concurrence that the reading of the
hugepd entries should be fine through the standard contig pte APIs and
so all the page walkers doing read side operations could stop having
special hugepd code. It is just an implementation of contig pte with
the new property that the size of a PTE can be larger than a PMD
entry.

If, or how much, the hugepd write side remains special is the open
question, I think.

> this series, I kept the hugepd handling because we may still need to do so
> before getting a clearer picture of the future of hugepd.  The other reason
> is simply that we did it already for fast-gup and most codes are still
> around to be reused.  It'll make more sense to keep slow/fast gup behave
> the same before a decision is made to remove hugepd.

Yeah, I think that is right for this series. Lets get this done and
then try to remove hugepd read side.

Thanks,
Jason


Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-03-22 Thread Jason Gunthorpe
On Fri, Mar 22, 2024 at 11:55:11AM -0400, Peter Xu wrote:
> Jason,
> 
> On Fri, Mar 22, 2024 at 10:30:12AM -0300, Jason Gunthorpe wrote:
> > On Thu, Mar 21, 2024 at 06:08:02PM -0400, pet...@redhat.com wrote:
> > 
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade 
> > > over
> > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway.  It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part 
> > > of
> > > a performance analysis but a side benefit.  If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> > 
> > I think this is probably fine for the moment, at least for this
> > series, as CONT_PTE is still very new.
> > 
> > But it will need to be optimized. "slow" GUP is the only GUP that is
> > used by FOLL_LONGTERM and it still needs to be optimized because you
> > can't assume a FOLL_LONGTERM user will be hitting the really slow
> > fault path. There are enough important cases where it is just reading
> > already populted page tables, and these days, often with large folios.
> 
> Ah, I thought FOLL_LONGTERM should work in most cases for fast-gup,
> especially for hugetlb, but maybe I missed something?  

Ah, no this is my bad memory, there was a time where that was true,
but it is not the case now. Oh, it is a really bad memory because it
seems I removed parts of it :)

> I do see that devmap skips fast-gup for LONGTERM, we also have that
> writeback issue but none of those that I can find applies to
> hugetlb.  This might be a problem indeed if we have hugetlb cont_pte
> pages that will constantly fallback to slow gup.

Right, DAX would be the main use case I can think of. Today the
intersection of DAX and contig PTE is non-existant so lets not worry.

> OTOH, I also agree with you that such batching would be nice to have for
> slow-gup, likely devmap or many fs (exclude shmem/hugetlb) file mappings
> can at least benefit from it due to above.  But then that'll be a more
> generic issue to solve, IOW, we still don't do that for !hugetlb cont_pte
> large folios, before or after this series.

Right, improving contig pte is going to be a process and eventually it
will make sense to optimize this regardless of hugetlbfs

Jason


Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-03-22 Thread Jason Gunthorpe
On Thu, Mar 21, 2024 at 06:08:02PM -0400, pet...@redhat.com wrote:

> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> a tight loop of slow gup after the path switched.  That shouldn't be a
> problem because slow-gup should not be a hot path for GUP in general: when
> page is commonly present, fast-gup will already succeed, while when the
> page is indeed missing and require a follow up page fault, the slow gup
> degrade will probably buried in the fault paths anyway.  It also explains
> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> a performance analysis but a side benefit.  If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page().

I think this is probably fine for the moment, at least for this
series, as CONT_PTE is still very new.

But it will need to be optimized. "slow" GUP is the only GUP that is
used by FOLL_LONGTERM and it still needs to be optimized because you
can't assume a FOLL_LONGTERM user will be hitting the really slow
fault path. There are enough important cases where it is just reading
already populted page tables, and these days, often with large folios.

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v3 05/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2024-03-22 Thread Jason Gunthorpe
On Thu, Mar 21, 2024 at 06:07:55PM -0400, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
> 
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems.  That should never apply
> to hugepd.  Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.
> 
> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.
> 
> To make sure we'll still remember this issue just in case hugepd will be
> extended to support non-hugetlbfs memories, add a rich comment above
> gup_huge_pd(), explaining the issue with proper references.
> 
> Cc: Christoph Hellwig 
> Cc: Lorenzo Stoakes 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu 
> ---
>  mm/gup.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v3 04/12] mm: Introduce vma_pgtable_walk_{begin|end}()

2024-03-22 Thread Jason Gunthorpe
On Thu, Mar 21, 2024 at 06:07:54PM -0400, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Introduce per-vma begin()/end() helpers for pgtable walks.  This is a
> preparation work to merge hugetlb pgtable walkers with generic mm.
> 
> The helpers need to be called before and after a pgtable walk, will start
> to be needed if the pgtable walker code supports hugetlb pages.  It's a
> hook point for any type of VMA, but for now only hugetlb uses it to
> stablize the pgtable pages from getting away (due to possible pmd
> unsharing).
> 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Muchun Song 
> Signed-off-by: Peter Xu 
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c| 12 
>  2 files changed, 15 insertions(+)

is_vm_hugetlb_page(vma) seems weirdly named. Regardless

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-19 Thread Jason Gunthorpe
On Tue, Mar 19, 2024 at 11:07:08PM +, Christophe Leroy wrote:
> 
> 
> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
> > On Thu, Mar 14, 2024 at 01:11:59PM +, Christophe Leroy wrote:
> >>
> >>
> >> Le 14/03/2024 à 13:53, Peter Xu a écrit :
> >>> On Thu, Mar 14, 2024 at 08:45:34AM +, Christophe Leroy wrote:
> >>>>
> >>>>
> >>>> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
> >>>>> From: Peter Xu 
> >>>>>
> >>>>> PowerPC book3s 4K mostly has the same definition on both, except 
> >>>>> pXd_huge()
> >>>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out 
> >>>>> [1],
> >>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be 
> >>>>> set so
> >>>>> it will keep returning false.
> >>>>>
> >>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> >>>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc 
> >>>>> hugetlb
> >>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check 
> >>>>> hugetlb
> >>>>> mappings.
> >>>>>
> >>>>> The goal should be that we will have one API pXd_leaf() to detect all 
> >>>>> kinds
> >>>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather 
> >>>>> than
> >>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return 
> >>>>> true.
> >>>>
> >>>> All kinds of huge mappings ?
> >>>>
> >>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
> >>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
> >>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
> >>>> those huge pages.
> >>>
> >>> Ah yes, I should always mention this is in the context of leaf huge pages
> >>> only.  Are the examples you provided all fall into hugepd category?  If so
> >>> I can reword the commit message, as:
> >>
> >> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
> >>
> >> The 512k hugepages are at PTE level, they are handled more or less like
> >> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
> >>
> >> You can also look at pte_leaf_size() and pgd_leaf_size().
> > 
> > IMHO leaf should return false if the thing is pointing to a next level
> > page table, even if that next level is fully populated with contiguous
> > pages.
> > 
> > This seems more aligned with the contig page direction that hugepd
> > should be moved over to..
> 
> Should hugepd be moved to the contig page direction, really ?

Sure? Is there any downside for the reading side to do so?

> Would it be acceptable that a 8M hugepage requires 2048 contig entries
> in 2 page tables, when the hugepd allows a single entry ? 

? I thought we agreed the only difference would be that something new
is needed to merge the two identical sibling page tables into one, ie
you pay 2x the page table memory if that isn't fixed. That is write
side only change and I imagine it could be done with a single PPC
special API.

Honestly not totally sure that is a big deal, it is already really
memory inefficient compared to every other arch's huge page by needing
the child page table in the first place.

> Would it be acceptable performancewise ?

Isn't this particular PPC sub platform ancient? Are there current real
users that are going to have hugetlbfs special code and care about
this performance detail on a 6.20 era kernel?

In today's world wouldn't it be performance better if these platforms
could support THP by aligning to the contig API instead of being
special?

Am I wrong to question why we are polluting the core code for this
special optimization?

Jason


Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables

2024-03-19 Thread Jason Gunthorpe
On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:
> The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
> it_userspace") which implemented the tce indirect levels
> support for PowerNV ended up removing the single level support
> which existed by default(generic tce_iommu_userspace_view_alloc/free()
> calls). On pSeries the TCEs are single level, and the allocation
> of userspace view is lost with the removal of generic code.

:( :(

If this has been broken since 2018 and nobody cared till now can we
please go in a direction of moving this code to the new iommu APIs
instead of doubling down on more of this old stuff that apparently
almost nobody cares about ??

Jason


Re: [PATCH 11/13] mm/treewide: Replace pXd_huge() with pXd_leaf()

2024-03-18 Thread Jason Gunthorpe
On Thu, Mar 14, 2024 at 08:59:22AM -0400, Peter Xu wrote:

> > > --- a/mm/hmm.c
> > > +++ b/mm/hmm.c
> > > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned 
> > > long start, unsigned long end,
> > >   return hmm_vma_walk_hole(start, end, -1, walk);
> > >   }
> > >   
> > > - if (pud_huge(pud) && pud_devmap(pud)) {
> > > + if (pud_leaf(pud) && pud_devmap(pud)) {
> > 
> > Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?
> 
> This is an extra safety check that I didn't remove.  Devmap used separate
> bits even though I'm not clear on why.  It should still imply a leaf though.

Yes, something is very wrong if devmap is true on non-leaf..

Jason


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-18 Thread Jason Gunthorpe
On Thu, Mar 14, 2024 at 01:11:59PM +, Christophe Leroy wrote:
> 
> 
> Le 14/03/2024 à 13:53, Peter Xu a écrit :
> > On Thu, Mar 14, 2024 at 08:45:34AM +, Christophe Leroy wrote:
> >>
> >>
> >> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
> >>> From: Peter Xu 
> >>>
> >>> PowerPC book3s 4K mostly has the same definition on both, except 
> >>> pXd_huge()
> >>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
> >>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set 
> >>> so
> >>> it will keep returning false.
> >>>
> >>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> >>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
> >>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> >>> mappings.
> >>>
> >>> The goal should be that we will have one API pXd_leaf() to detect all 
> >>> kinds
> >>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
> >>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
> >>
> >> All kinds of huge mappings ?
> >>
> >> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
> >> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
> >> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
> >> those huge pages.
> > 
> > Ah yes, I should always mention this is in the context of leaf huge pages
> > only.  Are the examples you provided all fall into hugepd category?  If so
> > I can reword the commit message, as:
> 
> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
> 
> The 512k hugepages are at PTE level, they are handled more or less like 
> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
> 
> You can also look at pte_leaf_size() and pgd_leaf_size().

IMHO leaf should return false if the thing is pointing to a next level
page table, even if that next level is fully populated with contiguous
pages.

This seems more aligned with the contig page direction that hugepd
should be moved over to..

> By the way pgd_leaf_size() looks odd because it is called only when 
> pgd_leaf_size() returns true, which never happens for 8M pages.

Like this, you should reach the actual final leaf that the HW will
load and leaf_size() should say it is greater size than the current
table level. Other levels should return 0.

If necessary the core MM code should deal with this by iterating over
adjacent tables.

Jason


Re: [PATCH RFC 13/13] mm: Document pXd_leaf() API

2024-03-08 Thread Jason Gunthorpe
On Wed, Mar 06, 2024 at 06:41:47PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> There's one small section already, but since we're going to remove
> pXd_huge(), that comment may start to obsolete.
> 
> Rewrite that section with more information, hopefully with that the API is
> crystal clear on what it implies.
> 
> Signed-off-by: Peter Xu 
> ---
>  include/linux/pgtable.h | 24 +++-
>  1 file changed, 19 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH RFC 10/13] mm/gup: Merge pXd huge mapping checks

2024-03-07 Thread Jason Gunthorpe
On Wed, Mar 06, 2024 at 06:41:44PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Huge mapping checks in GUP are slightly redundant and can be simplified.
> 
> pXd_huge() now is the same as pXd_leaf().  pmd_trans_huge() and
> pXd_devmap() should both imply pXd_leaf(). Time to merge them into one.
> 
> Signed-off-by: Peter Xu 
> ---
>  mm/gup.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)

Much better!

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH RFC 04/13] mm/x86: Change pXd_huge() behavior to exclude swap entries

2024-03-07 Thread Jason Gunthorpe
On Wed, Mar 06, 2024 at 06:41:38PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> This patch partly reverts below commits:
> 
> 3a194f3f8ad0 ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of 
> non-present pud entry")
> cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage")
> 
> Right now, pXd_huge() definition across kernel is unclear. We have two
> groups that think differently on swap entries:
> 
>   - x86/sparc: Allow pXd_huge() to accept swap entries
>   - all the rest:  Doesn't allow pXd_huge() to accept swap entries
> 
> This is so confusing.  Since the sparc helpers seem to be added in 2016,
> which is after x86's (2015), so sparc could have followed a trend.  x86
> proposed such swap handling in 2015 to resolve hugetlb swap entries hit in
> GUP, but now GUP guards swap entries with !pXd_present() in all layers so
> we should be safe.
> 
> We should define this API properly, one way or another, rather than keep
> them defined differently across archs.
> 
> Gut feeling tells me that pXd_huge() shouldn't include swap entries, and it
> turns out that I am not the only one thinking so, the question was raised
> when the current pmd_huge() for x86 was proposed by Ville Syrjälä:
> 
> https://lore.kernel.org/all/y2wq7i4lxh8iu...@intel.com/
> 
>   I might also be missing something obvious, but why is it even necessary
>   to treat PRESENT==0+PSE==0 as a huge entry?
> 
> It is also questioned when Jason Gunthorpe reviewed the other patchset on
> swap entry handlings:
> 
> https://lore.kernel.org/all/20240221125753.gq13...@nvidia.com/
> 
> Revert its meaning back to original.  It shouldn't have any functional
> change as we should be ready with guards on !pXd_present() explicitly
> everywhere.
> 
> Note that I also dropped the "#if CONFIG_PGTABLE_LEVELS > 2", it was there
> probably because it was breaking things when 3a194f3f8ad0 was proposed,
> according to the report here:
> 
> https://lore.kernel.org/all/y2lyxitkqyajt...@intel.com/
> 
> Now we shouldn't need that.
> 
> Instead of reverting to _PAGE_PSE raw check, leverage pXd_leaf().
> 
> Cc: Naoya Horiguchi 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: x...@kernel.org
> Signed-off-by: Peter Xu 
> ---
>  arch/x86/mm/hugetlbpage.c | 18 --
>  1 file changed, 4 insertions(+), 14 deletions(-)

I think this is the right thing to do, callers should be more directly
sensitive to swap entries not back into it indirectly from a helper
like this.

Jason


Re: [PATCH RFC 03/13] mm/gup: Check p4d presence before going on

2024-03-07 Thread Jason Gunthorpe
On Wed, Mar 06, 2024 at 06:41:37PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Currently there should have no p4d swap entries so it may not matter much,
> however this may help us to rule out swap entries in pXd_huge() API, which
> will include p4d_huge().  The p4d_present() checks make it 100% clear that
> we won't rely on p4d_huge() for swap entries.
> 
> Signed-off-by: Peter Xu 
> ---
>  mm/gup.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

>   p4dp = p4d_offset(pgdp, address);
>   p4d = READ_ONCE(*p4dp);
> - if (p4d_none(p4d))
> + if (p4d_none(p4d) || !p4d_present(p4d))
>   return no_page_table(vma, flags);

Like the other place I think this makes more sense as

if (!p4_present(p4d))
return no_page_table(vma, flags);

Since none can never be present.

IOW if the following code doesn't decode swap entries then it should
be guarded with a chceck on present not none..

Jason


Re: [PATCH RFC 01/13] mm/hmm: Process pud swap entry without pud_huge()

2024-03-07 Thread Jason Gunthorpe
On Wed, Mar 06, 2024 at 06:41:35PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Swap pud entries do not always return true for pud_huge() for all archs.
> x86 and sparc (so far) allow it, but all the rest do not accept a swap
> entry to be reported as pud_huge().  So it's not safe to check swap entries
> within pud_huge().  Check swap entries before pud_huge(), so it should be
> always safe.
> 
> This is the only place in the kernel that (IMHO, wrongly) relies on
> pud_huge() to return true on pud swap entries.  The plan is to cleanup
> pXd_huge() to only report non-swap mappings for all archs.
> 
> Cc: Alistair Popple 
> Signed-off-by: Peter Xu 
> ---
>  mm/hmm.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)

Reviewed-by: Jason Gunthorpe 

> @@ -424,7 +424,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long 
> start, unsigned long end,
>   walk->action = ACTION_CONTINUE;
>  
>   pud = READ_ONCE(*pudp);
> - if (pud_none(pud)) {
> + if (pud_none(pud) || !pud_present(pud)) {

Isn't this a tautology? pud_none always implies !present() ?

Jason


Re: [PATCH v3 04/10] mm/x86: Replace pgd_large() with pgd_leaf()

2024-03-05 Thread Jason Gunthorpe
On Tue, Mar 05, 2024 at 12:37:44PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> pgd_leaf() is a global API while pgd_large() is not.  Always use
> the global pgd_leaf(), then drop pgd_large().
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: x...@kernel.org
> Signed-off-by: Peter Xu 
> ---
>  arch/x86/include/asm/pgtable.h | 4 ++--
>  arch/x86/mm/pti.c  | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v3 10/10] mm/treewide: Align up pXd_leaf() retval across archs

2024-03-05 Thread Jason Gunthorpe
On Tue, Mar 05, 2024 at 12:37:50PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Even if pXd_leaf() API is defined globally, it's not clear on the retval,
> and there are three types used (bool, int, unsigned log).
> 
> Always return a boolean for pXd_leaf() APIs.
> 
> Suggested-by: Jason Gunthorpe 
> Signed-off-by: Peter Xu 
> ---
>  arch/riscv/include/asm/pgtable-64.h | 2 +-
>  arch/riscv/include/asm/pgtable.h| 2 +-
>  arch/s390/include/asm/pgtable.h | 4 ++--
>  arch/sparc/include/asm/pgtable_64.h | 4 ++--
>  arch/x86/include/asm/pgtable.h  | 8 
>  include/linux/pgtable.h | 8 
>  6 files changed, 14 insertions(+), 14 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 7/7] mm/treewide: Drop pXd_large()

2024-03-04 Thread Jason Gunthorpe
On Thu, Feb 29, 2024 at 04:42:58PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> They're not used anymore, drop all of them.
> 
> Signed-off-by: Peter Xu 
> ---
>  arch/arm/include/asm/pgtable-2level.h|  1 -
>  arch/arm/include/asm/pgtable-3level.h|  1 -
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  4 +---
>  arch/powerpc/include/asm/pgtable.h   |  4 
>  arch/s390/include/asm/pgtable.h  |  8 
>  arch/sparc/include/asm/pgtable_64.h  |  8 
>  arch/x86/include/asm/pgtable.h   | 19 +++
>  7 files changed, 16 insertions(+), 29 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 6/7] mm/treewide: Replace pud_large() with pud_leaf()

2024-03-04 Thread Jason Gunthorpe
On Thu, Feb 29, 2024 at 04:42:57PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> pud_large() is always defined as pud_leaf().  Merge their usages.  Chose
> pud_leaf() because pud_leaf() is a global API, while pud_large() is not.
> 
> Signed-off-by: Peter Xu 
> ---
>  arch/powerpc/mm/book3s64/pgtable.c | 2 +-
>  arch/s390/boot/vmem.c  | 2 +-
>  arch/s390/include/asm/pgtable.h| 4 ++--
>  arch/s390/mm/gmap.c| 2 +-
>  arch/s390/mm/hugetlbpage.c | 4 ++--
>  arch/s390/mm/pageattr.c| 2 +-
>  arch/s390/mm/pgtable.c | 2 +-
>  arch/s390/mm/vmem.c| 6 +++---
>  arch/sparc/mm/init_64.c| 2 +-
>  arch/x86/kvm/mmu/mmu.c | 2 +-
>  arch/x86/mm/fault.c| 4 ++--
>  arch/x86/mm/ident_map.c| 2 +-
>  arch/x86/mm/init_64.c  | 4 ++--
>  arch/x86/mm/kasan_init_64.c| 2 +-
>  arch/x86/mm/mem_encrypt_identity.c | 2 +-
>  arch/x86/mm/pat/set_memory.c   | 6 +++---
>  arch/x86/mm/pgtable.c  | 2 +-
>  arch/x86/mm/pti.c  | 2 +-
>  arch/x86/power/hibernate.c | 2 +-
>  arch/x86/xen/mmu_pv.c      | 4 ++--
>  20 files changed, 29 insertions(+), 29 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 5/7] mm/treewide: Replace pmd_large() with pmd_leaf()

2024-03-04 Thread Jason Gunthorpe
On Thu, Feb 29, 2024 at 04:42:56PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> pmd_large() is always defined as pmd_leaf().  Merge their usages.  Chose
> pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not.
> 
> Signed-off-by: Peter Xu 
> ---
>  arch/arm/mm/dump.c   |  4 ++--
>  arch/powerpc/mm/book3s64/pgtable.c   |  2 +-
>  arch/powerpc/mm/book3s64/radix_pgtable.c |  2 +-
>  arch/powerpc/mm/pgtable_64.c |  2 +-
>  arch/s390/boot/vmem.c|  2 +-
>  arch/s390/include/asm/pgtable.h  |  8 
>  arch/s390/mm/gmap.c  | 12 ++--
>  arch/s390/mm/hugetlbpage.c   |  2 +-
>  arch/s390/mm/pageattr.c  |  2 +-
>  arch/s390/mm/pgtable.c   |  6 +++---
>  arch/s390/mm/vmem.c  |  6 +++---
>  arch/sparc/mm/init_64.c  |  4 ++--
>  arch/x86/boot/compressed/ident_map_64.c  |  2 +-
>  arch/x86/kvm/mmu/mmu.c   |  2 +-
>  arch/x86/mm/fault.c  |  8 
>  arch/x86/mm/init_32.c|  2 +-
>  arch/x86/mm/init_64.c|  8 
>  arch/x86/mm/kasan_init_64.c  |  2 +-
>  arch/x86/mm/mem_encrypt_identity.c   |  4 ++--
>  arch/x86/mm/pat/set_memory.c |  4 ++--
>  arch/x86/mm/pgtable.c|  2 +-
>  arch/x86/mm/pti.c|  4 ++--
>  arch/x86/power/hibernate.c   |  2 +-
>  arch/x86/xen/mmu_pv.c|  4 ++--
>  drivers/misc/sgi-gru/grufault.c  |  2 +-
>  25 files changed, 49 insertions(+), 49 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 4/7] mm/x86: Drop two unnecessary pud_leaf() definitions

2024-03-04 Thread Jason Gunthorpe
On Thu, Feb 29, 2024 at 04:42:55PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> pud_leaf() has a fallback macro defined in include/linux/pgtable.h already.
> Drop the extra two for x86.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: x...@kernel.org
> Signed-off-by: Peter Xu 
> ---
>  arch/x86/include/asm/pgtable.h  | 1 -
>  include/asm-generic/pgtable-nopmd.h | 1 -
>  2 files changed, 2 deletions(-)

Reviewed-by: Jason Gunthorpe 

> @@ -31,7 +31,6 @@ static inline int pud_none(pud_t pud)   { 
> return 0; }
>  static inline int pud_bad(pud_t pud) { return 0; }
>  static inline int pud_present(pud_t pud) { return 1; }
>  static inline int pud_user(pud_t pud){ return 0; }
> -static inline int pud_leaf(pud_t pud){ return 0; }

It would be nice to have a final patch making the signatures
consistent on all the arch inlines, it should return bool not int.

Jason


Re: [PATCH v2 3/7] mm/x86: Replace p4d_large() with p4d_leaf()

2024-03-04 Thread Jason Gunthorpe
On Thu, Feb 29, 2024 at 04:42:54PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> p4d_large() is always defined as p4d_leaf().  Merge their usages.  Chose
> p4d_leaf() because p4d_leaf() is a global API, while p4d_large() is not.
> 
> Only x86 has p4d_leaf() defined as of now.  So it also means after this
> patch we removed all p4d_large() usages.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: x...@kernel.org
> Signed-off-by: Peter Xu 
> ---
>  arch/x86/mm/fault.c  | 4 ++--
>  arch/x86/mm/init_64.c| 2 +-
>  arch/x86/mm/pat/set_memory.c | 4 ++--
>  arch/x86/mm/pti.c| 2 +-
>  arch/x86/power/hibernate.c   | 2 +-
>  arch/x86/xen/mmu_pv.c| 2 +-
>  6 files changed, 8 insertions(+), 8 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 2/7] mm/ppc: Replace pXd_is_leaf() with pXd_leaf()

2024-03-04 Thread Jason Gunthorpe
On Thu, Feb 29, 2024 at 04:42:53PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> They're the same macros underneath.  Drop pXd_is_leaf(), instead always use
> pXd_leaf().
> 
> At the meantime, instead of renames, drop the pXd_is_leaf() fallback
> definitions directly in arch/powerpc/include/asm/pgtable.h. because similar
> fallback macros for pXd_leaf() are already defined in
> include/linux/pgtable.h.
> 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: "Aneesh Kumar K.V" 
> Cc: "Naveen N. Rao" 
> Cc: linuxppc-dev@lists.ozlabs.org
> Suggested-by: Christophe Leroy 
> Signed-off-by: Peter Xu 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 10 
>  arch/powerpc/include/asm/pgtable.h   | 24 
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   | 12 +-
>  arch/powerpc/mm/book3s64/radix_pgtable.c | 14 ++--
>  arch/powerpc/mm/pgtable.c|  6 ++---
>  arch/powerpc/mm/pgtable_64.c |  6 ++---
>  arch/powerpc/xmon/xmon.c     |  6 ++---
>  7 files changed, 26 insertions(+), 52 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 1/7] mm/ppc: Define pXd_large() with pXd_leaf()

2024-03-04 Thread Jason Gunthorpe
On Thu, Feb 29, 2024 at 04:42:52PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> The two definitions are the same.  The only difference is that pXd_large()
> is only defined with THP selected, and only on book3s 64bits.
> 
> Instead of implementing it twice, make pXd_large() a macro to pXd_leaf().
> Define it unconditionally just like pXd_leaf().  This helps to prepare
> merging the two APIs.
> 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: "Naveen N. Rao" 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 16 ++--
>  arch/powerpc/include/asm/pgtable.h   |  2 +-
>  2 files changed, 3 insertions(+), 15 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 03/13] mm: Provide generic pmd_thp_or_huge()

2024-02-21 Thread Jason Gunthorpe
On Wed, Feb 21, 2024 at 05:37:37PM +0800, Peter Xu wrote:
> On Mon, Jan 15, 2024 at 01:55:51PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 03, 2024 at 05:14:13PM +0800, pet...@redhat.com wrote:
> > > From: Peter Xu 
> > > 
> > > ARM defines pmd_thp_or_huge(), detecting either a THP or a huge PMD.  It
> > > can be a helpful helper if we want to merge more THP and hugetlb code
> > > paths.  Make it a generic default implementation, only exist when
> > > CONFIG_MMU.  Arch can overwrite it by defining its own version.
> > > 
> > > For example, ARM's pgtable-2level.h defines it to always return false.
> > > 
> > > Keep the macro declared with all config, it should be optimized to a false
> > > anyway if !THP && !HUGETLB.
> > > 
> > > Signed-off-by: Peter Xu 
> > > ---
> > >  include/linux/pgtable.h | 4 
> > >  mm/gup.c| 3 +--
> > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index 466cf477551a..2b42e95a4e3a 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -1362,6 +1362,10 @@ static inline int pmd_write(pmd_t pmd)
> > >  #endif /* pmd_write */
> > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >  
> > > +#ifndef pmd_thp_or_huge
> > > +#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
> > > +#endif
> > 
> > Why not just use pmd_leaf() ?
> > 
> > This GUP case seems to me exactly like what pmd_leaf() should really
> > do and be used for..
> 
> I think I mostly agree with you, and these APIs are indeed confusing.  IMHO
> the challenge is about the risk of breaking others on small changes in the
> details where evil resides.

These APIs are super confusing, which is why I brought it up.. Adding
even more subtly different variations is not helping.

I think pmd_leaf means the entry is present and refers to a physical
page not another radix level.

> > eg x86 does:
> > 
> > #define pmd_leafpmd_large
> > static inline int pmd_large(pmd_t pte)
> > return pmd_flags(pte) & _PAGE_PSE;
> > 
> > static inline int pmd_trans_huge(pmd_t pmd)
> > return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
> > 
> > int pmd_huge(pmd_t pmd)
> > return !pmd_none(pmd) &&
> > (pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;
> 
> For example, here I don't think it's strictly pmd_leaf()? As pmd_huge()
> will return true if PRESENT=0 && PSE=0 (as long as none pte ruled out
> first), while pmd_leaf() will return false; I think that came from
> cbef8478bee5. 

Yikes, but do you even want to handle non-present entries in GUP
world? Isn't everything gated by !present in the first place?

> Besides that, there're also other cases where it's not clear of such direct
> replacement, not until further investigated.  E.g., arm-3level has:
> 
> #define pmd_leaf(pmd) pmd_sect(pmd)
> #define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
>PMD_TYPE_SECT)
> #define PMD_TYPE_SECT (_AT(pmdval_t, 1) << 0)
> 
> While pmd_huge() there relies on PMD_TABLE_BIT ()

I looked at tht, it looked OK.. 

#define PMD_TYPE_MASK   (_AT(pmdval_t, 3) << 0)
#define PMD_TABLE_BIT   (_AT(pmdval_t, 1) << 1)

It is the same stuff, just a little confusingly written

Jason


Re: [PATCH v2] powerpc/iommu: Fix the iommu group reference leak during platform domain attach

2024-02-14 Thread Jason Gunthorpe
On Wed, Feb 14, 2024 at 12:09:24PM -0600, Shivaprasad G Bhat wrote:
> The function spapr_tce_platform_iommu_attach_dev() is missing to call
> iommu_group_put() when the domain is already set. This refcount leak
> shows up with BUG_ON() during DLPAR remove operation as,
> 
>   KernelBug: Kernel bug in state 'None': kernel BUG at 
> arch/powerpc/platforms/pseries/iommu.c:100!
>   Oops: Exception in kernel mode, sig: 5 [#1]
>   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
>   
>   Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
> of:IBM,FW1060.00 (NH1060_016) hv:phyp pSeries
>   NIP:  c00ff4d4 LR: c00ff4cc CTR: 
>   REGS: c013aed5f840 TRAP: 0700   Tainted: G  I 
> (6.8.0-rc3-autotest-g99bd3cb0d12e)
>   MSR:  80029033   CR: 44002402  XER: 2004
>   CFAR: c0a0d170 IRQMASK: 0
>   GPR00: c00ff4cc c013aed5fae0 c1512700 c013aa362138
>   GPR04:    000119c8afd0
>   GPR08:  c01284442b00 0001 1003
>   GPR12: 0003 c0182f00  
>   GPR16:    
>   GPR20:    
>   GPR24: c013aed5fc40 0002  c2757d90
>   GPR28: c00ff440 c2757cb8 c0183799c1a0 c013aa362b00
>   NIP [c00ff4d4] iommu_reconfig_notifier+0x94/0x200
>   LR [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200
>   Call Trace:
>   [c013aed5fae0] [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200 
> (unreliable)
>   [c013aed5fb10] [c01a27b0] notifier_call_chain+0xb8/0x19c
>   [c013aed5fb70] [c01a2a78] blocking_notifier_call_chain+0x64/0x98
>   [c013aed5fbb0] [c0c4a898] of_reconfig_notify+0x44/0xdc
>   [c013aed5fc20] [c0c4add4] of_detach_node+0x78/0xb0
>   [c013aed5fc70] [c00f96a8] ofdt_write.part.0+0x86c/0xbb8
>   [c013aed5fce0] [c069b4bc] proc_reg_write+0xf4/0x150
>   [c013aed5fd10] [c05bfeb4] vfs_write+0xf8/0x488
>   [c013aed5fdc0] [c05c0570] ksys_write+0x84/0x140
>   [c013aed5fe10] [c0033358] system_call_exception+0x138/0x330
>   [c013aed5fe50] [c000d05c] 
> system_call_vectored_common+0x15c/0x2ec
>   --- interrupt: 3000 at 0x2433acb4
>   
>   ---[ end trace  ]---
> 
> The patch makes the iommu_group_get() call only when using it there by
> avoiding the leak.
> 
> Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach 
> atctions after probe")
> Reported-by: Venkat Rao Bagalkote 
> Closes: 
> https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e2...@linux.vnet.ibm.com
> Signed-off-by: Shivaprasad G Bhat 
> ---
> Changelog:
> v1: 
> https://lore.kernel.org/all/170784021983.6249.10039296655906636112.st...@linux.ibm.com/
>  - Minor refactor to call the iommu_group_get() only if required.
>  - Updated the title, description and signature(Closes/Reported-by).
> 
>  arch/powerpc/kernel/iommu.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe 

Thanks again,
Jason


Re: [PATCH] powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach

2024-02-14 Thread Jason Gunthorpe
On Wed, Feb 14, 2024 at 11:53:20PM +1100, Michael Ellerman wrote:
> Venkat Rao Bagalkote  writes:
> > Thanks for the patch. Applied this patch and verified and issue is fixed.
> >
> > This issue way originally reported in the below mail.
> >
> > https://marc.info/?l=linux-kernel=170737160630106=2
> 
> Please use lore for links, in this case:
> 
> https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e2...@linux.vnet.ibm.com/

Also if you are respinning you may prefer this

@@ -1285,14 +1285,15 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;
+   struct iommu_group *grp;
int ret = -EINVAL;
 
/* At first attach the ownership is already set */
if (!domain)
return 0;
 
+   grp = iommu_group_get(dev);
if (!grp)
return -ENODEV;

Which is sort of why this happened in the first place :)

Jason


Re: [PATCH] powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach

2024-02-13 Thread Jason Gunthorpe
On Tue, Feb 13, 2024 at 10:05:22AM -0600, Shivaprasad G Bhat wrote:
> The function spapr_tce_platform_iommu_attach_dev() is missing to call
> iommu_group_put() when the domain is already set. This refcount leak
> shows up with BUG_ON() during DLPAR remove operation as,
> 
>   KernelBug: Kernel bug in state 'None': kernel BUG at 
> arch/powerpc/platforms/pseries/iommu.c:100!
>   Oops: Exception in kernel mode, sig: 5 [#1]
>   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
>   
>   Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
> of:IBM,FW1060.00 (NH1060_016) hv:phyp pSeries
>   NIP:  c00ff4d4 LR: c00ff4cc CTR: 
>   REGS: c013aed5f840 TRAP: 0700   Tainted: G  I 
> (6.8.0-rc3-autotest-g99bd3cb0d12e)
>   MSR:  80029033   CR: 44002402  XER: 2004
>   CFAR: c0a0d170 IRQMASK: 0
>   GPR00: c00ff4cc c013aed5fae0 c1512700 c013aa362138
>   GPR04:    000119c8afd0
>   GPR08:  c01284442b00 0001 1003
>   GPR12: 0003 c0182f00  
>   GPR16:    
>   GPR20:    
>   GPR24: c013aed5fc40 0002  c2757d90
>   GPR28: c00ff440 c2757cb8 c0183799c1a0 c013aa362b00
>   NIP [c00ff4d4] iommu_reconfig_notifier+0x94/0x200
>   LR [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200
>   Call Trace:
>   [c013aed5fae0] [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200 
> (unreliable)
>   [c013aed5fb10] [c01a27b0] notifier_call_chain+0xb8/0x19c
>   [c013aed5fb70] [c01a2a78] blocking_notifier_call_chain+0x64/0x98
>   [c013aed5fbb0] [c0c4a898] of_reconfig_notify+0x44/0xdc
>   [c013aed5fc20] [c0c4add4] of_detach_node+0x78/0xb0
>   [c013aed5fc70] [c00f96a8] ofdt_write.part.0+0x86c/0xbb8
>   [c013aed5fce0] [c069b4bc] proc_reg_write+0xf4/0x150
>   [c013aed5fd10] [c05bfeb4] vfs_write+0xf8/0x488
>   [c013aed5fdc0] [c05c0570] ksys_write+0x84/0x140
>   [c013aed5fe10] [c0033358] system_call_exception+0x138/0x330
>   [c013aed5fe50] [c000d05c] 
> system_call_vectored_common+0x15c/0x2ec
>   --- interrupt: 3000 at 0x2433acb4
>   
>   ---[ end trace  ]---
> 
> The patch adds the missing iommu_group_put() call.
> 
> Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach 
> atctions after probe")
> Signed-off-by: Shivaprasad G Bhat 
> ---
>  arch/powerpc/kernel/iommu.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Doh, that is a weird splat for this but thanks for finding it

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2] powerpc: iommu: Bring back table group release_ownership() call

2024-02-01 Thread Jason Gunthorpe
On Fri, Jan 26, 2024 at 09:09:18AM -0600, Shivaprasad G Bhat wrote:
> The commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and
> remove set_platform_dma_ops") refactored the code removing the
> set_platform_dma_ops(). It missed out the table group
> release_ownership() call which would have got called otherwise
> during the guest shutdown via vfio_group_detach_container(). On
> PPC64, this particular call actually sets up the 32-bit TCE table,
> and enables the 64-bit DMA bypass etc. Now after guest shutdown,
> the subsequent host driver (e.g megaraid-sas) probe post unbind
> from vfio-pci fails like,
> 
> megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 
> 0x7fff, table unavailable
> megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x, 
> table unavailable
> megaraid_sas 0031:01:00.0: Failed to set DMA mask
> megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539
> 
> The patch brings back the call to table_group release_ownership()
> call when switching back to PLATFORM domain from BLOCKED, while
> also separates the domain_ops for both.
> 
> Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
> set_platform_dma_ops")
> Signed-off-by: Shivaprasad G Bhat 
> ---
> Changelog:
> v1: 
> https://lore.kernel.org/linux-iommu/170618451433.3805.9015493852395837391.st...@ltcd48-lp2.aus.stglab.ibm.com/
>  - Split the common attach_dev call to platform and blocked attach_dev
>calls as suggested.
> 
>  arch/powerpc/kernel/iommu.c |   37 ++++-
>  1 file changed, 28 insertions(+), 9 deletions(-)

Reviewed-by: Jason Gunthorpe 

> @@ -1312,13 +1312,32 @@ static struct iommu_domain spapr_tce_platform_domain 
> = {
>   .ops = _tce_platform_domain_ops,
>  };
> 
> -static struct iommu_domain spapr_tce_blocked_domain = {
> - .type = IOMMU_DOMAIN_BLOCKED,
> +static int
> +spapr_tce_blocked_iommu_attach_dev(struct iommu_domain *platform_domain,

It was my typo but this should be "struct iommu_domain *blocked_domain"

Jason


Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-26 Thread Jason Gunthorpe
On Fri, Jan 26, 2024 at 09:39:56AM -0600, Timothy Pearson wrote:
> > Just forget about the weird KVM and SPAPR stuff, leave it under the
> > kconfig of the old code and nobody will run it. Almost nobody already
> > runs it, apparently.
> 
> We actually use QEMU/KVM/VFIO extensively at Raptor, so need the
> support and need it to be performant...

I wonder if you alone are the "almost" :)

The KVM entanglement was hairy and scary. I never did figure out what
was really going on there. Maybe you don't need all of it and can be
successful with a more typical iommu working model?

Suggest to tackle it after getting the first parts done.

Jason


Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-26 Thread Jason Gunthorpe
On Fri, Jan 26, 2024 at 09:29:55AM -0600, Timothy Pearson wrote:
> > On Fri, Jan 26, 2024 at 08:43:12PM +0530, Shivaprasad G Bhat wrote:
> >> > Also, is there any chance someone can work on actually fixing this to
> >> > be a proper iommu driver? I think that will become important for power
> >> > to use the common dma_iommu code in the next year...
> >> We are looking into it.
> > 
> > Okay, let me know, I can possibly help make parts of this happen
> > 
> > power is the last still-current architecture to be outside the modern
> > IOMMU and DMA API design and I'm going to start proposing things that
> > will not be efficient on power because of this.
> 
> I can get development resources on this fairly rapidly, including
> testing.  We should figure out the best way forward and how to deal
> with the VFIO side of things, even if that's a rewrite at the end of
> the day the machine-specific codebase isn't *that* large for our two
> target flavors (64-bit PowerNV and 64-bit pSeries).

I have a feeling the way forward is to just start a power driver under
drivers/iommu/ and use kconfig to make the user exclusively select
either the legacy arch or the modern iommu.

Get that working to a level where dma_iommu is happy.

Get iommufd support in the new driver good enough to run your
application.

Just forget about the weird KVM and SPAPR stuff, leave it under the
kconfig of the old code and nobody will run it. Almost nobody already
runs it, apparently.

Remove it in a few years

>From what I remember the code under the arch directory is already very
nearly almost an iommu driver. I think someone could fairly quickly
get to something working and using dma_iommu.c. If s390 is any
experience there is some benchmarking and tweaking to get performance
equal to the arch's tweaked dma_iommu copy.

Jason


Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-26 Thread Jason Gunthorpe
On Fri, Jan 26, 2024 at 08:43:12PM +0530, Shivaprasad G Bhat wrote:
> > Also, is there any chance someone can work on actually fixing this to
> > be a proper iommu driver? I think that will become important for power
> > to use the common dma_iommu code in the next year...
> We are looking into it.

Okay, let me know, I can possibly help make parts of this happen

power is the last still-current architecture to be outside the modern
IOMMU and DMA API design and I'm going to start proposing things that
will not be efficient on power because of this.

I think a basic iommu driver using the dma API would not be so hard.

I don't know what to do about the SPAPR VFIO mess though. :(

Jason


Re: [PATCH 2/2] iommu: Fix the domain type checks when default_domain is set

2024-01-25 Thread Jason Gunthorpe
On Thu, Jan 25, 2024 at 06:08:52AM -0600, Shivaprasad G Bhat wrote:
> On PPC64, the iommu_ops.def_domain_type() is not defined and
> CONFIG_IOMMU_DMA not enabled. With commit 0f6a90436a57 ("iommu: Do not
> use IOMMU_DOMAIN_DMA if CONFIG_IOMMU_DMA is not enabled"), the
> iommu_get_default_domain_type() return IOMMU_DOMAIN_IDENTITY. With
> commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove
> set_platform_dma_ops"), the defaule_domain is set wih the type being
> PLATFORM. With these two changes, iommu_group_alloc_default_domain()
> ends up returning the NULL(with recent changes, ERR_PTR(-EINVAL))
> leading to iommu_probe_device() failure and the device has no
> iommu_group set in effect. Subsequently, the bind to vfio(VFIO_IOMMU)
> fail as the iommu_group is not set for the device.
> 
> Make the iommu_get_default_domain_type() to take default_domain->type
> into consideration along with default_domain_type() and fix
> iommu_group_alloc_default_domain() to not error out if the requested
> type is same as default domain type.
> 
> Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
> set_platform_dma_ops")
> Fixes: 0f6a90436a57 ("iommu: Do not use IOMMU_DOMAIN_DMA if CONFIG_IOMMU_DMA 
> is not enabled")
> Signed-off-by: Shivaprasad G Bhat 
> ---
>  drivers/iommu/iommu.c |   14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)

Are you OK with this version?

https://lore.kernel.org/linux-iommu/20240123174905.gs50...@ziepe.ca/

?

I think it does the same thing

Jason


Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-25 Thread Jason Gunthorpe
On Thu, Jan 25, 2024 at 06:08:39AM -0600, Shivaprasad G Bhat wrote:
> The commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and
> remove set_platform_dma_ops") refactored the code removing the
> set_platform_dma_ops(). It missed out the table group
> release_ownership() call which would have got called otherwise
> during the guest shutdown via vfio_group_detach_container(). On
> PPC64, this particular call actually sets up the 32-bit TCE table,
> and enables the 64-bit DMA bypass etc. Now after guest shutdown,
> the subsequent host driver (e.g megaraid-sas) probe post unbind
> from vfio-pci fails like,
> 
> megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 
> 0x7fff, table unavailable
> megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x, 
> table unavailable
> megaraid_sas 0031:01:00.0: Failed to set DMA mask
> megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539
> 
> The patch brings back the call to table_group release_ownership()
> call when switching back to PLATFORM domain.
> 
> Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
> set_platform_dma_ops")
> Signed-off-by: Shivaprasad G Bhat 
> ---
>  arch/powerpc/kernel/iommu.c |   16 +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ebe259bdd462..ac7df43fa7ef 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1296,9 +1296,19 @@ spapr_tce_platform_iommu_attach_dev(struct 
> iommu_domain *platform_domain,
>   if (!grp)
>   return -ENODEV;
>  
> - table_group = iommu_group_get_iommudata(grp);
> - ret = table_group->ops->take_ownership(table_group);
> - iommu_group_put(grp);
> + if (platform_domain->type == IOMMU_DOMAIN_PLATFORM) {
> + ret = 0;
> + table_group = iommu_group_get_iommudata(grp);
> + /*
> +  * The domain being set to PLATFORM from earlier
> +  * BLOCKED. The table_group ownership has to be released.
> +  */
> + table_group->ops->release_ownership(table_group);
> + } else if (platform_domain->type == IOMMU_DOMAIN_BLOCKED) {
> + table_group = iommu_group_get_iommudata(grp);
> + ret = table_group->ops->take_ownership(table_group);
> + iommu_group_put(grp);
> + }

Sure, but please split the function, don't test on the
platform->domain_type.

Also, is there any chance someone can work on actually fixing this to
be a proper iommu driver? I think that will become important for power
to use the common dma_iommu code in the next year...

Sort of like this:

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ebe259bdd46298..0d6a7fea2bd9a5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1287,20 +1287,20 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;
-   int ret = -EINVAL;
 
/* At first attach the ownership is already set */
if (!domain)
return 0;
 
-   if (!grp)
-   return -ENODEV;
-
table_group = iommu_group_get_iommudata(grp);
-   ret = table_group->ops->take_ownership(table_group);
+   /*
+* The domain being set to PLATFORM from earlier
+* BLOCKED. The table_group ownership has to be released.
+*/
+   table_group->ops->release_ownership(table_group);
iommu_group_put(grp);
 
-   return ret;
+   return 0
 }
 
 static const struct iommu_domain_ops spapr_tce_platform_domain_ops = {
@@ -1312,13 +1312,33 @@ static struct iommu_domain spapr_tce_platform_domain = {
.ops = _tce_platform_domain_ops,
 };
 
-static struct iommu_domain spapr_tce_blocked_domain = {
-   .type = IOMMU_DOMAIN_BLOCKED,
+static int
+spapr_tce_platform_iommu_blocked_dev(struct iommu_domain *platform_domain,
+struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct iommu_group *grp = iommu_group_get(dev);
+   struct iommu_table_group *table_group;
+   int ret = -EINVAL;
+
/*
 * FIXME: SPAPR mixes blocked and platform behaviors, the blocked domain
 * also sets the dma_api ops
 */
-   .ops = _tce_platform_domain_ops,
+   table_group = iommu_group_get_iommudata(grp);
+   ret = table_group->ops->take_ownership(table_group);
+   iommu_group_put(grp);
+
+   return ret;
+}
+
+static const struct iommu_domain_ops spapr_tce_blocked_domain_ops = {
+   .attach_dev = spapr_tce_blocked_iommu_attach_dev,
+};
+
+static struct iommu_domain spapr_tce_blocked_domain = {
+   .type = 

Re: [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2024-01-17 Thread Jason Gunthorpe
On Tue, Jan 16, 2024 at 06:32:32PM +, Christophe Leroy wrote:
> >> hugepd is a page directory dedicated to huge pages, where you have huge
> >> pages listed instead of regular pages. For instance, on powerpc 32 with
> >> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A
> >> hugepd for 512k is a page table with 8 entries.
> >>
> >> And for 8Mbytes entries, the hugepd is a page table with only one entry.
> >> And 2 consecutive PGS entries will point to the same hugepd to cover the
> >> entire 8Mbytes.
> > 
> > That still sounds alot like the ARM thing - except ARM replicates the
> > entry, you also said PPC relicates the entry like ARM to get to the
> > 8M?
> 
> Is it like ARM ? Not sure. The PTE is not in the PGD it must be in a L2 
> directory, even for 8M.

Your diagram looks almost exactly like ARM to me.

The key thing is that the address for the L2 Table is *always* formed as:

   L2 Table Base << 12 + L2 Index << 2 + 00

Then the L2 Descriptor must contains bits indicating the page
size. The L2 Descriptor is replicated to every 4k entry that the page
size covers.

The only difference I see is the 8M case which has a page size greater
than a single L1 entry.

> Yes that's how it works on powerpc. For 8xx we used to do that for both 
> 8M and 512k pages. Now for 512k pages we do kind of like ARM (which 
> means replicating the entry 128 times) as that's needed to allow mixing 
> different page sizes for a given PGD entry.

Right, you want to have granular page sizes or it becomes unusable in
the general case
 
> But for 8M pages that would mean replicating the entry 2048 times. 
> That's a bit too much isn't it ?

Indeed, de-duplicating the L2 Table is a neat optimization.

> > So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
> > that would return enough information for both.
> 
> pmd_leaf() ? Unless I'm missing something I can't do leaf at PMD (PGD) 
> level. It must be a two-level process even for pages bigger than a PMD 
> entry.

Right, this is the normal THP/hugetlb situation on x86/etc. It
wouldn't apply here since it seems the HW doesn't have a bit in the L1
descriptor to indicate leaf.

Instead for PPC this hugepd stuff should start to follow Ryan's
generic work for ARM contig:

https://lore.kernel.org/all/20231218105100.172635-1-ryan.robe...@arm.com/

Specifically the arch implementation:

https://lore.kernel.org/linux-mm/20231218105100.172635-15-ryan.robe...@arm.com/

Ie the arch should ultimately wire up the replication and variable
page size bits within its implementation of set_ptes(). set_ptes()s
gets a contiguous run of address and should install it with maximum
use of the variable page sizes. The core code will start to call
set_ptes() in more cases as Ryan gets along his project.

For the purposes of GUP, where are are today and where we are going,
it would be much better to not have a special PPC specific "hugepd"
parser. Just process each of the 4k replicates one by one like ARM is
starting with.

The arch would still have to return the correct page address from
pte_phys() which I think Ryan is doing by having the replicates encode
the full 4k based address in each entry. The HW will ignore those low
bits and pte_phys() then works properly. This would work for PPC as
well, excluding the 8M optimization.

Going forward I'd expect to see some pte_page_size() that returns the
size bits and GUP can have logic to skip reading replicates.

The advantage of all this is that it stops making the feature special
and the work Ryan is doing to generically push larger folios into
set_ptes will become usable on these PPC platforms as well. And we can
kill the PPC specific hugepd.

Jason


Re: [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2024-01-16 Thread Jason Gunthorpe
On Tue, Jan 16, 2024 at 06:30:39AM +, Christophe Leroy wrote:
> 
> 
> Le 15/01/2024 à 19:37, Jason Gunthorpe a écrit :
> > On Wed, Jan 03, 2024 at 05:14:16PM +0800, pet...@redhat.com wrote:
> >> From: Peter Xu 
> >>
> >> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> >> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> >> PPC_8XX), however those pages are not candidates for GUP.
> >>
> >> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> >> file-backed mappings") added a check to fail gup-fast if there's potential
> >> risk of violating GUP over writeback file systems.  That should never apply
> >> to hugepd.  Considering that hugepd is an old format (and even
> >> software-only), there's no plan to extend hugepd into other file typed
> >> memories that is prone to the same issue.
> > 
> > I didn't dig into the ppc stuff too deeply, but this looks to me like
> > it is the same thing as ARM's contig bits?
> > 
> > ie a chunk of PMD/etc entries are all managed together as though they
> > are a virtual larger entry and we use the hugepte_addr_end() stuff to
> > iterate over each sub entry.
> 
> As far as I understand ARM's contig stuff, hugepd on powerpc is 
> something different.
> 
> hugepd is a page directory dedicated to huge pages, where you have huge 
> pages listed instead of regular pages. For instance, on powerpc 32 with 
> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A 
> hugepd for 512k is a page table with 8 entries.
> 
> And for 8Mbytes entries, the hugepd is a page table with only one entry. 
> And 2 consecutive PGS entries will point to the same hugepd to cover the 
> entire 8Mbytes.

That still sounds alot like the ARM thing - except ARM replicates the
entry, you also said PPC relicates the entry like ARM to get to the
8M?

I guess the difference is in how the table memory is layed out? ARM
marks the size in the same entry that has the physical address so the
entries are self describing and then replicated. It kind of sounds
like PPC is marking the size in prior level and then reconfiguring the
layout of the lower level? Otherwise it surely must do the same
replication to make a radix index work..

If yes, I guess that is the main problem, the mm APIs don't have way
today to convey data from the pgd level to understand how to parse the
pmd level?

> > It seems to me we should see ARM and PPC agree on what the API is for
> > this and then get rid of hugepd by making both use the same page table
> > walker API. Is that too hopeful?
> 
> Can't see the similarity between ARM contig PTE and PPC huge page 
> directories.

Well, they are both variable sized entries.

So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
that would return enough information for both.

Jason


Re: [PATCH v2 11/13] mm/gup: Handle huge pmd for follow_pmd_mask()

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:21PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Replace pmd_trans_huge() with pmd_thp_or_huge() to also cover pmd_huge() as
> long as enabled.
> 
> FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.
> 
> Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
> into follow_huge_pmd() to match what it does.  Move it into gup.c so not
> depend on CONFIG_THP.
> 
> When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
> it when the page is valid.  It was not a bug to set it before even if GUP
> failed (page==NULL), because follow_page_mask() callers always ignores
> page_mask if so.  But doing so makes the code cleaner.
> 
> Signed-off-by: Peter Xu 
> ---
>  mm/gup.c | 107 ---
>  mm/huge_memory.c |  86 +
>  mm/internal.h|   5 +--
>  3 files changed, 105 insertions(+), 93 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 10/13] mm/gup: Handle huge pud for follow_pud_mask()

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:20PM +0800, pet...@redhat.com wrote:
> diff --git a/mm/gup.c b/mm/gup.c
> index 63845b3ec44f..760406180222 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -525,6 +525,70 @@ static struct page *no_page_table(struct vm_area_struct 
> *vma,
>   return NULL;
>  }
>  
> +#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> +static struct page *follow_huge_pud(struct vm_area_struct *vma,
> + unsigned long addr, pud_t *pudp,
> + int flags, struct follow_page_context *ctx)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + struct page *page;
> + pud_t pud = *pudp;
> + unsigned long pfn = pud_pfn(pud);
> + int ret;
> +
> + assert_spin_locked(pud_lockptr(mm, pudp));
> +
> + if ((flags & FOLL_WRITE) && !pud_write(pud))
> + return NULL;
> +
> + if (!pud_present(pud))
> + return NULL;
> +
> + pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
> +
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> + if (pud_devmap(pud)) {

Can this use IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) ?

> + /*
> +  * device mapped pages can only be returned if the caller
> +  * will manage the page reference count.
> +  *
> +  * At least one of FOLL_GET | FOLL_PIN must be set, so
> +  * assert that here:
> +  */
> + if (!(flags & (FOLL_GET | FOLL_PIN)))
> + return ERR_PTR(-EEXIST);
> +
> + if (flags & FOLL_TOUCH)
> + touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
> +
> + ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
> + if (!ctx->pgmap)
> + return ERR_PTR(-EFAULT);
> + }
> +#endif   /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> + page = pfn_to_page(pfn);
> +
> + if (!pud_devmap(pud) && !pud_write(pud) &&
> + gup_must_unshare(vma, flags, page))
> + return ERR_PTR(-EMLINK);
> +
> + ret = try_grab_page(page, flags);
> + if (ret)
> + page = ERR_PTR(ret);
> + else
> + ctx->page_mask = HPAGE_PUD_NR - 1;
> +
> + return page;
> +}
> +#else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
> +static struct page *follow_huge_pud(struct vm_area_struct *vma,
> + unsigned long addr, pud_t *pudp,
> + int flags, struct follow_page_context *ctx)
> +{
> + return NULL;
> +}
> +#endif   /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
> +
>  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>   pte_t *pte, unsigned int flags)
>  {
> @@ -760,11 +824,11 @@ static struct page *follow_pud_mask(struct 
> vm_area_struct *vma,
>  
>   pudp = pud_offset(p4dp, address);
>   pud = READ_ONCE(*pudp);
> - if (pud_none(pud))
> + if (pud_none(pud) || !pud_present(pud))
>   return no_page_table(vma, flags, address);

Isn't 'pud_none() || !pud_present()' redundent? A none pud is
non-present, by definition?

> - if (pud_devmap(pud)) {
> + if (pud_huge(pud)) {
>       ptl = pud_lock(mm, pudp);
> - page = follow_devmap_pud(vma, address, pudp, flags, 
> >pgmap);
> + page = follow_huge_pud(vma, address, pudp, flags, ctx);
>   spin_unlock(ptl);
>   if (page)
>   return page;

Otherwise it looks OK to me

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 09/13] mm/gup: Cache *pudp in follow_pud_mask()

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:19PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Introduce "pud_t pud" in the function, so the code won't dereference *pudp
> multiple time.  Not only because that looks less straightforward, but also
> because if the dereference really happened, it's not clear whether there
> can be race to see different *pudp values if it's being modified at the
> same time.
> 
> Acked-by: James Houghton 
> Signed-off-by: Peter Xu 
> ---
>  mm/gup.c | 17 +
>  1 file changed, 9 insertions(+), 8 deletions(-)

I think we have several more case like this, and I ceratinly agree
code should not access a READ_ONCE variable more than once :(

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 08/13] mm/gup: Handle hugetlb for no_page_table()

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:18PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> no_page_table() is not yet used for hugetlb code paths. Make it prepared.
> 
> The major difference here is hugetlb will return -EFAULT as long as page
> cache does not exist, even if VM_SHARED.  See hugetlb_follow_page_mask().
> 
> Pass "address" into no_page_table() too, as hugetlb will need it.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Peter Xu 
> ---
>  mm/gup.c | 44 ++--
>  1 file changed, 26 insertions(+), 18 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 07/13] mm/gup: Refactor record_subpages() to find 1st small page

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:17PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> All the fast-gup functions take a tail page to operate, always need to do
> page mask calculations before feeding that into record_subpages().
> 
> Merge that logic into record_subpages(), so that it will do the nth_page()
> calculation.
> 
> Signed-off-by: Peter Xu 
> ---
>  mm/gup.c | 25 ++---
>  1 file changed, 14 insertions(+), 11 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:16PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
> 
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems.  That should never apply
> to hugepd.  Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.

I didn't dig into the ppc stuff too deeply, but this looks to me like
it is the same thing as ARM's contig bits?

ie a chunk of PMD/etc entries are all managed together as though they
are a virtual larger entry and we use the hugepte_addr_end() stuff to
iterate over each sub entry.

But WHY is GUP doing this or caring about this? GUP should have no
problem handling the super-size entry (eg 8M on nohash) as a single
thing. It seems we only lack an API to get this out of the arch code?

It seems to me we should see ARM and PPC agree on what the API is for
this and then get rid of hugepd by making both use the same page table
walker API. Is that too hopeful?

> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.

I didn't see any other caller of this function in this series? When
does this re-use happen??

Jason


Re: [PATCH v2 04/13] mm: Make HPAGE_PXD_* macros even if !THP

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:14PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> These macros can be helpful when we plan to merge hugetlb code into generic
> code.  Move them out and define them even if !THP.
> 
> We actually already defined HPAGE_PMD_NR for other reasons even if !THP.
> Reorganize these macros.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Peter Xu 
> ---
>  include/linux/huge_mm.h | 17 ++---
>  1 file changed, 6 insertions(+), 11 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v2 03/13] mm: Provide generic pmd_thp_or_huge()

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:13PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> ARM defines pmd_thp_or_huge(), detecting either a THP or a huge PMD.  It
> can be a helpful helper if we want to merge more THP and hugetlb code
> paths.  Make it a generic default implementation, only exist when
> CONFIG_MMU.  Arch can overwrite it by defining its own version.
> 
> For example, ARM's pgtable-2level.h defines it to always return false.
> 
> Keep the macro declared with all config, it should be optimized to a false
> anyway if !THP && !HUGETLB.
> 
> Signed-off-by: Peter Xu 
> ---
>  include/linux/pgtable.h | 4 
>  mm/gup.c| 3 +--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 466cf477551a..2b42e95a4e3a 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1362,6 +1362,10 @@ static inline int pmd_write(pmd_t pmd)
>  #endif /* pmd_write */
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +#ifndef pmd_thp_or_huge
> +#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
> +#endif

Why not just use pmd_leaf() ?

This GUP case seems to me exactly like what pmd_leaf() should really
do and be used for..

eg x86 does:

#define pmd_leafpmd_large
static inline int pmd_large(pmd_t pte)
return pmd_flags(pte) & _PAGE_PSE;

static inline int pmd_trans_huge(pmd_t pmd)
return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;

int pmd_huge(pmd_t pmd)
return !pmd_none(pmd) &&
(pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;

I spot checked a couple arches and it looks like it holds up.

Further, it looks to me like this site in GUP is the only core code
caller..

So, I'd suggest a small series to go arch by arch and convert the arch
to use pmd_huge() == pmd_leaf(). Then retire pmd_huge() as a public
API.

> diff --git a/mm/gup.c b/mm/gup.c
> index df83182ec72d..eebae70d2465 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3004,8 +3004,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, 
> unsigned long addr, unsigned lo
>   if (!pmd_present(pmd))
>   return 0;
>  
> - if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
> -  pmd_devmap(pmd))) {
> + if (unlikely(pmd_thp_or_huge(pmd) || pmd_devmap(pmd))) {
>   /* See gup_pte_range() */
>   if (pmd_protnone(pmd))
>   return 0;

And the devmap thing here doesn't make any sense either. The arch
should ensure that pmd_devmap() implies pmd_leaf(). Since devmap is a
purely SW construct it almost certainly does already anyhow.

Jason


Re: [PATCH v2 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES

2024-01-15 Thread Jason Gunthorpe
On Wed, Jan 03, 2024 at 05:14:11PM +0800, pet...@redhat.com wrote:
> From: Peter Xu 
> 
> Introduce a config option that will be selected as long as huge leaves are
> involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
> code with this new config that can process either hugetlb or thp pages in
> any level that is higher than pte level.
> 
> Signed-off-by: Peter Xu 
> ---
>  mm/Kconfig | 3 +++
>  1 file changed, 3 insertions(+)

So you mean anything that supports page table entires > PAGE_SIZE ?

Makes sense to me, though maybe add a comment in the kconfig?

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH 05/26] vfio: KVM: Pass get/put helpers from KVM to VFIO, don't do circular lookup

2023-12-03 Thread Jason Gunthorpe
On Fri, Dec 01, 2023 at 04:51:55PM -0800, Sean Christopherson wrote:

> There's one more wrinkle: this patch is buggy in that it doesn't ensure the 
> liveliness
> of KVM-the-module, i.e. nothing prevents userspace from unloading kvm.ko 
> while VFIO
> still holds a reference to a kvm structure, and so invoking ->put_kvm() could 
> jump
> into freed code.  To fix that, KVM would also need to pass along a module 
> pointer :-(

Maybe we should be refcounting the struct file not the struct kvm?

Then we don't need special helpers and it keeps the module alive correctly.

Jason


Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-11-30 Thread Jason Gunthorpe
On Wed, Nov 29, 2023 at 06:02:08PM -0800, Sean Christopherson wrote:

> > > Ah, it's the same warning, I just missed the CONFIG_MODULES=n requirement.
> > 
> > Oh, wait, doesn't that mean the approach won't work? IIRC doesn't
> > symbol_get turn into just  when non-modular turning this into a
> > link failure without the kconfig part?
> 
> Yes, but it doesn't cause linker errors.  IIUC, because the extern declaration
> is tagged "weak", a dummy default is used.  E.g. on x86, this is what is 
> generated
> with VFIO=y

Oh wow that is some pretty dark magic there :|

Jason


Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-11-29 Thread Jason Gunthorpe
On Wed, Nov 29, 2023 at 05:07:45PM -0800, Sean Christopherson wrote:
> On Wed, Nov 29, 2023, Michael Ellerman wrote:
> > Sean Christopherson  writes:
> > > On Fri, Nov 10, 2023, Michael Ellerman wrote:
> > >> Jason Gunthorpe  writes:
> > >> > There are a bunch of reported randconfig failures now because of this,
> > >> > something like:
> > >> >
> > >> >>> arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7: warning: attribute 
> > >> >>> declaration must precede definition [-Wignored-attributes]
> > >> >fn = symbol_get(vfio_file_iommu_group);
> > >> > ^
> > >> >include/linux/module.h:805:60: note: expanded from macro 
> > >> > 'symbol_get'
> > >> >#define symbol_get(x) ({ extern typeof(x) x 
> > >> > __attribute__((weak,visibility("hidden"))); &(x); })
> > >> >
> > >> > It happens because the arch forces KVM_VFIO without knowing if VFIO is
> > >> > even enabled.
> > >> 
> > >> This is still breaking some builds. Can we get this fix in please?
> > >> 
> > >> cheers
> > >> 
> > >> > Split the kconfig so the arch selects the usual HAVE_KVM_ARCH_VFIO and
> > >> > then KVM_VFIO is only enabled if the arch wants it and VFIO is turned 
> > >> > on.
> > >
> > > Heh, so I was trying to figure out why things like vfio_file_set_kvm() 
> > > aren't
> > > problematic, i.e. why the existing mess didn't cause failures.  I can't 
> > > repro the
> > > warning (requires clang-16?), but IIUC the reason only the group code is 
> > > problematic
> > > is that vfio.h creates a stub for vfio_file_iommu_group() and thus 
> > > there's no symbol,
> > > whereas vfio.h declares vfio_file_set_kvm() unconditionally.
> > 
> > That warning I'm unsure about.
> 
> Ah, it's the same warning, I just missed the CONFIG_MODULES=n requirement.

Oh, wait, doesn't that mean the approach won't work? IIRC doesn't
symbol_get turn into just  when non-modular turning this into a
link failure without the kconfig part?

Jason


Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-11-29 Thread Jason Gunthorpe
On Tue, Nov 28, 2023 at 06:21:42PM -0800, Sean Christopherson wrote:
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 454e9295970c..a65b2513f8cd 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -289,16 +289,12 @@ void vfio_combine_iova_ranges(struct rb_root_cached 
> *root, u32 cur_nodes,
>  /*
>   * External user API
>   */
> -#if IS_ENABLED(CONFIG_VFIO_GROUP)
>  struct iommu_group *vfio_file_iommu_group(struct file *file);
> +
> +#if IS_ENABLED(CONFIG_VFIO_GROUP)
>  bool vfio_file_is_group(struct file *file);
>  bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
>  #else
> -static inline struct iommu_group *vfio_file_iommu_group(struct file *file)
> -{
> -   return NULL;
> -}
> -
>  static inline bool vfio_file_is_group(struct file *file)
>  {
> return false;
> 

So you symbol get on a symbol that can never be defined? Still says to
me the kconfig needs fixing :|

But sure, as a small patch it looks fine

Thanks,
Jason


Re: get_user_pages() and EXEC_ONLY mapping.

2023-11-10 Thread Jason Gunthorpe
On Fri, Nov 10, 2023 at 08:27:19PM +0530, Aneesh Kumar K V wrote:
> On 11/10/23 8:23 PM, Jason Gunthorpe wrote:
> > On Fri, Nov 10, 2023 at 08:19:23PM +0530, Aneesh Kumar K.V wrote:
> >>
> >> Hello,
> >>
> >> Some architectures can now support EXEC_ONLY mappings and I am wondering
> >> what get_user_pages() on those addresses should return. 
> > 
> > -EPERM
> > 
> >> Earlier PROT_EXEC implied PROT_READ and pte_access_permitted()
> >> returned true for that. But arm64 does have this explicit comment
> >> that says
> >>
> >>  /*
> >>  * p??_access_permitted() is true for valid user mappings (PTE_USER
> >>  * bit set, subject to the write permission check). For execute-only
> >>  * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
> >>  * not set) must return false. PROT_NONE mappings do not have the
> >>  * PTE_VALID bit set.
> >>  */
> >>
> >> Is that correct? We should be able to get struct page for PROT_EXEC
> >> mappings?
> > 
> > If the memory is unreadable then providing a back door through
> > O_DIRECT and everthing else to read it sounds wrong to me.
> > 
> > If there is some case where a get_user_pages caller is exec-only
> > compatible then a new FOLL_EXEC flag to permit it would make sense.
> > 
> 
> I was expecting pin_user_pages() to return -EPERM and get_user_pages()
> return struct page. This was based on 
> Documentation/core-api/pin_user_pages.rst  

Not unconditionally but you could argue that FOLL_GET should
succeed. It depends how much do you care about absolute security of
unreadable memory vs compatability.

> "Another way of thinking about these flags is as a progression of 
> restrictions:
> FOLL_GET is for struct page manipulation, without affecting the data that the
> struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
> short term pins on pages whose data *will* get accessed. "

This was sort of aspirational - have we got rid of all the FOLL_GET
users that are touching the data? Looks like no from a quick check..

Jason


Re: get_user_pages() and EXEC_ONLY mapping.

2023-11-10 Thread Jason Gunthorpe
On Fri, Nov 10, 2023 at 08:19:23PM +0530, Aneesh Kumar K.V wrote:
> 
> Hello,
> 
> Some architectures can now support EXEC_ONLY mappings and I am wondering
> what get_user_pages() on those addresses should return. 

-EPERM

> Earlier PROT_EXEC implied PROT_READ and pte_access_permitted()
> returned true for that. But arm64 does have this explicit comment
> that says
> 
>  /*
>  * p??_access_permitted() is true for valid user mappings (PTE_USER
>  * bit set, subject to the write permission check). For execute-only
>  * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
>  * not set) must return false. PROT_NONE mappings do not have the
>  * PTE_VALID bit set.
>  */
> 
> Is that correct? We should be able to get struct page for PROT_EXEC
> mappings?

If the memory is unreadable then providing a back door through
O_DIRECT and everthing else to read it sounds wrong to me.

If there is some case where a get_user_pages caller is exec-only
compatible then a new FOLL_EXEC flag to permit it would make sense.

Jason


Re: [Bisected] [commit 2ad56efa80db] [Hotplug] WARNING while performing hotplug operation on 6.6-rc3-next

2023-10-06 Thread Jason Gunthorpe
On Fri, Oct 06, 2023 at 06:50:00PM +0530, Tasmiya Nalatwad wrote:
>Greetings,
> 
>Thanks Jason.
> 
>The fix provided by you works. It is not giving WARNING's but i am
>seeing below logs. Would you please confirm on the logs.

I don't know anything about your environment but those logs don't
appear to be related to this series?

Jason


Re: [Bisected] [commit 2ad56efa80db] [Hotplug] WARNING while performing hotplug operation on 6.6-rc3-next

2023-10-06 Thread Jason Gunthorpe
On Fri, Oct 06, 2023 at 01:20:17PM +0530, Tasmiya Nalatwad wrote:
> Greetings,
> 
> [linux-next] [6.6.0-rc3-next-20230929] WARNING: CPU: 5 PID: 185612 at
> drivers/iommu/iommu.c:3049 iommu_setup_default_domain+0x410/0x680
> 
> --- Traces ---
> 
> [ 6296.425934] WARNING: CPU: 5 PID: 185612 at drivers/iommu/iommu.c:3049
> iommu_setup_default_domain+0x410/0x680

Does this fix it too? I think it should?

https://lore.kernel.org/r/0-v1-2b52423411b9+164fc-iommu_ppc_defdomain_...@nvidia.com

Jason


[PATCH] powerpc/iommu: Do not do platform domain attach atctions after probe

2023-10-05 Thread Jason Gunthorpe
POWER throws a splat at boot, it looks like the DMA ops were probably
changed while a driver was attached. Something is still weird about how
power sequences its bootup. Previously this was hidden since the core
iommu code did nothing during probe, now it calls
spapr_tce_platform_iommu_attach_dev().

Make spapr_tce_platform_iommu_attach_dev() do nothing on the probe time
call like it did before.

  WARNING: CPU: 0 PID: 8 at arch/powerpc/kernel/iommu.c:407 
__iommu_free+0x1e4/0x1f0
  Modules linked in: sd_mod t10_pi crc64_rocksoft crc64 sg ibmvfc mlx5_core(+) 
scsi_transport_fc ibmveth mlxfw psample dm_multipath dm_mirror dm_region_hash 
dm_log dm_mod fuse
  CPU: 0 PID: 8 Comm: kworker/0:0 Not tainted 6.6.0-rc3-next-20230929-auto #1
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1030.30 
(NH1030_062) hv:phyp pSeries
  Workqueue: events work_for_cpu_fn
  NIP:  c005f6d4 LR: c005f6d0 CTR: 005ca81c
  REGS: c3a27890 TRAP: 0700   Not tainted (6.6.0-rc3-next-20230929-auto)
  MSR:  8282b033   CR: 48000824  XER: 
0008
  CFAR: c020f738 IRQMASK: 0
  GPR00: c005f6d0 c3a27b30 c1481800 017
  GPR04: 7fff c3a27950 c3a27948 0027
  GPR08: c00c18c07c10 0001 0027 c2ac8a08
  GPR12:  c2ff c019cc88 c3042300
  GPR16:    c3071ab0
  GPR20: c349f80d c3215440 c3215480 61c8864680b583eb
  GPR24:  7fff 08002000 0010
  GPR28: 0002 8002 cc5dc800 cc5dc880
  NIP [c005f6d4] __iommu_free+0x1e4/0x1f0
  LR [c005f6d0] __iommu_free+0x1e0/0x1f0
  Call Trace:
  [c3a27b30] [c005f6d0] __iommu_free+0x1e0/0x1f0 (unreliable)
  [c3a27bc0] [c005f848] iommu_free+0x28/0x70
  [c3a27bf0] [c0061518] iommu_free_coherent+0x68/0xa0
  [c3a27c20] [c005e8d4] dma_iommu_free_coherent+0x24/0x40
  [c3a27c40] [c024698c] dma_free_attrs+0x10c/0x140
  [c3a27c90] [c00800dcb8d4] mlx5_cmd_cleanup+0x5c/0x90 [mlx5_core]
  [c3a27cc0] [c00800dc45a0] mlx5_mdev_uninit+0xc8/0x100 [mlx5_core]
  [c3a27d00] [c00800dc4ac4] probe_one+0x3ec/0x530 [mlx5_core]
  [c3a27d90] [c08c5edc] local_pci_probe+0x6c/0x110
  [c3a27e10] [c0189c98] work_for_cpu_fn+0x38/0x60
  [c3a27e40] [c018d1d0] process_scheduled_works+0x230/0x4f0
  [c3a27f10] [c018ff14] worker_thread+0x1e4/0x500
  [c3a27f90] [c019cdb8] kthread+0x138/0x140
  [c3a27fe0] [c000df98] start_kernel_thread+0x14/0x18
  Code: 481b004d 6000 e89e0028 3c62ffe0 3863dd20 481b0039 6000 e89e0038 
3c62ffe0 3863dd38 481b0025 6000 <0fe0> 4b20 6000 3c4c0142
  ---[ end trace  ]---
  iommu_free: invalid entry
  entry = 0x800203d0
  dma_addr  = 0x800203d
  Table = 0xcc5dc800
  bus#  = 0x1
  size  = 0x2
  startOff  = 0x8000
  index = 0x70200016

Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
set_platform_dma_ops")
Reported-by: Tasmiya Nalatwad 
Link: 
https://lore.kernel.org/r/d06cee81-c47f-9d62-dfc6-4c77b6005...@linux.vnet.ibm.com
Tested-by: Tasmiya Nalatwad 
Signed-off-by: Jason Gunthorpe 
---
 arch/powerpc/kernel/iommu.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

Joerg, this would need to go to your tree, thanks

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d6ad3fde85a212..bf19932147517b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1280,13 +1280,19 @@ struct iommu_table_group_ops spapr_tce_table_group_ops 
= {
 /*
  * A simple iommu_ops to allow less cruft in generic VFIO code.
  */
-static int spapr_tce_platform_iommu_attach_dev(struct iommu_domain *dom,
-  struct device *dev)
+static int
+spapr_tce_platform_iommu_attach_dev(struct iommu_domain *platform_domain,
+   struct device *dev)
 {
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;
int ret = -EINVAL;
 
+   /* At first attach the ownership is already set */
+   if (!domain)
+   return 0;
+
if (!grp)
return -ENODEV;
 

base-commit: 20700abdf2393c8c0bd3a100e3111d0fcd3d5bf3
-- 
2.42.0



Re: [Bisected] [commit 2ad56efa80dba89162106c06ebc00b611325e584] [linux-next] WARNING while booting to kernel 6.6.0-rc3-next-20230929

2023-10-04 Thread Jason Gunthorpe
On Wed, Oct 04, 2023 at 04:37:10PM +0530, Tasmiya Nalatwad wrote:
>Greetings,
> 
>[linux-next] [6.6.0-rc3-next-20230929] [git bisect ->
>2ad56efa80dba89162106c06ebc00b611325e584]WARNING: CPU: 0 PID: 8 at
>arch/powerpc/kernel/[1]iommu.c:407 __iommu_free+0x1e4/0x1f0
 
>gitbisect is pointing to the below commit
>commit 2ad56efa80dba89162106c06ebc00b611325e584
>powerpc/iommu: Setup a default domain and remove set_platform_dma_ops
 
I assume this means there are still sequencing problems with power at
boot time. eg we turned on the dma ops in the wrong order or something
like that

As far as I can see the only difference here is that we do the
operation to claim dma ops during the iommu drive probe. We can avoid that.

Does this work for you?

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d6ad3fde85a212..115b9031badac7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1280,13 +1280,19 @@ struct iommu_table_group_ops spapr_tce_table_group_ops 
= {
 /*
  * A simple iommu_ops to allow less cruft in generic VFIO code.
  */
-static int spapr_tce_platform_iommu_attach_dev(struct iommu_domain *dom,
-  struct device *dev)
+static int
+spapr_tce_platform_iommu_attach_dev(struct iommu_domain *platform_domain,
+   struct device *dev)
 {
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;
int ret = -EINVAL;
 
+   /* At first attach the ownership is already set */
+   if (!domain)
+   return 0;
+
if (!grp)
return -ENODEV;
 



Re: [PATCH v8 20/24] iommu: Require a default_domain for all iommu drivers

2023-10-02 Thread Jason Gunthorpe
On Tue, Oct 03, 2023 at 12:21:59AM +0300, Dmitry Baryshkov wrote:
> On Wed, 13 Sept 2023 at 16:45, Jason Gunthorpe  wrote:
> >
> > At this point every iommu driver will cause a default_domain to be
> > selected, so we can finally remove this gap from the core code.
> >
> > The following table explains what each driver supports and what the
> > resulting default_domain will be:
> >
> > ops->defaut_domain
> > IDENTITY   DMA  PLATFORMv  ARM32  
> > dma-iommu  ARCH
> > amd/iommu.c Y   Y   N/A 
> > either
> > apple-dart.cY   Y   N/A 
> > either
> > arm-smmu.c  Y   Y   IDENTITY
> > either
> > qcom_iommu.cG   Y   IDENTITY
> > either
> > arm-smmu-v3.c   Y   Y   N/A 
> > either
> > exynos-iommu.c  G   Y   IDENTITY
> > either
> > fsl_pamu_domain.c   Y   Y   N/A N/A 
> > PLATFORM
> > intel/iommu.c   Y   Y   N/A 
> > either
> > ipmmu-vmsa.cG   Y   IDENTITY
> > either
> > msm_iommu.c G   IDENTITYN/A
> 
> Unfortunately this patch breaks msm_iommu platforms. This driver
> doesn't select ARM_DMA_USE_IOMMU, so iommu_get_default_domain_type()
> returns 0, bus_iommu_probe() fails with -ENODEV.
> If I make MSM_IOMMU select ARM_DMA_USE_IOMMU, then GPU probing fails
> with -EBUSY.

Oh, OK.

Does this fix it?

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index cdc7b730192a35..f7ef081c33dcb2 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -685,10 +685,16 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
return 0;
 }
 
+static int msm_iommu_def_domain_type(struct device *dev)
+{
+   return IOMMU_DOMAIN_IDENTITY;
+}
+
 static struct iommu_ops msm_iommu_ops = {
.identity_domain = _iommu_identity_domain,
.domain_alloc_paging = msm_iommu_domain_alloc_paging,
.probe_device = msm_iommu_probe_device,
+   .def_domain_type = msm_iommu_def_domain_type,
.device_group = generic_device_group,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
.of_xlate = qcom_iommu_of_xlate,


[PATCH v2 2/9] iommu/vt-d: Update the definition of the blocking domain

2023-09-27 Thread Jason Gunthorpe
The global static should pre-define the type and the NOP free function can
be now left as NULL.

Reviewed-by: Lu Baolu 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 3685ba90ec88e8..ba9db95d2f1c5e 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4013,9 +4013,9 @@ static int blocking_domain_attach_dev(struct iommu_domain 
*domain,
 }
 
 static struct iommu_domain blocking_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
.ops = &(const struct iommu_domain_ops) {
.attach_dev = blocking_domain_attach_dev,
-   .free   = intel_iommu_domain_free
}
 };
 
@@ -4060,7 +4060,7 @@ static struct iommu_domain 
*intel_iommu_domain_alloc(unsigned type)
 
 static void intel_iommu_domain_free(struct iommu_domain *domain)
 {
-   if (domain != _domain->domain && domain != _domain)
+   if (domain != _domain->domain)
domain_exit(to_dmar_domain(domain));
 }
 
-- 
2.42.0



[PATCH v2 0/9] iommu: Convert dart & iommufd to the new domain_alloc_paging()

2023-09-27 Thread Jason Gunthorpe
Continue converting drivers to the new interface. Introduce
ops->blocked_domain to hold the global static BLOCKED domain and convert
all drivers supporting BLOCKED to use it.

This makes it trivial for dart and iommufd to convert over to
domain_alloc_paging().

There are six drivers remaining:

drivers/iommu/amd/iommu.c:  .domain_alloc = amd_iommu_domain_alloc,
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:.domain_alloc   = 
arm_smmu_domain_alloc,
drivers/iommu/arm/arm-smmu/arm-smmu.c:  .domain_alloc   = 
arm_smmu_domain_alloc,
drivers/iommu/fsl_pamu_domain.c:.domain_alloc   = fsl_pamu_domain_alloc,
drivers/iommu/intel/iommu.c:.domain_alloc   = 
intel_iommu_domain_alloc,
drivers/iommu/virtio-iommu.c:   .domain_alloc   = viommu_domain_alloc,

v2:
 - Rebase to Joerg's for-next
 - New patch to remove force_bypass, as discussed with Janne
 - Move some hunks between patches to accommodate Robin's change to the
   attach_dev switch
v1: https://lore.kernel.org/r/0-v1-8060f06462cc+c0a39-dart_paging_...@nvidia.com

Jason Gunthorpe (9):
  iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain
  iommu/vt-d: Update the definition of the blocking domain
  iommu/vt-d: Use ops->blocked_domain
  iommufd: Convert to alloc_domain_paging()
  iommu/dart: Use static global identity domains
  iommu/dart: Move the blocked domain support to a global static
  iommu/dart: Convert to domain_alloc_paging()
  iommu/dart: Call apple_dart_finalize_domain() as part of
alloc_paging()
  iommu/dart: Remove the force_bypass variable

 arch/powerpc/kernel/iommu.c  |   9 +-
 drivers/iommu/apple-dart.c   | 137 ++-
 drivers/iommu/intel/iommu.c  |   7 +-
 drivers/iommu/iommu.c|   2 +
 drivers/iommu/iommufd/selftest.c |  11 +--
 include/linux/iommu.h|   3 +
 6 files changed, 94 insertions(+), 75 deletions(-)


base-commit: 83653d8508051af13f74905fc3f6c2fa2e59dbee
-- 
2.42.0



[PATCH v2 9/9] iommu/dart: Remove the force_bypass variable

2023-09-27 Thread Jason Gunthorpe
This flag just caches if the IO page size is larger than the CPU
PAGE_SIZE. This only needs to be checked in two places so remove the
confusingly named cache.

dart would like to not support paging domains at all if the IO page size
is larger than the CPU page size. In this case we should ideally fail
domain_alloc_paging(), as there is no point in creating a domain that can
never be attached. Move the test into apple_dart_finalize_domain().

The check in apple_dart_mod_streams() will prevent the domain from being
attached to the wrong dart

There is no HW limitation that prevents BLOCKED domains from working,
remove that test.

The check in apple_dart_of_xlate() is redundant since immediately after
the pgsize is checked. Remove it.

Remove the variable.

Suggested-by: Janne Grunau 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 20 ++--
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 126da0d89f0dd4..821b4a3465dfb9 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -196,7 +196,6 @@ struct apple_dart_hw {
  * @lock: lock for hardware operations involving this dart
  * @pgsize: pagesize supported by this DART
  * @supports_bypass: indicates if this DART supports bypass mode
- * @force_bypass: force bypass mode due to pagesize mismatch?
  * @sid2group: maps stream ids to iommu_groups
  * @iommu: iommu core device
  */
@@ -217,7 +216,6 @@ struct apple_dart {
u32 pgsize;
u32 num_streams;
u32 supports_bypass : 1;
-   u32 force_bypass : 1;
 
struct iommu_group *sid2group[DART_MAX_STREAMS];
struct iommu_device iommu;
@@ -576,6 +574,9 @@ static int apple_dart_finalize_domain(struct 
apple_dart_domain *dart_domain,
int ret = 0;
int i, j;
 
+   if (dart->pgsize > PAGE_SIZE)
+   return -EINVAL;
+
mutex_lock(_domain->init_lock);
 
if (dart_domain->finalized)
@@ -659,9 +660,6 @@ static int apple_dart_attach_dev_paging(struct iommu_domain 
*domain,
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
struct apple_dart_domain *dart_domain = to_dart_domain(domain);
 
-   if (cfg->stream_maps[0].dart->force_bypass)
-   return -EINVAL;
-
ret = apple_dart_finalize_domain(dart_domain, cfg);
if (ret)
return ret;
@@ -706,9 +704,6 @@ static int apple_dart_attach_dev_blocked(struct 
iommu_domain *domain,
struct apple_dart_stream_map *stream_map;
int i;
 
-   if (cfg->stream_maps[0].dart->force_bypass)
-   return -EINVAL;
-
for_each_stream_map(i, cfg, stream_map)
apple_dart_hw_disable_dma(stream_map);
return 0;
@@ -803,8 +798,6 @@ static int apple_dart_of_xlate(struct device *dev, struct 
of_phandle_args *args)
if (cfg_dart) {
if (cfg_dart->supports_bypass != dart->supports_bypass)
return -EINVAL;
-   if (cfg_dart->force_bypass != dart->force_bypass)
-   return -EINVAL;
if (cfg_dart->pgsize != dart->pgsize)
return -EINVAL;
}
@@ -946,7 +939,7 @@ static int apple_dart_def_domain_type(struct device *dev)
 {
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
 
-   if (cfg->stream_maps[0].dart->force_bypass)
+   if (cfg->stream_maps[0].dart->pgsize > PAGE_SIZE)
return IOMMU_DOMAIN_IDENTITY;
if (!cfg->stream_maps[0].dart->supports_bypass)
return IOMMU_DOMAIN_DMA;
@@ -1146,8 +1139,6 @@ static int apple_dart_probe(struct platform_device *pdev)
goto err_clk_disable;
}
 
-   dart->force_bypass = dart->pgsize > PAGE_SIZE;
-
ret = apple_dart_hw_reset(dart);
if (ret)
goto err_clk_disable;
@@ -1171,7 +1162,8 @@ static int apple_dart_probe(struct platform_device *pdev)
dev_info(
>dev,
"DART [pagesize %x, %d streams, bypass support: %d, bypass 
forced: %d] initialized\n",
-   dart->pgsize, dart->num_streams, dart->supports_bypass, 
dart->force_bypass);
+   dart->pgsize, dart->num_streams, dart->supports_bypass,
+   dart->pgsize > PAGE_SIZE);
return 0;
 
 err_sysfs_remove:
-- 
2.42.0



[PATCH v2 4/9] iommufd: Convert to alloc_domain_paging()

2023-09-27 Thread Jason Gunthorpe
Move the global static blocked domain to the ops and convert the unmanaged
domain to domain_alloc_paging.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/selftest.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index fb981ba97c4e87..ee607984709102 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -141,16 +141,10 @@ static void *mock_domain_hw_info(struct device *dev, u32 
*length, u32 *type)
return info;
 }
 
-static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+static struct iommu_domain *mock_domain_alloc_paging(struct device *dev)
 {
struct mock_iommu_domain *mock;
 
-   if (iommu_domain_type == IOMMU_DOMAIN_BLOCKED)
-   return _blocking_domain;
-
-   if (iommu_domain_type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
mock = kzalloc(sizeof(*mock), GFP_KERNEL);
if (!mock)
return NULL;
@@ -295,10 +289,11 @@ static const struct iommu_ops mock_ops = {
 * because it is zero.
 */
.default_domain = _blocking_domain,
+   .blocked_domain = _blocking_domain,
.owner = THIS_MODULE,
.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
.hw_info = mock_domain_hw_info,
-   .domain_alloc = mock_domain_alloc,
+   .domain_alloc_paging = mock_domain_alloc_paging,
.capable = mock_domain_capable,
.device_group = generic_device_group,
.probe_device = mock_probe_device,
-- 
2.42.0



[PATCH v2 8/9] iommu/dart: Call apple_dart_finalize_domain() as part of alloc_paging()

2023-09-27 Thread Jason Gunthorpe
In many cases the dev argument will now be !NULL so we should use it to
finalize the domain at allocation.

Make apple_dart_finalize_domain() accept the correct type.

Reviewed-by: Janne Grunau 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 28 +++-
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 2566cf3111c7c1..126da0d89f0dd4 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -568,10 +568,9 @@ apple_dart_setup_translation(struct apple_dart_domain 
*domain,
stream_map->dart->hw->invalidate_tlb(stream_map);
 }
 
-static int apple_dart_finalize_domain(struct iommu_domain *domain,
+static int apple_dart_finalize_domain(struct apple_dart_domain *dart_domain,
  struct apple_dart_master_cfg *cfg)
 {
-   struct apple_dart_domain *dart_domain = to_dart_domain(domain);
struct apple_dart *dart = cfg->stream_maps[0].dart;
struct io_pgtable_cfg pgtbl_cfg;
int ret = 0;
@@ -597,17 +596,18 @@ static int apple_dart_finalize_domain(struct iommu_domain 
*domain,
.iommu_dev = dart->dev,
};
 
-   dart_domain->pgtbl_ops =
-   alloc_io_pgtable_ops(dart->hw->fmt, _cfg, domain);
+   dart_domain->pgtbl_ops = alloc_io_pgtable_ops(dart->hw->fmt, _cfg,
+ _domain->domain);
if (!dart_domain->pgtbl_ops) {
ret = -ENOMEM;
goto done;
}
 
-   domain->pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
-   domain->geometry.aperture_start = 0;
-   domain->geometry.aperture_end = (dma_addr_t)DMA_BIT_MASK(dart->ias);
-   domain->geometry.force_aperture = true;
+   dart_domain->domain.pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
+   dart_domain->domain.geometry.aperture_start = 0;
+   dart_domain->domain.geometry.aperture_end =
+   (dma_addr_t)DMA_BIT_MASK(dart->ias);
+   dart_domain->domain.geometry.force_aperture = true;
 
dart_domain->finalized = true;
 
@@ -662,7 +662,7 @@ static int apple_dart_attach_dev_paging(struct iommu_domain 
*domain,
if (cfg->stream_maps[0].dart->force_bypass)
return -EINVAL;
 
-   ret = apple_dart_finalize_domain(domain, cfg);
+   ret = apple_dart_finalize_domain(dart_domain, cfg);
if (ret)
return ret;
 
@@ -758,6 +758,16 @@ static struct iommu_domain 
*apple_dart_domain_alloc_paging(struct device *dev)
 
mutex_init(_domain->init_lock);
 
+   if (dev) {
+   struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+   int ret;
+
+   ret = apple_dart_finalize_domain(dart_domain, cfg);
+   if (ret) {
+   kfree(dart_domain);
+   return ERR_PTR(ret);
+   }
+   }
return _domain->domain;
 }
 
-- 
2.42.0



[PATCH v2 5/9] iommu/dart: Use static global identity domains

2023-09-27 Thread Jason Gunthorpe
Move to the new static global for identity domains. Move the identity
specific code to apple_dart_attach_dev_identity().

Reviewed-by: Janne Grunau 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 39 +++---
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 0b892750842746..2c121c525749cf 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -659,11 +659,7 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
struct apple_dart_domain *dart_domain = to_dart_domain(domain);
 
-   if (cfg->stream_maps[0].dart->force_bypass &&
-   domain->type != IOMMU_DOMAIN_IDENTITY)
-   return -EINVAL;
-   if (!cfg->stream_maps[0].dart->supports_bypass &&
-   domain->type == IOMMU_DOMAIN_IDENTITY)
+   if (cfg->stream_maps[0].dart->force_bypass)
return -EINVAL;
 
ret = apple_dart_finalize_domain(domain, cfg);
@@ -683,15 +679,35 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
for_each_stream_map(i, cfg, stream_map)
apple_dart_hw_disable_dma(stream_map);
break;
-   case IOMMU_DOMAIN_IDENTITY:
-   for_each_stream_map(i, cfg, stream_map)
-   apple_dart_hw_enable_bypass(stream_map);
-   break;
}
 
return ret;
 }
 
+static int apple_dart_attach_dev_identity(struct iommu_domain *domain,
+ struct device *dev)
+{
+   struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+   struct apple_dart_stream_map *stream_map;
+   int i;
+
+   if (!cfg->stream_maps[0].dart->supports_bypass)
+   return -EINVAL;
+
+   for_each_stream_map(i, cfg, stream_map)
+   apple_dart_hw_enable_bypass(stream_map);
+   return 0;
+}
+
+static const struct iommu_domain_ops apple_dart_identity_ops = {
+   .attach_dev = apple_dart_attach_dev_identity,
+};
+
+static struct iommu_domain apple_dart_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _dart_identity_ops,
+};
+
 static struct iommu_device *apple_dart_probe_device(struct device *dev)
 {
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
@@ -722,7 +738,7 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
struct apple_dart_domain *dart_domain;
 
if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED &&
-   type != IOMMU_DOMAIN_IDENTITY && type != IOMMU_DOMAIN_BLOCKED)
+   type != IOMMU_DOMAIN_BLOCKED)
return NULL;
 
dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL);
@@ -732,7 +748,7 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
mutex_init(_domain->init_lock);
 
/* no need to allocate pgtbl_ops or do any other finalization steps */
-   if (type == IOMMU_DOMAIN_IDENTITY || type == IOMMU_DOMAIN_BLOCKED)
+   if (type == IOMMU_DOMAIN_BLOCKED)
dart_domain->finalized = true;
 
return _domain->domain;
@@ -947,6 +963,7 @@ static void apple_dart_get_resv_regions(struct device *dev,
 }
 
 static const struct iommu_ops apple_dart_iommu_ops = {
+   .identity_domain = _dart_identity_domain,
.domain_alloc = apple_dart_domain_alloc,
.probe_device = apple_dart_probe_device,
.release_device = apple_dart_release_device,
-- 
2.42.0



[PATCH v2 1/9] iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain

2023-09-27 Thread Jason Gunthorpe
Following the pattern of identity domains, just assign the BLOCKED domain
global statics to a value in ops. Update the core code to use the global
static directly.

Update powerpc to use the new scheme and remove its empty domain_alloc
callback.

Reviewed-by: Lu Baolu 
Signed-off-by: Jason Gunthorpe 
---
 arch/powerpc/kernel/iommu.c | 9 +
 drivers/iommu/iommu.c   | 2 ++
 include/linux/iommu.h   | 3 +++
 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d6ad3fde85a212..3c1d10be19c4c7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1327,13 +1327,6 @@ static bool spapr_tce_iommu_capable(struct device *dev, 
enum iommu_cap cap)
return false;
 }
 
-static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
-{
-   if (type != IOMMU_DOMAIN_BLOCKED)
-   return NULL;
-   return _tce_blocked_domain;
-}
-
 static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
 {
struct pci_dev *pdev;
@@ -1368,8 +1361,8 @@ static struct iommu_group 
*spapr_tce_iommu_device_group(struct device *dev)
 
 static const struct iommu_ops spapr_tce_iommu_ops = {
.default_domain = _tce_platform_domain,
+   .blocked_domain = _tce_blocked_domain,
.capable = spapr_tce_iommu_capable,
-   .domain_alloc = spapr_tce_iommu_domain_alloc,
.probe_device = spapr_tce_iommu_probe_device,
.release_device = spapr_tce_iommu_release_device,
.device_group = spapr_tce_iommu_device_group,
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1ecac2b5c54f4a..89db35e2c21771 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2074,6 +2074,8 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct iommu_ops *ops,
 
if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
return ops->identity_domain;
+   else if (alloc_type == IOMMU_DOMAIN_BLOCKED && ops->blocked_domain)
+   return ops->blocked_domain;
else if (type & __IOMMU_DOMAIN_PAGING && ops->domain_alloc_paging)
domain = ops->domain_alloc_paging(dev);
else if (ops->domain_alloc)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 64bd20142cbed0..e1a4c2c2c34d42 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -265,6 +265,8 @@ struct iommu_iotlb_gather {
  * @owner: Driver module providing these ops
  * @identity_domain: An always available, always attachable identity
  *   translation.
+ * @blocked_domain: An always available, always attachable blocking
+ *  translation.
  * @default_domain: If not NULL this will always be set as the default domain.
  *  This should be an IDENTITY/BLOCKED/PLATFORM domain.
  *  Do not use in new drivers.
@@ -303,6 +305,7 @@ struct iommu_ops {
unsigned long pgsize_bitmap;
struct module *owner;
struct iommu_domain *identity_domain;
+   struct iommu_domain *blocked_domain;
struct iommu_domain *default_domain;
 };
 
-- 
2.42.0



[PATCH v2 6/9] iommu/dart: Move the blocked domain support to a global static

2023-09-27 Thread Jason Gunthorpe
Move to the new static global for blocked domains. Move the blocked
specific code to apple_dart_attach_dev_blocked().

Reviewed-by: Janne Grunau 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 53 +++---
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 2c121c525749cf..a34812e8c9ea57 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -666,22 +666,13 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
if (ret)
return ret;
 
-   switch (domain->type) {
-   default:
-   ret = apple_dart_domain_add_streams(dart_domain, cfg);
-   if (ret)
-   return ret;
+   ret = apple_dart_domain_add_streams(dart_domain, cfg);
+   if (ret)
+   return ret;
 
-   for_each_stream_map(i, cfg, stream_map)
-   apple_dart_setup_translation(dart_domain, stream_map);
-   break;
-   case IOMMU_DOMAIN_BLOCKED:
-   for_each_stream_map(i, cfg, stream_map)
-   apple_dart_hw_disable_dma(stream_map);
-   break;
-   }
-
-   return ret;
+   for_each_stream_map(i, cfg, stream_map)
+   apple_dart_setup_translation(dart_domain, stream_map);
+   return 0;
 }
 
 static int apple_dart_attach_dev_identity(struct iommu_domain *domain,
@@ -708,6 +699,30 @@ static struct iommu_domain apple_dart_identity_domain = {
.ops = _dart_identity_ops,
 };
 
+static int apple_dart_attach_dev_blocked(struct iommu_domain *domain,
+struct device *dev)
+{
+   struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+   struct apple_dart_stream_map *stream_map;
+   int i;
+
+   if (cfg->stream_maps[0].dart->force_bypass)
+   return -EINVAL;
+
+   for_each_stream_map(i, cfg, stream_map)
+   apple_dart_hw_disable_dma(stream_map);
+   return 0;
+}
+
+static const struct iommu_domain_ops apple_dart_blocked_ops = {
+   .attach_dev = apple_dart_attach_dev_blocked,
+};
+
+static struct iommu_domain apple_dart_blocked_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   .ops = _dart_blocked_ops,
+};
+
 static struct iommu_device *apple_dart_probe_device(struct device *dev)
 {
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
@@ -737,8 +752,7 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
 {
struct apple_dart_domain *dart_domain;
 
-   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED &&
-   type != IOMMU_DOMAIN_BLOCKED)
+   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
return NULL;
 
dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL);
@@ -747,10 +761,6 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
 
mutex_init(_domain->init_lock);
 
-   /* no need to allocate pgtbl_ops or do any other finalization steps */
-   if (type == IOMMU_DOMAIN_BLOCKED)
-   dart_domain->finalized = true;
-
return _domain->domain;
 }
 
@@ -964,6 +974,7 @@ static void apple_dart_get_resv_regions(struct device *dev,
 
 static const struct iommu_ops apple_dart_iommu_ops = {
.identity_domain = _dart_identity_domain,
+   .blocked_domain = _dart_blocked_domain,
.domain_alloc = apple_dart_domain_alloc,
.probe_device = apple_dart_probe_device,
.release_device = apple_dart_release_device,
-- 
2.42.0



[PATCH v2 7/9] iommu/dart: Convert to domain_alloc_paging()

2023-09-27 Thread Jason Gunthorpe
Since the IDENTITY and BLOCKED behaviors were moved to global statics all
that remains is the paging domain. Rename to
apple_dart_attach_dev_paging() and remove the left over type check.

Reviewed-by: Janne Grunau 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index a34812e8c9ea57..2566cf3111c7c1 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -651,8 +651,8 @@ static int apple_dart_domain_add_streams(struct 
apple_dart_domain *domain,
  true);
 }
 
-static int apple_dart_attach_dev(struct iommu_domain *domain,
-struct device *dev)
+static int apple_dart_attach_dev_paging(struct iommu_domain *domain,
+   struct device *dev)
 {
int ret, i;
struct apple_dart_stream_map *stream_map;
@@ -748,13 +748,10 @@ static void apple_dart_release_device(struct device *dev)
kfree(cfg);
 }
 
-static struct iommu_domain *apple_dart_domain_alloc(unsigned int type)
+static struct iommu_domain *apple_dart_domain_alloc_paging(struct device *dev)
 {
struct apple_dart_domain *dart_domain;
 
-   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL);
if (!dart_domain)
return NULL;
@@ -975,7 +972,7 @@ static void apple_dart_get_resv_regions(struct device *dev,
 static const struct iommu_ops apple_dart_iommu_ops = {
.identity_domain = _dart_identity_domain,
.blocked_domain = _dart_blocked_domain,
-   .domain_alloc = apple_dart_domain_alloc,
+   .domain_alloc_paging = apple_dart_domain_alloc_paging,
.probe_device = apple_dart_probe_device,
.release_device = apple_dart_release_device,
.device_group = apple_dart_device_group,
@@ -985,7 +982,7 @@ static const struct iommu_ops apple_dart_iommu_ops = {
.pgsize_bitmap = -1UL, /* Restricted during dart probe */
.owner = THIS_MODULE,
.default_domain_ops = &(const struct iommu_domain_ops) {
-   .attach_dev = apple_dart_attach_dev,
+   .attach_dev = apple_dart_attach_dev_paging,
.map_pages  = apple_dart_map_pages,
.unmap_pages= apple_dart_unmap_pages,
.flush_iotlb_all = apple_dart_flush_iotlb_all,
-- 
2.42.0



[PATCH v2 3/9] iommu/vt-d: Use ops->blocked_domain

2023-09-27 Thread Jason Gunthorpe
Trivially migrate to the ops->blocked_domain for the existing global
static.

Reviewed-by: Lu Baolu 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ba9db95d2f1c5e..71c12e15ecd7b3 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4025,8 +4025,6 @@ static struct iommu_domain 
*intel_iommu_domain_alloc(unsigned type)
struct iommu_domain *domain;
 
switch (type) {
-   case IOMMU_DOMAIN_BLOCKED:
-   return _domain;
case IOMMU_DOMAIN_DMA:
case IOMMU_DOMAIN_UNMANAGED:
dmar_domain = alloc_domain(type);
@@ -4788,6 +4786,7 @@ static void *intel_iommu_hw_info(struct device *dev, u32 
*length, u32 *type)
 }
 
 const struct iommu_ops intel_iommu_ops = {
+   .blocked_domain = _domain,
.capable= intel_iommu_capable,
.hw_info= intel_iommu_hw_info,
.domain_alloc   = intel_iommu_domain_alloc,
-- 
2.42.0



Re: [PATCH 6/8] iommu/dart: Move the blocked domain support to a global static

2023-09-27 Thread Jason Gunthorpe
On Tue, Sep 26, 2023 at 09:05:08PM +0200, Janne Grunau wrote:
> > +static int apple_dart_attach_dev_blocked(struct iommu_domain *domain,
> > +struct device *dev)
> > +{
> > +   struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
> > +   struct apple_dart_stream_map *stream_map;
> > +   int i;
> > +
> > +   if (cfg->stream_maps[0].dart->force_bypass)
> > +   return -EINVAL;
> 
> unrelated to this change as this keeps the current behavior but I think
> force_bypass should not override IOMMU_DOMAIN_BLOCKED.

That would be great, dart is the only driver that can fail blocked..

> It is set if the CPU page size is smaller than dart's page
> size. Obviously dart can't translate in that situation but it should
> be still possible to block it completely.

Wonderful, actually it probably shouldn't even support allocating a
paging domain if it can't support PAGE_SIZE mappings?

> How do we manage this? I can write a patch either to the current state
> or based on this series.

Let me just add a patch to this series to remove that test with your
explanation, that will avoid patch ordering troubles.

Thanks,
Jason


Re: [PATCH 3/8] iommu/vt-d: Use ops->blocked_domain

2023-09-25 Thread Jason Gunthorpe
On Mon, Sep 25, 2023 at 10:29:52AM +0800, Baolu Lu wrote:
> On 9/23/23 1:07 AM, Jason Gunthorpe wrote:
> > Trivially migrate to the ops->blocked_domain for the existing global
> > static.
> > 
> > Signed-off-by: Jason Gunthorpe
> > ---
> >   drivers/iommu/intel/iommu.c | 3 +--
> >   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> Reviewed-by: Lu Baolu 
> 
> P.S. We can further do the same thing to the identity domain. I will
> clean it up after all patches are landed.

I looked at that, and it is not trivial..

Both the Intel and virtio-iommu drivers create an "identity" domain
out of a paging domain and pass that off as a true "identity"
domain. So neither can set the global static since the determination
is at runtime..

What I was thinking about doing is consolidating that code so that the
core logic is the thing turning a paging domain into an identity
domain.

Jason


[PATCH 5/8] iommu/dart: Use static global identity domains

2023-09-22 Thread Jason Gunthorpe
Move to the new static global for identity domains. Move the identity
specific code to apple_dart_attach_dev_identity().

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 41 --
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 6bc2ad2207c3da..424f779ccc34df 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -658,11 +658,7 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
struct apple_dart_domain *dart_domain = to_dart_domain(domain);
 
-   if (cfg->stream_maps[0].dart->force_bypass &&
-   domain->type != IOMMU_DOMAIN_IDENTITY)
-   return -EINVAL;
-   if (!cfg->stream_maps[0].dart->supports_bypass &&
-   domain->type == IOMMU_DOMAIN_IDENTITY)
+   if (cfg->stream_maps[0].dart->force_bypass)
return -EINVAL;
 
ret = apple_dart_finalize_domain(domain, cfg);
@@ -683,15 +679,37 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
for_each_stream_map(i, cfg, stream_map)
apple_dart_hw_disable_dma(stream_map);
break;
-   case IOMMU_DOMAIN_IDENTITY:
-   for_each_stream_map(i, cfg, stream_map)
-   apple_dart_hw_enable_bypass(stream_map);
-   break;
+   default:
+   return -EINVAL;
}
 
return ret;
 }
 
+static int apple_dart_attach_dev_identity(struct iommu_domain *domain,
+ struct device *dev)
+{
+   struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+   struct apple_dart_stream_map *stream_map;
+   int i;
+
+   if (!cfg->stream_maps[0].dart->supports_bypass)
+   return -EINVAL;
+
+   for_each_stream_map(i, cfg, stream_map)
+   apple_dart_hw_enable_bypass(stream_map);
+   return 0;
+}
+
+static const struct iommu_domain_ops apple_dart_identity_ops = {
+   .attach_dev = apple_dart_attach_dev_identity,
+};
+
+static struct iommu_domain apple_dart_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _dart_identity_ops,
+};
+
 static struct iommu_device *apple_dart_probe_device(struct device *dev)
 {
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
@@ -722,7 +740,7 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
struct apple_dart_domain *dart_domain;
 
if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED &&
-   type != IOMMU_DOMAIN_IDENTITY && type != IOMMU_DOMAIN_BLOCKED)
+   type != IOMMU_DOMAIN_BLOCKED)
return NULL;
 
dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL);
@@ -732,7 +750,7 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
mutex_init(_domain->init_lock);
 
/* no need to allocate pgtbl_ops or do any other finalization steps */
-   if (type == IOMMU_DOMAIN_IDENTITY || type == IOMMU_DOMAIN_BLOCKED)
+   if (type == IOMMU_DOMAIN_BLOCKED)
dart_domain->finalized = true;
 
return _domain->domain;
@@ -947,6 +965,7 @@ static void apple_dart_get_resv_regions(struct device *dev,
 }
 
 static const struct iommu_ops apple_dart_iommu_ops = {
+   .identity_domain = _dart_identity_domain,
.domain_alloc = apple_dart_domain_alloc,
.probe_device = apple_dart_probe_device,
.release_device = apple_dart_release_device,
-- 
2.42.0



[PATCH 7/8] iommu/dart: Convert to domain_alloc_paging()

2023-09-22 Thread Jason Gunthorpe
Since the IDENTITY and BLOCKED behaviors were moved to global statics all
that remains is the paging domain. Rename to
apple_dart_attach_dev_paging() and remove the left over cruft.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 33 +++--
 1 file changed, 11 insertions(+), 22 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 376f4c5461e8f7..62efe0aa056f60 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -650,8 +650,8 @@ static int apple_dart_domain_add_streams(struct 
apple_dart_domain *domain,
  true);
 }
 
-static int apple_dart_attach_dev(struct iommu_domain *domain,
-struct device *dev)
+static int apple_dart_attach_dev_paging(struct iommu_domain *domain,
+   struct device *dev)
 {
int ret, i;
struct apple_dart_stream_map *stream_map;
@@ -665,21 +665,13 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
if (ret)
return ret;
 
-   switch (domain->type) {
-   case IOMMU_DOMAIN_DMA:
-   case IOMMU_DOMAIN_UNMANAGED:
-   ret = apple_dart_domain_add_streams(dart_domain, cfg);
-   if (ret)
-   return ret;
+   ret = apple_dart_domain_add_streams(dart_domain, cfg);
+   if (ret)
+   return ret;
 
-   for_each_stream_map(i, cfg, stream_map)
-   apple_dart_setup_translation(dart_domain, stream_map);
-   break;
-   default:
-   return -EINVAL;
-   }
-
-   return ret;
+   for_each_stream_map(i, cfg, stream_map)
+   apple_dart_setup_translation(dart_domain, stream_map);
+   return 0;
 }
 
 static int apple_dart_attach_dev_identity(struct iommu_domain *domain,
@@ -755,13 +747,10 @@ static void apple_dart_release_device(struct device *dev)
kfree(cfg);
 }
 
-static struct iommu_domain *apple_dart_domain_alloc(unsigned int type)
+static struct iommu_domain *apple_dart_domain_alloc_paging(struct device *dev)
 {
struct apple_dart_domain *dart_domain;
 
-   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL);
if (!dart_domain)
return NULL;
@@ -982,7 +971,7 @@ static void apple_dart_get_resv_regions(struct device *dev,
 static const struct iommu_ops apple_dart_iommu_ops = {
.identity_domain = _dart_identity_domain,
.blocked_domain = _dart_blocked_domain,
-   .domain_alloc = apple_dart_domain_alloc,
+   .domain_alloc_paging = apple_dart_domain_alloc_paging,
.probe_device = apple_dart_probe_device,
.release_device = apple_dart_release_device,
.device_group = apple_dart_device_group,
@@ -992,7 +981,7 @@ static const struct iommu_ops apple_dart_iommu_ops = {
.pgsize_bitmap = -1UL, /* Restricted during dart probe */
.owner = THIS_MODULE,
.default_domain_ops = &(const struct iommu_domain_ops) {
-   .attach_dev = apple_dart_attach_dev,
+   .attach_dev = apple_dart_attach_dev_paging,
.map_pages  = apple_dart_map_pages,
.unmap_pages= apple_dart_unmap_pages,
.flush_iotlb_all = apple_dart_flush_iotlb_all,
-- 
2.42.0



[PATCH 6/8] iommu/dart: Move the blocked domain support to a global static

2023-09-22 Thread Jason Gunthorpe
Move to the new static global for blocked domains. Move the blocked
specific code to apple_dart_attach_dev_blocked().

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 36 ++--
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 424f779ccc34df..376f4c5461e8f7 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -675,10 +675,6 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
for_each_stream_map(i, cfg, stream_map)
apple_dart_setup_translation(dart_domain, stream_map);
break;
-   case IOMMU_DOMAIN_BLOCKED:
-   for_each_stream_map(i, cfg, stream_map)
-   apple_dart_hw_disable_dma(stream_map);
-   break;
default:
return -EINVAL;
}
@@ -710,6 +706,30 @@ static struct iommu_domain apple_dart_identity_domain = {
.ops = _dart_identity_ops,
 };
 
+static int apple_dart_attach_dev_blocked(struct iommu_domain *domain,
+struct device *dev)
+{
+   struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+   struct apple_dart_stream_map *stream_map;
+   int i;
+
+   if (cfg->stream_maps[0].dart->force_bypass)
+   return -EINVAL;
+
+   for_each_stream_map(i, cfg, stream_map)
+   apple_dart_hw_disable_dma(stream_map);
+   return 0;
+}
+
+static const struct iommu_domain_ops apple_dart_blocked_ops = {
+   .attach_dev = apple_dart_attach_dev_blocked,
+};
+
+static struct iommu_domain apple_dart_blocked_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   .ops = _dart_blocked_ops,
+};
+
 static struct iommu_device *apple_dart_probe_device(struct device *dev)
 {
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
@@ -739,8 +759,7 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
 {
struct apple_dart_domain *dart_domain;
 
-   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED &&
-   type != IOMMU_DOMAIN_BLOCKED)
+   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
return NULL;
 
dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL);
@@ -749,10 +768,6 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
 
mutex_init(_domain->init_lock);
 
-   /* no need to allocate pgtbl_ops or do any other finalization steps */
-   if (type == IOMMU_DOMAIN_BLOCKED)
-   dart_domain->finalized = true;
-
return _domain->domain;
 }
 
@@ -966,6 +981,7 @@ static void apple_dart_get_resv_regions(struct device *dev,
 
 static const struct iommu_ops apple_dart_iommu_ops = {
.identity_domain = _dart_identity_domain,
+   .blocked_domain = _dart_blocked_domain,
.domain_alloc = apple_dart_domain_alloc,
.probe_device = apple_dart_probe_device,
.release_device = apple_dart_release_device,
-- 
2.42.0



[PATCH 8/8] iommu/dart: Call apple_dart_finalize_domain() as part of alloc_paging()

2023-09-22 Thread Jason Gunthorpe
In many cases the dev argument will now be !NULL so we should use it to
finalize the domain at allocation.

Make apple_dart_finalize_domain() accept the correct type.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/apple-dart.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 62efe0aa056f60..2c1832e357c7c6 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -568,10 +568,9 @@ apple_dart_setup_translation(struct apple_dart_domain 
*domain,
stream_map->dart->hw->invalidate_tlb(stream_map);
 }
 
-static int apple_dart_finalize_domain(struct iommu_domain *domain,
+static int apple_dart_finalize_domain(struct apple_dart_domain *dart_domain,
  struct apple_dart_master_cfg *cfg)
 {
-   struct apple_dart_domain *dart_domain = to_dart_domain(domain);
struct apple_dart *dart = cfg->stream_maps[0].dart;
struct io_pgtable_cfg pgtbl_cfg;
int ret = 0;
@@ -597,16 +596,17 @@ static int apple_dart_finalize_domain(struct iommu_domain 
*domain,
.iommu_dev = dart->dev,
};
 
-   dart_domain->pgtbl_ops =
-   alloc_io_pgtable_ops(dart->hw->fmt, _cfg, domain);
+   dart_domain->pgtbl_ops = alloc_io_pgtable_ops(dart->hw->fmt, _cfg,
+ _domain->domain);
if (!dart_domain->pgtbl_ops) {
ret = -ENOMEM;
goto done;
}
 
-   domain->pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
-   domain->geometry.aperture_start = 0;
-   domain->geometry.aperture_end = (dma_addr_t)DMA_BIT_MASK(dart->ias);
+   dart_domain->domain.pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
+   dart_domain->domain.geometry.aperture_start = 0;
+   dart_domain->domain.geometry.aperture_end =
+   (dma_addr_t)DMA_BIT_MASK(dart->ias);
 
dart_domain->finalized = true;
 
@@ -661,7 +661,7 @@ static int apple_dart_attach_dev_paging(struct iommu_domain 
*domain,
if (cfg->stream_maps[0].dart->force_bypass)
return -EINVAL;
 
-   ret = apple_dart_finalize_domain(domain, cfg);
+   ret = apple_dart_finalize_domain(dart_domain, cfg);
if (ret)
return ret;
 
@@ -757,6 +757,16 @@ static struct iommu_domain 
*apple_dart_domain_alloc_paging(struct device *dev)
 
mutex_init(_domain->init_lock);
 
+   if (dev) {
+   struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+   int ret;
+
+   ret = apple_dart_finalize_domain(dart_domain, cfg);
+   if (ret) {
+   kfree(dart_domain);
+   return ERR_PTR(ret);
+   }
+   }
return _domain->domain;
 }
 
-- 
2.42.0



[PATCH 4/8] iommufd: Convert to alloc_domain_paging()

2023-09-22 Thread Jason Gunthorpe
Move the global static blocked domain to the ops and convert the unmanaged
domain to domain_alloc_paging.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/selftest.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index fb981ba97c4e87..ee607984709102 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -141,16 +141,10 @@ static void *mock_domain_hw_info(struct device *dev, u32 
*length, u32 *type)
return info;
 }
 
-static struct iommu_domain *mock_domain_alloc(unsigned int iommu_domain_type)
+static struct iommu_domain *mock_domain_alloc_paging(struct device *dev)
 {
struct mock_iommu_domain *mock;
 
-   if (iommu_domain_type == IOMMU_DOMAIN_BLOCKED)
-   return _blocking_domain;
-
-   if (iommu_domain_type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
mock = kzalloc(sizeof(*mock), GFP_KERNEL);
if (!mock)
return NULL;
@@ -295,10 +289,11 @@ static const struct iommu_ops mock_ops = {
 * because it is zero.
 */
.default_domain = _blocking_domain,
+   .blocked_domain = _blocking_domain,
.owner = THIS_MODULE,
.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
.hw_info = mock_domain_hw_info,
-   .domain_alloc = mock_domain_alloc,
+   .domain_alloc_paging = mock_domain_alloc_paging,
.capable = mock_domain_capable,
.device_group = generic_device_group,
.probe_device = mock_probe_device,
-- 
2.42.0



[PATCH 1/8] iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain

2023-09-22 Thread Jason Gunthorpe
Following the pattern of identity domains, just assign the BLOCKED domain
global statics to a value in ops. Update the core code to use the global
static directly.

Update powerpc to use the new scheme and remove its empty domain_alloc
callback.

Signed-off-by: Jason Gunthorpe 
---
 arch/powerpc/kernel/iommu.c | 9 +
 drivers/iommu/iommu.c   | 2 ++
 include/linux/iommu.h   | 3 +++
 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d6ad3fde85a212..3c1d10be19c4c7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1327,13 +1327,6 @@ static bool spapr_tce_iommu_capable(struct device *dev, 
enum iommu_cap cap)
return false;
 }
 
-static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
-{
-   if (type != IOMMU_DOMAIN_BLOCKED)
-   return NULL;
-   return _tce_blocked_domain;
-}
-
 static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
 {
struct pci_dev *pdev;
@@ -1368,8 +1361,8 @@ static struct iommu_group 
*spapr_tce_iommu_device_group(struct device *dev)
 
 static const struct iommu_ops spapr_tce_iommu_ops = {
.default_domain = _tce_platform_domain,
+   .blocked_domain = _tce_blocked_domain,
.capable = spapr_tce_iommu_capable,
-   .domain_alloc = spapr_tce_iommu_domain_alloc,
.probe_device = spapr_tce_iommu_probe_device,
.release_device = spapr_tce_iommu_release_device,
.device_group = spapr_tce_iommu_device_group,
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fe033043be467a..7fa53d28feca87 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2049,6 +2049,8 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct iommu_ops *ops,
 
if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
return ops->identity_domain;
+   else if (alloc_type == IOMMU_DOMAIN_BLOCKED && ops->blocked_domain)
+   return ops->blocked_domain;
else if (type & __IOMMU_DOMAIN_PAGING && ops->domain_alloc_paging)
domain = ops->domain_alloc_paging(dev);
else if (ops->domain_alloc)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4f65879df853e4..6f9e0aacc4431a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -264,6 +264,8 @@ struct iommu_iotlb_gather {
  * @owner: Driver module providing these ops
  * @identity_domain: An always available, always attachable identity
  *   translation.
+ * @blocked_domain: An always available, always attachable blocking
+ *  translation.
  * @default_domain: If not NULL this will always be set as the default domain.
  *  This should be an IDENTITY/BLOCKED/PLATFORM domain.
  *  Do not use in new drivers.
@@ -302,6 +304,7 @@ struct iommu_ops {
unsigned long pgsize_bitmap;
struct module *owner;
struct iommu_domain *identity_domain;
+   struct iommu_domain *blocked_domain;
struct iommu_domain *default_domain;
 };
 
-- 
2.42.0



[PATCH 2/8] iommu/vt-d: Update the definition of the blocking domain

2023-09-22 Thread Jason Gunthorpe
The global static should pre-define the type and the NOP free function can
be now left as NULL.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index c8036caad24e7a..0d0cb2534972a2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4029,9 +4029,9 @@ static int blocking_domain_attach_dev(struct iommu_domain 
*domain,
 }
 
 static struct iommu_domain blocking_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
.ops = &(const struct iommu_domain_ops) {
.attach_dev = blocking_domain_attach_dev,
-   .free   = intel_iommu_domain_free
}
 };
 
@@ -4075,7 +4075,7 @@ static struct iommu_domain 
*intel_iommu_domain_alloc(unsigned type)
 
 static void intel_iommu_domain_free(struct iommu_domain *domain)
 {
-   if (domain != _domain->domain && domain != _domain)
+   if (domain != _domain->domain)
domain_exit(to_dmar_domain(domain));
 }
 
-- 
2.42.0



[PATCH 3/8] iommu/vt-d: Use ops->blocked_domain

2023-09-22 Thread Jason Gunthorpe
Trivially migrate to the ops->blocked_domain for the existing global
static.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 0d0cb2534972a2..b79188d0c67e95 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4041,8 +4041,6 @@ static struct iommu_domain 
*intel_iommu_domain_alloc(unsigned type)
struct iommu_domain *domain;
 
switch (type) {
-   case IOMMU_DOMAIN_BLOCKED:
-   return _domain;
case IOMMU_DOMAIN_DMA:
case IOMMU_DOMAIN_UNMANAGED:
dmar_domain = alloc_domain(type);
@@ -4803,6 +4801,7 @@ static void *intel_iommu_hw_info(struct device *dev, u32 
*length, u32 *type)
 }
 
 const struct iommu_ops intel_iommu_ops = {
+   .blocked_domain = _domain,
.capable= intel_iommu_capable,
.hw_info= intel_iommu_hw_info,
.domain_alloc   = intel_iommu_domain_alloc,
-- 
2.42.0



[PATCH 0/8] iommu: Convert dart & iommufd to the new domain_alloc_paging()

2023-09-22 Thread Jason Gunthorpe
Continue converting drivers to the new interface. Introduce
ops->blocked_domain to hold the global static BLOCKED domain and convert
all drivers supporting BLOCKED to use it.

This makes it trivial for dart and iommufd to convert over to
domain_alloc_paging().

There are six drivers remaining:

drivers/iommu/amd/iommu.c:  .domain_alloc = amd_iommu_domain_alloc,
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:.domain_alloc   = 
arm_smmu_domain_alloc,
drivers/iommu/arm/arm-smmu/arm-smmu.c:  .domain_alloc   = 
arm_smmu_domain_alloc,
drivers/iommu/fsl_pamu_domain.c:.domain_alloc   = fsl_pamu_domain_alloc,
drivers/iommu/intel/iommu.c:.domain_alloc   = 
intel_iommu_domain_alloc,
drivers/iommu/virtio-iommu.c:   .domain_alloc   = viommu_domain_alloc,

The follows the "Make default_domain's mandatory" series

Jason Gunthorpe (8):
  iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain
  iommu/vt-d: Update the definition of the blocking domain
  iommu/vt-d: Use ops->blocked_domain
  iommufd: Convert to alloc_domain_paging()
  iommu/dart: Use static global identity domains
  iommu/dart: Move the blocked domain support to a global static
  iommu/dart: Convert to domain_alloc_paging()
  iommu/dart: Call apple_dart_finalize_domain() as part of
alloc_paging()

 arch/powerpc/kernel/iommu.c  |   9 +--
 drivers/iommu/apple-dart.c   | 124 ---
 drivers/iommu/intel/iommu.c  |   7 +-
 drivers/iommu/iommu.c|   2 +
 drivers/iommu/iommufd/selftest.c |  11 +--
 include/linux/iommu.h|   3 +
 6 files changed, 91 insertions(+), 65 deletions(-)


base-commit: 494e0fef6db2e604bca0fc552d92b760b6e25d10
-- 
2.42.0



[PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-09-20 Thread Jason Gunthorpe
There are a bunch of reported randconfig failures now because of this,
something like:

>> arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7: warning: attribute 
>> declaration must precede definition [-Wignored-attributes]
   fn = symbol_get(vfio_file_iommu_group);
^
   include/linux/module.h:805:60: note: expanded from macro 'symbol_get'
   #define symbol_get(x) ({ extern typeof(x) x 
__attribute__((weak,visibility("hidden"))); &(x); })

It happens because the arch forces KVM_VFIO without knowing if VFIO is
even enabled.

Split the kconfig so the arch selects the usual HAVE_KVM_ARCH_VFIO and
then KVM_VFIO is only enabled if the arch wants it and VFIO is turned on.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202308251949.5iiav0sz-...@intel.com/
Closes: 
https://lore.kernel.org/oe-kbuild-all/202309030741.82alacdg-...@intel.com/
Closes: 
https://lore.kernel.org/oe-kbuild-all/202309110914.qlh0lu6l-...@intel.com/
Cc: Nick Desaulniers 
Fixes: c1cce6d079b8 ("vfio: Compile vfio_group infrastructure optionally")
Signed-off-by: Jason Gunthorpe 
---
 arch/arm64/kvm/Kconfig   | 2 +-
 arch/powerpc/kvm/Kconfig | 2 +-
 arch/s390/kvm/Kconfig| 2 +-
 arch/x86/kvm/Kconfig | 2 +-
 virt/kvm/Kconfig | 7 ++-
 5 files changed, 10 insertions(+), 5 deletions(-)

Sean's large series will also address this:

https://lore.kernel.org/kvm/20230916003118.2540661-7-sea...@google.com/

I don't know if it is sever enough to fix in the rc cycle, but here is the
patch.

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e5b..7c43eaea51ce05 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -28,7 +28,7 @@ menuconfig KVM
select KVM_MMIO
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select KVM_XFER_TO_GUEST_WORK
-   select KVM_VFIO
+   select HAVE_KVM_ARCH_VFIO
select HAVE_KVM_EVENTFD
select HAVE_KVM_IRQFD
select HAVE_KVM_DIRTY_RING_ACQ_REL
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200df..b64824e4cbc1eb 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
select PREEMPT_NOTIFIERS
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
-   select KVM_VFIO
+   select HAVE_KVM_ARCH_VFIO
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
select INTERVAL_TREE
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 45fdf2a9b2e326..d206ad3a777d5d 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -31,7 +31,7 @@ config KVM
select HAVE_KVM_IRQ_ROUTING
select HAVE_KVM_INVALID_WAKEUPS
select HAVE_KVM_NO_POLL
-   select KVM_VFIO
+   select HAVE_KVM_ARCH_VFIO
select INTERVAL_TREE
select MMU_NOTIFIER
help
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ed90f148140dfe..8e70e693f90e30 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -45,7 +45,7 @@ config KVM
select HAVE_KVM_NO_POLL
select KVM_XFER_TO_GUEST_WORK
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
-   select KVM_VFIO
+   select HAVE_KVM_ARCH_VFIO
select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
select KVM_GENERIC_HARDWARE_ENABLING
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 484d0873061ca5..0bf34809e1bbfe 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -59,9 +59,14 @@ config HAVE_KVM_MSI
 config HAVE_KVM_CPU_RELAX_INTERCEPT
bool
 
-config KVM_VFIO
+config HAVE_KVM_ARCH_VFIO
bool
 
+config KVM_VFIO
+   def_bool y
+   depends on HAVE_KVM_ARCH_VFIO
+   depends on VFIO
+
 config HAVE_KVM_INVALID_WAKEUPS
bool
 

base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
-- 
2.42.0



Re: [PATCH 06/26] KVM: Drop CONFIG_KVM_VFIO and just look at KVM+VFIO

2023-09-18 Thread Jason Gunthorpe
On Mon, Sep 18, 2023 at 08:52:40AM -0700, Sean Christopherson wrote:

> > I wonder if we should be making the VFIO drivers that need the kvm to
> > ask for it? 'select CONFIG_NEED_VFIO_KVM' or something?
> 
> I wondered the same thing, if only to make it easier to track which
> drivers actually end up interacting directly with KVM.

There are two usages I've seen..

GVT's uage is just totally broken:

https://lore.kernel.org/kvm/661447fd-b041-c08d-cedc-341b31c40...@intel.com/

It is trying to use KVM to write protect IOVA DMA memory, and it just
doesn't work. If we want to do something like this the core vfio code
should provide this service and it should be wired into KVM
properly.

power and s390 have actual architectural "virtual machines" and they
need actual arch operations to install VFIO devices into those
things. In this regard having the arch opt into the integration would
make some sense. I expect this will get worse in our CC future where
VFIO devices will need to be passed into arch specific CC code
somehow.

This arch stuff isn't cleanly done, the code is sprinkled all over the
place. Some in mdevs, some in PCI arch code, some in #ifdefs..

Maybe the CC people will clean it up instead of making the mess bigger :)

Jason


Re: [PATCH 05/26] vfio: KVM: Pass get/put helpers from KVM to VFIO, don't do circular lookup

2023-09-18 Thread Jason Gunthorpe
On Mon, Sep 18, 2023 at 08:49:57AM -0700, Sean Christopherson wrote:
> On Mon, Sep 18, 2023, Jason Gunthorpe wrote:
> > On Fri, Sep 15, 2023 at 05:30:57PM -0700, Sean Christopherson wrote:
> > > Explicitly pass KVM's get/put helpers to VFIO when attaching a VM to
> > > VFIO instead of having VFIO do a symbol lookup back into KVM.  Having both
> > > KVM and VFIO do symbol lookups increases the overall complexity and places
> > > an unnecessary dependency on KVM (from VFIO) without adding any value.
> > > 
> > > Signed-off-by: Sean Christopherson 
> > > ---
> > >  drivers/vfio/vfio.h  |  2 ++
> > >  drivers/vfio/vfio_main.c | 74 +++-
> > >  include/linux/vfio.h |  4 ++-
> > >  virt/kvm/vfio.c  |  9 +++--
> > >  4 files changed, 47 insertions(+), 42 deletions(-)
> > 
> > I don't mind this, but Christoph had disliked my prior attempt to do
> > this with function pointers..
> > 
> > The get can be inlined, IIRC, what about putting a pointer to the put
> > inside the kvm struct?
> 
> That wouldn't allow us to achieve our goal, which is to hide the details of
> "struct kvm" from VFIO (and the rest of the kernel).

> What's the objection to handing VFIO a function pointer?

Hmm, looks like it was this thread:

 
https://lore.kernel.org/r/0-v1-33906a626da1+16b0-vfio_kvm_no_group_...@nvidia.com

Your rational looks a little better to me.

> > The the normal kvm get/put don't have to exported symbols at all?
> 
> The export of kvm_get_kvm_safe() can go away (I forgot to do that in this 
> series),
> but kvm_get_kvm() will hang around as it's needed by KVM sub-modules (PPC and 
> x86),
> KVMGT (x86), and drivers/s390/crypto/vfio_ap_ops.c (no idea what to call that 
> beast).

My thought would be to keep it as an inline, there should be some way
to do that without breaking your desire to hide the bulk of the kvm
struct content. Like put the refcount as the first element in the
struct and just don't ifdef it away?.

Jason


Re: [PATCH 06/26] KVM: Drop CONFIG_KVM_VFIO and just look at KVM+VFIO

2023-09-18 Thread Jason Gunthorpe
On Fri, Sep 15, 2023 at 05:30:58PM -0700, Sean Christopherson wrote:
> Drop KVM's KVM_VFIO Kconfig, and instead compile in VFIO support if
> and only if VFIO itself is enabled.  Similar to the recent change to have
> VFIO stop looking at HAVE_KVM, compiling in support for talking to VFIO
> just because the architecture supports VFIO is nonsensical.
> 
> This fixes a bug where RISC-V doesn't select KVM_VFIO, i.e. would silently
> fail to do connect KVM and VFIO, even though RISC-V supports VFIO.  The
> bug is benign as the only driver in all of Linux that actually uses the
> KVM reference provided by VFIO is KVM-GT, which is x86/Intel specific.

Hmm, I recall that all the S390 drivers need it as well.

static int vfio_ap_mdev_open_device(struct vfio_device *vdev)
{
struct ap_matrix_mdev *matrix_mdev =
container_of(vdev, struct ap_matrix_mdev, vdev);

if (!vdev->kvm)
return -EINVAL;

return vfio_ap_mdev_set_kvm(matrix_mdev, vdev->kvm);


I wonder if we should be making the VFIO drivers that need the kvm to
ask for it? 'select CONFIG_NEED_VFIO_KVM' or something?

Regardless, I fully agree with getting rid of the arch flag.

Reviewed-by: Jason Gunthorpe 

> --- a/virt/kvm/Makefile.kvm
> +++ b/virt/kvm/Makefile.kvm
> @@ -6,7 +6,9 @@
>  KVM ?= ../../../virt/kvm
>  
>  kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
> -kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
> +ifdef CONFIG_VFIO
> +kvm-y += $(KVM)/vfio.o
> +endif

I wonder if kvm-m magically works in kbuild so you don't need the ifdef?

Jason


Re: [PATCH 05/26] vfio: KVM: Pass get/put helpers from KVM to VFIO, don't do circular lookup

2023-09-18 Thread Jason Gunthorpe
On Fri, Sep 15, 2023 at 05:30:57PM -0700, Sean Christopherson wrote:
> Explicitly pass KVM's get/put helpers to VFIO when attaching a VM to
> VFIO instead of having VFIO do a symbol lookup back into KVM.  Having both
> KVM and VFIO do symbol lookups increases the overall complexity and places
> an unnecessary dependency on KVM (from VFIO) without adding any value.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/vfio/vfio.h  |  2 ++
>  drivers/vfio/vfio_main.c | 74 +++-
>  include/linux/vfio.h |  4 ++-
>  virt/kvm/vfio.c  |  9 +++--
>  4 files changed, 47 insertions(+), 42 deletions(-)

I don't mind this, but Christoph had disliked my prior attempt to do
this with function pointers..

The get can be inlined, IIRC, what about putting a pointer to the put
inside the kvm struct?

The the normal kvm get/put don't have to exported symbols at all?

Jason


Re: [PATCH 03/26] virt: Declare and define vfio_file_set_kvm() iff CONFIG_KVM is enabled

2023-09-18 Thread Jason Gunthorpe
On Fri, Sep 15, 2023 at 05:30:55PM -0700, Sean Christopherson wrote:
> Hide vfio_file_set_kvm() and its unique helpers if KVM is not enabled,
> nothing else in the kernel (or out of the kernel) should be using a
> KVM specific helper.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/vfio/vfio_main.c | 2 +-
>  include/linux/vfio.h | 2 ++
>  2 files changed, 3 insertions(+), 1 deletion(-)

Patch subject should be vfio not virt

Reviewed-by: Jason Gunthorpe 

> @@ -1388,6 +1387,7 @@ void vfio_file_set_kvm(struct file *file, struct kvm 
> *kvm)
>   vfio_device_file_set_kvm(file, kvm);
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
> +#endif

We could even put this in a symbol namespace..

Jason


Re: [PATCH 02/26] vfio: Move KVM get/put helpers to colocate it with other KVM related code

2023-09-18 Thread Jason Gunthorpe
On Fri, Sep 15, 2023 at 05:30:54PM -0700, Sean Christopherson wrote:
> Move the definitions of vfio_device_get_kvm_safe() and vfio_device_put_kvm()
> down in vfio_main.c to colocate them with other KVM-specific functions,
> e.g. to allow wrapping them all with a single CONFIG_KVM check.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  drivers/vfio/vfio_main.c | 104 +++
>  1 file changed, 52 insertions(+), 52 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason


  1   2   3   4   5   6   >