Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-19 Thread Christophe Leroy


Le 19/04/2024 à 17:49, Mike Rapoport a écrit :
> Hi Masami,
> 
> On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
>> Hi Mike,
>>
>> On Thu, 11 Apr 2024 19:00:50 +0300
>> Mike Rapoport  wrote:
>>
>>> From: "Mike Rapoport (IBM)" 
>>>
>>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
>>> code.
>>>
>>> Since code allocations are now implemented with execmem, kprobes can be
>>> enabled in non-modular kernels.
>>>
>>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
>>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
>>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
>>
>> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
>> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
>> function body? We have enough dummy functions for that, so it should
>> not make a problem.
> 
> The code in check_kprobe_address_safe() that gets the module and checks for
> __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> I can pull it out to a helper or leave #ifdef in the function body,
> whichever you prefer.

As far as I can see, the only problem is MODULE_STATE_COMING.
Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h  ?


>   
>> -- 
>> Masami Hiramatsu
> 


Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Christophe Leroy


Le 16/04/2024 à 14:14, Markus Elfring a écrit :
>> This is explicit in Kernel documentation:
>>
>> /**
>>* kfree - free previously allocated memory
>>* @object: pointer returned by kmalloc() or kmem_cache_alloc()
>>*
>>* If @object is NULL, no operation is performed.
>>*/
>>
>> That's exactly the same behaviour as free() in libc.
>>
>> So Coccinelle should be fixed if it reports an error for that.
> 
> Redundant function calls can occasionally be avoided accordingly,
> can't they?

Sure they can, but is that worth it here ?

Christophe


Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Christophe Leroy


Le 16/04/2024 à 13:11, Michael Ellerman a écrit :
> Markus Elfring  writes:
>>> A few update suggestions were taken into account
>>> from static source code analysis.
>>>
>>> Markus Elfring (2):
>>
>> I would appreciate a bit more information about the reasons
>> why this patch series was rejected.
>>
>>
>>>One function call less in vas_window_alloc() after error detection
>>
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1f1c21cf-c34c-418c-b00c-8e6474f12...@web.de/
> 
> It introduced a new goto and label to avoid a kfree(NULL) call, but
> kfree() explicitly accepts NULL and handles it. So it complicates the
> source code for no gain.

This is explicit in Kernel documentation:

/**
  * kfree - free previously allocated memory
  * @object: pointer returned by kmalloc() or kmem_cache_alloc()
  *
  * If @object is NULL, no operation is performed.
  */

That's exactly the same behaviour as free() in libc.

So Coccinelle should be fixed if it reports an error for that.

> 
>>>Return directly after a failed kasprintf() in map_paste_region()
>>
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/f46f04bc-613c-4e98-b602-4c5120556...@web.de/
> 
> Basically the same reasoning. And it also changes the function from
> having two return paths (success and error), to three.
> 

Looking at that function, I however see a missing region release when 
ioremap_cache() fails.

Christophe


Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-04-16 Thread Christophe Leroy


Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> 
> 
> Le 12/04/2024 à 16:30, Peter Xu a écrit :
>> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>>>
>>>
>>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>>>>
>>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>>>>> is not feasible as such.
>>>>>>
>>>>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>>>>
>>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD 
>>>>>> entries
>>>>>> must point to a single entry level-2 page table. Until now that was
>>>>>> done using hugepd. This series changes it to use standard page tables
>>>>>> where the entry is replicated 1024 times on each of the two 
>>>>>> pagetables
>>>>>> refered by the two associated PMD entries for that 8M page.
>>>>>>
>>>>>> At the moment it has to look into each helper to know if the
>>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>>>>> a lower size. I hope this can me handled by core-mm in the future.
>>>>>>
>>>>>> There are probably several ways to implement stuff, so feedback is
>>>>>> very welcome.
>>>>>
>>>>> I thought it looks pretty good!
>>>>
>>>> I second it.
>>>>
>>>> I saw the discussions in patch 1.  Christophe, I suppose you're 
>>>> exploring
>>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
>>>> solution for nohash/32bit challenge you mentioned?
>>>>
>>>> I'm trying to position my next step; it seems like at least I should 
>>>> not
>>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD 
>>>> checks,
>>>> or you're going to have an RFC soon then I can base on top?
>>>
>>> Depends on what you expect by "soon".
>>>
>>> I sure won't be able to send any RFC before end of April.
>>>
>>> Should be possible to have something during May.
>>
>> That's good enough, thanks.  I'll see what is the best I can do.
>>
>> Then do you think I can leave p4d/pgd leaves alone?  Please check the 
>> other
>> email where I'm not sure whether pgd leaves ever existed for any of
>> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
>> recognize pud and lower for all leaves.  Then if Power can switch from
>> hugepd to this it should just work.
> 
> Well, if I understand correctly, something with no PMD will include 
>  and will therefore only have  pmd 
> entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
> pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
> (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
> case.
> 
> And therefore including  means  you 
> have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
> checked.
> 
> 
>>
>> Even if pgd exists (then something I overlooked..), I'm wondering whether
>> we can push that downwards to be either pud/pmd (and looks like we all
>> agree p4d is never used on Power).  That may involve some pgtable
>> operations moving from pgd level to lower, e.g. my pure imagination would
>> look like starting with:
> 
> Yes I think there is no doubt that p4d is never used:
> 
> arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
> 
> arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
> 
> arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
> 
> arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
> 
> 
> But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
> 
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    0
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
>>
>> To:
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    0
> 
> But then you can't anymore have #define PTRS_PER_PMD 1 from 
> 
> 
>>
>> And the rest will need care too.  I hope moving downward is easier
>> (e.g. the walker should always exist for lower levels but not always for
>> higher levels), but I actually have little idea on whether there's any
>> other implications, so please bare with me on stupid mistakes.
>>
>> I just hope pgd leaves don't exist already, then I think it'll be 
>> simpler.
>>
>> Thanks,
>>

Digging into asm-generic/pgtable-nopmd.h, I see a definition of 
pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 
("riscv/mm: fix two page table check related issues")

So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf() 
and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ?

Christophe


Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-04-15 Thread Christophe Leroy


Le 12/04/2024 à 16:30, Peter Xu a écrit :
> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>>>
>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>>>> is not feasible as such.
>>>>>
>>>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>>>
>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>>>>> must point to a single entry level-2 page table. Until now that was
>>>>> done using hugepd. This series changes it to use standard page tables
>>>>> where the entry is replicated 1024 times on each of the two pagetables
>>>>> refered by the two associated PMD entries for that 8M page.
>>>>>
>>>>> At the moment it has to look into each helper to know if the
>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>>>> a lower size. I hope this can me handled by core-mm in the future.
>>>>>
>>>>> There are probably several ways to implement stuff, so feedback is
>>>>> very welcome.
>>>>
>>>> I thought it looks pretty good!
>>>
>>> I second it.
>>>
>>> I saw the discussions in patch 1.  Christophe, I suppose you're exploring
>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
>>> solution for nohash/32bit challenge you mentioned?
>>>
>>> I'm trying to position my next step; it seems like at least I should not
>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
>>> or you're going to have an RFC soon then I can base on top?
>>
>> Depends on what you expect by "soon".
>>
>> I sure won't be able to send any RFC before end of April.
>>
>> Should be possible to have something during May.
> 
> That's good enough, thanks.  I'll see what is the best I can do.
> 
> Then do you think I can leave p4d/pgd leaves alone?  Please check the other
> email where I'm not sure whether pgd leaves ever existed for any of
> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
> recognize pud and lower for all leaves.  Then if Power can switch from
> hugepd to this it should just work.

Well, if I understand correctly, something with no PMD will include 
 and will therefore only have  pmd 
entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
(0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
case.

And therefore including  means  you 
have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
checked.


> 
> Even if pgd exists (then something I overlooked..), I'm wondering whether
> we can push that downwards to be either pud/pmd (and looks like we all
> agree p4d is never used on Power).  That may involve some pgtable
> operations moving from pgd level to lower, e.g. my pure imagination would
> look like starting with:

Yes I think there is no doubt that p4d is never used:

arch/powerpc/include/asm/book3s/32/pgtable.h:#include 

arch/powerpc/include/asm/book3s/64/pgtable.h:#include 

arch/powerpc/include/asm/nohash/32/pgtable.h:#include 

arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 


But that means that PPC32 have pmd entries and PPC64 have p4d entries ...

> 
> #define PTE_INDEX_SIZEPTE_SHIFT
> #define PMD_INDEX_SIZE0
> #define PUD_INDEX_SIZE0
> #define PGD_INDEX_SIZE(32 - PGDIR_SHIFT)
> 
> To:
> 
> #define PTE_INDEX_SIZEPTE_SHIFT
> #define PMD_INDEX_SIZE(32 - PMD_SHIFT)
> #define PUD_INDEX_SIZE0
> #define PGD_INDEX_SIZE0

But then you can't anymore have #define PTRS_PER_PMD 1 from 


> 
> And the rest will need care too.  I hope moving downward is easier
> (e.g. the walker should always exist for lower levels but not always for
> higher levels), but I actually have little idea on whether there's any
> other implications, so please bare with me on stupid mistakes.
> 
> I just hope pgd leaves don't exist already, then I think it'll be simpler.
> 
> Thanks,
> 


Re: [PATCH] bug: Fix no-return-statement warning with !CONFIG_BUG

2024-04-15 Thread Christophe Leroy


Le 15/04/2024 à 17:35, Arnd Bergmann a écrit :
> On Mon, Apr 15, 2024, at 04:19, Michael Ellerman wrote:
>> "Arnd Bergmann"  writes:
>>> On Thu, Apr 11, 2024, at 11:27, Adrian Hunter wrote:
>>>> On 11/04/24 11:22, Christophe Leroy wrote:
>>>>
>>>> That is fragile because it depends on defined(__OPTIMIZE__),
>>>> so it should still be:
>>>
>>> If there is a function that is defined but that must never be
>>> called, I think we are doing something wrong.
>>
>> It's a pretty inevitable result of using IS_ENABLED(), which the docs
>> encourage people to use.
> 
> Using IS_ENABLED() is usually a good idea, as it helps avoid
> adding extra #ifdef checks and just drops static functions as
> dead code, or lets you call extern functions that are conditionally
> defined in a different file.
> 
> The thing is that here it does not do either of those and
> adds more complexity than it avoids.
> 
>> In this case it could easily be turned into a build error by just making
>> it an extern rather than a static inline.
>>
>> But I think Christophe's solution is actually better, because it's more
>> explicit, ie. this function should not be called and if it is that's a
>> build time error.
> 
> I haven't seen a good solution here. Ideally we'd just define
> the functions unconditionally and have IS_ENABLED() take care
> of letting the compiler drop them silently, but that doesn't
> build because of missing struct members.
> 
> I won't object to either an 'extern' declaration or the
> 'BUILD_BUG_ON()' if you and others prefer that, both are better
> than BUG() here. I still think my suggestion would be a little
> simpler.

The advantage of the BUILD_BUG() against the extern is that the error 
gets detected at buildtime. With the extern it gets detected only at 
link-time.

But agree with you, the missing struct members defeats the advantages of 
IS_ENABLED().

At the end, how many instances of struct timekeeper do we have in the 
system ? With a quick look I see only two instances: tkcore.timekeeper 
and shadow_timekeeper. If I'm correct, wouldn't it just be simpler to 
have the three debug struct members defined at all time ?

Christophe


Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-12 Thread Christophe Leroy


Le 10/04/2024 à 21:58, Peter Xu a écrit :
>>
>> e500 has two modes: 32 bits and 64 bits.
>>
>> For 32 bits:
>>
>> 8xx is the only one handling it through HW-assisted pagetable walk hence
>> requiring a 2-level whatever the pagesize is.
> 
> Hmm I think maybe finally I get it..
> 
> I think the confusion came from when I saw there's always such level-2
> table described in Figure 8-5 of the manual:
> 
> https://www.nxp.com/docs/en/reference-manual/MPC860UM.pdf

Yes indeed that figure is confusing.

Table 8-1 gives a pretty good idea of what is required. We only use 
MD_CTR[TWAM] = 1

> 
> So I suppose you meant for 8M, the PowerPC 8xx system hardware will be
> aware of such 8M pgtable (from level-1's entry, where it has bit 28-29 set
> 011b), then it won't ever read anything starting from "Level-2 Descriptor
> 1" (but only read the only entry "Level-2 Descriptor 0"), so fundamentally
> hugepd format must look like such for 8xx?
> 
> But then perhaps it's still compatible with cont-pte because the rest
> entries (pte index 1+) will simply be ignored by the hardware?

Yes, still compatible with CONT-PTE allthough things become tricky 
because you need two page tables to get the full 8M so that's a kind of 
cont-PMD down to PTE level, as you can see in my RFC series.

> 
>>
>> On e500 it is all software so pages 2M and larger should be cont-PGD (by
>> the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD
>> and PTE, the PGD entries are populated by a function called PMD_populate()).
> 
> Yeah.. I am also wondering whether pgd_populate() could also work there
> (perhaps with some trivial changes, or maybe not even needed..), as when
> p4d/pud/pmd levels are missing, linux should just do something like an
> enforced cast from pgd_t* -> pmd_t* in this case.
> 
> I think currently they're already not pgd, as __find_linux_pte() already
> skipped pgd unconditionally:
> 
>   pgdp = pgdir + pgd_index(ea);
>   p4dp = p4d_offset(pgdp, ea);
> 

Yes that's what is confusing, some parts of code considers we have only 
a PGD and a PT while other parts consider we have only a PMD and a PT

>>
>> Current situation for 8xx is illustrated here:
>> https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx
>>
>> I also tried to better illustrate e500/32 here:
>> https://github.com/linuxppc/wiki/wiki/Huge-pages#e500
>>
>> For 64 bits:
>> We have PTE/PMD/PUD/PGD, no P4D
>>
>> See arch/powerpc/include/asm/nohash/64/pgtable-4k.h
> 
> We don't have anything that is above pud in this category, right?  That's
> what I read from your wiki (and thanks for providing that in the first
> place; helps a lot for me to understand how it works on PowerPC).

Yes thanks to Michael and Aneesh who initiated that Wiki page.

> 
> I want to make sure if I can move on without caring on p4d/pgd leafs like
> what we do right now, even after if we can remove hugepd for good, in this
> case since p4d always missing, then it's about whether "pud|pmd|pte_leaf()"
> can also cover the pgd ones when that day comes, iiuc.

I guess so but I'd like Aneesh and/or Michael to confirm as I'm not an 
expert on PPC64.

Christophe


Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-04-12 Thread Christophe Leroy


Le 11/04/2024 à 18:15, Peter Xu a écrit :
> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>
>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>> is not feasible as such.
>>>
>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>
>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>>> must point to a single entry level-2 page table. Until now that was
>>> done using hugepd. This series changes it to use standard page tables
>>> where the entry is replicated 1024 times on each of the two pagetables
>>> refered by the two associated PMD entries for that 8M page.
>>>
>>> At the moment it has to look into each helper to know if the
>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>> a lower size. I hope this can me handled by core-mm in the future.
>>>
>>> There are probably several ways to implement stuff, so feedback is
>>> very welcome.
>>
>> I thought it looks pretty good!
> 
> I second it.
> 
> I saw the discussions in patch 1.  Christophe, I suppose you're exploring
> the big hammer over hugepd, and perhaps went already with the 32bit pmd
> solution for nohash/32bit challenge you mentioned?
> 
> I'm trying to position my next step; it seems like at least I should not
> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
> or you're going to have an RFC soon then I can base on top?

Depends on what you expect by "soon".

I sure won't be able to send any RFC before end of April.

Should be possible to have something during May.

Christophe


Re: [RFC PATCH 2/7] mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations

2024-04-12 Thread Christophe Leroy


Le 11/04/2024 à 18:05, Mike Rapoport a écrit :
> From: "Mike Rapoport (IBM)" 
> 
> vmalloc allocations with VM_ALLOW_HUGE_VMAP that do not explictly
> specify node ID will use huge pages only if size_per_node is larger than
> PMD_SIZE.
> Still the actual allocated memory is not distributed between nodes and
> there is no advantage in such approach.
> On the contrary, BPF allocates PMD_SIZE * num_possible_nodes() for each
> new bpf_prog_pack, while it could do with PMD_SIZE'ed packs.
> 
> Don't account for number of nodes for VM_ALLOW_HUGE_VMAP with
> NUMA_NO_NODE and use huge pages whenever the requested allocation size
> is larger than PMD_SIZE.

Patch looks ok but message is confusing. We also use huge pages at PTE 
size, for instance 512k pages or 16k pages on powerpc 8xx, while 
PMD_SIZE is 4M.

Christophe

> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>   mm/vmalloc.c | 9 ++---
>   1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 22aa63f4ef63..5fc8b514e457 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3737,8 +3737,6 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
> long align,
>   }
>   
>   if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
> - unsigned long size_per_node;
> -
>   /*
>* Try huge pages. Only try for PAGE_KERNEL allocations,
>* others like modules don't yet expect huge pages in
> @@ -3746,13 +3744,10 @@ void *__vmalloc_node_range(unsigned long size, 
> unsigned long align,
>* supporting them.
>*/
>   
> - size_per_node = size;
> - if (node == NUMA_NO_NODE)
> - size_per_node /= num_online_nodes();
> - if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
> + if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
>   shift = PMD_SHIFT;
>   else
> - shift = arch_vmap_pte_supported_shift(size_per_node);
> + shift = arch_vmap_pte_supported_shift(size);
>   
>   align = max(real_align, 1UL << shift);
>   size = ALIGN(real_size, 1UL << shift);


Re: [PATCH] bug: Fix no-return-statement warning with !CONFIG_BUG

2024-04-11 Thread Christophe Leroy


Le 11/04/2024 à 10:12, Christophe Leroy a écrit :
> 
> 
> Le 11/04/2024 à 09:16, Adrian Hunter a écrit :
>> On 11/04/24 10:04, Arnd Bergmann wrote:
>>> On Wed, Apr 10, 2024, at 17:32, Adrian Hunter wrote:
>>>> BUG() does not return, and arch implementations of BUG() use 
>>>> unreachable()
>>>> or other non-returning code. However with !CONFIG_BUG, the default
>>>> implementation is often used instead, and that does not do that. x86 
>>>> always
>>>> uses its own implementation, but powerpc with !CONFIG_BUG gives a build
>>>> error:
>>>>
>>>>    kernel/time/timekeeping.c: In function ‘timekeeping_debug_get_ns’:
>>>>    kernel/time/timekeeping.c:286:1: error: no return statement in 
>>>> function
>>>>    returning non-void [-Werror=return-type]
>>>>
>>>> Add unreachable() to default !CONFIG_BUG BUG() implementation.
>>>
>>> I'm a bit worried about this patch, since we have had problems
>>> with unreachable() inside of BUG() in the past, and as far as I
>>> can remember, the current version was the only one that
>>> actually did the right thing on all compilers.
>>>
>>> One problem with an unreachable() annotation here is that if
>>> a compiler misanalyses the endless loop, it can decide to
>>> throw out the entire code path leading up to it and just
>>> run into undefined behavior instead of printing a BUG()
>>> message.
>>>
>>> Do you know which compiler version show the warning above?
>>
>> Original report has a list
>>
>> 
>> https://lore.kernel.org/all/CA+G9fYvjdZCW=7zgxs6a_3bysjq56yf7s-+pnlq_8a4dkh1...@mail.gmail.com/
>>
> 
> Looking at the report, I think the correct fix should be to use 
> BUILD_BUG() instead of BUG()

I confirm the error goes away with the following change to next-20240411 
on powerpc tinyconfig with gcc 13.2

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4e18db1819f8..3d5ac0cdd721 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -282,7 +282,7 @@ static inline void timekeeping_check_update(struct 
timekeeper *tk, u64 offset)
  }
  static inline u64 timekeeping_debug_get_ns(const struct tk_read_base *tkr)
  {
-   BUG();
+   BUILD_BUG();
  }
  #endif



Re: [PATCH] bug: Fix no-return-statement warning with !CONFIG_BUG

2024-04-11 Thread Christophe Leroy


Le 11/04/2024 à 09:16, Adrian Hunter a écrit :
> On 11/04/24 10:04, Arnd Bergmann wrote:
>> On Wed, Apr 10, 2024, at 17:32, Adrian Hunter wrote:
>>> BUG() does not return, and arch implementations of BUG() use unreachable()
>>> or other non-returning code. However with !CONFIG_BUG, the default
>>> implementation is often used instead, and that does not do that. x86 always
>>> uses its own implementation, but powerpc with !CONFIG_BUG gives a build
>>> error:
>>>
>>>kernel/time/timekeeping.c: In function ‘timekeeping_debug_get_ns’:
>>>kernel/time/timekeeping.c:286:1: error: no return statement in function
>>>returning non-void [-Werror=return-type]
>>>
>>> Add unreachable() to default !CONFIG_BUG BUG() implementation.
>>
>> I'm a bit worried about this patch, since we have had problems
>> with unreachable() inside of BUG() in the past, and as far as I
>> can remember, the current version was the only one that
>> actually did the right thing on all compilers.
>>
>> One problem with an unreachable() annotation here is that if
>> a compiler misanalyses the endless loop, it can decide to
>> throw out the entire code path leading up to it and just
>> run into undefined behavior instead of printing a BUG()
>> message.
>>
>> Do you know which compiler version show the warning above?
> 
> Original report has a list
> 
>   
> https://lore.kernel.org/all/CA+G9fYvjdZCW=7zgxs6a_3bysjq56yf7s-+pnlq_8a4dkh1...@mail.gmail.com/
> 

Looking at the report, I think the correct fix should be to use 
BUILD_BUG() instead of BUG()


Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-10 Thread Christophe Leroy


Le 10/04/2024 à 17:28, Peter Xu a écrit :
> On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
>> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
>>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
>>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
>>> not all.  I need to look at all the walkers and there can be some tricky
>>> ones, but I believe that applies in general.  It's actually similar to what
>>> I did with slow gup here.
>>
>> I think that is the big question, I also haven't done the research to
>> know the answer.
>>
>> At this point focusing on moving what is reasonable to the pXX_* API
>> makes sense to me. Then reviewing what remains and making some
>> decision.
>>
>>> Like this series, for cont-pXd we'll need multiple walks comparing to
>>> before (when with hugetlb_entry()), but for that part I'll provide some
>>> performance tests too, and we also have a fallback plan, which is to detect
>>> cont-pXd existance, which will also work for large folios.
>>
>> I think we can optimize this pretty easy.
>>   
 I think if you do the easy places for pXX conversion you will have a
 good idea about what is needed for the hard places.
>>>
>>> Here IMHO we don't need to understand "what is the size of this hugetlb
>>> vma"
>>
>> Yeh, I never really understood why hugetlb was linked to the VMA.. The
>> page table is self describing, obviously.
> 
> Attaching to vma still makes sense to me, where we should definitely avoid
> a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
> allocated, managed, ...  totally differently.
> 
> And since hugetlb is designed as file-based (which also makes sense to me,
> at least for now), it's also natural that it's vma-attached.
> 
>>
>>> or "which level of pgtable does this hugetlb vma pages locate",
>>
>> Ditto
>>
>>> because we may not need that, e.g., when we only want to collect some smaps
>>> statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
>>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
>>> hugetlb_entry removed), we may need extra check later to put things into
>>> the right bucket, but for the walker itself it doesn't necessarily need
>>> hugetlb_entry().
>>
>> Right, places may still need to know it is part of a huge VMA because we
>> have special stuff linked to that.
>>
 But then again we come back to power and its big list of page sizes
 and variety :( Looks like some there have huge sizes at the pgd level
 at least.
>>>
>>> Yeah this is something I want to be super clear, because I may miss
>>> something: we don't have real pgd pages, right?  Powerpc doesn't even
>>> define p4d_leaf(), AFAICT.
>>
>> AFAICT it is because it hides it all in hugepd.
> 
> IMHO one thing we can benefit from such hugepd rework is, if we can squash
> all the hugepds like what Christophe does, then we push it one more layer
> down, and we have a good chance all things should just work.
> 
> So again my Power brain is close to zero, but now I'm referring to what
> Christophe shared in the other thread:
> 
> https://github.com/linuxppc/wiki/wiki/Huge-pages
> 
> Together with:
> 
> https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.le...@csgroup.eu
> 
> Where it has:
> 
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -98,6 +98,7 @@ config PPC_BOOK3S_64
>  select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
>  select ARCH_ENABLE_SPLIT_PMD_PTLOCK
>  select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> +   select ARCH_HAS_HUGEPD if HUGETLB_PAGE
>  select ARCH_SUPPORTS_HUGETLBFS
>  select ARCH_SUPPORTS_NUMA_BALANCING
>  select HAVE_MOVE_PMD
> @@ -290,6 +291,7 @@ config PPC_BOOK3S
>   config PPC_E500
>  select FSL_EMB_PERFMON
>  bool
> +   select ARCH_HAS_HUGEPD if HUGETLB_PAGE
>  select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
>  select PPC_SMP_MUXED_IPI
>  select PPC_DOORBELL
> 
> So I think it means we have three PowerPC systems that supports hugepd
> right now (besides the 8xx which Christophe is trying to drop support
> there), besides 8xx we still have book3s_64 and E500.
> 
> Let's check one by one:
> 
>- book3s_64
> 
>  - hash
> 
>- 64K: p4d is not used, largest pgsize pgd 16G @pud level.  It
>  means after squashing it'll be a bunch of cont-pmd, all good.
> 
>- 4K: p4d also not used, largest pgsize pgd 128G, after squashed
>  it'll be cont-pud. all good.
> 
>  - radix
> 
>- 64K: largest 1G @pud, then cont-pmd after squashed. all good.
> 
>- 4K: largest 1G @pud, then cont-pmd, all good.
> 
>- e500 & 8xx
> 
>  - both of them use 2-level pgtables (pgd + pte), after squashed hugepd
>@pgd level they become cont-pte. all good.


Re: [PATCH] powerpc: align memory_limit to 16MB in early_parse_mem

2024-04-10 Thread Christophe Leroy


Le 10/04/2024 à 17:22, Joel Savitz a écrit :
> [Vous ne recevez pas souvent de courriers de jsav...@redhat.com. Découvrez 
> pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Mon, Apr 1, 2024 at 10:17 AM Joel Savitz  wrote:
>>
>> On Tue, Mar 26, 2024 at 12:45 AM Joel Savitz  wrote:
>>>
>>> On Fri, Mar 8, 2024 at 5:18 AM Aneesh Kumar K.V  
>>> wrote:

 Joel Savitz  writes:

> On 64-bit powerpc, usage of a non-16MB-aligned value for the mem= kernel
> cmdline parameter results in a system hang at boot.
>
> For example, using 'mem=4198400K' will always reproduce this issue.
>
> This patch fixes the problem by aligning any argument to mem= to 16MB
> corresponding with the large page size on powerpc.
>
> Fixes: 2babf5c2ec2f ("[PATCH] powerpc: Unify mem= handling")
> Co-developed-by: Gonzalo Siero 
> Signed-off-by: Gonzalo Siero 
> Signed-off-by: Joel Savitz 
> ---
>   arch/powerpc/kernel/prom.c | 6 +-
>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 0b5878c3125b..8cd3e2445d8a 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -82,8 +82,12 @@ static int __init early_parse_mem(char *p)
>   {
>if (!p)
>return 1;
> -
> +#ifdef CONFIG_PPC64
> + /* Align to 16 MB == size of ppc64 large page */
> + memory_limit = ALIGN(memparse(p, ), 0x100);
> +#else
>memory_limit = PAGE_ALIGN(memparse(p, ));
> +#endif
>DBG("memory limit = 0x%llx\n", memory_limit);
>
>return 0;
> --
> 2.43.0

 Can you try this change?

 commit bc55e1aa71f545cff31e1eccdb4a2e39df84
 Author: Aneesh Kumar K.V (IBM) 
 Date:   Fri Mar 8 14:45:26 2024 +0530

  powerpc/mm: Align memory_limit value specified using mem= kernel 
 parameter

  The value specified for the memory limit is used to set a restriction 
 on
  memory usage. It is important to ensure that this restriction is 
 within
  the linear map kernel address space range. The hash page table
  translation uses a 16MB page size to map the kernel linear map address
  space. htab_bolt_mapping() function aligns down the size of the range
  while mapping kernel linear address space. Since the memblock limit is
  enforced very early during boot, before we can detect the type of 
 memory
  translation (radix vs hash), we align the memory limit value specified
  as a kernel parameter to 16MB. This alignment value will work for both
  hash and radix translations.

  Signed-off-by: Aneesh Kumar K.V (IBM) 

 diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
 index 0b5878c3125b..9bd965d35352 100644
 --- a/arch/powerpc/kernel/prom.c
 +++ b/arch/powerpc/kernel/prom.c
 @@ -824,8 +824,11 @@ void __init early_init_devtree(void *params)
  reserve_crashkernel();
  early_reserve_mem();

 -   /* Ensure that total memory size is page-aligned. */
 -   limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE);
 +   if (memory_limit > memblock_phys_mem_size())
 +   memory_limit = 0;
 +
 +   /* Align down to 16 MB which is large page size with hash page 
 translation */
 +   limit = ALIGN_DOWN(memory_limit ?: memblock_phys_mem_size(), 
 SZ_16M);
  memblock_enforce_memory_limit(limit);

   #if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_4K_PAGES)
 diff --git a/arch/powerpc/kernel/prom_init.c 
 b/arch/powerpc/kernel/prom_init.c
 index e67effdba85c..d6410549e141 100644
 --- a/arch/powerpc/kernel/prom_init.c
 +++ b/arch/powerpc/kernel/prom_init.c
 @@ -817,8 +817,8 @@ static void __init early_cmdline_parse(void)
  opt += 4;
  prom_memory_limit = prom_memparse(opt, (const char 
 **));
   #ifdef CONFIG_PPC64
 -   /* Align to 16 MB == size of ppc64 large page */
 -   prom_memory_limit = ALIGN(prom_memory_limit, 0x100);
 +   /* Align down to 16 MB which is large page size with hash 
 page translation */
 +   prom_memory_limit = ALIGN_DOWN(prom_memory_limit, SZ_16M);
   #endif
  }


>>>
>>> Sorry for the delayed reply. I just tested this patch and it fixes the
>>> bug for me.
>>
>> Hi,
>>
>> Just a quick follow up on this.
>>
>> The above patch fixed the bug for me.
>>
>> How do we want to proceed?
>>
>> Best,
>> Joel Savitz
> 
> Hi,
> 
> I haven't heard anything on this thread so I'm just sending a quick follow up.
> 
> Do we want to merge this
> 

Is it the same as 

Re: [PATCH] cpufreq: Covert to exit callback returning void

2024-04-10 Thread Christophe Leroy


Le 10/04/2024 à 15:42, lizhe a écrit :
> 
> Hi,
>       I have already tested it, it is functioning properly, Please review.
> 
> *Lizhe*
>  
>                       Thanks
> 

Please don't top-post, see 
https://docs.kernel.org/process/submitting-patches.html?highlight=mailing+list+etiquette#use-trimmed-interleaved-replies-in-email-discussions

Please always post in plain text (ASCII-only), never as an HTML message.

And still, your changes cannot build without a change in the definition 
of exit in struct cpu_freq driver in include/linux/cpufreq.h
Never submit a patch that doesn't build.

Christophe

> At 2024-04-10 21:22:47, "Lizhe"  wrote:
>>For the exit() callback function returning an int type value.
>>this leads many driver authors mistakenly believing that error
>>handling can be performed by returning an error code. However.
>>the returned value is ignore, and to improve this situation.
>>it is proposed to modify the return type of the exit() callback
>>function to void
>>
>>Signed-off-by: Lizhe 
>>---
>> drivers/cpufreq/acpi-cpufreq.c | 4 +---
>> drivers/cpufreq/amd-pstate.c   | 7 ++-
>> drivers/cpufreq/apple-soc-cpufreq.c| 4 +---
>> drivers/cpufreq/bmips-cpufreq.c| 4 +---
>> drivers/cpufreq/cppc_cpufreq.c | 3 +--
>> drivers/cpufreq/cpufreq-dt.c   | 3 +--
>> drivers/cpufreq/e_powersaver.c | 3 +--
>> drivers/cpufreq/intel_pstate.c | 4 +---
>> drivers/cpufreq/mediatek-cpufreq-hw.c  | 4 +---
>> drivers/cpufreq/mediatek-cpufreq.c | 4 +---
>> drivers/cpufreq/omap-cpufreq.c | 3 +--
>> drivers/cpufreq/pasemi-cpufreq.c   | 6 ++
>> drivers/cpufreq/powernow-k6.c  | 3 +--
>> drivers/cpufreq/powernow-k7.c  | 3 +--
>> drivers/cpufreq/powernow-k8.c  | 4 +---
>> drivers/cpufreq/powernv-cpufreq.c  | 4 +---
>> drivers/cpufreq/ppc_cbe_cpufreq.c  | 3 +--
>> drivers/cpufreq/qcom-cpufreq-hw.c  | 4 +---
>> drivers/cpufreq/qoriq-cpufreq.c| 4 +---
>> drivers/cpufreq/scmi-cpufreq.c | 4 +---
>> drivers/cpufreq/scpi-cpufreq.c | 4 +---
>> drivers/cpufreq/sh-cpufreq.c   | 4 +---
>> drivers/cpufreq/sparc-us2e-cpufreq.c   | 3 +--
>> drivers/cpufreq/sparc-us3-cpufreq.c| 3 +--
>> drivers/cpufreq/speedstep-centrino.c   | 4 +---
>> drivers/cpufreq/tegra194-cpufreq.c | 4 +---
>> drivers/cpufreq/vexpress-spc-cpufreq.c | 3 +--
>> 27 files changed, 29 insertions(+), 74 deletions(-)
>>
>>diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
>>index 37f1cdf46d29..33f18140e9a4 100644
>>--- a/drivers/cpufreq/acpi-cpufreq.c
>>+++ b/drivers/cpufreq/acpi-cpufreq.c
>>@@ -906,7 +906,7 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy 
>>*policy)
>>  return result;
>> }
>> 
>>-static int acpi_cpufreq_cpu_exit(struct cpufreq_policy *policy)
>>+static void acpi_cpufreq_cpu_exit(struct cpufreq_policy *policy)
>> {
>>  struct acpi_cpufreq_data *data = policy->driver_data;
>> 
>>@@ -919,8 +919,6 @@ static int acpi_cpufreq_cpu_exit(struct cpufreq_policy 
>>*policy)
>>  free_cpumask_var(data->freqdomain_cpus);
>>  kfree(policy->freq_table);
>>  kfree(data);
>>-
>>- return 0;
>> }
>> 
>> static int acpi_cpufreq_resume(struct cpufreq_policy *policy)
>>diff --git a/drivers/cpufreq/amd-pstate.c b/drivers/cpufreq/amd-pstate.c
>>index 2015c9fcc3c9..60b3d20d5939 100644
>>--- a/drivers/cpufreq/amd-pstate.c
>>+++ b/drivers/cpufreq/amd-pstate.c
>>@@ -919,7 +919,7 @@ static int amd_pstate_cpu_init(struct cpufreq_policy 
>>*policy)
>>  return ret;
>> }
>> 
>>-static int amd_pstate_cpu_exit(struct cpufreq_policy *policy)
>>+static void amd_pstate_cpu_exit(struct cpufreq_policy *policy)
>> {
>>  struct amd_cpudata *cpudata = policy->driver_data;
>> 
>>@@ -927,8 +927,6 @@ static int amd_pstate_cpu_exit(struct cpufreq_policy 
>>*policy)
>>  freq_qos_remove_request(>req[0]);
>>  policy->fast_switch_possible = false;
>>  kfree(cpudata);
>>-
>>- return 0;
>> }
>> 
>> static int amd_pstate_cpu_resume(struct cpufreq_policy *policy)
>>@@ -1376,10 +1374,9 @@ static int amd_pstate_epp_cpu_init(struct 
>>cpufreq_policy *policy)
>>  return ret;
>> }
>> 
>>-static int amd_pstate_epp_cpu_exit(struct cpufreq_policy *policy)
>>+static void amd_pstate_epp_cpu_exit(struct cpufreq_policy *policy)
>> {
>>  pr_debug("CPU %d exiting\n", policy->cpu);
>>- return 0;
>> }
>> 
>> static void amd_pstate_epp_update_limit(struct cpufreq_policy *policy)
>>diff --git a/drivers/cpufreq/apple-soc-cpufreq.c 
>>b/drivers/cpufreq/apple-soc-cpufreq.c
>>index 021f423705e1..af34c22fa273 100644
>>--- a/drivers/cpufreq/apple-soc-cpufreq.c
>>+++ b/drivers/cpufreq/apple-soc-cpufreq.c
>>@@ -305,7 +305,7 @@ static int apple_soc_cpufreq_init(struct cpufreq_policy 
>>*policy)
>>  return ret;
>> }
>> 
>>-static int apple_soc_cpufreq_exit(struct cpufreq_policy *policy)
>>+static void 

Re: [RESEND PATCH net v4 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2024-04-09 Thread Christophe Leroy
Hi Vladimir,

Le 19/02/2024 à 16:30, Vladimir Oltean a écrit :
> Hi Sean,
> 
> On Thu, Feb 15, 2024 at 11:23:26AM -0500, Sean Anderson wrote:
>> smp_call_function_single disables IRQs when executing the callback. To
>> prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
>> This is already done by qman_update_cgr and qman_delete_cgr; fix the
>> other lockers.
>>
>> Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
>> CC: sta...@vger.kernel.org
>> Signed-off-by: Sean Anderson 
>> Reviewed-by: Camelia Groza 
>> Tested-by: Vladimir Oltean 
>> ---
>> I got no response the first time I sent this, so I am resending to net.
>> This issue was introduced in a series which went through net, so I hope
>> it makes sense to take it via net.
>>
>> [1] 
>> https://lore.kernel.org/linux-arm-kernel/20240108161904.2865093-1-sean.ander...@seco.com/
>>
>> (no changes since v3)
>>
>> Changes in v3:
>> - Change blamed commit to something more appropriate
>>
>> Changes in v2:
>> - Fix one additional call to spin_unlock
> 
> Leo Li (Li Yang) is no longer with NXP. Until we figure out within NXP
> how to continue with the maintainership of drivers/soc/fsl/, yes, please
> continue to submit this series to 'net'. I would also like to point
> out to Arnd that this is the case.
> 
> Arnd, a large portion of drivers/soc/fsl/ is networking-related
> (dpio, qbman). Would it make sense to transfer the maintainership
> of these under the respective networking drivers, to simplify the
> procedures?

I see FREESCALE QUICC ENGINE LIBRARY (drivers/soc/fsl/qe/) is maintained 
by Qiang Zhao  but I can't find any mail from him in 
the past 4 years in linuxppc-dev list, and everytime I wanted to submit 
something I only got responses from Leo Ly.

The last commit he reviewed is 661ea25e5319 ("soc: fsl: qe: Replace 
one-element array and use struct_size() helper"), it was in May 2020.

Is he still working at NXP and actively maintaining that library ? 
Keeping this part maintained is vital for me as this SOC is embedded in 
the two powerpc platform I maintain (8xx and 83xx).

If Qiang Zhao is not able to activaly maintain that SOC anymore, I 
volonteer to maintain it.

Thanks
Christophe


Re: [PATCH 2/2] MAINTAINERS: Make cxl obsolete

2024-04-08 Thread Christophe Leroy


Le 09/04/2024 à 06:37, Michael Ellerman a écrit :
> Andrew Donnellan  writes:
>> The cxl driver is no longer actively maintained and we intend to remove it
>> in a future kernel release. Change its status to obsolete, and update the
>> sysfs ABI documentation accordingly.
>>
>> Signed-off-by: Andrew Donnellan 
>> ---
>>   Documentation/ABI/{testing => obsolete}/sysfs-class-cxl | 3 +++
>>   MAINTAINERS | 4 ++--
>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>   rename Documentation/ABI/{testing => obsolete}/sysfs-class-cxl (99%)
> 
> This is a good start, but I suspect if there are any actual users they
> are not going to be monitoring the status of cxl in the MAINTAINERS file :)
> 
> I think we should probably modify Kconfig so that anyone who's using cxl
> on purpose has some chance to notice before we remove it.
> 
> Something like the patch below. Anyone who has an existing config and
> runs oldconfig will get a prompt, eg:
> 
>Deprecated support for IBM Coherent Accelerators (CXL) (DEPRECATED_CXL) 
> [N/m/y/?] (NEW)
> 
> Folks who just use defconfig etc. won't notice any change which is a
> pity. We could also change the default to n, but that risks breaking
> someone's machine. Maybe we do that in a another releases time.

When I boot one of my boards I see:

[0.641090] mcr3000-hwmon 1800.hwmon: hwmon_device_register() is 
deprecated. Please convert the driver to use 
hwmon_device_register_with_info().

Could we do something similar, write a message at boottime when the CXL 
driver gets probed ?

Christophe


Re: [FSL P50x0] Kernel 6.9-rc1 compiling issue

2024-04-04 Thread Christophe Leroy
Hi Christian, hi Hari,

Le 04/04/2024 à 19:44, Christian Zigotzky a écrit :
> Shall we use CONFIG_CRASH_DUMP to get int crashing_cpu = -1;?
> 
> Further information: 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2024-March/269985.html 
> 

Looking at problematic commit 5c4233cc0920 ("powerpc/kdump: Split 
KEXEC_CORE and CRASH_DUMP dependency"), my feeling is that the change 
should be as follows.

Hari, can you confirm ?

diff --git a/arch/powerpc/platforms/85xx/smp.c 
b/arch/powerpc/platforms/85xx/smp.c
index 40aa58206888..3209fc92ac19 100644
--- a/arch/powerpc/platforms/85xx/smp.c
+++ b/arch/powerpc/platforms/85xx/smp.c
@@ -362,7 +362,7 @@ struct smp_ops_t smp_85xx_ops = {
  #endif
  };

-#ifdef CONFIG_KEXEC_CORE
+#ifdef CONFIG_CRASH_DUMP
  #ifdef CONFIG_PPC32
  atomic_t kexec_down_cpus = ATOMIC_INIT(0);

@@ -465,7 +465,7 @@ static void mpc85xx_smp_machine_kexec(struct kimage 
*image)

default_machine_kexec(image);
  }
-#endif /* CONFIG_KEXEC_CORE */
+#endif /* CONFIG_CRASH_DUMP */

  static void smp_85xx_setup_cpu(int cpu_nr)
  {


Re: [PATCH 1/2] powerpc: Apply __always_inline to interrupt_{enter,exit}_prepare()

2024-04-04 Thread Christophe Leroy


Le 04/04/2024 à 06:45, Rohan McLure a écrit :
> In keeping with the advice given by Documentation/core-api/entry.rst,
> entry and exit handlers for interrupts should not be instrumented.
> Guarantee that the interrupt_{enter,exit}_prepare() routines are inlined
> so that they will inheret instrumentation from their caller.
> 
> KCSAN kernels were observed to compile without inlining these routines,
> which would lead to grief on NMI handlers.
> 
> Signed-off-by: Rohan McLure 

Reviewed-by: Christophe Leroy 

> ---
>   arch/powerpc/include/asm/interrupt.h | 12 ++--
>   1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index 7b610864b364..f4343e0bfb13 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -150,7 +150,7 @@ static inline void booke_restore_dbcr0(void)
>   #endif
>   }
>   
> -static inline void interrupt_enter_prepare(struct pt_regs *regs)
> +static __always_inline void interrupt_enter_prepare(struct pt_regs *regs)
>   {
>   #ifdef CONFIG_PPC64
>   irq_soft_mask_set(IRQS_ALL_DISABLED);
> @@ -215,11 +215,11 @@ static inline void interrupt_enter_prepare(struct 
> pt_regs *regs)
>* However interrupt_nmi_exit_prepare does return directly to regs, because
>* NMIs do not do "exit work" or replay soft-masked interrupts.
>*/
> -static inline void interrupt_exit_prepare(struct pt_regs *regs)
> +static __always_inline void interrupt_exit_prepare(struct pt_regs *regs)
>   {
>   }
>   
> -static inline void interrupt_async_enter_prepare(struct pt_regs *regs)
> +static __always_inline void interrupt_async_enter_prepare(struct pt_regs 
> *regs)
>   {
>   #ifdef CONFIG_PPC64
>   /* Ensure interrupt_enter_prepare does not enable MSR[EE] */
> @@ -238,7 +238,7 @@ static inline void interrupt_async_enter_prepare(struct 
> pt_regs *regs)
>   irq_enter();
>   }
>   
> -static inline void interrupt_async_exit_prepare(struct pt_regs *regs)
> +static __always_inline void interrupt_async_exit_prepare(struct pt_regs 
> *regs)
>   {
>   /*
>* Adjust at exit so the main handler sees the true NIA. This must
> @@ -278,7 +278,7 @@ static inline bool nmi_disables_ftrace(struct pt_regs 
> *regs)
>   return true;
>   }
>   
> -static inline void interrupt_nmi_enter_prepare(struct pt_regs *regs, struct 
> interrupt_nmi_state *state)
> +static __always_inline void interrupt_nmi_enter_prepare(struct pt_regs 
> *regs, struct interrupt_nmi_state *state)
>   {
>   #ifdef CONFIG_PPC64
>   state->irq_soft_mask = local_paca->irq_soft_mask;
> @@ -340,7 +340,7 @@ static inline void interrupt_nmi_enter_prepare(struct 
> pt_regs *regs, struct inte
>   nmi_enter();
>   }
>   
> -static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, struct 
> interrupt_nmi_state *state)
> +static __always_inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, 
> struct interrupt_nmi_state *state)
>   {
>   if (mfmsr() & MSR_DR) {
>   // nmi_exit if relocations are on


Re: [PATCH 2/2] powerpc/64: Only warn for kuap locked when KCSAN not present

2024-04-04 Thread Christophe Leroy


Le 04/04/2024 à 06:45, Rohan McLure a écrit :
> Arbitrary instrumented locations, including syscall handlers, can call
> arch_local_irq_restore() transitively when KCSAN is enabled, and in turn
> also replay_soft_interrupts_irqrestore(). The precondition on entry to
> this routine that is checked is that KUAP is enabled (user access
> prohibited). Failure to meet this condition only triggers a warning
> however, and afterwards KUAP is enabled anyway. That is, KUAP being
> disabled on entry is in fact permissable, but not possible on an
> uninstrumented kernel.
> 
> Disable this assertion only when KCSAN is enabled.

Please elaborate on that arbitrary call to arch_local_irq_restore() 
transitively, when does it happen and why, and why only when KCSAN is 
enabled.

I don't understand the reasoning, if it is permissible as you say, just 
drop the warning. If the warning is there, it should stay also with 
KCSAN. You should fix the root cause instead.

> 
> Suggested-by: Nicholas Piggin 
> Signed-off-by: Rohan McLure 
> ---
>   arch/powerpc/kernel/irq_64.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/irq_64.c b/arch/powerpc/kernel/irq_64.c
> index d5c48d1b0a31..18b2048389a2 100644
> --- a/arch/powerpc/kernel/irq_64.c
> +++ b/arch/powerpc/kernel/irq_64.c
> @@ -189,7 +189,8 @@ static inline __no_kcsan void 
> replay_soft_interrupts_irqrestore(void)
>* and re-locking AMR but we shouldn't get here in the first place,
>* hence the warning.
>*/
> - kuap_assert_locked();
> + if (!IS_ENABLED(CONFIG_KCSAN))
> + kuap_assert_locked();
>   
>   if (kuap_state != AMR_KUAP_BLOCKED)
>   set_kuap(AMR_KUAP_BLOCKED);


Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()

2024-04-03 Thread Christophe Leroy


Le 27/03/2024 à 17:57, Jason Gunthorpe a écrit :
> On Wed, Mar 27, 2024 at 09:58:35AM +0000, Christophe Leroy wrote:
>>> Just general remarks on the ones with huge pages:
>>>
>>>hash 64k and hugepage 16M/16G
>>>radix 64k/radix hugepage 2M/1G
>>>radix 4k/radix hugepage 2M/1G
>>>nohash 32
>>> - I think this is just a normal x86 like scheme? PMD/PUD can be a
>>>   leaf with the same size as a next level table.
>>>
>>>   Do any of these cases need to know the higher level to parse the
>>>   lower? eg is there a 2M bit in the PUD indicating that the PMD
>>>   is a table of 2M leafs or does each PMD entry have a bit
>>>   indicating it is a leaf?
>>
>> For hash and radix there is a bit that tells it is leaf (_PAGE_PTE)
>>
>> For nohash32/e500 I think the drawing is not full right, there is a huge
>> page directory (hugepd) with a single entry. I think it should be
>> possible to change it to a leaf entry, it seems we have bit _PAGE_SW1
>> available in the PTE.
> 
> It sounds to me like PPC breaks down into only a couple fundamental
> behaviors
>   - x86 like leaf in many page levels. Use the pgd/pud/pmd_leaf() and
> related to implement it
>   - ARM like contig PTE within a single page table level. Use the
> contig sutff to implement it
>   - Contig PTE across two page table levels with a bit in the
> PMD. Needs new support like you showed
>   - Page table levels with a variable page size. Ie a PUD can point to
> a directory of 8 pages or 512 pages of different size. Probbaly
> needs some new core support, but I think your changes to the
> *_offset go a long way already.
> 
>>>
>>>hash 4k and hugepage 16M/16G
>>>nohash 64
>>> - How does this work? I guess since 8xx explicitly calls out
>>>   consecutive this is actually the pgd can point to 512 256M
>>>   entries or 8 16G entries? Ie the table size at each level is
>>>   varable? Or is it the same and the table size is still 512 and
>>>   each 16G entry is replicated 64 times?
>>
>> For those it is using the huge page directory (hugepd) which can be
>> hooked at any level and is a directory of huge pages on its own. There
>> is no consecutive entries involved here I think, allthough I'm not
>> completely sure.
>>
>> For hash4k I'm not sure how it works, this was changed by commit
>> e2b3d202d1db ("powerpc: Switch 16GB and 16MB explicit hugepages to a
>> different page table format")
>>
>> For the nohash/64, a PGD entry points either to a regular PUD directory
>> or to a HUGEPD directory. The size of the HUGEPD directory is encoded in
>> the 6 lower bits of the PGD entry.
> 
> If it is a software walker there might be value in just aligning to
> the contig pte scheme in all levels and forgetting about the variable
> size page table levels. That quarter page stuff is a PITA to manage
> the memory allocation for on PPC anyhow..

Looking one step further, into nohash/32, I see a challenge: on that 
platform, a PTE is 64 bits while a PGD/PMD entry is 32 bits. It is 
therefore not possible as such to do PMD leaf or cont-PMD leaf.

I see two possible solutions:
- Double the size of PGD/PMD entries, but then we loose atomicity when 
reading or writing an entry, could this be a problem ?
- Do as for the 8xx, ie go down to PTEs even for pages greater than 4M.

Any thought ?

Christophe


Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-03 Thread Christophe Leroy


Le 03/04/2024 à 15:07, Jason Gunthorpe a écrit :
> On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
>>> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>>>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>>>
>>>>>> I actually tested this without hitting the issue (even though I didn't
>>>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>>>> it didn't fetch from there.
>>>>>
>>>>> I suspect it is weird compiler variations.. Maybe something is not
>>>>> being inlined.
>>>>>
>>>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>>>triggering it elsewhere but failed..)
>>>>>
>>>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>>>> called and the optimizer removing it.
>>>>
>>>> Good point, for some reason loongarch defined pud_leaf() without defining
>>>> pud_pfn(), which does look strange.
>>>>
>>>> #define pud_leaf(pud)  ((pud_val(pud) & _PAGE_HUGE) != 0)
>>>>
>>>> But I noticed at least MIPS also does it..  Logically I think one arch
>>>> should define either none of both.
>>>
>>> Wow, this is definately an arch issue. You can't define pud_leaf() and
>>> not have a pud_pfn(). It makes no sense at all..
>>>
>>> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
>>> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
>>> at least
>>
>> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm:
>> Add p?d_leaf() definitions").
> 
> That commit makes it sounds like the arch supports huge PUD's through
> the hugepte mechanism - it says a LTP test failed so something
> populated a huge PUD at least??

Not sure, I more see it just like a copy/paste of commit 501b81046701 
("mips: mm: add p?d_leaf() definitions").

The commit message says that the test failed because pmd_leaf() is 
missing, it says nothing about PUD.

When looking where _PAGE_HUGE is used in loongarch, I have the 
impression that it is exclusively used at PMD level.

> 
> So maybe this?
> 
> #define pud_pfn pte_pfn
> 
>> Not sure it was added for a good reason, and I'm not sure what was added
>> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
>>
>> #define  _PAGE_HUGE_SHIFT6  /* HUGE is a PMD bit */
>>
>> So I'm not sure it is correct to use that bit for PUD, is it ?
> 
> Could be, lots of arches repeat the bit layouts in each radix
> level.. It is essentially why the hugepte trick of pretending every
> level is a pte works.
>   
> Jason


Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-03 Thread Christophe Leroy


Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>
 I actually tested this without hitting the issue (even though I didn't
 mention it in the cover letter..).  I re-kicked the build test, it turns
 out my "make alldefconfig" on loongarch will generate a config with both
 HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
 THP=y (which I assume was the one above build used).  I didn't further
 check how "make alldefconfig" generated the config; a bit surprising that
 it didn't fetch from there.
>>>
>>> I suspect it is weird compiler variations.. Maybe something is not
>>> being inlined.
>>>
 (and it also surprises me that this BUILD_BUG can trigger.. I used to try
   triggering it elsewhere but failed..)
>>>
>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>> called and the optimizer removing it.
>>
>> Good point, for some reason loongarch defined pud_leaf() without defining
>> pud_pfn(), which does look strange.
>>
>> #define pud_leaf(pud)((pud_val(pud) & _PAGE_HUGE) != 0)
>>
>> But I noticed at least MIPS also does it..  Logically I think one arch
>> should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
Add p?d_leaf() definitions").

Not sure it was added for a good reason, and I'm not sure what was added 
is correct because arch/loongarch/include/asm/pgtable-bits.h has:

#define _PAGE_HUGE_SHIFT6  /* HUGE is a PMD bit */

So I'm not sure it is correct to use that bit for PUD, is it ?

Probably pud_leaf() should always return false.

Christophe


Re: [PATCH 01/34] powerpc/fsl-soc: hide unused const variable

2024-04-03 Thread Christophe Leroy


Le 03/04/2024 à 10:06, Arnd Bergmann a écrit :
> From: Arnd Bergmann 
> 
> vmpic_msi_feature is only used conditionally, which triggers a rare
> -Werror=unused-const-variable= warning with gcc:
> 
> arch/powerpc/sysdev/fsl_msi.c:567:37: error: 'vmpic_msi_feature' defined but 
> not used [-Werror=unused-const-variable=]
>567 | static const struct fsl_msi_feature vmpic_msi_feature =
> 
> Hide this one in the same #ifdef as the reference so we can turn on
> the warning by default.
> 
> Fixes: 305bcf26128e ("powerpc/fsl-soc: use CONFIG_EPAPR_PARAVIRT for hcalls")
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Christophe Leroy 

> ---
>   arch/powerpc/sysdev/fsl_msi.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/fsl_msi.c b/arch/powerpc/sysdev/fsl_msi.c
> index 8e6c84df4ca1..e205135ae1fe 100644
> --- a/arch/powerpc/sysdev/fsl_msi.c
> +++ b/arch/powerpc/sysdev/fsl_msi.c
> @@ -564,10 +564,12 @@ static const struct fsl_msi_feature ipic_msi_feature = {
>   .msiir_offset = 0x38,
>   };
>   
> +#ifdef CONFIG_EPAPR_PARAVIRT
>   static const struct fsl_msi_feature vmpic_msi_feature = {
>   .fsl_pic_ip = FSL_PIC_IP_VMPIC,
>   .msiir_offset = 0,
>   };
> +#endif
>   
>   static const struct of_device_id fsl_of_msi_ids[] = {
>   {


Re: [PATCH v3 2/2] powerpc/bpf: enable kfunc call

2024-04-02 Thread Christophe Leroy


Le 02/04/2024 à 12:58, Hari Bathini a écrit :
> Currently, bpf jit code on powerpc assumes all the bpf functions and
> helpers to be kernel text. This is false for kfunc case, as function
> addresses can be module addresses as well. So, ensure module addresses
> are supported to enable kfunc support.
> 
> Emit instructions based on whether the function address is kernel text
> address or module address to retain optimized instruction sequence for
> kernel text address case.
> 
> Also, as bpf programs are always module addresses and a bpf helper can
> be within kernel address as well, using relative addressing often fails
> with "out of range of pcrel address" error. Use unoptimized instruction
> sequence for both kernel and module addresses to work around this, when
> PCREL addressing is used.
> 
> With module addresses supported, override bpf_jit_supports_kfunc_call()
> to enable kfunc support. Since module address offsets can be more than
> 32-bit long on PPC64, override bpf_jit_supports_far_kfunc_call() to
> enable 64-bit pointers.
> 
> Signed-off-by: Hari Bathini 
> ---
> 
> * Changes in v3:
>- Retained optimized instruction sequence when function address is
>  a core kernel address as suggested by Naveen.
>- Used unoptimized instruction sequence for PCREL addressing to
>  avoid out of range errors for core kernel function addresses.
>- Folded patch that adds support for kfunc calls with patch that
>  enables/advertises this support as suggested by Naveen.
> 
> 
>   arch/powerpc/net/bpf_jit_comp.c   | 10 +++
>   arch/powerpc/net/bpf_jit_comp64.c | 48 ---
>   2 files changed, 42 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index 0f9a21783329..dc7ffafd7441 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c
> @@ -359,3 +359,13 @@ void bpf_jit_free(struct bpf_prog *fp)
>   
>   bpf_prog_unlock_free(fp);
>   }
> +
> +bool bpf_jit_supports_kfunc_call(void)
> +{
> + return true;
> +}
> +
> +bool bpf_jit_supports_far_kfunc_call(void)
> +{
> + return IS_ENABLED(CONFIG_PPC64) ? true : false;

You don't need the true/false, the following is enough:

return IS_ENABLED(CONFIG_PPC64);

> +}
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> b/arch/powerpc/net/bpf_jit_comp64.c
> index 7f62ac4b4e65..ec3adf715c55 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -207,24 +207,14 @@ static int bpf_jit_emit_func_call_hlp(u32 *image, 
> struct codegen_context *ctx, u
>   unsigned long func_addr = func ? ppc_function_entry((void *)func) : 0;
>   long reladdr;
>   
> - if (WARN_ON_ONCE(!core_kernel_text(func_addr)))
> + /*
> +  * With the introduction of kfunc feature, BPF helpers can be part of 
> kernel as
> +  * well as module text address.
> +  */
> + if (WARN_ON_ONCE(!kernel_text_address(func_addr)))
>   return -EINVAL;
>   
> - if (IS_ENABLED(CONFIG_PPC_KERNEL_PCREL)) {
> - reladdr = func_addr - CTX_NIA(ctx);
> -
> - if (reladdr >= (long)SZ_8G || reladdr < -(long)SZ_8G) {
> - pr_err("eBPF: address of %ps out of range of pcrel 
> address.\n",
> - (void *)func);
> - return -ERANGE;
> - }
> - /* pla r12,addr */
> - EMIT(PPC_PREFIX_MLS | __PPC_PRFX_R(1) | IMM_H18(reladdr));
> - EMIT(PPC_INST_PADDI | ___PPC_RT(_R12) | IMM_L(reladdr));
> - EMIT(PPC_RAW_MTCTR(_R12));
> - EMIT(PPC_RAW_BCTR());
> -
> - } else {
> + if (core_kernel_text(func_addr) && 
> !IS_ENABLED(CONFIG_PPC_KERNEL_PCREL)) {
>   reladdr = func_addr - kernel_toc_addr();
>   if (reladdr > 0x7FFF || reladdr < -(0x8000L)) {
>   pr_err("eBPF: address of %ps out of range of 
> kernel_toc.\n", (void *)func);
> @@ -235,6 +225,32 @@ static int bpf_jit_emit_func_call_hlp(u32 *image, struct 
> codegen_context *ctx, u
>   EMIT(PPC_RAW_ADDI(_R12, _R12, PPC_LO(reladdr)));
>   EMIT(PPC_RAW_MTCTR(_R12));
>   EMIT(PPC_RAW_BCTRL());
> + } else {
> + if (IS_ENABLED(CONFIG_PPC64_ELF_ABI_V1)) {
> + /* func points to the function descriptor */
> + PPC_LI64(bpf_to_ppc(TMP_REG_2), func);
> + /* Load actual entry point from function descriptor */
> + EMIT(PPC_RAW_LD(bpf_to_ppc(TMP_REG_1), 
> bpf_to_ppc(TMP_REG_2), 0));
> + /* ... and move it to CTR */
> + EMIT(PPC_RAW_MTCTR(bpf_to_ppc(TMP_REG_1)));
> + /*
> +  * Load TOC from function descriptor at offset 8.
> +  * We can clobber r2 since we get called through a
> +  * function pointer (so 

Re: [PATCH v3 1/2] powerpc64/bpf: fix tail calls for PCREL addressing

2024-04-02 Thread Christophe Leroy


Le 02/04/2024 à 12:58, Hari Bathini a écrit :
> With PCREL addressing, there is no kernel TOC. So, it is not setup in
> prologue when PCREL addressing is used. But the number of instructions
> to skip on a tail call was not adjusted accordingly. That resulted in
> not so obvious failures while using tailcalls. 'tailcalls' selftest
> crashed the system with the below call trace:
> 
>bpf_test_run+0xe8/0x3cc (unreliable)
>bpf_prog_test_run_skb+0x348/0x778
>__sys_bpf+0xb04/0x2b00
>sys_bpf+0x28/0x38
>system_call_exception+0x168/0x340
>system_call_vectored_common+0x15c/0x2ec
> 
> Fixes: 7e3a68be42e1 ("powerpc/64: vmlinux support building with PCREL 
> addresing")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Hari Bathini 
> ---
> 
> * Changes in v3:
>- New patch to fix tailcall issues with PCREL addressing.
> 
> 
>   arch/powerpc/net/bpf_jit_comp64.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> b/arch/powerpc/net/bpf_jit_comp64.c
> index 79f23974a320..7f62ac4b4e65 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -285,8 +285,10 @@ static int bpf_jit_emit_tail_call(u32 *image, struct 
> codegen_context *ctx, u32 o
>   int b2p_index = bpf_to_ppc(BPF_REG_3);
>   int bpf_tailcall_prologue_size = 8;
>   
> +#ifndef CONFIG_PPC_KERNEL_PCREL

Any reason for not using IS_ENABLED(CONFIG_PPC_KERNEL_PCREL) ?

>   if (IS_ENABLED(CONFIG_PPC64_ELF_ABI_V2))
>   bpf_tailcall_prologue_size += 4; /* skip past the toc load */
> +#endif
>   
>   /*
>* if (index >= array->map.max_entries)


Re: [PATCH v11 00/11] Support page table check PowerPC

2024-03-29 Thread Christophe Leroy


Le 28/03/2024 à 08:57, Christophe Leroy a écrit :
> 
> 
> Le 28/03/2024 à 07:52, Christophe Leroy a écrit :
>>
>>
>> Le 28/03/2024 à 05:55, Rohan McLure a écrit :
>>> Support page table check on all PowerPC platforms. This works by
>>> serialising assignments, reassignments and clears of page table
>>> entries at each level in order to ensure that anonymous mappings
>>> have at most one writable consumer, and likewise that file-backed
>>> mappings are not simultaneously also anonymous mappings.
>>>
>>> In order to support this infrastructure, a number of stubs must be
>>> defined for all powerpc platforms. Additionally, seperate set_pte_at()
>>> and set_pte_at_unchecked(), to allow for internal, uninstrumented 
>>> mappings.
>>
>> I gave it a try on QEMU e500 (64 bits), and get the following Oops. 
>> What do I have to look for ?
>>
>> Freeing unused kernel image (initmem) memory: 2588K
>> This architecture does not have kernel memory protection.
>> Run /init as init process
>> [ cut here ]
>> kernel BUG at mm/page_table_check.c:119!
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> BE PAGE_SIZE=4K SMP NR_CPUS=32 QEMU e500
> 
> Same problem on my 8xx board:
> 
> [    7.358146] Freeing unused kernel image (initmem) memory: 448K
> [    7.363957] Run /init as init process
> [    7.370955] [ cut here ]
> [    7.375411] kernel BUG at mm/page_table_check.c:119!
> [    7.380393] Oops: Exception in kernel mode, sig: 5 [#1]
> [    7.385621] BE PAGE_SIZE=16K PREEMPT CMPC885

Both problems are fixed by following change:

diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 413d01a51e6f..5b932632a5d7 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -29,6 +29,8 @@ static inline pte_basic_t pte_update(struct mm_struct 
*mm, unsigned long addr, p

  #ifndef __ASSEMBLY__

+#include 
+
  extern int icache_44x_need_flush;

  /*
@@ -92,7 +94,11 @@ static inline void ptep_set_wrprotect(struct 
mm_struct *mm, unsigned long addr,
  static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned 
long addr,
   pte_t *ptep)
  {
-   return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 0));
+   pte_t old_pte = __pte(pte_update(mm, addr, ptep, ~0UL, 0, 0));
+
+   page_table_check_pte_clear(mm, addr, old_pte);
+
+   return old_pte;
  }
  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR




Re: [PATCH v11 00/11] Support page table check PowerPC

2024-03-28 Thread Christophe Leroy


Le 28/03/2024 à 07:52, Christophe Leroy a écrit :
> 
> 
> Le 28/03/2024 à 05:55, Rohan McLure a écrit :
>> Support page table check on all PowerPC platforms. This works by
>> serialising assignments, reassignments and clears of page table
>> entries at each level in order to ensure that anonymous mappings
>> have at most one writable consumer, and likewise that file-backed
>> mappings are not simultaneously also anonymous mappings.
>>
>> In order to support this infrastructure, a number of stubs must be
>> defined for all powerpc platforms. Additionally, seperate set_pte_at()
>> and set_pte_at_unchecked(), to allow for internal, uninstrumented 
>> mappings.
> 
> I gave it a try on QEMU e500 (64 bits), and get the following Oops. What 
> do I have to look for ?
> 
> Freeing unused kernel image (initmem) memory: 2588K
> This architecture does not have kernel memory protection.
> Run /init as init process
> [ cut here ]
> kernel BUG at mm/page_table_check.c:119!
> Oops: Exception in kernel mode, sig: 5 [#1]
> BE PAGE_SIZE=4K SMP NR_CPUS=32 QEMU e500

Same problem on my 8xx board:

[7.358146] Freeing unused kernel image (initmem) memory: 448K
[7.363957] Run /init as init process
[7.370955] [ cut here ]
[7.375411] kernel BUG at mm/page_table_check.c:119!
[7.380393] Oops: Exception in kernel mode, sig: 5 [#1]
[7.385621] BE PAGE_SIZE=16K PREEMPT CMPC885
[7.393483] CPU: 0 PID: 1 Comm: init Not tainted 
6.8.0-s3k-dev-13737-g8d9e247585fb #787
[7.401505] Hardware name: MIAE 8xx 0x50 CMPC885
[7.406481] NIP:  c0183278 LR: c018316c CTR: 0001
[7.411541] REGS: c902bb20 TRAP: 0700   Not tainted 
(6.8.0-s3k-dev-13737-g8d9e247585fb)
[7.419657] MSR:  00029032   CR: 35055355  XER: 80007100
[7.426550]
[7.426550] GPR00: c018316c c902bbe0 c2118000 c7f7a0d8 7fab8000 
c23b5ae0 c902bc20 0002
[7.426550] GPR08: c11a c7f7a0d8 c11143e0  95003355 
 c0004a38 c23a0a00
[7.426550] GPR16: 4000 7fffc000 8000 c23a0a00 0001 
7fab8000 ffabc000 8000
[7.426550] GPR24: 7fffc000 c33be1c0 4000 c902bc20 7fab8000 
0001 c7fb0360 
[7.463291] NIP [c0183278] __page_table_check_ptes_set+0x1c8/0x210
[7.469491] LR [c018316c] __page_table_check_ptes_set+0xbc/0x210
[7.475514] Call Trace:
[7.477957] [c902bbe0] [c018316c] 
__page_table_check_ptes_set+0xbc/0x210 (unreliable)
[7.485809] [c902bc00] [c0012474] set_ptes+0x148/0x16c
[7.490958] [c902bc50] [c0158a3c] move_page_tables+0x228/0x578
[7.496806] [c902bcf0] [c0192f38] shift_arg_pages+0xf0/0x1d4
[7.502479] [c902bd90] [c0193b40] setup_arg_pages+0x1c8/0x36c
[7.508238] [c902be40] [c01f51a0] load_elf_binary+0x3c0/0x1218
[7.514086] [c902beb0] [c01934b0] bprm_execve+0x21c/0x4a4
[7.519497] [c902bf00] [c019516c] kernel_execve+0x13c/0x200
[7.525082] [c902bf20] [c0004aa8] kernel_init+0x70/0x1b0
[7.530406] [c902bf30] [c00111e4] ret_from_kernel_user_thread+0x10/0x18
[7.537038] --- interrupt: 0 at 0x0
[7.540534] Code: 39290004 7ce04828 30e70001 7ce0492d 40a2fff4 
2c07 4080ff94 0fe0 0fe0 0fe0 2c1f 4082ff80 
<0fe0> 0fe0 392a 4bfffef8
[7.556068] ---[ end trace  ]---
[7.560692]
[8.531997] note: init[1] exited with irqs disabled
[8.536891] note: init[1] exited with preempt_count 1
[8.542032] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0005
[8.549602] Rebooting in 180 seconds..


Re: [PATCH v11 00/11] Support page table check PowerPC

2024-03-28 Thread Christophe Leroy


Le 28/03/2024 à 05:55, Rohan McLure a écrit :
> Support page table check on all PowerPC platforms. This works by
> serialising assignments, reassignments and clears of page table
> entries at each level in order to ensure that anonymous mappings
> have at most one writable consumer, and likewise that file-backed
> mappings are not simultaneously also anonymous mappings.
> 
> In order to support this infrastructure, a number of stubs must be
> defined for all powerpc platforms. Additionally, seperate set_pte_at()
> and set_pte_at_unchecked(), to allow for internal, uninstrumented mappings.

I gave it a try on QEMU e500 (64 bits), and get the following Oops. What 
do I have to look for ?

Freeing unused kernel image (initmem) memory: 2588K
This architecture does not have kernel memory protection.
Run /init as init process
[ cut here ]
kernel BUG at mm/page_table_check.c:119!
Oops: Exception in kernel mode, sig: 5 [#1]
BE PAGE_SIZE=4K SMP NR_CPUS=32 QEMU e500
Modules linked in:
CPU: 0 PID: 1 Comm: init Not tainted 6.8.0-13732-gc5347beead0b-dirty #784
Hardware name: QEMU ppce500 e5500 0x80240020 QEMU e500
NIP:  c02951a0 LR: c02951bc CTR: 
REGS: c32e7440 TRAP: 0700   Not tainted 
(6.8.0-13732-gc5347beead0b-dirty)
MSR:  80029002   CR: 24044248  XER: 
IRQMASK: 0
GPR00: c0029d90 c32e76e0 c0d44000 c3017e18
GPR04: ffb11000 c7f16888 000fc324123d 
GPR08:  0001 c1184000 84004248
GPR12: 00c0 c11b9000 c7f16888 c7f19000
GPR16: 1000 3000  
GPR20: 4000  0001 c000ffb12000
GPR24: c7f19000 c6008000 c6008000 0030
GPR28: 0001 c118afe8 c3017e18 0001
NIP [c02951a0] __page_table_check_ptes_set+0x210/0x2ac
LR [c02951bc] __page_table_check_ptes_set+0x22c/0x2ac
Call Trace:
[c32e76e0] [c32e7790] 0xc32e7790 (unreliable)
[c32e7730] [c0029d90] set_ptes+0x178/0x210
[c32e7790] [c024c72c] move_page_tables+0x298/0x750
[c32e7870] [c02a944c] shift_arg_pages+0x120/0x254
[c32e79a0] [c02a9f94] setup_arg_pages+0x244/0x418
[c32e7b30] [c0331610] load_elf_binary+0x584/0x17d4
[c32e7c30] [c02aa3e8] bprm_execve+0x280/0x704
[c32e7d00] [c02ac158] kernel_execve+0x16c/0x214
[c32e7d50] [c00011c8] run_init_process+0x100/0x168
[c32e7de0] [c000214c] kernel_init+0x84/0x1f8
[c32e7e50] [c594] ret_from_kernel_user_thread+0x14/0x1c
--- interrupt: 0 at 0x0
Code: 81230004 7d2907b4 0b09 7c0004ac 7d201828 31290001 7d20192d 
40c2fff4 7c0004ac 2c090002 3920 7d29e01e <0b09> e93d 
37ff 7fde4a14
---[ end trace  ]---

note: init[1] exited with irqs disabled
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0005
Rebooting in 180 seconds..


Re: [PATCH v11 09/11] poweprc: mm: Implement *_user_accessible_page() for ptes

2024-03-27 Thread Christophe Leroy


Le 28/03/2024 à 05:55, Rohan McLure a écrit :
> Page table checking depends on architectures providing an
> implementation of p{te,md,ud}_user_accessible_page. With
> refactorisations made on powerpc/mm, the pte_access_permitted() and
> similar methods verify whether a userland page is accessible with the
> required permissions.
> 
> Since page table checking is the only user of
> p{te,md,ud}_user_accessible_page(), implement these for all platforms,
> using some of the same preliminary checks taken by pte_access_permitted()
> on that platform.
> 
> Since Commit 8e9bd41e4ce1 ("powerpc/nohash: Replace pte_user() by pte_read()")
> pte_user() is no longer required to be present on all platforms as it
> may be equivalent to or implied by pte_read(). Hence implementations of
> pte_user_accessible_page() are specialised.
> 
> Signed-off-by: Rohan McLure 
> ---
> v9: New implementation
> v10: Let book3s/64 use pte_user(), but otherwise default other platforms
> to using the address provided with the call to infer whether it is a
> user page or not. pmd/pud variants will warn on all other platforms, as
> they should not be used for user page mappings
> v11: Conditionally define p{m,u}d_user_accessible_page(), as not all
> platforms have p{m,u}d_leaf(), p{m,u}d_pte() stubs.

See my comment to v10 patch 10.

p{m,u}d_leaf() is defined for all platforms (There is a fallback 
definition in include/linux/pgtable.h) so p{m,u}d_user_accessible_page() 
can be defined for all platforms, no need for a conditionally define.

> ---
>   arch/powerpc/include/asm/book3s/32/pgtable.h |  5 +
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 17 +
>   arch/powerpc/include/asm/nohash/pgtable.h|  5 +
>   arch/powerpc/include/asm/pgtable.h   |  8 
>   4 files changed, 35 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
> b/arch/powerpc/include/asm/book3s/32/pgtable.h
> index 52971ee30717..83f7b98ef49f 100644
> --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
> @@ -436,6 +436,11 @@ static inline bool pte_access_permitted(pte_t pte, bool 
> write)
>   return true;
>   }
>   
> +static inline bool pte_user_accessible_page(pte_t pte, unsigned long addr)
> +{
> + return pte_present(pte) && !is_kernel_addr(addr);
> +}
> +
>   /* Conversion functions: convert a page and protection to a page entry,
>* and a page entry and page directory to the page they refer to.
>*
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index fac5615e6bc5..d8640ddbcad1 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -538,6 +538,11 @@ static inline bool pte_access_permitted(pte_t pte, bool 
> write)
>   return arch_pte_access_permitted(pte_val(pte), write, 0);
>   }
>   
> +static inline bool pte_user_accessible_page(pte_t pte, unsigned long addr)
> +{
> + return pte_present(pte) && pte_user(pte);
> +}
> +
>   /*
>* Conversion functions: convert a page and protection to a page entry,
>* and a page entry and page directory to the page they refer to.
> @@ -1441,5 +1446,17 @@ static inline bool pud_leaf(pud_t pud)
>   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
>   }
>   
> +#define pmd_user_accessible_page pmd_user_accessible_page
> +static inline bool pmd_user_accessible_page(pmd_t pmd, unsigned long addr)
> +{
> + return pmd_leaf(pmd) && pte_user_accessible_page(pmd_pte(pmd), addr);
> +}
> +
> +#define pud_user_accessible_page pud_user_accessible_page
> +static inline bool pud_user_accessible_page(pud_t pud, unsigned long addr)
> +{
> + return pud_leaf(pud) && pte_user_accessible_page(pud_pte(pud), addr);
> +}
> +
>   #endif /* __ASSEMBLY__ */
>   #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
> diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
> b/arch/powerpc/include/asm/nohash/pgtable.h
> index 427db14292c9..413d01a51e6f 100644
> --- a/arch/powerpc/include/asm/nohash/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/pgtable.h
> @@ -213,6 +213,11 @@ static inline bool pte_access_permitted(pte_t pte, bool 
> write)
>   return true;
>   }
>   
> +static inline bool pte_user_accessible_page(pte_t pte, unsigned long addr)
> +{
> + return pte_present(pte) && !is_kernel_addr(addr);
> +}
> +
>   /* Conversion functions: convert a page and protection to a page entry,
>* and a page entry and page directory to the page they refer to.
>*
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index ee8c82c0528f..f1ceae778cb1 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -219,6 +219,14 @@ static inline int pud_pfn(pud_t pud)
>   }
>   #endif
>   
> +#ifndef pmd_user_accessible_page
> +#define pmd_user_accessible_page(pmd, addr)  false
> +#endif
> +
> +#ifndef 

Re: [PATCH v11 08/11] powerpc: mm: Add pud_pfn() stub

2024-03-27 Thread Christophe Leroy


Le 28/03/2024 à 05:55, Rohan McLure a écrit :
> The page table check feature requires that pud_pfn() be defined
> on each consuming architecture. Since only 64-bit, Book3S platforms
> allow for hugepages at this upper level, and since the calling code is
> gated by a call to pud_user_accessible_page(), which will return zero,
> include this stub as a BUILD_BUG().
> 
> Signed-off-by: Rohan McLure 
> ---
> v11: pud_pfn() stub has been removed upstream as it has valid users now
> in transparent hugepages. Create a BUG_ON() for other, non Book3S64
> platforms.
> ---
>   arch/powerpc/include/asm/pgtable.h | 8 
>   1 file changed, 8 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 239709a2f68e..ee8c82c0528f 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -211,6 +211,14 @@ static inline bool 
> arch_supports_memmap_on_memory(unsigned long vmemmap_size)
>   
>   #endif /* CONFIG_PPC64 */
>   
> +#ifndef pud_pfn
> +#define pud_pfn pud_pfn
> +static inline int pud_pfn(pud_t pud)
> +{
> + BUILD_BUG();

This function must return something.

> +}
> +#endif
> +
>   #endif /* __ASSEMBLY__ */
>   
>   #endif /* _ASM_POWERPC_PGTABLE_H */


Re: [PATCH] Add static_key_feature_checks_initialized flag

2024-03-27 Thread Christophe Leroy


Le 27/03/2024 à 05:59, Nicholas Miehlbradt a écrit :
> JUMP_LABEL_FEATURE_CHECK_DEBUG used static_key_initialized to determine
> whether {cpu,mmu}_has_feature() was used before static keys were
> initialized. However, {cpu,mmu}_has_feature() should not be used before
> setup_feature_keys() is called. As static_key_initalized is set much
> earlier during boot there is a window in which JUMP_LABEL_FEATURE_CHECK_DEBUG
> will not report errors. Add a flag specifically to indicate when
> {cpu,mmu}_has_feature() is safe to use.

What do you mean by "much earlier" ?

As far as I can see, static_key_initialized is set by jump_label_init() 
as cpu_feature_keys_init() and mmu_feature_keys_init() are call 
immediately after. I don't think it is possible to do anything inbetween.

Or maybe you mean the problem is the call to jump_label_init() in 
early_init_devtree() ? You should make it explicit in the message, and 
see if it wouldn't be better to call cpu_feature_keys_init() and 
mmu_feature_keys_init() as well in early_init_devtree() in that case ?

Christophe


Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()

2024-03-27 Thread Christophe Leroy


Le 26/03/2024 à 16:01, Jason Gunthorpe a écrit :
> On Mon, Mar 25, 2024 at 07:05:01PM +0000, Christophe Leroy wrote:
> 
>> Not looked into details yet, but I guess so.
>>
>> By the way there is a wiki dedicated to huge pages on powerpc, you can
>> have a look at it here :
>> https://github.com/linuxppc/wiki/wiki/Huge-pages , maybe you'll find
>> good ideas there to help me.
> 
> There sure are alot of page tables types here
> 
> I'm a bit wondering about terminology, eg on the first diagram "huge
> pte entry" means a PUD entry that is a leaf? Which ones are contiguous
> replications?

Yes, on the first diagram, a huge pte entry covering the same size as 
pud entry means a leaf PUD entry.

Contiguous replications are only on 8xx for the time being and are 
displayed as "consecutive entries".

> 
> Just general remarks on the ones with huge pages:
> 
>   hash 64k and hugepage 16M/16G
>   radix 64k/radix hugepage 2M/1G
>   radix 4k/radix hugepage 2M/1G
>   nohash 32
>- I think this is just a normal x86 like scheme? PMD/PUD can be a
>  leaf with the same size as a next level table.
> 
>  Do any of these cases need to know the higher level to parse the
>  lower? eg is there a 2M bit in the PUD indicating that the PMD
>  is a table of 2M leafs or does each PMD entry have a bit
>  indicating it is a leaf?

For hash and radix there is a bit that tells it is leaf (_PAGE_PTE)

For nohash32/e500 I think the drawing is not full right, there is a huge 
page directory (hugepd) with a single entry. I think it should be 
possible to change it to a leaf entry, it seems we have bit _PAGE_SW1 
available in the PTE.

> 
>   hash 4k and hugepage 16M/16G
>   nohash 64
>- How does this work? I guess since 8xx explicitly calls out
>  consecutive this is actually the pgd can point to 512 256M
>  entries or 8 16G entries? Ie the table size at each level is
>  varable? Or is it the same and the table size is still 512 and
>  each 16G entry is replicated 64 times?

For those it is using the huge page directory (hugepd) which can be 
hooked at any level and is a directory of huge pages on its own. There 
is no consecutive entries involved here I think, allthough I'm not 
completely sure.

For hash4k I'm not sure how it works, this was changed by commit 
e2b3d202d1db ("powerpc: Switch 16GB and 16MB explicit hugepages to a 
different page table format")

For the nohash/64, a PGD entry points either to a regular PUD directory 
or to a HUGEPD directory. The size of the HUGEPD directory is encoded in 
the 6 lower bits of the PGD entry.

> 
>  Do the offset accessors already abstract this enough?
> 
>   8xx 4K
>   8xx 16K
> - As this series does?

This is how it is prior to the series, ie 16k and 512k pages are 
implemented as contiguous PTEs in a standard page table while 8M pages 
are implemented with hugepd and a single entry in it (with two PGD 
entries pointing to the same huge page directory.

Christophe


Re: [PATCH v2 3/3] powerpc/code-patching: Restore 32-bit patching performance

2024-03-26 Thread Christophe Leroy


Le 25/03/2024 à 23:48, Benjamin Gray a écrit :
> The new open/close abstraction makes it more difficult for a
> compiler to optimise. This causes 10% worse performance on
> ppc32 as in [1]. Restoring the page alignment mask and inlining
> the helpers allows the compiler to better reason about the address
> alignment, allowing more optimised cache flushing selection.

This should be squashed into patch 1. There is no point in having that 
as a separate patch when in the same series.

> 
> [1]: 
> https://lore.kernel.org/all/77fdcdeb-4af5-4ad0-a4c6-57bf0762d...@csgroup.eu/
> 
> Suggested-by: Christophe Leroy 
> Signed-off-by: Benjamin Gray 
> 
> ---
> 
> v2: * New in v2
> 
> I think Suggested-by is an appropriate tag. The patch is Christophe's
> from the link, I just added the commit description, so it could well
> be better to change the author to Christophe completely.
> ---
>   arch/powerpc/lib/code-patching.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c 
> b/arch/powerpc/lib/code-patching.c
> index b3a644290369..d089da115987 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -282,13 +282,13 @@ struct patch_window {
>* Interrupts must be disabled for the entire duration of the patching. The 
> PIDR
>* is potentially changed during this time.
>*/
> -static int open_patch_window(void *addr, struct patch_window *ctx)
> +static __always_inline int open_patch_window(void *addr, struct patch_window 
> *ctx)
>   {
>   unsigned long pfn = get_patch_pfn(addr);
>   
>   lockdep_assert_irqs_disabled();
>   
> - ctx->text_poke_addr = (unsigned 
> long)__this_cpu_read(cpu_patching_context.addr);
> + ctx->text_poke_addr = (unsigned 
> long)__this_cpu_read(cpu_patching_context.addr) & PAGE_MASK;
>   
>   if (!mm_patch_enabled()) {
>   ctx->ptep = __this_cpu_read(cpu_patching_context.pte);
> @@ -331,7 +331,7 @@ static int open_patch_window(void *addr, struct 
> patch_window *ctx)
>   return 0;
>   }
>   
> -static void close_patch_window(struct patch_window *ctx)
> +static __always_inline void close_patch_window(struct patch_window *ctx)
>   {
>   lockdep_assert_irqs_disabled();
>   


Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()

2024-03-25 Thread Christophe Leroy


Le 25/03/2024 à 17:19, Jason Gunthorpe a écrit :
> On Mon, Mar 25, 2024 at 03:55:54PM +0100, Christophe Leroy wrote:
>> Unlike many architectures, powerpc 8xx hardware tablewalk requires
>> a two level process for all page sizes, allthough second level only
>> has one entry when pagesize is 8M.
>>
>> To fit with Linux page table topology and without requiring special
>> page directory layout like hugepd, the page entry will be replicated
>> 1024 times in the standard page table. However for large pages it is
>> necessary to set bits in the level-1 (PMD) entry. At the time being,
>> for 512k pages the flag is kept in the PTE and inserted in the PMD
>> entry at TLB miss exception, that is necessary because we can have
>> pages of different sizes in a page table. However the 12 PTE bits are
>> fully used and there is no room for an additional bit for page size.
>>
>> For 8M pages, there will be only one page per PMD entry, it is
>> therefore possible to flag the pagesize in the PMD entry, with the
>> advantage that the information will already be at the right place for
>> the hardware.
>>
>> To do so, add a new helper called pmd_populate_size() which takes the
>> page size as an additional argument, and modify __pte_alloc() to also
>> take that argument. pte_alloc() is left unmodified in order to
>> reduce churn on callers, and a pte_alloc_size() is added for use by
>> pte_alloc_huge().
>>
>> When an architecture doesn't provide pmd_populate_size(),
>> pmd_populate() is used as a fallback.
> 
> I think it would be a good idea to document what the semantic is
> supposed to be for sz?
> 
> Just a general remark, probably nothing for this, but with these new
> arguments the historical naming seems pretty tortured for
> pte_alloc_size().. Something like pmd_populate_leaf(size) as a naming
> scheme would make this more intuitive. Ie pmd_populate_leaf() gives
> you a PMD entry where the entry points to a leaf page table able to
> store folios of at least size.
> 
> Anyhow, I thought the edits to the mm helpers were fine, certainly
> much nicer than hugepd. Do you see a path to remove hugepd entirely
> from here?

Not looked into details yet, but I guess so.

By the way there is a wiki dedicated to huge pages on powerpc, you can 
have a look at it here : 
https://github.com/linuxppc/wiki/wiki/Huge-pages , maybe you'll find 
good ideas there to help me.

Christophe


[RFC PATCH 8/8] powerpc/8xx: Add back support for 8M pages using contiguous PTE entries

2024-03-25 Thread Christophe Leroy
In order to fit better with standard Linux page tables layout, add
support for 8M pages using contiguous PTE entries in a standard
page table. Page tables will then be populated with 1024 similar
entries and two PMD entries will point to that page table.

The PMD entries also get a flag to tell it is addressing an 8M page,
this is required for the HW tablewalk assistance.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/hugetlb.h| 11 -
 .../include/asm/nohash/32/hugetlb-8xx.h   | 28 +++-
 arch/powerpc/include/asm/nohash/32/pgalloc.h  |  2 +
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 43 +--
 arch/powerpc/include/asm/pgtable.h|  1 +
 arch/powerpc/kernel/head_8xx.S|  1 +
 arch/powerpc/mm/hugetlbpage.c | 12 +-
 arch/powerpc/mm/nohash/8xx.c  | 31 ++---
 arch/powerpc/mm/nohash/tlb.c  |  3 ++
 arch/powerpc/mm/pgtable.c | 24 +++
 arch/powerpc/mm/pgtable_32.c  |  2 +-
 11 files changed, 134 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index a05657e5701b..bd60ea134f8e 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -41,7 +41,16 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned 
long addr,
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
 {
-   return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
+   pmd_t *pmdp = (pmd_t *)ptep;
+   pte_t pte;
+
+   if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+   pte = __pte(pte_update(mm, addr, pte_offset_kernel(pmdp, 0), 
~0UL, 0, 1));
+   pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 
1);
+   } else {
+   pte = __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
+   }
+   return pte;
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h 
b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index 178ed9fdd353..1414cfd28987 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -15,6 +15,16 @@ static inline int check_and_get_huge_psize(int shift)
return shift_to_mmu_psize(shift);
 }
 
+#define __HAVE_ARCH_HUGE_PTEP_GET
+static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep)
+{
+   pmd_t *pmdp = (pmd_t *)ptep;
+
+   if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M)))
+   ptep = pte_offset_kernel(pmdp, 0);
+   return ptep_get(ptep);
+}
+
 #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 pte_t pte, unsigned long sz);
@@ -23,7 +33,14 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr, pte_t *ptep,
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep, unsigned long sz)
 {
-   pte_update(mm, addr, ptep, ~0UL, 0, 1);
+   pmd_t *pmdp = (pmd_t *)ptep;
+
+   if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+   pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1);
+   pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 
1);
+   } else {
+   pte_update(mm, addr, ptep, ~0UL, 0, 1);
+   }
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
@@ -33,7 +50,14 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct 
*mm,
unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
unsigned long set = pte_val(pte_wrprotect(__pte(0)));
 
-   pte_update(mm, addr, ptep, clr, set, 1);
+   pmd_t *pmdp = (pmd_t *)ptep;
+
+   if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+   pte_update(mm, addr, pte_offset_kernel(pmdp, 0), clr, set, 1);
+   pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), clr, set, 
1);
+   } else {
+   pte_update(mm, addr, ptep, clr, set, 1);
+   }
 }
 
 #ifdef CONFIG_PPC_4K_PAGES
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 11eac371e7e0..ff4f90cfb461 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -14,6 +14,7 @@
 #define __pmd_free_tlb(tlb,x,a)do { } while (0)
 /* #define pgd_populate(mm, pmd, pte)  BUG() */
 
+#ifndef CONFIG_PPC_8xx
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
   pte_t *pte)
 {
@@ -31,5 +32,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmdp,
else
*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT

[RFC PATCH 7/8] powerpc/8xx: Remove support for 8M pages

2024-03-25 Thread Christophe Leroy
Remove support for 8M pages in order to stop using hugepd.

Support for 8M pages will be added back later using the same
approach as for 512k pages, in extenso using contiguous page
entries in the regular page table.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig  |  1 -
 .../include/asm/nohash/32/hugetlb-8xx.h   | 30 ---
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 14 +
 arch/powerpc/include/asm/nohash/pgtable.h |  4 ---
 arch/powerpc/include/asm/page.h   |  5 
 arch/powerpc/kernel/head_8xx.S|  9 +-
 arch/powerpc/mm/hugetlbpage.c |  3 --
 arch/powerpc/mm/nohash/8xx.c  | 28 ++---
 arch/powerpc/mm/nohash/tlb.c  |  3 --
 arch/powerpc/platforms/Kconfig.cputype|  2 ++
 10 files changed, 7 insertions(+), 92 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a68b9e637eda..74c038cf770c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,7 +135,6 @@ config PPC
select ARCH_HAS_DMA_MAP_DIRECT  if PPC_PSERIES
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
-   select ARCH_HAS_HUGEPD  if HUGETLB_PAGE
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_MEMBARRIER_SYNC_CORE
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h 
b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index 92df40c6cc6b..178ed9fdd353 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -4,42 +4,12 @@
 
 #define PAGE_SHIFT_8M  23
 
-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
-   BUG_ON(!hugepd_ok(hpd));
-
-   return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK);
-}
-
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
-   return PAGE_SHIFT_8M;
-}
-
-static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
-   unsigned int pdshift)
-{
-   unsigned long idx = (addr & (SZ_4M - 1)) >> PAGE_SHIFT;
-
-   return hugepd_page(hpd) + idx;
-}
-
 static inline void flush_hugetlb_page(struct vm_area_struct *vma,
  unsigned long vmaddr)
 {
flush_tlb_page(vma, vmaddr);
 }
 
-static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int 
pshift)
-{
-   *hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M);
-}
-
-static inline void hugepd_populate_kernel(hugepd_t *hpdp, pte_t *new, unsigned 
int pshift)
-{
-   *hpdp = __hugepd(__pa(new) | _PMD_PRESENT | _PMD_PAGE_8M);
-}
-
 static inline int check_and_get_huge_psize(int shift)
 {
return shift_to_mmu_psize(shift);
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h 
b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 07df6b664861..004d7e825af2 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -142,15 +142,6 @@ static inline void __ptep_set_access_flags(struct 
vm_area_struct *vma, pte_t *pt
 }
 #define __ptep_set_access_flags __ptep_set_access_flags
 
-static inline unsigned long pgd_leaf_size(pgd_t pgd)
-{
-   if (pgd_val(pgd) & _PMD_PAGE_8M)
-   return SZ_8M;
-   return SZ_4M;
-}
-
-#define pgd_leaf_size pgd_leaf_size
-
 static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
 {
pte_basic_t val = pte_val(pte);
@@ -171,14 +162,11 @@ static inline unsigned long pte_leaf_size(pmd_t pmd, 
pte_t pte)
  * For other page sizes, we have a single entry in the table.
  */
 static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
-static int hugepd_ok(hugepd_t hpd);
 
 static inline int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int 
huge)
 {
if (!huge)
return PAGE_SIZE / SZ_4K;
-   else if (hugepd_ok(*((hugepd_t *)pmd)))
-   return 1;
else if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !(val & _PAGE_HUGE))
return SZ_16K / SZ_4K;
else
@@ -198,7 +186,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, 
unsigned long addr, p
 
for (i = 0; i < num; i += PAGE_SIZE / SZ_4K, new += PAGE_SIZE) {
*entry++ = new;
-   if (IS_ENABLED(CONFIG_PPC_16K_PAGES) && num != 1) {
+   if (IS_ENABLED(CONFIG_PPC_16K_PAGES)) {
*entry++ = new;
*entry++ = new;
*entry++ = new;
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index ac3353f7f2ac..c4be7754e96f 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -343,12 +343,8 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
 #ifdef 

Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-03-25 Thread Christophe Leroy


Le 21/03/2024 à 23:07, pet...@redhat.com a écrit :
> From: Peter Xu 
> 
> v3:
> - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
>pXd_huge() users) with pXd_leaf() [Jason]
> - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> - Use IS_ENABLED() in follow_huge_pud() [Jason]
> - Remove redundant none pud check in follow_pud_mask() [Jason]
> 
> rfc: https://lore.kernel.org/r/20231116012908.392077-1-pet...@redhat.com
> v1:  https://lore.kernel.org/r/20231219075538.414708-1-pet...@redhat.com
> v2:  https://lore.kernel.org/r/20240103091423.400294-1-pet...@redhat.com
> 
> The series removes the hugetlb slow gup path after a previous refactor work
> [1], so that slow gup now uses the exact same path to process all kinds of
> memory including hugetlb.
> 
> For the long term, we may want to remove most, if not all, call sites of
> huge_pte_offset().  It'll be ideal if that API can be completely dropped
> from arch hugetlb API.  This series is one small step towards merging
> hugetlb specific codes into generic mm paths.  From that POV, this series
> removes one reference to huge_pte_offset() out of many others.
> 
> One goal of such a route is that we can reconsider merging hugetlb features
> like High Granularity Mapping (HGM).  It was not accepted in the past
> because it may add lots of hugetlb specific codes and make the mm code even
> harder to maintain.  With a merged codeset, features like HGM can hopefully
> share some code with THP, legacy (PMD+) or modern (continuous PTEs).
> 
> To make it work, the generic slow gup code will need to at least understand
> hugepd, which is already done like so in fast-gup.  Due to the specialty of
> hugepd to be software-only solution (no hardware recognizes the hugepd
> format, so it's purely artificial structures), there's chance we can merge
> some or all hugepd formats with cont_pte in the future.  That question is
> yet unsettled from Power side to have an acknowledgement.  As of now for
> this series, I kept the hugepd handling because we may still need to do so
> before getting a clearer picture of the future of hugepd.  The other reason
> is simply that we did it already for fast-gup and most codes are still
> around to be reused.  It'll make more sense to keep slow/fast gup behave
> the same before a decision is made to remove hugepd.
> 

It is not true that hugepd is a software-only solution. Powerpc 8xx HW 
matches the hugepd topology for 8M pages.

Christophe





[RFC PATCH 6/8] powerpc/8xx: Fix size given to set_huge_pte_at()

2024-03-25 Thread Christophe Leroy
set_huge_pte_at() expects the real page size, not the psize which is
the index of the page definition in table mmu_psize_defs[]

Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to 
set_huge_pte_at()")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/nohash/8xx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index 6be6421086ed..70b4d807fda5 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -94,7 +94,8 @@ static int __ref __early_map_kernel_hugepage(unsigned long 
va, phys_addr_t pa,
return -EINVAL;
 
set_huge_pte_at(_mm, va, ptep,
-   pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize);
+   pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)),
+   1UL << mmu_psize_to_shift(psize));
 
return 0;
 }
-- 
2.43.0



[RFC PATCH 5/8] powerpc/mm: Allow hugepages without hugepd

2024-03-25 Thread Christophe Leroy
In preparation of implementing huge pages on powerpc 8xx
without hugepd, enclose hugepd related code inside an
ifdef CONFIG_ARCH_HAS_HUGEPD

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/hugetlb.h|  2 ++
 arch/powerpc/include/asm/nohash/pgtable.h |  8 +---
 arch/powerpc/mm/hugetlbpage.c | 10 ++
 arch/powerpc/mm/pgtable.c |  2 ++
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index ea71f7245a63..a05657e5701b 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -30,10 +30,12 @@ static inline int is_hugepage_only_range(struct mm_struct 
*mm,
 }
 #define is_hugepage_only_range is_hugepage_only_range
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 #define __HAVE_ARCH_HUGETLB_FREE_PGD_RANGE
 void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
unsigned long end, unsigned long floor,
unsigned long ceiling);
+#endif
 
 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 427db14292c9..ac3353f7f2ac 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -340,7 +340,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
 
 #define pgprot_writecombine pgprot_noncached_wc
 
-#ifdef CONFIG_HUGETLB_PAGE
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 static inline int hugepd_ok(hugepd_t hpd)
 {
 #ifdef CONFIG_PPC_8xx
@@ -351,6 +351,10 @@ static inline int hugepd_ok(hugepd_t hpd)
 #endif
 }
 
+#define is_hugepd(hpd) (hugepd_ok(hpd))
+#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
 static inline int pmd_huge(pmd_t pmd)
 {
return 0;
@@ -360,8 +364,6 @@ static inline int pud_huge(pud_t pud)
 {
return 0;
 }
-
-#define is_hugepd(hpd) (hugepd_ok(hpd))
 #endif
 
 int map_kernel_page(unsigned long va, phys_addr_t pa, pgprot_t prot);
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 66ac56b26007..db73ad845a2a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -42,6 +42,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long 
addr, unsigned long s
return __find_linux_pte(mm->pgd, addr, NULL, NULL);
 }
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
   unsigned long address, unsigned int pdshift,
   unsigned int pshift, spinlock_t *ptl)
@@ -193,6 +194,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
return hugepte_offset(*hpdp, addr, pdshift);
 }
+#else
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, unsigned long sz)
+{
+   return pte_alloc_huge(mm, pmd_off(mm, addr), addr, sz);
+}
+#endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /*
@@ -248,6 +256,7 @@ int __init alloc_bootmem_huge_page(struct hstate *h, int 
nid)
return __alloc_bootmem_huge_page(h, nid);
 }
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 #ifndef CONFIG_PPC_BOOK3S_64
 #define HUGEPD_FREELIST_SIZE \
((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t))
@@ -505,6 +514,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
}
} while (addr = next, addr != end);
 }
+#endif
 
 bool __init arch_hugetlb_valid_size(unsigned long size)
 {
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9e7ba9c3851f..acdf64c9b93e 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -487,8 +487,10 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
if (!hpdp)
return NULL;
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
ret_pte = hugepte_offset(*hpdp, ea, pdshift);
pdshift = hugepd_shift(*hpdp);
+#endif
 out:
if (hpage_shift)
*hpage_shift = pdshift;
-- 
2.43.0



[RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get()

2024-03-25 Thread Christophe Leroy
On powerpc 8xx huge_ptep_get() will need to know whether the given
ptep is a PTE entry or a PMD entry. This cannot be known with the
PMD entry itself because there is no easy way to know it from the
content of the entry.

So huge_ptep_get() will need to know either the size of the page
or get the pmd.

In order to be consistent with huge_ptep_get_and_clear(), give
mm and address to huge_ptep_get().

Signed-off-by: Christophe Leroy 
---
 arch/arm64/include/asm/hugetlb.h |  2 +-
 fs/hugetlbfs/inode.c |  2 +-
 fs/proc/task_mmu.c   |  8 +++---
 fs/userfaultfd.c |  2 +-
 include/asm-generic/hugetlb.h|  2 +-
 include/linux/swapops.h  |  2 +-
 mm/damon/vaddr.c |  6 ++---
 mm/gup.c |  2 +-
 mm/hmm.c |  2 +-
 mm/hugetlb.c | 46 
 mm/memory-failure.c  |  2 +-
 mm/mempolicy.c   |  2 +-
 mm/migrate.c |  4 +--
 mm/mincore.c |  2 +-
 mm/userfaultfd.c |  2 +-
 15 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 2ddc33d93b13..1af39a74e791 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -46,7 +46,7 @@ extern pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
   pte_t *ptep, unsigned long sz);
 #define __HAVE_ARCH_HUGE_PTEP_GET
-extern pte_t huge_ptep_get(pte_t *ptep);
+extern pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep);
 
 void __init arm64_hugetlb_cma_reserve(void);
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6502c7e776d1..ec3ec87d29e7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -425,7 +425,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct 
*vma,
if (!ptep)
return false;
 
-   pte = huge_ptep_get(ptep);
+   pte = huge_ptep_get(vma->vm_mm, addr, ptep);
if (huge_pte_none(pte) || !pte_present(pte))
return false;
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 23fbab954c20..b14081bcdafe 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1572,7 +1572,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned 
long hmask,
if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;
 
-   pte = huge_ptep_get(ptep);
+   pte = huge_ptep_get(walk->mm, addr, ptep);
if (pte_present(pte)) {
struct page *page = pte_page(pte);
 
@@ -2256,7 +2256,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, 
unsigned long hmask,
if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
/* Go the short route when not write-protecting pages. */
 
-   pte = huge_ptep_get(ptep);
+   pte = huge_ptep_get(walk->mm, start, ptep);
categories = p->cur_vma_category | 
pagemap_hugetlb_category(pte);
 
if (!pagemap_scan_is_interesting_page(categories, p))
@@ -2268,7 +2268,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, 
unsigned long hmask,
i_mmap_lock_write(vma->vm_file->f_mapping);
ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
 
-   pte = huge_ptep_get(ptep);
+   pte = huge_ptep_get(walk->mm, start, ptep);
categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
 
if (!pagemap_scan_is_interesting_page(categories, p))
@@ -2663,7 +2663,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
-   pte_t huge_pte = huge_ptep_get(pte);
+   pte_t huge_pte = huge_ptep_get(walk->mm, addr, pte);
struct numa_maps *md;
struct page *page;
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 60dcfafdc11a..177fe1ff14d7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -256,7 +256,7 @@ static inline bool userfaultfd_huge_must_wait(struct 
userfaultfd_ctx *ctx,
goto out;
 
ret = false;
-   pte = huge_ptep_get(ptep);
+   pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
 
/*
 * Lockless access: we're in a wait_event so it's ok if it
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 6dcf4d576970..594d5905f615 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -144,7 +144,7 @@ static inline int huge_ptep_set_access_flags(struct 
vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_HUGE_PTEP_GET
-static inline pte_t huge_ptep_get(pte_t *ptep)
+static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long a

[RFC PATCH 3/8] mm: Provide pmd to pte_leaf_size()

2024-03-25 Thread Christophe Leroy
On powerpc 8xx, when a page is 8M size, the information is in the PMD
entry. So provide it to pte_leaf_size().

Signed-off-by: Christophe Leroy 
---
 arch/arm64/include/asm/pgtable.h | 2 +-
 arch/powerpc/include/asm/nohash/32/pte-8xx.h | 2 +-
 arch/riscv/include/asm/pgtable.h | 2 +-
 arch/sparc/include/asm/pgtable_64.h  | 2 +-
 arch/sparc/mm/hugetlbpage.c  | 2 +-
 include/linux/pgtable.h  | 2 +-
 kernel/events/core.c | 2 +-
 7 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..57c40f2498ab 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -624,7 +624,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, 
unsigned long pfn,
 #define pmd_bad(pmd)   (!pmd_table(pmd))
 
 #define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
-#define pte_leaf_size(pte) (pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
+#define pte_leaf_size(pmd, pte)(pte_cont(pte) ? CONT_PTE_SIZE : 
PAGE_SIZE)
 
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 static inline bool pud_sect(pud_t pud) { return false; }
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h 
b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 137dc3c84e45..07df6b664861 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -151,7 +151,7 @@ static inline unsigned long pgd_leaf_size(pgd_t pgd)
 
 #define pgd_leaf_size pgd_leaf_size
 
-static inline unsigned long pte_leaf_size(pte_t pte)
+static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
 {
pte_basic_t val = pte_val(pte);
 
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 20242402fc11..45fa27810f25 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -439,7 +439,7 @@ static inline pte_t pte_mkhuge(pte_t pte)
return pte;
 }
 
-#define pte_leaf_size(pte) (pte_napot(pte) ?   
\
+#define pte_leaf_size(pmd, pte)(pte_napot(pte) ?   
\
napot_cont_size(napot_cont_order(pte)) 
:\
PAGE_SIZE)
 
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 4d1bafaba942..67063af2ff8f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1175,7 +1175,7 @@ extern unsigned long pud_leaf_size(pud_t pud);
 extern unsigned long pmd_leaf_size(pmd_t pmd);
 
 #define pte_leaf_size pte_leaf_size
-extern unsigned long pte_leaf_size(pte_t pte);
+extern unsigned long pte_leaf_size(pmd_t pmd, pte_t pte);
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 5a342199e837..60c845a15bee 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -276,7 +276,7 @@ static unsigned long huge_tte_to_size(pte_t pte)
 
 unsigned long pud_leaf_size(pud_t pud) { return 1UL << tte_to_shift(*(pte_t 
*)); }
 unsigned long pmd_leaf_size(pmd_t pmd) { return 1UL << tte_to_shift(*(pte_t 
*)); }
-unsigned long pte_leaf_size(pte_t pte) { return 1UL << tte_to_shift(pte); }
+unsigned long pte_leaf_size(pmd_t pmd, pte_t pte) { return 1UL << 
tte_to_shift(pte); }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 85fc7554cd52..e605a4149fc7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1802,7 +1802,7 @@ typedef unsigned int pgtbl_mod_mask;
 #define pmd_leaf_size(x) PMD_SIZE
 #endif
 #ifndef pte_leaf_size
-#define pte_leaf_size(x) PAGE_SIZE
+#define pte_leaf_size(x, y) PAGE_SIZE
 #endif
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 724e6d7e128f..5c1c083222b2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7585,7 +7585,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, 
unsigned long addr)
 
pte = ptep_get_lockless(ptep);
if (pte_present(pte))
-   size = pte_leaf_size(pte);
+   size = pte_leaf_size(pmd, pte);
pte_unmap(ptep);
 #endif /* CONFIG_HAVE_FAST_GUP */
 
-- 
2.43.0



[RFC PATCH 2/8] mm: Provide page size to pte_alloc_huge()

2024-03-25 Thread Christophe Leroy
In order to be able to flag the PMD entry with _PMD_HUGE_8M on
powerpc 8xx, provide page size to pte_alloc_huge() and use it
through the newly introduced pte_alloc_size().

Signed-off-by: Christophe Leroy 
---
 arch/arm64/mm/hugetlbpage.c   | 2 +-
 arch/parisc/mm/hugetlbpage.c  | 2 +-
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 arch/riscv/mm/hugetlbpage.c   | 2 +-
 arch/sh/mm/hugetlbpage.c  | 2 +-
 arch/sparc/mm/hugetlbpage.c   | 2 +-
 include/linux/hugetlb.h   | 4 ++--
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 0f0e10bb0a95..71161c655fd6 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -289,7 +289,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
 
WARN_ON(addr & (sz - 1));
-   ptep = pte_alloc_huge(mm, pmdp, addr);
+   ptep = pte_alloc_huge(mm, pmdp, addr, sz);
} else if (sz == PMD_SIZE) {
if (want_pmd_share(vma, addr) && pud_none(READ_ONCE(*pudp)))
ptep = huge_pmd_share(mm, vma, addr, pudp);
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index a9f7e21f6656..2f4c6b440710 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -66,7 +66,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
-   pte = pte_alloc_huge(mm, pmd, addr);
+   pte = pte_alloc_huge(mm, pmd, addr, sz);
}
return pte;
 }
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 594a4b7b2ca2..66ac56b26007 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,7 +183,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
 
if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
-   return pte_alloc_huge(mm, (pmd_t *)hpdp, addr);
+   return pte_alloc_huge(mm, (pmd_t *)hpdp, addr, sz);
 
BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 5ef2a6891158..dc77a58c6321 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -67,7 +67,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 
for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
-   pte = pte_alloc_huge(mm, pmd, addr & 
napot_cont_mask(order));
+   pte = pte_alloc_huge(mm, pmd, addr & 
napot_cont_mask(order), sz);
break;
}
}
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 6cb0ad73dbb9..26579429e5ed 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -38,7 +38,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
-   pte = pte_alloc_huge(mm, pmd, addr);
+   pte = pte_alloc_huge(mm, pmd, addr, sz);
}
}
}
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index b432500c13a5..5a342199e837 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -298,7 +298,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
if (sz >= PMD_SIZE)
return (pte_t *)pmd;
-   return pte_alloc_huge(mm, pmd, addr);
+   return pte_alloc_huge(mm, pmd, addr, sz);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 77b30a8c6076..d9c5d9daadc5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -193,9 +193,9 @@ static inline pte_t *pte_offset_huge(pmd_t *pmd, unsigned 
long address)
return pte_offset_kernel(pmd, address);
 }
 static inline pte_t *pte_alloc_huge(struct mm_struct *mm, pmd_t *pmd,
-   unsigned long address)
+   unsigned long address, unsigned long sz)
 {
-   return pte_alloc(mm, pmd) ? NULL : pte_offset_huge(pmd, address);
+   return pte_alloc_size(mm, pmd, sz) ? NULL : pte_offset_huge(pmd, 
address);
 }
 #endif
 
-- 
2.43.0



[RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()

2024-03-25 Thread Christophe Leroy
Unlike many architectures, powerpc 8xx hardware tablewalk requires
a two level process for all page sizes, allthough second level only
has one entry when pagesize is 8M.

To fit with Linux page table topology and without requiring special
page directory layout like hugepd, the page entry will be replicated
1024 times in the standard page table. However for large pages it is
necessary to set bits in the level-1 (PMD) entry. At the time being,
for 512k pages the flag is kept in the PTE and inserted in the PMD
entry at TLB miss exception, that is necessary because we can have
pages of different sizes in a page table. However the 12 PTE bits are
fully used and there is no room for an additional bit for page size.

For 8M pages, there will be only one page per PMD entry, it is
therefore possible to flag the pagesize in the PMD entry, with the
advantage that the information will already be at the right place for
the hardware.

To do so, add a new helper called pmd_populate_size() which takes the
page size as an additional argument, and modify __pte_alloc() to also
take that argument. pte_alloc() is left unmodified in order to
reduce churn on callers, and a pte_alloc_size() is added for use by
pte_alloc_huge().

When an architecture doesn't provide pmd_populate_size(),
pmd_populate() is used as a fallback.

Signed-off-by: Christophe Leroy 
---
 include/linux/mm.h | 12 +++-
 mm/filemap.c   |  2 +-
 mm/internal.h  |  2 +-
 mm/memory.c| 19 ---
 mm/pgalloc-track.h |  2 +-
 mm/userfaultfd.c   |  4 ++--
 6 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2c0910bc3e4a..6c5c15955d4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2801,8 +2801,8 @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {}
 static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
-int __pte_alloc_kernel(pmd_t *pmd);
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz);
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz);
 
 #if defined(CONFIG_MMU)
 
@@ -2987,7 +2987,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t 
*pmd,
pte_unmap(pte); \
 } while (0)
 
-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+#define pte_alloc_size(mm, pmd, sz) (unlikely(pmd_none(*(pmd))) && 
__pte_alloc(mm, pmd, sz))
+#define pte_alloc(mm, pmd) pte_alloc_size(mm, pmd, PAGE_SIZE)
 
 #define pte_alloc_map(mm, pmd, address)\
(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
@@ -2996,9 +2997,10 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t 
*pmd,
(pte_alloc(mm, pmd) ?   \
 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
-#define pte_alloc_kernel(pmd, address) \
-   ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
+#define pte_alloc_kernel_size(pmd, address, sz)\
+   ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, sz))? \
NULL: pte_offset_kernel(pmd, address))
+#define pte_alloc_kernel(pmd, address) pte_alloc_kernel_size(pmd, address, 
PAGE_SIZE)
 
 #if USE_SPLIT_PMD_PTLOCKS
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 7437b2bd75c1..b013000ea84f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3428,7 +3428,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct 
folio *folio,
}
 
if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
-   pmd_install(mm, vmf->pmd, >prealloc_pte);
+   pmd_install(mm, vmf->pmd, >prealloc_pte, PAGE_SIZE);
 
return false;
 }
diff --git a/mm/internal.h b/mm/internal.h
index 7e486f2c502c..b81c3ca59f45 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -206,7 +206,7 @@ void folio_activate(struct folio *folio);
 void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
   struct vm_area_struct *start_vma, unsigned long floor,
   unsigned long ceiling, bool mm_wr_locked);
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned 
long sz);
 
 struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..c846bb75746b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -409,7 +409,12 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state 
*mas,
} while (vma);
 }
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+#ifndef pmd_populate_size
+#define pmd_populate_size(mm, pmdp, pte, sz) pmd_populate(mm, pmdp, pte)
+#define pmd_populate_kernel_size(mm, pmdp, pte, sz) pmd_populate_kernel(mm, 
pmdp, pte)
+#endif
+
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgt

[RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-03-25 Thread Christophe Leroy
This series reimplements hugepages with hugepd on powerpc 8xx.

Unlike most architectures, powerpc 8xx HW requires a two-level
pagetable topology for all page sizes. So a leaf PMD-contig approach
is not feasible as such.

Possible sizes are 4k, 16k, 512k and 8M.

First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
must point to a single entry level-2 page table. Until now that was
done using hugepd. This series changes it to use standard page tables
where the entry is replicated 1024 times on each of the two pagetables
refered by the two associated PMD entries for that 8M page.

At the moment it has to look into each helper to know if the
hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
a lower size. I hope this can me handled by core-mm in the future.

There are probably several ways to implement stuff, so feedback is
very welcome.

Christophe Leroy (8):
  mm: Provide pagesize to pmd_populate()
  mm: Provide page size to pte_alloc_huge()
  mm: Provide pmd to pte_leaf_size()
  mm: Provide mm_struct and address to huge_ptep_get()
  powerpc/mm: Allow hugepages without hugepd
  powerpc/8xx: Fix size given to set_huge_pte_at()
  powerpc/8xx: Remove support for 8M pages
  powerpc/8xx: Add back support for 8M pages using contiguous PTE
entries

 arch/arm64/include/asm/hugetlb.h  |  2 +-
 arch/arm64/include/asm/pgtable.h  |  2 +-
 arch/arm64/mm/hugetlbpage.c   |  2 +-
 arch/parisc/mm/hugetlbpage.c  |  2 +-
 arch/powerpc/Kconfig  |  1 -
 arch/powerpc/include/asm/hugetlb.h| 13 +++-
 .../include/asm/nohash/32/hugetlb-8xx.h   | 54 -
 arch/powerpc/include/asm/nohash/32/pgalloc.h  |  2 +
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 59 +--
 arch/powerpc/include/asm/nohash/pgtable.h | 12 ++--
 arch/powerpc/include/asm/page.h   |  5 --
 arch/powerpc/include/asm/pgtable.h|  1 +
 arch/powerpc/kernel/head_8xx.S| 10 +---
 arch/powerpc/mm/hugetlbpage.c | 23 +++-
 arch/powerpc/mm/nohash/8xx.c  | 46 +++
 arch/powerpc/mm/pgtable.c | 26 +---
 arch/powerpc/mm/pgtable_32.c  |  2 +-
 arch/powerpc/platforms/Kconfig.cputype|  2 +
 arch/riscv/include/asm/pgtable.h  |  2 +-
 arch/riscv/mm/hugetlbpage.c   |  2 +-
 arch/sh/mm/hugetlbpage.c  |  2 +-
 arch/sparc/include/asm/pgtable_64.h   |  2 +-
 arch/sparc/mm/hugetlbpage.c   |  4 +-
 fs/hugetlbfs/inode.c  |  2 +-
 fs/proc/task_mmu.c|  8 +--
 fs/userfaultfd.c  |  2 +-
 include/asm-generic/hugetlb.h |  2 +-
 include/linux/hugetlb.h   |  4 +-
 include/linux/mm.h| 12 ++--
 include/linux/pgtable.h   |  2 +-
 include/linux/swapops.h   |  2 +-
 kernel/events/core.c  |  2 +-
 mm/damon/vaddr.c  |  6 +-
 mm/filemap.c  |  2 +-
 mm/gup.c  |  2 +-
 mm/hmm.c  |  2 +-
 mm/hugetlb.c  | 46 +++
 mm/internal.h |  2 +-
 mm/memory-failure.c   |  2 +-
 mm/memory.c   | 19 +++---
 mm/mempolicy.c|  2 +-
 mm/migrate.c  |  4 +-
 mm/mincore.c  |  2 +-
 mm/pgalloc-track.h|  2 +-
 mm/userfaultfd.c  |  6 +-
 45 files changed, 229 insertions(+), 180 deletions(-)

-- 
2.43.0



Re: [FSL P50x0] Kernel 6.9-rc1 compiling issue

2024-03-25 Thread Christophe Leroy
Hi,

Le 25/03/2024 à 06:18, Christian Zigotzky a écrit :
> I have created a patch:
> 
> --- a/arch/powerpc/platforms/85xx/smp.c 2024-03-25 06:14:02.201209476 +0100
> +++ b/arch/powerpc/platforms/85xx/smp.c 2024-03-25 06:10:04.421425931 +0100
> @@ -393,6 +393,7 @@ static void mpc85xx_smp_kexec_cpu_down(i
>      int disable_threadbit = 0;
>      long start = mftb();
>      long now;
> +   int crashing_cpu = -1;

crashing_cpu is a global variable defined in 
arch/powerpc/kernel/setup-common.c and declared in 
arch/powerpc/include/asm/kexec.h

So you can't redefine crashing_cpu as a local stub.

All you need to do is to add #include  just like 
arch/powerpc/platforms/powernv/smp.c I guess.

Christophe



> 
>      local_irq_disable();
>      hard_irq_disable();
> 
> ---
> 
> -- Christian
> 
> 
> On 25 March 2024 at 05:48 am, Christian Zigotzky wrote:
>> Hi All,
>>
>> Compiling of the RC1 of kernel 6.9 doesn’t work anymore for our FSL 
>> P5020/P5040 boards [1] since the PowerPC updates 6.9-2 [2].
>>
>> Error messages:
>>
>> arch/powerpc/platforms/85xx/smp.c: In function 
>> 'mpc85xx_smp_kexec_cpu_down':
>> arch/powerpc/platforms/85xx/smp.c:401:13: error: 'crashing_cpu' 
>> undeclared (first use in this function); did you mean 'crash_save_cpu'?
>>   401 |  if (cpu == crashing_cpu && cpu_thread_in_core(cpu) != 0) {
>>   | ^~~~
>>   | crash_save_cpu
>> arch/powerpc/platforms/85xx/smp.c:401:13: note: each undeclared 
>> identifier is reported only once for each function it appears in
>> make[5]: *** [scripts/Makefile.build:244: 
>> arch/powerpc/platforms/85xx/smp.o] Error 1
>> make[4]: *** [scripts/Makefile.build:485: arch/powerpc/platforms/85xx] 
>> Error 2
>> make[3]: *** [scripts/Makefile.build:485: arch/powerpc/platforms] Error 2
>> make[2]: *** [scripts/Makefile.build:485: arch/powerpc] Error 2
>>
>> ---
>>
>> I was able to revert it. After that the compiling works again.
>>
>> Could you please check the PowerPC updates 6.9-2? [2]
>>
>> Thanks,
>> Christian
>>
>> [1] http://wiki.amiga.org/index.php?title=X5000
>> [2] 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.9-rc1=484193fecd2b6349a6fd1554d306aec646ae1a6a
> 


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-20 Thread Christophe Leroy


Le 20/03/2024 à 17:09, Peter Xu a écrit :
> On Wed, Mar 20, 2024 at 06:16:43AM +0000, Christophe Leroy wrote:
>> At the first place that was to get a close fit between hardware
>> pagetable topology and linux pagetable topology. But obviously we
>> already stepped back for 512k pages, so let's go one more step aside and
>> do similar with 8M pages.
>>
>> I'll give it a try and see how it goes.
> 
> So you're talking about 8M only for 8xx, am I right?

Yes I am.

> 
> There seem to be other PowerPC systems use hugepd.  Is it possible that we
> convert all hugepd into cont_pte form?

Indeed.

Seems like we have hugepd for book3s/64 and for nohash.

For book3s I don't know, may Aneesh can answer.

For nohash I think it should be possible because TLB misses are handled 
by software. Even the e6500 which has a hardware tablewalk falls back on 
software walk when it is a hugepage IIUC.

Christophe


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-20 Thread Christophe Leroy


Le 20/03/2024 à 00:26, Jason Gunthorpe a écrit :
> On Tue, Mar 19, 2024 at 11:07:08PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
>>> On Thu, Mar 14, 2024 at 01:11:59PM +, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 14/03/2024 à 13:53, Peter Xu a écrit :
>>>>> On Thu, Mar 14, 2024 at 08:45:34AM +, Christophe Leroy wrote:
>>>>>>
>>>>>>
>>>>>> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
>>>>>>> From: Peter Xu 
>>>>>>>
>>>>>>> PowerPC book3s 4K mostly has the same definition on both, except 
>>>>>>> pXd_huge()
>>>>>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out 
>>>>>>> [1],
>>>>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be 
>>>>>>> set so
>>>>>>> it will keep returning false.
>>>>>>>
>>>>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>>>>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc 
>>>>>>> hugetlb
>>>>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check 
>>>>>>> hugetlb
>>>>>>> mappings.
>>>>>>>
>>>>>>> The goal should be that we will have one API pXd_leaf() to detect all 
>>>>>>> kinds
>>>>>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather 
>>>>>>> than
>>>>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return 
>>>>>>> true.
>>>>>>
>>>>>> All kinds of huge mappings ?
>>>>>>
>>>>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>>>>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>>>>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>>>>>> those huge pages.
>>>>>
>>>>> Ah yes, I should always mention this is in the context of leaf huge pages
>>>>> only.  Are the examples you provided all fall into hugepd category?  If so
>>>>> I can reword the commit message, as:
>>>>
>>>> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
>>>>
>>>> The 512k hugepages are at PTE level, they are handled more or less like
>>>> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
>>>>
>>>> You can also look at pte_leaf_size() and pgd_leaf_size().
>>>
>>> IMHO leaf should return false if the thing is pointing to a next level
>>> page table, even if that next level is fully populated with contiguous
>>> pages.
>>>
>>> This seems more aligned with the contig page direction that hugepd
>>> should be moved over to..
>>
>> Should hugepd be moved to the contig page direction, really ?
> 
> Sure? Is there any downside for the reading side to do so?

Probably not.

> 
>> Would it be acceptable that a 8M hugepage requires 2048 contig entries
>> in 2 page tables, when the hugepd allows a single entry ?
> 
> ? I thought we agreed the only difference would be that something new
> is needed to merge the two identical sibling page tables into one, ie
> you pay 2x the page table memory if that isn't fixed. That is write
> side only change and I imagine it could be done with a single PPC
> special API.
> 
> Honestly not totally sure that is a big deal, it is already really
> memory inefficient compared to every other arch's huge page by needing
> the child page table in the first place.
> 
>> Would it be acceptable performancewise ?
> 
> Isn't this particular PPC sub platform ancient? Are there current real
> users that are going to have hugetlbfs special code and care about
> this performance detail on a 6.20 era kernel?

Ancient yes but still widely in use and with the emergence of voice over 
IP in Air Trafic Control, performance becomes more and more challenge 
with those old boards that have another 10 years in front of them.

> 
> In today's world wouldn't it be performance better if these platforms
> could support THP by aligning to the contig API instead of being
> special?

Indeed, if we can promote THP that'd be even better.

> 
> Am I wrong to question why we are polluting the core code for this
> special optimization?

At the first place that was to get a close fit between hardware 
pagetable topology and linux pagetable topology. But obviously we 
already stepped back for 512k pages, so let's go one more step aside and 
do similar with 8M pages.

I'll give it a try and see how it goes.

Christophe


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-19 Thread Christophe Leroy


Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
> On Thu, Mar 14, 2024 at 01:11:59PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 14/03/2024 à 13:53, Peter Xu a écrit :
>>> On Thu, Mar 14, 2024 at 08:45:34AM +, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
>>>>> From: Peter Xu 
>>>>>
>>>>> PowerPC book3s 4K mostly has the same definition on both, except 
>>>>> pXd_huge()
>>>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
>>>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set 
>>>>> so
>>>>> it will keep returning false.
>>>>>
>>>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
>>>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
>>>>> mappings.
>>>>>
>>>>> The goal should be that we will have one API pXd_leaf() to detect all 
>>>>> kinds
>>>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
>>>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>>>>
>>>> All kinds of huge mappings ?
>>>>
>>>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>>>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>>>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>>>> those huge pages.
>>>
>>> Ah yes, I should always mention this is in the context of leaf huge pages
>>> only.  Are the examples you provided all fall into hugepd category?  If so
>>> I can reword the commit message, as:
>>
>> On powerpc 8xx, only the 8M huge pages fall into the hugepd case.
>>
>> The 512k hugepages are at PTE level, they are handled more or less like
>> CONT_PTE on ARM. see function set_huge_pte_at() for more context.
>>
>> You can also look at pte_leaf_size() and pgd_leaf_size().
> 
> IMHO leaf should return false if the thing is pointing to a next level
> page table, even if that next level is fully populated with contiguous
> pages.
> 
> This seems more aligned with the contig page direction that hugepd
> should be moved over to..

Should hugepd be moved to the contig page direction, really ?

Would it be acceptable that a 8M hugepage requires 2048 contig entries 
in 2 page tables, when the hugepd allows a single entry ? Would it be 
acceptable performancewise ?

> 
>> By the way pgd_leaf_size() looks odd because it is called only when
>> pgd_leaf_size() returns true, which never happens for 8M pages.
> 
> Like this, you should reach the actual final leaf that the HW will
> load and leaf_size() should say it is greater size than the current
> table level. Other levels should return 0.
> 
> If necessary the core MM code should deal with this by iterating over
> adjacent tables.
> 
> Jason


Re: [PATCH] powerpc: Use swapper_pg_dir instead of init_mm->pgd

2024-03-16 Thread Christophe Leroy


Le 09/10/2022 à 19:31, Christophe Leroy a écrit :
> init_mm->pgd is always swapper_pg_dir[] which is known
> at build time.
> 
> Directly use the later instead of loading it from init_mm
> struct at every time.
> 
> Signed-off-by: Christophe Leroy 

Dropping this patch after feedback from Michael:

no other arches do it. (swapper_pg_dir)

It would also make us the only arch other than ia64 (which is old and
probably going to get removed soon) defining pgd_offset_k().


> ---
>   arch/powerpc/include/asm/pgtable.h | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 283f40d05a4d..f6843e6294d9 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -48,6 +48,9 @@ struct mm_struct;
>   /* Keep these as a macros to avoid include dependency mess */
>   #define pte_page(x) pfn_to_page(pte_pfn(x))
>   #define mk_pte(page, pgprot)pfn_pte(page_to_pfn(page), (pgprot))
> +
> +#define pgd_offset_k(address)pgd_offset_pgd(swapper_pg_dir, 
> (address))
> +
>   /*
>* Select all bits except the pfn
>*/


[PATCH v2] powerpc: Handle error in mark_rodata_ro() and mark_initmem_nx()

2024-03-16 Thread Christophe Leroy
mark_rodata_ro() and mark_initmem_nx() use functions that can
fail like set_memory_nx() and set_memory_ro(), leading to a not
protected kernel.

In case of failure, panic.

Link: https://github.com/KSPP/linux/issues/7
Signed-off-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: 
https://msgid.link/836f75710daef12dfea55f8fb6055d7fdaf716e3.1708078577.git.christophe.le...@csgroup.eu
---
v2: Rebased on top of 6388eaa7f116 ("Automatic merge of 'master' into merge 
(2024-03-16 10:18)")
---
 arch/powerpc/mm/book3s32/mmu.c |  7 +--
 arch/powerpc/mm/mmu_decl.h |  8 +++
 arch/powerpc/mm/nohash/8xx.c   | 33 ++---
 arch/powerpc/mm/nohash/e500.c  | 10 ++---
 arch/powerpc/mm/pgtable_32.c   | 38 +-
 5 files changed, 65 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
index 5445587bfe84..100f999871bc 100644
--- a/arch/powerpc/mm/book3s32/mmu.c
+++ b/arch/powerpc/mm/book3s32/mmu.c
@@ -193,7 +193,7 @@ static bool is_module_segment(unsigned long addr)
return true;
 }
 
-void mmu_mark_initmem_nx(void)
+int mmu_mark_initmem_nx(void)
 {
int nb = mmu_has_feature(MMU_FTR_USE_HIGH_BATS) ? 8 : 4;
int i;
@@ -230,9 +230,10 @@ void mmu_mark_initmem_nx(void)
 
mtsr(mfsr(i << 28) | 0x1000, i << 28);
}
+   return 0;
 }
 
-void mmu_mark_rodata_ro(void)
+int mmu_mark_rodata_ro(void)
 {
int nb = mmu_has_feature(MMU_FTR_USE_HIGH_BATS) ? 8 : 4;
int i;
@@ -245,6 +246,8 @@ void mmu_mark_rodata_ro(void)
}
 
update_bats();
+
+   return 0;
 }
 
 /*
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 8e84bc214d13..6949c2c937e7 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -160,11 +160,11 @@ static inline unsigned long p_block_mapped(phys_addr_t 
pa) { return 0; }
 #endif
 
 #if defined(CONFIG_PPC_BOOK3S_32) || defined(CONFIG_PPC_8xx) || 
defined(CONFIG_PPC_E500)
-void mmu_mark_initmem_nx(void);
-void mmu_mark_rodata_ro(void);
+int mmu_mark_initmem_nx(void);
+int mmu_mark_rodata_ro(void);
 #else
-static inline void mmu_mark_initmem_nx(void) { }
-static inline void mmu_mark_rodata_ro(void) { }
+static inline int mmu_mark_initmem_nx(void) { return 0; }
+static inline int mmu_mark_rodata_ro(void) { return 0; }
 #endif
 
 #ifdef CONFIG_PPC_8xx
diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index 6be6421086ed..43d4842bb1c7 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -119,23 +119,26 @@ void __init mmu_mapin_immr(void)
PAGE_KERNEL_NCG, MMU_PAGE_512K, true);
 }
 
-static void mmu_mapin_ram_chunk(unsigned long offset, unsigned long top,
-   pgprot_t prot, bool new)
+static int mmu_mapin_ram_chunk(unsigned long offset, unsigned long top,
+  pgprot_t prot, bool new)
 {
unsigned long v = PAGE_OFFSET + offset;
unsigned long p = offset;
+   int err = 0;
 
WARN_ON(!IS_ALIGNED(offset, SZ_512K) || !IS_ALIGNED(top, SZ_512K));
 
-   for (; p < ALIGN(p, SZ_8M) && p < top; p += SZ_512K, v += SZ_512K)
-   __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
-   for (; p < ALIGN_DOWN(top, SZ_8M) && p < top; p += SZ_8M, v += SZ_8M)
-   __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new);
-   for (; p < ALIGN_DOWN(top, SZ_512K) && p < top; p += SZ_512K, v += 
SZ_512K)
-   __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
+   for (; p < ALIGN(p, SZ_8M) && p < top && !err; p += SZ_512K, v += 
SZ_512K)
+   err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, 
new);
+   for (; p < ALIGN_DOWN(top, SZ_8M) && p < top && !err; p += SZ_8M, v += 
SZ_8M)
+   err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new);
+   for (; p < ALIGN_DOWN(top, SZ_512K) && p < top && !err; p += SZ_512K, v 
+= SZ_512K)
+   err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, 
new);
 
if (!new)
flush_tlb_kernel_range(PAGE_OFFSET + v, PAGE_OFFSET + top);
+
+   return err;
 }
 
 unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top)
@@ -166,27 +169,33 @@ unsigned long __init mmu_mapin_ram(unsigned long base, 
unsigned long top)
return top;
 }
 
-void mmu_mark_initmem_nx(void)
+int mmu_mark_initmem_nx(void)
 {
unsigned long etext8 = ALIGN(__pa(_etext), SZ_8M);
unsigned long sinittext = __pa(_sinittext);
unsigned long boundary = strict_kernel_rwx_enabled() ? sinittext : 
etext8;
unsigned long einittext8 = ALIGN(__pa(_einittext), SZ_8M);
+   int err = 0;
 
if (!debug

Re: [PATCH v1 2/2] powerpc/code-patching: Convert to open_patch_window()/close_patch_window()

2024-03-16 Thread Christophe Leroy


Le 15/03/2024 à 09:38, Christophe Leroy a écrit :
> 
> 
> Le 15/03/2024 à 03:59, Benjamin Gray a écrit :
>> The existing patching alias page setup and teardown sections can be
>> simplified to make use of the new open_patch_window() abstraction.
>>
>> This eliminates the _mm variants of the helpers, consumers no longer
>> need to check mm_patch_enabled(), and consumers no longer need to worry
>> about synchronization and flushing beyond the changes they make in the
>> patching window.
> 
> With this patch, the time needed to activate or de-activate function 
> tracer is approx 10% longer on powerpc 8xx.

With the following changes, the performance is restored:

diff --git a/arch/powerpc/lib/code-patching.c 
b/arch/powerpc/lib/code-patching.c
index fd6f8576033a..bc92b85913d8 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -282,13 +282,13 @@ struct patch_window {
   * Interrupts must be disabled for the entire duration of the 
patching. The PIDR
   * is potentially changed during this time.
   */
-static int open_patch_window(void *addr, struct patch_window *ctx)
+static __always_inline int open_patch_window(void *addr, struct 
patch_window *ctx)
  {
unsigned long pfn = get_patch_pfn(addr);

lockdep_assert_irqs_disabled();

-   ctx->text_poke_addr = (unsigned 
long)__this_cpu_read(cpu_patching_context.addr);
+   ctx->text_poke_addr = (unsigned 
long)__this_cpu_read(cpu_patching_context.addr) & PAGE_MASK;

if (!mm_patch_enabled()) {
ctx->ptep = __this_cpu_read(cpu_patching_context.pte);
@@ -331,7 +331,7 @@ static int open_patch_window(void *addr, struct 
patch_window *ctx)
return 0;
  }

-static void close_patch_window(struct patch_window *ctx)
+static __always_inline void close_patch_window(struct patch_window *ctx)
  {
lockdep_assert_irqs_disabled();




Re: [PATCH v1 2/2] powerpc/code-patching: Convert to open_patch_window()/close_patch_window()

2024-03-16 Thread Christophe Leroy


Le 15/03/2024 à 09:38, Christophe Leroy a écrit :
> 
> 
> Le 15/03/2024 à 03:59, Benjamin Gray a écrit :
>> The existing patching alias page setup and teardown sections can be
>> simplified to make use of the new open_patch_window() abstraction.
>>
>> This eliminates the _mm variants of the helpers, consumers no longer
>> need to check mm_patch_enabled(), and consumers no longer need to worry
>> about synchronization and flushing beyond the changes they make in the
>> patching window.
> 
> With this patch, the time needed to activate or de-activate function 
> tracer is approx 10% longer on powerpc 8xx.

See below difference of patch_instruction() before and after your patch, 
both for 4k pages and 16k pages:

16k pages, before your patch:

0278 :
  278:  48 00 00 84 nop
  27c:  7c e0 00 a6 mfmsr   r7
  280:  7c 51 13 a6 mtspr   81,r2
  284:  3d 00 00 00 lis r8,0
286: R_PPC_ADDR16_HA.data
  288:  39 08 00 00 addir8,r8,0
28a: R_PPC_ADDR16_LO.data
  28c:  7c 69 1b 78 mr  r9,r3
  290:  3d 29 40 00 addis   r9,r9,16384
  294:  81 48 00 08 lwz r10,8(r8)
  298:  55 29 00 22 clrrwi  r9,r9,14
  29c:  81 08 00 04 lwz r8,4(r8)
  2a0:  61 29 01 2d ori r9,r9,301
  2a4:  55 06 00 22 clrrwi  r6,r8,14
  2a8:  91 2a 00 00 stw r9,0(r10)
  2ac:  91 2a 00 04 stw r9,4(r10)
  2b0:  91 2a 00 08 stw r9,8(r10)
  2b4:  91 2a 00 0c stw r9,12(r10)
  2b8:  50 68 04 be rlwimi  r8,r3,0,18,31
  2bc:  90 88 00 00 stw r4,0(r8)
  2c0:  7c 00 40 6c dcbst   0,r8
  2c4:  7c 00 04 ac hwsync
  2c8:  7c 00 1f ac icbi0,r3
  2cc:  7c 00 04 ac hwsync
  2d0:  4c 00 01 2c isync
  2d4:  38 60 00 00 li  r3,0
  2d8:  39 20 00 00 li  r9,0
  2dc:  91 2a 00 00 stw r9,0(r10)
  2e0:  91 2a 00 04 stw r9,4(r10)
  2e4:  91 2a 00 08 stw r9,8(r10)
  2e8:  91 2a 00 0c stw r9,12(r10)
  2ec:  7c 00 32 64 tlbie   r6,r0
  2f0:  7c 00 04 ac hwsync
  2f4:  7c e0 01 24 mtmsr   r7
  2f8:  4e 80 00 20 blr

16k pages, after your patch. Now we have a stack frame for the call to 
close_patch_window(). And the branch in close_patch_window() is 
unexpected as patch_instruction() works on single pages.

024c :
  24c:  81 23 00 04 lwz r9,4(r3)
  250:  39 40 00 00 li  r10,0
  254:  91 49 00 00 stw r10,0(r9)
  258:  91 49 00 04 stw r10,4(r9)
  25c:  91 49 00 08 stw r10,8(r9)
  260:  91 49 00 0c stw r10,12(r9)
  264:  81 23 00 00 lwz r9,0(r3)
  268:  55 2a 00 22 clrrwi  r10,r9,14
  26c:  39 29 40 00 addir9,r9,16384
  270:  7d 2a 48 50 subfr9,r10,r9
  274:  28 09 40 00 cmplwi  r9,16384
  278:  41 81 00 10 bgt 288 
  27c:  7c 00 52 64 tlbie   r10,r0
  280:  7c 00 04 ac hwsync
  284:  4e 80 00 20 blr
  288:  7c 00 04 ac hwsync
  28c:  7c 00 02 e4 tlbia
  290:  4c 00 01 2c isync
  294:  4e 80 00 20 blr

02c4 :
  2c4:  94 21 ff d0 stwur1,-48(r1)
  2c8:  93 c1 00 28 stw r30,40(r1)
  2cc:  48 00 00 ac nop
  2d0:  7c 08 02 a6 mflrr0
  2d4:  90 01 00 34 stw r0,52(r1)
  2d8:  93 e1 00 2c stw r31,44(r1)
  2dc:  7f e0 00 a6 mfmsr   r31
  2e0:  7c 51 13 a6 mtspr   81,r2
  2e4:  3d 40 00 00 lis r10,0
2e6: R_PPC_ADDR16_HA.data
  2e8:  39 4a 00 00 addir10,r10,0
2ea: R_PPC_ADDR16_LO.data
  2ec:  7c 69 1b 78 mr  r9,r3
  2f0:  3d 29 40 00 addis   r9,r9,16384
  2f4:  81 0a 00 08 lwz r8,8(r10)
  2f8:  80 ca 00 04 lwz r6,4(r10)
  2fc:  55 29 00 22 clrrwi  r9,r9,14
  300:  61 29 01 2d ori r9,r9,301
  304:  38 e0 00 00 li  r7,0
  308:  54 6a 04 be clrlwi  r10,r3,18
  30c:  91 28 00 00 stw r9,0(r8)
  310:  91 28 00 04 stw r9,4(r8)
  314:  91 28 00 08 stw r9,8(r8)
  318:  91 28 00 0c stw r9,12(r8)
  31c:  91 01 00 0c stw r8,12(r1)
  320:  90 c1 00 08 stw r6,8(r1)
  324:  7d 4a 32 14 add r10,r10,r6
  328:  90 e1 00 10 stw r7,16(r1)
  32c:  90 e1 00 14 stw r7,20(r1)
  330:  90 e1 00 18 stw r7,24(r1)
  334:  90 8a 00 00 stw r4,0(r10)
  338:  7c 00 50 6c dcbst   0,r10
  33c:  7c 00 04 ac hwsync
  340:  7c 00 1f ac icbi0,r3
  344:  7c 00 04 ac hwsync
  348:  4c 00 01 2c isync
  34c:  3b c0 00 00 li  r30,0
  350:  38 61 00 08 addir3,r1,8
  354:  4b ff fe f9 bl  24c 
  358:  7f e0 01 24 mtmsr   r31
  35c:  80 01 00 34 lwz r0,52(r1)
  360:  83 e1 00 2c lwz r31,44(r1)
  364:  7c 08 03 a6 mtlrr0
  368:  7f c3 f3 78 mr  r3,r30
  36c:  83 c1 00 28 lwz r30,40(r1)
  370:  38 21 00 30 addir1,r1,48
  374:  4e 80 00 20 b

Re: Cannot load wireguard module

2024-03-15 Thread Christophe Leroy
Hi,

Le 15/03/2024 à 13:20, Michal Suchánek a écrit :
> Hello,
> 
> I cannot load the wireguard module.
> 
> Loading the module provides no diagnostic other than 'No such device'.
> 
> Please provide maningful diagnostics for loading software-only driver,
> clearly there is no particular device needed.

Can you tell us more ? Were you able to load it before ?
Can you provide your .config ?

I just gave it a try on my powerpc 8xx (ppc32) as built-in (I don't use 
modules) and it seems to probe properly:

[7.547390] wireguard: allowedips self-tests: pass
[7.607224] wireguard: nonce counter self-tests: pass
[7.776594] wireguard: ratelimiter self-tests: pass
[7.781723] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com 
for info
rmation.
[7.789570] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld 
. All Rights Reserved.


Christophe

> 
> Thanks
> 
> Michal
> 
> jostaberry-1:~ # uname -a
> Linux jostaberry-1 6.8.0-lp155.8.g7e0e887-default #1 SMP Wed Mar 13 09:02:21 
> UTC 2024 (7e0e887) ppc64le ppc64le ppc64le GNU/Linux
> jostaberry-1:~ # modprobe wireguard
> modprobe: ERROR: could not insert 'wireguard': No such device
> jostaberry-1:~ # modprobe -v wireguard
> insmod 
> /lib/modules/6.8.0-lp155.8.g7e0e887-default/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko.zst
> modprobe: ERROR: could not insert 'wireguard': No such device
> jostaberry-1:~ # modprobe chacha-generic
> jostaberry-1:~ # modprobe -v wireguard
> insmod 
> /lib/modules/6.8.0-lp155.8.g7e0e887-default/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko.zst
> modprobe: ERROR: could not insert 'wireguard': No such device
> jostaberry-1:~ #
> 


Re: [PATCH v1 2/2] powerpc/code-patching: Convert to open_patch_window()/close_patch_window()

2024-03-15 Thread Christophe Leroy


Le 15/03/2024 à 03:59, Benjamin Gray a écrit :
> The existing patching alias page setup and teardown sections can be
> simplified to make use of the new open_patch_window() abstraction.
> 
> This eliminates the _mm variants of the helpers, consumers no longer
> need to check mm_patch_enabled(), and consumers no longer need to worry
> about synchronization and flushing beyond the changes they make in the
> patching window.

With this patch, the time needed to activate or de-activate function 
tracer is approx 10% longer on powerpc 8xx.

Christophe


Re: linux-next: manual merge of the powerpc tree with the mm-stable tree

2024-03-15 Thread Christophe Leroy


Le 29/02/2024 à 07:37, Michael Ellerman a écrit :
> Stephen Rothwell  writes:
>> Hi all,
>>
>> Today's linux-next merge of the powerpc tree got a conflict in:
>>
>>arch/powerpc/mm/pgtable_32.c
>>
>> between commit:
>>
>>a5e8131a0329 ("arm64, powerpc, riscv, s390, x86: ptdump: refactor 
>> CONFIG_DEBUG_WX")
>>
>> from the mm-stable tree and commit:
>>
>>8f17bd2f4196 ("powerpc: Handle error in mark_rodata_ro() and 
>> mark_initmem_nx()")
>>
>> from the powerpc tree.
> 
> Thanks. That's a fairly ugly conflict.
> 
> Maybe I'll drop that patch until the generic change has gone in.
> 

The change is now in linus tree.

Christophe


Re: [PATCH v1 1/3] powerpc/code-patching: Test patch_instructions() during boot

2024-03-15 Thread Christophe Leroy


Le 15/03/2024 à 03:57, Benjamin Gray a écrit :
> patch_instructions() introduces new behaviour with a couple of
> variations. Test each case of
> 
>* a repeated 32-bit instruction,
>* a repeated 64-bit instruction (ppc64), and
>* a copied sequence of instructions
> 
> for both on a single page and when it crosses a page boundary.
> 
> Signed-off-by: Benjamin Gray 
> ---
>   arch/powerpc/lib/test-code-patching.c | 92 +++
>   1 file changed, 92 insertions(+)
> 
> diff --git a/arch/powerpc/lib/test-code-patching.c 
> b/arch/powerpc/lib/test-code-patching.c
> index c44823292f73..35a3756272df 100644
> --- a/arch/powerpc/lib/test-code-patching.c
> +++ b/arch/powerpc/lib/test-code-patching.c
> @@ -347,6 +347,97 @@ static void __init test_prefixed_patching(void)
>   check(!memcmp(iptr, expected, sizeof(expected)));
>   }
>   
> +static void __init test_multi_instruction_patching(void)
> +{
> + u32 code[256];

Build failure:

   CC  arch/powerpc/lib/test-code-patching.o
arch/powerpc/lib/test-code-patching.c: In function 
'test_multi_instruction_patching':
arch/powerpc/lib/test-code-patching.c:439:1: error: the frame size of 
1040 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
   439 | }
   | ^
cc1: all warnings being treated as errors
make[4]: *** [scripts/Makefile.build:243: 
arch/powerpc/lib/test-code-patching.o] Error 1


I have to avoid big arrays on the stack.


> + void *buf;
> + u32 *addr32;
> + u64 *addr64;
> + ppc_inst_t inst64 = ppc_inst_prefix(OP_PREFIX << 26 | 3UL << 24, 
> PPC_RAW_TRAP());
> + u32 inst32 = PPC_RAW_NOP();
> +
> + buf = vzalloc(PAGE_SIZE * 8);
> + check(buf);
> + if (!buf)
> + return;
> +
> + /* Test single page 32-bit repeated instruction */
> + addr32 = buf + PAGE_SIZE;
> + check(!patch_instructions(addr32 + 1, , 12, true));
> +
> + check(addr32[0] == 0);
> + check(addr32[1] == inst32);
> + check(addr32[2] == inst32);
> + check(addr32[3] == inst32);
> + check(addr32[4] == 0);
> +
> + /* Test single page 64-bit repeated instruction */
> + if (IS_ENABLED(CONFIG_PPC64)) {
> + check(ppc_inst_prefixed(inst64));
> +
> + addr64 = buf + PAGE_SIZE * 2;
> + ppc_inst_write(code, inst64);
> + check(!patch_instructions((u32 *)(addr64 + 1), code, 24, true));
> +
> + check(addr64[0] == 0);
> + check(ppc_inst_equal(ppc_inst_read((u32 *)[1]), inst64));
> + check(ppc_inst_equal(ppc_inst_read((u32 *)[2]), inst64));
> + check(ppc_inst_equal(ppc_inst_read((u32 *)[3]), inst64));
> + check(addr64[4] == 0);
> + }
> +
> + /* Test single page memcpy */
> + addr32 = buf + PAGE_SIZE * 3;
> +
> + for (int i = 0; i < ARRAY_SIZE(code); i++)
> + code[i] = i + 1;
> +
> + check(!patch_instructions(addr32 + 1, code, sizeof(code), false));
> +
> + check(addr32[0] == 0);
> + check(!memcmp([1], code, sizeof(code)));
> + check(addr32[ARRAY_SIZE(code) + 1] == 0);
> +
> + /* Test multipage 32-bit repeated instruction */
> + addr32 = buf + PAGE_SIZE * 4 - 8;
> + check(!patch_instructions(addr32 + 1, , 12, true));
> +
> + check(addr32[0] == 0);
> + check(addr32[1] == inst32);
> + check(addr32[2] == inst32);
> + check(addr32[3] == inst32);
> + check(addr32[4] == 0);
> +
> + /* Test multipage 64-bit repeated instruction */
> + if (IS_ENABLED(CONFIG_PPC64)) {
> + check(ppc_inst_prefixed(inst64));
> +
> + addr64 = buf + PAGE_SIZE * 5 - 8;
> + ppc_inst_write(code, inst64);
> + check(!patch_instructions((u32 *)(addr64 + 1), code, 24, true));
> +
> + check(addr64[0] == 0);
> + check(ppc_inst_equal(ppc_inst_read((u32 *)[1]), inst64));
> + check(ppc_inst_equal(ppc_inst_read((u32 *)[2]), inst64));
> + check(ppc_inst_equal(ppc_inst_read((u32 *)[3]), inst64));
> + check(addr64[4] == 0);
> + }
> +
> + /* Test multipage memcpy */
> + addr32 = buf + PAGE_SIZE * 6 - 12;
> +
> + for (int i = 0; i < ARRAY_SIZE(code); i++)
> + code[i] = i + 1;
> +
> + check(!patch_instructions(addr32 + 1, code, sizeof(code), false));
> +
> + check(addr32[0] == 0);
> + check(!memcmp([1], code, sizeof(code)));
> + check(addr32[ARRAY_SIZE(code) + 1] == 0);
> +
> + vfree(buf);
> +}
> +
>   static int __init test_code_patching(void)
>   {
>   pr_info("Running code patching self-tests ...\n");
> @@ -356,6 +447,7 @@ static int __init test_code_patching(void)
>   test_create_function_call();
>   test_translate_branch();
>   test_prefixed_patching();
> + test_multi_instruction_patching();
>   
>   return 0;
>   }


Re: [PATCH v1 3/3] powerpc/code-patching: Optimise patch_memcpy() to 4 byte chunks

2024-03-15 Thread Christophe Leroy


Le 15/03/2024 à 03:57, Benjamin Gray a écrit :
> As we are patching instructions, we can assume the length is a multiple
> of 4 and the destination address is aligned.
> 
> Atomicity of patching a prefixed instruction is not a concern, as the
> original implementation doesn't provide it anyway.

This patch looks unnecessary.

copy_to_kernel_nofault() is what you want to use instead.

> 
> Signed-off-by: Benjamin Gray 
> ---
>   arch/powerpc/lib/code-patching.c | 8 
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c 
> b/arch/powerpc/lib/code-patching.c
> index c6633759b509..ed450a32918c 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -394,10 +394,10 @@ static int patch_memset32(u32 *addr, u32 val, size_t 
> count)
>   return -EPERM;
>   }
>   
> -static int patch_memcpy(void *dst, void *src, size_t len)
> +static int patch_memcpy32(u32 *dst, u32 *src, size_t count)
>   {
> - for (void *end = src + len; src < end; dst++, src++)
> - __put_kernel_nofault(dst, src, u8, failed);
> + for (u32 *end = src + count; src < end; dst++, src++)
> + __put_kernel_nofault(dst, src, u32, failed);
>   
>   return 0;
>   
> @@ -424,7 +424,7 @@ static int __patch_instructions(u32 *patch_addr, u32 
> *code, size_t len, bool rep
>   err = patch_memset32(patch_addr, val, len / 4);
>   }
>   } else {
> - err = patch_memcpy(patch_addr, code, len);
> + err = patch_memcpy32(patch_addr, code, len / 4);
>   }
>   
>   smp_wmb();  /* smp write barrier */


Re: [PATCH v1 2/3] powerpc/code-patching: Use dedicated memory routines for patching

2024-03-15 Thread Christophe Leroy


Le 15/03/2024 à 03:57, Benjamin Gray a écrit :
> The patching page set up as a writable alias may be in quadrant 1
> (userspace) if the temporary mm path is used. This causes sanitiser
> failures if so. Sanitiser failures also occur on the non-mm path
> because the plain memset family is instrumented, and KASAN treats the
> patching window as poisoned.
> 
> Introduce locally defined patch_* variants of memset that perform an
> uninstrumented lower level set, as well as detecting write errors like
> the original single patch variant does.
> 
> copy_to_user() is not correct here, as the PTE makes it a proper kernel
> page (the EEA is privileged access only, RW). It just happens to be in
> quadrant 1 because that's the hardware's mechanism for using the current
> PID vs PID 0 in translations. Importantly, it's incorrect to allow user
> page accesses.
> 
> Now that the patching memsets are used, we also propagate a failure up
> to the caller as the single patch variant does.
> 
> Signed-off-by: Benjamin Gray 
> 
> ---
> 
> The patch_memcpy() can be optimised to 4 bytes at a time assuming the
> same requirements as regular instruction patching are being followed
> for the 'copy sequence of instructions' mode (i.e., they actually are
> instructions following instruction alignment rules).

Why not use copy_to_kernel_nofault() ?


> ---
>   arch/powerpc/lib/code-patching.c | 42 +---
>   1 file changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/lib/code-patching.c 
> b/arch/powerpc/lib/code-patching.c
> index c6ab46156cda..c6633759b509 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -372,9 +372,43 @@ int patch_instruction(u32 *addr, ppc_inst_t instr)
>   }
>   NOKPROBE_SYMBOL(patch_instruction);
>   
> +static int patch_memset64(u64 *addr, u64 val, size_t count)
> +{
> + for (u64 *end = addr + count; addr < end; addr++)
> + __put_kernel_nofault(addr, , u64, failed);
> +
> + return 0;
> +
> +failed:
> + return -EPERM;

Is it correct ? Shouldn't it be -EFAULT ?

> +}
> +
> +static int patch_memset32(u32 *addr, u32 val, size_t count)
> +{
> + for (u32 *end = addr + count; addr < end; addr++)
> + __put_kernel_nofault(addr, , u32, failed);
> +
> + return 0;
> +
> +failed:
> + return -EPERM;
> +}
> +
> +static int patch_memcpy(void *dst, void *src, size_t len)
> +{
> + for (void *end = src + len; src < end; dst++, src++)
> + __put_kernel_nofault(dst, src, u8, failed);
> +
> + return 0;
> +
> +failed:
> + return -EPERM;
> +}
> +
>   static int __patch_instructions(u32 *patch_addr, u32 *code, size_t len, 
> bool repeat_instr)
>   {
>   unsigned long start = (unsigned long)patch_addr;
> + int err;
>   
>   /* Repeat instruction */
>   if (repeat_instr) {
> @@ -383,19 +417,19 @@ static int __patch_instructions(u32 *patch_addr, u32 
> *code, size_t len, bool rep
>   if (ppc_inst_prefixed(instr)) {
>   u64 val = ppc_inst_as_ulong(instr);
>   
> - memset64((u64 *)patch_addr, val, len / 8);
> + err = patch_memset64((u64 *)patch_addr, val, len / 8);
>   } else {
>   u32 val = ppc_inst_val(instr);
>   
> - memset32(patch_addr, val, len / 4);
> + err = patch_memset32(patch_addr, val, len / 4);
>   }
>   } else {
> - memcpy(patch_addr, code, len);
> + err = patch_memcpy(patch_addr, code, len);

Use copy_to_kernel_nofault() instead of open coding a new less optimised 
version of it.

>   }
>   
>   smp_wmb();  /* smp write barrier */
>   flush_icache_range(start, start + len);
> - return 0;
> + return err;
>   }
>   
>   /*


Re: [PATCH v9 07/27] net: wan: Add support for QMC HDLC

2024-03-14 Thread Christophe Leroy


Le 14/03/2024 à 16:21, Guenter Roeck a écrit :
> On Wed, Nov 15, 2023 at 03:39:43PM +0100, Herve Codina wrote:
>> The QMC HDLC driver provides support for HDLC using the QMC (QUICC
>> Multichannel Controller) to transfer the HDLC data.
>>
>> Signed-off-by: Herve Codina 
>> Reviewed-by: Christophe Leroy 
>> Acked-by: Jakub Kicinski 
>> ---
> [ ... ]
> 
>> +
>> +static const struct of_device_id qmc_hdlc_id_table[] = {
>> +{ .compatible = "fsl,qmc-hdlc" },
>> +{} /* sentinel */
>> +};
>> +MODULE_DEVICE_TABLE(of, qmc_hdlc_driver);
> 
> I am a bit puzzled. How does this even compile ?

Because

#else  /* !MODULE */
#define MODULE_DEVICE_TABLE(type, name)
#endif


We should probably try to catch those errors when CONFIG_MODULE is not set.

By the way, a fix is available at 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20240314123346.461350-1-herve.cod...@bootlin.com/

Christophe


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-14 Thread Christophe Leroy


Le 14/03/2024 à 13:53, Peter Xu a écrit :
> On Thu, Mar 14, 2024 at 08:45:34AM +0000, Christophe Leroy wrote:
>>
>>
>> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
>>> From: Peter Xu 
>>>
>>> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
>>> it will keep returning false.
>>>
>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
>>> mappings.
>>>
>>> The goal should be that we will have one API pXd_leaf() to detect all kinds
>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
>>
>> All kinds of huge mappings ?
>>
>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>> those huge pages.
> 
> Ah yes, I should always mention this is in the context of leaf huge pages
> only.  Are the examples you provided all fall into hugepd category?  If so
> I can reword the commit message, as:

On powerpc 8xx, only the 8M huge pages fall into the hugepd case.

The 512k hugepages are at PTE level, they are handled more or less like 
CONT_PTE on ARM. see function set_huge_pte_at() for more context.

You can also look at pte_leaf_size() and pgd_leaf_size().

By the way pgd_leaf_size() looks odd because it is called only when 
pgd_leaf_size() returns true, which never happens for 8M pages.

> 
>  As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to
>  create such huge mappings for 4K hash MMUs.  Meanwhile, the major
>  powerpc hugetlb pgtable walker __find_linux_pte() already used
>  pXd_leaf() to check leaf hugetlb mappings.
> 
>  The goal should be that we will have one API pXd_leaf() to detect
>  all kinds of huge mappings except hugepd.  AFAICT we need to use
>  the pXd_leaf() impl (rather than pXd_huge() ones) to make sure
>  ie. THPs on hash MMU will also return true.
> 
> Does this look good to you?
> 
> Thanks,
> 


Re: [PATCH v6 1/9] locking/mutex: introduce devm_mutex_init

2024-03-14 Thread Christophe Leroy


Le 14/03/2024 à 09:45, George Stark a écrit :
> Using of devm API leads to a certain order of releasing resources.
> So all dependent resources which are not devm-wrapped should be deleted
> with respect to devm-release order. Mutex is one of such objects that
> often is bound to other resources and has no own devm wrapping.
> Since mutex_destroy() actually does nothing in non-debug builds
> frequently calling mutex_destroy() is just ignored which is safe for now
> but wrong formally and can lead to a problem if mutex_destroy() will be
> extended so introduce devm_mutex_init()
> 
> Signed-off-by: George Stark 
> Suggested by-by: Christophe Leroy 

s/Suggested by-by/Suggested-by:

Reviewed-by: Christophe Leroy 

> ---
>   include/linux/mutex.h| 27 +++
>   kernel/locking/mutex-debug.c | 11 +++
>   2 files changed, 38 insertions(+)
> 
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index 67edc4ca2bee..f57e005ded24 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -22,6 +22,8 @@
>   #include 
>   #include 
>   
> +struct device;
> +
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   # define __DEP_MAP_MUTEX_INITIALIZER(lockname)  \
>   , .dep_map = {  \
> @@ -117,6 +119,31 @@ do { 
> \
>   } while (0)
>   #endif /* CONFIG_PREEMPT_RT */
>   
> +#ifdef CONFIG_DEBUG_MUTEXES
> +
> +int __devm_mutex_init(struct device *dev, struct mutex *lock);
> +
> +#else
> +
> +static inline int __devm_mutex_init(struct device *dev, struct mutex *lock)
> +{
> + /*
> +  * When CONFIG_DEBUG_MUTEXES is off mutex_destroy is just a nop so
> +  * no really need to register it in devm subsystem.
> +  */
> + return 0;
> +}
> +
> +#endif
> +
> +#define devm_mutex_init(dev, mutex)  \
> +({   \
> + typeof(mutex) mutex_ = (mutex); \
> + \
> + mutex_init(mutex_); \
> + __devm_mutex_init(dev, mutex_); \
> +})
> +
>   /*
>* See kernel/locking/mutex.c for detailed documentation of these APIs.
>* Also see Documentation/locking/mutex-design.rst.
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index bc8abb8549d2..6aa77e3dc82e 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -19,6 +19,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
>   
>   #include "mutex.h"
>   
> @@ -89,6 +90,16 @@ void debug_mutex_init(struct mutex *lock, const char *name,
>   lock->magic = lock;
>   }
>   
> +static void devm_mutex_release(void *res)
> +{
> + mutex_destroy(res);
> +}
> +
> +int __devm_mutex_init(struct device *dev, struct mutex *lock)
> +{
> + return devm_add_action_or_reset(dev, devm_mutex_release, lock);
> +}
> +
>   /***
>* mutex_destroy - mark a mutex unusable
>* @lock: the mutex to be destroyed


Re: [PATCH 12/13] mm/treewide: Remove pXd_huge()

2024-03-14 Thread Christophe Leroy


Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
> From: Peter Xu 
> 
> This API is not used anymore, drop it for the whole tree.
> 
> Signed-off-by: Peter Xu 
> ---
>   arch/arm/mm/Makefile  |  1 -
>   arch/arm/mm/hugetlbpage.c | 29 ---
>   arch/arm64/mm/hugetlbpage.c   | 10 ---
>   arch/loongarch/mm/hugetlbpage.c   | 10 ---
>   arch/mips/include/asm/pgtable-32.h|  2 +-
>   arch/mips/include/asm/pgtable-64.h|  2 +-
>   arch/mips/mm/hugetlbpage.c| 10 ---
>   arch/parisc/mm/hugetlbpage.c  | 11 ---
>   .../include/asm/book3s/64/pgtable-4k.h| 10 ---
>   .../include/asm/book3s/64/pgtable-64k.h   | 25 
>   arch/powerpc/include/asm/nohash/pgtable.h | 10 ---
>   arch/riscv/mm/hugetlbpage.c   | 10 ---
>   arch/s390/mm/hugetlbpage.c| 10 ---
>   arch/sh/mm/hugetlbpage.c  | 10 ---
>   arch/sparc/mm/hugetlbpage.c   | 10 ---
>   arch/x86/mm/hugetlbpage.c | 16 --
>   include/linux/hugetlb.h   | 24 ---
>   17 files changed, 2 insertions(+), 198 deletions(-)
>   delete mode 100644 arch/arm/mm/hugetlbpage.c
> 

> diff --git a/arch/mips/include/asm/pgtable-32.h 
> b/arch/mips/include/asm/pgtable-32.h
> index 0e196650f4f4..92b7591aac2a 100644
> --- a/arch/mips/include/asm/pgtable-32.h
> +++ b/arch/mips/include/asm/pgtable-32.h
> @@ -129,7 +129,7 @@ static inline int pmd_none(pmd_t pmd)
>   static inline int pmd_bad(pmd_t pmd)
>   {
>   #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> - /* pmd_huge(pmd) but inline */
> + /* pmd_leaf(pmd) but inline */

Shouldn't this comment have been changed in patch 11 ?

>   if (unlikely(pmd_val(pmd) & _PAGE_HUGE))

Unlike pmd_huge() which is an outline function, pmd_leaf() is a macro so 
it could be used here instead of open coping.

>   return 0;
>   #endif
> diff --git a/arch/mips/include/asm/pgtable-64.h 
> b/arch/mips/include/asm/pgtable-64.h
> index 20ca48c1b606..7c28510b3768 100644
> --- a/arch/mips/include/asm/pgtable-64.h
> +++ b/arch/mips/include/asm/pgtable-64.h
> @@ -245,7 +245,7 @@ static inline int pmd_none(pmd_t pmd)
>   static inline int pmd_bad(pmd_t pmd)
>   {
>   #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> - /* pmd_huge(pmd) but inline */
> + /* pmd_leaf(pmd) but inline */

Same

>   if (unlikely(pmd_val(pmd) & _PAGE_HUGE))

Same

>   return 0;
>   #endif

> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> index 2fce3498b000..579a7153857f 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> @@ -4,31 +4,6 @@
>   
>   #ifndef __ASSEMBLY__
>   #ifdef CONFIG_HUGETLB_PAGE
> -/*
> - * We have PGD_INDEX_SIZ = 12 and PTE_INDEX_SIZE = 8, so that we can have
> - * 16GB hugepage pte in PGD and 16MB hugepage pte at PMD;
> - *
> - * Defined in such a way that we can optimize away code block at build time
> - * if CONFIG_HUGETLB_PAGE=n.
> - *
> - * returns true for pmd migration entries, THP, devmap, hugetlb
> - * But compile time dependent on CONFIG_HUGETLB_PAGE
> - */

Should we keep this comment somewhere for documentation ?

> -static inline int pmd_huge(pmd_t pmd)
> -{
> - /*
> -  * leaf pte for huge page
> -  */
> - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> -static inline int pud_huge(pud_t pud)
> -{
> - /*
> -  * leaf pte for huge page
> -  */
> - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -}
>   
>   /*
>* With 64k page size, we have hugepage ptes in the pgd and pmd entries. We 
> don't


Re: [PATCH 11/13] mm/treewide: Replace pXd_huge() with pXd_leaf()

2024-03-14 Thread Christophe Leroy


Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
> From: Peter Xu 
> 
> Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
> reuse it.  Luckily, pXd_huge() isn't widely used.
> 
> Signed-off-by: Peter Xu 
> ---
>   arch/arm/include/asm/pgtable-3level.h | 2 +-
>   arch/arm64/include/asm/pgtable.h  | 2 +-
>   arch/arm64/mm/hugetlbpage.c   | 4 ++--
>   arch/loongarch/mm/hugetlbpage.c   | 2 +-
>   arch/mips/mm/tlb-r4k.c| 2 +-
>   arch/powerpc/mm/pgtable_64.c  | 6 +++---
>   arch/x86/mm/pgtable.c | 4 ++--
>   mm/gup.c  | 4 ++--
>   mm/hmm.c  | 2 +-
>   mm/memory.c   | 2 +-
>   10 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm/include/asm/pgtable-3level.h 
> b/arch/arm/include/asm/pgtable-3level.h
> index e7aecbef75c9..9e3c44f0aea2 100644
> --- a/arch/arm/include/asm/pgtable-3level.h
> +++ b/arch/arm/include/asm/pgtable-3level.h
> @@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
>   #define pmd_dirty(pmd)  (pmd_isset((pmd), L_PMD_SECT_DIRTY))
>   
>   #define pmd_hugewillfault(pmd)  (!pmd_young(pmd) || !pmd_write(pmd))
> -#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
> +#define pmd_thp_or_huge(pmd) (pmd_leaf(pmd) || pmd_trans_huge(pmd))

Previous patch said pmd_trans_huge() implies pmd_leaf().

Or is that only for GUP ?

>   
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   #define pmd_trans_huge(pmd) (pmd_val(pmd) && !pmd_table(pmd))


> diff --git a/mm/hmm.c b/mm/hmm.c
> index c95b9ec5d95f..93aebd9cc130 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long 
> start, unsigned long end,
>   return hmm_vma_walk_hole(start, end, -1, walk);
>   }
>   
> - if (pud_huge(pud) && pud_devmap(pud)) {
> + if (pud_leaf(pud) && pud_devmap(pud)) {

Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?

>   unsigned long i, npages, pfn;
>   unsigned int required_fault;
>   unsigned long *hmm_pfns;




Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-14 Thread Christophe Leroy


Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
> From: Peter Xu 
> 
> PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> it will keep returning false.
> 
> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
> pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> mappings.
> 
> The goal should be that we will have one API pXd_leaf() to detect all kinds
> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.

All kinds of huge mappings ?

pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are 
also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages 
and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report 
those huge pages.

> 
> This helps to simplify a follow up patch to drop pXd_huge() treewide.
> 
> NOTE: *_leaf() definition need to be moved before the inclusion of
> asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it.
> 
> [1] https://lore.kernel.org/r/87v85zo6w7.fsf@mail.lhotse
> 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: "Naveen N. Rao" 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu 
> ---
>   .../include/asm/book3s/64/pgtable-4k.h| 14 ++
>   arch/powerpc/include/asm/book3s/64/pgtable.h  | 27 +--
>   2 files changed, 14 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> index 48f21820afe2..92545981bb49 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> @@ -8,22 +8,12 @@
>   #ifdef CONFIG_HUGETLB_PAGE
>   static inline int pmd_huge(pmd_t pmd)
>   {
> - /*
> -  * leaf pte for huge page
> -  */
> - if (radix_enabled())
> - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> - return 0;
> + return pmd_leaf(pmd);
>   }
>   
>   static inline int pud_huge(pud_t pud)
>   {
> - /*
> -  * leaf pte for huge page
> -  */
> - if (radix_enabled())
> - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> - return 0;
> + return pud_leaf(pud);
>   }
>   
>   /*
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index df66dce8306f..fd7180fded75 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -262,6 +262,18 @@ extern unsigned long __kernel_io_end;
>   
>   extern struct page *vmemmap;
>   extern unsigned long pci_io_base;
> +
> +#define pmd_leaf pmd_leaf
> +static inline bool pmd_leaf(pmd_t pmd)
> +{
> + return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> +}
> +
> +#define pud_leaf pud_leaf
> +static inline bool pud_leaf(pud_t pud)
> +{
> + return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> +}
>   #endif /* __ASSEMBLY__ */
>   
>   #include 
> @@ -1436,20 +1448,5 @@ static inline bool is_pte_rw_upgrade(unsigned long 
> old_val, unsigned long new_va
>   return false;
>   }
>   
> -/*
> - * Like pmd_huge(), but works regardless of config options
> - */
> -#define pmd_leaf pmd_leaf
> -static inline bool pmd_leaf(pmd_t pmd)
> -{
> - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> -#define pud_leaf pud_leaf
> -static inline bool pud_leaf(pud_t pud)
> -{
> - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -}
> -
>   #endif /* __ASSEMBLY__ */
>   #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */


Re: [PATCH v10 11/12] powerpc: mm: Use set_pte_at_unchecked() for early-boot / internal usages

2024-03-13 Thread Christophe Leroy


Le 13/03/2024 à 05:21, Rohan McLure a écrit :
> In the new set_ptes() API, set_pte_at() (a special case of set_ptes())
> is intended to be instrumented by the page table check facility. There
> are however several other routines that constitute the API for setting
> page table entries, including set_pmd_at() among others. Such routines
> are themselves implemented in terms of set_ptes_at().
> 
> A future patch providing support for page table checking on powerpc
> must take care to avoid duplicate calls to
> page_table_check_p{te,md,ud}_set(). Allow for assignment of pte entries
> without instrumentation through the set_pte_at_unchecked() routine
> introduced in this patch.
> 
> Cause API-facing routines that call set_pte_at() to instead call
> set_pte_at_unchecked(), which will remain uninstrumented by page
> table check. set_ptes() is itself implemented by calls to
> __set_pte_at(), so this eliminates redundant code.
> 
> Also prefer set_pte_at_unchecked() in early-boot usages which should not be
> instrumented.
> 
> Signed-off-by: Rohan McLure 
> ---
> v9: New patch
> v10: don't reuse __set_pte_at(), as that will not apply filters. Instead
> use new set_pte_at_unchecked().

Are filters needed at all in those usecases ?

> ---
>   arch/powerpc/include/asm/pgtable.h   | 2 ++
>   arch/powerpc/mm/book3s64/hash_pgtable.c  | 2 +-
>   arch/powerpc/mm/book3s64/pgtable.c   | 6 +++---
>   arch/powerpc/mm/book3s64/radix_pgtable.c | 8 
>   arch/powerpc/mm/nohash/book3e_pgtable.c  | 2 +-
>   arch/powerpc/mm/pgtable.c| 7 +++
>   arch/powerpc/mm/pgtable_32.c | 2 +-
>   7 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 3741a63fb82e..6ff1d8cfa216 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -44,6 +44,8 @@ struct mm_struct;
>   void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
>   pte_t pte, unsigned int nr);
>   #define set_ptes set_ptes
> +void set_pte_at_unchecked(struct mm_struct *mm, unsigned long addr,
> +   pte_t *ptep, pte_t pte);
>   #define update_mmu_cache(vma, addr, ptep) \
>   update_mmu_cache_range(NULL, vma, addr, ptep, 1)
>   
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
> b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 988948d69bc1..871472f99a01 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -165,7 +165,7 @@ int hash__map_kernel_page(unsigned long ea, unsigned long 
> pa, pgprot_t prot)
>   ptep = pte_alloc_kernel(pmdp, ea);
>   if (!ptep)
>   return -ENOMEM;
> - set_pte_at(_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, prot));
> + set_pte_at_unchecked(_mm, ea, ptep, pfn_pte(pa >> 
> PAGE_SHIFT, prot));
>   } else {
>   /*
>* If the mm subsystem is not fully up, we cannot create a
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
> b/arch/powerpc/mm/book3s64/pgtable.c
> index 3438ab72c346..25082ab6018b 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -116,7 +116,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>   WARN_ON(!(pmd_large(pmd)));
>   #endif
>   trace_hugepage_set_pmd(addr, pmd_val(pmd));
> - return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
> + return set_pte_at_unchecked(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>   }
>   
>   void set_pud_at(struct mm_struct *mm, unsigned long addr,
> @@ -133,7 +133,7 @@ void set_pud_at(struct mm_struct *mm, unsigned long addr,
>   WARN_ON(!(pud_large(pud)));
>   #endif
>   trace_hugepage_set_pud(addr, pud_val(pud));
> - return set_pte_at(mm, addr, pudp_ptep(pudp), pud_pte(pud));
> + return set_pte_at_unchecked(mm, addr, pudp_ptep(pudp), pud_pte(pud));
>   }
>   
>   static void do_serialize(void *arg)
> @@ -539,7 +539,7 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (radix_enabled())
>   return radix__ptep_modify_prot_commit(vma, addr,
> ptep, old_pte, pte);
> - set_pte_at(vma->vm_mm, addr, ptep, pte);
> + set_pte_at_unchecked(vma->vm_mm, addr, ptep, pte);
>   }
>   
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 46fa46ce6526..c661e42bb2f1 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -109,7 +109,7 @@ static int early_map_kernel_page(unsigned long ea, 
> unsigned long pa,
>   ptep = pte_offset_kernel(pmdp, ea);
>   
>   set_the_pte:
> - set_pte_at(_mm, ea, ptep, pfn_pte(pfn, flags));
> + set_pte_at_unchecked(_mm, ea, ptep, pfn_pte(pfn, 

Re: [PATCH v10 10/12] poweprc: mm: Implement *_user_accessible_page() for ptes

2024-03-13 Thread Christophe Leroy


Le 13/03/2024 à 05:21, Rohan McLure a écrit :
> Page table checking depends on architectures providing an
> implementation of p{te,md,ud}_user_accessible_page. With
> refactorisations made on powerpc/mm, the pte_access_permitted() and
> similar methods verify whether a userland page is accessible with the
> required permissions.
> 
> Since page table checking is the only user of
> p{te,md,ud}_user_accessible_page(), implement these for all platforms,
> using some of the same preliminay checks taken by pte_access_permitted()
> on that platform.
> 
> Since Commit 8e9bd41e4ce1 ("powerpc/nohash: Replace pte_user() by pte_read()")
> pte_user() is no longer required to be present on all platforms as it
> may be equivalent to or implied by pte_read(). Hence implementations are
> specialised.
> 
> Signed-off-by: Rohan McLure 
> ---
> v9: New implementation
> v10: Let book3s/64 use pte_user(), but otherwise default other platforms
> to using the address provided with the call to infer whether it is a
> user page or not. pmd/pud variants will warn on all other platforms, as
> they should not be used for user page mappings
> ---
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 19 ++
>   arch/powerpc/include/asm/pgtable.h   | 26 
>   2 files changed, 45 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 382724c5e872..ca765331e21d 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -538,6 +538,12 @@ static inline bool pte_access_permitted(pte_t pte, bool 
> write)
>   return arch_pte_access_permitted(pte_val(pte), write, 0);
>   }
>   
> +#define pte_user_accessible_page pte_user_accessible_page
> +static inline bool pte_user_accessible_page(pte_t pte, unsigned long addr)
> +{
> + return pte_present(pte) && pte_user(pte);
> +}
> +
>   /*
>* Conversion functions: convert a page and protection to a page entry,
>* and a page entry and page directory to the page they refer to.
> @@ -881,6 +887,7 @@ static inline int pud_present(pud_t pud)
>   
>   extern struct page *pud_page(pud_t pud);
>   extern struct page *pmd_page(pmd_t pmd);
> +

Garbage ?

>   static inline pte_t pud_pte(pud_t pud)
>   {
>   return __pte_raw(pud_raw(pud));
> @@ -926,6 +933,12 @@ static inline bool pud_access_permitted(pud_t pud, bool 
> write)
>   return pte_access_permitted(pud_pte(pud), write);
>   }
>   
> +#define pud_user_accessible_page pud_user_accessible_page
> +static inline bool pud_user_accessible_page(pud_t pud, unsigned long addr)
> +{
> + return pte_user_accessible_page(pud_pte(pud), addr);
> +}
> +

If I understand what is done on arm64, you should first check 
pud_leaf(). Then this function could be common to all powerpc platforms, 
only pte_user_accessible_page() would be platform specific.

>   #define __p4d_raw(x)((p4d_t) { __pgd_raw(x) })
>   static inline __be64 p4d_raw(p4d_t x)
>   {
> @@ -1091,6 +1104,12 @@ static inline bool pmd_access_permitted(pmd_t pmd, 
> bool write)
>   return pte_access_permitted(pmd_pte(pmd), write);
>   }
>   
> +#define pmd_user_accessible_page pmd_user_accessible_page
> +static inline bool pmd_user_accessible_page(pmd_t pmd, unsigned long addr)
> +{
> + return pte_user_accessible_page(pmd_pte(pmd), addr);
> +}

Same, pmd_leaf() should be checked.

> +
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
>   extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 13f661831333..3741a63fb82e 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -227,6 +227,32 @@ static inline int pud_pfn(pud_t pud)
>   }
>   #endif
>   
> +#ifndef pte_user_accessible_page
> +#define pte_user_accessible_page pte_user_accessible_page
> +static inline bool pte_user_accessible_page(pte_t pte, unsigned long addr)
> +{
> + return pte_present(pte) && !is_kernel_addr(addr);
> +}
> +#endif

I would prefer to see one version in asm/book3s/32/pgtable.h and one in 
asm/nohash/pgtable.h and then avoid this game with ifdefs.

> +
> +#ifndef pmd_user_accessible_page
> +#define pmd_user_accessible_page pmd_user_accessible_page
> +static inline bool pmd_user_accessible_page(pmd_t pmd, unsigned long addr)
> +{
> + WARN_ONCE(1, "pmd: platform does not use pmd entries directly");
> + return false;
> +}
> +#endif

Also check pmd_leaf() and this function on all platforms.

> +
> +#ifndef pud_user_accessible_page
> +#define pud_user_accessible_page pud_user_accessible_page
> +static inline bool pud_user_accessible_page(pud_t pud, unsigned long addr)
> +{
> + WARN_ONCE(1, "pud: platform does not use pud entries directly");
> + return false;
> +}

Also check pud_leaf() and this function on all 

Re: [PATCH v10 09/12] powerpc: mm: Add common pud_pfn stub for all platforms

2024-03-13 Thread Christophe Leroy


Le 13/03/2024 à 05:21, Rohan McLure a écrit :
> Prior to this commit, pud_pfn was implemented with BUILD_BUG as the inline
> function for 64-bit Book3S systems but is never included, as its
> invocations in generic code are guarded by calls to pud_devmap which return
> zero on such systems. A future patch will provide support for page table
> checks, the generic code for which depends on a pud_pfn stub being
> implemented, even while the patch will not interact with puds directly.
> 
> Remove the 64-bit Book3S stub and define pud_pfn to warn on all
> platforms. pud_pfn may be defined properly on a per-platform basis
> should it grow real usages in future.

Can you please re-explain why that's needed ? I remember we discussed it 
already in the past, but I checked again today and can't see the need:

In mm/page_table_check.c, the call to pud_pfn() is gated by a call to 
pud_user_accessible_page(pud). If I look into arm64 version of 
pud_user_accessible_page(), it depends on pud_leaf(). When pud_leaf() is 
constant 0, pud_user_accessible_page() is always false and the call to 
pud_pfn() should be folded away.

> 
> Signed-off-by: Rohan McLure 
> ---
>   arch/powerpc/include/asm/pgtable.h | 14 ++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 0c0ffbe7a3b5..13f661831333 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -213,6 +213,20 @@ static inline bool 
> arch_supports_memmap_on_memory(unsigned long vmemmap_size)
>   
>   #endif /* CONFIG_PPC64 */
>   
> +/*
> + * Currently only consumed by page_table_check_pud_{set,clear}. Since clears
> + * and sets to page table entries at any level are done through
> + * page_table_check_pte_{set,clear}, provide stub implementation.
> + */
> +#ifndef pud_pfn
> +#define pud_pfn pud_pfn
> +static inline int pud_pfn(pud_t pud)
> +{
> + WARN_ONCE(1, "pud: platform does not use pud entries directly");
> + return 0;
> +}
> +#endif
> +
>   #endif /* __ASSEMBLY__ */
>   
>   #endif /* _ASM_POWERPC_PGTABLE_H */


Re: [PATCH v10 08/12] powerpc: mm: Replace p{u,m,4}d_is_leaf with p{u,m,4}_leaf

2024-03-13 Thread Christophe Leroy
Hi,

Le 13/03/2024 à 05:21, Rohan McLure a écrit :
> Replace occurrences of p{u,m,4}d_is_leaf with p{u,m,4}_leaf, as the
> latter is the name given to checking that a higher-level entry in
> multi-level paging contains a page translation entry (pte) throughout
> all other archs.

There's already an equivalent commit in mm-stable, that will likely go 
into v6.9:

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-stable=bd18b688220c7225fb50498dabd9f9d0c9988e67



> 
> Reviewed-by: Christophe Leroy 
> Signed-off-by: Rohan McLure 
> ---
> v9: No longer required in order to implement page table check, just a
> refactor.
> v10: Fix more occurances, and just delete p{u,m,4}_is_leaf() stubs as
> equivalent p{u,m,4}_leaf() stubs already exist.
> ---
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 10 
>   arch/powerpc/include/asm/pgtable.h   | 24 
>   arch/powerpc/kvm/book3s_64_mmu_radix.c   | 12 +-
>   arch/powerpc/mm/book3s64/radix_pgtable.c | 14 ++--
>   arch/powerpc/mm/pgtable.c|  6 ++---
>   arch/powerpc/mm/pgtable_64.c |  6 ++---
>   arch/powerpc/xmon/xmon.c |  6 ++---
>   7 files changed, 26 insertions(+), 52 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 62c43d3d80ec..382724c5e872 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1443,16 +1443,14 @@ static inline bool is_pte_rw_upgrade(unsigned long 
> old_val, unsigned long new_va
>   /*
>* Like pmd_huge() and pmd_large(), but works regardless of config options
>*/
> -#define pmd_is_leaf pmd_is_leaf
> -#define pmd_leaf pmd_is_leaf
> -static inline bool pmd_is_leaf(pmd_t pmd)
> +#define pmd_leaf pmd_leaf
> +static inline bool pmd_leaf(pmd_t pmd)
>   {
>   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
>   }
>   
> -#define pud_is_leaf pud_is_leaf
> -#define pud_leaf pud_is_leaf
> -static inline bool pud_is_leaf(pud_t pud)
> +#define pud_leaf pud_leaf
> +static inline bool pud_leaf(pud_t pud)
>   {
>   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
>   }
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 9224f23065ff..0c0ffbe7a3b5 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -180,30 +180,6 @@ static inline void pte_frag_set(mm_context_t *ctx, void 
> *p)
>   }
>   #endif
>   
> -#ifndef pmd_is_leaf
> -#define pmd_is_leaf pmd_is_leaf
> -static inline bool pmd_is_leaf(pmd_t pmd)
> -{
> - return false;
> -}
> -#endif
> -
> -#ifndef pud_is_leaf
> -#define pud_is_leaf pud_is_leaf
> -static inline bool pud_is_leaf(pud_t pud)
> -{
> - return false;
> -}
> -#endif
> -
> -#ifndef p4d_is_leaf
> -#define p4d_is_leaf p4d_is_leaf
> -static inline bool p4d_is_leaf(p4d_t p4d)
> -{
> - return false;
> -}
> -#endif
> -
>   #define pmd_pgtable pmd_pgtable
>   static inline pgtable_t pmd_pgtable(pmd_t pmd)
>   {
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 4a1abb9f7c05..408d98f8a514 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -503,7 +503,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t 
> *pmd, bool full,
>   for (im = 0; im < PTRS_PER_PMD; ++im, ++p) {
>   if (!pmd_present(*p))
>   continue;
> - if (pmd_is_leaf(*p)) {
> + if (pmd_leaf(*p)) {
>   if (full) {
>   pmd_clear(p);
>   } else {
> @@ -532,7 +532,7 @@ static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t 
> *pud,
>   for (iu = 0; iu < PTRS_PER_PUD; ++iu, ++p) {
>   if (!pud_present(*p))
>   continue;
> - if (pud_is_leaf(*p)) {
> + if (pud_leaf(*p)) {
>   pud_clear(p);
>   } else {
>   pmd_t *pmd;
> @@ -635,12 +635,12 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, 
> pte_t pte,
>   new_pud = pud_alloc_one(kvm->mm, gpa);
>   
>   pmd = NULL;
> - if (pud && pud_present(*pud) && !pud_is_leaf(*pud))
> + if (pud && pud_present(*pud) && !pud_leaf(*pud))
>   pmd = pmd_offset(pud, gpa);
>   else if (level <= 1)
>   new_pmd = kvmppc_pmd_alloc();
>   
> -

Re: [PATCH v3 07/12] powerpc: Use initializer for struct vm_unmapped_area_info

2024-03-13 Thread Christophe Leroy


Le 12/03/2024 à 23:28, Rick Edgecombe a écrit :
> Future changes will need to add a new member to struct
> vm_unmapped_area_info. This would cause trouble for any call site that
> doesn't initialize the struct. Currently every caller sets each member
> manually, so if new members are added they will be uninitialized and the
> core code parsing the struct will see garbage in the new member.
> 
> It could be possible to initialize the new member manually to 0 at each
> call site. This and a couple other options were discussed, and a working
> consensus (see links) was that in general the best way to accomplish this
> would be via static initialization with designated member initiators.
> Having some struct vm_unmapped_area_info instances not zero initialized
> will put those sites at risk of feeding garbage into vm_unmapped_area() if
> the convention is to zero initialize the struct and any new member addition
> misses a call site that initializes each member manually.
> 
> It could be possible to leave the code mostly untouched, and just change
> the line:
> struct vm_unmapped_area_info info
> to:
> struct vm_unmapped_area_info info = {};
> 
> However, that would leave cleanup for the members that are manually set
> to zero, as it would no longer be required.
> 
> So to be reduce the chance of bugs via uninitialized members, instead
> simply continue the process to initialize the struct this way tree wide.
> This will zero any unspecified members. Move the member initializers to the
> struct declaration when they are known at that time. Leave the members out
> that were manually initialized to zero, as this would be redundant for
> designated initializers.

I understand from this text that, as agreed, this patch removes the 
pointless/redundant zero-init of individual members. But it is not what 
is done, see below ?

> 
> Signed-off-by: Rick Edgecombe 
> Acked-by: Michael Ellerman 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: Aneesh Kumar K.V 
> Cc: Naveen N. Rao 
> Cc: linuxppc-dev@lists.ozlabs.org
> Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t
> Link: 
> https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/
> ---
> v3:
>   - Fixed spelling errors in log
>   - Be consistent about field vs member in log
> 
> Hi,
> 
> This patch was split and refactored out of a tree-wide change [0] to just
> zero-init each struct vm_unmapped_area_info. The overall goal of the
> series is to help shadow stack guard gaps. Currently, there is only one
> arch with shadow stacks, but two more are in progress. It is compile tested
> only.
> 
> There was further discussion that this method of initializing the structs
> while nice in some ways has a greater risk of introducing bugs in some of
> the more complicated callers. Since this version was reviewed my arch
> maintainers already, leave it as was already acknowledged.
> 
> Thanks,
> 
> Rick
> 
> [0] 
> https://lore.kernel.org/lkml/20240226190951.3240433-6-rick.p.edgeco...@intel.com/
> ---
>   arch/powerpc/mm/book3s64/slice.c | 23 ---
>   1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/slice.c 
> b/arch/powerpc/mm/book3s64/slice.c
> index c0b58afb9a47..6c7ac8c73a6c 100644
> --- a/arch/powerpc/mm/book3s64/slice.c
> +++ b/arch/powerpc/mm/book3s64/slice.c
> @@ -282,12 +282,12 @@ static unsigned long slice_find_area_bottomup(struct 
> mm_struct *mm,
>   {
>   int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
>   unsigned long found, next_end;
> - struct vm_unmapped_area_info info;
> -
> - info.flags = 0;
> - info.length = len;
> - info.align_mask = PAGE_MASK & ((1ul << pshift) - 1);
> - info.align_offset = 0;
> + struct vm_unmapped_area_info info = {
> + .flags = 0,

Please remove zero-init as agreed and explained in the commit message

> + .length = len,
> + .align_mask = PAGE_MASK & ((1ul << pshift) - 1),
> + .align_offset = 0

Same here.

> + };
>   /*
>* Check till the allow max value for this mmap request
>*/
> @@ -326,13 +326,14 @@ static unsigned long slice_find_area_topdown(struct 
> mm_struct *mm,
>   {
>   int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
>   unsigned long found, prev;
> - struct vm_unmapped_area_info info;
> + struct vm_unmapped_area_info info = {
> + .flags = VM_UNMAPPED_AREA_TOPDOWN,
> + .length = len,
> + .align_mask = PAGE_MASK & ((1ul << pshift) - 1),
>

Re: [PATCH v5 02/10] locking/mutex: introduce devm_mutex_init

2024-03-12 Thread Christophe Leroy


Le 12/03/2024 à 16:30, George Stark a écrit :
> [Vous ne recevez pas souvent de courriers de gnst...@salutedevices.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hello Christophe
> 
> On 3/12/24 14:51, Christophe Leroy wrote:
>>
>>
>> Le 12/03/2024 à 12:39, George Stark a écrit :
>>> [Vous ne recevez pas souvent de courriers de gnst...@salutedevices.com.
>>> Découvrez pourquoi ceci est important à
>>> https://aka.ms/LearnAboutSenderIdentification ]
> 
> ...
> 
>> You don't need that inline function, just change debug_devm_mutex_init()
>> to __devm_mutex_init().
> 
> I stuck to debug_* name because mutex-debug.c already exports a set
> of debug_ calls so...

Ah yes you are right I didn't see that. On the other hand all those 
debug_mutex_* are used by kernel/locking/mutex.c.
Here we really don't want our new function to be called by anything else 
than devm_mutex_init so by calling it __devm_mutex_init() you kind of 
tie them together.

> Well it's not essential anyway. Here's the next try:

Looks good to me.

> 
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index 67edc4ca2bee..537b5ea18ceb 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -22,6 +22,8 @@
>   #include 
>   #include 
> 
> +struct device;
> +
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   # define __DEP_MAP_MUTEX_INITIALIZER(lockname)    \
>     , .dep_map = {  \
> @@ -117,6 +119,29 @@ do 
> {   \
>   } while (0)
>   #endif /* CONFIG_PREEMPT_RT */
> 
> +#ifdef CONFIG_DEBUG_MUTEXES
> +
> +int __devm_mutex_init(struct device *dev, struct mutex *lock);
> +
> +#else
> +
> +static inline int __devm_mutex_init(struct device *dev, struct mutex 
> *lock)
> +{
> +   /*
> +    * When CONFIG_DEBUG_MUTEXES is off mutex_destroy is just a nop so
> +    * no really need to register it in devm subsystem.
> +    */
> +   return 0;
> +}
> +
> +#endif
> +
> +#define devm_mutex_init(dev, mutex)    \
> +({ \
> +   mutex_init(mutex);  \
> +   __devm_mutex_init(dev, mutex);  \
> +})
> +
>   /*
>    * See kernel/locking/mutex.c for detailed documentation of these APIs.
>    * Also see Documentation/locking/mutex-design.rst.
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index bc8abb8549d2..6aa77e3dc82e 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -19,6 +19,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
> 
>   #include "mutex.h"
> 
> @@ -89,6 +90,16 @@ void debug_mutex_init(struct mutex *lock, const char
> *name,
>     lock->magic = lock;
>   }
> 
> +static void devm_mutex_release(void *res)
> +{
> +   mutex_destroy(res);
> +}
> +
> +int __devm_mutex_init(struct device *dev, struct mutex *lock)
> +{
> +   return devm_add_action_or_reset(dev, devm_mutex_release, lock);
> +}
> +
>   /***
>    * mutex_destroy - mark a mutex unusable
>    * @lock: the mutex to be destroyed
> -- 
> 2.25.1
> 
> 
> 
>>> +
>>> +#else
>>> +
>>> +static inline int __devm_mutex_init(struct device *dev, struct mutex
>>> *lock)
>>> +{
>>> +   /*
>>> +   * When CONFIG_DEBUG_MUTEXES is off mutex_destroy is just a 
>>> nop so
>>> +   * no really need to register it in devm subsystem.
>>> +   */
>>
>> Don't know if it is because tabs are replaced by blanks in you email,
>> but the stars should be aligned
> 
> Ack
> 
> 
> -- 
> Best regards
> George


Re: [PATCH v5 02/10] locking/mutex: introduce devm_mutex_init

2024-03-12 Thread Christophe Leroy


Le 12/03/2024 à 12:39, George Stark a écrit :
> [Vous ne recevez pas souvent de courriers de gnst...@salutedevices.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hello Christophe
> 
> Thanks for the review
> You were right about typecheck - it was meant to check errors even if
> CONFIG_DEBUG_MUTEXES was off.

Yes that's current practice in order to catch problems as soon as possible.

> 
> Here's new version based on the comments:
> 
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index 67edc4ca2bee..9193b163038f 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -22,6 +22,8 @@
>   #include 
>   #include 
> 
> +struct device;
> +
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   # define __DEP_MAP_MUTEX_INITIALIZER(lockname)    \
>     , .dep_map = {  \
> @@ -117,6 +119,34 @@ do 
> {   \
>   } while (0)
>   #endif /* CONFIG_PREEMPT_RT */
> 
> +#ifdef CONFIG_DEBUG_MUTEXES
> +
> +int debug_devm_mutex_init(struct device *dev, struct mutex *lock);
> +
> +static inline int __devm_mutex_init(struct device *dev, struct mutex 
> *lock)
> +{
> +   return debug_devm_mutex_init(dev, lock);
> +}

You don't need that inline function, just change debug_devm_mutex_init() 
to __devm_mutex_init().

> +
> +#else
> +
> +static inline int __devm_mutex_init(struct device *dev, struct mutex 
> *lock)
> +{
> +   /*
> +   * When CONFIG_DEBUG_MUTEXES is off mutex_destroy is just a nop so
> +   * no really need to register it in devm subsystem.
> +   */

Don't know if it is because tabs are replaced by blanks in you email, 
but the stars should be aligned

/* ...
  * ...
  */

> +   return 0;
> +}
> +
> +#endif
> +
> +#define devm_mutex_init(dev, mutex)    \
> +({ \
> +   mutex_init(mutex);  \
> +   __devm_mutex_init(dev, mutex);  \
> +})
> +
>   /*
>    * See kernel/locking/mutex.c for detailed documentation of these APIs.
>    * Also see Documentation/locking/mutex-design.rst.
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index bc8abb8549d2..967a5367c79a 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -19,6 +19,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
> 
>   #include "mutex.h"
> 
> @@ -89,6 +90,16 @@ void debug_mutex_init(struct mutex *lock, const char
> *name,
>     lock->magic = lock;
>   }
> 
> +static void devm_mutex_release(void *res)
> +{
> +   mutex_destroy(res);
> +}
> +
> +int debug_devm_mutex_init(struct device *dev, struct mutex *lock)

Rename __devm_mutex_init();

It makes it more clear that nobody is expected to call it directly.

> +{
> +   return devm_add_action_or_reset(dev, devm_mutex_release, lock);
> +}
> +
>   /***
>    * mutex_destroy - mark a mutex unusable
>    * @lock: the mutex to be destroyed
> -- 
> 2.25.1
> 
> 



Re: [PATCH] powerpc/kernel: Fix potential spectre v1 in syscall

2024-03-12 Thread Christophe Leroy
+Nathan as this is RTAS related.

Le 21/08/2018 à 20:42, Breno Leitao a écrit :
> The rtas syscall reads a value from a user-provided structure and uses it
> to index an array, being a possible area for a potential spectre v1 attack.
> This is the code that exposes this problem.
> 
>   args.rets = [nargs];
> 
> The nargs is an user provided value, and the below code is an example where
> the 'nargs' value would be set to XX.
> 
>   struct rtas_args ra;
>   ra.nargs = htobe32(XX);
>   syscall(__NR_rtas, );


This patch has been hanging around in patchwork since 2018 and doesn't 
apply anymore. Is it still relevant ? If so, can you rebase et resubmit ?

Thanks
Christophe


> 
> Signed-off-by: Breno Leitao 
> ---
>   arch/powerpc/kernel/rtas.c | 6 --
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 8afd146bc9c7..5ef3c863003d 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -27,6 +27,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
>   
>   #include 
>   #include 
> @@ -1056,7 +1057,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>   struct rtas_args args;
>   unsigned long flags;
>   char *buff_copy, *errbuf = NULL;
> - int nargs, nret, token;
> + int index, nargs, nret, token;
>   
>   if (!capable(CAP_SYS_ADMIN))
>   return -EPERM;
> @@ -1084,7 +1085,8 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>   if (token == RTAS_UNKNOWN_SERVICE)
>   return -EINVAL;
>   
> - args.rets = [nargs];
> + index = array_index_nospec(nargs, ARRAY_SIZE(args.args));
> + args.rets = [index];
>   memset(args.rets, 0, nret * sizeof(rtas_arg_t));
>   
>   /* Need to handle ibm,suspend_me call specially */


Re: [PATCH 1/2] powerpc: Flush checkpointed gpr state for 32-bit processes in ptrace

2024-03-12 Thread Christophe Leroy


Le 19/06/2018 à 21:54, Pedro Franco de Carvalho a écrit :
> Would something like this be ok?
> 
> I factored out the calls to flush_fp_to_thread and flush_altivec_to_thread,
> although I don't really understand why these are necessary. The 
> tm_cvsx_get/set
> functions also calls flush_vsx_to_thread (outside of the helper function).
> 
> I also noticed that tm_ppr/dscr/tar_get/set functions don't flush the tm
> state. Should they do it, and if so, should they also flush the fp and altivec
> state?
> 
> Thanks!
> Pedro
> 
> -- >8 --
> Currently ptrace doesn't flush the register state when the
> checkpointed GPRs of a 32-bit thread are accessed. This can cause core
> dumps to have stale data in the checkpointed GPR note.
> 
> This patch adds a helper function to flush the TM, fpu and altivec
> state and calls it from the tm_cgpr32_get/set functions.

This patch is almost 6 yr old and doesn't apply anymore.

If someone thinks it is still relevant, please rebase and resubmit.

Thanks
Christophe

> 
> Signed-off-by: Pedro Franco de Carvalho 
> ---
>   arch/powerpc/kernel/ptrace.c | 33 +
>   1 file changed, 33 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index 9667666eb18e..0d56857e1e89 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -778,6 +778,29 @@ static int evr_set(struct task_struct *target, const 
> struct user_regset *regset,
>   
>   #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>   /**
> + * tm_flush_if_active - flush TM, fpu and altivec state if TM active
> + * @target:  The target task.
> + *
> + * This function flushes the TM, fpu and altivec state to the target
> + * task and returns 0 if TM is available and active in the target, and
> + * returns an error code suitable for ptrace otherwise.
> + */
> +static int tm_flush_if_active (struct task_struct *target)
> +{
> + if (!cpu_has_feature(CPU_FTR_TM))
> + return -ENODEV;
> +
> + if (!MSR_TM_ACTIVE(target->thread.regs->msr))
> + return -ENODATA;
> +
> + flush_tmregs_to_thread(target);
> + flush_fp_to_thread(target);
> + flush_altivec_to_thread(target);
> +
> + return 0;
> +}
> +
> +/**
>* tm_cgpr_active - get active number of registers in CGPR
>* @target: The target task.
>* @regset: The user regset structure.
> @@ -2124,6 +2147,11 @@ static int tm_cgpr32_get(struct task_struct *target,
>unsigned int pos, unsigned int count,
>void *kbuf, void __user *ubuf)
>   {
> + int ret = tm_flush_if_active(target);
> +
> + if (ret)
> + return ret;
> +
>   return gpr32_get_common(target, regset, pos, count, kbuf, ubuf,
>   >thread.ckpt_regs.gpr[0]);
>   }
> @@ -2133,6 +2161,11 @@ static int tm_cgpr32_set(struct task_struct *target,
>unsigned int pos, unsigned int count,
>const void *kbuf, const void __user *ubuf)
>   {
> + int ret = tm_flush_if_active(target);
> +
> + if (ret)
> + return ret;
> +
>   return gpr32_set_common(target, regset, pos, count, kbuf, ubuf,
>   >thread.ckpt_regs.gpr[0]);
>   }


Re: [PATCH] powerpc: build-time fixup alternate feature relative addresses

2024-03-12 Thread Christophe Leroy


Le 29/01/2024 à 07:25, Sathvika Vasireddy a écrit :
> Hi Christophe, Nick
> 
> On 1/26/24 12:32 AM, Christophe Leroy wrote:
>> Hi Nic,
>>
>> Le 21/05/2017 à 03:01, Nicholas Piggin a écrit :
>>> Implement build-time fixup of alternate feature relative addresses for
>>> the out-of-line ("else") patch code. This is done post-link with a new
>>> powerpc build tool that parses relocations and fixup structures, and
>>> adjusts branch instructions.
>>>
>>> This gives us the ability to link patch code anywhere in the kernel,
>>> without branches to targets outside the patch code having to be
>>> reached directly (without a linker stub). This allows patch code to be
>>> moved out from the head section, and avoids build failures with
>>> unresolvable branche.
>>
>> Is it worth keeping this hanging in patchwork ? It seems outdated and 
>> doesn't apply. Could this me done with objtool instead ?
>>
>> Christophe
> 
> Yes, this can be done with objtool. I am working on this and will post 
> an RFC this week.
> 

Nice.

I've open an issue to track his at 
https://github.com/linuxppc/issues/issues/479 and have retired Nic's 
patch in patchwork.

Christophe


Re: [PATCH v5 02/10] locking/mutex: introduce devm_mutex_init

2024-03-12 Thread Christophe Leroy


Le 12/03/2024 à 00:47, George Stark a écrit :
> [Vous ne recevez pas souvent de courriers de gnst...@salutedevices.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hello Waiman, Marek
> 
> Thanks for the review.
> 
> I've never used lockdep for debug but it seems preferable to
> keep that feature working. It could be look like this:

For sure it is a must. I'm not used to it either hence my overlook.

> 
> 
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index f7611c092db7..574f6de6084d 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -22,6 +22,8 @@
>   #include 
>   #include 
> 
> +struct device;
> +
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   # define __DEP_MAP_MUTEX_INITIALIZER(lockname)    \
>     , .dep_map = {  \
> @@ -115,10 +117,31 @@ do 
> {  \
> 
>   #ifdef CONFIG_DEBUG_MUTEXES
> 
> +int debug_devm_mutex_init(struct device *dev, struct mutex *lock);
> +
> +#define devm_mutex_init(dev, mutex)    \
> +({ \
> +   int ret;    \
> +   mutex_init(mutex);  \
> +   ret = debug_devm_mutex_init(dev, mutex);    \
> +   ret;    \
> +})
> +

I think it would be preferable to minimise the number of macros.

If I were you I would keep your devm_mutex_init() as is but rename it 
__devm_mutex_init() and just remove the mutex_init() from it, then add 
only one macro that works independant of CONFIG_DEBUG_MUTEXES:

#define devm_mutex_init(dev, mutex) \
({  \
mutex_init(mutex);  \
__devm_mutex_init(dev, mutex);  \
})

With that, no need of a second version of the macro and no need for the 
typecheck either.

Note the __ which is a clear indication that allthough that function is 
declared in public mutex.h, it is not meant to be used outside of it.



>   void mutex_destroy(struct mutex *lock);
> 
>   #else
> 
> +/*
> +* When CONFIG_DEBUG_MUTEXES is off mutex_destroy is just a nop so
> +* there's no really need to register it in devm subsystem.
> +*/
> +#define devm_mutex_init(dev, mutex)    \
> +({ \
> +   typecheck(struct device *, dev);    \
> +   mutex_init(mutex);  \
> +   0;  \
> +})
> +
>   static inline void mutex_destroy(struct mutex *lock) {}
> 
>   #endif
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index bc8abb8549d2..967a5367c79a 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -19,6 +19,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
> 
>   #include "mutex.h"
> 
> @@ -89,6 +90,16 @@ void debug_mutex_init(struct mutex *lock, const char
> *name,
>     lock->magic = lock;
>   }
> 
> +static void devm_mutex_release(void *res)
> +{
> +   mutex_destroy(res);
> +}
> +
> +int debug_devm_mutex_init(struct device *dev, struct mutex *lock)
> +{
> +   return devm_add_action_or_reset(dev, devm_mutex_release, lock);
> +}
> +
>   /***
>    * mutex_destroy - mark a mutex unusable
>    * @lock: the mutex to be destroyed
> -- 
> 2.25.1
> 
> 
> 
> And now I would drop the the refactoring patch with moving down
> mutex_destroy. devm block is big enough to be declared standalone.
> 
> 
> On 3/7/24 19:44, Marek Behún wrote:
>> On Thu, 7 Mar 2024 08:39:46 -0500
>> Waiman Long  wrote:
>>
>>> On 3/7/24 04:56, Marek Behún wrote:
>>>> On Thu, Mar 07, 2024 at 05:40:26AM +0300, George Stark wrote:
>>>>> Using of devm API leads to a certain order of releasing resources.
>>>>> So all dependent resources which are not devm-wrapped should be 
>>>>> deleted
>>>>> with respect to devm-release order. Mutex is one of such objects that
>>>>> often is bound to other resources and has no own devm wrapping.
>>>>> Since mutex_destroy() actually does nothing in non-debug builds
>>>>> frequently calling mutex_destroy() is just ignored which is safe 
>>>>> for now
>>>>> but wrong formally and can lead to a problem if mutex_destroy() 
>>>>> will be
>>>>> extended so introduce devm_mutex_init()
>>>>>
>>>

Re: [PATCH v5 02/10] locking/mutex: introduce devm_mutex_init

2024-03-11 Thread Christophe Leroy


Le 12/03/2024 à 02:10, Waiman Long a écrit :
> On 3/11/24 19:47, George Stark wrote:
>> Hello Waiman, Marek
>>
>> Thanks for the review.
>>
>> I've never used lockdep for debug but it seems preferable to
>> keep that feature working. It could be look like this:
>>
>> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
>> index f7611c092db7..574f6de6084d 100644
>> --- a/include/linux/mutex.h
>> +++ b/include/linux/mutex.h
>> @@ -22,6 +22,8 @@
>>  #include 
>>  #include 
>>
>> +struct device;
>> +
>>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>  # define __DEP_MAP_MUTEX_INITIALIZER(lockname)    \
>>  , .dep_map = {    \
>> @@ -115,10 +117,31 @@ do {    \
>>
>>  #ifdef CONFIG_DEBUG_MUTEXES
>>
>> +int debug_devm_mutex_init(struct device *dev, struct mutex *lock);
>> +
>> +#define devm_mutex_init(dev, mutex)    \
>> +({    \
>> +    int ret;    \
>> +    mutex_init(mutex);    \
>> +    ret = debug_devm_mutex_init(dev, mutex);    \
>> +    ret;    \
>> +})
> 
> The int ret variable is not needed. The macro can just end with 
> debug_devm_mutex_init().
> 
> 
>> +
>>  void mutex_destroy(struct mutex *lock);
>>
>>  #else
>>
>> +/*
>> +* When CONFIG_DEBUG_MUTEXES is off mutex_destroy is just a nop so
>> +* there's no really need to register it in devm subsystem.
> "no really need"?
>> +*/
>> +#define devm_mutex_init(dev, mutex)    \
>> +({    \
>> +    typecheck(struct device *, dev);    \
>> +    mutex_init(mutex);    \
>> +    0;    \
>> +})
> 
> Do we need a typecheck() here? Compilation will fail with 
> CONFIG_DEBUG_MUTEXES if dev is not a device pointer.

I guess the idea is to have it fail _also_ when CONFIG_DEBUG_MUTEXES is 
not selected, in order to discover errors as soon as possible.

> 
> 
>> +
>>  static inline void mutex_destroy(struct mutex *lock) {}
>>
>>  #endif
>> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
>> index bc8abb8549d2..967a5367c79a 100644
>> --- a/kernel/locking/mutex-debug.c
>> +++ b/kernel/locking/mutex-debug.c
>> @@ -19,6 +19,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include "mutex.h"
>>
>> @@ -89,6 +90,16 @@ void debug_mutex_init(struct mutex *lock, const 
>> char *name,
>>  lock->magic = lock;
>>  }
>>
>> +static void devm_mutex_release(void *res)
>> +{
>> +    mutex_destroy(res);
>> +}
>> +
>> +int debug_devm_mutex_init(struct device *dev, struct mutex *lock)
>> +{
>> +    return devm_add_action_or_reset(dev, devm_mutex_release, lock);
>> +}
>> +
>>  /***
>>   * mutex_destroy - mark a mutex unusable
>>   * @lock: the mutex to be destroyed
> 


Re: [PATCH v5 02/10] locking/mutex: introduce devm_mutex_init

2024-03-11 Thread Christophe Leroy


Le 12/03/2024 à 01:01, George Stark a écrit :
> [Vous ne recevez pas souvent de courriers de gnst...@salutedevices.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hello Andy
> 
> On 3/7/24 13:34, Andy Shevchenko wrote:
>> On Thu, Mar 7, 2024 at 4:40 AM George Stark 
>>  wrote:
>>>
>>> Using of devm API leads to a certain order of releasing resources.
>>> So all dependent resources which are not devm-wrapped should be deleted
>>> with respect to devm-release order. Mutex is one of such objects that
>>> often is bound to other resources and has no own devm wrapping.
>>> Since mutex_destroy() actually does nothing in non-debug builds
>>> frequently calling mutex_destroy() is just ignored which is safe for now
>>> but wrong formally and can lead to a problem if mutex_destroy() will be
>>> extended so introduce devm_mutex_init()
>>>
>>> Signed-off-by: George Stark 
>>> Signed-off-by: Christophe Leroy 
>>
>>>   Hello Christophe. Hope you don't mind I put you SoB tag because you 
>>> helped alot
>>>   to make this patch happen.
>>
>> You also need to figure out who should be the author of the patch and
>> probably add a (missing) Co-developed-by. After all you should also
>> follow the correct order of SoBs.
>>
> 
> Thanks for the review.
> I explained in the other letter as I see it. So I'd leave myself
> as author and add appropriate tag with Christophe's name.
> BTW what do you mean by correct SoB order?
> Is it alphabetical order or order of importance?
> 

The correct order is to first have the Author's SoB.


Re: [RFC PATCH v2 1/3] powerpc/prom_init: Replace linux,sml-base/sml-size with linux,sml-log

2024-03-11 Thread Christophe Leroy


Le 11/03/2024 à 14:20, Stefan Berger a écrit :
> linux,sml-base holds the address of a buffer with the TPM log. This
> buffer may become invalid after a kexec. To avoid accessing an invalid
> address or corrupted buffer, embed the whole TPM log in the device tree
> property linux,sml-log. This helps to protect the log since it is
> properly carried across a kexec soft reboot with both of the kexec
> syscalls.
> 
> Avoid having the firmware ingest the whole TPM log when calling
> prom_setprop but only create the linux,sml-log property as a place holder.
> Insert the actual TPM log during the tree flattening phase.
> 
> Fixes: 4a727429abec ("PPC64: Add support for instantiating SML from Open 
> Firmware")
> Suggested-by: Michael Ellerman 
> Signed-off-by: Stefan Berger 
> ---
>   arch/powerpc/kernel/prom_init.c | 27 +++
>   1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
> index e67effdba85c..6f7ca72013c2 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -211,6 +211,8 @@ static cell_t __prombss regbuf[1024];
>   
>   static bool  __prombss rtas_has_query_cpu_stopped;
>   
> +static u64 __prombss sml_base;
> +static u32 __prombss sml_size;
>   
>   /*
>* Error results ... some OF calls will return "-1" on error, some
> @@ -1954,17 +1956,15 @@ static void __init prom_instantiate_sml(void)
>   }
>   prom_printf(" done\n");
>   
> - reserve_mem(base, size);
> -
> - prom_setprop(ibmvtpm_node, "/vdevice/vtpm", "linux,sml-base",
> -  , sizeof(base));
> - prom_setprop(ibmvtpm_node, "/vdevice/vtpm", "linux,sml-size",
> -  , sizeof(size));
> -
> - prom_debug("sml base = 0x%llx\n", base);
> + /* Add property now, defer adding log to tree flattening phase */
> + prom_setprop(ibmvtpm_node, "/vdevice/vtpm", "linux,sml-log",
> +  NULL, 0);
>   prom_debug("sml size = 0x%x\n", size);
>   
>   prom_debug("prom_instantiate_sml: end...\n");
> +
> + sml_base = base;
> + sml_size = size;
>   }
>   
>   /*
> @@ -2645,6 +2645,17 @@ static void __init scan_dt_build_struct(phandle node, 
> unsigned long *mem_start,
>   }
>   prev_name = sstart + soff;
>   
> + if (!prom_strcmp("linux,sml-log", pname)) {
> + /* push property head */
> + dt_push_token(OF_DT_PROP, mem_start, mem_end);
> + dt_push_token(sml_size, mem_start, mem_end);
> + dt_push_token(soff, mem_start, mem_end);
> + /* push property content */
> + valp = make_room(mem_start, mem_end, sml_size, 1);
> + memcpy(valp, (void *)sml_base, sml_size);

You can't cast a u64 into a pointer. If sml_base is an address, it must 
be declared as an unsigned long.

Build with pmac32_defconfig :

   CC  arch/powerpc/kernel/prom_init.o
arch/powerpc/kernel/prom_init.c: In function 'scan_dt_build_struct':
arch/powerpc/kernel/prom_init.c:2663:38: error: cast to pointer from 
integer of different size [-Werror=int-to-pointer-cast]
  2663 | memcpy(valp, (void *)sml_base, sml_size);
   |  ^
cc1: all warnings being treated as errors
make[4]: *** [scripts/Makefile.build:243: 
arch/powerpc/kernel/prom_init.o] Error 1


> + *mem_start = ALIGN(*mem_start, 4);
> + continue;
> + }
>   /* get length */
>   l = call_prom("getproplen", 2, 1, node, pname);
>   


Re: [PATCH RFC 00/13] mm/treewide: Remove pXd_huge() API

2024-03-11 Thread Christophe Leroy


Le 06/03/2024 à 11:41, pet...@redhat.com a écrit :
> From: Peter Xu 
> 
> [based on akpm/mm-unstable latest commit a7f399ae964e]
> 
> In previous work [1], we removed the pXd_large() API, which is arch
> specific.  This patchset further removes the hugetlb pXd_huge() API.
> 
> Hugetlb was never special on creating huge mappings when compared with
> other huge mappings.  Having a standalone API just to detect such pgtable
> entries is more or less redundant, especially after the pXd_leaf() API set
> is introduced with/without CONFIG_HUGETLB_PAGE.
> 
> When looking at this problem, a few issues are also exposed that we don't
> have a clear definition of the *_huge() variance API.  This patchset
> started by cleaning these issues first, then replace all *_huge() users to
> use *_leaf(), then drop all *_huge() code.
> 
> On x86/sparc, swap entries will be reported "true" in pXd_huge(), while for
> all the rest archs they're reported "false" instead.  This part is done in
> patch 1-5, in which I suspect patch 1 can be seen as a bug fix, but I'll
> leave that to hmm experts to decide.
> 
> Besides, there are three archs (arm, arm64, powerpc) that have slightly
> different definitions between the *_huge() v.s. *_leaf() variances.  I
> tackled them separately so that it'll be easier for arch experts to chim in
> when necessary.  This part is done in patch 6-9.
> 
> The final patches 10-13 do the rest on the final removal, since *_leaf()
> will be the ultimate API in the future, and we seem to have quite some
> confusions on how *_huge() APIs can be defined, provide a rich comment for
> *_leaf() API set to define them properly to avoid future misuse, and
> hopefully that'll also help new archs to start support huge mappings and
> avoid traps (like either swap entries, or PROT_NONE entry checks).
> 
> The whole series is only lightly tested on x86, while as usual I don't have
> the capability to test all archs that it touches.
> 
> Marking this series RFC as of now.
> 
> [1] https://lore.kernel.org/r/20240305043750.93762-1-pet...@redhat.com
> 

Hi Peter, and nice job you are doing in cleaning up things around _huge 
stuff.

One thing that might be worth looking at also at some point is the mess 
around pmd_clear_huge() and pud_clear_huge().

I tried to clean things up with commit c742199a014d ("mm/pgtable: add 
stubs for {pmd/pub}_{set/clear}_huge") but it was reverted because of 
arm64 by commit d8a719059b9d ("Revert "mm/pgtable: add stubs for 
{pmd/pub}_{set/clear}_huge"")

So now powerpc/8xx has to implement pmd_clear_huge() and 
pud_clear_huge() allthough 8xx page hierarchy only has 2 levels.

Christophe


Re: [PATCH 3/3] tools/perf/arch/powerc: Add get_arch_regnum for powerpc

2024-03-09 Thread Christophe Leroy


Le 09/03/2024 à 08:25, Athira Rajeev a écrit :
> The function get_dwarf_regnum() returns a DWARF register number
> from a register name string. This calls arch specific function
> get_arch_regnum to return register number for corresponding arch.
> Add mappings for register name to register number in powerpc code:
> arch/powerpc/util/dwarf-regs.c
> 
> Signed-off-by: Athira Rajeev 
> ---
>   tools/perf/arch/powerpc/util/dwarf-regs.c | 29 +++
>   1 file changed, 29 insertions(+)
> 
> diff --git a/tools/perf/arch/powerpc/util/dwarf-regs.c 
> b/tools/perf/arch/powerpc/util/dwarf-regs.c
> index 0c4f4caf53ac..d955e3e577ea 100644
> --- a/tools/perf/arch/powerpc/util/dwarf-regs.c
> +++ b/tools/perf/arch/powerpc/util/dwarf-regs.c
> @@ -98,3 +98,32 @@ int regs_query_register_offset(const char *name)
>   return roff->ptregs_offset;
>   return -EINVAL;
>   }
> +
> +struct dwarf_regs_idx {
> + const char *name;
> + int idx;
> +};
> +
> +static const struct dwarf_regs_idx powerpc_regidx_table[] = {
> + { "r0", 0 }, { "r1", 1 }, { "r2", 2 }, { "r3", 3 }, { "r4", 4 },
> + { "r5", 5 }, { "r6", 6 }, { "r7", 7 }, { "r8", 8 }, { "r9", 9 },
> + { "r10", 10 }, { "r11", 11 }, { "r12", 12 }, { "r13", 13 }, { "r14", 14 
> },
> + { "r15", 15 }, { "r16", 16 }, { "r17", 17 }, { "r18", 18 }, { "r19", 19 
> },
> + { "r20", 20 }, { "r21", 21 }, { "r22", 22 }, { "r23", 23 }, { "r24", 24 
> },
> + { "r25", 25 }, { "r26", 26 }, { "r27", 27 }, { "r27", 27 }, { "r28", 28 
> },
> + { "r29", 29 }, { "r30", 30 }, { "r31", 31 },
> +};
> +
> +int get_arch_regnum(const char *name)
> +{
> + unsigned int i;
> +
> + if (*name != 'r')
> + return -EINVAL;
> +
> + for (i = 0; i < ARRAY_SIZE(powerpc_regidx_table); i++)
> + if (!strcmp(powerpc_regidx_table[i].name, name))
> + return powerpc_regidx_table[i].idx;

Can you do more simple ?

Something like:

int n;

if (*name != 'r')
return -EINVAL;
n = atoi(name + 1);
return n >= 0 && n < 32 ? n : -ENOENT;

> +
> + return -ENOENT;
> +}


Re: [PATCH 1/3] tools/perf/arch/powerpc: Add load/store in powerpc annotate instructions for data type profling

2024-03-09 Thread Christophe Leroy


Le 09/03/2024 à 08:25, Athira Rajeev a écrit :
> Add powerpc instruction nmemonic table to associate load/store
> instructions with move_ops. mov_ops is used to identify mem_type
> to associate instruction with data type and offset. Also initialize
> and allocate arch specific fields for nr_instructions, instructions and
> nr_instructions_allocate.
> 
> Signed-off-by: Athira Rajeev 
> ---
>   .../perf/arch/powerpc/annotate/instructions.c | 66 +++
>   1 file changed, 66 insertions(+)
> 
> diff --git a/tools/perf/arch/powerpc/annotate/instructions.c 
> b/tools/perf/arch/powerpc/annotate/instructions.c
> index a3f423c27cae..07af4442be38 100644
> --- a/tools/perf/arch/powerpc/annotate/instructions.c
> +++ b/tools/perf/arch/powerpc/annotate/instructions.c
> @@ -1,6 +1,65 @@
>   // SPDX-License-Identifier: GPL-2.0
>   #include 
>   
> +/*
> + * powerpc instruction nmemonic table to associate load/store instructions 
> with
> + * move_ops. mov_ops is used to identify mem_type to associate instruction 
> with
> + * data type and offset.
> + */
> +static struct ins powerpc__instructions[] = {
> + { .name = "lbz",.ops = _ops,  },
> + { .name = "lbzx",   .ops = _ops,  },
> + { .name = "lbzu",   .ops = _ops,  },
> + { .name = "lbzux",  .ops = _ops,  },
> + { .name = "lhz",.ops = _ops,  },
> + { .name = "lhzx",   .ops = _ops,  },
> + { .name = "lhzu",   .ops = _ops,  },
> + { .name = "lhzux",  .ops = _ops,  },
> + { .name = "lha",.ops = _ops,  },
> + { .name = "lhax",   .ops = _ops,  },
> + { .name = "lhau",   .ops = _ops,  },
> + { .name = "lhaux",  .ops = _ops,  },
> + { .name = "lwz",.ops = _ops,  },
> + { .name = "lwzx",   .ops = _ops,  },
> + { .name = "lwzu",   .ops = _ops,  },
> + { .name = "lwzux",  .ops = _ops,  },
> + { .name = "lwa",.ops = _ops,  },
> + { .name = "lwax",   .ops = _ops,  },
> + { .name = "lwaux",  .ops = _ops,  },
> + { .name = "ld", .ops = _ops,  },
> + { .name = "ldx",.ops = _ops,  },
> + { .name = "ldu",.ops = _ops,  },
> + { .name = "ldux",   .ops = _ops,  },
> + { .name = "stb",.ops = _ops,  },
> + { .name = "stbx",   .ops = _ops,  },
> + { .name = "stbu",   .ops = _ops,  },
> + { .name = "stbux",  .ops = _ops,  },
> + { .name = "sth",.ops = _ops,  },
> + { .name = "sthx",   .ops = _ops,  },
> + { .name = "sthu",   .ops = _ops,  },
> + { .name = "sthux",  .ops = _ops,  },
> + { .name = "stw",.ops = _ops,  },
> + { .name = "stwx",   .ops = _ops,  },
> + { .name = "stwu",   .ops = _ops,  },
> + { .name = "stwux",  .ops = _ops,  },
> + { .name = "std",.ops = _ops,  },
> + { .name = "stdx",   .ops = _ops,  },
> + { .name = "stdu",   .ops = _ops,  },
> + { .name = "stdux",  .ops = _ops,  },
> + { .name = "lhbrx",  .ops = _ops,  },
> + { .name = "sthbrx", .ops = _ops,  },
> + { .name = "lwbrx",  .ops = _ops,  },
> + { .name = "stwbrx", .ops = _ops,  },
> + { .name = "ldbrx",  .ops = _ops,  },
> + { .name = "stdbrx", .ops = _ops,  },
> + { .name = "lmw",.ops = _ops,  },
> + { .name = "stmw",   .ops = _ops,  },
> + { .name = "lswi",   .ops = _ops,  },
> + { .name = "lswx",   .ops = _ops,  },
> + { .name = "stswi",  .ops = _ops,  },
> + { .name = "stswx",  .ops = _ops,  },
> +};

What about lwarx and stwcx ?

> +
>   static struct ins_ops *powerpc__associate_instruction_ops(struct arch 
> *arch, const char *name)
>   {
>   int i;
> @@ -52,6 +111,13 @@ static struct ins_ops 
> *powerpc__associate_instruction_ops(struct arch *arch, con
>   static int powerpc__annotate_init(struct arch *arch, char *cpuid 
> __maybe_unused)
>   {
>   if (!arch->initialized) {
> + arch->nr_instructions = ARRAY_SIZE(powerpc__instructions);
> + arch->instructions = calloc(arch->nr_instructions, 
> sizeof(struct ins));
> + if (arch->instructions == NULL)

Prefered form is

if (!arch->instructions)

> + return -ENOMEM;
> +
> + memcpy(arch->instructions, (struct ins *)powerpc__instructions, 
> sizeof(struct ins) * arch->nr_instructions);

No need to cast powerpc__instructions, it is already a pointer.


> + arch->nr_instructions_allocated = arch->nr_instructions;
>   arch->initialized = true;
>   arch->associate_instruction_ops = 
> powerpc__associate_instruction_ops;
>   arch->objdump.comment_char  = '#';


Re: [PATCH 01/19] vdso: Consolidate vdso_calc_delta()

2024-03-08 Thread Christophe Leroy


Le 08/03/2024 à 14:14, Adrian Hunter a écrit :
> [Vous ne recevez pas souvent de courriers de adrian.hun...@intel.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Consolidate vdso_calc_delta(), in preparation for further simplification.
> 
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Adrian Hunter 
> ---
>   arch/powerpc/include/asm/vdso/gettimeofday.h | 17 ++---
>   arch/s390/include/asm/vdso/gettimeofday.h|  7 ++-
>   lib/vdso/gettimeofday.c  |  4 
>   3 files changed, 8 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
> b/arch/powerpc/include/asm/vdso/gettimeofday.h
> index f0a4cf01e85c..f4da8e18cdf3 100644
> --- a/arch/powerpc/include/asm/vdso/gettimeofday.h
> +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
> @@ -14,6 +14,8 @@
> 
>   #define VDSO_HAS_TIME  1
> 
> +#define VDSO_DELTA_NOMASK  1
> +
>   static __always_inline int do_syscall_2(const unsigned long _r0, const 
> unsigned long _r3,
>  const unsigned long _r4)
>   {
> @@ -105,21 +107,6 @@ static inline bool vdso_clocksource_ok(const struct 
> vdso_data *vd)
>   }
>   #define vdso_clocksource_ok vdso_clocksource_ok
> 
> -/*
> - * powerpc specific delta calculation.
> - *
> - * This variant removes the masking of the subtraction because the
> - * clocksource mask of all VDSO capable clocksources on powerpc is U64_MAX
> - * which would result in a pointless operation. The compiler cannot
> - * optimize it away as the mask comes from the vdso data and is not compile
> - * time constant.
> - */

Please keep the comment. You can move it close to VDSO_DELTA_NOMASK

> -static __always_inline u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, 
> u32 mult)
> -{
> -   return (cycles - last) * mult;
> -}
> -#define vdso_calc_delta vdso_calc_delta
> -
>   #ifndef __powerpc64__
>   static __always_inline u64 vdso_shift_ns(u64 ns, unsigned long shift)
>   {
> diff --git a/arch/s390/include/asm/vdso/gettimeofday.h 
> b/arch/s390/include/asm/vdso/gettimeofday.h
> index db84942eb78f..7937765ccfa5 100644
> --- a/arch/s390/include/asm/vdso/gettimeofday.h
> +++ b/arch/s390/include/asm/vdso/gettimeofday.h
> @@ -6,16 +6,13 @@
> 
>   #define VDSO_HAS_CLOCK_GETRES 1
> 
> +#define VDSO_DELTA_NOMASK 1
> +
>   #include 
>   #include 
>   #include 
>   #include 
> 
> -#define vdso_calc_delta __arch_vdso_calc_delta
> -static __always_inline u64 __arch_vdso_calc_delta(u64 cycles, u64 last, u64 
> mask, u32 mult)
> -{
> -   return (cycles - last) * mult;
> -}
> 
>   static __always_inline const struct vdso_data *__arch_get_vdso_data(void)
>   {
> diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
> index ce2f69552003..042b95e8164d 100644
> --- a/lib/vdso/gettimeofday.c
> +++ b/lib/vdso/gettimeofday.c
> @@ -13,7 +13,11 @@
>   static __always_inline
>   u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
>   {
> +#ifdef VDSO_DELTA_NOMASK
> +   return (cycles - last) * mult;
> +#else
>  return ((cycles - last) & mask) * mult;
> +#endif

See 
https://docs.kernel.org/process/coding-style.html#conditional-compilation

You don't need #ifdefs here.

One solution is to define VDSO_DELTA_NOMASK to 0 in 
include/vdso/datapage.h after including asm/vdso/gettimeofday.h :

#ifndef VDSO_DELTA_NOMASK
#define VDSO_DELTA_NOMASK 0
#endif

Then

u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
if (VDSO_DELTA_NOMASK)
mask = ~0ULL;

return ((cycles - last) & mask) * mult;
}

or

u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
if (VDSO_DELTA_NOMASK)
return (cycles - last) * mult;

return ((cycles - last) & mask) * mult;
}




>   }
>   #endif
> 
> --
> 2.34.1
> 


Re: [PATCH v5 02/10] locking/mutex: introduce devm_mutex_init

2024-03-07 Thread Christophe Leroy


Le 07/03/2024 à 03:40, George Stark a écrit :
> [Vous ne recevez pas souvent de courriers de gnst...@salutedevices.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Using of devm API leads to a certain order of releasing resources.
> So all dependent resources which are not devm-wrapped should be deleted
> with respect to devm-release order. Mutex is one of such objects that
> often is bound to other resources and has no own devm wrapping.
> Since mutex_destroy() actually does nothing in non-debug builds
> frequently calling mutex_destroy() is just ignored which is safe for now
> but wrong formally and can lead to a problem if mutex_destroy() will be
> extended so introduce devm_mutex_init()
> 
> Signed-off-by: George Stark 
> Signed-off-by: Christophe Leroy 
> ---
>   Hello Christophe. Hope you don't mind I put you SoB tag because you helped 
> alot
>   to make this patch happen.

Up to you, I sent a RFC patch based on yours with my ideas included 
because an exemple is easier than a lot of words for understanding, and 
my scripts automatically sets the Signed-off-by: but feel free to change 
it to Suggested-by:

Christophe

> 
>   include/linux/mutex.h| 13 +
>   kernel/locking/mutex-debug.c | 22 ++
>   2 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index f7611c092db7..9bcf72cb941a 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -22,6 +22,8 @@
>   #include 
>   #include 
> 
> +struct device;
> +
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   # define __DEP_MAP_MUTEX_INITIALIZER(lockname) \
>  , .dep_map = {  \
> @@ -115,10 +117,21 @@ do {
>   \
> 
>   #ifdef CONFIG_DEBUG_MUTEXES
> 
> +int devm_mutex_init(struct device *dev, struct mutex *lock);
>   void mutex_destroy(struct mutex *lock);
> 
>   #else
> 
> +static inline int devm_mutex_init(struct device *dev, struct mutex *lock)
> +{
> +   /*
> +* since mutex_destroy is nop actually there's no need to register it
> +* in devm subsystem.
> +*/
> +   mutex_init(lock);
> +   return 0;
> +}
> +
>   static inline void mutex_destroy(struct mutex *lock) {}
> 
>   #endif
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index bc8abb8549d2..c9efab1a8026 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -19,6 +19,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
> 
>   #include "mutex.h"
> 
> @@ -104,3 +105,24 @@ void mutex_destroy(struct mutex *lock)
>   }
> 
>   EXPORT_SYMBOL_GPL(mutex_destroy);
> +
> +static void devm_mutex_release(void *res)
> +{
> +   mutex_destroy(res);
> +}
> +
> +/**
> + * devm_mutex_init - Resource-managed mutex initialization
> + * @dev:   Device which lifetime mutex is bound to
> + * @lock:  Pointer to a mutex
> + *
> + * Initialize mutex which is automatically destroyed when the driver is 
> detached.
> + *
> + * Returns: 0 on success or a negative error code on failure.
> + */
> +int devm_mutex_init(struct device *dev, struct mutex *lock)
> +{
> +   mutex_init(lock);
> +   return devm_add_action_or_reset(dev, devm_mutex_release, lock);
> +}
> +EXPORT_SYMBOL_GPL(devm_mutex_init);
> --
> 2.25.1
> 


Re: [PATCH 3/3] macintosh/ams: Fix unused variable warning

2024-03-06 Thread Christophe Leroy


Le 07/03/2024 à 06:32, Michael Ellerman a écrit :
> Christophe Leroy  writes:
>> Le 06/03/2024 à 13:58, Michael Ellerman a écrit :
>>> If both CONFIG_SENSORS_AMS_PMU and CONFIG_SENSORS_AMS_I2C are unset,
>>> there is an unused variable warning in the ams driver:
>>>
>>> drivers/macintosh/ams/ams-core.c: In function 'ams_init':
>>> drivers/macintosh/ams/ams-core.c:181:29: warning: unused variable 'np'
>>>   181 | struct device_node *np;
>>>
>>> Fix it by using IS_ENABLED() to create a block for each case, and move
>>> the variable declartion in there.
>>>
>>> Probably the dependencies should be changed so that the driver can't be
>>> built with both variants disabled, but that would be a larger change.
>>
>> Can be done easily that way I think:
>>
>> diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
>> index a0e717a986dc..fb38f68f 100644
>> --- a/drivers/macintosh/Kconfig
>> +++ b/drivers/macintosh/Kconfig
>> @@ -262,7 +262,7 @@ config SENSORS_AMS
>>will be called ams.
>>
>>config SENSORS_AMS_PMU
>> -bool "PMU variant"
>> +bool "PMU variant" if SENSORS_AMS_I2C
>>  depends on SENSORS_AMS && ADB_PMU
>>  default y
>>  help
> 
> Thanks. It's a little clunky. For example if you answer no to both
> prompts, it still selects SENSORS_AMS_PMU, but I guess it doesn't really
> matter.
> 
>$ make oldconfig
>...
>  Apple Motion Sensor driver (SENSORS_AMS) [N/m/y/?] (NEW) y
>PMU variant (SENSORS_AMS_PMU) [Y/n/?] (NEW) n
>I2C variant (SENSORS_AMS_I2C) [Y/n/?] (NEW) n
>#
># configuration written to .config
>#
>make[1]: Leaving directory '/home/michael/linux/.build'
>
>$ grep SENSORS_AMS .build/.config
>CONFIG_SENSORS_AMS=y
>CONFIG_SENSORS_AMS_PMU=y
># CONFIG_SENSORS_AMS_I2C is not set
> 
> 
> I'll turn to this into a patch and add your SoB?

That's fine for me.

You can alternatively use Suggested-by: , I don't really mind.

Thanks
Christophe


Re: [PATCH 3/3] macintosh/ams: Fix unused variable warning

2024-03-06 Thread Christophe Leroy


Le 06/03/2024 à 13:58, Michael Ellerman a écrit :
> If both CONFIG_SENSORS_AMS_PMU and CONFIG_SENSORS_AMS_I2C are unset,
> there is an unused variable warning in the ams driver:
> 
>drivers/macintosh/ams/ams-core.c: In function 'ams_init':
>drivers/macintosh/ams/ams-core.c:181:29: warning: unused variable 'np'
>  181 | struct device_node *np;
> 
> Fix it by using IS_ENABLED() to create a block for each case, and move
> the variable declartion in there.
> 
> Probably the dependencies should be changed so that the driver can't be
> built with both variants disabled, but that would be a larger change.

Can be done easily that way I think:

diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
index a0e717a986dc..fb38f68f 100644
--- a/drivers/macintosh/Kconfig
+++ b/drivers/macintosh/Kconfig
@@ -262,7 +262,7 @@ config SENSORS_AMS
  will be called ams.

  config SENSORS_AMS_PMU
-   bool "PMU variant"
+   bool "PMU variant" if SENSORS_AMS_I2C
depends on SENSORS_AMS && ADB_PMU
default y
help


> 
> Signed-off-by: Michael Ellerman 
> ---
>   drivers/macintosh/ams/ams-core.c | 29 ++---
>   1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/macintosh/ams/ams-core.c 
> b/drivers/macintosh/ams/ams-core.c
> index c978b4272daa..22d3e6605287 100644
> --- a/drivers/macintosh/ams/ams-core.c
> +++ b/drivers/macintosh/ams/ams-core.c
> @@ -178,25 +178,24 @@ int ams_sensor_attach(void)
>   
>   static int __init ams_init(void)
>   {
> - struct device_node *np;
> -
>   spin_lock_init(_info.irq_lock);
>   mutex_init(_info.lock);
>   INIT_WORK(_info.worker, ams_worker);
>   
> -#ifdef CONFIG_SENSORS_AMS_I2C
> - np = of_find_node_by_name(NULL, "accelerometer");
> - if (np && of_device_is_compatible(np, "AAPL,accelerometer_1"))
> - /* Found I2C motion sensor */
> - return ams_i2c_init(np);
> -#endif
> -
> -#ifdef CONFIG_SENSORS_AMS_PMU
> - np = of_find_node_by_name(NULL, "sms");
> - if (np && of_device_is_compatible(np, "sms"))
> - /* Found PMU motion sensor */
> - return ams_pmu_init(np);
> -#endif
> + if (IS_ENABLED(CONFIG_SENSORS_AMS_I2C)) {
> + struct device_node *np = of_find_node_by_name(NULL, 
> "accelerometer");
> + if (np && of_device_is_compatible(np, "AAPL,accelerometer_1"))
> + /* Found I2C motion sensor */
> + return ams_i2c_init(np);
> + }
> +
> + if (IS_ENABLED(CONFIG_SENSORS_AMS_PMU)) {
> + struct device_node *np = of_find_node_by_name(NULL, "sms");
> + if (np && of_device_is_compatible(np, "sms"))
> + /* Found PMU motion sensor */
> + return ams_pmu_init(np);
> + }
> +
>   return -ENODEV;
>   }
>   


Re: [RESEND2 PATCH net v4 2/2] soc: fsl: qbman: Use raw spinlock for cgr_lock

2024-03-05 Thread Christophe Leroy


Le 05/03/2024 à 19:14, Sean Anderson a écrit :
> [Vous ne recevez pas souvent de courriers de sean.ander...@linux.dev. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi,
> 
> On 2/23/24 11:02, Sean Anderson wrote:
>> On 2/23/24 00:38, Christophe Leroy wrote:
>>> Le 22/02/2024 à 18:07, Sean Anderson a écrit :
>>>> [Vous ne recevez pas souvent de courriers de sean.ander...@linux.dev. 
>>>> Découvrez pourquoi ceci est important à 
>>>> https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> cgr_lock may be locked with interrupts already disabled by
>>>> smp_call_function_single. As such, we must use a raw spinlock to avoid
>>>> problems on PREEMPT_RT kernels. Although this bug has existed for a
>>>> while, it was not apparent until commit ef2a8d5478b9 ("net: dpaa: Adjust
>>>> queue depth on rate change") which invokes smp_call_function_single via
>>>> qman_update_cgr_safe every time a link goes up or down.
>>>
>>> Why a raw spinlock to avoid problems on PREEMPT_RT, can you elaborate ?
>>
>> smp_call_function always runs its callback in hard IRQ context, even on
>> PREEMPT_RT, where spinlocks can sleep. So we need to use raw spinlocks
>> to ensure we aren't waiting on a sleeping task. See the first bug report
>> for more discussion.
>>
>> In the longer term it would be better to switch to some other
>> abstraction.
> 
> Does this make sense to you?

Yes that fine, thanks for the clarification. Maybe you can explain that 
in the patch description in case you send a v5.

Christophe


Re: [PATCH v3 02/10] mm/ppc: Replace pXd_is_leaf() with pXd_leaf()

2024-03-05 Thread Christophe Leroy


Le 05/03/2024 à 05:37, pet...@redhat.com a écrit :
> From: Peter Xu 
> 
> They're the same macros underneath.  Drop pXd_is_leaf(), instead always use
> pXd_leaf().
> 
> At the meantime, instead of renames, drop the pXd_is_leaf() fallback
> definitions directly in arch/powerpc/include/asm/pgtable.h. because similar
> fallback macros for pXd_leaf() are already defined in
> include/linux/pgtable.h.
> 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: "Aneesh Kumar K.V" 
> Cc: "Naveen N. Rao" 
> Cc: linuxppc-dev@lists.ozlabs.org
> Suggested-by: Christophe Leroy 
> Reviewed-by: Jason Gunthorpe 
> Signed-off-by: Peter Xu 

Reviewed-by: Christophe Leroy 

In case you post a new version, in the subject, usually with use powerpc 
not ppc

> ---
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 10 
>   arch/powerpc/include/asm/pgtable.h   | 24 
>   arch/powerpc/kvm/book3s_64_mmu_radix.c   | 12 +-
>   arch/powerpc/mm/book3s64/radix_pgtable.c | 14 ++--
>   arch/powerpc/mm/pgtable.c|  6 ++---
>   arch/powerpc/mm/pgtable_64.c |  6 ++---
>   arch/powerpc/xmon/xmon.c |  6 ++---
>   7 files changed, 26 insertions(+), 52 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index d1318e8582ac..3e99e409774a 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1439,18 +1439,16 @@ static inline bool is_pte_rw_upgrade(unsigned long 
> old_val, unsigned long new_va
>   /*
>* Like pmd_huge() and pmd_large(), but works regardless of config options
>*/
> -#define pmd_is_leaf pmd_is_leaf
> -#define pmd_leaf pmd_is_leaf
> +#define pmd_leaf pmd_leaf
>   #define pmd_large pmd_leaf
> -static inline bool pmd_is_leaf(pmd_t pmd)
> +static inline bool pmd_leaf(pmd_t pmd)
>   {
>   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
>   }
>   
> -#define pud_is_leaf pud_is_leaf
> -#define pud_leaf pud_is_leaf
> +#define pud_leaf pud_leaf
>   #define pud_large pud_leaf
> -static inline bool pud_is_leaf(pud_t pud)
> +static inline bool pud_leaf(pud_t pud)
>   {
>   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
>   }
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 5928b3c1458d..e6edf1cdbc5b 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -182,30 +182,6 @@ static inline void pte_frag_set(mm_context_t *ctx, void 
> *p)
>   }
>   #endif
>   
> -#ifndef pmd_is_leaf
> -#define pmd_is_leaf pmd_is_leaf
> -static inline bool pmd_is_leaf(pmd_t pmd)
> -{
> - return false;
> -}
> -#endif
> -
> -#ifndef pud_is_leaf
> -#define pud_is_leaf pud_is_leaf
> -static inline bool pud_is_leaf(pud_t pud)
> -{
> - return false;
> -}
> -#endif
> -
> -#ifndef p4d_is_leaf
> -#define p4d_is_leaf p4d_is_leaf
> -static inline bool p4d_is_leaf(p4d_t p4d)
> -{
> - return false;
> -}
> -#endif
> -
>   #define pmd_pgtable pmd_pgtable
>   static inline pgtable_t pmd_pgtable(pmd_t pmd)
>   {
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 4a1abb9f7c05..408d98f8a514 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -503,7 +503,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t 
> *pmd, bool full,
>   for (im = 0; im < PTRS_PER_PMD; ++im, ++p) {
>   if (!pmd_present(*p))
>   continue;
> - if (pmd_is_leaf(*p)) {
> + if (pmd_leaf(*p)) {
>   if (full) {
>   pmd_clear(p);
>   } else {
> @@ -532,7 +532,7 @@ static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t 
> *pud,
>   for (iu = 0; iu < PTRS_PER_PUD; ++iu, ++p) {
>   if (!pud_present(*p))
>   continue;
> - if (pud_is_leaf(*p)) {
> + if (pud_leaf(*p)) {
>   pud_clear(p);
>   } else {
>   pmd_t *pmd;
> @@ -635,12 +635,12 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, 
> pte_t pte,
>   new_pud = pud_alloc_one(kvm->mm, gpa);
>   
>   pmd = NULL;
> - if (pud && pud_present(*pud) && !pud_is_leaf(*pud))
> + if (pud && pud_present(*pud) && !pud_leaf(*pud))
>   pmd = pmd_offset(pud, gpa);
>

Re: [PATCH v3 01/10] mm/ppc: Define pXd_large() with pXd_leaf()

2024-03-05 Thread Christophe Leroy


Le 05/03/2024 à 05:37, pet...@redhat.com a écrit :
> From: Peter Xu 
> 
> The two definitions are the same.  The only difference is that pXd_large()
> is only defined with THP selected, and only on book3s 64bits.
> 
> Instead of implementing it twice, make pXd_large() a macro to pXd_leaf().
> Define it unconditionally just like pXd_leaf().  This helps to prepare
> merging the two APIs.
> 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: "Naveen N. Rao" 
> Cc: linuxppc-dev@lists.ozlabs.org
> Reviewed-by: Jason Gunthorpe 
> Signed-off-by: Peter Xu 

Reviewed-by: Christophe Leroy 

> ---
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 16 ++--
>   arch/powerpc/include/asm/pgtable.h   |  2 +-
>   2 files changed, 3 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 927d585652bc..d1318e8582ac 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1157,20 +1157,6 @@ pud_hugepage_update(struct mm_struct *mm, unsigned 
> long addr, pud_t *pudp,
>   return pud_val(*pudp);
>   }
>   
> -/*
> - * returns true for pmd migration entries, THP, devmap, hugetlb
> - * But compile time dependent on THP config
> - */
> -static inline int pmd_large(pmd_t pmd)
> -{
> - return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> -static inline int pud_large(pud_t pud)
> -{
> - return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -}
> -
>   /*
>* For radix we should always find H_PAGE_HASHPTE zero. Hence
>* the below will work for radix too
> @@ -1455,6 +1441,7 @@ static inline bool is_pte_rw_upgrade(unsigned long 
> old_val, unsigned long new_va
>*/
>   #define pmd_is_leaf pmd_is_leaf
>   #define pmd_leaf pmd_is_leaf
> +#define pmd_large pmd_leaf
>   static inline bool pmd_is_leaf(pmd_t pmd)
>   {
>   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> @@ -1462,6 +1449,7 @@ static inline bool pmd_is_leaf(pmd_t pmd)
>   
>   #define pud_is_leaf pud_is_leaf
>   #define pud_leaf pud_is_leaf
> +#define pud_large pud_leaf
>   static inline bool pud_is_leaf(pud_t pud)
>   {
>   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 7a1ba8889aea..5928b3c1458d 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -101,7 +101,7 @@ void poking_init(void);
>   extern unsigned long ioremap_bot;
>   extern const pgprot_t protection_map[16];
>   
> -#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> +#ifndef pmd_large
>   #define pmd_large(pmd)  0
>   #endif
>   


[PATCH] powerpc/bpf/32: Fix failing test_bpf tests

2024-03-05 Thread Christophe Leroy
Recent additions in BPF like cpu v4 instructions, test_bpf module
exhibits the following failures:

test_bpf: #82 ALU_MOVSX | BPF_B jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 
times)
test_bpf: #83 ALU_MOVSX | BPF_H jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 
times)
test_bpf: #84 ALU64_MOVSX | BPF_B jited:1 ret 2 != 1 (0x2 != 0x1)FAIL 
(1 times)
test_bpf: #85 ALU64_MOVSX | BPF_H jited:1 ret 2 != 1 (0x2 != 0x1)FAIL 
(1 times)
test_bpf: #86 ALU64_MOVSX | BPF_W jited:1 ret 2 != 1 (0x2 != 0x1)FAIL 
(1 times)

test_bpf: #165 ALU_SDIV_X: -6 / 2 = -3 jited:1 ret 2147483645 != -3 
(0x7ffd != 0xfffd)FAIL (1 times)
test_bpf: #166 ALU_SDIV_K: -6 / 2 = -3 jited:1 ret 2147483645 != -3 
(0x7ffd != 0xfffd)FAIL (1 times)

test_bpf: #169 ALU_SMOD_X: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 
0x)FAIL (1 times)
test_bpf: #170 ALU_SMOD_K: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 
0x)FAIL (1 times)

test_bpf: #172 ALU64_SMOD_K: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 
0x)FAIL (1 times)

test_bpf: #313 BSWAP 16: 0x0123456789abcdef -> 0xefcd
eBPF filter opcode 00d7 (@2) unsupported
jited:0 301 PASS
test_bpf: #314 BSWAP 32: 0x0123456789abcdef -> 0xefcdab89
eBPF filter opcode 00d7 (@2) unsupported
jited:0 555 PASS
test_bpf: #315 BSWAP 64: 0x0123456789abcdef -> 0x67452301
eBPF filter opcode 00d7 (@2) unsupported
jited:0 268 PASS
test_bpf: #316 BSWAP 64: 0x0123456789abcdef >> 32 -> 0xefcdab89
eBPF filter opcode 00d7 (@2) unsupported
jited:0 269 PASS
test_bpf: #317 BSWAP 16: 0xfedcba9876543210 -> 0x1032
eBPF filter opcode 00d7 (@2) unsupported
jited:0 460 PASS
test_bpf: #318 BSWAP 32: 0xfedcba9876543210 -> 0x10325476
eBPF filter opcode 00d7 (@2) unsupported
jited:0 320 PASS
test_bpf: #319 BSWAP 64: 0xfedcba9876543210 -> 0x98badcfe
eBPF filter opcode 00d7 (@2) unsupported
jited:0 222 PASS
test_bpf: #320 BSWAP 64: 0xfedcba9876543210 >> 32 -> 0x10325476
eBPF filter opcode 00d7 (@2) unsupported
jited:0 273 PASS

test_bpf: #344 BPF_LDX_MEMSX | BPF_B
eBPF filter opcode 0091 (@5) unsupported
jited:0 432 PASS
test_bpf: #345 BPF_LDX_MEMSX | BPF_H
eBPF filter opcode 0089 (@5) unsupported
jited:0 381 PASS
test_bpf: #346 BPF_LDX_MEMSX | BPF_W
eBPF filter opcode 0081 (@5) unsupported
jited:0 505 PASS

test_bpf: #490 JMP32_JA: Unconditional jump: if (true) return 1
eBPF filter opcode 0006 (@1) unsupported
jited:0 261 PASS

test_bpf: Summary: 1040 PASSED, 10 FAILED, [924/1038 JIT'ed]

Fix them by adding missing processing.

Fixes: daabb2b098e0 ("bpf/tests: add tests for cpuv4 instructions")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/ppc-opcode.h |   4 +
 arch/powerpc/net/bpf_jit_comp32.c | 137 --
 2 files changed, 110 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 005601243dda..076ae60b4a55 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -510,6 +510,7 @@
 #define PPC_RAW_STB(r, base, i)(0x9800 | ___PPC_RS(r) | 
___PPC_RA(base) | IMM_L(i))
 #define PPC_RAW_LBZ(r, base, i)(0x8800 | ___PPC_RT(r) | 
___PPC_RA(base) | IMM_L(i))
 #define PPC_RAW_LDX(r, base, b)(0x7c2a | ___PPC_RT(r) | 
___PPC_RA(base) | ___PPC_RB(b))
+#define PPC_RAW_LHA(r, base, i)(0xa800 | ___PPC_RT(r) | 
___PPC_RA(base) | IMM_L(i))
 #define PPC_RAW_LHZ(r, base, i)(0xa000 | ___PPC_RT(r) | 
___PPC_RA(base) | IMM_L(i))
 #define PPC_RAW_LHBRX(r, base, b)  (0x7c00062c | ___PPC_RT(r) | 
___PPC_RA(base) | ___PPC_RB(b))
 #define PPC_RAW_LWBRX(r, base, b)  (0x7c00042c | ___PPC_RT(r) | 
___PPC_RA(base) | ___PPC_RB(b))
@@ -532,6 +533,7 @@
 #define PPC_RAW_MULW(d, a, b)  (0x7c0001d6 | ___PPC_RT(d) | 
___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_RAW_MULHWU(d, a, b)(0x7c16 | ___PPC_RT(d) | 
___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_RAW_MULI(d, a, i)  (0x1c00 | ___PPC_RT(d) | 
___PPC_RA(a) | IMM_L(i))
+#define PPC_RAW_DIVW(d, a, b)  (0x7c0003d6 | ___PPC_RT(d) | 
___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_RAW_DIVWU(d, a, b) (0x7c000396 | ___PPC_RT(d) | 
___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_RAW_DIVDU(d, a, b) (0x7c000392 | ___PPC_RT(d) | 
___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_RAW_DIVDE(t, a, b) (0x7c000352 | ___PPC_RT(t) | 
___PPC_RA(a) | ___PPC_RB(b))
@@ -550,6 +552,8 @@
 #define PPC_RAW_XOR(d, a, b)   (0x7c000278 | ___PPC_RA(d) | 
___PPC_RS(a) | ___PPC_RB(b))
 #define PPC_RAW_XO

Re: [PATCH v2 3/3] arch/powerpc: Remove from backlight code

2024-03-05 Thread Christophe Leroy


Le 05/03/2024 à 11:04, Thomas Zimmermann a écrit :
> Hi
> 
> Am 05.03.24 um 10:25 schrieb Christophe Leroy:
>>
>> Le 05/03/2024 à 10:01, Thomas Zimmermann a écrit :
>>> Replace  with a forward declaration in  to
>>> resolves an unnecessary dependency. Remove pmac_backlight_curve_lookup()
>>> and struct fb_info from source and header files. The function and the
>>> framebuffer struct is unused. No functional changes.
>> When you remove pmac_backlight_curve_lookup() prototype you'll then get
>> a warning/error about missing prototype when building
>> arch/powerpc/platforms/powermac/backlight.c
>>
>> The fonction is not used outside of that file so it should be static.
>> And then it is not used in that file either so it should be removed
>> completely. Indeed last use of that function was removed by commit
>> d565dd3b0824 ("[PATCH] powerpc: More via-pmu backlight fixes") so the
>> function can safely be removed.
> 
> Isn't that what my patch is doing? I have no callers of the function in 
> my tree (drm-tip), so I removed it entirely. Should I add a Fixes tag 
> against commit d565dd3b0824? Best regards Thomas

Sorry I overlooked your patch and focussed on the removal of the 
prototype and missed the removal of the function.

Christophe

>>
>> Christophe
>>
>>> Signed-off-by: Thomas Zimmermann 
>>> ---
>>>    arch/powerpc/include/asm/backlight.h    |  5 ++--
>>>    arch/powerpc/platforms/powermac/backlight.c | 26 
>>> -
>>>    2 files changed, 2 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/backlight.h 
>>> b/arch/powerpc/include/asm/backlight.h
>>> index 1b5eab62ed047..061a910d74929 100644
>>> --- a/arch/powerpc/include/asm/backlight.h
>>> +++ b/arch/powerpc/include/asm/backlight.h
>>> @@ -10,15 +10,14 @@
>>>    #define __ASM_POWERPC_BACKLIGHT_H
>>>    #ifdef __KERNEL__
>>> -#include 
>>>    #include 
>>> +struct backlight_device;
>>> +
>>>    /* For locking instructions, see the implementation file */
>>>    extern struct backlight_device *pmac_backlight;
>>>    extern struct mutex pmac_backlight_mutex;
>>> -extern int pmac_backlight_curve_lookup(struct fb_info *info, int 
>>> value);
>>> -
>>>    extern int pmac_has_backlight_type(const char *type);
>>>    extern void pmac_backlight_key(int direction);
>>> diff --git a/arch/powerpc/platforms/powermac/backlight.c 
>>> b/arch/powerpc/platforms/powermac/backlight.c
>>> index aeb79a8b3e109..12bc01353bd3c 100644
>>> --- a/arch/powerpc/platforms/powermac/backlight.c
>>> +++ b/arch/powerpc/platforms/powermac/backlight.c
>>> @@ -9,7 +9,6 @@
>>>     */
>>>    #include 
>>> -#include 
>>>    #include 
>>>    #include 
>>>    #include 
>>> @@ -72,31 +71,6 @@ int pmac_has_backlight_type(const char *type)
>>>    return 0;
>>>    }
>>> -int pmac_backlight_curve_lookup(struct fb_info *info, int value)
>>> -{
>>> -    int level = (FB_BACKLIGHT_LEVELS - 1);
>>> -
>>> -    if (info && info->bl_dev) {
>>> -    int i, max = 0;
>>> -
>>> -    /* Look for biggest value */
>>> -    for (i = 0; i < FB_BACKLIGHT_LEVELS; i++)
>>> -    max = max((int)info->bl_curve[i], max);
>>> -
>>> -    /* Look for nearest value */
>>> -    for (i = 0; i < FB_BACKLIGHT_LEVELS; i++) {
>>> -    int diff = abs(info->bl_curve[i] - value);
>>> -    if (diff < max) {
>>> -    max = diff;
>>> -    level = i;
>>> -    }
>>> -    }
>>> -
>>> -    }
>>> -
>>> -    return level;
>>> -}
>>> -
>>>    static void pmac_backlight_key_worker(struct work_struct *work)
>>>    {
>>>    if (atomic_read(_backlight_disabled))
> 


Re: [PATCH v2 3/3] arch/powerpc: Remove from backlight code

2024-03-05 Thread Christophe Leroy


Le 05/03/2024 à 10:01, Thomas Zimmermann a écrit :
> Replace  with a forward declaration in  to
> resolves an unnecessary dependency. Remove pmac_backlight_curve_lookup()
> and struct fb_info from source and header files. The function and the
> framebuffer struct is unused. No functional changes.

When you remove pmac_backlight_curve_lookup() prototype you'll then get 
a warning/error about missing prototype when building 
arch/powerpc/platforms/powermac/backlight.c

The fonction is not used outside of that file so it should be static. 
And then it is not used in that file either so it should be removed 
completely. Indeed last use of that function was removed by commit 
d565dd3b0824 ("[PATCH] powerpc: More via-pmu backlight fixes") so the 
function can safely be removed.

Christophe

> 
> Signed-off-by: Thomas Zimmermann 
> ---
>   arch/powerpc/include/asm/backlight.h|  5 ++--
>   arch/powerpc/platforms/powermac/backlight.c | 26 -
>   2 files changed, 2 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/backlight.h 
> b/arch/powerpc/include/asm/backlight.h
> index 1b5eab62ed047..061a910d74929 100644
> --- a/arch/powerpc/include/asm/backlight.h
> +++ b/arch/powerpc/include/asm/backlight.h
> @@ -10,15 +10,14 @@
>   #define __ASM_POWERPC_BACKLIGHT_H
>   #ifdef __KERNEL__
>   
> -#include 
>   #include 
>   
> +struct backlight_device;
> +
>   /* For locking instructions, see the implementation file */
>   extern struct backlight_device *pmac_backlight;
>   extern struct mutex pmac_backlight_mutex;
>   
> -extern int pmac_backlight_curve_lookup(struct fb_info *info, int value);
> -
>   extern int pmac_has_backlight_type(const char *type);
>   
>   extern void pmac_backlight_key(int direction);
> diff --git a/arch/powerpc/platforms/powermac/backlight.c 
> b/arch/powerpc/platforms/powermac/backlight.c
> index aeb79a8b3e109..12bc01353bd3c 100644
> --- a/arch/powerpc/platforms/powermac/backlight.c
> +++ b/arch/powerpc/platforms/powermac/backlight.c
> @@ -9,7 +9,6 @@
>*/
>   
>   #include 
> -#include 
>   #include 
>   #include 
>   #include 
> @@ -72,31 +71,6 @@ int pmac_has_backlight_type(const char *type)
>   return 0;
>   }
>   
> -int pmac_backlight_curve_lookup(struct fb_info *info, int value)
> -{
> - int level = (FB_BACKLIGHT_LEVELS - 1);
> -
> - if (info && info->bl_dev) {
> - int i, max = 0;
> -
> - /* Look for biggest value */
> - for (i = 0; i < FB_BACKLIGHT_LEVELS; i++)
> - max = max((int)info->bl_curve[i], max);
> -
> - /* Look for nearest value */
> - for (i = 0; i < FB_BACKLIGHT_LEVELS; i++) {
> - int diff = abs(info->bl_curve[i] - value);
> - if (diff < max) {
> - max = diff;
> - level = i;
> - }
> - }
> -
> - }
> -
> - return level;
> -}
> -
>   static void pmac_backlight_key_worker(struct work_struct *work)
>   {
>   if (atomic_read(_backlight_disabled))


Re: [PATCH] powerpc/pseries: fix max polling time in plpks_confirm_object_flushed() function

2024-03-04 Thread Christophe Leroy


Le 04/03/2024 à 07:53, Nayna Jain a écrit :
> [Vous ne recevez pas souvent de courriers de na...@linux.ibm.com. Découvrez 
> pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> usleep_range() function takes input time and range in usec. However,
> currently it is assumed in msec in the function
> plpks_confirm_object_flushed().
> 
> Fix the total polling time for the object flushing from 5msec to 5sec.

I understand when 5000 msec becomes 500 usec.

But why does 10 msec becomes 5000 usec ?

Why does 400 becomes 5000 ?

Christophe

> 
> Reported-by: Nageswara R Sastry 
> Fixes: 2454a7af0f2a ("powerpc/pseries: define driver for Platform KeyStore")
> Signed-off-by: Nayna Jain 
> Tested-by: Nageswara R Sastry 
> ---
>   arch/powerpc/include/asm/plpks.h | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/plpks.h 
> b/arch/powerpc/include/asm/plpks.h
> index 23b77027c916..8721d97f32c1 100644
> --- a/arch/powerpc/include/asm/plpks.h
> +++ b/arch/powerpc/include/asm/plpks.h
> @@ -44,9 +44,9 @@
>   #define PLPKS_MAX_DATA_SIZE4000
> 
>   // Timeouts for PLPKS operations
> -#define PLPKS_MAX_TIMEOUT  5000 // msec
> -#define PLPKS_FLUSH_SLEEP  10 // msec
> -#define PLPKS_FLUSH_SLEEP_RANGE400
> +#define PLPKS_MAX_TIMEOUT  500 // usec
> +#define PLPKS_FLUSH_SLEEP  5000 // usec
> +#define PLPKS_FLUSH_SLEEP_RANGE5000
> 
>   struct plpks_var {
>  char *component;
> --
> 2.31.1
> 


Re: [PATCH] powerpc: include linux/backlight.h from asm/backlight.h

2024-03-04 Thread Christophe Leroy


Le 04/03/2024 à 11:32, Thomas Zimmermann a écrit :
> Hi
> 
> Am 04.03.24 um 10:55 schrieb Jani Nikula:
>> Removal of the backlight include from fb.h uncovered an implicit
>> dependency in powerpc asm/backlight.h. Add the explicit include.
>>
>> Reported-by: Naresh Kamboju 
>> Closes: 
>> https://lore.kernel.org/r/ca+g9fysak5tbqqxfc2w4ohlga0cbthmxbeq8qayfxtu75yi...@mail.gmail.com
>> Fixes: 11b4eedfc87d ("fbdev: Do not include  in 
>> header")
>> Cc: Thomas Zimmermann 
>> Cc: Helge Deller 
>> Cc: linux-fb...@vger.kernel.org
>> Signed-off-by: Jani Nikula 
>>
>> ---
>>
>> Not even compile tested!
> 
> That's one of the cases that's hard to catch unless you get the config 
> right.
> 
>> ---
>>   arch/powerpc/include/asm/backlight.h | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/powerpc/include/asm/backlight.h 
>> b/arch/powerpc/include/asm/backlight.h
>> index 1b5eab62ed04..275d5bb9aa04 100644
>> --- a/arch/powerpc/include/asm/backlight.h
>> +++ b/arch/powerpc/include/asm/backlight.h
>> @@ -10,6 +10,7 @@
>>   #define __ASM_POWERPC_BACKLIGHT_H
>>   #ifdef __KERNEL__
>> +#include 
> 
> Thanks, but I think this should go directly into chipsfb.c. I would have 
> provided a patch already, if our mail server didn't have issues this 
> morning. Let me try again.

asm/backlight.h needs it for struct backlight_device

At least if you don't want to include linux/backlight.h in 
asm/backlight.h, then you need a forward declaration of struct 
backlight_device;

> 
> Best regards
> Thomas
> 
>>   #include 
>>   #include 
> 


Re: [PATCH v2 5/9] mm: Initialize struct vm_unmapped_area_info

2024-03-04 Thread Christophe Leroy


Le 02/03/2024 à 02:51, Kees Cook a écrit :
> On Sat, Mar 02, 2024 at 12:47:08AM +, Edgecombe, Rick P wrote:
>> On Wed, 2024-02-28 at 09:21 -0800, Kees Cook wrote:
>>> I totally understand. If the "uninitialized" warnings were actually
>>> reliable, I would agree. I look at it this way:
>>>
>>> - initializations can be missed either in static initializers or via
>>>    run time initializers. (So the risk of mistake here is matched --
>>>    though I'd argue it's easier to *find* static initializers when
>>> adding
>>>    new struct members.)
>>> - uninitialized warnings are inconsistent (this becomes an unknown
>>> risk)
>>> - when a run time initializer is missed, the contents are whatever
>>> was
>>>    on the stack (high risk)
>>> - what a static initializer is missed, the content is 0 (low risk)
>>>
>>> I think unambiguous state (always 0) is significantly more important
>>> for
>>> the safety of the system as a whole. Yes, individual cases maybe bad
>>> ("what uid should this be? root?!") but from a general memory safety
>>> perspective the value doesn't become potentially influenced by order
>>> of
>>> operations, leftover stack memory, etc.
>>>
>>> I'd agree, lifting everything into a static initializer does seem
>>> cleanest of all the choices.
>>
>> Hi Kees,
>>
>> Well, I just gave this a try. It is giving me flashbacks of when I last
>> had to do a tree wide change that I couldn't fully test and the
>> breakage was caught by Linus.
> 
> Yeah, testing isn't fun for these kinds of things. This is traditionally
> why the "obviously correct" changes tend to have an easier time landing
> (i.e. adding "= {}" to all of them).
> 
>> Could you let me know if you think this is additionally worthwhile
>> cleanup outside of the guard gap improvements of this series? Because I
>> was thinking a more cowardly approach could be a new vm_unmapped_area()
>> variant that takes the new start gap member as a separate argument
>> outside of struct vm_unmapped_area_info. It would be kind of strange to
>> keep them separate, but it would be less likely to bump something.
> 
> I think you want a new member -- AIUI, that's what that struct is for.
> 
> Looking at this resulting set of patches, I do kinda think just adding
> the "= {}" in a single patch is more sensible. Having to split things
> that are know at the top of the function from the stuff known at the
> existing initialization time is rather awkward.
> 
> Personally, I think a single patch that sets "= {}" for all of them and
> drop the all the "= 0" or "= NULL" assignments would be the cleanest way
> to go.

I agree with Kees, set = {} and drop all the "something = 0;" stuff.

Christophe


Re: BUG: Bad page map in process init pte:c0ab684c pmd:01182000 (on a PowerMac G4 DP)

2024-02-29 Thread Christophe Leroy


Le 29/02/2024 à 02:09, Erhard Furtner a écrit :
> On Mon, 12 Dec 2022 14:31:35 +1000
> "Nicholas Piggin"  wrote:
> 
>> On Thu Dec 1, 2022 at 7:44 AM AEST, Erhard F. wrote:
>>> Getting this at boot sometimes, but not always (PowerMac G4 DP, kernel 
>>> 6.0.9):
>>>
>>> [...]
>>> Freeing unused kernel image (initmem) memory: 1328K
>>> Checked W+X mappings: passed, no W+X pages found
>>> rodata_test: all tests were successful
>>> Run /sbin/init as init process
>>> _swap_info_get: Bad swap file entry 24c0ab68
>>> BUG: Bad page map in process init  pte:c0ab684c pmd:01182000
>>
>> Have you run memtest on the system? Are the messages related to a
>> kernel upgrade? This and your KASAN bugs look possibly like random
>> corruption.
>>
>> Although with that KASAN one it's strange that kernfs_node_cache
>> was involved both times, it's strange that page tables are pointing
>> to that same slab memory. It could be a page table page use-after
>> -free maybe? Maybe with the page table fragment code. I'm sure other
>> people would have hit that before though, so I don't know what to
>> suggest.
>>
>> Thanks,
>> Nick
> 
> Revisited the issue on kernel v6.8-rc6 and I can still reproduce it.
> 
> Short summary as my last post was over a year ago:
>   (x) I get this memory corruption only when CONFIG_VMAP_STACK=y and 
> CONFIG_SMP=y is enabled.
>   (x) I don't get this memory corruption when only one of the above is 
> enabled. ^^
>   (x) memtester says the 2 GiB RAM in my G4 DP are fine.
>   (x) I don't get this issue on my G5 11,2 or Talos II.
>   (x) "stress -m 2 --vm-bytes 965M" provokes the issue in < 10 secs. 
> (https://salsa.debian.org/debian/stress)
> 
> For the test I used CONFIG_KASAN_INLINE=y for v6.8-rc6 and 
> debug_pagealloc=on, page_owner=on and got this dmesg:
> 
> [...]
> pagealloc: memory corruption
> f5fcfff0: 00 00 00 00  
> CPU: 1 PID: 1788 Comm: stress Tainted: GB  6.8.0-rc6-PMacG4 
> #15
> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> Call Trace:
> [f3bfbac0] [c162a8e8] dump_stack_lvl+0x60/0x94 (unreliable)
> [f3bfbae0] [c04edf9c] __kernel_unpoison_pages+0x1e0/0x1f0
> [f3bfbb30] [c04a8aa0] post_alloc_hook+0xe0/0x174
> [f3bfbb60] [c04a8b58] prep_new_page+0x24/0xbc
> [f3bfbb80] [c04abcc4] get_page_from_freelist+0xcd0/0xf10
> [f3bfbc50] [c04aecd8] __alloc_pages+0x204/0xe2c
> [f3bfbda0] [c04b07a8] __folio_alloc+0x18/0x88
> [f3bfbdc0] [c0461a10] vma_alloc_zeroed_movable_folio.isra.0+0x2c/0x6c
> [f3bfbde0] [c046bb90] handle_mm_fault+0x91c/0x19ac
> [f3bfbec0] [c0047b8c] ___do_page_fault+0x93c/0xc14
> [f3bfbf10] [c0048278] do_page_fault+0x28/0x60
> [f3bfbf30] [c000433c] DataAccess_virt+0x124/0x17c
> --- interrupt: 300 at 0xbe30d8
> NIP:  00be30d8 LR: 00be30b4 CTR: 
> REGS: f3bfbf40 TRAP: 0300   Tainted: GB   (6.8.0-rc6-PMacG4)
> MSR:  d032   CR: 20882464  XER: 
> DAR: 88c7a010 DSISR: 4200
> GPR00: 00be30b4 af8397d0 a78436c0 6b2ee010 3c50 20224462 fe77f7e1 00b00264
> GPR08: 1d98d000 1d98c000  40ae256a 20882262 00b4  
> GPR16:  0002  005a 40802262 80002262 40002262 00c000a4
> GPR24:   3c50   6b2ee010 00c07d64 1000
> NIP [00be30d8] 0xbe30d8
> LR [00be30b4] 0xbe30b4
> --- interrupt: 300
> page:ef4bd92c refcount:1 mapcount:0 mapping: index:0x1 pfn:0x310b3
> flags: 0x8000(zone=2)
> page_type: 0x()
> raw: 8000 0100 0122  0001   0001
> raw: 
> page dumped because: pagealloc: corrupted page details
> page_owner info is not present (never set?)
> swapper/1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), 
> nodemask=(null),cpuset=/,mems_allowed=0
> CPU: 1 PID: 0 Comm: swapper/1 Tainted: GB  6.8.0-rc6-PMacG4 
> #15
> Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
> Call Trace:
> [f101b9d0] [c162a8e8] dump_stack_lvl+0x60/0x94 (unreliable)
> [f101b9f0] [c04ae948] warn_alloc+0x154/0x2e0
> [f101bab0] [c04af030] __alloc_pages+0x55c/0xe2c
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>cache: skbuff_head_cache, object size: 176, buffer size: 288, default 
> order: 0, min order: 0
>node 0: slabs: 509, objs: 7126, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>cache: skbuff_head_cache, object size: 176, buffer size: 288, default 
> order: 0, min order: 0
>node 0: slabs: 509, objs: 7126, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>cache: skbuff_head_cache, object size: 176, buffer size: 288, default 
> order: 0, min order: 0
>node 0: slabs: 509, objs: 7126, free: 0
> SLUB: Unable to allocate memory on node -1, gfp=0x820(GFP_ATOMIC)
>cache: skbuff_head_cache, object size: 176, buffer size: 288, default 
> order: 0, min order: 0
>node 0: slabs: 509, objs: 7126, free: 0
> SLUB: Unable to allocate memory on node -1, 

Re: [PATCH v2 5/9] mm: Initialize struct vm_unmapped_area_info

2024-02-28 Thread Christophe Leroy


Le 28/02/2024 à 18:01, Edgecombe, Rick P a écrit :
> On Wed, 2024-02-28 at 13:22 +0000, Christophe Leroy wrote:
>>> Any preference? Or maybe am I missing your point and talking
>>> nonsense?
>>>
>>
>> So my preference would go to the addition of:
>>
>>  info.new_field = 0;
>>
>> But that's very minor and if you think it is easier to manage and
>> maintain by performing {} initialisation at declaration, lets go for
>> that.
> 
> Appreciate the clarification and help getting this right. I'm thinking
> Kees' and now Kirill's point about this patch resulting in unnecessary
> manual zero initialization of the structs is probably something that
> needs to be addressed.
> 
> If I created a bunch of patches to change each call site, I think the
> the best is probably to do the designated field zero initialization
> way.
> 
> But I can do something for powerpc special if you want. I'll first try
> with powerpc matching the others, and if it seems objectionable, please
> let me know.
> 

My comments were generic, it was not powerpc oriented. Please keep 
powerpc as similar as possible with others.

Christophe


Re: [revert 0d60d8df6f49] [net/net-next] [6.8-rc5] Build Failure

2024-02-28 Thread Christophe Leroy


Le 28/02/2024 à 15:43, Eric Dumazet a écrit :
> On Wed, Feb 28, 2024 at 3:07 PM Vadim Fedorenko
>  wrote:
>>
>> On 28/02/2024 11:09, Tasmiya Nalatwad wrote:
>>> Greetings,
>>>
>>> [revert 0d60d8df6f49] [net/net-next] [6.8-rc5] Build Failure
>>>
>>> Reverting below commit fixes the issue
>>>
>>> commit 0d60d8df6f493bb46bf5db40d39dd60a1bafdd4e
>>>   dpll: rely on rcu for netdev_dpll_pin()
>>>
>>> --- Traces ---
>>>
>>> ./include/linux/dpll.h: In function ‘netdev_dpll_pin’:
>>> ./include/linux/rcupdate.h:439:9: error: dereferencing pointer to
>>> incomplete type ‘struct dpll_pin’
>>> typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>>>^
>>> ./include/linux/rcupdate.h:587:2: note: in expansion of macro
>>> ‘__rcu_dereference_check’
>>> __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>>> ^~~
>>> ./include/linux/rtnetlink.h:70:2: note: in expansion of macro
>>> ‘rcu_dereference_check’
>>> rcu_dereference_check(p, lockdep_rtnl_is_held())
>>> ^
>>> ./include/linux/dpll.h:175:9: note: in expansion of macro
>>> ‘rcu_dereference_rtnl’
>>> return rcu_dereference_rtnl(dev->dpll_pin);
>>>^~~~
>>> make[4]: *** [scripts/Makefile.build:243: drivers/dpll/dpll_core.o] Error 1
>>> make[4]: *** Waiting for unfinished jobs
>>> AR  net/mpls/built-in.a
>>> AR  net/l3mdev/built-in.a
>>> In file included from ./include/linux/rbtree.h:24,
>>>from ./include/linux/mm_types.h:11,
>>>from ./include/linux/mmzone.h:22,
>>>from ./include/linux/gfp.h:7,
>>>from ./include/linux/umh.h:4,
>>>from ./include/linux/kmod.h:9,
>>>from ./include/linux/module.h:17,
>>>from drivers/dpll/dpll_netlink.c:9:
>>> ./include/linux/dpll.h: In function ‘netdev_dpll_pin’:
>>> ./include/linux/rcupdate.h:439:9: error: dereferencing pointer to
>>> incomplete type ‘struct dpll_pin’
>>> typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
>>>^
>>> ./include/linux/rcupdate.h:587:2: note: in expansion of macro
>>> ‘__rcu_dereference_check’
>>> __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
>>> ^~~
>>> ./include/linux/rtnetlink.h:70:2: note: in expansion of macro
>>> ‘rcu_dereference_check’
>>> rcu_dereference_check(p, lockdep_rtnl_is_held())
>>> ^
>>> ./include/linux/dpll.h:175:9: note: in expansion of macro
>>> ‘rcu_dereference_rtnl’
>>> return rcu_dereference_rtnl(dev->dpll_pin);
>>>^~~~
>>> make[4]: *** [scripts/Makefile.build:243: drivers/dpll/dpll_netlink.o]
>>> Error 1
>>> make[3]: *** [scripts/Makefile.build:481: drivers/dpll] Error 2
>>> make[3]: *** Waiting for unfinished jobs
>>> In file included from ./arch/powerpc/include/generated/asm/rwonce.h:1,
>>>from ./include/linux/compiler.h:251,
>>>from ./include/linux/instrumented.h:10,
>>>from ./include/linux/uaccess.h:6,
>>>from net/core/dev.c:71:
>>> net/core/dev.c: In function ‘netdev_dpll_pin_assign’:
>>> ./include/linux/rcupdate.h:462:36: error: dereferencing pointer to
>>> incomplete type ‘struct dpll_pin’
>>>#define RCU_INITIALIZER(v) (typeof(*(v)) __force __rcu *)(v)
>>>   ^~~~
>>> ./include/asm-generic/rwonce.h:55:33: note: in definition of macro
>>> ‘__WRITE_ONCE’
>>> *(volatile typeof(x) *)&(x) = (val);\
>>>^~~
>>> ./arch/powerpc/include/asm/barrier.h:76:2: note: in expansion of macro
>>> ‘WRITE_ONCE’
>>> WRITE_ONCE(*p, v);  \
>>> ^~
>>> ./include/asm-generic/barrier.h:172:55: note: in expansion of macro
>>> ‘__smp_store_release’
>>>#define smp_store_release(p, v) do { kcsan_release();
>>> __smp_store_release(p, v); } while (0)
>>> ^~~
>>> ./include/linux/rcupdate.h:503:3: note: in expansion of macro
>>> ‘smp_store_release’
>>>  smp_store_release(, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \
>>>  ^
>>> ./include/linux/rcupdate.h:503:25: note: in expansion of macro
>>> ‘RCU_INITIALIZER’
>>>  smp_store_release(, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \
>>>^~~
>>> net/core/dev.c:9081:2: note: in expansion of macro ‘rcu_assign_pointer’
>>> rcu_assign_pointer(dev->dpll_pin, dpll_pin);
>>> ^~
>>> make[4]: *** [scripts/Makefile.build:243: net/core/dev.o] Error 1
>>> make[4]: *** Waiting for unfinished jobs
>>> AR  drivers/net/ethernet/built-in.a
>>> AR  drivers/net/built-in.a
>>> AR  net/dcb/built-in.a
>>> AR  net/netlabel/built-in.a
>>> AR  net/strparser/built-in.a
>>> AR  net/handshake/built-in.a
>>> GEN lib/test_fortify.log
>>> AR  net/8021q/built-in.a
>>> AR  

Re: [PATCH v2 5/9] mm: Initialize struct vm_unmapped_area_info

2024-02-28 Thread Christophe Leroy


Le 27/02/2024 à 21:25, Edgecombe, Rick P a écrit :
> On Tue, 2024-02-27 at 18:16 +0000, Christophe Leroy wrote:
>>>> Why doing a full init of the struct when all fields are re-
>>>> written a few
>>>> lines after ?
>>>
>>> It's a nice change for robustness and makes future changes easier.
>>> It's
>>> not actually wasteful since the compiler will throw away all
>>> redundant
>>> stores.
>>
>> Well, I tend to dislike default init at declaration because it often
>> hides missed real init. When a field is not initialized GCC should
>> emit
>> a Warning, at least when built with W=2 which sets
>> -Wmissing-field-initializers ?
> 
> Sorry, I'm not following where you are going with this. There aren't
> any struct vm_unmapped_area_info users that use initializers today, so
> that warning won't apply in this case. Meanwhile, designated style
> struct initialization (which would zero new members) is very common, as
> well as not get anything checked by that warning. Anything with this
> many members is probably going to use the designated style.
> 
> If we are optimizing to avoid bugs, the way this struct is used today
> is not great. It is essentially being used as an argument passer.
> Normally when a function signature changes, but a caller is missed, of
> course the compiler will notice loudly. But not here. So I think
> probably zero initializing it is safer than being setup to pass
> garbage.

No worry, if everybody thinks that init at declaration is worth it in 
that case it is OK for me and I'm not going to ask for something special 
on powerpc, my comment was more general allthough I used powerpc as an 
exemple.

My worry with initialisation at declaration is it often hides missing 
assignments. Let's take following simple exemple:

char *colour(int num)
{
char *name;

if (num == 0) {
name = "black";
} else if (num == 1) {
name = "white";
} else if (num == 2) {
} else {
name = "no colour";
}

return name;
}


Here, GCC warns about a missing initialisation of variable 'name'.

But if I declare it as

char *name = "no colour";

Then GCC won't warn anymore that we are missing a value for when num is 2.

During my life I have so many times spent huge amount of time 
investigating issues and bugs due to missing assignments that were going 
undetected due to default initialisation at declaration.

> 
> I'm trying to figure out what to do here. If I changed it so that just
> powerpc set the new field manually, then the convention across the
> kernel would be for everything to be default zero, and future other new
> parameters could have a greater chance of turning into garbage on
> powerpc. Since it could be easy to miss that powerpc was special. Would
> you prefer it?
> 
> Or maybe I could try a new vm_unmapped_area() that takes the extra
> argument separately? The old callers could call the old function and
> not need any arch updates. It all seems strange though, because
> automatic zero initializing struct members is so common in the kernel.
> But it also wouldn't add the cleanup Kees was pointing out. Hmm.
> 
> Any preference? Or maybe am I missing your point and talking nonsense?
> 

So my preference would go to the addition of:

info.new_field = 0;

But that's very minor and if you think it is easier to manage and 
maintain by performing {} initialisation at declaration, lets go for that.

Christophe


Re: [PATCH 0/5] mm/treewide: Replace pXd_large() with pXd_leaf()

2024-02-28 Thread Christophe Leroy


Le 28/02/2024 à 09:53, pet...@redhat.com a écrit :
> From: Peter Xu 
> 
> [based on latest akpm/mm-unstable, commit 1274e7646240]
> 
> These two APIs are mostly always the same.  It's confusing to have both of
> them.  Merge them into one.  Here I used pXd_leaf() only because pXd_leaf()
> is a global API which is always defined, while pXd_large() is not.
> 
> We have yet one more API that is similar which is pXd_huge(), but that's
> even trickier, so let's do it step by step.
> 
> Some cautions are needed on either x86 or ppc: x86 is currently the only
> user of p4d_large(), while ppc used to define pXd_large() only with THP,
> while it is not the case for pXd_leaf().  For the rest archs, afaict
> they're 100% identical.

Maybe would also be a good opportunity to replace pmd_is_leaf() by 
pmd_leaf() and the same for pud_is_leaf()

Christophe

> 
> Only lightly tested on x86.
> 
> Please have a look, thanks.
> 
> Peter Xu (5):
>mm/ppc: Define pXd_large() with pXd_leaf()
>mm/x86: Replace p4d_large() with p4d_leaf()
>mm/treewide: Replace pmd_large() with pmd_leaf()
>mm/treewide: Replace pud_large() with pud_leaf()
>mm/treewide: Drop pXd_large()
> 
>   arch/arm/include/asm/pgtable-2level.h|  1 -
>   arch/arm/include/asm/pgtable-3level.h|  1 -
>   arch/arm/mm/dump.c   |  4 ++--
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 14 --
>   arch/powerpc/include/asm/pgtable.h   |  4 
>   arch/powerpc/mm/book3s64/pgtable.c   |  4 ++--
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  2 +-
>   arch/powerpc/mm/pgtable_64.c |  2 +-
>   arch/s390/boot/vmem.c|  4 ++--
>   arch/s390/include/asm/pgtable.h  | 20 ++--
>   arch/s390/mm/gmap.c  | 14 +++---
>   arch/s390/mm/hugetlbpage.c   |  6 +++---
>   arch/s390/mm/pageattr.c  |  4 ++--
>   arch/s390/mm/pgtable.c   |  8 
>   arch/s390/mm/vmem.c  | 12 ++--
>   arch/sparc/include/asm/pgtable_64.h  |  8 
>   arch/sparc/mm/init_64.c  |  6 +++---
>   arch/x86/boot/compressed/ident_map_64.c  |  2 +-
>   arch/x86/include/asm/pgtable.h   | 15 +++
>   arch/x86/kvm/mmu/mmu.c   |  4 ++--
>   arch/x86/mm/fault.c  | 16 
>   arch/x86/mm/ident_map.c  |  2 +-
>   arch/x86/mm/init_32.c|  2 +-
>   arch/x86/mm/init_64.c| 14 +++---
>   arch/x86/mm/kasan_init_64.c  |  4 ++--
>   arch/x86/mm/mem_encrypt_identity.c   |  6 +++---
>   arch/x86/mm/pat/set_memory.c | 14 +++---
>   arch/x86/mm/pgtable.c|  4 ++--
>   arch/x86/mm/pti.c|  8 
>   arch/x86/power/hibernate.c   |  6 +++---
>   arch/x86/xen/mmu_pv.c| 10 +-
>   drivers/misc/sgi-gru/grufault.c  |  2 +-
>   32 files changed, 101 insertions(+), 122 deletions(-)
> 


Re: [PATCH v2 5/9] mm: Initialize struct vm_unmapped_area_info

2024-02-27 Thread Christophe Leroy


Le 27/02/2024 à 19:07, Kees Cook a écrit :
> On Tue, Feb 27, 2024 at 07:02:59AM +0000, Christophe Leroy wrote:
>>
>>
>> Le 26/02/2024 à 20:09, Rick Edgecombe a écrit :
>>> Future changes will need to add a field to struct vm_unmapped_area_info.
>>> This would cause trouble for any archs that don't initialize the
>>> struct. Currently every user sets each field, so if new fields are
>>> added, the core code parsing the struct will see garbage in the new
>>> field.
>>>
>>> It could be possible to initialize the new field for each arch to 0, but
>>> instead simply inialize the field with a C99 struct inializing syntax.
>>
>> Why doing a full init of the struct when all fields are re-written a few
>> lines after ?
> 
> It's a nice change for robustness and makes future changes easier. It's
> not actually wasteful since the compiler will throw away all redundant
> stores.

Well, I tend to dislike default init at declaration because it often 
hides missed real init. When a field is not initialized GCC should emit 
a Warning, at least when built with W=2 which sets 
-Wmissing-field-initializers ?

> 
>> If I take the exemple of powerpc function slice_find_area_bottomup():
>>
>>  struct vm_unmapped_area_info info;
>>
>>  info.flags = 0;
>>  info.length = len;
>>  info.align_mask = PAGE_MASK & ((1ul << pshift) - 1);
>>  info.align_offset = 0;
> 
> But one cleanup that is possible from explicitly zero-initializing the
> whole structure would be dropping all the individual "= 0" assignments.
> :)
> 

Sure if we decide to go that direction all those 0 assignments void.


  1   2   3   4   5   6   7   8   9   10   >