Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-13 Thread Catalin Marinas
On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote:
> On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
> > On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> > > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > > > Opting-in to the higher address space is reasonable. However, it is not
> > > > my preference, because the purpose of this flag is to ensure that
> > > > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > > > applications that want this guarantee to be the ones setting the flag,
> > > > rather than the applications that want the higher bits setting the flag.
[...]
> Anyway, the prctl() can go both ways, either expanding or limiting the
> default address space. So I'd be fine with such interface.

Ah, I just realised (while reading Lorenzo's reply) that we can't really
restrict the space via a prctl() as we have the main thread stack
already allocated by the kernel before the user code starts. You may
need to limit this stack as well, not just the later heap allocations
(anonymous mmap()).

-- 
Catalin



Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-13 Thread Catalin Marinas
On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
> On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
> > On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> > > Opting-in to the higher address space is reasonable. However, it is not
> > > my preference, because the purpose of this flag is to ensure that
> > > allocations do not exceed 47-bits, so it is a clearer ABI to have the
> > > applications that want this guarantee to be the ones setting the flag,
> > > rather than the applications that want the higher bits setting the flag.
> > 
> > Yes, this would be ideal. Unfortunately those applications don't know
> > they need to set a flag in order to work.
> 
> It's not a regression, the applications never worked (on platforms that
> do not have this default). The 47-bit default would allow applications
> that didn't work to start working at the cost of a non-ideal ABI. That
> doesn't seem like a reasonable tradeoff to me.  If applications want to
> run on new hardware that has different requirements, shouldn't they be
> required to update rather than expect the kernel will solve their
> problems for them?

That's a valid point but it depends on the application and how much you
want to spend updating user-space. OpenJDK is fine, if you need a JIT
you'll have to add support for that architecture anyway. But others are
arch-agnostic, you just recompile to your target. It's not an ABI
problem, more of an API one.

The x86 case (and powerpc/arm64) was different, the 47-bit worked for a
long time before expanding it. So it made a lot of sense to keep the
same default.

Anyway, the prctl() can go both ways, either expanding or limiting the
default address space. So I'd be fine with such interface.

-- 
Catalin



Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-12 Thread Catalin Marinas
On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
> Opting-in to the higher address space is reasonable. However, it is not
> my preference, because the purpose of this flag is to ensure that
> allocations do not exceed 47-bits, so it is a clearer ABI to have the
> applications that want this guarantee to be the ones setting the flag,
> rather than the applications that want the higher bits setting the flag.

Yes, this would be ideal. Unfortunately those applications don't know
they need to set a flag in order to work.

A slightly better option is to leave the default 47-bit at the kernel
ABI level and have the libc/dynamic loader issue the prctl(). You can
control the default with environment variables if needed.

We do something similar in glibc for arm64 MTE. When MTE is enabled, the
top byte of an allocated pointer contains the tag that must not be
corrupted. We left the decision to the C library via the
glibc.mem.tagging tunable (Android has something similar via the app
manifest). An app can change the default if it wants but if you run with
old glibc or no environment variable to say otherwise, the default would
be safe. Distros can set the environment to be the maximum range by
default if they know the apps included have been upgraded and tested.

-- 
Catalin



Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-11 Thread Catalin Marinas
On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
> On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
> > * Catalin Marinas  [240906 07:44]:
> > > On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> > > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > > > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
> > > > >> It's also unclear to me how we want this flag to interact with
> > > > >> the existing logic in arch_get_mmap_end(), which attempts to
> > > > >> limit the default mapping to a 47-bit address space already.
> > > > >
> > > > > To optimize RISC-V progress, I recommend:
> > > > >
> > > > > Step 1: Approve the patch.
> > > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > > > > Step 3: Wait approximately several iterations for Go & OpenJDK
> > > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()

Point 4 is an ABI change. What guarantees that there isn't still
software out there that relies on the old behaviour?

> > > > I really want to first see a plausible explanation about why
> > > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> > > > like all the other major architectures (x86, arm64, powerpc64),
> > > 
> > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
> > > configuration. We end up with a 47-bit with 16K pages but for a
> > > different reason that has to do with LPA2 support (I doubt we need this
> > > for the user mapping but we need to untangle some of the macros there;
> > > that's for a separate discussion).
> > > 
> > > That said, we haven't encountered any user space problems with a 48-bit
> > > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
> > > approach (47 or 48 bit default limit). Better to have some ABI
> > > consistency between architectures. One can still ask for addresses above
> > > this default limit via mmap().
> > 
> > I think that is best as well.
> > 
> > Can we please just do what x86 and arm64 does?
> 
> I responded to Arnd in the other thread, but I am still not convinced
> that the solution that x86 and arm64 have selected is the best solution.
> The solution of defaulting to 47 bits does allow applications the
> ability to get addresses that are below 47 bits. However, due to
> differences across architectures it doesn't seem possible to have all
> architectures default to the same value. Additionally, this flag will be
> able to help users avoid potential bugs where a hint address is passed
> that causes upper bits of a VA to be used.

The reason we added this limit on arm64 is that we noticed programs
using the top 8 bits of a 64-bit pointer for additional information.
IIRC, it wasn't even openJDK but some JavaScript JIT. We could have
taught those programs of a new flag but since we couldn't tell how many
are out there, it was the safest to default to a smaller limit and opt
in to the higher one. Such opt-in is via mmap() but if you prefer a
prctl() flag, that's fine by me as well (though I think this should be
opt-in to higher addresses rather than opt-out of the higher addresses).

-- 
Catalin



Re: [PATCH RFC v3 1/2] mm: Add personality flag to limit address to 47 bits

2024-09-06 Thread Catalin Marinas
On Fri, Sep 06, 2024 at 09:55:42AM +, Arnd Bergmann wrote:
> On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
> > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann  wrote:
> >> It's also unclear to me how we want this flag to interact with
> >> the existing logic in arch_get_mmap_end(), which attempts to
> >> limit the default mapping to a 47-bit address space already.
> >
> > To optimize RISC-V progress, I recommend:
> >
> > Step 1: Approve the patch.
> > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it.
> > Step 3: Wait approximately several iterations for Go & OpenJDK
> > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
> 
> I really want to first see a plausible explanation about why
> RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW
> like all the other major architectures (x86, arm64, powerpc64),

FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default
configuration. We end up with a 47-bit with 16K pages but for a
different reason that has to do with LPA2 support (I doubt we need this
for the user mapping but we need to untangle some of the macros there;
that's for a separate discussion).

That said, we haven't encountered any user space problems with a 48-bit
DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar
approach (47 or 48 bit default limit). Better to have some ABI
consistency between architectures. One can still ask for addresses above
this default limit via mmap().

-- 
Catalin



Re: [PATCH v5 06/30] arm64: context switch POR_EL0 register

2024-09-04 Thread Catalin Marinas
On Wed, Sep 04, 2024 at 11:22:54AM +0100, Will Deacon wrote:
> On Tue, Sep 03, 2024 at 03:54:13PM +0100, Joey Gouly wrote:
> > commit 3141fb86bee8d48ae47cab1594dad54f974a8899
> > Author: Joey Gouly 
> > Date:   Tue Sep 3 15:47:26 2024 +0100
> > 
> > fixup! arm64: context switch POR_EL0 register
> > 
> > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> > index a3a61ecdb165..c224b0955f1a 100644
> > --- a/arch/arm64/kernel/process.c
> > +++ b/arch/arm64/kernel/process.c
> > @@ -515,11 +515,8 @@ static void permission_overlay_switch(struct 
> > task_struct *next)
> > return;
> > 
> > current->thread.por_el0 = read_sysreg_s(SYS_POR_EL0);
> > -   if (current->thread.por_el0 != next->thread.por_el0) {
> > +   if (current->thread.por_el0 != next->thread.por_el0)
> > write_sysreg_s(next->thread.por_el0, SYS_POR_EL0);
> > -   /* ISB required for kernel uaccess routines when chaning 
> > POR_EL0 */
> > -   isb();
> > -   }
> >  }
> 
> What about the one in flush_poe()? I'm inclined to drop that as well.

Yes, that's a similar thing.

-- 
Catalin



Re: [PATCH v5 06/30] arm64: context switch POR_EL0 register

2024-09-02 Thread Catalin Marinas
On Tue, Aug 27, 2024 at 12:38:04PM +0100, Will Deacon wrote:
> On Fri, Aug 23, 2024 at 07:40:52PM +0100, Catalin Marinas wrote:
> > On Fri, Aug 23, 2024 at 06:08:36PM +0100, Will Deacon wrote:
> > > On Fri, Aug 23, 2024 at 05:41:06PM +0100, Catalin Marinas wrote:
> > > > On Fri, Aug 23, 2024 at 03:45:32PM +0100, Will Deacon wrote:
> > > > > On Thu, Aug 22, 2024 at 04:10:49PM +0100, Joey Gouly wrote:
> > > > > > +static void permission_overlay_switch(struct task_struct *next)
> > > > > > +{
> > > > > > +   if (!system_supports_poe())
> > > > > > +   return;
> > > > > > +
> > > > > > +   current->thread.por_el0 = read_sysreg_s(SYS_POR_EL0);
> > > > > > +   if (current->thread.por_el0 != next->thread.por_el0) {
> > > > > > +   write_sysreg_s(next->thread.por_el0, SYS_POR_EL0);
> > > > > > +   /* ISB required for kernel uaccess routines when 
> > > > > > chaning POR_EL0 */
> > > > > 
> > > > > nit: typo "chaning".
> > > > > 
> > > > > But more substantially, is this just to prevent spurious faults in the
> > > > > context of a new thread using a stale value for POR_EL0?
> > > > 
> > > > Not just prevent faults but enforce the permissions from the new
> > > > thread's POR_EL0. The kernel may continue with a uaccess routine from
> > > > here, we can't tell.
[...]
> > > So what do we actually gain by having the uaccess routines honour this?
> > 
> > I guess where it matters is more like not accidentally faulting because
> > the previous thread had more restrictive permissions.
> 
> That's what I wondered initially, but won't the fault handler retry in
> that case?

Yes, it will retry and this should be fine (I assume you are only
talking about the dropping ISB in the context switch).

For the case of running with a more permissive stale POR_EL0, arguably it's
slightly more predictable for the user but, OTOH, some syscalls like
readv() could be routed through GUP with no checks. As with MTE, we
don't guarantee uaccesses honour the user permissions.

That said, at some point we should sanitise this path anyway and have a
single ISB at the end. In the meantime, I'm fine with dropping the ISB
here.

-- 
Catalin



Re: [PATCH v5 06/30] arm64: context switch POR_EL0 register

2024-08-23 Thread Catalin Marinas
On Fri, Aug 23, 2024 at 06:08:36PM +0100, Will Deacon wrote:
> On Fri, Aug 23, 2024 at 05:41:06PM +0100, Catalin Marinas wrote:
> > On Fri, Aug 23, 2024 at 03:45:32PM +0100, Will Deacon wrote:
> > > On Thu, Aug 22, 2024 at 04:10:49PM +0100, Joey Gouly wrote:
> > > > +static void permission_overlay_switch(struct task_struct *next)
> > > > +{
> > > > +   if (!system_supports_poe())
> > > > +   return;
> > > > +
> > > > +   current->thread.por_el0 = read_sysreg_s(SYS_POR_EL0);
> > > > +   if (current->thread.por_el0 != next->thread.por_el0) {
> > > > +   write_sysreg_s(next->thread.por_el0, SYS_POR_EL0);
> > > > +   /* ISB required for kernel uaccess routines when 
> > > > chaning POR_EL0 */
> > > 
> > > nit: typo "chaning".
> > > 
> > > But more substantially, is this just to prevent spurious faults in the
> > > context of a new thread using a stale value for POR_EL0?
> > 
> > Not just prevent faults but enforce the permissions from the new
> > thread's POR_EL0. The kernel may continue with a uaccess routine from
> > here, we can't tell.
> 
> Hmm, I wondered if that was the case. It's a bit weird though, because:
> 
>   - There's a window between switch_mm() and switch_to() where you might
> reasonably expect to be able to execute uaccess routines

I don't think we can have any uaccess between these two switches (a
uaccess could fault, that's a pretty weird state between these two).

>   - kthread_use_mm() doesn't/can't look at this at all

No, but a kthread would have it's own, most permissive, POR_EL0.

>   - GUP obviously doesn't care
> 
> So what do we actually gain by having the uaccess routines honour this?

I guess where it matters is more like not accidentally faulting because
the previous thread had more restrictive permissions.

-- 
Catalin



Re: [PATCH v5 06/30] arm64: context switch POR_EL0 register

2024-08-23 Thread Catalin Marinas
On Fri, Aug 23, 2024 at 03:45:32PM +0100, Will Deacon wrote:
> On Thu, Aug 22, 2024 at 04:10:49PM +0100, Joey Gouly wrote:
> > +static void permission_overlay_switch(struct task_struct *next)
> > +{
> > +   if (!system_supports_poe())
> > +   return;
> > +
> > +   current->thread.por_el0 = read_sysreg_s(SYS_POR_EL0);
> > +   if (current->thread.por_el0 != next->thread.por_el0) {
> > +   write_sysreg_s(next->thread.por_el0, SYS_POR_EL0);
> > +   /* ISB required for kernel uaccess routines when chaning 
> > POR_EL0 */
> 
> nit: typo "chaning".
> 
> But more substantially, is this just to prevent spurious faults in the
> context of a new thread using a stale value for POR_EL0?

Not just prevent faults but enforce the permissions from the new
thread's POR_EL0. The kernel may continue with a uaccess routine from
here, we can't tell.

-- 
Catalin



Re: [PATCH v4 18/29] arm64: add POE signal support

2024-08-19 Thread Catalin Marinas
On Thu, Aug 15, 2024 at 04:09:26PM +0100, Dave P Martin wrote:
> On Thu, Aug 15, 2024 at 02:18:15PM +0100, Joey Gouly wrote:
> > That's a lot of words to say, or ask, do you agree with the approach of only
> > saving POR_EL0 in the signal frame if num_allocated_pkeys() > 1?
> > 
> > Thanks,
> > Joey
> 
> ...So..., given all the above, it is perhaps best to go back to
> dumping POR_EL0 unconditionally after all, unless we have a mechanism
> to determine whether pkeys are in use at all.

Ah, I can see why checking for POR_EL0_INIT is useful. Only checking for
the allocated keys gets confusing with pkey 0.

Not sure what the deal is with pkey 0. Is it considered allocated by
default or unallocatable? If the former, it implies that pkeys are
already in use (hence the additional check for POR_EL0_INIT). In
principle the hardware allows us to use permissions where the pkeys do
not apply but we'd run out of indices and PTE bits to encode them, so I
think by default we should assume that pkey 0 is pre-allocated.

So I agree that it's probably best to save it unconditionally.

-- 
Catalin



Re: [PATCH v4 18/29] arm64: add POE signal support

2024-08-14 Thread Catalin Marinas
Hi Joey,

On Tue, Aug 06, 2024 at 03:31:03PM +0100, Joey Gouly wrote:
> diff --git arch/arm64/kernel/signal.c arch/arm64/kernel/signal.c
> index 561986947530..ca7d4e0be275 100644
> --- arch/arm64/kernel/signal.c
> +++ arch/arm64/kernel/signal.c
> @@ -1024,7 +1025,10 @@ static int setup_sigframe_layout(struct 
> rt_sigframe_user_layout *user,
> return err;
> }
>  
> -   if (system_supports_poe()) {
> +   if (system_supports_poe() &&
> +   (add_all ||
> +mm_pkey_allocation_map(current->mm) != 0x1 ||
> +read_sysreg_s(SYS_POR_EL0) != POR_EL0_INIT)) {
> err = sigframe_alloc(user, &user->poe_offset,
>  sizeof(struct poe_context));
> if (err)
> 
> 
> That is, we only save the POR_EL0 value if any pkeys have been allocated 
> (other
> than pkey 0) *or* if POR_EL0 is a non-default value.

I had a chat with Dave as well on this and, in principle, we don't want
to add stuff to the signal frame unnecessarily, especially for old
binaries that have no clue of pkeys. OTOH, it looks like too complicated
for just 16 bytes. Also POR_EL0 all RWX is a valid combination, I don't
think we should exclude it.

If no pkey has been allocated, I guess we could skip this and it also
matches the x86 description of the PKRU being guaranteed to be preserved
only for the allocated keys. Do we reserve pkey 0 for arm64? I thought
that's only an x86 thing to emulate execute-only mappings.

Another corner case would be the signal handler doing a pkey_alloc() and
willing to populate POR_EL0 on sigreturn. It will have to find room in
the signal handler, though I don't think that's a problem.

-- 
Catalin



Re: [PATCH v6 RESED 1/2] dma: replace zone_dma_bits by zone_dma_limit

2024-08-12 Thread Catalin Marinas
On Sun, Aug 11, 2024 at 10:09:35AM +0300, Baruch Siach wrote:
> From: Catalin Marinas 
> 
> Hardware DMA limit might not be power of 2. When RAM range starts above
> 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high bit
> can not encode this limit.
> 
> Use plain address for DMA zone limit.
> 
> Since DMA zone can now potentially span beyond 4GB physical limit of
> DMA32, make sure to use DMA zone for GFP_DMA32 allocations in that case.
> 
> Signed-off-by: Catalin Marinas 
> Co-developed-by: Baruch Siach 
> Signed-off-by: Baruch Siach 

You might want to say that no functional change is expected with this
patch. The patch looks fine.

Reviewed-by: Catalin Marinas 



Re: [PATCH v5 2/3] dma: replace zone_dma_bits by zone_dma_limit

2024-08-08 Thread Catalin Marinas
On Thu, Aug 08, 2024 at 11:35:01AM +0200, Petr Tesařík wrote:
> On Wed, 7 Aug 2024 19:14:58 +0100
> Catalin Marinas  wrote:
> > With ZONE_DMA32, since all the DMA code assumes that ZONE_DMA32 ends at
> > 4GB CPU address, it doesn't really work for such platforms. If there are
> > 32-bit devices with a corresponding CPU address offset, ZONE_DMA32
> > should end at 36GB on Baruch's platform. But to simplify things, we just
> > ignore this on arm64 and make ZONE_DMA32 empty.
> 
> Ah. That makes sense. It also seems to support my theory that Linux
> memory zones are an obsolete concept and should be replaced by a
> different mechanism.

I agree, they are too coarse-grained. From an API perspective, what we
need is an alloc_pages() that takes a DMA mask or phys address limit,
maybe something similar to memblock_alloc_range_nid(). OTOH, an
advantage of the zones is that by default you keep the lower memory free
by using ZONE_NORMAL as default, you have free lists per zone. Maybe
with some alternative data structures we could efficiently search free
pages based on phys ranges or bitmasks and get rid of the zones but I
haven't put any thoughts into it.

We'd still need some boundaries like *_dma_get_max_cpu_address() to at
least allocate an swiotlb buffer that's suitable for all devices.

> > In some cases where we have the device structure we could instead do a
> > dma_to_phys(DMA_BIT_MASK(32)) but not in the two cases above. I guess if
> > we really want to address this properly, we'd need to introduce a
> > zone_dma32_limit that's initialised by the arch code. For arm64, I'm
> > happy with just having an empty ZONE_DMA32 on such platforms.
> 
> The obvious caveat is that zone boundaries are system-wide, but the
> mapping between bus addresses and CPU addresses depends on the device
> structure. After all, that's why dma_to_phys takes the device as a
> parameter... In fact, a system may have multiple busses behind
> different bridges with a different offset applied by each.

Indeed, and as Robin mentioned, the ACPI/DT code already handle this.

> FYI I want to make more people aware of these issues at this year's
> Plumbers, see https://lpc.events/event/18/contributions/1776/

Looking forward to this. I'll dial in, unfortunately can't make Plumbers
in person this year.

In the meantime, I think this series is a good compromise ;).

-- 
Catalin


Re: [PATCH v5 2/3] dma: replace zone_dma_bits by zone_dma_limit

2024-08-07 Thread Catalin Marinas
On Wed, Aug 07, 2024 at 04:19:38PM +0200, Petr Tesařík wrote:
> On Fri, 2 Aug 2024 10:37:38 +0100
> Catalin Marinas  wrote:
> > On Fri, Aug 02, 2024 at 09:03:47AM +0300, Baruch Siach wrote:
> > > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> > > index 3b4be4ca3b08..62b36fda44c9 100644
> > > --- a/kernel/dma/direct.c
> > > +++ b/kernel/dma/direct.c
> > > @@ -20,7 +20,7 @@
> > >   * it for entirely different regions. In that case the arch code needs to
> > >   * override the variable below for dma-direct to work properly.
> > >   */
> > > -unsigned int zone_dma_bits __ro_after_init = 24;
> > > +u64 zone_dma_limit __ro_after_init = DMA_BIT_MASK(24);  
> > 
> > u64 here makes sense even if it may be larger than phys_addr_t. It
> > matches the phys_limit type in the swiotlb code. The compilers should no
> > longer complain.
> 
> FTR I have never quite understood why phys_limit is u64, but u64 was
> already used all around the place when I first looked into swiotlb.
> 
> > > diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> > > index d10613eb0f63..7b04f7575796 100644
> > > --- a/kernel/dma/pool.c
> > > +++ b/kernel/dma/pool.c
> > > @@ -70,9 +70,9 @@ static bool cma_in_zone(gfp_t gfp)
> > >   /* CMA can't cross zone boundaries, see cma_activate_area() */
> > >   end = cma_get_base(cma) + size - 1;
> > >   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
> > > - return end <= DMA_BIT_MASK(zone_dma_bits);
> > > + return end <= zone_dma_limit;
> > >   if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
> > > - return end <= DMA_BIT_MASK(32);
> > > + return end <= max(DMA_BIT_MASK(32), zone_dma_limit);
> > >   return true;
> > >  }
> > >  
> > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > > index 043b0ecd3e8d..bb51bd5335ad 100644
> > > --- a/kernel/dma/swiotlb.c
> > > +++ b/kernel/dma/swiotlb.c
> > > @@ -450,9 +450,9 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> > >   if (!remap)
> > >   io_tlb_default_mem.can_grow = true;
> > >   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp_mask & __GFP_DMA))
> > > - io_tlb_default_mem.phys_limit = DMA_BIT_MASK(zone_dma_bits);
> > > + io_tlb_default_mem.phys_limit = zone_dma_limit;
> > >   else if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp_mask & __GFP_DMA32))
> > > - io_tlb_default_mem.phys_limit = DMA_BIT_MASK(32);
> > > + io_tlb_default_mem.phys_limit = max(DMA_BIT_MASK(32), 
> > > zone_dma_limit);
> > >   else
> > >   io_tlb_default_mem.phys_limit = virt_to_phys(high_memory - 1);
> > >  #endif  
> > 
> > These two look correct to me now and it's the least intrusive (the
> > alternative would have been a zone_dma32_limit). The arch code, however,
> > needs to ensure that zone_dma_limit can always support 32-bit devices
> > even if it is above 4GB (with the relevant dma offsets in place for such
> > devices).
> 
> Just to make sure, the DMA zone (if present) must map to at most 32-bit
> bus address space (possibly behind a bridge). Is that what you're
> saying?

No exactly. What I'm trying to say is that on arm64 zone_dma_limit can
go beyond DMA_BIT_MASK(32) when the latter is treated as a CPU address.
In such cases, ZONE_DMA32 is empty.

TBH, this code is confusing and not entirely suitable for a system where
the CPU address offsets are not 0. The device::dma_coherent_mask is
about the bus address range and phys_limit is calculated correctly in
functions like dma_direct_optimal_gfp_mask(). But that's about it w.r.t.
DMA bit masks because zone_dma_bits and DMA_BIT_MASK(32) are assumed to
be about the CPU address ranges in some cases (in other cases
DMA_BIT_MASK() is used to initialise dma_coherent_mask, so more of a bus
address).

On the platform Baruch is trying to fix, RAM starts at 32GB and ZONE_DMA
should end at 33GB. That's 30-bit mask in bus address terms but
something not a power of two for the CPU address, hence the
zone_dma_limit introduced here.

With ZONE_DMA32, since all the DMA code assumes that ZONE_DMA32 ends at
4GB CPU address, it doesn't really work for such platforms. If there are
32-bit devices with a corresponding CPU address offset, ZONE_DMA32
should end at 36GB on Baruch's platform. But to simplify things, we just
ignore this on arm64 and make ZONE_DMA32 empty.

In some cases where we have the device structure we could instead do a
dma_to_phys(DMA_BIT_MASK(32)) but not in the two cases above. I guess if
we really want to address this properly, we'd need to introduce a
zone_dma32_limit that's initialised by the arch code. For arm64, I'm
happy with just having an empty ZONE_DMA32 on such platforms.

-- 
Catalin


Re: [PATCH v12 02/84] KVM: arm64: Disallow copying MTE to guest memory while KVM is dirty logging

2024-08-07 Thread Catalin Marinas
On Fri, Jul 26, 2024 at 04:51:11PM -0700, Sean Christopherson wrote:
> Disallow copying MTE tags to guest memory while KVM is dirty logging, as
> writing guest memory without marking the gfn as dirty in the memslot could
> result in userspace failing to migrate the updated page.  Ideally (maybe?),
> KVM would simply mark the gfn as dirty, but there is no vCPU to work with,
> and presumably the only use case for copy MTE tags _to_ the guest is when
> restoring state on the target.
> 
> Fixes: f0376edb1ddc ("KVM: arm64: Add ioctl to fetch/store tags in a guest")
> Signed-off-by: Sean Christopherson 
> ---
>  arch/arm64/kvm/guest.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index e1f0ff08836a..962f985977c2 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -1045,6 +1045,11 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>  
>   mutex_lock(&kvm->slots_lock);
>  
> + if (write && atomic_read(&kvm->nr_memslots_dirty_logging)) {
> + ret = -EBUSY;
> + goto out;
> + }

There are ways to actually log the page dirtying but I don't think
it's worth it. AFAICT, reading the tags still works and that's what's
used during migration (on the VM where dirty tracking takes place).

Reviewed-by: Catalin Marinas 


Re: [PATCH v12 01/84] KVM: arm64: Release pfn, i.e. put page, if copying MTE tags hits ZONE_DEVICE

2024-08-07 Thread Catalin Marinas
On Fri, Jul 26, 2024 at 04:51:10PM -0700, Sean Christopherson wrote:
> Put the page reference acquired by gfn_to_pfn_prot() if
> kvm_vm_ioctl_mte_copy_tags() runs into ZONE_DEVICE memory.  KVM's less-
> than-stellar heuristics for dealing with pfn-mapped memory means that KVM
> can get a page reference to ZONE_DEVICE memory.
> 
> Fixes: f0376edb1ddc ("KVM: arm64: Add ioctl to fetch/store tags in a guest")
> Signed-off-by: Sean Christopherson 
> ---
>  arch/arm64/kvm/guest.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 11098eb7eb44..e1f0ff08836a 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -1059,6 +1059,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>   page = pfn_to_online_page(pfn);
>   if (!page) {
>   /* Reject ZONE_DEVICE memory */
> + kvm_release_pfn_clean(pfn);
>   ret = -EFAULT;
>   goto out;
>   }

This patch makes sense irrespective of whether the above pfn is a
ZONE_DEVICE or not. gfn_to_pfn_prot() increased the page refcount via
GUP, so it must be released before bailing out of this loop.

Reviewed-by: Catalin Marinas 


Re: [PATCH v5 1/3] dma: improve DMA zone selection

2024-08-07 Thread Catalin Marinas
Thanks Robin for having a look.

On Wed, Aug 07, 2024 at 02:13:06PM +0100, Robin Murphy wrote:
> On 2024-08-02 7:03 am, Baruch Siach wrote:
> > When device DMA limit does not fit in DMA32 zone it should use DMA zone,
> > even when DMA zone is stricter than needed.
> > 
> > Same goes for devices that can't allocate from the entire normal zone.
> > Limit to DMA32 in that case.
> 
> Per the bot report this only works for CONFIG_ARCH_KEEP_MEMBLOCK,

Yeah, I just noticed.

> however
> the whole concept looks wrong anyway. The logic here is that we're only
> forcing a particular zone if there's *no* chance of the higher zone being
> usable. For example, ignoring offsets for simplicity, if we have a 40-bit
> DMA mask then we *do* want to initially try allocating from ZONE_NORMAL even
> if max_pfn is above 40 bits, since we still might get a usable allocation
> from between 32 and 40 bits, and if we don't, then we'll fall back to
> retrying from the DMA zone(s) anyway.

Ah, I did not read the code further down in __dma_direct_alloc_pages(),
it does fall back to a GFP_DMA allocation if !dma_coherent_ok().
Similarly with swiotlb_alloc_tlb(), it keeps retrying until the
allocation fails.

So yes, this patch can be dropped.

-- 
Catalin


Re: [PATCH v5 2/3] dma: replace zone_dma_bits by zone_dma_limit

2024-08-02 Thread Catalin Marinas
On Fri, Aug 02, 2024 at 09:03:47AM +0300, Baruch Siach wrote:
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 3b4be4ca3b08..62b36fda44c9 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -20,7 +20,7 @@
>   * it for entirely different regions. In that case the arch code needs to
>   * override the variable below for dma-direct to work properly.
>   */
> -unsigned int zone_dma_bits __ro_after_init = 24;
> +u64 zone_dma_limit __ro_after_init = DMA_BIT_MASK(24);

u64 here makes sense even if it may be larger than phys_addr_t. It
matches the phys_limit type in the swiotlb code. The compilers should no
longer complain.

> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index d10613eb0f63..7b04f7575796 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -70,9 +70,9 @@ static bool cma_in_zone(gfp_t gfp)
>   /* CMA can't cross zone boundaries, see cma_activate_area() */
>   end = cma_get_base(cma) + size - 1;
>   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
> - return end <= DMA_BIT_MASK(zone_dma_bits);
> + return end <= zone_dma_limit;
>   if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
> - return end <= DMA_BIT_MASK(32);
> + return end <= max(DMA_BIT_MASK(32), zone_dma_limit);
>   return true;
>  }
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 043b0ecd3e8d..bb51bd5335ad 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -450,9 +450,9 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   if (!remap)
>   io_tlb_default_mem.can_grow = true;
>   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp_mask & __GFP_DMA))
> - io_tlb_default_mem.phys_limit = DMA_BIT_MASK(zone_dma_bits);
> + io_tlb_default_mem.phys_limit = zone_dma_limit;
>   else if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp_mask & __GFP_DMA32))
> - io_tlb_default_mem.phys_limit = DMA_BIT_MASK(32);
> + io_tlb_default_mem.phys_limit = max(DMA_BIT_MASK(32), 
> zone_dma_limit);
>   else
>   io_tlb_default_mem.phys_limit = virt_to_phys(high_memory - 1);
>  #endif

These two look correct to me now and it's the least intrusive (the
alternative would have been a zone_dma32_limit). The arch code, however,
needs to ensure that zone_dma_limit can always support 32-bit devices
even if it is above 4GB (with the relevant dma offsets in place for such
devices).

-- 
Catalin


Re: [PATCH v4 2/2] dma: replace zone_dma_bits by zone_dma_limit

2024-08-01 Thread Catalin Marinas
On Thu, Aug 01, 2024 at 11:25:07AM +0300, Baruch Siach wrote:
> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index d10613eb0f63..a6e15db9d1e7 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -70,9 +70,10 @@ static bool cma_in_zone(gfp_t gfp)
>   /* CMA can't cross zone boundaries, see cma_activate_area() */
>   end = cma_get_base(cma) + size - 1;
>   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
> - return end <= DMA_BIT_MASK(zone_dma_bits);
> + return end <= zone_dma_limit;
> + /* Account for possible zone_dma_limit > DMA_BIT_MASK(32) */
>   if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
> - return end <= DMA_BIT_MASK(32);
> + return end <= DMA_BIT_MASK(32) || end <= zone_dma_limit;
>   return true;
>  }
>  
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 043b0ecd3e8d..53595eb41922 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -450,7 +450,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   if (!remap)
>   io_tlb_default_mem.can_grow = true;
>   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp_mask & __GFP_DMA))
> - io_tlb_default_mem.phys_limit = DMA_BIT_MASK(zone_dma_bits);
> + io_tlb_default_mem.phys_limit = zone_dma_limit;
>   else if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp_mask & __GFP_DMA32))
>   io_tlb_default_mem.phys_limit = DMA_BIT_MASK(32);

I think this needs some adjustment as the cma_in_zone() case. Maybe just
use max(DMA_BIT_MASK(32), zone_dma_limit) in both cases for consistency.

-- 
Catalin


Re: [PATCH v4 2/2] dma: replace zone_dma_bits by zone_dma_limit

2024-08-01 Thread Catalin Marinas
On Thu, Aug 01, 2024 at 11:25:07AM +0300, Baruch Siach wrote:
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 9b5ab6818f7f..c45e2152ca9e 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -115,35 +115,35 @@ static void __init arch_reserve_crashkernel(void)
>  }
>  
>  /*
> - * Return the maximum physical address for a zone accessible by the given 
> bits
> - * limit. If DRAM starts above 32-bit, expand the zone to the maximum
> + * Return the maximum physical address for a zone given its limit.
> + * If DRAM starts above 32-bit, expand the zone to the maximum
>   * available memory, otherwise cap it at 32-bit.
>   */
> -static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
> +static phys_addr_t __init max_zone_phys(phys_addr_t zone_limit)
>  {
> - phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits);
>   phys_addr_t phys_start = memblock_start_of_DRAM();
>  
>   if (phys_start > U32_MAX)
> - zone_mask = PHYS_ADDR_MAX;
> - else if (phys_start > zone_mask)
> - zone_mask = U32_MAX;
> + zone_limit = PHYS_ADDR_MAX;
> + else if (phys_start > zone_limit)
> + zone_limit = U32_MAX;
>  
> - return min(zone_mask, memblock_end_of_DRAM() - 1) + 1;
> + return min(zone_limit, memblock_end_of_DRAM() - 1) + 1;
>  }

OK, so no functional change here which is good. But isn't this series
missing some additional patches to limit ZONE_DMA? For you platform, the
above function expands ZONE_DMA to the whole RAM which IIUC it's not
what you want eventually.

-- 
Catalin


Re: [PATCH v4 1/2] dma: improve DMA zone selection

2024-08-01 Thread Catalin Marinas
On Thu, Aug 01, 2024 at 11:25:06AM +0300, Baruch Siach wrote:
> When device DMA limit does not fit in DMA32 zone it should use DMA zone,
> even when DMA zone is stricter than needed.
> 
> Same goes for devices that can't allocate from the entire normal zone.
> Limit to DMA32 in that case.
> 
> Reported-by: Catalin Marinas 
> Signed-off-by: Baruch Siach 

This looks fine to me now.

Reviewed-by: Catalin Marinas 


Re: [PATCH v3 3/3] dma-direct: use RAM start to offset zone_dma_limit

2024-07-31 Thread Catalin Marinas
On Mon, Jul 29, 2024 at 01:51:26PM +0300, Baruch Siach wrote:
> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index 410a7b40e496..ded3d841c88c 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static struct gen_pool *atomic_pool_dma __ro_after_init;
>  static unsigned long pool_size_dma;
> @@ -70,7 +71,7 @@ static bool cma_in_zone(gfp_t gfp)
>   /* CMA can't cross zone boundaries, see cma_activate_area() */
>   end = cma_get_base(cma) + size - 1;
>   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
> - return end <= zone_dma_limit;
> + return end <= memblock_start_of_DRAM() + zone_dma_limit;

I think this patch is entirely wrong. After the previous patch,
zone_dma_limit is already a physical/CPU address, not some offset or
range - of_dma_get_max_cpu_address() returns the absolute physical
address. Adding memblock_start_of_DRAM() to it does not make any sense.
It made sense when we had zone_dma_bits but since we are trying to move
away from bitmasks to absolute CPU addresses, zone_dma_limit already
includes the start of DRAM.

What problems do you see without this patch? Is it because
DMA_BIT_MASK(32) can be lower than zone_dma_limit as I mentioned on the
previous patch?

-- 
Catalin


Re: [PATCH v3 2/3] dma-mapping: replace zone_dma_bits by zone_dma_limit

2024-07-31 Thread Catalin Marinas
On Mon, Jul 29, 2024 at 01:51:25PM +0300, Baruch Siach wrote:
> From: Catalin Marinas 
> 
> Hardware DMA limit might not be power of 2. When RAM range starts above
> 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high bit
> can not encode this limit.
> 
> Use direct phys_addr_t limit address for DMA zone limit.
> 
> Signed-off-by: Catalin Marinas 
> Signed-off-by: Baruch Siach 

You should add your Co-developed-by line, the patch evolved a bit from
initial my partial diff.

> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 9b5ab6818f7f..870fd967c610 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -114,36 +114,28 @@ static void __init arch_reserve_crashkernel(void)
>   low_size, high);
>  }
>  
> -/*
> - * Return the maximum physical address for a zone accessible by the given 
> bits
> - * limit. If DRAM starts above 32-bit, expand the zone to the maximum
> - * available memory, otherwise cap it at 32-bit.
> - */
> -static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
> +static phys_addr_t __init max_zone_phys(phys_addr_t zone_limit)
>  {
> - phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits);
> - phys_addr_t phys_start = memblock_start_of_DRAM();
> -
> - if (phys_start > U32_MAX)
> - zone_mask = PHYS_ADDR_MAX;
> - else if (phys_start > zone_mask)
> - zone_mask = U32_MAX;
> + /* We have RAM in low 32-bit area, keep DMA zone there */
> + if (memblock_start_of_DRAM() < U32_MAX)
> + zone_limit = min(U32_MAX, zone_limit);

Does this matter anymore? Or is it to keep ZONE_DMA below (or equal to)
the ZONE_DMA32 limit?

Anyway, since this patch is about replacing zone_dma_bits with
zone_dma_limit, we should not introduce functional changes. AFAICT, we
need zone_limit to be set to memblock_end_of_DRAM() when phys_start is
above U32_MAX. You can do the functional change in a subsequent patch
once all the other refactoring has been handled.

>  
> - return min(zone_mask, memblock_end_of_DRAM() - 1) + 1;
> + return min(zone_limit, memblock_end_of_DRAM() - 1) + 1;
>  }
[...]
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index edbe13d00776..98b7d8015043 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -12,7 +12,7 @@
>  #include 
>  #include 
>  
> -extern unsigned int zone_dma_bits;
> +extern phys_addr_t zone_dma_limit;
>  
>  /*
>   * Record the mapping of CPU physical to DMA addresses for a given region.
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 3b4be4ca3b08..3dbc0b89d6fb 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -20,7 +20,7 @@
>   * it for entirely different regions. In that case the arch code needs to
>   * override the variable below for dma-direct to work properly.
>   */
> -unsigned int zone_dma_bits __ro_after_init = 24;
> +phys_addr_t zone_dma_limit __ro_after_init = DMA_BIT_MASK(24);
>  
>  static inline dma_addr_t phys_to_dma_direct(struct device *dev,
>   phys_addr_t phys)
> @@ -580,7 +580,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
>* part of the check.
>*/
>   if (IS_ENABLED(CONFIG_ZONE_DMA))
> - min_mask = min_t(u64, min_mask, DMA_BIT_MASK(zone_dma_bits));
> + min_mask = min_t(u64, min_mask, zone_dma_limit);
>   return mask >= phys_to_dma_unencrypted(dev, min_mask);
>  }
>  
> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index d10613eb0f63..410a7b40e496 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -70,7 +70,7 @@ static bool cma_in_zone(gfp_t gfp)
>   /* CMA can't cross zone boundaries, see cma_activate_area() */
>   end = cma_get_base(cma) + size - 1;
>   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
> - return end <= DMA_BIT_MASK(zone_dma_bits);
> + return end <= zone_dma_limit;
>   if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
>   return end <= DMA_BIT_MASK(32);
>   return true;

I haven't got to the third patch yet but with this series we can have
zone_dma_limit above DMA_BIT_MASK(32). The above function could return
false for GFP_DMA32 when 'end' is perfectly valid within ZONE_DMA (which
implies safe for GFP_DMA32).

> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index df68d29740a0..dfd83e5ee0b3 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -450,7 +450,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>   if (!remap)
>   io_tlb_default_mem.can_grow = 

Re: [PATCH RFC v2 2/5] of: get dma area lower limit

2024-07-30 Thread Catalin Marinas
On Thu, Jul 25, 2024 at 02:49:01PM +0300, Baruch Siach wrote:
> Hi Catalin,
> 
> On Tue, Jun 18 2024, Catalin Marinas wrote:
> > On Tue, Apr 09, 2024 at 09:17:55AM +0300, Baruch Siach wrote:
> >> of_dma_get_max_cpu_address() returns the highest CPU address that
> >> devices can use for DMA. The implicit assumption is that all CPU
> >> addresses below that limit are suitable for DMA. However the
> >> 'dma-ranges' property this code uses also encodes a lower limit for DMA
> >> that is potentially non zero.
> >> 
> >> Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower
> >> limit for the same 'dma-ranges' property describing the high limit.
> >
> > I don't understand the reason for the lower limit. The way the Linux
> > zones work is that ZONE_DMA always starts from the start of the RAM. It
> > doesn't matter whether it's 0 or not, you'd not allocate below the start
> > of RAM anyway. If you have a device that cannot use the bottom of the
> > RAM, it is pretty broken and not supported by Linux.
> 
> I won't argue with that assertion. My target system RAM happens to start
> at that the lower end of devices DMA zone, so I'm fine with skipping
> this patch.
> 
> Just curious. What is the inherent limitation that prevents Linux from
> supporting DMA zone with lower limit above RAM start?

It's the way the zone allocation fallback mechanism works. Let's say a
ZONE_DMA32 allocation fails, it falls back to ZONE_DMA and it's supposed
to be compatible with the GFP_DMA32 request. If you have some other zone
below ZONE_DMA, it should also be compatible with GFP_DMA allocations.

-- 
Catalin


Re: [PATCH v4 17/29] arm64: implement PKEYS support

2024-07-08 Thread Catalin Marinas
Hi Szabolcs,

On Mon, Jun 17, 2024 at 03:51:35PM +0100, Szabolcs Nagy wrote:
> The 06/17/2024 15:40, Florian Weimer wrote:
> > >> A user can still set it by interacting with the register directly, but I 
> > >> guess
> > >> we want something for the glibc interface..
> > >> 
> > >> Dave, any thoughts here?
> > >
> > > adding Florian too, since i found an old thread of his that tried
> > > to add separate PKEY_DISABLE_READ and PKEY_DISABLE_EXECUTE, but
> > > it did not seem to end up upstream. (this makes more sense to me
> > > as libc api than the weird disable access semantics)
> > 
> > I still think it makes sense to have a full complenent of PKEY_* flags
> > complementing the PROT_* flags, in a somewhat abstract fashion for
> > pkey_alloc only.  The internal protection mask register encoding will
> > differ from architecture to architecture, but the abstract glibc
> > functions pkey_set and pkey_get could use them (if we are a bit
> > careful).
> 
> to me it makes sense to have abstract
> 
> PKEY_DISABLE_READ
> PKEY_DISABLE_WRITE
> PKEY_DISABLE_EXECUTE
> PKEY_DISABLE_ACCESS
> 
> where access is handled like
> 
> if (flags&PKEY_DISABLE_ACCESS)
>   flags |= PKEY_DISABLE_READ|PKEY_DISABLE_WRITE;
> disable_read = flags&PKEY_DISABLE_READ;
> disable_write = flags&PKEY_DISABLE_WRITE;
> disable_exec = flags&PKEY_DISABLE_EXECUTE;
> 
> if there are unsupported combinations like
> disable_read&&!disable_write then those are rejected
> by pkey_alloc and pkey_set.
> 
> this allows portable use of pkey apis.
> (the flags could be target specific, but don't have to be)

On powerpc, PKEY_DISABLE_ACCESS also disables execution. AFAICT, the
kernel doesn't define a PKEY_DISABLE_READ, only PKEY_DISABLE_ACCESS so
for powerpc there's no way to to set an execute-only permission via this
interface. I wouldn't like to diverge from powerpc.

However, does it matter much? That's only for the initial setup, the
user can then change the permissions directly via the sysreg. So maybe
we don't need all those combinations upfront. A PKEY_DISABLE_EXECUTE
together with the full PKEY_DISABLE_ACCESS would probably suffice.

Give that on x86 the PKEY_ACCESS_MASK will have to stay as
PKEY_DISABLE_ACCESS|PKEY_DISABLE_WRITE, we'll probably do the same as
powerpc and define an arm64 specific PKEY_DISABLE_EXECUTE with the
corresponding PKEY_ACCESS_MASK including it. We can generalise the masks
with some ARCH_HAS_PKEY_DISABLE_EXECUTE but it's probably more hassle
than just defining the arm64 PKEY_DISABLE_EXECUTE.

I assume you'd like PKEY_DISABLE_EXECUTE to be part of this series,
otherwise changing PKEY_ACCESS_MASK later will cause potential ABI
issues.

-- 
Catalin


Re: [PATCH v4 13/29] arm64: convert protection key into vm_flags and pgprot values

2024-07-08 Thread Catalin Marinas
On Thu, Jul 04, 2024 at 01:47:04PM +0100, Joey Gouly wrote:
> On Wed, Jun 19, 2024 at 05:45:29PM +0100, Catalin Marinas wrote:
> > On Tue, May 28, 2024 at 12:24:57PM +0530, Amit Daniel Kachhap wrote:
> > > On 5/3/24 18:31, Joey Gouly wrote:
> > > > diff --git a/arch/arm64/include/asm/mman.h 
> > > > b/arch/arm64/include/asm/mman.h
> > > > index 5966ee4a6154..ecb2d18dc4d7 100644
> > > > --- a/arch/arm64/include/asm/mman.h
> > > > +++ b/arch/arm64/include/asm/mman.h
> > > > @@ -7,7 +7,7 @@
> > > >   #include 
> > > >   static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
> > > > -   unsigned long pkey __always_unused)
> > > > +   unsigned long pkey)
> > > >   {
> > > > unsigned long ret = 0;
> > > > @@ -17,6 +17,12 @@ static inline unsigned long 
> > > > arch_calc_vm_prot_bits(unsigned long prot,
> > > > if (system_supports_mte() && (prot & PROT_MTE))
> > > > ret |= VM_MTE;
> > > > +#if defined(CONFIG_ARCH_HAS_PKEYS)
> > > 
> > > Should there be system_supports_poe() check like above?
> > 
> > I think it should, otherwise we end up with these bits in the pte even
> > when POE is not supported.
> 
> I think it can't get here due to the flow of the code, but I will add it to be
> defensive (since it's just an alternative that gets patched).

You are probably right, the mprotect_pkey() will reject the call if we
don't support POE. So you could add a comment instead (but a
system_supports_poe() check seems safer).

> I still need the defined(CONFIG_ARCH_HAS_PKEYS) check, since the VM_PKEY_BIT*
> are only defined then.

Yes, the ifdef will stay.

-- 
Catalin


Re: [PATCH v4 22/29] arm64: add Permission Overlay Extension Kconfig

2024-07-05 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:40PM +0100, Joey Gouly wrote:
> Now that support for POE and Protection Keys has been implemented, add a
> config to allow users to actually enable it.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Catalin Marinas 


Re: [PATCH v4 18/29] arm64: add POE signal support

2024-07-05 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:36PM +0100, Joey Gouly wrote:
> Add PKEY support to signals, by saving and restoring POR_EL0 from the 
> stackframe.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Reviewed-by: Mark Brown 
> Acked-by: Szabolcs Nagy 

Reviewed-by: Catalin Marinas 


Re: [PATCH v4 17/29] arm64: implement PKEYS support

2024-07-05 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:35PM +0100, Joey Gouly wrote:
> @@ -163,7 +182,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>  #define pte_access_permitted_no_overlay(pte, write) \
>   (((pte_val(pte) & (PTE_VALID | PTE_USER)) == (PTE_VALID | PTE_USER)) && 
> (!(write) || pte_write(pte)))
>  #define pte_access_permitted(pte, write) \
> - pte_access_permitted_no_overlay(pte, write)
> + (pte_access_permitted_no_overlay(pte, write) && \
> + por_el0_allows_pkey(FIELD_GET(PTE_PO_IDX_MASK, pte_val(pte)), write, 
> false))

I'm still not entirely convinced on checking the keys during fast GUP
but that's what x86 and powerpc do already, so I guess we'll follow the
same ABI.

Reviewed-by: Catalin Marinas 


Re: [PATCH v4 20/29] arm64: enable POE and PIE to coexist

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:38PM +0100, Joey Gouly wrote:
> Set the EL0/userspace indirection encodings to be the overlay enabled
> variants of the permissions.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas 


Re: [PATCH v4 16/29] arm64: add pte_access_permitted_no_overlay()

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:34PM +0100, Joey Gouly wrote:
> We do not want take POE into account when clearing the MTE tags.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas 


Re: [PATCH v4 06/29] arm64: context switch POR_EL0 register

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:24PM +0100, Joey Gouly wrote:
> +static void flush_poe(void)
> +{
> + if (!system_supports_poe())
> + return;
> +
> + write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
> + /* ISB required for kernel uaccess routines when chaning POR_EL0 */

Nit: s/chaning/changing/

-- 
Catalin


Re: [PATCH v4 12/29] arm64: add POIndex defines

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:30PM +0100, Joey Gouly wrote:
> The 3-bit POIndex is stored in the PTE at bits 60..62.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Catalin Marinas 


Re: [PATCH v4 11/29] arm64: re-order MTE VM_ flags

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:29PM +0100, Joey Gouly wrote:
> To make it easier to share the generic PKEYs flags, move the MTE flag.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Catalin Marinas 


Re: [PATCH v4 10/29] arm64: enable the Permission Overlay Extension for EL0

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:28PM +0100, Joey Gouly wrote:
> Expose a HWCAP and ID_AA64MMFR3_EL1_S1POE to userspace, so they can be used to
> check if the CPU supports the feature.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas 


Re: [PATCH v4 06/29] arm64: context switch POR_EL0 register

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:24PM +0100, Joey Gouly wrote:
> POR_EL0 is a register that can be modified by userspace directly,
> so it must be context switched.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas 


Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap

2024-06-21 Thread Catalin Marinas
On Fri, Jun 21, 2024 at 06:01:52PM +0100, Catalin Marinas wrote:
> On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote:
> > This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE
> > as the boot CPU will enable POE if it has it, so secondary CPUs must also
> > have this feature.
> > 
> > Signed-off-by: Joey Gouly 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> 
> Acked-by: Catalin Marinas 

One ack is sufficient, ignore this one ;)

-- 
Catalin


Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote:
> This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE
> as the boot CPU will enable POE if it has it, so secondary CPUs must also
> have this feature.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Catalin Marinas 


Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote:
> This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE
> as the boot CPU will enable POE if it has it, so secondary CPUs must also
> have this feature.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Adding some acks otherwise I'll forget what I reviewed.

Acked-by: Catalin Marinas 


Re: [PATCH v4 15/29] arm64: handle PKEY/POE faults

2024-06-21 Thread Catalin Marinas
On Fri, May 03, 2024 at 02:01:33PM +0100, Joey Gouly wrote:
> @@ -529,6 +547,8 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned long esr,
>   unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>   unsigned long addr = untagged_addr(far);
>   struct vm_area_struct *vma;
> + bool pkey_fault = false;
> + int pkey = -1;
>  
>   if (kprobe_page_fault(regs, esr))
>   return 0;
> @@ -590,6 +610,12 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned long esr,
>   vma_end_read(vma);
>   goto lock_mmap;
>   }
> +
> + if (fault_from_pkey(esr, vma, mm_flags)) {
> + vma_end_read(vma);
> + goto lock_mmap;
> + }
> +
>   fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, 
> regs);
>   if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
>   vma_end_read(vma);
> @@ -617,6 +643,11 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned long esr,
>   goto done;
>   }
>  
> + if (fault_from_pkey(esr, vma, mm_flags)) {
> + pkey_fault = true;
> + pkey = vma_pkey(vma);
> + }

I was wondering if we actually need to test this again. We know the
fault was from a pkey already above but I guess it matches what we do
with the vma->vm_flags check in case it races with some mprotect() call.

> +
>   fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs);

You'll need to rebase this on 6.10-rcX since this function disappeared.

Otherwise the patch looks fine.

Reviewed-by: Catalin Marinas 


Re: [PATCH v4 13/29] arm64: convert protection key into vm_flags and pgprot values

2024-06-19 Thread Catalin Marinas
On Tue, May 28, 2024 at 12:24:57PM +0530, Amit Daniel Kachhap wrote:
> On 5/3/24 18:31, Joey Gouly wrote:
> > diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> > index 5966ee4a6154..ecb2d18dc4d7 100644
> > --- a/arch/arm64/include/asm/mman.h
> > +++ b/arch/arm64/include/asm/mman.h
> > @@ -7,7 +7,7 @@
> >   #include 
> >   static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
> > -   unsigned long pkey __always_unused)
> > +   unsigned long pkey)
> >   {
> > unsigned long ret = 0;
> > @@ -17,6 +17,12 @@ static inline unsigned long 
> > arch_calc_vm_prot_bits(unsigned long prot,
> > if (system_supports_mte() && (prot & PROT_MTE))
> > ret |= VM_MTE;
> > +#if defined(CONFIG_ARCH_HAS_PKEYS)
> 
> Should there be system_supports_poe() check like above?

I think it should, otherwise we end up with these bits in the pte even
when POE is not supported.

> > +   ret |= pkey & 0x1 ? VM_PKEY_BIT0 : 0;
> > +   ret |= pkey & 0x2 ? VM_PKEY_BIT1 : 0;
> > +   ret |= pkey & 0x4 ? VM_PKEY_BIT2 : 0;
> > +#endif
> > +
> > return ret;
> >   }
> >   #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, 
> > pkey)

-- 
Catalin


Re: [PATCH RFC v2 4/5] dma-direct: add base offset to zone_dma_bits

2024-06-18 Thread Catalin Marinas
On Tue, Apr 09, 2024 at 09:17:57AM +0300, Baruch Siach wrote:
> Current code using zone_dma_bits assume that all addresses range in the
> bits mask are suitable for DMA. For some existing platforms this
> assumption is not correct. DMA range might have non zero lower limit.
[...]
> @@ -59,7 +60,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device 
> *dev, u64 *phys_limit)
>* zones.
>*/
>   *phys_limit = dma_to_phys(dev, dma_limit);
> - if (*phys_limit <= zone_dma_limit)
> + if (*phys_limit <= zone_dma_base + zone_dma_limit)
>   return GFP_DMA;
>   if (*phys_limit <= DMA_BIT_MASK(32))
>   return GFP_DMA32;

As I said previously, we no longer have zone_dma_bits after the first
patch, so adding this limit no longer make sense. In v1, you wanted a
limit like 32G to be added to the 30-bit zone_dma_bits to give you 33G
upper limit for ZONE_DMA. But since the first patch sets zone_dma_limit
to 33G already, this is no longer needed.

-- 
Catalin


Re: [PATCH RFC v2 2/5] of: get dma area lower limit

2024-06-18 Thread Catalin Marinas
On Tue, Apr 09, 2024 at 09:17:55AM +0300, Baruch Siach wrote:
> of_dma_get_max_cpu_address() returns the highest CPU address that
> devices can use for DMA. The implicit assumption is that all CPU
> addresses below that limit are suitable for DMA. However the
> 'dma-ranges' property this code uses also encodes a lower limit for DMA
> that is potentially non zero.
> 
> Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower
> limit for the same 'dma-ranges' property describing the high limit.

I don't understand the reason for the lower limit. The way the Linux
zones work is that ZONE_DMA always starts from the start of the RAM. It
doesn't matter whether it's 0 or not, you'd not allocate below the start
of RAM anyway. If you have a device that cannot use the bottom of the
RAM, it is pretty broken and not supported by Linux.

I think you added this limit before we tried to move away from
zone_dma_bits to a non-power-of-two limit (zone_dma_limit). With the
latter, we no longer need tricks with the lower limit,
of_dma_get_max_cpu_address() should capture the smallest upper CPU
address limit supported by all devices (and that's where ZONE_DMA should
end).

-- 
Catalin


Re: [PATCH RFC v2 1/5] dma-mapping: replace zone_dma_bits by zone_dma_limit

2024-06-18 Thread Catalin Marinas
(finally getting around to looking at this series, sorry for the delay)

On Tue, Apr 09, 2024 at 09:17:54AM +0300, Baruch Siach wrote:
> From: Catalin Marinas 
> 
> Hardware DMA limit might not be power of 2. When RAM range starts above
> 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high bit
> can not encode this limit.
> 
> Use direct phys_addr_t limit address for DMA zone limit.
> 
> Following commits will add explicit base address to DMA zone.
> 
> ---
> Catalin,
> 
> This is taken almost verbatim from your email:
> 
>   https://lore.kernel.org/all/zz2hnhjv3gdzu...@arm.com/
> 
> Would you provide your sign-off?

Signed-off-by: Catalin Marinas 

Thanks for writing a commit log. However, I think more work is needed.
See below.

> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 03efd86dce0a..00508c69ca9e 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -113,36 +113,24 @@ static void __init arch_reserve_crashkernel(void)
>   low_size, high);
>  }
>  
> -/*
> - * Return the maximum physical address for a zone accessible by the given 
> bits
> - * limit. If DRAM starts above 32-bit, expand the zone to the maximum
> - * available memory, otherwise cap it at 32-bit.
> - */
> -static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
> +static phys_addr_t __init max_zone_phys(phys_addr_t zone_limit)
>  {
> - phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits);
> - phys_addr_t phys_start = memblock_start_of_DRAM();
> -
> - if (phys_start > U32_MAX)
> - zone_mask = PHYS_ADDR_MAX;
> - else if (phys_start > zone_mask)
> - zone_mask = U32_MAX;
> -
> - return min(zone_mask, memblock_end_of_DRAM() - 1) + 1;
> + return min(zone_limit, memblock_end_of_DRAM() - 1) + 1;
>  }
>  
>  static void __init zone_sizes_init(void)
>  {
>   unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
> - unsigned int __maybe_unused acpi_zone_dma_bits;
> - unsigned int __maybe_unused dt_zone_dma_bits;
> - phys_addr_t __maybe_unused dma32_phys_limit = max_zone_phys(32);
> + phys_addr_t __maybe_unused acpi_zone_dma_limit;
> + phys_addr_t __maybe_unused dt_zone_dma_limit;
> + phys_addr_t __maybe_unused dma32_phys_limit =
> + max_zone_phys(DMA_BIT_MASK(32));
>  
>  #ifdef CONFIG_ZONE_DMA
> - acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
> - dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
> - zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
> - arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
> + acpi_zone_dma_limit = acpi_iort_dma_get_max_cpu_address();
> + dt_zone_dma_limit = of_dma_get_max_cpu_address(NULL);
> + zone_dma_limit = min(dt_zone_dma_limit, acpi_zone_dma_limit);
> + arm64_dma_phys_limit = max_zone_phys(zone_dma_limit);
>   max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
>  #endif
>  #ifdef CONFIG_ZONE_DMA32

I think this goes wrong if zone_dma_limit ends up above 32-bit (e.g. no
restrictive dma-ranges properties) but the start of RAM is below 4G.
We'd simply reduce ZONE_DMA32 to zero and ZONE_DMA potentially covering
the whole RAM. Prior to this change, we capped zone_dma_bits to 32 via
min3(). I think we should maintain this cap if memblock_start_of_DRAM()
is below 4G.

We could fix this up in max_zone_phys() above:

if (memblock_start_of_DRAM() < U32_MAX)
zone_limit = min(U32_MAX, zone_limit);

return min(zone_limit, memblock_end_of_DRAM() - 1) + 1;

> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 4d543b1e9d57..3b2ebcd4f576 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -20,7 +20,7 @@
>   * it for entirely different regions. In that case the arch code needs to
>   * override the variable below for dma-direct to work properly.
>   */
> -unsigned int zone_dma_bits __ro_after_init = 24;
> +phys_addr_t zone_dma_limit __ro_after_init = DMA_BIT_MASK(24);
>  
>  static inline dma_addr_t phys_to_dma_direct(struct device *dev,
>   phys_addr_t phys)
> @@ -59,7 +59,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device 
> *dev, u64 *phys_limit)
>* zones.
>*/
>   *phys_limit = dma_to_phys(dev, dma_limit);
> - if (*phys_limit <= DMA_BIT_MASK(zone_dma_bits))
> + if (*phys_limit <= zone_dma_limit)
>   return GFP_DMA;
>   if (*phys_limit <= DMA_BIT_MASK(32))
>   return GFP_DMA32;

It's worth noting that if ZONE_DMA ends up entirely above 32-bit, there
won't be any ZONE_DMA32. Thinking about it, this could be a potential
problem. For

Re: [PATCH v2 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS

2024-04-09 Thread Catalin Marinas
On Wed, Apr 03, 2024 at 04:38:00PM +0800, Kefeng Wang wrote:
> The vm_flags of vma already checked under per-VMA lock, if it is a
> bad access, directly set fault to VM_FAULT_BADACCESS and handle error,
> no need to retry with mmap_lock again, the latency time reduces 34% in
> 'lat_sig -P 1 prot lat_sig' from lmbench testcase.
> 
> Since the page faut is handled under per-VMA lock, count it as a vma lock
> event with VMA_LOCK_SUCCESS.
> 
> Reviewed-by: Suren Baghdasaryan 
> Signed-off-by: Kefeng Wang 

Reviewed-by: Catalin Marinas 


Re: [PATCH v2 1/7] arm64: mm: cleanup __do_page_fault()

2024-04-09 Thread Catalin Marinas
On Wed, Apr 03, 2024 at 04:37:59PM +0800, Kefeng Wang wrote:
> The __do_page_fault() only calls handle_mm_fault() after vm_flags
> checked, and it is only called by do_page_fault(), let's squash
> it into do_page_fault() to cleanup code.
> 
> Reviewed-by: Suren Baghdasaryan 
> Signed-off-by: Kefeng Wang 

As I reviewed v1 and the changes are minimal:

Reviewed-by: Catalin Marinas 


Re: [PATCH 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS

2024-04-03 Thread Catalin Marinas
On Tue, Apr 02, 2024 at 03:51:37PM +0800, Kefeng Wang wrote:
> The vm_flags of vma already checked under per-VMA lock, if it is a
> bad access, directly set fault to VM_FAULT_BADACCESS and handle error,
> no need to lock_mm_and_find_vma() and check vm_flags again, the latency
> time reduce 34% in lmbench 'lat_sig -P 1 prot lat_sig'.
> 
> Signed-off-by: Kefeng Wang 
> ---
>  arch/arm64/mm/fault.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 9bb9f395351a..405f9aa831bd 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -572,7 +572,9 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned long esr,
>  
>   if (!(vma->vm_flags & vm_flags)) {
>   vma_end_read(vma);
> - goto lock_mmap;
> + fault = VM_FAULT_BADACCESS;
> + count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> + goto done;
>   }
>   fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, 
> regs);
>   if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))

I think this makes sense. A concurrent modification of vma->vm_flags
(e.g. mprotect()) would do a vma_start_write(), so no need to recheck
again with the mmap lock held.

Reviewed-by: Catalin Marinas 


Re: [PATCH 1/7] arm64: mm: cleanup __do_page_fault()

2024-04-03 Thread Catalin Marinas
On Tue, Apr 02, 2024 at 03:51:36PM +0800, Kefeng Wang wrote:
> The __do_page_fault() only check vma->flags and call handle_mm_fault(),
> and only called by do_page_fault(), let's squash it into do_page_fault()
> to cleanup code.
> 
> Signed-off-by: Kefeng Wang 

Reviewed-by: Catalin Marinas 


Re: [PATCH 4/4] vdso: avoid including asm/page.h

2024-02-27 Thread Catalin Marinas
On Mon, Feb 26, 2024 at 05:14:14PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The recent change to the vdso_data_store broke building compat VDSO
> on at least arm64 because it includes headers outside of the include/vdso/
> namespace:
> 
> In file included from arch/arm64/include/asm/lse.h:5,
>  from arch/arm64/include/asm/cmpxchg.h:14,
>  from arch/arm64/include/asm/atomic.h:16,
>  from include/linux/atomic.h:7,
>  from include/asm-generic/bitops/atomic.h:5,
>  from arch/arm64/include/asm/bitops.h:25,
>  from include/linux/bitops.h:68,
>  from arch/arm64/include/asm/memory.h:209,
>  from arch/arm64/include/asm/page.h:46,
>  from include/vdso/datapage.h:22,
>  from lib/vdso/gettimeofday.c:5,
>  from :
> arch/arm64/include/asm/atomic_ll_sc.h:298:9: error: unknown type name 'u128'
>   298 | u128 full;
> 
> Use an open-coded page size calculation based on the new CONFIG_PAGE_SHIFT
> Kconfig symbol instead.
> 
> Reported-by: Linux Kernel Functional Testing 
> Fixes: a0d2fcd62ac2 ("vdso/ARM: Make union vdso_data_store available for all 
> architectures")
> Link: 
> https://lore.kernel.org/lkml/ca+g9fytrxxm_ko9fnpz3xarxhv7ud_yqp-teupqrnrhu+_0...@mail.gmail.com/
> Signed-off-by: Arnd Bergmann 

Acked-by: Catalin Marinas 


Re: [PATCH 2/4] arch: simplify architecture specific page size configuration

2024-02-27 Thread Catalin Marinas
On Mon, Feb 26, 2024 at 05:14:12PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> arc, arm64, parisc and powerpc all have their own Kconfig symbols
> in place of the common CONFIG_PAGE_SIZE_4KB symbols. Change these
> so the common symbols are the ones that are actually used, while
> leaving the arhcitecture specific ones as the user visible
> place for configuring it, to avoid breaking user configs.
> 
> Signed-off-by: Arnd Bergmann 

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-19 Thread Catalin Marinas
On Fri, Feb 16, 2024 at 12:53:43PM +, Ryan Roberts wrote:
> On 16/02/2024 12:25, Catalin Marinas wrote:
> > On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
> >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >> +{
> >> +  /*
> >> +   * Gather access/dirty bits, which may be populated in any of the ptes
> >> +   * of the contig range. We may not be holding the PTL, so any contiguous
> >> +   * range may be unfolded/modified/refolded under our feet. Therefore we
> >> +   * ensure we read a _consistent_ contpte range by checking that all ptes
> >> +   * in the range are valid and have CONT_PTE set, that all pfns are
> >> +   * contiguous and that all pgprots are the same (ignoring access/dirty).
> >> +   * If we find a pte that is not consistent, then we must be racing with
> >> +   * an update so start again. If the target pte does not have CONT_PTE
> >> +   * set then that is considered consistent on its own because it is not
> >> +   * part of a contpte range.
> >> +*/
[...]
> > After writing the comments above, I think I figured out that the whole
> > point of this loop is to check that the ptes in the contig range are
> > still consistent and the only variation allowed is the dirty/young
> > state to be passed to the orig_pte returned. The original pte may have
> > been updated by the time this loop finishes but I don't think it
> > matters, it wouldn't be any different than reading a single pte and
> > returning it while it is being updated.
> 
> Correct. The pte can be updated at any time, before after or during the reads.
> That was always the case. But now we have to cope with a whole contpte block
> being repainted while we are reading it. So we are just checking to make sure
> that all the ptes that we read from the contpte block are consistent with
> eachother and therefore we can trust that the access/dirty bits we gathered 
> are
> consistent.

I've been thinking a bit more about this - do any of the callers of
ptep_get_lockless() check the dirty/access bits? The only one that seems
to care is ptdump but in that case I'd rather see the raw bits for
debugging rather than propagating the dirty/access bits to the rest in
the contig range.

So with some clearer documentation on the requirements, I think we don't
need an arm64-specific ptep_get_lockless() (unless I missed something).

-- 
Catalin


Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-16 Thread Catalin Marinas
On Fri, Feb 16, 2024 at 12:53:43PM +, Ryan Roberts wrote:
> On 16/02/2024 12:25, Catalin Marinas wrote:
> > On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
> >>  arch/arm64/mm/contpte.c  | 285 +++
> > 
> > Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL().
> > We don't expect them to be used by random out of tree modules. In fact,
> > do we expect them to end up in modules at all? Most seem to be called
> > from the core mm code.
> 
> The problem is that the contpte_* symbols are called from the ptep_* inline
> functions. So where those inlines are called from modules, we need to make 
> sure
> the contpte_* symbols are available.
> 
> John Hubbard originally reported this problem against v1 and I enumerated all
> the drivers that call into the ptep_* inlines here:
> https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94...@arm.com/#t
> 
> So they definitely need to be exported. Perhaps we can tighten it to
> EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break 
> anything
> out-of-tree. I'm not sure what the normal policy is? arm64 seems to use ~equal
> amounts of both.

I don't think we are consistent here. For example set_pte_at() can't be
called from non-GPL modules because of __sync_icache_dcache. OTOH, such
driver is probably doing something dodgy. Same with
apply_to_page_range(), it's GPL-only (called from i915).

Let's see if others have any view over the next week or so, otherwise
I'd go for _GPL and relax it later if someone has a good use-case (can
be a patch on top adding _GPL).

> > If you can make this easier to parse (in a few years time) with an
> > additional patch adding some more comments, that would be great. For
> > this patch:
> 
> I already have a big block comment at the top, which was trying to explain it.
> Clearly not well enough though. I'll add more comments as a follow up patch 
> when
> I get back from holiday.

I read that comment but it wasn't immediately obvious what the atomicity
requirements are - basically we require a single PTE to be atomically
read (which it is), the rest is the dirty/young state being added on
top. I guess a sentence along these lines would do.

Enjoy your holiday!

-- 
Catalin


Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:05AM +, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 17/18] arm64/mm: __always_inline to improve fork() perf

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:04AM +, Ryan Roberts wrote:
> As set_ptes() and wrprotect_ptes() become a bit more complex, the
> compiler may choose not to inline them. But this is critical for fork()
> performance. So mark the functions, along with contpte_try_unfold()
> which is called by them, as __always_inline. This is worth ~1% on the
> fork() microbenchmark with order-0 folios (the common case).
> 
> Acked-by: Mark Rutland 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 16/18] arm64/mm: Implement pte_batch_hint()

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:03AM +, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Acked-by: Mark Rutland 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:01AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:32:00AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
> continues to benefit from the more efficient use of the TLB after
> the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/ja-07/
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-16 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
>  arch/arm64/mm/contpte.c  | 285 +++

Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL().
We don't expect them to be used by random out of tree modules. In fact,
do we expect them to end up in modules at all? Most seem to be called
from the core mm code.

> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> + pte_t pte = __ptep_get(ptep);
> +
> + if (likely(!pte_valid_cont(pte)))
> + return pte;
> +
> + return contpte_ptep_get_lockless(ptep);
> +}
[...]
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> + /*
> +  * Gather access/dirty bits, which may be populated in any of the ptes
> +  * of the contig range. We may not be holding the PTL, so any contiguous
> +  * range may be unfolded/modified/refolded under our feet. Therefore we
> +  * ensure we read a _consistent_ contpte range by checking that all ptes
> +  * in the range are valid and have CONT_PTE set, that all pfns are
> +  * contiguous and that all pgprots are the same (ignoring access/dirty).
> +  * If we find a pte that is not consistent, then we must be racing with
> +  * an update so start again. If the target pte does not have CONT_PTE
> +  * set then that is considered consistent on its own because it is not
> +  * part of a contpte range.
> +*/

I can't get my head around this lockless API. Maybe it works fine (and
may have been discussed already) but we should document what the races
are, why it works, what the memory ordering requirements are. For
example, the generic (well, x86 PAE) ptep_get_lockless() only needs to
ensure that the low/high 32 bits of a pte are consistent and there are
some ordering rules on how these are updated.

Does the arm64 implementation only need to be correct w.r.t. the
access/dirty bits? Since we can read orig_ptep atomically, I assume the
only other updates from unfolding would set the dirty/access bits.

> +
> + pgprot_t orig_prot;
> + unsigned long pfn;
> + pte_t orig_pte;
> + pgprot_t prot;
> + pte_t *ptep;
> + pte_t pte;
> + int i;
> +
> +retry:
> + orig_pte = __ptep_get(orig_ptep);
> +
> + if (!pte_valid_cont(orig_pte))
> + return orig_pte;
> +
> + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> + ptep = contpte_align_down(orig_ptep);
> + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + pte = __ptep_get(ptep);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));

We don't have any ordering guarantees in how the ptes in this range are
read or written in the contpte_set_ptes() and the fold/unfold functions.
We might not need them given all the other checks below but it's worth
adding a comment.

> +
> + if (!pte_valid_cont(pte) ||
> +pte_pfn(pte) != pfn ||
> +pgprot_val(prot) != pgprot_val(orig_prot))
> + goto retry;

I think this also needs some comment. I get the !pte_valid_cont() check
to attempt retrying when racing with unfolding. Are the other checks
needed to detect re-folding with different protection or pfn?

> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }

After writing the comments above, I think I figured out that the whole
point of this loop is to check that the ptes in the contig range are
still consistent and the only variation allowed is the dirty/young
state to be passed to the orig_pte returned. The original pte may have
been updated by the time this loop finishes but I don't think it
matters, it wouldn't be any different than reading a single pte and
returning it while it is being updated.

If you can make this easier to parse (in a few years time) with an
additional patch adding some more comments, that would be great. For
this patch:

Reviewed-by: Catalin Marinas 

-- 
Catalin


Re: [PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:58AM +, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs()
> has changed, but now aligns with the ordering of __flush_tlb_page(). It
> has been discussed that __flush_tlb_page() may be wrong though.
> Regardless, both will be resolved separately if needed.
> 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:57AM +, Ryan Roberts wrote:
> Create a new layer for the in-table PTE manipulation APIs. For now, The
> existing API is prefixed with double underscore to become the
> arch-private API and the public API is just a simple wrapper that calls
> the private API.
> 
> The public API implementation will subsequently be used to transparently
> manipulate the contiguous bit where appropriate. But since there are
> already some contig-aware users (e.g. hugetlb, kernel mapper), we must
> first ensure those users use the private API directly so that the future
> contig-bit manipulations in the public API do not interfere with those
> existing uses.
> 
> The following APIs are treated this way:
> 
>  - ptep_get
>  - set_pte
>  - set_ptes
>  - pte_clear
>  - ptep_get_and_clear
>  - ptep_test_and_clear_young
>  - ptep_clear_flush_young
>  - ptep_set_wrprotect
>  - ptep_set_access_flags
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear()

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:56AM +, Ryan Roberts wrote:
> ptep_clear() is a generic wrapper around the arch-implemented
> ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into
> a public version and private version (__ptep_get_and_clear()) to support
> the transparent contpte work. We won't have a private version of
> ptep_clear() so let's convert it to directly call ptep_get_and_clear().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1)

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:55AM +, Ryan Roberts wrote:
> Since set_ptes() was introduced, set_pte_at() has been implemented as a
> generic macro around set_ptes(..., 1). So this change should continue to
> generate the same code. However, making this change prepares us for the
> transparent contpte support. It means we can reroute set_ptes() to
> __set_ptes(). Since set_pte_at() is a generic macro, there will be no
> equivalent __set_pte_at() to reroute to.
> 
> Note that a couple of calls to set_pte_at() remain in the arch code.
> This is intentional, since those call sites are acting on behalf of
> core-mm and should continue to call into the public set_ptes() rather
> than the arch-private __set_ptes().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:54AM +, Ryan Roberts wrote:
> There are a number of places in the arch code that read a pte by using
> the READ_ONCE() macro. Refactor these call sites to instead use the
> ptep_get() helper, which itself is a READ_ONCE(). Generated code should
> be the same.
> 
> This will benefit us when we shortly introduce the transparent contpte
> support. In this case, ptep_get() will become more complex so we now
> have all the code abstracted through it.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Catalin Marinas
On Thu, Feb 15, 2024 at 10:31:51AM +, Ryan Roberts wrote:
> Core-mm needs to be able to advance the pfn by an arbitrary amount, so
> override the new pte_advance_pfn() API to do so.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas 


Re: get_user_pages() and EXEC_ONLY mapping.

2023-11-10 Thread Catalin Marinas
On Fri, Nov 10, 2023 at 08:19:23PM +0530, Aneesh Kumar K.V wrote:
> Some architectures can now support EXEC_ONLY mappings and I am wondering
> what get_user_pages() on those addresses should return. Earlier
> PROT_EXEC implied PROT_READ and pte_access_permitted() returned true for
> that. But arm64 does have this explicit comment that says
> 
>  /*
>  * p??_access_permitted() is true for valid user mappings (PTE_USER
>  * bit set, subject to the write permission check). For execute-only
>  * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
>  * not set) must return false. PROT_NONE mappings do not have the
>  * PTE_VALID bit set.
>  */
> 
> Is that correct? We should be able to get struct page for PROT_EXEC
> mappings?

I don't remember why we ended up with this briefly looking at the code,
pte_access_permitted() is only used on the fast GUP path. On the slow
path, there is a check_vma_flags() call which returns -EFAULT if the vma
is not readable. So the pte_access_permitted() on the fast path matches
the semantics of the slow path.

If one wants the page structure, FOLL_FORCE ignores the read check (on
the slow path), though I think it still fails if VM_MAYREAD is not set.
Unless you have a real use-case where this is not sufficient, I'd leave
the behaviour as is on arm64 (and maybe update other architectures that
support exec-only to do the same).

-- 
Catalin


Re: [PATCH v3 9/9] efi: move screen_info into efi init code

2023-10-10 Thread Catalin Marinas
On Mon, Oct 09, 2023 at 11:18:45PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> After the vga console no longer relies on global screen_info, there are
> only two remaining use cases:
> 
>  - on the x86 architecture, it is used for multiple boot methods
>(bzImage, EFI, Xen, kexec) to commucate the initial VGA or framebuffer
>settings to a number of device drivers.
> 
>  - on other architectures, it is only used as part of the EFI stub,
>and only for the three sysfb framebuffers (simpledrm, simplefb, efifb).
> 
> Remove the duplicate data structure definitions by moving it into the
> efi-init.c file that sets it up initially for the EFI case, leaving x86
> as an exception that retains its own definition for non-EFI boots.
> 
> The added #ifdefs here are optional, I added them to further limit the
> reach of screen_info to configurations that have at least one of the
> users enabled.
> 
> Reviewed-by: Ard Biesheuvel 
> Reviewed-by: Javier Martinez Canillas 
> Acked-by: Helge Deller 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/arm/kernel/setup.c   |  4 
>  arch/arm64/kernel/efi.c   |  4 
>  arch/arm64/kernel/image-vars.h|  2 ++

It's more Ard's thing and he reviewed it already but if you need another
ack:

Acked-by: Catalin Marinas 


Re: [PATCH v2] arch: Reserve map_shadow_stack() syscall number for all architectures

2023-10-04 Thread Catalin Marinas
On Thu, Sep 14, 2023 at 06:58:03PM +, Sohil Mehta wrote:
> commit c35559f94ebc ("x86/shstk: Introduce map_shadow_stack syscall")
> recently added support for map_shadow_stack() but it is limited to x86
> only for now. There is a possibility that other architectures (namely,
> arm64 and RISC-V), that are implementing equivalent support for shadow
> stacks, might need to add support for it.
> 
> Independent of that, reserving arch-specific syscall numbers in the
> syscall tables of all architectures is good practice and would help
> avoid future conflicts. map_shadow_stack() is marked as a conditional
> syscall in sys_ni.c. Adding it to the syscall tables of other
> architectures is harmless and would return ENOSYS when exercised.
> 
> Note, map_shadow_stack() was assigned #453 during the merge process
> since #452 was taken by fchmodat2().
> 
> For Powerpc, map it to sys_ni_syscall() as is the norm for Powerpc
> syscall tables.
> 
> For Alpha, map_shadow_stack() takes up #563 as Alpha still diverges from
> the common syscall numbering system in the other architectures.
> 
> Link: 
> https://lore.kernel.org/lkml/20230515212255.ga562...@debug.ba.rivosinc.com/
> Link: 
> https://lore.kernel.org/lkml/b402b80b-a7c6-4ef0-b977-c0f5f582b...@sirena.org.uk/
> 
> Signed-off-by: Sohil Mehta 
> ---
> v2:
> - Skip syscall table changes to tools/. They will be handled separetely by the
>   perf folks.
> - Map Powerpc to sys_ni_syscall (Rick Edgecombe)
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
>  arch/arm/tools/syscall.tbl  | 1 +
>  arch/arm64/include/asm/unistd.h     | 2 +-
>  arch/arm64/include/asm/unistd32.h   | 2 ++

For arm64 (compat):

Acked-by: Catalin Marinas 


Re: [PATCH v1 0/8] Fix set_huge_pte_at() panic on arm64

2023-09-21 Thread Catalin Marinas
On Thu, Sep 21, 2023 at 05:35:54PM +0100, Ryan Roberts wrote:
> On 21/09/2023 17:30, Andrew Morton wrote:
> > On Thu, 21 Sep 2023 17:19:59 +0100 Ryan Roberts  
> > wrote:
> >> Ryan Roberts (8):
> >>   parisc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   powerpc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   riscv: hugetlb: Convert set_huge_pte_at() to take vma
> >>   s390: hugetlb: Convert set_huge_pte_at() to take vma
> >>   sparc: hugetlb: Convert set_huge_pte_at() to take vma
> >>   mm: hugetlb: Convert set_huge_pte_at() to take vma
> >>   arm64: hugetlb: Convert set_huge_pte_at() to take vma
> >>   arm64: hugetlb: Fix set_huge_pte_at() to work with all swap entries
> >>
> >>  arch/arm64/include/asm/hugetlb.h  |  2 +-
> >>  arch/arm64/mm/hugetlbpage.c   | 22 --
> >>  arch/parisc/include/asm/hugetlb.h |  2 +-
> >>  arch/parisc/mm/hugetlbpage.c  |  4 +--
> >>  .../include/asm/nohash/32/hugetlb-8xx.h   |  3 +-
> >>  arch/powerpc/mm/book3s64/hugetlbpage.c|  2 +-
> >>  arch/powerpc/mm/book3s64/radix_hugetlbpage.c  |  2 +-
> >>  arch/powerpc/mm/nohash/8xx.c  |  2 +-
> >>  arch/powerpc/mm/pgtable.c |  7 -
> >>  arch/riscv/include/asm/hugetlb.h  |  2 +-
> >>  arch/riscv/mm/hugetlbpage.c   |  3 +-
> >>  arch/s390/include/asm/hugetlb.h   |  8 +++--
> >>  arch/s390/mm/hugetlbpage.c|  8 -
> >>  arch/sparc/include/asm/hugetlb.h  |  8 +++--
> >>  arch/sparc/mm/hugetlbpage.c   |  8 -
> >>  include/asm-generic/hugetlb.h |  6 ++--
> >>  include/linux/hugetlb.h   |  6 ++--
> >>  mm/damon/vaddr.c  |  2 +-
> >>  mm/hugetlb.c  | 30 +--
> >>  mm/migrate.c  |  2 +-
> >>  mm/rmap.c | 10 +++
> >>  mm/vmalloc.c  |  5 +++-
> >>  22 files changed, 80 insertions(+), 64 deletions(-)
> > 
> > Looks scary but it's actually a fairly modest patchset.  It could
> > easily be all rolled into a single patch for ease of backporting. 
> > Maybe Greg has an opinion?
> 
> Yes, I thought about doing that; or perhaps 2 patches - one for the interface
> change across all arches and core code, and one for the actual bug fix?

I think this would make more sense, especially if we want to backport
it. The first patch would have no functional change, only an interface
change, followed by the arm64 fix.

-- 
Catalin


Re: [PATCH v3 5/5] mmu_notifiers: Rename invalidate_range notifier

2023-07-21 Thread Catalin Marinas
On Thu, Jul 20, 2023 at 06:39:27PM +1000, Alistair Popple wrote:
> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index a99349d..84a05a0 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -253,7 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   __tlbi(aside1is, asid);
>   __tlbi_user(aside1is, asid);
>   dsb(ish);
> - mmu_notifier_invalidate_range(mm, 0, -1UL);
> + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> @@ -265,7 +265,7 @@ static inline void __flush_tlb_page_nosync(struct 
> mm_struct *mm,
>   addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
> - mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
> + mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>   (uaddr & PAGE_MASK) + 
> PAGE_SIZE);
>  }
>  
> @@ -400,7 +400,7 @@ static inline void __flush_tlb_range(struct 
> vm_area_struct *vma,
>   scale++;
>   }
>   dsb(ish);
> - mmu_notifier_invalidate_range(vma->vm_mm, start, end);
> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>  }

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v3 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs

2023-07-21 Thread Catalin Marinas
On Thu, Jul 20, 2023 at 06:39:25PM +1000, Alistair Popple wrote:
> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index 3456866..a99349d 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -252,6 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   __tlbi(aside1is, asid);
>   __tlbi_user(aside1is, asid);
>   dsb(ish);
> + mmu_notifier_invalidate_range(mm, 0, -1UL);
>  }
>  
>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> @@ -263,6 +265,8 @@ static inline void __flush_tlb_page_nosync(struct 
> mm_struct *mm,
>   addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
> + mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
> + (uaddr & PAGE_MASK) + 
> PAGE_SIZE);

Nitpick: we have PAGE_ALIGN() for this.

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v11 4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-21 Thread Catalin Marinas
On Mon, Jul 17, 2023 at 09:10:04PM +0800, Yicong Yang wrote:
> +static inline void arch_tlbbatch_add_pending(struct 
> arch_tlbflush_unmap_batch *batch,
> +  struct mm_struct *mm,
> +  unsigned long uaddr)
> +{
> + __flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> +{
> + dsb(ish);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
> *batch)
> +{
> + dsb(ish);
> +}

Nitpick: as an additional patch, I'd add some comment for these two
functions that the TLBI has already been issued and only a DSB is needed
to synchronise its effect on the other CPUs.

Reviewed-by: Catalin Marinas 


Re: [PATCH v11 3/4] mm/tlbbatch: Introduce arch_flush_tlb_batched_pending()

2023-07-21 Thread Catalin Marinas
On Mon, Jul 17, 2023 at 09:10:03PM +0800, Yicong Yang wrote:
> From: Yicong Yang 
> 
> Currently we'll flush the mm in flush_tlb_batched_pending() to
> avoid race between reclaim unmaps pages by batched TLB flush
> and mprotect/munmap/etc. Other architectures like arm64 may
> only need a synchronization barrier(dsb) here rather than
> a full mm flush. So add arch_flush_tlb_batched_pending() to
> allow an arch-specific implementation here. This intends no
> functional changes on x86 since still a full mm flush for
> x86.
> 
> Signed-off-by: Yicong Yang 

Reviewed-by: Catalin Marinas 


Re: [PATCH v11 2/4] mm/tlbbatch: Rename and extend some functions

2023-07-21 Thread Catalin Marinas
On Mon, Jul 17, 2023 at 09:10:02PM +0800, Yicong Yang wrote:
> From: Barry Song 
> 
> This patch does some preparation works to extend batched TLB flush to
> arm64. Including:
> - Extend set_tlb_ubc_flush_pending() and arch_tlbbatch_add_mm()
>   to accept an additional argument for address, architectures
>   like arm64 may need this for tlbi.
> - Rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending()
>   to match its current function since we don't need to handle
>   mm on architectures like arm64 and add_mm is not proper,
>   add_pending will make sense to both as on x86 we're pending the
>   TLB flush operations while on arm64 we're pending the synchronize
>   operations.
> 
> This intends no functional changes on x86.
> 
> Cc: Anshuman Khandual 
> Cc: Jonathan Corbet 
> Cc: Nadav Amit 
> Cc: Mel Gorman 
> Tested-by: Yicong Yang 
> Tested-by: Xin Hao 
> Tested-by: Punit Agrawal 
> Signed-off-by: Barry Song 
> Signed-off-by: Yicong Yang 
> Reviewed-by: Kefeng Wang 
> Reviewed-by: Xin Hao 
> Reviewed-by: Anshuman Khandual 

Reviewed-by: Catalin Marinas 


Re: [PATCH v11 1/4] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

2023-07-21 Thread Catalin Marinas
On Mon, Jul 17, 2023 at 09:10:01PM +0800, Yicong Yang wrote:
> From: Anshuman Khandual 
> 
> The entire scheme of deferred TLB flush in reclaim path rests on the
> fact that the cost to refill TLB entries is less than flushing out
> individual entries by sending IPI to remote CPUs. But architecture
> can have different ways to evaluate that. Hence apart from checking
> TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
> architecture specific.
> 
> Signed-off-by: Anshuman Khandual 
> [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/]
> Signed-off-by: Yicong Yang 
> [Rebase and fix incorrect return value type]
> Reviewed-by: Kefeng Wang 
> Reviewed-by: Anshuman Khandual 
> Reviewed-by: Barry Song 
> Reviewed-by: Xin Hao 
> Tested-by: Punit Agrawal 

Reviewed-by: Catalin Marinas 


Re: [PATCH v10 4/4] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-16 Thread Catalin Marinas
On Mon, Jul 10, 2023 at 04:39:14PM +0800, Yicong Yang wrote:
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7856c3a3e35a..f0ce8208c57f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -96,6 +96,7 @@ config ARM64
>   select ARCH_SUPPORTS_NUMA_BALANCING
>   select ARCH_SUPPORTS_PAGE_TABLE_CHECK
>   select ARCH_SUPPORTS_PER_VMA_LOCK
> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT

I don't want EXPERT to turn on a feature that's not selectable by the
user. This would lead to different performance behaviour based on
EXPERT. Just select it unconditionally.

> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..4bb9cec62e26 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   dsb(ish);
>  }
>  
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> -  unsigned long uaddr)
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> +unsigned long uaddr)
>  {
>   unsigned long addr;
>  
>   dsb(ishst);
> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>   __tlbi(vale1is, addr);
>   __tlbi_user(vale1is, addr);
>  }
>  
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +  unsigned long uaddr)
> +{
> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
>  static inline void flush_tlb_page(struct vm_area_struct *vma,
> unsigned long uaddr)
>  {
> @@ -272,6 +278,42 @@ static inline void flush_tlb_page(struct vm_area_struct 
> *vma,
>   dsb(ish);
>  }
>  
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH

If it's selected unconditionally, we won't need this #ifdef here.

> +
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> + /*
> +  * TLB flush deferral is not required on systems, which are affected 
> with

"affected by" and drop the comma before "which".

-- 
Catalin


Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-06-29 Thread Catalin Marinas
On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> > From: Barry Song 
> > 
> > on x86, batched and deferred tlb shootdown has lead to 90%
> > performance increase on tlb shootdown. on arm64, HW can do
> > tlb shootdown without software IPI. But sync tlbi is still
> > quite expensive.
> [...]
> >  .../features/vm/TLB/arch-support.txt  |  2 +-
> >  arch/arm64/Kconfig|  1 +
> >  arch/arm64/include/asm/tlbbatch.h | 12 
> >  arch/arm64/include/asm/tlbflush.h | 33 -
> >  arch/arm64/mm/flush.c | 69 +++
> >  arch/x86/include/asm/tlbflush.h   |  5 +-
> >  include/linux/mm_types_task.h |  4 +-
> >  mm/rmap.c | 12 ++--
> 
> First of all, this patch needs to be split in some preparatory patches
> introducing/renaming functions with no functional change for x86. Once
> done, you can add the arm64-only changes.
> 
> Now, on the implementation, I had some comments on v7 but we didn't get
> to a conclusion and the thread eventually died:
> 
> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/
> 
> I know I said a command line argument is better than Kconfig or some
> random number of CPUs heuristics but it would be even better if we don't
> bother with any, just make this always on. Barry had some comments
> around mprotect() being racy and that's why we have
> flush_tlb_batched_pending() but I don't think it's needed (or, for
> arm64, it can be a DSB since this patch issues the TLBIs but without the
> DVM Sync). So we need to clarify this (see Barry's last email on the
> above thread) and before attempting new versions of this patchset. With
> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
> implementation would be faster on any SoC irrespective of the number of
> CPUs.

I think I got the need for flush_tlb_batched_pending(). If
try_to_unmap() marks the pte !present and we have a pending TLBI,
change_pte_range() will skip the TLB maintenance altogether since it did
not change the pte. So we could be left with stale TLB entries after
mprotect() before TTU does the batch flushing.

We can have an arch-specific flush_tlb_batched_pending() that can be a
DSB only on arm64 and a full mm flush on x86.

-- 
Catalin


Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-06-29 Thread Catalin Marinas
On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> From: Barry Song 
> 
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
[...]
>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  1 +
>  arch/arm64/include/asm/tlbbatch.h | 12 
>  arch/arm64/include/asm/tlbflush.h | 33 -
>  arch/arm64/mm/flush.c | 69 +++
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 12 ++--

First of all, this patch needs to be split in some preparatory patches
introducing/renaming functions with no functional change for x86. Once
done, you can add the arm64-only changes.

Now, on the implementation, I had some comments on v7 but we didn't get
to a conclusion and the thread eventually died:

https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/

I know I said a command line argument is better than Kconfig or some
random number of CPUs heuristics but it would be even better if we don't
bother with any, just make this always on. Barry had some comments
around mprotect() being racy and that's why we have
flush_tlb_batched_pending() but I don't think it's needed (or, for
arm64, it can be a DSB since this patch issues the TLBIs but without the
DVM Sync). So we need to clarify this (see Barry's last email on the
above thread) and before attempting new versions of this patchset. With
flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
implementation would be faster on any SoC irrespective of the number of
CPUs.

-- 
Catalin


Re: [PATCH v4 21/34] arm64: Convert various functions to use ptdescs

2023-06-14 Thread Catalin Marinas
On Mon, Jun 12, 2023 at 02:04:10PM -0700, Vishal Moola (Oracle) wrote:
> As part of the conversions to replace pgtable constructor/destructors with
> ptdesc equivalents, convert various page table functions to use ptdescs.
> 
> Signed-off-by: Vishal Moola (Oracle) 

Acked-by: Catalin Marinas 


Re: [PATCH 0/3] Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas
On Tue, Jun 13, 2023 at 04:42:40PM +, Christophe Leroy wrote:
> 
> 
> Le 13/06/2023 à 17:52, Catalin Marinas a écrit :
> > Hi,
> > 
> > The ARCH_KMALLOC_MINALIGN reduction series defines a generic
> > ARCH_DMA_MINALIGN in linux/cache.h:
> > 
> > https://lore.kernel.org/r/20230612153201.554742-2-catalin.mari...@arm.com/
> > 
> > Unfortunately, this causes a duplicate definition warning for
> > microblaze, powerpc (32-bit only) and sh as these architectures define
> > ARCH_DMA_MINALIGN in a different file than asm/cache.h. Move the macro
> > to asm/cache.h to avoid this issue and also bring them in line with the
> > other architectures.
> 
> What about mips ?
> 
> arch/mips/include/asm/mach-generic/kmalloc.h:#define ARCH_DMA_MINALIGN
> 128
> arch/mips/include/asm/mach-ip32/kmalloc.h:#define ARCH_DMA_MINALIGN   32
> arch/mips/include/asm/mach-ip32/kmalloc.h:#define ARCH_DMA_MINALIGN   128
> arch/mips/include/asm/mach-n64/kmalloc.h:#define ARCH_DMA_MINALIGN
> L1_CACHE_BYTES
> arch/mips/include/asm/mach-tx49xx/kmalloc.h:#define ARCH_DMA_MINALIGN 
> L1_CACHE_BYTES

Sorry, I should have mentioned it in the cover letter (discussed here -
https://lore.kernel.org/r/zihpaixb%2f0ve7...@arm.com/). These kmalloc.h
files are included in asm/cache.h, based on which machine is enabled, so
there's no problem for mips. It makes more sense to keep them in those
mach-*/kmalloc.h files instead of having lots of #ifdefs in cache.h.

-- 
Catalin


[PATCH 2/3] microblaze: Move the ARCH_{DMA,SLAB}_MINALIGN definitions to asm/cache.h

2023-06-13 Thread Catalin Marinas
The microblaze architecture defines ARCH_DMA_MINALIGN in asm/page.h.
Move it to asm/cache.h to allow a generic ARCH_DMA_MINALIGN definition
in linux/cache.h without redefine errors/warnings.

While at it, also move ARCH_SLAB_MINALIGN to asm/cache.h for
consistency.

Signed-off-by: Catalin Marinas 
Cc: Michal Simek 
---
 arch/microblaze/include/asm/cache.h | 5 +
 arch/microblaze/include/asm/page.h  | 5 -
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/microblaze/include/asm/cache.h 
b/arch/microblaze/include/asm/cache.h
index a149b3e711ec..1903988b9e23 100644
--- a/arch/microblaze/include/asm/cache.h
+++ b/arch/microblaze/include/asm/cache.h
@@ -18,4 +18,9 @@
 
 #define SMP_CACHE_BYTESL1_CACHE_BYTES
 
+/* MS be sure that SLAB allocates aligned objects */
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+
+#define ARCH_SLAB_MINALIGN L1_CACHE_BYTES
+
 #endif /* _ASM_MICROBLAZE_CACHE_H */
diff --git a/arch/microblaze/include/asm/page.h 
b/arch/microblaze/include/asm/page.h
index 7b9861bcd458..337f23eabc71 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -30,11 +30,6 @@
 
 #ifndef __ASSEMBLY__
 
-/* MS be sure that SLAB allocates aligned objects */
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-
-#define ARCH_SLAB_MINALIGN L1_CACHE_BYTES
-
 /*
  * PAGE_OFFSET -- the first address of the first page of memory. With MMU
  * it is set to the kernel start address (aligned on a page boundary).


[PATCH 3/3] sh: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas
The sh architecture defines ARCH_DMA_MINALIGN in asm/page.h. Move it to
asm/cache.h to allow a generic ARCH_DMA_MINALIGN definition in
linux/cache.h without redefine errors/warnings.

Signed-off-by: Catalin Marinas 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: linux...@vger.kernel.org
---
 arch/sh/include/asm/cache.h | 6 ++
 arch/sh/include/asm/page.h  | 6 --
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/sh/include/asm/cache.h b/arch/sh/include/asm/cache.h
index 32dfa6b82ec6..b38dbc975581 100644
--- a/arch/sh/include/asm/cache.h
+++ b/arch/sh/include/asm/cache.h
@@ -14,6 +14,12 @@
 
 #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
 
+/*
+ * Some drivers need to perform DMA into kmalloc'ed buffers
+ * and so we have to increase the kmalloc minalign for this.
+ */
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+
 #define __read_mostly __section(".data..read_mostly")
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
index 09ac6c7faee0..62f4b9edcb98 100644
--- a/arch/sh/include/asm/page.h
+++ b/arch/sh/include/asm/page.h
@@ -174,10 +174,4 @@ typedef struct page *pgtable_t;
 #include 
 #include 
 
-/*
- * Some drivers need to perform DMA into kmalloc'ed buffers
- * and so we have to increase the kmalloc minalign for this.
- */
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-
 #endif /* __ASM_SH_PAGE_H */


[PATCH 1/3] powerpc: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas
The powerpc architecture defines ARCH_DMA_MINALIGN in asm/page_32.h and
only if CONFIG_NOT_COHERENT_CACHE is enabled (32-bit platforms only).
Move this macro to asm/cache.h to allow a generic ARCH_DMA_MINALIGN
definition in linux/cache.h without redefine errors/warnings.

Signed-off-by: Catalin Marinas 
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202306131053.1ybvrrho-...@intel.com/
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/cache.h   | 4 
 arch/powerpc/include/asm/page_32.h | 4 
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index ae0a68a838e8..69232231d270 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -33,6 +33,10 @@
 
 #define IFETCH_ALIGN_BYTES (1 << IFETCH_ALIGN_SHIFT)
 
+#ifdef CONFIG_NOT_COHERENT_CACHE
+#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
+#endif
+
 #if !defined(__ASSEMBLY__)
 #ifdef CONFIG_PPC64
 
diff --git a/arch/powerpc/include/asm/page_32.h 
b/arch/powerpc/include/asm/page_32.h
index 56f217606327..b9ac9e3a771c 100644
--- a/arch/powerpc/include/asm/page_32.h
+++ b/arch/powerpc/include/asm/page_32.h
@@ -12,10 +12,6 @@
 
 #define VM_DATA_DEFAULT_FLAGS  VM_DATA_DEFAULT_FLAGS32
 
-#ifdef CONFIG_NOT_COHERENT_CACHE
-#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
-#endif
-
 #if defined(CONFIG_PPC_256K_PAGES) || \
 (defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES))
 #define PTE_SHIFT  (PAGE_SHIFT - PTE_T_LOG2 - 2)   /* 1/4 of a page */


[PATCH 0/3] Move the ARCH_DMA_MINALIGN definition to asm/cache.h

2023-06-13 Thread Catalin Marinas
Hi,

The ARCH_KMALLOC_MINALIGN reduction series defines a generic
ARCH_DMA_MINALIGN in linux/cache.h:

https://lore.kernel.org/r/20230612153201.554742-2-catalin.mari...@arm.com/

Unfortunately, this causes a duplicate definition warning for
microblaze, powerpc (32-bit only) and sh as these architectures define
ARCH_DMA_MINALIGN in a different file than asm/cache.h. Move the macro
to asm/cache.h to avoid this issue and also bring them in line with the
other architectures.

Andrew, if the arch maintainers cc'ed are fine with such change, could
you please take these three patches together with the
ARCH_KMALLOC_MINALIGN series?

Thank you.

Catalin Marinas (3):
  powerpc: Move the ARCH_DMA_MINALIGN definition to asm/cache.h
  microblaze: Move the ARCH_{DMA,SLAB}_MINALIGN definitions to
asm/cache.h
  sh: Move the ARCH_DMA_MINALIGN definition to asm/cache.h

 arch/microblaze/include/asm/cache.h | 5 +
 arch/microblaze/include/asm/page.h  | 5 -
 arch/powerpc/include/asm/cache.h| 4 
 arch/powerpc/include/asm/page_32.h  | 4 
 arch/sh/include/asm/cache.h | 6 ++
 arch/sh/include/asm/page.h  | 6 --
 6 files changed, 15 insertions(+), 15 deletions(-)



Re: linux-next: build failure after merge of the mm tree

2023-06-13 Thread Catalin Marinas
Hi Stephen,

On Tue, Jun 13, 2023 at 04:21:19PM +1000, Stephen Rothwell wrote:
> After merging the mm tree, today's linux-next build (powerpc
> ppc44x_defconfig) failed like this:
> 
> In file included from arch/powerpc/include/asm/page.h:247,
>  from arch/powerpc/include/asm/thread_info.h:13,
>  from include/linux/thread_info.h:60,
>  from include/asm-generic/preempt.h:5,
>  from ./arch/powerpc/include/generated/asm/preempt.h:1,
>  from include/linux/preempt.h:78,
>  from include/linux/spinlock.h:56,
>  from include/linux/ipc.h:5,
>  from include/uapi/linux/sem.h:5,
>  from include/linux/sem.h:5,
>  from include/linux/compat.h:14,
>  from arch/powerpc/kernel/asm-offsets.c:12:
> arch/powerpc/include/asm/page_32.h:16: warning: "ARCH_DMA_MINALIGN" redefined
>16 | #define ARCH_DMA_MINALIGN   L1_CACHE_BYTES
>   | 
> In file included from include/linux/time.h:5,
>  from include/linux/compat.h:10:
> include/linux/cache.h:104: note: this is the location of the previous 
> definition
>   104 | #define ARCH_DMA_MINALIGN __alignof__(unsigned long long)
>   | 
> 
> (lots of theses)
> 
> Caused by commit
> 
>   cc7335787e73 ("mm/slab: decouple ARCH_KMALLOC_MINALIGN from 
> ARCH_DMA_MINALIGN")
> 
> I have applied the following hack for today - we need something better.

I just posted this series fixing it for powerpc, microblaze and sh. I
did not add the #ifndef __powerpc64__ line since
CONFIG_NOT_COHERENT_CACHE should not be enabled for those builds.

https://lore.kernel.org/r/20230613155245.1228274-1-catalin.mari...@arm.com

> From: Stephen Rothwell 
> Date: Tue, 13 Jun 2023 16:07:16 +1000
> Subject: [PATCH] fix up for "mm/slab: decouple ARCH_KMALLOC_MINALIGN from 
> ARCH_DMA_MINALIGN"
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  arch/powerpc/include/asm/cache.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/cache.h 
> b/arch/powerpc/include/asm/cache.h
> index ae0a68a838e8..e9be1396dfd1 100644
> --- a/arch/powerpc/include/asm/cache.h
> +++ b/arch/powerpc/include/asm/cache.h
> @@ -142,5 +142,14 @@ static inline void iccci(void *addr)
>  }
>  
>  #endif /* !__ASSEMBLY__ */
> +
> +#ifndef __powerpc64__
> +#ifdef CONFIG_NOT_COHERENT_CACHE
> +#ifndef ARCH_DMA_MINALIGN
> +#define ARCH_DMA_MINALIGNL1_CACHE_BYTES
> +#endif
> +#endif
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_CACHE_H */

I think it should also remove the ARCH_DMA_MINALIGN from asm/page.h (as
I did in my series; sorry I did not cc you, only noticed now that you
reported it as well).

-- 
Catalin


Re: [PATCH] irq_work: consolidate arch_irq_work_raise prototypes

2023-05-25 Thread Catalin Marinas
On Tue, May 16, 2023 at 10:02:31PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The prototype was hidden on x86, which causes a warning:
> 
> kernel/irq_work.c:72:13: error: no previous prototype for 
> 'arch_irq_work_raise' [-Werror=missing-prototypes]
> 
> Fix this by providing it in only one place that is always visible.
> 
> Signed-off-by: Arnd Bergmann 

Acked-by: Catalin Marinas 


Re: [PATCH 03/23] arm64/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-05-25 Thread Catalin Marinas
On Tue, May 09, 2023 at 09:45:57PM -0700, Hugh Dickins wrote:
> pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
> that: to keep balance in future, use the recently added pte_alloc_huge()
> instead; with pte_offset_huge() a better name for pte_offset_kernel().
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Catalin Marinas 


Re: [PATCH 02/23] arm64: allow pte_offset_map() to fail

2023-05-25 Thread Catalin Marinas
On Tue, May 09, 2023 at 09:43:47PM -0700, Hugh Dickins wrote:
> In rare transient cases, not yet made possible, pte_offset_map() and
> pte_offset_map_lock() may not find a page table: handle appropriately.
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Catalin Marinas 


Re: [PATCH 13/14] thread_info: move function declarations to linux/thread_info.h

2023-05-24 Thread Catalin Marinas
On Wed, May 17, 2023 at 03:11:01PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> There are a few __weak functions in kernel/fork.c, which architectures
> can override. If there is no prototype, the compiler warns about them:
> 
> kernel/fork.c:164:13: error: no previous prototype for 
> 'arch_release_task_struct' [-Werror=missing-prototypes]
> kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' 
> [-Werror=missing-prototypes]
> kernel/fork.c:1086:12: error: no previous prototype for 
> 'arch_dup_task_struct' [-Werror=missing-prototypes]
> 
> There are already prototypes in a number of architecture specific headers
> that have addressed those warnings before, but it's much better to have
> these in a single place so the warning no longer shows up anywhere.
> 
> Signed-off-by: Arnd Bergmann 

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-27 Thread Catalin Marinas
On Tue, Apr 25, 2023 at 11:09:58AM -0500, Justin Forbes wrote:
> On Tue, Apr 18, 2023 at 5:22 PM Andrew Morton  
> wrote:
> > On Wed, 12 Apr 2023 18:27:08 +0100 Catalin Marinas 
> >  wrote:
> > > > It sounds nice in theory. In practice. EXPERT hides too much. When you
> > > > flip expert, you expose over a 175ish new config options which are
> > > > hidden behind EXPERT.  You don't have to know what you are doing just
> > > > with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> > > > already running 10, this might be less of a problem. At least Fedora
> > > > and RHEL are running 13 for 4K pages on aarch64. This was not some
> > > > accidental choice, we had to carry a patch to even allow it for a
> > > > while.  If this does go in as is, we will likely just carry a patch to
> > > > remove the "if EXPERT", but that is a bit of a disservice to users who
> > > > might be trying to debug something else upstream, bisecting upstream
> > > > kernels or testing a patch.  In those cases, people tend to use
> > > > pristine upstream sources without distro patches to verify, and they
> > > > tend to use their existing configs. With this change, their MAX_ORDER
> > > > will drop to 10 from 13 silently.   That can look like a different
> > > > issue enough to ruin a bisect or have them give bad feedback on a
> > > > patch because it introduces a "regression" which is not a regression
> > > > at all, but a config change they couldn't see.
> > >
> > > If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
> > > and avoid having to explain to people why some random MAX_ORDER doesn't
> > > build (keeping the range would also make sense for randconfig, not sure
> > > we got to any conclusion there).
> >
> > Well this doesn't seem to have got anywhere.  I think I'll send the
> > patchset into Linus for the next merge window as-is.  Please let's take
> > a look at this Kconfig presentation issue during the following -rc
> > cycle.
> 
> Well, I am very sorry to see this going in as is.  It will silently
> change people building with oldconfig, and anyone not paying attention
> will not notice until an issue is hit where "it worked before, and my
> config hasn't changed".  If EXPERT is unset, there is no notification,
> just a changed behavior.  While it would be easy for me to carry a
> patch dropping the if EXPERT, it will not help any users building on
> upstream with our configs, whether for their own regular use, or while
> trying to debug other issues,  I expect it will result in a reasonable
> amount of frustration from users trying to do the right thing and
> bisect or test patches upstream.

As I said in a previous reply, I'm fine with reverting this commit if it
breaks existing configs. It's only that Andrew had already queued it in
his tree but we have time until the final 6.4 kernel is released.

That said, would you mind sending a patch reverting it (if removing
EXPERT, I'd like to keep the ranges)? ;)

Thanks.

-- 
Catalin


Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-19 Thread Catalin Marinas
On Tue, Apr 18, 2023 at 03:05:57PM -0700, Andrew Morton wrote:
> On Wed, 12 Apr 2023 18:27:08 +0100 Catalin Marinas  
> wrote:
> > > It sounds nice in theory. In practice. EXPERT hides too much. When you
> > > flip expert, you expose over a 175ish new config options which are
> > > hidden behind EXPERT.  You don't have to know what you are doing just
> > > with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> > > already running 10, this might be less of a problem. At least Fedora
> > > and RHEL are running 13 for 4K pages on aarch64. This was not some
> > > accidental choice, we had to carry a patch to even allow it for a
> > > while.  If this does go in as is, we will likely just carry a patch to
> > > remove the "if EXPERT", but that is a bit of a disservice to users who
> > > might be trying to debug something else upstream, bisecting upstream
> > > kernels or testing a patch.  In those cases, people tend to use
> > > pristine upstream sources without distro patches to verify, and they
> > > tend to use their existing configs. With this change, their MAX_ORDER
> > > will drop to 10 from 13 silently.   That can look like a different
> > > issue enough to ruin a bisect or have them give bad feedback on a
> > > patch because it introduces a "regression" which is not a regression
> > > at all, but a config change they couldn't see.
> > 
> > If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
> > and avoid having to explain to people why some random MAX_ORDER doesn't
> > build (keeping the range would also make sense for randconfig, not sure
> > we got to any conclusion there).
> 
> Well this doesn't seem to have got anywhere.  I think I'll send the
> patchset into Linus for the next merge window as-is.  Please let's take
> a look at this Kconfig presentation issue during the following -rc
> cycle.

That's fine by me. I have a slight preference to drop EXPERT and keep
the ranges in, especially if it affects current distro kernels. Debian
seems to enable EXPERT already in their arm64 kernel config but I'm not
sure about the Fedora or other distro kernels. If they don't, we can
fix/revert this Kconfig entry once the merging window is closed.

-- 
Catalin


Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-04-12 Thread Catalin Marinas
On Tue, Apr 04, 2023 at 06:50:01AM -0500, Justin Forbes wrote:
> On Tue, Apr 4, 2023 at 2:22 AM Mike Rapoport  wrote:
> > On Wed, Mar 29, 2023 at 10:55:37AM -0500, Justin Forbes wrote:
> > > On Sat, Mar 25, 2023 at 1:09 AM Mike Rapoport  wrote:
> > > >
> > > > From: "Mike Rapoport (IBM)" 
> > > >
> > > > It is not a good idea to change fundamental parameters of core memory
> > > > management. Having predefined ranges suggests that the values within
> > > > those ranges are sensible, but one has to *really* understand
> > > > implications of changing MAX_ORDER before actually amending it and
> > > > ranges don't help here.
> > > >
> > > > Drop ranges in definition of ARCH_FORCE_MAX_ORDER and make its prompt
> > > > visible only if EXPERT=y
> > >
> > > I do not like suddenly hiding this behind EXPERT for a couple of
> > > reasons.  Most importantly, it will silently change the config for
> > > users building with an old kernel config.  If a user has for instance
> > > "13" set and building with 4K pages, as is the current configuration
> > > for Fedora and RHEL aarch64 builds, an oldconfig build will now set it
> > > to 10 with no indication that it is doing so.  And while I think that
> > > 10 is a fine default for many aarch64 users, there are valid reasons
> > > for choosing other values. Putting this behind expert makes it much
> > > less obvious that this is an option.
> >
> > That's the idea of EXPERT, no?
> >
> > This option was intended to allow allocation of huge pages for
> > architectures that had PMD_ORDER > MAX_ORDER and not to allow user to
> > select size of maximal physically contiguous allocation.
> >
> > Changes to MAX_ORDER fundamentally change the behaviour of core mm and
> > unless users *really* know what they are doing there is no reason to choose
> > non-default values so hiding this option behind EXPERT seems totally
> > appropriate to me.
> 
> It sounds nice in theory. In practice. EXPERT hides too much. When you
> flip expert, you expose over a 175ish new config options which are
> hidden behind EXPERT.  You don't have to know what you are doing just
> with the MAX_ORDER, but a whole bunch more as well.  If everyone were
> already running 10, this might be less of a problem. At least Fedora
> and RHEL are running 13 for 4K pages on aarch64. This was not some
> accidental choice, we had to carry a patch to even allow it for a
> while.  If this does go in as is, we will likely just carry a patch to
> remove the "if EXPERT", but that is a bit of a disservice to users who
> might be trying to debug something else upstream, bisecting upstream
> kernels or testing a patch.  In those cases, people tend to use
> pristine upstream sources without distro patches to verify, and they
> tend to use their existing configs. With this change, their MAX_ORDER
> will drop to 10 from 13 silently.   That can look like a different
> issue enough to ruin a bisect or have them give bad feedback on a
> patch because it introduces a "regression" which is not a regression
> at all, but a config change they couldn't see.

If we remove EXPERT (as prior to this patch), I'd rather keep the ranges
and avoid having to explain to people why some random MAX_ORDER doesn't
build (keeping the range would also make sense for randconfig, not sure
we got to any conclusion there).

-- 
Catalin


Re: [PATCH 18/21] ARM: drop SMP support for ARM11MPCore

2023-03-31 Thread Catalin Marinas
On Mon, Mar 27, 2023 at 02:13:14PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The cache management operations for noncoherent DMA on ARMv6 work
> in two different ways:
> 
>  * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
>DMA buffers lead to data corruption when the prefetched data is written
>back on top of data from the device.
> 
>  * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
>is not seen by the other core(s), leading to inconsistent contents
>accross the system.
> 
> As a consequence, neither configuration is actually safe to use in a
> general-purpose kernel that is used on both MPCore systems and ARM1176
> with prefetching enabled.

As the author of this terrible hack (created under duress ;))

Acked-by: Catalin Marinas 

IIRC, RWFO is working in combination with the cache operations. Because
the cache maintenance broadcast did not happen, we forced the cache
lines to migrate to a CPU via a write (for ownership) and doing the
cache maintenance on that CPU (that was the FROM_DEVICE case). For the
TO_DEVICE case, reading on a CPU would cause dirty lines on another CPU
to be evicted (or migrated as dirty to the current CPU IIRC) then the
cache maintenance to clean them to PoC on the local CPU.

But there's always a small window between read/write for ownership and
the actual cache maintenance which can cause a cache line to migrate to
other CPUs if they do speculative prefetches. At the time ARM11MPCore
was deemed safe-ish but I haven't followed what later implementations
actually did (luckily we fixed the architecture in ARMv7).

-- 
Catalin


Re: [PATCH 00/21] dma-mapping: unify support for cache flushes

2023-03-31 Thread Catalin Marinas
On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote:
> Another difference that I do not address here is what cache invalidation
> does for partical cache lines. On arm32, arm64 and powerpc, a partial
> cache line always gets written back before invalidation in order to
> ensure that data before or after the buffer is not discarded. On all
> other architectures, the assumption is cache lines are never shared
> between DMA buffer and data that is accessed by the CPU.

I don't think sharing the DMA buffer with other data is safe even with
this clean+invalidate on the unaligned cache. Mapping the DMA buffer as
FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be
evicted and override the device written data. This sharing only works if
the CPU guarantees not to dirty the corresponding cache line.

I'm fine with removing this partial cache line hack from arm64 as it's
not safe anyway. We'll see if any driver stops working. If there's some
benign sharing (I wouldn't trust it), the cache cleaning prior to
mapping and invalidate on unmap would not lose any data.

-- 
Catalin


Re: [PATCH 03/14] arm64: reword ARCH_FORCE_MAX_ORDER prompt and help text

2023-03-23 Thread Catalin Marinas
On Thu, Mar 23, 2023 at 11:21:45AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
> describe this configuration option.
> 
> Update both to actually describe what this option does.
> 
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Catalin Marinas 


Re: [PATCH 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER

2023-03-23 Thread Catalin Marinas
On Thu, Mar 23, 2023 at 11:21:44AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> It is not a good idea to change fundamental parameters of core memory
> management. Having predefined ranges suggests that the values within
> those ranges are sensible, but one has to *really* understand
> implications of changing MAX_ORDER before actually amending it and
> ranges don't help here.
> 
> Drop ranges in definition of ARCH_FORCE_MAX_ORDER
> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/arm64/Kconfig | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e60baf7859d1..bab6483e4317 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1489,9 +1489,7 @@ config XEN
>  config ARCH_FORCE_MAX_ORDER
>   int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES
>   default "13" if ARM64_64K_PAGES
> - range 11 13 if ARM64_16K_PAGES
>   default "11" if ARM64_16K_PAGES
> - range 10 15 if ARM64_4K_PAGES
>   default "10"

I don't mind rewriting the help text as in the subsequent patch but I'd
keep the ranges as a safety measure. It's less wasted time explaining to
people why some random max order doesn't work. Alternatively, we can
drop the ranges but make this option configurable only if EXPERT.

-- 
Catalin


Re: [PATCH] mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()

2023-03-06 Thread Catalin Marinas
On Mon, Mar 06, 2023 at 05:15:48PM +0100, Gerald Schaefer wrote:
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index b6ba466e2e8a..0bd18de9fd97 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
>   * fault on one CPU which has been handled concurrently by another CPU
>   * does not need to perform additional invalidation.
>   */
> -#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)

For arm64:

Acked-by: Catalin Marinas 

> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 2c70b4d1263d..c1f6b46ec555 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1239,7 +1239,8 @@ static inline int pte_allow_rdp(pte_t old, pte_t new)
>  }
>  
>  static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> - unsigned long address)
> + unsigned long address,
> + pte_t *ptep)
>  {
>   /*
>* RDP might not have propagated the PTE protection reset to all CPUs,
> @@ -1247,11 +1248,12 @@ static inline void 
> flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
>* NOTE: This will also be called when a racing pagetable update on
>* another thread already installed the correct PTE. Both cases cannot
>* really be distinguished.
> -  * Therefore, only do the local TLB flush when RDP can be used, to avoid
> -  * unnecessary overhead.
> +  * Therefore, only do the local TLB flush when RDP can be used, and the
> +  * PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
> +  * A local RDP can be used to do the flush.
>*/
> - if (MACHINE_HAS_RDP)
> - asm volatile("ptlb" : : : "memory");
> + if (MACHINE_HAS_RDP && !(pte_val(*ptep) & _PAGE_PROTECT))
> + __ptep_rdp(address, ptep, 0, 0, 1);

I wonder whether passing the actual entry is somewhat quicker as it
avoids another memory access (though it might already be in the cache).

-- 
Catalin


Re: [PATCH v3 02/24] arm64: Remove COMMAND_LINE_SIZE from uapi

2023-02-14 Thread Catalin Marinas
On Tue, Feb 14, 2023 at 08:49:03AM +0100, Alexandre Ghiti wrote:
> From: Palmer Dabbelt 
> 
> As far as I can tell this is not used by userspace and thus should not
> be part of the user-visible API.
> 
> Signed-off-by: Palmer Dabbelt 

Acked-by: Catalin Marinas 


Re: (subset) [PATCH 00/35] Documentation: correct lots of spelling errors (series 1)

2023-01-31 Thread Catalin Marinas
On Thu, 26 Jan 2023 22:39:30 -0800, Randy Dunlap wrote:
> Correct many spelling errors in Documentation/ as reported by codespell.
> 
> Maintainers of specific kernel subsystems are only Cc-ed on their
> respective patches, not the entire series. [if all goes well]
> 
> These patches are based on linux-next-20230125.
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[01/35] Documentation: arm64: correct spelling
https://git.kernel.org/arm64/c/a70f00e7f1a3

-- 
Catalin



Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-01-09 Thread Catalin Marinas
On Sun, Jan 08, 2023 at 06:48:41PM +0800, Barry Song wrote:
> On Fri, Jan 6, 2023 at 2:15 AM Catalin Marinas  
> wrote:
> > On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> > > It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> > > large systems but may not have improvement on small systems like on
> > > a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> > > on CONFIG_EXPERT for this stage and make this disabled on systems
> > > with less than 8 CPUs. User can modify this threshold according to
> > > their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
> >
> > What's the overhead of such batching on systems with 4 or fewer CPUs? If
> > it isn't noticeable, I'd rather have it always on than some number
> > chosen on whichever SoC you tested.
> 
> On the one hand, tlb flush is cheap on a small system. so batching tlb flush
> helps very minorly.

Yes, it probably won't help on small systems but I don't like config
options choosing the threshold, which may be different from system to
system even if they have the same number of CPUs. A run-time tunable
would be a better option.

> On the other hand, since we have batched the tlb flush, new PTEs might be
> invisible to others before the final broadcast is done and Ack-ed.

The new PTEs could indeed be invisible at the TLB level but not at the
memory (page table) level since this is done under the PTL IIUC.

> thus, there
> is a risk someone else might do mprotect or similar things  on those deferred
> pages which will ask for read-modify-write on those deferred PTEs.

And this should be fine, we have things like the PTL in place for the
actual memory access to the page table.

> in this
> case, mm will do an explicit flush by flush_tlb_batched_pending which is
> not required if tlb flush is not deferred.

I don't fully understand why it's needed, or at least why it would be
needed on arm64. At the end of an mprotect(), we have the final PTEs in
place and we just need to issue a TLBI for that range.
change_pte_range() for example has a tlb_flush_pte_range() if the PTE
was present and that won't be done lazily. If there are other TLBIs
pending for the same range, they'll be done later though likely
unnecessarily but still cheaper than issuing a flush_tlb_mm().

> void flush_tlb_batched_pending(struct mm_struct *mm)
> {
>int batch = atomic_read(&mm->tlb_flush_batched);
>int pending = batch & TLB_FLUSH_BATCH_PENDING_MASK;
>int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> 
>if (pending != flushed) {
>flush_tlb_mm(mm);
> /*
>  * If the new TLB flushing is pending during flushing, leave
>  * mm->tlb_flush_batched as is, to avoid losing flushing.
> */
>   atomic_cmpxchg(&mm->tlb_flush_batched, batch,
>pending | (pending << TLB_FLUSH_BATCH_FLUSHED_SHIFT));
>  }
> }

I guess this works on x86 better as it avoids the IPIs if this flush
already happened. But on arm64 we already issued the TLBI, we just
didn't wait for it to complete via a DSB.

> I believe Anshuman has contributed many points on this in those previous
> discussions.

Yeah, I should re-read the old threads.

-- 
Catalin


Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-01-05 Thread Catalin Marinas
On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> large systems but may not have improvement on small systems like on
> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> on CONFIG_EXPERT for this stage and make this disabled on systems
> with less than 8 CPUs. User can modify this threshold according to
> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

What's the overhead of such batching on systems with 4 or fewer CPUs? If
it isn't noticeable, I'd rather have it always on than some number
chosen on whichever SoC you tested.

Another option would be to make this a sysctl tunable.

>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  6 +++
>  arch/arm64/include/asm/tlbbatch.h | 12 +
>  arch/arm64/include/asm/tlbflush.h | 52 ++-
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 10 ++--

Please keep any function prototype changes in a preparatory patch so
that the arm64 one only introduces the arch specific changes. Easier to
review.

> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> + /*
> +  * TLB batched flush is proved to be beneficial for systems with large
> +  * number of CPUs, especially system with more than 8 CPUs. TLB shutdown
> +  * is cheap on small systems which may not need this feature. So use
> +  * a threshold for enabling this to avoid potential side effects on
> +  * these platforms.
> +  */
> + if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB)
> + return false;

The x86 implementation tracks the cpumask of where a task has run. We
don't have such tracking on arm64 and I don't think it matters. As
noticed/described in this series, the bottleneck is the actual DSB
synchronisation (which sends a DVM Sync message to all the other CPUs
and waits for a DVM Complete response). So I think it makes sense not to
bother with an mm_cpumask(). What this patch aims to optimise is
actually the number of DSBs issued on an SMP system by
ptep_clear_flush().

The DVM is not an architected concept (well, it's part of AMBA AXI). I'd
be curious to know how such patch behaves on Apple's M1/M2 hardware. My
preference would be to have this always on for num_online_cpus() > 1 if
there's no overhead.

-- 
Catalin


  1   2   3   >