Re: [PATCH] tracing: Inform kmemleak of saved_cmdlines allocation

2024-02-19 Thread Catalin Marinas
On Wed, Feb 14, 2024 at 10:26:14AM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> The allocation of the struct saved_cmdlines_buffer structure changed from:
> 
> s = kmalloc(sizeof(*s), GFP_KERNEL);
>   s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);
> 
> to:
> 
>   orig_size = sizeof(*s) + val * TASK_COMM_LEN;
>   order = get_order(orig_size);
>   size = 1 << (order + PAGE_SHIFT);
>   page = alloc_pages(GFP_KERNEL, order);
>   if (!page)
>   return NULL;
> 
>   s = page_address(page);
>   memset(s, 0, sizeof(*s));
> 
>   s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);
> 
> Where that s->saved_cmdlines allocation looks to be a dangling allocation
> to kmemleak. That's because kmemleak only keeps track of kmalloc()
> allocations. For allocations that use page_alloc() directly, the kmemleak
> needs to be explicitly informed about it.
> 
> Add kmemleak_alloc() and kmemleak_free() around the page allocation so
> that it doesn't give the following false positive:
> 
> unreferenced object 0x8881010c8000 (size 32760):
>   comm "swapper", pid 0, jiffies 4294667296
>   hex dump (first 32 bytes):
> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  
> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  
>   backtrace (crc ae6ec1b9):
> [] kmemleak_alloc+0x45/0x80
> [] __kmalloc_large_node+0x10d/0x190
> [] __kmalloc+0x3b1/0x4c0
> [] allocate_cmdlines_buffer+0x113/0x230
> [] tracer_alloc_buffers.isra.0+0x124/0x460
> [] early_trace_init+0x14/0xa0
> [] start_kernel+0x12e/0x3c0
> [] x86_64_start_reservations+0x18/0x30
> [] x86_64_start_kernel+0x7b/0x80
> [] secondary_startup_64_no_verify+0x15e/0x16b
> 
> Link: https://lore.kernel.org/linux-trace-kernel/87r0hfnr9r@kernel.org/
> 
> Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic")
> Reported-by: Kalle Valo 
> Signed-off-by: Steven Rostedt (Google) 

Reviewed-by: Catalin Marinas 



Re: [PATCH 4/5] kbuild: unify vdso_install rules

2023-10-27 Thread Catalin Marinas
On Mon, Oct 09, 2023 at 09:42:09PM +0900, Masahiro Yamada wrote:
> Currently, there is no standard implementation for vdso_install,
> leading to various issues:
[...]
>  arch/arm64/Makefile|  9 ++
>  arch/arm64/kernel/vdso/Makefile| 10 --
>  arch/arm64/kernel/vdso32/Makefile  | 10 --

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

2023-10-27 Thread Catalin Marinas
On Wed, Oct 25, 2023 at 05:52:58PM +0900, Hyesoo Yu wrote:
> If we only avoid using ALLOC_CMA for __GFP_TAGGED, would we still be able to 
> use
> the next iteration even if the hardware does not support "tag of tag" ? 

It depends on how the next iteration looks like. The plan was not to
support this so that we avoid another complication where a non-tagged
page is mprotect'ed to become tagged and it would need to be migrated
out of the CMA range. Not sure how much code it would save.

> I am not sure every vendor will support tag of tag, since there is no 
> information
> related to that feature, like in the Google spec document.

If you are aware of any vendors not supporting this, please direct them
to the Arm support team, it would be very useful information for us.

Thanks.

-- 
Catalin



Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA

2023-10-17 Thread Catalin Marinas
On Mon, Oct 16, 2023 at 01:41:15PM +0100, Alexandru Elisei wrote:
> On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> > I don't think it would be effcient when the majority of movable pages
> > do not use GFP_TAGGED.
> > 
> > Metadata pages have a low probability of being in the pcp list
> > because metadata pages is bypassed when freeing pages.
> > 
> > The allocation performance of most movable pages is likely to decrease
> > if only the request with ALLOC_FROM_METADATA could be allocated.
> 
> You're right, I hadn't considered that.
> 
> > 
> > How about not including metadata pages in the pcp list at all ?
> 
> Sounds reasonable, I will keep it in mind for the next iteration of the
> series.

BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
CMA and assume that the tag storage itself supports tagging. Hopefully
it makes the patches a bit simpler.

-- 
Catalin



Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

2023-09-14 Thread Catalin Marinas
Hi Kuan-Ying,

On Wed, Sep 13, 2023 at 08:11:40AM +, Kuan-Ying Lee (李冠穎) wrote:
> On Wed, 2023-08-23 at 14:13 +0100, Alexandru Elisei wrote:
> > diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > index 60472d65a355..bd050373d6cf 100644
> > --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
> > @@ -165,10 +165,28 @@ C1_L2: l2-cache1 {
> > };
> > };
> >  
> > -   memory@8000 {
> > +   memory0: memory@8000 {
> > device_type = "memory";
> > -   reg = <0x 0x8000 0 0x8000>,
> > - <0x0008 0x8000 0 0x8000>;
> > +   reg = <0x00 0x8000 0x00 0x7c00>;
> > +   };
> > +
> > +   metadata0: metadata@c000  {
> > +   compatible = "arm,mte-tag-storage";
> > +   reg = <0x00 0xfc00 0x00 0x3e0>;
> > +   block-size = <0x1000>;
> > +   memory = <>;
> > +   };
> > +
> > +   memory1: memory@88000 {
> > +   device_type = "memory";
> > +   reg = <0x08 0x8000 0x00 0x7c00>;
> > +   };
> > +
> > +   metadata1: metadata@8c000  {
> > +   compatible = "arm,mte-tag-storage";
> > +   reg = <0x08 0xfc00 0x00 0x3e0>;
> > +   block-size = <0x1000>;
> > +   memory = <>;
> > };
> >  
> 
> AFAIK, the above memory configuration means that there are two region
> of dram(0x8000-0xfc00 and 0x8_8000-0x8_fc000) and this
> is called PDD memory map.
> 
> Document[1] said there are some constraints of tag memory as below.
> 
> | The following constraints apply to the tag regions in DRAM:
> | 1. The tag region cannot be interleaved with the data region.
> | The tag region must also be above the data region within DRAM.
> |
> | 2.The tag region in the physical address space cannot straddle
> | multiple regions of a memory map.
> |
> | PDD memory map is not allowed to have part of the tag region between
> | 2GB-4GB and another part between 34GB-64GB.
> 
> I'm not sure if we can separate tag memory with the above
> configuration. Or do I miss something?
> 
> [1] https://developer.arm.com/documentation/101569/0300/?lang=en
> (Section 5.4.6.1)

Good point, thanks. The above dts some random layout we picked as an
example, it doesn't match any real hardware and we didn't pay attention
to the interconnect limitations (we fake the tag storage on the model).

I'll try to dig out how the mtu_tag_addr_shutter registers work and how
the sparse DRAM space is compressed to a smaller tag range. But that's
something done by firmware and the kernel only learns the tag storage
location from the DT (provided by firmware). We also don't need to know
the fine-grained mapping between 32 bytes of data and 1 byte (2 tags) in
the tag storage, only the block size in the tag storage space that
covers all interleaving done by the interconnect (it can be from 1 byte
to something larger like a page; the kernel will then use the lowest
common multiple between a page size and this tag block size to figure
out how many pages to reserve).

-- 
Catalin


Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

2023-09-13 Thread Catalin Marinas
On Mon, Sep 11, 2023 at 02:29:03PM +0200, David Hildenbrand wrote:
> On 11.09.23 13:52, Catalin Marinas wrote:
> > On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> > > On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > > > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > > > Regarding one complication: "The kernel needs to know where to 
> > > > > > allocate
> > > > > > a PROT_MTE page from or migrate a current page if it becomes 
> > > > > > PROT_MTE
> > > > > > (mprotect()) and the range it is in does not support tagging.",
> > > > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > > > 
> > > > > Okay, I now realize that this patch set effectively duplicates some 
> > > > > CMA
> > > > > behavior using a new migrate-type.
> > [...]
> > > I considered mixing the tag storage memory memory with normal memory and
> > > adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> > > this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> > > request to use MIGRATE_CMA.
> > > 
> > > I considered two solutions to this problem:
> > > 
> > > 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> > > this effectively means transforming all memory from MIGRATE_CMA into the
> > > MIGRATE_METADATA migratetype that the series introduces. Not very
> > > appealing, because that means treating normal memory that is also on the
> > > MIGRATE_CMA lists as tagged memory.
> > 
> > That's indeed not ideal. We could try this if it makes the patches
> > significantly simpler, though I'm not so sure.
> > 
> > Allocating metadata is the easier part as we know the correspondence
> > from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
> > storage page), so alloc_contig_range() does this for us. Just adding it
> > to the CMA range is sufficient.
> > 
> > However, making sure that we don't allocate PROT_MTE pages from the
> > metadata range is what led us to another migrate type. I guess we could
> > achieve something similar with a new zone or a CPU-less NUMA node,
> 
> Ideally, no significant core-mm changes to optimize for an architecture
> oddity. That implies, no new zones and no new migratetypes -- unless it is
> unavoidable and you are confident that you can convince core-MM people that
> the use case (giving back 3% of system RAM at max in some setups) is worth
> the trouble.

If I was an mm maintainer, I'd also question this ;). But vendors seem
pretty picky about the amount of RAM reserved for MTE (e.g. 0.5G for a
16G platform does look somewhat big). As more and more apps adopt MTE,
the wastage would be smaller but the first step is getting vendors to
enable it.

> I also had CPU-less NUMA nodes in mind when thinking about that, but not
> sure how easy it would be to integrate it. If the tag memory has actually
> different performance characteristics as well, a NUMA node would be the
> right choice.

In general I'd expect the same characteristics. However, changing the
memory designation from tag to data (and vice-versa) requires some cache
maintenance. The allocation cost is slightly higher (not the runtime
one), so it would help if the page allocator does not favour this range.
Anyway, that's an optimisation to worry about later.

> If we could find some way to easily support this either via CMA or CPU-less
> NUMA nodes, that would be much preferable; even if we cannot cover each and
> every future use case right now. I expect some issues with CXL+MTE either
> way , but are happy to be taught otherwise :)

I think CXL+MTE is rather theoretical at the moment. Given that PCIe
doesn't have any notion of MTE, more likely there would be some piece of
interconnect that generates two memory accesses: one for data and the
other for tags at a configurable offset (which may or may not be in the
same CXL range).

> Another thought I had was adding something like CMA memory characteristics.
> Like, asking if a given CMA area/page supports tagging (i.e., flag for the
> CMA area set?)?

I don't think adding CMA memory characteristics helps much. The metadata
allocation wouldn't go through cma_alloc() but rather
alloc_contig_range() directly for a specific pfn cor

Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

2023-09-11 Thread Catalin Marinas
On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:
> On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:
> > On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:
> > > On 24.08.23 13:06, David Hildenbrand wrote:
> > > > Regarding one complication: "The kernel needs to know where to allocate
> > > > a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
> > > > (mprotect()) and the range it is in does not support tagging.",
> > > > simplified handling would be if it's in a MIGRATE_CMA pageblock, it
> > > > doesn't support tagging. You have to migrate to a !CMA page (for
> > > > example, not specifying GFP_MOVABLE as a quick way to achieve that).
> > > 
> > > Okay, I now realize that this patch set effectively duplicates some CMA
> > > behavior using a new migrate-type.
[...]
> I considered mixing the tag storage memory memory with normal memory and
> adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
> this means that it's not enough anymore to have a __GFP_MOVABLE allocation
> request to use MIGRATE_CMA.
> 
> I considered two solutions to this problem:
> 
> 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
> this effectively means transforming all memory from MIGRATE_CMA into the
> MIGRATE_METADATA migratetype that the series introduces. Not very
> appealing, because that means treating normal memory that is also on the
> MIGRATE_CMA lists as tagged memory.

That's indeed not ideal. We could try this if it makes the patches
significantly simpler, though I'm not so sure.

Allocating metadata is the easier part as we know the correspondence
from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
storage page), so alloc_contig_range() does this for us. Just adding it
to the CMA range is sufficient.

However, making sure that we don't allocate PROT_MTE pages from the
metadata range is what led us to another migrate type. I guess we could
achieve something similar with a new zone or a CPU-less NUMA node,
though the latter is not guaranteed not to allocate memory from the
range, only make it less likely. Both these options are less flexible in
terms of size/alignment/placement.

Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
configure the metadata range in ZONE_MOVABLE but at some point I'd
expect some CXL-attached memory to support MTE with additional carveout
reserved.

To recap, in this series, a PROT_MTE page allocation starts with a
typical allocation from anywhere other than MIGRATE_METADATA followed by
the hooks to reserve the corresponding metadata range at (pfn * 128 +
offset) for a 4K page. The whole metadata page is reserved, so the
adjacent 31 pages around the original allocation can also be mapped as
PROT_MTE.

(Peter and Evgenii @ Google had a slightly different approach in their
prototype: separate free_area[] array for PROT_MTE pages; while it has
some advantages, I found it more intrusive since the same page can be on
a free_area/free_list or another)

> 2. Keep track of which pages are tag storage at page granularity (either by
> a page flag, or by checking that the pfn falls in one of the tag storage
> region, or by some other mechanism). When the page allocator takes free
> pages from the MIGRATE_METADATA list to satisfy an allocation, compare the
> gfp mask with the page type, and if the allocation is tagged and the page
> is a tag storage page, put it back at the tail of the free list and choose
> the next page. Repeat until the page allocator finds a normal memory page
> that can be tagged (some refinements obviously needed to need to avoid
> infinite loops).

With large enough CMA areas, there's a real risk of latency spikes, RCU
stalls etc. Not really keen on such heuristics.

-- 
Catalin


Re: [PATCH] mm: Move HOLES_IN_ZONE into mm

2021-04-19 Thread Catalin Marinas
On Sat, Apr 17, 2021 at 03:59:46PM +0800, Kefeng Wang wrote:
> commit a55749639dc1 ("ia64: drop marked broken DISCONTIGMEM and 
> VIRTUAL_MEM_MAP")
> drop VIRTUAL_MEM_MAP, so there is no need HOLES_IN_ZONE on ia64.
> 
> Also move HOLES_IN_ZONE into mm/Kconfig, select it if architecture needs
> this feature.
> 
> Signed-off-by: Kefeng Wang 
> ---
>  arch/arm64/Kconfig | 4 +---
>  arch/ia64/Kconfig  | 3 ---
>  arch/mips/Kconfig  | 3 ---
>  mm/Kconfig | 3 +++
>  4 files changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index f0b17d758912..3c5a53e0db91 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -202,6 +202,7 @@ config ARM64
>   select HAVE_KPROBES
>   select HAVE_KRETPROBES
>   select HAVE_GENERIC_VDSO
> + select HOLES_IN_ZONE
>   select IOMMU_DMA if IOMMU_SUPPORT
>   select IRQ_DOMAIN
>   select IRQ_FORCED_THREADING
> @@ -1053,9 +1054,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
>   def_bool y
>   depends on NUMA
>  
> -config HOLES_IN_ZONE
> - def_bool y
> -
>  source "kernel/Kconfig.hz"

For arm64:

Acked-by: Catalin Marinas 


[GIT PULL] arm64 fix for 5.12-rc8/final

2021-04-16 Thread Catalin Marinas
Hi Linus,

A recent commit (2decad92f473, "arm64: mte: Ensure TIF_MTE_ASYNC_FAULT
is set atomically") broke the kernel build when using the LLVM
integrated assembly (only noticeable with clang-12 as MTE is not
supported by earlier versions and the code in question not compiled).
The Fixes: tag in the commit refers to the original patch introducing
subsections for the alternative code sequences.

Please pull the fix below for -rc8/final (Will's off today). Thanks.

The following changes since commit 738fa58ee1328481d1d7889e7c430b3401c571b9:

  arm64: kprobes: Restore local irqflag if kprobes is cancelled (2021-04-13 
09:30:16 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-fixes

for you to fetch changes up to 22315a2296f4c251fa92aec45fbbae37e9301b6c:

  arm64: alternatives: Move length validation in alternative_{insn, endif} 
(2021-04-15 18:33:25 +0100)


Fix kernel compilation when using the LLVM integrated assembly.


Nathan Chancellor (1):
  arm64: alternatives: Move length validation in alternative_{insn, endif}

 arch/arm64/include/asm/alternative-macros.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
Catalin


Re: [PATCH 1/3] arm64: ptrace: Add is_syscall_success to handle compat

2021-04-16 Thread Catalin Marinas
On Fri, Apr 16, 2021 at 03:55:31PM +0800, He Zhe wrote:
> The general version of is_syscall_success does not handle 32-bit
> compatible case, which would cause 32-bit negative return code to be
> recoganized as a positive number later and seen as a "success".
> 
> Since is_compat_thread is defined in compat.h, implementing
> is_syscall_success in ptrace.h would introduce build failure due to
> recursive inclusion of some basic headers like mutex.h. We put the
> implementation to ptrace.c
> 
> Signed-off-by: He Zhe 
> ---
>  arch/arm64/include/asm/ptrace.h |  3 +++
>  arch/arm64/kernel/ptrace.c  | 10 ++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
> index e58bca832dff..3c415e9e5d85 100644
> --- a/arch/arm64/include/asm/ptrace.h
> +++ b/arch/arm64/include/asm/ptrace.h
> @@ -328,6 +328,9 @@ static inline void regs_set_return_value(struct pt_regs 
> *regs, unsigned long rc)
>   regs->regs[0] = rc;
>  }
>  
> +extern inline int is_syscall_success(struct pt_regs *regs);
> +#define is_syscall_success(regs) is_syscall_success(regs)
> +
>  /**
>   * regs_get_kernel_argument() - get Nth function argument in kernel
>   * @regs:pt_regs of that context
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index 170f42fd6101..3266201f8c60 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -1909,3 +1909,13 @@ int valid_user_regs(struct user_pt_regs *regs, struct 
> task_struct *task)
>   else
>   return valid_native_regs(regs);
>  }
> +
> +inline int is_syscall_success(struct pt_regs *regs)
> +{
> + unsigned long val = regs->regs[0];
> +
> + if (is_compat_thread(task_thread_info(current)))
> + val = sign_extend64(val, 31);
> +
> + return !IS_ERR_VALUE(val);
> +}

It's better to use compat_user_mode(regs) here instead of
is_compat_thread(). It saves us from worrying whether regs are for the
current context.

I think we should change regs_return_value() instead. This function
seems to be called from several other places and it has the same
potential problems if called on compat pt_regs.

-- 
Catalin


Re: [PATCH] arm64: alternatives: Move length validation in alternative_{insn, endif}

2021-04-15 Thread Catalin Marinas
On Tue, 13 Apr 2021 17:08:04 -0700, Nathan Chancellor wrote:
> After commit 2decad92f473 ("arm64: mte: Ensure TIF_MTE_ASYNC_FAULT is
> set atomically"), LLVM's integrated assembler fails to build entry.S:
> 
> :5:7: error: expected assembly-time absolute expression
>  .org . - (664b-663b) + (662b-661b)
>   ^
> :6:7: error: expected assembly-time absolute expression
>  .org . - (662b-661b) + (664b-663b)
>   ^
> 
> [...]

Applied to arm64 (for-next/fixes), thanks!

[1/1] arm64: alternatives: Move length validation in alternative_{insn, endif}
  https://git.kernel.org/arm64/c/22315a2296f4

-- 
Catalin



Re: [PATCH] arm64: alternatives: Move length validation in alternative_{insn,endif}

2021-04-15 Thread Catalin Marinas
On Thu, Apr 15, 2021 at 08:50:25AM -0700, Sami Tolvanen wrote:
> On Thu, Apr 15, 2021 at 7:02 AM Catalin Marinas  
> wrote:
> >
> > On Thu, Apr 15, 2021 at 06:25:57AM -0700, Nathan Chancellor wrote:
> > > On Thu, Apr 15, 2021 at 10:17:43AM +0100, Catalin Marinas wrote:
> > > > On Tue, Apr 13, 2021 at 05:08:04PM -0700, Nathan Chancellor wrote:
> > > > > After commit 2decad92f473 ("arm64: mte: Ensure TIF_MTE_ASYNC_FAULT is
> > > > > set atomically"), LLVM's integrated assembler fails to build entry.S:
> > > > >
> > > > > :5:7: error: expected assembly-time absolute expression
> > > > >  .org . - (664b-663b) + (662b-661b)
> > > > >   ^
> > > > > :6:7: error: expected assembly-time absolute expression
> > > > >  .org . - (662b-661b) + (664b-663b)
> > > > >   ^
> > > >
> > > > I tried the latest Linus' tree and linux-next (defconfig) with this
> > > > commit in and I can't get your build error. I used both clang-10 from
> > > > Debian stable and clang-11 from Debian sid. So, which clang version did
> > > > you use or which kernel config options?
> > >
> > > Interesting, this reproduces for me with LLVM 12 or newer with just
> > > defconfig.
> >
> > It fails for me as well with clang-12. Do you happen to know why it
> > works fine with previous clang versions?
> 
> It looks like CONFIG_ARM64_AS_HAS_MTE is not set when we use the
> integrated assembler with LLVM 11, and the code that breaks later
> versions is gated behind CONFIG_ARM64_MTE.

That explains it, thanks.

-- 
Catalin


Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Catalin Marinas
On Thu, Apr 15, 2021 at 04:28:21PM +0100, Will Deacon wrote:
> On Thu, Apr 15, 2021 at 05:03:58PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 15, 2021 at 02:25:52PM +, Ali Saidi wrote:
> > > While this code is executed with the wait_lock held, a reader can
> > > acquire the lock without holding wait_lock.  The writer side loops
> > > checking the value with the atomic_cond_read_acquire(), but only truly
> > > acquires the lock when the compare-and-exchange is completed
> > > successfully which isn’t ordered. The other atomic operations from this
> > > point are release-ordered and thus reads after the lock acquisition can
> > > be completed before the lock is truly acquired which violates the
> > > guarantees the lock should be making.
[...]
> > > Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() 
> > > when spinning in qrwloc")
> > > Signed-off-by: Ali Saidi 
> > > Cc: sta...@vger.kernel.org
> > > ---
> > >  kernel/locking/qrwlock.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> > > index 4786dd271b45..10770f6ac4d9 100644
> > > --- a/kernel/locking/qrwlock.c
> > > +++ b/kernel/locking/qrwlock.c
> > > @@ -73,8 +73,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
> > >  
> > >   /* When no more readers or writers, set the locked flag */
> > >   do {
> > > - atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> > > - } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING,
> > > + atomic_cond_read_relaxed(>cnts, VAL == _QW_WAITING);
> > > + } while (atomic_cmpxchg_acquire(>cnts, _QW_WAITING,
> > >   _QW_LOCKED) != _QW_WAITING);
> > >  unlock:
> > >   arch_spin_unlock(>wait_lock);
> > 
> > This doesn't make sense, there is no such thing as a store-acquire. What
> > you're doing here is moving the acquire from one load to the next. A
> > load we know will load the exact same value.
> > 
> > Also see Documentation/atomic_t.txt:
> > 
> >   {}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
> > 
> > 
> > If anything this code wants to be written like so.
> > 
> > ---
> > 
> > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> > index 4786dd271b45..22aeccc363ca 100644
> > --- a/kernel/locking/qrwlock.c
> > +++ b/kernel/locking/qrwlock.c
> > @@ -60,6 +60,8 @@ EXPORT_SYMBOL(queued_read_lock_slowpath);
> >   */
> >  void queued_write_lock_slowpath(struct qrwlock *lock)
> >  {
> > +   u32 cnt;
> > +
> > /* Put the writer into the wait queue */
> > arch_spin_lock(>wait_lock);
> >  
> > @@ -73,9 +75,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
> >  
> > /* When no more readers or writers, set the locked flag */
> > do {
> > -   atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> > -   } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING,
> > -   _QW_LOCKED) != _QW_WAITING);
> > +   cnt = atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> 
> I think the issue is that >here< a concurrent reader in interrupt context
> can take the lock and release it again, but we could speculate reads from
> the critical section up over the later release and up before the control
> dependency here...
> 
> > +   } while (!atomic_try_cmpxchg_relaxed(>cnts, , _QW_LOCKED));
> 
> ... and then this cmpxchg() will succeed, so our speculated stale reads
> could be used.
> 
> *HOWEVER*
> 
> Speculating a read should be fine in the face of a concurrent _reader_,
> so for this to be an issue it implies that the reader is also doing some
> (atomic?) updates.

There's at least one such case: see chain_epi_lockless() updating
epi->next, called from ep_poll_callback() with a read_lock held. This
races with ep_done_scan() which has the write_lock held.

I think the authors of the above code interpreted the read_lock as
something that multiple threads can own disregarding the _read_ part.

-- 
Catalin


Re: [PATCH] arm64: alternatives: Move length validation in alternative_{insn,endif}

2021-04-15 Thread Catalin Marinas
On Thu, Apr 15, 2021 at 06:25:57AM -0700, Nathan Chancellor wrote:
> On Thu, Apr 15, 2021 at 10:17:43AM +0100, Catalin Marinas wrote:
> > On Tue, Apr 13, 2021 at 05:08:04PM -0700, Nathan Chancellor wrote:
> > > After commit 2decad92f473 ("arm64: mte: Ensure TIF_MTE_ASYNC_FAULT is
> > > set atomically"), LLVM's integrated assembler fails to build entry.S:
> > > 
> > > :5:7: error: expected assembly-time absolute expression
> > >  .org . - (664b-663b) + (662b-661b)
> > >   ^
> > > :6:7: error: expected assembly-time absolute expression
> > >  .org . - (662b-661b) + (664b-663b)
> > >   ^
> > 
> > I tried the latest Linus' tree and linux-next (defconfig) with this
> > commit in and I can't get your build error. I used both clang-10 from
> > Debian stable and clang-11 from Debian sid. So, which clang version did
> > you use or which kernel config options?
> 
> Interesting, this reproduces for me with LLVM 12 or newer with just
> defconfig.

It fails for me as well with clang-12. Do you happen to know why it
works fine with previous clang versions?

-- 
Catalin


Re: [PATCH] arm64: alternatives: Move length validation in alternative_{insn,endif}

2021-04-15 Thread Catalin Marinas
Hi Nathan,

On Tue, Apr 13, 2021 at 05:08:04PM -0700, Nathan Chancellor wrote:
> After commit 2decad92f473 ("arm64: mte: Ensure TIF_MTE_ASYNC_FAULT is
> set atomically"), LLVM's integrated assembler fails to build entry.S:
> 
> :5:7: error: expected assembly-time absolute expression
>  .org . - (664b-663b) + (662b-661b)
>   ^
> :6:7: error: expected assembly-time absolute expression
>  .org . - (662b-661b) + (664b-663b)
>   ^

I tried the latest Linus' tree and linux-next (defconfig) with this
commit in and I can't get your build error. I used both clang-10 from
Debian stable and clang-11 from Debian sid. So, which clang version did
you use or which kernel config options?

-- 
Catalin


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-15 Thread Catalin Marinas
(fixed Will's email address)

On Thu, Apr 15, 2021 at 10:09:54AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 15, 2021 at 05:47:34AM +0900, Stafford Horne wrote:
> > > How's this then? Compile tested only on openrisc/simple_smp_defconfig.
> > 
> > I did my testing with this FPGA build SoC:
> > 
> >  https://github.com/stffrdhrn/de0_nano-multicore
> > 
> > Note, the CPU timer sync logic uses mb() and is a bit flaky.  So missing 
> > mb()
> > might be a reason.  I thought we had defined mb() and l.msync, but it seems 
> > to
> > have gotten lost.
> > 
> > With that said I could test out this ticket-lock implementation.  How would 
> > I
> > tell if its better than qspinlock?
> 
> Mostly if it isn't worse, it's better for being *much* simpler. As you
> can see, the guts of ticket is like 16 lines of C (lock+unlock) and you
> only need the behaviour of atomic_fetch_add() to reason about behaviour
> of the whole thing. qspinlock OTOH is mind bending painful to reason
> about.
> 
> There are some spinlock tests in locktorture; but back when I had a
> userspace copy of the lot and would measure min,avg,max acquire times
> under various contention loads (making sure to only run a single task
> per CPU etc.. to avoid lock holder preemption and other such 'fun'
> things).
> 
> It took us a fair amount of work to get qspinlock to compete with ticket
> for low contention cases (by far the most common in the kernel), and it
> took a fairly large amount of CPUs for qspinlock to really win from
> ticket on the contended case. Your hardware may vary. In particular the
> access to the external cacheline (for queueing, see the queue: label in
> queued_spin_lock_slowpath) is a pain-point and the relative cost of
> cacheline misses for your arch determines where (and if) low contention
> behaviour is competitive.
> 
> Also, less variance (the reason for the min/max measure) is better.
> Large variance is typically a sign of fwd progress trouble.

IIRC, one issue we had with ticket spinlocks on arm64 was on big.LITTLE
systems where the little CPUs were always last to get a ticket when
racing with the big cores. That was with load/store exclusives (LR/SC
style) and would have probably got better with atomics but we moved to
qspinlocks eventually (the Juno board didn't have atomics).

(leaving the rest of the text below for Will's convenience)

> That's not saying that qspinlock isn't awesome, but I'm arguing that you
> should get there by first trying all the simpler things. By gradually
> increasing complexity you can also find the problem spots (for your
> architecture) and you have something to fall back to in case of trouble.
> 
> Now, the obvious selling point of qspinlock is that due to the MCS style
> nature of the thing it doesn't bounce the lock around, but that comes at
> a cost of having to use that extra cacheline (due to the kernel liking
> sizeof(spinlock_t) == sizeof(u32)). But things like ARM64's WFE (see
> smp_cond_load_acquire()) can shift the balance quite a bit on that front
> as well (ARM has a similar thing but less useful, see it's spinlock.h
> and look for wfe() and dsb_sev()).
> 
> Once your arch hits NUMA, qspinlock is probably a win. However, low
> contention performance is still king for most workloads. Better high
> contention behaviour is nice.

-- 
Catalin


Re: [PATCH] arm64:align function __arch_clear_user

2021-04-14 Thread Catalin Marinas
On Wed, Apr 14, 2021 at 05:25:43PM +0800, Kai Shen wrote:
> Performance decreases happen in __arch_clear_user when this
> function is not correctly aligned on HISI-HIP08 arm64 SOC which
> fetches 32 bytes (8 instructions) from icache with a 32-bytes
> aligned end address. As a result, if the hot loop is not 32-bytes
> aligned, it may take more icache fetches which leads to decrease
> in performance.
> Dump of assembler code for function __arch_clear_user:
>0x809e3f10 :nop
>0x809e3f14 :mov x2, x1
>0x809e3f18 :subs x1, x1, #0x8
>0x809e3f1c :b.mi 0x809e3f30 <__arch_clear_user+3
> -  0x809e3f20 :strxzr, [x0],#8
> hot0x809e3f24 :nop
> loop   0x809e3f28 :subs x1, x1, #0x8
> -  0x809e3f2c :b.pl  0x809e3f20 <__arch_clear_user+1
> The hot loop above takes one icache fetch as the code is in one
> 32-bytes aligned area and the loop takes one more icache fetch
> when it is not aligned like below.
>0x809e4178 :   strxzr, [x0],#8
>0x809e417c :   nop
>0x809e4180 :   subs x1, x1, #0x8
>0x809e4184 :   b.pl  0x809e4178 <__arch_clear_user+
> Data collected by perf:
>  aligned   not aligned
>   instructions   57733790 57739065
>L1-dcache-store   14938070 13718242
> L1-dcache-store-misses 349280   349869
>L1-icache-loads   15380895 28500665
> As we can see, L1-icache-loads almost double when the loop is not
> aligned.
> This problem is found in linux 4.19 on HISI-HIP08 arm64 SOC.
> Not sure what the case is on other arm64 SOC, but it should do
> no harm.
> Signed-off-by: Kai Shen 

Do you have a real world workload that's affected by this function?

I'm against adding alignments and nops for specific hardware
implementations. What about lots of other loops that the compiler may
generate or that we wrote in asm?

-- 
Catalin


Re: [PATCH] arm64/kernel/probes: Use BUG_ON instead of if condition followed by BUG.

2021-04-14 Thread Catalin Marinas
On Tue, 30 Mar 2021 04:57:50 -0700, zhouchuangao wrote:
> It can be optimized at compile time.

Applied to arm64 (for-next/misc), it saves one line ;). Thanks!

[1/1] arm64/kernel/probes: Use BUG_ON instead of if condition followed by BUG.
  https://git.kernel.org/arm64/c/839157876f97

-- 
Catalin



Re: [PATCH] riscv: locks: introduce ticket-based spinlock implementation

2021-04-14 Thread Catalin Marinas
On Wed, Apr 14, 2021 at 08:23:51AM +0800, Guo Ren wrote:
> On Tue, Apr 13, 2021 at 5:31 PM Catalin Marinas  
> wrote:
> > On Tue, Apr 13, 2021 at 11:22:40AM +0200, Christoph Müllner wrote:
> > > On Tue, Apr 13, 2021 at 10:03 AM Peter Zijlstra  
> > > wrote:
> > > > On Mon, Apr 12, 2021 at 11:54:55PM +0200, Christoph Müllner wrote:
> > > > > On Mon, Apr 12, 2021 at 7:33 PM Palmer Dabbelt  
> > > > > wrote:
> > > > > > My plan is to add a generic ticket-based lock, which can be 
> > > > > > selected at
> > > > > > compile time.  It'll have no architecture dependencies (though it'll
> > > > > > likely have some hooks for architectures that can make this go 
> > > > > > faster).
> > > > > > Users can then just pick which spinlock flavor they want, with the 
> > > > > > idea
> > > > > > being that smaller systems will perform better with ticket locks and
> > > > > > larger systems will perform better with queued locks.  The main goal
> > > > > > here is to give the less widely used architectures an easy way to 
> > > > > > have
> > > > > > fair locks, as right now we've got a lot of code duplication 
> > > > > > because any
> > > > > > architecture that wants ticket locks has to do it themselves.
> > > > >
> > > > > In the case of LL/SC sequences, we have a maximum of 16 instructions
> > > > > on RISC-V. My concern with a pure-C implementation would be that
> > > > > we cannot guarantee this (e.g. somebody wants to compile with -O0)
> > > > > and I don't know of a way to abort the build in case this limit 
> > > > > exceeds.
> > > > > Therefore I have preferred inline assembly for OpenSBI (my initial 
> > > > > idea
> > > > > was to use closure-like LL/SC macros, where you can write the loop
> > > > > in form of C code).
> > > >
> > > > For ticket locks you really only needs atomic_fetch_add() and
> > > > smp_store_release() and an architectural guarantees that the
> > > > atomic_fetch_add() has fwd progress under contention and that a sub-word
> > > > store (through smp_store_release()) will fail the SC.
> > > >
> > > > Then you can do something like:
> > > >
> > > > void lock(atomic_t *lock)
> > > > {
> > > > u32 val = atomic_fetch_add(1<<16, lock); /* SC, gives us RCsc */
> > > > u16 ticket = val >> 16;
> > > >
> > > > for (;;) {
> > > > if (ticket == (u16)val)
> > > > break;
> > > > cpu_relax();
> > > > val = atomic_read_acquire(lock);
> > > > }
> > > > }
> > > >
> > > > void unlock(atomic_t *lock)
> > > > {
> > > > u16 *ptr = (u16 *)lock + (!!__BIG_ENDIAN__);
> > > > u32 val = atomic_read(lock);
> > > >
> > > > smp_store_release(ptr, (u16)val + 1);
> > > > }
> > > >
> > > > That's _almost_ as simple as a test-and-set :-) It isn't quite optimal
> > > > on x86 for not being allowed to use a memop on unlock, since its being
> > > > forced into a load-store because of all the volatile, but whatever.
> > >
> > > What about trylock()?
> > > I.e. one could implement trylock() without a loop, by letting
> > > trylock() fail if the SC fails.
> > > That looks safe on first view, but nobody does this right now.
> 
> I think it's safe for riscv LR/SC, because in spec A 8.3 section:
> "As a consequence of the eventuality guarantee, if some harts in an
> execution environment are executing constrained LR/SC loops, and no
> other harts or devices in the execution environment execute an
> unconditional store or AMO to that reservation set, then at least one
> hart will eventually exit its constrained LR/SC loop."

This is clearly talking about _loops_ and that one hart will
_eventually_ exit the loop. It does not say that there is a guaranteed
LR/SC successful sequence or single iteration of the loop.

> So it guarantees LR/SC pair:
> 
> CPU0   CPU1
> === ===
> LR addr1
> LR addr1
> SC addr1 // guarantee success.
> SC addr1

I don't see the RISC-V spec 

Re: [PATCH] arch/arm64/kernel/traps: Use find_vma_intersection() in traps for setting si_code

2021-04-13 Thread Catalin Marinas
On Tue, Apr 13, 2021 at 04:52:34PM +, Liam Howlett wrote:
> * Catalin Marinas  [210412 13:44]:
> > On Wed, Apr 07, 2021 at 03:11:06PM +, Liam Howlett wrote:
> > > find_vma() will continue to search upwards until the end of the virtual
> > > memory space.  This means the si_code would almost never be set to
> > > SEGV_MAPERR even when the address falls outside of any VMA.  The result
> > > is that the si_code is not reliable as it may or may not be set to the
> > > correct result, depending on where the address falls in the address
> > > space.
> > > 
> > > Using find_vma_intersection() allows for what is intended by only
> > > returning a VMA if it falls within the range provided, in this case a
> > > window of 1.
> > > 
> > > Signed-off-by: Liam R. Howlett 
> > > ---
> > >  arch/arm64/kernel/traps.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > > index a05d34f0e82a..a44007904a64 100644
> > > --- a/arch/arm64/kernel/traps.c
> > > +++ b/arch/arm64/kernel/traps.c
> > > @@ -383,9 +383,10 @@ void force_signal_inject(int signal, int code, 
> > > unsigned long address, unsigned i
> > >  void arm64_notify_segfault(unsigned long addr)
> > >  {
> > >   int code;
> > > + unsigned long ut_addr = untagged_addr(addr);
> > >  
> > >   mmap_read_lock(current->mm);
> > > - if (find_vma(current->mm, untagged_addr(addr)) == NULL)
> > > + if (find_vma_intersection(current->mm, ut_addr, ut_addr + 1) == NULL)
> > >   code = SEGV_MAPERR;
> > >   else
> > >   code = SEGV_ACCERR;
[...]
> > I don't think your change is entirely correct either. We can have a
> > fault below the vma of a stack (with VM_GROWSDOWN) and
> > find_vma_intersection() would return NULL but it should be a SEGV_ACCERR
> > instead.
> 
> I'm pretty sure I am missing something.  From what you said above, I
> think this means that there can be a user cache fault below the stack
> which should notify the user application that they are not allowed to
> expand the stack by sending a SIGV_ACCERR in the si_code?  Is this
> expected behaviour or am I missing a code path to this function?

My point was that find_vma() may return a valid vma where addr < vm_end
but also addr < vm_addr. It's the responsibility of the caller to check
that that vma can be expanded (VM_GROWSDOWN) and we do something like
this in __do_page_fault(). find_vma_intersection(), OTOH, requires addr
>= vm_start.

If we hit this case (addr < vm_start), normally we'd first need to check
whether it's expandable and, if not, return MAPERR. If it's expandable,
it should be ACCERR since something else caused the fault.

Now, I think at least for user_cache_maint_handler(), we can assume that
__do_page_fault() handled any expansion already, so we don't need to
check it here. In this case, your find_vma_intersection() check should
work.

Are there other cases where we invoke arm64_notify_segfault() without a
prior fault? I think in swp_handler() we can bail out early before we
even attempted the access so we may report MAPERR but ACCERR is a better
indication. Also in sys_rt_sigreturn() we always call it as
arm64_notify_segfault(regs->sp). I'm not sure that's correct in all
cases, see restore_altstack().

I guess this code needs some tidying up.

> > Maybe this should employ similar checks as __do_page_fault() (with
> > expand_stack() and VM_GROWSDOWN).
> 
> You mean the code needs to detect endianness and to check if this is an
> attempt to expand the stack for both cases?

Nothing to do with endianness, just the relation between the address and
the vma->vm_start and whether the vma can be expanded down.

-- 
Catalin


Re: [PATCH] riscv: locks: introduce ticket-based spinlock implementation

2021-04-13 Thread Catalin Marinas
On Tue, Apr 13, 2021 at 12:25:00PM +0200, Christoph Müllner wrote:
> On Tue, Apr 13, 2021 at 11:37 AM Peter Zijlstra  wrote:
> > On Tue, Apr 13, 2021 at 11:22:40AM +0200, Christoph Müllner wrote:
> > > What about trylock()?
> > > I.e. one could implement trylock() without a loop, by letting
> > > trylock() fail if the SC fails.
> > > That looks safe on first view, but nobody does this right now.
> >
> > Generic code has to use cmpxchg(), and then you get something like this:
> >
> > bool trylock(atomic_t *lock)
> > {
> > u32 old = atomic_read(lock);
> >
> > if ((old >> 16) != (old & 0x))
> > return false;
> >
> > return atomic_try_cmpxchg(lock, , old + (1<<16)); /* SC, for 
> > RCsc */
> > }
> 
> This approach requires two loads (atomic_read() and cmpxchg()), which
> is not required.
> Detecting this pattern and optimizing it in a compiler is quite unlikely.
> 
> A bit less generic solution would be to wrap the LL/SC (would be
> mandatory in this case)
> instructions and do something like this:
> 
> uint32_t __smp_load_acquire_reserved(void*);
> int __smp_store_release_conditional(void*, uint32_t);
> 
> typedef union {
> uint32_t v32;
> struct {
> uint16_t owner;
> uint16_t next;
> };
> } arch_spinlock_t;
> 
> int trylock(arch_spinlock_t *lock)
> {
> arch_spinlock_t l;
> int success;
> do {
> l.v32 = __smp_load_acquire_reserved(lock);
> if (l.owner != l.next)
> return 0;
> l.next++;
> success = __smp_store_release_conditional(lock, l.v32);
> } while (!success);
> return success;
> }
> 
> But here we can't tell the compiler to optimize the code between LL and SC...

This indeed needs some care. IIUC RISC-V has similar restrictions as arm
here, no load/store instructions are allowed between LR and SC. You
can't guarantee that the compiler won't spill some variable onto the
stack.

BTW, I think the SC doesn't need release semantics above, only the LR
needs acquire semantics.

-- 
Catalin


Re: [PATCH v16 0/9] arm64: ARMv8.5-A: MTE: Add async mode support

2021-04-13 Thread Catalin Marinas
On Mon, 15 Mar 2021 13:20:10 +, Vincenzo Frascino wrote:
> This patchset implements the asynchronous mode support for ARMv8.5-A
> Memory Tagging Extension (MTE), which is a debugging feature that allows
> to detect with the help of the architecture the C and C++ programmatic
> memory errors like buffer overflow, use-after-free, use-after-return, etc.
> 
> MTE is built on top of the AArch64 v8.0 virtual address tagging TBI
> (Top Byte Ignore) feature and allows a task to set a 4 bit tag on any
> subset of its address space that is multiple of a 16 bytes granule. MTE
> is based on a lock-key mechanism where the lock is the tag associated to
> the physical memory and the key is the tag associated to the virtual
> address.
> When MTE is enabled and tags are set for ranges of address space of a task,
> the PE will compare the tag related to the physical memory with the tag
> related to the virtual address (tag check operation). Access to the memory
> is granted only if the two tags match. In case of mismatch the PE will raise
> an exception.
> 
> [...]

Applied to arm64 (for-next/mte-async-kernel-mode) but with a note that
I'll drop them if Andrew prefers to take the patches via the mm tree.
Thanks!

[1/9] arm64: mte: Add asynchronous mode support
  https://git.kernel.org/arm64/c/f3b7deef8dca
[2/9] kasan: Add KASAN mode kernel parameter
  https://git.kernel.org/arm64/c/2603f8a78dfb
[3/9] arm64: mte: Drop arch_enable_tagging()
  https://git.kernel.org/arm64/c/c137c6145b11
[4/9] kasan: Add report for async mode
  https://git.kernel.org/arm64/c/8f7b5054755e
[5/9] arm64: mte: Enable TCO in functions that can read beyond buffer limits
  https://git.kernel.org/arm64/c/e60beb95c08b
[6/9] arm64: mte: Conditionally compile mte_enable_kernel_*()
  https://git.kernel.org/arm64/c/d8969752cc4e
[7/9] arm64: mte: Enable async tag check fault
  https://git.kernel.org/arm64/c/65812c6921cc
[8/9] arm64: mte: Report async tag faults before suspend
  https://git.kernel.org/arm64/c/eab0e6e17d87
[9/9] kasan, arm64: tests supports for HW_TAGS async mode
  https://git.kernel.org/arm64/c/e80a76aa1a91

-- 
Catalin



Re: [PATCH] riscv: locks: introduce ticket-based spinlock implementation

2021-04-13 Thread Catalin Marinas
On Tue, Apr 13, 2021 at 11:22:40AM +0200, Christoph Müllner wrote:
> On Tue, Apr 13, 2021 at 10:03 AM Peter Zijlstra  wrote:
> > On Mon, Apr 12, 2021 at 11:54:55PM +0200, Christoph Müllner wrote:
> > > On Mon, Apr 12, 2021 at 7:33 PM Palmer Dabbelt  wrote:
> > > > My plan is to add a generic ticket-based lock, which can be selected at
> > > > compile time.  It'll have no architecture dependencies (though it'll
> > > > likely have some hooks for architectures that can make this go faster).
> > > > Users can then just pick which spinlock flavor they want, with the idea
> > > > being that smaller systems will perform better with ticket locks and
> > > > larger systems will perform better with queued locks.  The main goal
> > > > here is to give the less widely used architectures an easy way to have
> > > > fair locks, as right now we've got a lot of code duplication because any
> > > > architecture that wants ticket locks has to do it themselves.
> > >
> > > In the case of LL/SC sequences, we have a maximum of 16 instructions
> > > on RISC-V. My concern with a pure-C implementation would be that
> > > we cannot guarantee this (e.g. somebody wants to compile with -O0)
> > > and I don't know of a way to abort the build in case this limit exceeds.
> > > Therefore I have preferred inline assembly for OpenSBI (my initial idea
> > > was to use closure-like LL/SC macros, where you can write the loop
> > > in form of C code).
> >
> > For ticket locks you really only needs atomic_fetch_add() and
> > smp_store_release() and an architectural guarantees that the
> > atomic_fetch_add() has fwd progress under contention and that a sub-word
> > store (through smp_store_release()) will fail the SC.
> >
> > Then you can do something like:
> >
> > void lock(atomic_t *lock)
> > {
> > u32 val = atomic_fetch_add(1<<16, lock); /* SC, gives us RCsc */
> > u16 ticket = val >> 16;
> >
> > for (;;) {
> > if (ticket == (u16)val)
> > break;
> > cpu_relax();
> > val = atomic_read_acquire(lock);
> > }
> > }
> >
> > void unlock(atomic_t *lock)
> > {
> > u16 *ptr = (u16 *)lock + (!!__BIG_ENDIAN__);
> > u32 val = atomic_read(lock);
> >
> > smp_store_release(ptr, (u16)val + 1);
> > }
> >
> > That's _almost_ as simple as a test-and-set :-) It isn't quite optimal
> > on x86 for not being allowed to use a memop on unlock, since its being
> > forced into a load-store because of all the volatile, but whatever.
> 
> What about trylock()?
> I.e. one could implement trylock() without a loop, by letting
> trylock() fail if the SC fails.
> That looks safe on first view, but nobody does this right now.

Not familiar with RISC-V but I'd recommend that a trylock only fails if
the lock is locked (after LR). A SC may fail for other reasons
(cacheline eviction; depending on the microarchitecture) even if the
lock is unlocked. At least on arm64 we had this issue with an
implementation having a tendency to always fail the first STXR.

-- 
Catalin


Re: linux-next: manual merge of the akpm-current tree with the arm64 tree

2021-04-13 Thread Catalin Marinas
On Tue, Apr 13, 2021 at 06:59:36PM +1000, Stephen Rothwell wrote:
> diff --cc lib/test_kasan.c
> index 785e724ce0d8,bf9225002a7e..
> --- a/lib/test_kasan.c
> +++ b/lib/test_kasan.c
> @@@ -78,33 -83,30 +83,35 @@@ static void kasan_test_exit(struct kuni
>* fields, it can reorder or optimize away the accesses to those fields.
>* Use READ/WRITE_ONCE() for the accesses and compiler barriers around the
>* expression to prevent that.
> +  *
> +  * In between KUNIT_EXPECT_KASAN_FAIL checks, fail_data.report_found is 
> kept as
> +  * false. This allows detecting KASAN reports that happen outside of the 
> checks
> +  * by asserting !fail_data.report_found at the start of 
> KUNIT_EXPECT_KASAN_FAIL
> +  * and in kasan_test_exit.
>*/
> - #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do {  \
> - if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
> - !kasan_async_mode_enabled())\
> - migrate_disable();  \
> - WRITE_ONCE(fail_data.report_expected, true);\
> - WRITE_ONCE(fail_data.report_found, false);  \
> - kunit_add_named_resource(test,  \
> - NULL,   \
> - NULL,   \
> - ,  \
> - "kasan_data", _data);  \
> - barrier();  \
> - expression; \
> - barrier();  \
> - if (kasan_async_mode_enabled()) \
> - kasan_force_async_fault();  \
> - barrier();  \
> - KUNIT_EXPECT_EQ(test,   \
> - READ_ONCE(fail_data.report_expected),   \
> - READ_ONCE(fail_data.report_found)); \
> - if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
> - !kasan_async_mode_enabled()) {  \
> - if (READ_ONCE(fail_data.report_found))  \
> - kasan_enable_tagging_sync();\
> - migrate_enable();   \
> - }   \
> + #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do {  
> \
>  -if (IS_ENABLED(CONFIG_KASAN_HW_TAGS))   \
> ++if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
> ++!kasan_async_mode_enabled())\
> + migrate_disable();  \
> + KUNIT_EXPECT_FALSE(test, READ_ONCE(fail_data.report_found));\
> + WRITE_ONCE(fail_data.report_expected, true);\
> + barrier();  \
> + expression; \
> + barrier();  \
> ++if (kasan_async_mode_enabled()) \
> ++kasan_force_async_fault();  \
> ++barrier();  \
> + KUNIT_EXPECT_EQ(test,   \
> + READ_ONCE(fail_data.report_expected),   \
> + READ_ONCE(fail_data.report_found)); \
>  -if (IS_ENABLED(CONFIG_KASAN_HW_TAGS)) { \
> ++if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
> ++!kasan_async_mode_enabled()) {  \
> + if (READ_ONCE(fail_data.report_found))  \
>  -kasan_enable_tagging(); \
> ++kasan_enable_tagging_sync();\
> + migrate_enable();   \
> + }   \
> + WRITE_ONCE(fail_data.report_found, false);  \
> + WRITE_ONCE(fail_data.report_expected, false);   \
>   } while (0)
>   
>   #define KASAN_TEST_NEEDS_CONFIG_ON(test, config) do {   
> \

Thanks Stephen. The resolution looks correct.

Andrew, if you'd rather I dropped the MTE async mode support from the
arm64 tree please let me know. Thanks.

https://lore.kernel.org/r/20210315132019.33202-1-vincenzo.frasc...@arm.com/

-- 
Catalin


Re: [PATCH] arch/arm64/kernel/traps: Use find_vma_intersection() in traps for setting si_code

2021-04-12 Thread Catalin Marinas
On Wed, Apr 07, 2021 at 03:11:06PM +, Liam Howlett wrote:
> find_vma() will continue to search upwards until the end of the virtual
> memory space.  This means the si_code would almost never be set to
> SEGV_MAPERR even when the address falls outside of any VMA.  The result
> is that the si_code is not reliable as it may or may not be set to the
> correct result, depending on where the address falls in the address
> space.
> 
> Using find_vma_intersection() allows for what is intended by only
> returning a VMA if it falls within the range provided, in this case a
> window of 1.
> 
> Signed-off-by: Liam R. Howlett 
> ---
>  arch/arm64/kernel/traps.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index a05d34f0e82a..a44007904a64 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -383,9 +383,10 @@ void force_signal_inject(int signal, int code, unsigned 
> long address, unsigned i
>  void arm64_notify_segfault(unsigned long addr)
>  {
>   int code;
> + unsigned long ut_addr = untagged_addr(addr);
>  
>   mmap_read_lock(current->mm);
> - if (find_vma(current->mm, untagged_addr(addr)) == NULL)
> + if (find_vma_intersection(current->mm, ut_addr, ut_addr + 1) == NULL)
>   code = SEGV_MAPERR;
>   else
>   code = SEGV_ACCERR;

I don't think your change is entirely correct either. We can have a
fault below the vma of a stack (with VM_GROWSDOWN) and
find_vma_intersection() would return NULL but it should be a SEGV_ACCERR
instead.

Maybe this should employ similar checks as __do_page_fault() (with
expand_stack() and VM_GROWSDOWN).

-- 
Catalin


Re: [PATCH v4] kasan: remove redundant config option

2021-04-12 Thread Catalin Marinas
On Sun, Apr 11, 2021 at 03:03:16PM -0700, Andrew Morton wrote:
> On Sun, 11 Apr 2021 11:53:33 +0100 Catalin Marinas  
> wrote:
> > On Tue, Mar 30, 2021 at 10:36:37PM -0700, Andrew Morton wrote:
> > > On Mon, 29 Mar 2021 16:54:26 +0200 Andrey Konovalov 
> > >  wrote:
> > > > Looks like my patch "kasan: fix KASAN_STACK dependency for HW_TAGS"
> > > > that was merged into 5.12-rc causes a build time warning:
> > > > 
> > > > include/linux/kasan.h:333:30: warning: 'CONFIG_KASAN_STACK' is not
> > > > defined, evaluates to 0 [-Wundef]
> > > > #if defined(CONFIG_KASAN) && CONFIG_KASAN_STACK
> > > > 
> > > > The fix for it would either be reverting the patch (which would leave
> > > > the initial issue unfixed) or applying this "kasan: remove redundant
> > > > config option" patch.
> > > > 
> > > > Would it be possible to send this patch (with the fix-up you have in
> > > > mm) for the next 5.12-rc?
> > > > 
> > > > Here are the required tags:
> > > > 
> > > > Fixes: d9b571c885a8 ("kasan: fix KASAN_STACK dependency for HW_TAGS")
> > > > Cc: sta...@vger.kernel.org
> > > 
> > > Got it, thanks.  I updated the changelog to mention the warning fix and
> > > moved these ahead for a -rc merge.
> > 
> > Is there a chance this patch makes it into 5.12? I still get the warning
> > with the latest Linus' tree (v5.12-rc6-408-g52e44129fba5) when enabling
> > KASAN_HW_TAGS.
> 
> Trying.   We're still awaiting a tested fix for
> https://lkml.kernel.org/r/ca+fcnzf1abrqg0dsxtoza9zm1bsblyq_xbu+xi9cv8wazxd...@mail.gmail.com

Thanks Andrew. I didn't realise it was sent and then dropped.

However, we should decouple (or rather reorder) the two patches. There's
no functional dependency between removing the redundant config option (a
fix for an existing commit) and adding support for KASAN_SW_TAGS with
gcc-11, only a conflict in scripts/Makefile.kasan. Walter's original
patch applies on top of vanilla 5.12-rc3:

https://lkml.kernel.org/r/20210226012531.29231-1-walter-zh...@mediatek.com

-- 
Catalin


Re: [PATCH v16 0/9] arm64: ARMv8.5-A: MTE: Add async mode support

2021-04-11 Thread Catalin Marinas
On Thu, Mar 18, 2021 at 06:56:07PM +, Catalin Marinas wrote:
> On Mon, Mar 15, 2021 at 01:20:10PM +, Vincenzo Frascino wrote:
> > This patchset implements the asynchronous mode support for ARMv8.5-A
> > Memory Tagging Extension (MTE), which is a debugging feature that allows
> > to detect with the help of the architecture the C and C++ programmatic
> > memory errors like buffer overflow, use-after-free, use-after-return, etc.
> > 
> > MTE is built on top of the AArch64 v8.0 virtual address tagging TBI
> > (Top Byte Ignore) feature and allows a task to set a 4 bit tag on any
> > subset of its address space that is multiple of a 16 bytes granule. MTE
> > is based on a lock-key mechanism where the lock is the tag associated to
> > the physical memory and the key is the tag associated to the virtual
> > address.
> > When MTE is enabled and tags are set for ranges of address space of a task,
> > the PE will compare the tag related to the physical memory with the tag
> > related to the virtual address (tag check operation). Access to the memory
> > is granted only if the two tags match. In case of mismatch the PE will raise
> > an exception.
> > 
> > The exception can be handled synchronously or asynchronously. When the
> > asynchronous mode is enabled:
> >   - Upon fault the PE updates the TFSR_EL1 register.
> >   - The kernel detects the change during one of the following:
> > - Context switching
> > - Return to user/EL0
> > - Kernel entry from EL1
> > - Kernel exit to EL1
> >   - If the register has been updated by the PE the kernel clears it and
> > reports the error.
> > 
> > The series is based on linux-next/akpm.
> 
> Andrew, could you please pick these patches up via the mm tree? They
> depend on kasan patches already queued.

Andrew, are you ok for me to queue these patches via the arm64 tree for
5.13 or you'd rather take them in the mm tree? There is a conflict with
the mm tree in lib/test_kasan.c, I think commit ce816b430b5a ("kasan:
detect false-positives in tests").

Thanks.

-- 
Catalin


Re: [PATCH v4] kasan: remove redundant config option

2021-04-11 Thread Catalin Marinas
Hi Andrew,

On Tue, Mar 30, 2021 at 10:36:37PM -0700, Andrew Morton wrote:
> On Mon, 29 Mar 2021 16:54:26 +0200 Andrey Konovalov  
> wrote:
> > Looks like my patch "kasan: fix KASAN_STACK dependency for HW_TAGS"
> > that was merged into 5.12-rc causes a build time warning:
> > 
> > include/linux/kasan.h:333:30: warning: 'CONFIG_KASAN_STACK' is not
> > defined, evaluates to 0 [-Wundef]
> > #if defined(CONFIG_KASAN) && CONFIG_KASAN_STACK
> > 
> > The fix for it would either be reverting the patch (which would leave
> > the initial issue unfixed) or applying this "kasan: remove redundant
> > config option" patch.
> > 
> > Would it be possible to send this patch (with the fix-up you have in
> > mm) for the next 5.12-rc?
> > 
> > Here are the required tags:
> > 
> > Fixes: d9b571c885a8 ("kasan: fix KASAN_STACK dependency for HW_TAGS")
> > Cc: sta...@vger.kernel.org
> 
> Got it, thanks.  I updated the changelog to mention the warning fix and
> moved these ahead for a -rc merge.

Is there a chance this patch makes it into 5.12? I still get the warning
with the latest Linus' tree (v5.12-rc6-408-g52e44129fba5) when enabling
KASAN_HW_TAGS.

Thanks.

-- 
Catalin


Re: [PATCH v3] arm64: mte: Move MTE TCF0 check in entry-common

2021-04-09 Thread Catalin Marinas
On Fri, Apr 09, 2021 at 05:18:45PM +0100, Mark Rutland wrote:
> On Fri, Apr 09, 2021 at 03:32:47PM +0100, Mark Rutland wrote:
> > Hi Vincenzo,
> > 
> > On Fri, Apr 09, 2021 at 02:24:19PM +0100, Vincenzo Frascino wrote:
> > > The check_mte_async_tcf macro sets the TIF flag non-atomically. This can
> > > race with another CPU doing a set_tsk_thread_flag() and all the other 
> > > flags
> > > can be lost in the process.
> > > 
> > > Move the tcf0 check to enter_from_user_mode() and clear tcf0 in
> > > exit_to_user_mode() to address the problem.
> > > 
> > > Note: Moving the check in entry-common allows to use set_thread_flag()
> > > which is safe.
> 
> I've dug into this a bit more, and as set_thread_flag() calls some
> potentially-instrumented helpers I don't think this is safe after all
> (as e.g. those might cause an EL1 exception and clobber the ESR/FAR/etc
> before the EL0 exception handler reads it).
> 
> Making that watertight is pretty hairy, as we either need to open-code
> set_thread_flag() or go rework a load of core code. If we can use STSET
> in the entry asm that'd be simpler, otherwise we'll need something more
> involved.

I hacked this up quickly:

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9b4d629f7628..25efe83d68a4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1646,6 +1646,7 @@ config ARM64_AS_HAS_MTE
 config ARM64_MTE
bool "Memory Tagging Extension support"
default y
+   depends on ARM64_LSE_ATOMICS
depends on ARM64_AS_HAS_MTE && ARM64_TAGGED_ADDR_ABI
depends on AS_HAS_ARMV8_5
# Required for tag checking in the uaccess routines
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index a45b4ebbfe7d..ad29892f2974 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -148,16 +148,18 @@ alternative_cb_end
.endm
 
/* Check for MTE asynchronous tag check faults */
-   .macro check_mte_async_tcf, flgs, tmp
+   .macro check_mte_async_tcf, tmp, ti_flags
 #ifdef CONFIG_ARM64_MTE
+   .arch_extension lse
 alternative_if_not ARM64_MTE
b   1f
 alternative_else_nop_endif
mrs_s   \tmp, SYS_TFSRE0_EL1
tbz \tmp, #SYS_TFSR_EL1_TF0_SHIFT, 1f
/* Asynchronous TCF occurred for TTBR0 access, set the TI flag */
-   orr \flgs, \flgs, #_TIF_MTE_ASYNC_FAULT
-   str \flgs, [tsk, #TSK_TI_FLAGS]
+   mov \tmp, #_TIF_MTE_ASYNC_FAULT
+   add \ti_flags, tsk, #TSK_TI_FLAGS
+   stset   \tmp, [\ti_flags]
msr_s   SYS_TFSRE0_EL1, xzr
 1:
 #endif
@@ -244,7 +246,7 @@ alternative_else_nop_endif
disable_step_tsk x19, x20
 
/* Check for asynchronous tag check faults in user space */
-   check_mte_async_tcf x19, x22
+   check_mte_async_tcf x22, x23
apply_ssbd 1, x22, x23
 
ptrauth_keys_install_kernel tsk, x20, x22, x23

-- 
Catalin


Re: [PATCH v3] arm64: mte: Move MTE TCF0 check in entry-common

2021-04-09 Thread Catalin Marinas
On Fri, Apr 09, 2021 at 02:24:19PM +0100, Vincenzo Frascino wrote:
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index b3c70a612c7a..84a942c25870 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -166,14 +166,43 @@ static void set_gcr_el1_excl(u64 excl)
>*/
>  }
>  
> -void flush_mte_state(void)
> +void noinstr check_mte_async_tcf0(void)

Nitpick: it looks like naming isn't be entirely consistent with your
kernel async patches:

https://lore.kernel.org/linux-arm-kernel/20210315132019.33202-8-vincenzo.frasc...@arm.com/

You could name them mte_check_tfsre0_el1() etc. Also make sure they are
called in similar places in both series.

> +{
> + u64 tcf0;
> +
> + if (!system_supports_mte())
> + return;
> +
> + /*
> +  * dsb(ish) is not required before the register read
> +  * because the TFSRE0_EL1 is automatically synchronized
> +  * by the hardware on exception entry as SCTLR_EL1.ITFSB
> +  * is set.
> +  */
> + tcf0 = read_sysreg_s(SYS_TFSRE0_EL1);
> +
> + if (tcf0 & SYS_TFSR_EL1_TF0)
> + set_thread_flag(TIF_MTE_ASYNC_FAULT);
> +
> + write_sysreg_s(0, SYS_TFSRE0_EL1);

Please move the write_sysreg() inside the 'if' block. If it was 0,
there's no point in a potentially more expensive write.

That said, we only check TFSRE0_EL1 on entry from EL0. Is there a point
in clearing it before we return to EL0? Uaccess routines may set it
anyway.

> +}
> +
> +void noinstr clear_mte_async_tcf0(void)
>  {
>   if (!system_supports_mte())
>   return;
>  
> - /* clear any pending asynchronous tag fault */
>   dsb(ish);
>   write_sysreg_s(0, SYS_TFSRE0_EL1);
> +}

I think Mark suggested on your first version that we should keep these
functions in mte.h so that they can be inlined. They are small and only
called in one or two places.

-- 
Catalin


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-04-08 Thread Catalin Marinas
On Thu, Apr 08, 2021 at 08:16:17PM +0200, David Hildenbrand wrote:
> On 08.04.21 16:18, Catalin Marinas wrote:
> > On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote:
> > > On 07/04/2021 16:14, Catalin Marinas wrote:
> > > > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
> > > > > On 31/03/2021 19:43, Catalin Marinas wrote:
> > > > > > When a slot is added by the VMM, if it asked for MTE in guest (I 
> > > > > > guess
> > > > > > that's an opt-in by the VMM, haven't checked the other patches), 
> > > > > > can we
> > > > > > reject it if it's is going to be mapped as Normal Cacheable but it 
> > > > > > is a
> > > > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions 
> > > > > > to
> > > > > > check for ZONE_DEVICE)? This way we don't need to do more expensive
> > > > > > checks in set_pte_at().
> > > > > 
> > > > > The problem is that KVM allows the VMM to change the memory backing a 
> > > > > slot
> > > > > while the guest is running. This is obviously useful for the likes of
> > > > > migration, but ultimately means that even if you were to do checks at 
> > > > > the
> > > > > time of slot creation, you would need to repeat the checks at 
> > > > > set_pte_at()
> > > > > time to ensure a mischievous VMM didn't swap the page for a 
> > > > > problematic one.
> > > > 
> > > > Does changing the slot require some KVM API call? Can we intercept it
> > > > and do the checks there?
> > > 
> > > As David has already replied - KVM uses MMU notifiers, so there's not 
> > > really
> > > a good place to intercept this before the fault.
> > > 
> > > > Maybe a better alternative for the time being is to add a new
> > > > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns
> > > > true _and_ the VMM asked for MTE in guest. We can then only set
> > > > PG_mte_tagged if !device.
> > > 
> > > KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE
> > > checks to only !kvm_is_device_pfn() makes sense (I have the fix in my 
> > > branch
> > > locally).
> > 
> > Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE,
> > I'd also mark a pfn as 'device' in user_mem_abort() if
> > pfn_to_online_page() is NULL as we don't want to map it as Cacheable in
> > Stage 2. It's unlikely that we'll trip over this path but just in case.
> > 
> > (can we have a ZONE_DEVICE _online_ pfn or by definition they are
> > considered offline?)
> 
> By definition (and implementation) offline. When you get a page =
> pfn_to_online_page() with page != NULL, that one should never be ZONE_DEVICE
> (otherwise it would be a BUG).
> 
> As I said, things are different when exposing dax memory via dax/kmem to the
> buddy. But then, we are no longer talking about ZONE_DEVICE.

Thanks David, it's clear now.

-- 
Catalin


Re: [PATCH] arm64: mte: Remove unused mte_assign_mem_tag_range()

2021-04-08 Thread Catalin Marinas
On Wed, 7 Apr 2021 14:38:17 +0100, Vincenzo Frascino wrote:
> mte_assign_mem_tag_range() was added in commit 85f49cae4dfc
> ("arm64: mte: add in-kernel MTE helpers") in 5.11 but moved out of
> mte.S by commit 2cb34276427a ("arm64: kasan: simplify and inline
> MTE functions") in 5.12 and renamed to mte_set_mem_tag_range().
> 2cb34276427a did not delete the old function prototypes in mte.h.
> 
> Remove the unused prototype from mte.h.

Applied to arm64 (for-next/misc), thanks!

[1/1] arm64: mte: Remove unused mte_assign_mem_tag_range()
  https://git.kernel.org/arm64/c/df652a16a657

-- 
Catalin



Re: [PATCH v2] arm64: Add __init section marker to some functions

2021-04-08 Thread Catalin Marinas
On Tue, 30 Mar 2021 13:54:49 +0800, Jisheng Zhang wrote:
> They are not needed after booting, so mark them as __init to move them
> to the .init section.

Applied to arm64 (for-next/misc), thanks!

[1/1] arm64: Add __init section marker to some functions
  https://git.kernel.org/arm64/c/a7dcf58ae5d2

-- 
Catalin



Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-04-08 Thread Catalin Marinas
On Wed, Apr 07, 2021 at 04:52:54PM +0100, Steven Price wrote:
> On 07/04/2021 16:14, Catalin Marinas wrote:
> > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
> > > On 31/03/2021 19:43, Catalin Marinas wrote:
> > > > When a slot is added by the VMM, if it asked for MTE in guest (I guess
> > > > that's an opt-in by the VMM, haven't checked the other patches), can we
> > > > reject it if it's is going to be mapped as Normal Cacheable but it is a
> > > > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
> > > > check for ZONE_DEVICE)? This way we don't need to do more expensive
> > > > checks in set_pte_at().
> > > 
> > > The problem is that KVM allows the VMM to change the memory backing a slot
> > > while the guest is running. This is obviously useful for the likes of
> > > migration, but ultimately means that even if you were to do checks at the
> > > time of slot creation, you would need to repeat the checks at set_pte_at()
> > > time to ensure a mischievous VMM didn't swap the page for a problematic 
> > > one.
> > 
> > Does changing the slot require some KVM API call? Can we intercept it
> > and do the checks there?
> 
> As David has already replied - KVM uses MMU notifiers, so there's not really
> a good place to intercept this before the fault.
> 
> > Maybe a better alternative for the time being is to add a new
> > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns
> > true _and_ the VMM asked for MTE in guest. We can then only set
> > PG_mte_tagged if !device.
> 
> KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE
> checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch
> locally).

Indeed, you can skip it if kvm_is_device_pfn(). In addition, with MTE,
I'd also mark a pfn as 'device' in user_mem_abort() if
pfn_to_online_page() is NULL as we don't want to map it as Cacheable in
Stage 2. It's unlikely that we'll trip over this path but just in case.

(can we have a ZONE_DEVICE _online_ pfn or by definition they are
considered offline?)

> > BTW, after a page is restored from swap, how long do we keep the
> > metadata around? I think we can delete it as soon as it was restored and
> > PG_mte_tagged was set. Currently it looks like we only do this when the
> > actual page was freed or swapoff. I haven't convinced myself that it's
> > safe to do this for swapoff unless it guarantees that all the ptes
> > sharing a page have been restored.
> 
> My initial thought was to free the metadata immediately. However it turns
> out that the following sequence can happen:
> 
>  1. Swap out a page
>  2. Swap the page in *read only*
>  3. Discard the page
>  4. Swap the page in again
> 
> So there's no writing of the swap data again before (3). This works nicely
> with a swap device because after writing a page it stays there forever, so
> if you know it hasn't been modified it's pointless rewriting it. Sadly it's
> not quite so ideal with the MTE tags which are currently kept in RAM.

I missed this scenario. So we need to keep it around as long as the
corresponding swap storage is still valid.

-- 
Catalin


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-04-07 Thread Catalin Marinas
On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote:
> On 31/03/2021 19:43, Catalin Marinas wrote:
> > When a slot is added by the VMM, if it asked for MTE in guest (I guess
> > that's an opt-in by the VMM, haven't checked the other patches), can we
> > reject it if it's is going to be mapped as Normal Cacheable but it is a
> > ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
> > check for ZONE_DEVICE)? This way we don't need to do more expensive
> > checks in set_pte_at().
> 
> The problem is that KVM allows the VMM to change the memory backing a slot
> while the guest is running. This is obviously useful for the likes of
> migration, but ultimately means that even if you were to do checks at the
> time of slot creation, you would need to repeat the checks at set_pte_at()
> time to ensure a mischievous VMM didn't swap the page for a problematic one.

Does changing the slot require some KVM API call? Can we intercept it
and do the checks there?

Maybe a better alternative for the time being is to add a new
kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns
true _and_ the VMM asked for MTE in guest. We can then only set
PG_mte_tagged if !device.

We can later relax this further to Normal Non-cacheable for ZONE_DEVICE
memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable
if we manage to change the behaviour of the architecture.

> > We could add another PROT_TAGGED or something which means PG_mte_tagged
> > set but still mapped as Normal Untagged. It's just that we are short of
> > pte bits for another flag.
> 
> That could help here - although it's slightly odd as you're asking the
> kernel to track the tags, but not allowing user space (direct) access to
> them. Like you say using us the precious bits for this seems like it might
> be short-sighted.

Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(),
so we already know it's a page potentially containing tags. On
restoring from swap, we need to check for MTE metadata irrespective of
whether the user pte is tagged or not, as you already did in patch 1.
I'll get back to that and look at the potential races.

BTW, after a page is restored from swap, how long do we keep the
metadata around? I think we can delete it as soon as it was restored and
PG_mte_tagged was set. Currently it looks like we only do this when the
actual page was freed or swapoff. I haven't convinced myself that it's
safe to do this for swapoff unless it guarantees that all the ptes
sharing a page have been restored.

-- 
Catalin


Re: [PATCH 09/20] kbuild: arm64: use common install script

2021-04-07 Thread Catalin Marinas
On Wed, Apr 07, 2021 at 07:34:08AM +0200, Greg Kroah-Hartman wrote:
> The common scripts/install.sh script will now work for arm65, no changes
> needed so convert the arm64 boot Makefile to call it instead of the
> arm64-only version of the file and remove the now unused file.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Signed-off-by: Greg Kroah-Hartman 

Assuming that it does the same thing (it seems to from a quick look):

Acked-by: Catalin Marinas 


Re: [PATCH] arm64/mm: Fix mapping_dirty_helpers with arm64

2021-04-02 Thread Catalin Marinas
On Wed, Mar 31, 2021 at 02:23:44PM -0400, Zack Rusin wrote:
> The pagetable walk callbacks in mm/mapping_dirty_helpers.c depend
> on a set of helpers from which pud_dirty(pud) was missing. I'm
> assuming mapping_dirty_helpers weren't used on ARM64 before
> because the missing pud_dirty is causing a compilation error.
> 
> The drivers/gpu/drm/vmwgfx code uses mapping_dirty_helpers and
> has been ported to ARM64 but it depends on this code getting in
> first in order to compile/work.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: linux-arm-ker...@lists.infradead.org
> Signed-off-by: Zack Rusin 
> ---
>  arch/arm64/include/asm/pgtable.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 47027796c2f9..ecd80f87a996 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -451,6 +451,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
>  #define pfn_pmd(pfn,prot)__pmd(__phys_to_pmd_val((phys_addr_t)(pfn) << 
> PAGE_SHIFT) | pgprot_val(prot))
>  #define mk_pmd(page,prot)pfn_pmd(page_to_pfn(page),prot)
>  
> +#define pud_dirty(pud)   pte_dirty(pud_pte(pud))
>  #define pud_young(pud)   pte_young(pud_pte(pud))
>  #define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud)))
>  #define pud_write(pud)   pte_write(pud_pte(pud))

I think pud_dirty() should only be defined if
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD which is not the case on
arm64. Well, the pud_young/mkyoung/write should probably also be removed
until we have pud pages on arm64.

The wp_clean_pud_entry() function bails out early since
pud_trans_unstable() returns 0 on arm64. So we either bracket this
function by some #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD or we
make some generic definitions for pud_* accessors when we don't have
this config.

-- 
Catalin


Re: [PATCH -next] docs: arm64: Fix a grammar error

2021-04-01 Thread Catalin Marinas
On Tue, 30 Mar 2021 04:58:17 -0400, He Ying wrote:
> depending -> depending on

Applied to arm64 (for-next/misc), thanks!

[1/1] docs: arm64: Fix a grammar error
  https://git.kernel.org/arm64/c/68f638a432df

-- 
Catalin



Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-31 Thread Catalin Marinas
On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote:
> On 31/03/2021 10:32, David Hildenbrand wrote:
> > On 31.03.21 11:21, Catalin Marinas wrote:
> > > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
> > > > On 30.03.21 12:30, Catalin Marinas wrote:
> > > > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
> > > > > > On 28/03/2021 13:21, Catalin Marinas wrote:
> > > > > > > However, the bigger issue is that Stage 2 cannot disable
> > > > > > > tagging for Stage 1 unless the memory is Non-cacheable or
> > > > > > > Device at S2. Is there a way to detect what gets mapped in
> > > > > > > the guest as Normal Cacheable memory and make sure it's
> > > > > > > only early memory or hotplug but no ZONE_DEVICE (or
> > > > > > > something else like on-chip memory)?  If we can't
> > > > > > > guarantee that all Cacheable memory given to a guest
> > > > > > > supports tags, we should disable the feature altogether.
> > > > > > 
> > > > > > In stage 2 I believe we only have two types of mapping -
> > > > > > 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()).
> > > > > > Filtering out the latter is a case of checking the 'device'
> > > > > > variable, and makes sense to avoid the overhead you
> > > > > > describe.
> > > > > > 
> > > > > > This should also guarantee that all stage-2 cacheable
> > > > > > memory supports tags,
> > > > > > as kvm_is_device_pfn() is simply !pfn_valid(), and
> > > > > > pfn_valid() should only
> > > > > > be true for memory that Linux considers "normal".
> > > > 
> > > > If you think "normal" == "normal System RAM", that's wrong; see
> > > > below.
> > > 
> > > By "normal" I think both Steven and I meant the Normal Cacheable memory
> > > attribute (another being the Device memory attribute).
> 
> Sadly there's no good standardised terminology here. Aarch64 provides the
> "normal (cacheable)" definition. Memory which is mapped as "Normal
> Cacheable" is implicitly MTE capable when shared with a guest (because the
> stage 2 mappings don't allow restricting MTE other than mapping it as Device
> memory).
> 
> So MTE also forces us to have a definition of memory which is "bog standard
> memory"[1] separate from the mapping attributes. This is the main memory
> which fully supports MTE.
> 
> Separate from the "bog standard" we have the "special"[1] memory, e.g.
> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that
> memory may not support MTE tags. This memory can only be safely shared with
> a guest in the following situations:
> 
>  1. MTE is completely disabled for the guest
> 
>  2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)
> 
>  3. We have some guarantee that guest MTE access are in some way safe.
> 
> (1) is the situation today (without this patch series). But it prevents the
> guest from using MTE in any form.
> 
> (2) is pretty terrible for general memory, but is the get-out clause for
> mapping devices into the guest.
> 
> (3) isn't something we have any architectural way of discovering. We'd need
> to know what the device did with the MTE accesses (and any caches between
> the CPU and the device) to ensure there aren't any side-channels or h/w
> lockup issues. We'd also need some way of describing this memory to the
> guest.
> 
> So at least for the time being the approach is to avoid letting a guest with
> MTE enabled have access to this sort of memory.

When a slot is added by the VMM, if it asked MTE in guest (I guess
that's an opt-in by the VMM, haven't checked the other patches), can we
reject it if it's is going to be mapped as Normal Cacheable but it is a
ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
check for ZONE_DEVICE)? This way we don't need to do more expensive
checks in set_pte_at().

We could simplify the set_pte_at() further if we require that the VMM
has a PROT_MTE mapping. This does not mean it cannot have two mappings,
the other without PROT_MTE. But at least we get a set_pte_at() when
swapping in which has PROT_MTE.

We could add another PROT_TAGGED or something which means PG_mte_tagged
set but still mapped as Normal Untagged. It's just that we are short of
pte bits for another flag.

Can we somehow identify when the S2 pte is set and can we get access to
the prior swap pte? This way we could avoid changes to set_pte_at() for
S2 faults.

-- 
Catalin


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-31 Thread Catalin Marinas
On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
> On 30.03.21 12:30, Catalin Marinas wrote:
> > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
> > > On 28/03/2021 13:21, Catalin Marinas wrote:
> > > > On Sat, Mar 27, 2021 at 03:23:24PM +, Catalin Marinas wrote:
> > > > > On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:
> > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > > > index 77cb2d28f2a4..b31b7a821f90 100644
> > > > > > --- a/arch/arm64/kvm/mmu.c
> > > > > > +++ b/arch/arm64/kvm/mmu.c
> > > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu 
> > > > > > *vcpu, phys_addr_t fault_ipa,
> > > > > > if (vma_pagesize == PAGE_SIZE && !force_pte)
> > > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> > > > > >, 
> > > > > > _ipa);
> > > > > > +
> > > > > > +   if (fault_status != FSC_PERM && kvm_has_mte(kvm) && 
> > > > > > pfn_valid(pfn)) {
> > > > > > +   /*
> > > > > > +* VM will be able to see the page's tags, so we must 
> > > > > > ensure
> > > > > > +* they have been initialised. if PG_mte_tagged is set, 
> > > > > > tags
> > > > > > +* have already been initialised.
> > > > > > +*/
> > > > > > +   struct page *page = pfn_to_page(pfn);
> > > > > > +   unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> > > > > > +
> > > > > > +   for (i = 0; i < nr_pages; i++, page++) {
> > > > > > +   if (!test_and_set_bit(PG_mte_tagged, 
> > > > > > >flags))
> > > > > > +   mte_clear_page_tags(page_address(page));
> > > > > > +   }
> > > > > > +   }
> > > > > 
> > > > > This pfn_valid() check may be problematic. Following commit 
> > > > > eeb0753ba27b
> > > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
> > > > > true for ZONE_DEVICE memory but such memory is allowed not to support
> > > > > MTE.
> > > > 
> > > > Some more thinking, this should be safe as any ZONE_DEVICE would be
> > > > mapped as untagged memory in the kernel linear map. It could be slightly
> > > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
> > > > untagged memory. Another overhead is pfn_valid() which will likely end
> > > > up calling memblock_is_map_memory().
> > > > 
> > > > However, the bigger issue is that Stage 2 cannot disable tagging for
> > > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
> > > > way to detect what gets mapped in the guest as Normal Cacheable memory
> > > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
> > > > something else like on-chip memory)?  If we can't guarantee that all
> > > > Cacheable memory given to a guest supports tags, we should disable the
> > > > feature altogether.
> > > 
> > > In stage 2 I believe we only have two types of mapping - 'normal' or
> > > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter 
> > > is a
> > > case of checking the 'device' variable, and makes sense to avoid the
> > > overhead you describe.
> > > 
> > > This should also guarantee that all stage-2 cacheable memory supports 
> > > tags,
> > > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
> > > be true for memory that Linux considers "normal".
> 
> If you think "normal" == "normal System RAM", that's wrong; see below.

By "normal" I think both Steven and I meant the Normal Cacheable memory
attribute (another being the Device memory attribute).

> > That's the problem. With Anshuman's commit I mentioned above,
> > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
> > memory, not talking about some I/O mapping that requires Device_nGnRE).
> > So kvm_is_device_pfn() is false for such memory and it may be mapped as
> > Normal but it is not guaranteed to support tagging.
> 
> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
> touch the page, you won't fault. So Anshuman's commit is correct.

I agree.

> pfn_to_online_page() means, "there is a struct page and it's system RAM
> that's in use; the memmap has a sane content"

Does pfn_to_online_page() returns a valid struct page pointer for
ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
some definition of system RAM (I assume NVDIMM != system RAM). For
example, pmem_attach_disk() calls devm_memremap_pages() and this would
use the Normal Cacheable memory attribute without necessarily being
system RAM.

So if pfn_valid() is not equivalent to system RAM, we have a potential
issue with MTE. Even if "system RAM" includes NVDIMMs, we still have
this issue and we may need a new term to describe MTE-safe memory. In
the kernel we assume MTE-safe all pages that can be mapped as
MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages.

Thanks.

-- 
Catalin


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-30 Thread Catalin Marinas
On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
> On 28/03/2021 13:21, Catalin Marinas wrote:
> > On Sat, Mar 27, 2021 at 03:23:24PM +, Catalin Marinas wrote:
> > > On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index 77cb2d28f2a4..b31b7a821f90 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> > > > phys_addr_t fault_ipa,
> > > > if (vma_pagesize == PAGE_SIZE && !force_pte)
> > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> > > >, 
> > > > _ipa);
> > > > +
> > > > +   if (fault_status != FSC_PERM && kvm_has_mte(kvm) && 
> > > > pfn_valid(pfn)) {
> > > > +   /*
> > > > +* VM will be able to see the page's tags, so we must 
> > > > ensure
> > > > +* they have been initialised. if PG_mte_tagged is set, 
> > > > tags
> > > > +* have already been initialised.
> > > > +*/
> > > > +   struct page *page = pfn_to_page(pfn);
> > > > +   unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> > > > +
> > > > +   for (i = 0; i < nr_pages; i++, page++) {
> > > > +   if (!test_and_set_bit(PG_mte_tagged, 
> > > > >flags))
> > > > +   mte_clear_page_tags(page_address(page));
> > > > +   }
> > > > +   }
> > > 
> > > This pfn_valid() check may be problematic. Following commit eeb0753ba27b
> > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
> > > true for ZONE_DEVICE memory but such memory is allowed not to support
> > > MTE.
> > 
> > Some more thinking, this should be safe as any ZONE_DEVICE would be
> > mapped as untagged memory in the kernel linear map. It could be slightly
> > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
> > untagged memory. Another overhead is pfn_valid() which will likely end
> > up calling memblock_is_map_memory().
> > 
> > However, the bigger issue is that Stage 2 cannot disable tagging for
> > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
> > way to detect what gets mapped in the guest as Normal Cacheable memory
> > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
> > something else like on-chip memory)?  If we can't guarantee that all
> > Cacheable memory given to a guest supports tags, we should disable the
> > feature altogether.
> 
> In stage 2 I believe we only have two types of mapping - 'normal' or
> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a
> case of checking the 'device' variable, and makes sense to avoid the
> overhead you describe.
> 
> This should also guarantee that all stage-2 cacheable memory supports tags,
> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
> be true for memory that Linux considers "normal".

That's the problem. With Anshuman's commit I mentioned above,
pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
memory, not talking about some I/O mapping that requires Device_nGnRE).
So kvm_is_device_pfn() is false for such memory and it may be mapped as
Normal but it is not guaranteed to support tagging.

For user MTE, we get away with this as the MAP_ANONYMOUS requirement
would filter it out while arch_add_memory() will ensure it's mapped as
untagged in the linear map. See another recent fix for hotplugged
memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal
Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged,
only hoplugged memory. Both handled via arch_add_memory() in the arch
code with ZONE_DEVICE starting at devm_memremap_pages().

> > > I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
> > > even without virtualisation.
> > 
> > I haven't checked all the code paths but I don't think we can get a
> > MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file
> > descriptor.
> 
> I certainly hope this is the case - it's the weird corner cases of device
> drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl
> (see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED).
> Mali's kbase did something similar in the past.

I think this should be fine since it's not a MAP_ANONYMOUS (we do allow
MAP_SHARED to be tagged).

-- 
Catalin


Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged

2021-03-30 Thread Catalin Marinas
On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote:
> On 26/03/2021 18:56, Catalin Marinas wrote:
> > On Fri, Mar 12, 2021 at 03:18:57PM +, Steven Price wrote:
> > > A KVM guest could store tags in a page even if the VMM hasn't mapped
> > > the page with PROT_MTE. So when restoring pages from swap we will
> > > need to check to see if there are any saved tags even if !pte_tagged().
> > > 
> > > However don't check pages which are !pte_valid_user() as these will
> > > not have been swapped out.
> > > 
> > > Signed-off-by: Steven Price 
> > > ---
> > >   arch/arm64/include/asm/pgtable.h |  2 +-
> > >   arch/arm64/kernel/mte.c  | 16 
> > >   2 files changed, 13 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/pgtable.h 
> > > b/arch/arm64/include/asm/pgtable.h
> > > index e17b96d0e4b5..84166625c989 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, 
> > > unsigned long addr,
> > >   __sync_icache_dcache(pte);
> > >   if (system_supports_mte() &&
> > > - pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> > > + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte))
> > >   mte_sync_tags(ptep, pte);
> > 
> > With the EPAN patches queued in for-next/epan, pte_valid_user()
> > disappeared as its semantics weren't very clear.
> 
> Thanks for pointing that out.
> 
> > So this relies on the set_pte_at() being done on the VMM address space.
> > I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access
> > it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need
> > something like pte_present() && addr <= user_addr_max().
> 
> AFAIUI the stage 2 matches the VMM's address space (for the subset that has
> memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be
> invalidated and a subsequent fault would exit to the VMM to sort out. This
> sort of thing is done for the lazy migration use case (i.e. pages are
> fetched as the VM tries to access them).

There's also the protected KVM case which IIUC wouldn't provide any
mapping of the guest memory to the host (or maybe the host still thinks
it's there but cannot access it without a Stage 2 fault). At least in
this case it wouldn't swap pages out and it would be the responsibility
of the EL2 code to clear the tags when giving pages to the guest
(user_mem_abort() must not touch the page).

So basically we either have a valid, accessible mapping in the VMM and
we can handle the tags via set_pte_at() or we leave it to whatever is
running at EL2 in the pKVM case.

I don't remember whether we had a clear conclusion in the past: have we
ruled out requiring the VMM to map the guest memory with PROT_MTE
entirely? IIRC a potential problem was the VMM using MTE itself and
having to disable it when accessing the guest memory.

Another potential issue (I haven't got my head around it yet) is a race
in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until
after the tags had been restored. Can we have the same page mapped by
two ptes, each attempting to restore it from swap and one gets it first
and starts modifying it? Given that we set the actual pte after setting
PG_mte_tagged, it's probably alright but I think we miss some barriers.

Also, if a page is not a swap one, we currently clear the tags if mapped
as pte_tagged() (prior to this patch). We'd need something similar when
mapping it in the guest so that we don't leak tags but to avoid any page
ending up with PG_mte_tagged, I think you moved the tag clearing to
user_mem_abort() in the KVM code. I presume set_pte_at() in the VMM
would be called first and then set in Stage 2.

> > BTW, ignoring virtualisation, can we ever bring a page in from swap on a
> > PROT_NONE mapping (say fault-around)? It's not too bad if we keep the
> > metadata around for when the pte becomes accessible but I suspect we
> > remove it if the page is removed from swap.
> 
> There are two stages of bringing data from swap. First is populating the
> swap cache by doing the physical read from swap. The second is actually
> restoring the page table entries.

When is the page metadata removed? I want to make sure we don't drop it
for some pte attributes.

-- 
Catalin


Re: [PATCH v4 0/5] arm64: kasan: support CONFIG_KASAN_VMALLOC

2021-03-29 Thread Catalin Marinas
On Wed, 24 Mar 2021 12:05:17 +0800, Lecopzer Chen wrote:
> Linux supports KAsan for VMALLOC since commit 3c5c3cfb9ef4da9
> ("kasan: support backing vmalloc space with real shadow memory")
> 
> Acroding to how x86 ported it [1], they early allocated p4d and pgd,
> but in arm64 I just simulate how KAsan supports MODULES_VADDR in arm64
> by not to populate the vmalloc area except for kimg address.
> 
> [...]

Applied to arm64 (for-next/kasan-vmalloc), thanks!

[1/5] arm64: kasan: don't populate vmalloc area for CONFIG_KASAN_VMALLOC
  https://git.kernel.org/arm64/c/9a0732efa774
[2/5] arm64: kasan: abstract _text and _end to KERNEL_START/END
  https://git.kernel.org/arm64/c/7d7b88ff5f8f
[3/5] arm64: Kconfig: support CONFIG_KASAN_VMALLOC
  https://git.kernel.org/arm64/c/71b613fc0c69
[4/5] arm64: kaslr: support randomized module area with KASAN_VMALLOC
  https://git.kernel.org/arm64/c/31d02e7ab008
[5/5] arm64: Kconfig: select KASAN_VMALLOC if KANSAN_GENERIC is enabled
  https://git.kernel.org/arm64/c/acc3042d62cb

-- 
Catalin



Re: [PATCH v2 -next] arm64: smp: Add missing prototype for some smp.c functions

2021-03-29 Thread Catalin Marinas
On Mon, 29 Mar 2021 11:43:43 +0800, Chen Lifu wrote:
> In commit eb631bb5bf5b
> ("arm64: Support arch_irq_work_raise() via self IPIs") a new
> function "arch_irq_work_raise" was added without a prototype.
> 
> In commit d914d4d49745
> ("arm64: Implement panic_smp_self_stop()") a new
> function "panic_smp_self_stop" was added without a prototype.
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[1/1] arm64: smp: Add missing prototype for some smp.c functions
  https://git.kernel.org/arm64/c/a52ef778ff28

-- 
Catalin



Re: linux-next: manual merge of the arm64 tree with Linus' tree

2021-03-29 Thread Catalin Marinas
On Mon, Mar 29, 2021 at 09:29:40AM +1100, Stephen Rothwell wrote:
> diff --cc arch/arm64/include/asm/cpucaps.h
> index c40f2490cd7b,9e3ec4dd56d8..
> --- a/arch/arm64/include/asm/cpucaps.h
> +++ b/arch/arm64/include/asm/cpucaps.h
> @@@ -66,8 -66,8 +66,9 @@@
>   #define ARM64_WORKAROUND_150841258
>   #define ARM64_HAS_LDAPR 59
>   #define ARM64_KVM_PROTECTED_MODE60
>  -#define ARM64_HAS_EPAN  61
>  +#define ARM64_WORKAROUND_NVIDIA_CARMEL_CNP  61
> ++#define ARM64_HAS_EPAN  62
>   
> --#define ARM64_NCAPS 62
> ++#define ARM64_NCAPS 63
>   
>   #endif /* __ASM_CPUCAPS_H */

Thanks Stephen, it looks fine.

-- 
Catalin


Re: [PATCH -next] arm64: smp: Add missing prototype for some smp.c functions

2021-03-28 Thread Catalin Marinas
On Sat, Mar 27, 2021 at 03:06:51PM +0800, Chen Lifu wrote:
> In commit eb631bb5bf5b
> ("arm64: Support arch_irq_work_raise() via self IPIs") a new
> function "arch_irq_work_raise" was added without a prototype
> in header irq_work.h
> 
> In commit d914d4d49745
> ("arm64: Implement panic_smp_self_stop()") a new
> function "panic_smp_self_stop" was added without a prototype
> in header irq_work.h
> 
> We get the following warnings on W=1:
> arch/arm64/kernel/smp.c:842:6: warning: no previous prototype
> for ‘arch_irq_work_raise’ [-Wmissing-prototypes]
> arch/arm64/kernel/smp.c:862:6: warning: no previous prototype
> for ‘panic_smp_self_stop’ [-Wmissing-prototypes]
> 
> Fix the same by adding the missing prototype in header irq_work.h
> 
> Signed-off-by: Chen Lifu 
> ---
>  arch/arm64/include/asm/irq_work.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/irq_work.h 
> b/arch/arm64/include/asm/irq_work.h
> index a1020285ea75..f766e2190e7c 100644
> --- a/arch/arm64/include/asm/irq_work.h
> +++ b/arch/arm64/include/asm/irq_work.h
> @@ -2,6 +2,9 @@
>  #ifndef __ASM_IRQ_WORK_H
>  #define __ASM_IRQ_WORK_H
>  
> +extern void arch_irq_work_raise(void);
> +extern void panic_smp_self_stop(void);

The second prototype makes more sense in arch/arm64/include/asm/smp.h
where we already have crash_smp_send_stop().

-- 
Catalin


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-28 Thread Catalin Marinas
On Sat, Mar 27, 2021 at 03:23:24PM +, Catalin Marinas wrote:
> On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 77cb2d28f2a4..b31b7a821f90 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> > phys_addr_t fault_ipa,
> > if (vma_pagesize == PAGE_SIZE && !force_pte)
> > vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> >, _ipa);
> > +
> > +   if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
> > +   /*
> > +* VM will be able to see the page's tags, so we must ensure
> > +* they have been initialised. if PG_mte_tagged is set, tags
> > +* have already been initialised.
> > +*/
> > +   struct page *page = pfn_to_page(pfn);
> > +   unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> > +
> > +   for (i = 0; i < nr_pages; i++, page++) {
> > +   if (!test_and_set_bit(PG_mte_tagged, >flags))
> > +   mte_clear_page_tags(page_address(page));
> > +   }
> > +   }
> 
> This pfn_valid() check may be problematic. Following commit eeb0753ba27b
> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
> true for ZONE_DEVICE memory but such memory is allowed not to support
> MTE.

Some more thinking, this should be safe as any ZONE_DEVICE would be
mapped as untagged memory in the kernel linear map. It could be slightly
inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
untagged memory. Another overhead is pfn_valid() which will likely end
up calling memblock_is_map_memory().

However, the bigger issue is that Stage 2 cannot disable tagging for
Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
way to detect what gets mapped in the guest as Normal Cacheable memory
and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
something else like on-chip memory)?  If we can't guarantee that all
Cacheable memory given to a guest supports tags, we should disable the
feature altogether.

> I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
> even without virtualisation.

I haven't checked all the code paths but I don't think we can get a
MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file
descriptor.

-- 
Catalin


Re: [PATCH] arm64: move --fix-cortex-a53-843419 linker test to Kconfig

2021-03-28 Thread Catalin Marinas
On Sun, Mar 28, 2021 at 03:59:29PM +0900, Masahiro Yamada wrote:
> On Fri, Mar 26, 2021 at 11:36 PM Catalin Marinas
>  wrote:
> > On Wed, Mar 24, 2021 at 04:11:28PM +0900, Masahiro Yamada wrote:
> > > $(call ld-option, --fix-cortex-a53-843419) in arch/arm64/Makefile is
> > > evaluated every time even for Make targets that do not need the linker,
> > > such as "make ARCH=arm64 install".
> > >
> > > Recently, the Kbuild tree queued up a patch to avoid needless
> > > compiler/linker flag evaluation. I beleive it is a good improvement
> > > itself, but causing a false-positive warning for arm64 installation
> > > in linux-next. (Thanks to Nathan for the report)
> > >
> > > Kconfig can test the linker capability just once, and store it in the
> > > .config file. The build and installation steps that follow do not need
> > > to test the liniker over again.
> > >
> > > Reported-by: Nathan Chancellor 
> > > Signed-off-by: Masahiro Yamada 
> > > ---
> > >
> > > I was not sure what the preferred CONFIG option name is.
> > > Please suggest a one if you have a better idea.
> > >
> > >
> > >  arch/arm64/Kconfig  | 3 +++
> > >  arch/arm64/Makefile | 2 +-
> > >  2 files changed, 4 insertions(+), 1 deletion(-)
> >
> > Would you like this patch to go in via the arm64 tree or you will queue
> > it via the kbuild tree?
> 
> I applied this to linux-kbuild with Will's Ack.

Great, I'll cross it off my list. Thanks.

-- 
Catalin


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-27 Thread Catalin Marinas
On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 77cb2d28f2a4..b31b7a821f90 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   if (vma_pagesize == PAGE_SIZE && !force_pte)
>   vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>  , _ipa);
> +
> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {

This pfn_valid() check may be problematic. Following commit eeb0753ba27b
("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
true for ZONE_DEVICE memory but such memory is allowed not to support
MTE.

I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
even without virtualisation.

> + /*
> +  * VM will be able to see the page's tags, so we must ensure
> +  * they have been initialised. if PG_mte_tagged is set, tags
> +  * have already been initialised.
> +  */
> + struct page *page = pfn_to_page(pfn);
> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> +
> + for (i = 0; i < nr_pages; i++, page++) {
> + if (!test_and_set_bit(PG_mte_tagged, >flags))
> + mte_clear_page_tags(page_address(page));
> + }
> + }
> +
>   if (writable)
>   prot |= KVM_PGTABLE_PROT_W;
>  

-- 
Catalin


Re: [PATCH 1/4] arm64: expose orig_x0 in the user_pt_regs structure

2021-03-27 Thread Catalin Marinas
On Fri, Mar 26, 2021 at 05:35:19PM -0700, Andrei Vagin wrote:
> On Fri, Mar 26, 2021 at 11:28 AM Catalin Marinas
>  wrote:
> >
> > On Mon, Mar 22, 2021 at 03:50:50PM -0700, Andrei Vagin wrote:
> > > diff --git a/arch/arm64/include/uapi/asm/ptrace.h 
> > > b/arch/arm64/include/uapi/asm/ptrace.h
> > > index 758ae984ff97..3c118c5b0893 100644
> > > --- a/arch/arm64/include/uapi/asm/ptrace.h
> > > +++ b/arch/arm64/include/uapi/asm/ptrace.h
> > > @@ -90,6 +90,7 @@ struct user_pt_regs {
> > >   __u64   sp;
> > >   __u64   pc;
> > >   __u64   pstate;
> > > + __u64   orig_x0;
> > >  };
> >
> > That's a UAPI change, likely to go wrong. For example, a
> > ptrace(PTRACE_GETREGSET, pid, REGSET_GPR, data) would write past the end
> > of an old struct user_pt_regs in the debugger.
> 
> ptrace(PTRACE_GETREGSET, ...) receives iovec:
> ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, )
> 
> iov contains a pointer to a buffer and its size and the kernel fills
> only the part that fits the buffer.
> I think this interface was invented to allow extending structures
> without breaking backward compatibility.

You are right here, it doesn't write past the end of the iov buffer.
However, it's still an ABI change. An unaware program using a newer
user_pt_regs but running on an older kernel may be surprised that the
updated iov.len is smaller than sizeof (struct user_pt_regs).

Changing this structure also changes the core dump format, see ELF_NGREG
and ELF_CORE_COPY_REGS. Maybe this doesn't matter much either since the
ELF note would have size information but I'd prefer if we didn't modify
this structure.

-- 
Catalin


Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged

2021-03-26 Thread Catalin Marinas
Hi Steven,

On Fri, Mar 12, 2021 at 03:18:57PM +, Steven Price wrote:
> A KVM guest could store tags in a page even if the VMM hasn't mapped
> the page with PROT_MTE. So when restoring pages from swap we will
> need to check to see if there are any saved tags even if !pte_tagged().
> 
> However don't check pages which are !pte_valid_user() as these will
> not have been swapped out.
> 
> Signed-off-by: Steven Price 
> ---
>  arch/arm64/include/asm/pgtable.h |  2 +-
>  arch/arm64/kernel/mte.c  | 16 
>  2 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index e17b96d0e4b5..84166625c989 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, 
> unsigned long addr,
>   __sync_icache_dcache(pte);
>  
>   if (system_supports_mte() &&
> - pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
> + pte_present(pte) && pte_valid_user(pte) && !pte_special(pte))
>   mte_sync_tags(ptep, pte);

With the EPAN patches queued in for-next/epan, pte_valid_user()
disappeared as its semantics weren't very clear.

So this relies on the set_pte_at() being done on the VMM address space.
I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access
it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need
something like pte_present() && addr <= user_addr_max().

BTW, ignoring virtualisation, can we ever bring a page in from swap on a
PROT_NONE mapping (say fault-around)? It's not too bad if we keep the
metadata around for when the pte becomes accessible but I suspect we
remove it if the page is removed from swap.

-- 
Catalin


Re: [PATCH 2/4] arm64/ptrace: introduce orig_x7 in the user_pt_regs structure

2021-03-26 Thread Catalin Marinas
On Mon, Mar 22, 2021 at 03:50:51PM -0700, Andrei Vagin wrote:
> diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
> index d4cdf98ac003..1008f0fbc5ea 100644
> --- a/arch/arm64/include/asm/ptrace.h
> +++ b/arch/arm64/include/asm/ptrace.h
> @@ -184,6 +184,7 @@ struct pt_regs {
>   u64 pc;
>   u64 pstate;
>   u64 orig_x0;
> + u64 orig_x7;
>   };
>   };
>  #ifdef __AARCH64EB__
> diff --git a/arch/arm64/include/uapi/asm/ptrace.h 
> b/arch/arm64/include/uapi/asm/ptrace.h
> index 3c118c5b0893..be7583ff5f4d 100644
> --- a/arch/arm64/include/uapi/asm/ptrace.h
> +++ b/arch/arm64/include/uapi/asm/ptrace.h
> @@ -91,6 +91,7 @@ struct user_pt_regs {
>   __u64   pc;
>   __u64   pstate;
>   __u64   orig_x0;
> + __u64   orig_x7;
>  };

Same here. So unless I miss something, we better have a separate
NT_ORIGREG (or some better name) regset to retrieve the additional
registers. Or, if you want to get all of them in one go, just add a
new one similar to NT_PRSTATUS but which restores x0 to orig_x0 and x7
to orig_x7.

Sorry if this was already discussed. I had a brief look at the previous
versions and couldn't see a user_pt_regs structure change, nor a
suggestion to do so.

-- 
Catalin


Re: [PATCH 1/4] arm64: expose orig_x0 in the user_pt_regs structure

2021-03-26 Thread Catalin Marinas
On Mon, Mar 22, 2021 at 03:50:50PM -0700, Andrei Vagin wrote:
> diff --git a/arch/arm64/include/uapi/asm/ptrace.h 
> b/arch/arm64/include/uapi/asm/ptrace.h
> index 758ae984ff97..3c118c5b0893 100644
> --- a/arch/arm64/include/uapi/asm/ptrace.h
> +++ b/arch/arm64/include/uapi/asm/ptrace.h
> @@ -90,6 +90,7 @@ struct user_pt_regs {
>   __u64   sp;
>   __u64   pc;
>   __u64   pstate;
> + __u64   orig_x0;
>  };

That's a UAPI change, likely to go wrong. For example, a
ptrace(PTRACE_GETREGSET, pid, REGSET_GPR, data) would write past the end
of an old struct user_pt_regs in the debugger.

-- 
Catalin


Re: [PATCH] arm64: move --fix-cortex-a53-843419 linker test to Kconfig

2021-03-26 Thread Catalin Marinas
Hi Masahiro,

On Wed, Mar 24, 2021 at 04:11:28PM +0900, Masahiro Yamada wrote:
> $(call ld-option, --fix-cortex-a53-843419) in arch/arm64/Makefile is
> evaluated every time even for Make targets that do not need the linker,
> such as "make ARCH=arm64 install".
> 
> Recently, the Kbuild tree queued up a patch to avoid needless
> compiler/linker flag evaluation. I beleive it is a good improvement
> itself, but causing a false-positive warning for arm64 installation
> in linux-next. (Thanks to Nathan for the report)
> 
> Kconfig can test the linker capability just once, and store it in the
> .config file. The build and installation steps that follow do not need
> to test the liniker over again.
> 
> Reported-by: Nathan Chancellor 
> Signed-off-by: Masahiro Yamada 
> ---
> 
> I was not sure what the preferred CONFIG option name is.
> Please suggest a one if you have a better idea.
> 
> 
>  arch/arm64/Kconfig  | 3 +++
>  arch/arm64/Makefile | 2 +-
>  2 files changed, 4 insertions(+), 1 deletion(-)

Would you like this patch to go in via the arm64 tree or you will queue
it via the kbuild tree?

Thanks.

-- 
Catalin


Re: [PATCHv3 0/6] arm64: Support FIQ controller registration

2021-03-24 Thread Catalin Marinas
On Mon, 15 Mar 2021 11:56:23 +, Mark Rutland wrote:
> Hector's M1 support series [1] shows that some platforms have critical
> interrupts wired to FIQ, and to support these platforms we need to support
> handling FIQ exceptions. Other contemporary platforms don't use FIQ (since 
> e.g.
> this is usually routed to EL3), and as we never expect to take an FIQ, we have
> the FIQ vector cause a panic.
> 
> Since the use of FIQ is a platform integration detail (which can differ across
> bare-metal and virtualized environments), we need be able to explicitly opt-in
> to handling FIQs while retaining the existing behaviour otherwise. This series
> adds a new set_handle_fiq() hook so that the FIQ controller can do so, and
> where no controller is registered the default handler will panic(). For
> consistency the set_handle_irq() code is made to do the same.
> 
> [...]

Applied to arm64 (for-next/fiq), thanks!

[1/6] genirq: Allow architectures to override set_handle_irq() fallback
  https://git.kernel.org/arm64/c/b0b8b689d78c
[2/6] arm64: don't use GENERIC_IRQ_MULTI_HANDLER
  https://git.kernel.org/arm64/c/338a743640e9
[3/6] arm64: irq: rework root IRQ handler registration
  https://git.kernel.org/arm64/c/8ff443cebffa
[4/6] arm64: entry: factor irq triage logic into macros
  https://git.kernel.org/arm64/c/9eb563cdabe1
[5/6] arm64: Always keep DAIF.[IF] in sync
  https://git.kernel.org/arm64/c/f0098155d337
[6/6] arm64: irq: allow FIQs to be handled
  https://git.kernel.org/arm64/c/3889ba70102e

-- 
Catalin



Re: [PATCH v5 05/19] arm64: Add support for trace synchronization barrier

2021-03-24 Thread Catalin Marinas
On Wed, Mar 24, 2021 at 05:06:58PM +, Suzuki K Poulose wrote:
> On 24/03/2021 16:30, Marc Zyngier wrote:
> > On Wed, 24 Mar 2021 16:25:12 +,
> > Suzuki K Poulose  wrote:
> > > 
> > > On 24/03/2021 16:16, Marc Zyngier wrote:
> > > > On Wed, 24 Mar 2021 15:51:14 +,
> > > > Suzuki K Poulose  wrote:
> > > > > 
> > > > > On 24/03/2021 13:49, Marc Zyngier wrote:
> > > > > > On Wed, 24 Mar 2021 09:39:13 +0000,
> > > > > > Suzuki K Poulose  wrote:
> > > > > > > 
> > > > > > > On 23/03/2021 18:21, Catalin Marinas wrote:
> > > > > > > > Hi Suzuki?
> > > > > > > > 
> > > > > > > > On Tue, Mar 23, 2021 at 12:06:33PM +, Suzuki K Poulose 
> > > > > > > > wrote:
> > > > > > > > > tsb csync synchronizes the trace operation of instructions.
> > > > > > > > > The instruction is a nop when FEAT_TRF is not implemented.
> > > > > > > > > 
> > > > > > > > > Cc: Mathieu Poirier 
> > > > > > > > > Cc: Mike Leach 
> > > > > > > > > Cc: Catalin Marinas 
> > > > > > > > > Cc: Will Deacon 
> > > > > > > > > Signed-off-by: Suzuki K Poulose 
> > > > > > > > 
> > > > > > > > How do you plan to merge these patches? If they go via the 
> > > > > > > > coresight
> > > > > > > > tree:
> > > > > > > > 
> > > > > > > 
> > > > > > > Ideally all of this should go via the CoreSight tree to have the
> > > > > > > dependencies solved at one place. But there are some issues :
> > > > > > > 
> > > > > > > If this makes to 5.13 queue for CoreSight,
> > > > > > > 
> > > > > > > 1) CoreSight next is based on rc2 at the moment and we have fixes 
> > > > > > > gone
> > > > > > > into rc3 and later, which this series will depend on. (We could 
> > > > > > > move
> > > > > > > the next tree forward to a later rc to solve this).
> > > > > > > 
> > > > > > > 2) There could be conflicts with the kvmarm tree for the KVM host
> > > > > > > changes (That has dependency on the TRBE definitions patch).
> > > > > > > 
> > > > > > > If it doesn't make to 5.13 queue, it would be good to have this 
> > > > > > > patch,
> > > > > > > the TRBE defintions and the KVM host patches queued for 5.13 (not 
> > > > > > > sure
> > > > > > > if this is acceptable) and we could rebase the CoreSight changes 
> > > > > > > on 5.13
> > > > > > > and push it to next release.
> > > > > > > 
> > > > > > > I am open for other suggestions.
> > > > > > > 
> > > > > > > Marc, Mathieu,
> > > > > > > 
> > > > > > > Thoughts ?
> > > > > > 
> > > > > > I was planning to take the first two patches in 5.12 as fixes (they
> > > > > > are queued already, and would hopefully land in -rc5). If that 
> > > > > > doesn't
> > > > > > fit with the plan, please let me know ASAP.
> > > > > 
> > > > > Marc,
> > > > > 
> > > > > I think it would be better to hold on pushing those patches until we
> > > > > have a clarity on how things will go.
> > > > 
> > > > OK. I thought there was a need for these patches to prevent guest
> > > > access to the v8.4 self hosted tracing feature that went in 5.12
> > > > though[1]... Did I get it wrong?
> > > 
> > > Yes, that is correct. The guest could access the Trace Filter Control
> > > register and fiddle with the host settings, without this patch.
> > > e.g, it could disable tracing at EL0/EL1, without the host being
> > > aware on nVHE host.
> > 
> > OK, so we definitely do need these patches, don't we? Both? Just one?
> > Please have a look at kvmarm/fixes and tell me what I must keep.
> 
> Both of them are fixes.
> 
> commit "KVM: arm64: Disable guest access to trace filter controls"
>  - This fixes guest fiddling with the trace filter control as described
> above.
> 
> commit "KVM: arm64: Hide system instruction access to Trace registers"
>  - Fixes the Hypervisor to advertise what it doesn't support. i.e
>stop advertising trace system instruction access to a guest.
>Otherwise a guest which trusts the ID registers
>(ID_AA64DFR0_EL1.TRACEVER == 1) can crash while trying to access the
>trace register as we trap the accesses (CPTR_EL2.TTA == 1). On Linux,
>the ETM drivers need a DT explicitly advertising the support. So,
>this is not immediately impacted. And this fix goes a long way back
>in the history, when the CPTR_EL2.TTA was added.
> 
> Now, the reason for asking you to hold on is the way this could create
> conflicts in merging the rest of the series.

The way we normally work around this is to either rebase your series on
top of -rc5 when the fixes go in or, if you want an earlier -rc base,
Marc can put them on a stable branch somewhere that you can use.

In the worst case you can merge the patches twice but that's rarely
needed.

-- 
Catalin


Re: linux-next: Signed-off-by missing for commit in the arm64 tree

2021-03-24 Thread Catalin Marinas
On Wed, Mar 24, 2021 at 08:14:45AM +1100, Stephen Rothwell wrote:
> Commits
> 
>   b17f265bb4cc ("kselftest/arm64: mte: Fix MTE feature detection")
>   4dfc9d30a8ab ("kselftest/arm64: mte: common: Fix write() warnings")
> 
> are missing a Signed-off-by from their author.

Thanks Stephen. Now fixed.

-- 
Catalin


Re: [PATCH v5 05/19] arm64: Add support for trace synchronization barrier

2021-03-23 Thread Catalin Marinas
Hi Suzuki?

On Tue, Mar 23, 2021 at 12:06:33PM +, Suzuki K Poulose wrote:
> tsb csync synchronizes the trace operation of instructions.
> The instruction is a nop when FEAT_TRF is not implemented.
> 
> Cc: Mathieu Poirier 
> Cc: Mike Leach 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Signed-off-by: Suzuki K Poulose 

How do you plan to merge these patches? If they go via the coresight
tree:

Acked-by: Catalin Marinas 


Re: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes

2021-03-23 Thread Catalin Marinas
On Tue, Mar 23, 2021 at 01:32:18PM +, Will Deacon wrote:
> On Tue, Mar 23, 2021 at 12:08:56PM +, Robin Murphy wrote:
> > On 2021-03-23 07:34, Yang Yingliang wrote:
> > > When copy over 128 bytes, src/dst is added after
> > > each ldp/stp instruction, it will cost more time.
> > > To improve this, we only add src/dst after load
> > > or store 64 bytes.
> > 
> > This breaks the required behaviour for copy_*_user(), since the fault
> > handler expects the base address to be up-to-date at all times. Say you're
> > copying 128 bytes and fault on the 4th store, it should return 80 bytes not
> > copied; the code below would return 128 bytes not copied, even though 48
> > bytes have actually been written to the destination.
> > 
> > We've had a couple of tries at updating this code (because the whole
> > template is frankly a bit terrible, and a long way from the well-optimised
> > code it was derived from), but getting the fault-handling behaviour right
> > without making the handler itself ludicrously complex has proven tricky. And
> > then it got bumped down the priority list while the uaccess behaviour in
> > general was in flux - now that the dust has largely settled on that I should
> > probably try to find time to pick this up again...
> 
> I think the v5 from Oli was pretty close, but it didn't get any review:
> 
> https://lore.kernel.org/r/20200914151800.2270-1-oli.sw...@arm.com

These are still unread in my inbox as I was planning to look at them
again. However, I think we discussed a few options on how to proceed but
no concrete plans:

1. Merge Oli's patches as they are, with some potential complexity
   issues as fixing the user copy accuracy was non-trivial. I think the
   latest version uses a two-stage approach: when taking a fault, it
   falls back to to byte-by-byte with the expectation that it faults
   again and we can then report the correct fault address.

2. Only use Cortex Strings for in-kernel memcpy() while the uaccess
   routines are some simple loops that align the uaccess part only
   (unlike Cortex Strings which usually to align the source).

3. Similar to 2 but with Cortex Strings imported automatically with some
   script to make it easier to keep the routines up to date.

If having non-optimal (but good enough) uaccess routines is acceptable,
I'd go for (2) with a plan to move to (3) at the next Cortex Strings
update.

I also need to look again at option (1) to see how complex it is but
given the time one spends on importing a new Cortex Strings library, I
don't think (1) scales well on the long term. We could, however, go for
(1) now and look at (3) with the next update.

-- 
Catalin


Re: linux-next: manual merge of the akpm tree with the arm64 tree

2021-03-22 Thread Catalin Marinas
On Mon, Mar 22, 2021 at 05:40:23PM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the akpm tree got a conflict in:
> 
>   arch/arm64/mm/mmu.c
> 
> between commit:
> 
>   87143f404f33 ("arm64: mm: use XN table mapping attributes for the linear 
> region")
> 
> from the arm64 tree and commit:
> 
>   0a2634348ef8 ("set_memory: allow querying whether set_direct_map_*() is 
> actually enabled")
> 
> from the akpm tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

The resolution looks fine. Thanks Stephen.

-- 
Catalin


Re: [PATCH] arm64: stacktrace: don't trace arch_stack_walk()

2021-03-19 Thread Catalin Marinas
On Fri, Mar 19, 2021 at 06:41:06PM +, Mark Rutland wrote:
> We recently converted arm64 to use arch_stack_walk() in commit:
> 
>   5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> 
> The core stacktrace code expects that (when tracing the current task)
> arch_stack_walk() starts a trace at its caller, and does not include
> itself in the trace. However, arm64's arch_stack_walk() includes itself,
> and so traces include one more entry than callers expect. The core
> stacktrace code which calls arch_stack_walk() tries to skip a number of
> entries to prevent itself appearing in a trace, and the additional entry
> prevents skipping one of the core stacktrace functions, leaving this in
> the trace unexpectedly.
> 
> We can fix this by having arm64's arch_stack_walk() begin the trace with
> its caller. The first value returned by the trace will be
> __builtin_return_address(0), i.e. the caller of arch_stack_walk(). The
> first frame record to be unwound will be __builtin_frame_address(1),
> i.e. the caller's frame record. To prevent surprises, arch_stack_walk()
> is also marked noinline.
> 
> While __builtin_frame_address(1) is not safe in portable code, local GCC
> developers have confirmed that it is safe on arm64. To find the caller's
> frame record, the builtin can safely dereference the current function's
> frame record or (in theory) could stash the original FP into another GPR
> at function entry time, neither of which are problematic.
> 
> Prior to this patch, the tracing code would unexpectedly show up in
> traces of the current task, e.g.
> 
> | # cat /proc/self/stack
> | [<0>] stack_trace_save_tsk+0x98/0x100
> | [<0>] proc_pid_stack+0xb4/0x130
> | [<0>] proc_single_show+0x60/0x110
> | [<0>] seq_read_iter+0x230/0x4d0
> | [<0>] seq_read+0xdc/0x130
> | [<0>] vfs_read+0xac/0x1e0
> | [<0>] ksys_read+0x6c/0xfc
> | [<0>] __arm64_sys_read+0x20/0x30
> | [<0>] el0_svc_common.constprop.0+0x60/0x120
> | [<0>] do_el0_svc+0x24/0x90
> | [<0>] el0_svc+0x2c/0x54
> | [<0>] el0_sync_handler+0x1a4/0x1b0
> | [<0>] el0_sync+0x170/0x180
> 
> After this patch, the tracing code will not show up in such traces:
> 
> | # cat /proc/self/stack
> | [<0>] proc_pid_stack+0xb4/0x130
> | [<0>] proc_single_show+0x60/0x110
> | [<0>] seq_read_iter+0x230/0x4d0
> | [<0>] seq_read+0xdc/0x130
> | [<0>] vfs_read+0xac/0x1e0
> | [<0>] ksys_read+0x6c/0xfc
> | [<0>] __arm64_sys_read+0x20/0x30
> | [<0>] el0_svc_common.constprop.0+0x60/0x120
> | [<0>] do_el0_svc+0x24/0x90
> | [<0>] el0_svc+0x2c/0x54
> | [<0>] el0_sync_handler+0x1a4/0x1b0
> | [<0>] el0_sync+0x170/0x180
> 
> Erring on the side of caution, I've given this a spin with a bunch of
> toolchains, verifying the output of /proc/self/stack and checking that
> the assembly looked sound. For GCC (where we require version 5.1.0 or
> later) I tested with the kernel.org crosstool binares for versions
> 5.5.0, 6.4.0, 6.5.0, 7.3.0, 7.5.0, 8.1.0, 8.3.0, 8.4.0, 9.2.0, and
> 10.1.0. For clang (where we require version 10.0.1 or later) I tested
> with the llvm.org binary releases of 11.0.0, and 11.0.1.
> 
> Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> Signed-off-by: Mark Rutland 
> Cc: Catalin Marinas 
> Cc: Chen Jun 
> Cc: Marco Elver 
> Cc: Mark Brown 
> Cc: Will Deacon 

Thanks Mark. I think we should add a cc stable, just with Fixes doesn't
always seem to end up in a stable kernel:

Cc:  # 5.10.x

With that:

Reviewed-by: Catalin Marinas 


Re: [PATCH 0/6] mm: some config cleanups

2021-03-19 Thread Catalin Marinas
On Tue, Mar 09, 2021 at 02:03:04PM +0530, Anshuman Khandual wrote:
> This series contains config cleanup patches which reduces code duplication
> across platforms and also improves maintainability. There is no functional
> change intended with this series. This has been boot tested on arm64 but
> only build tested on some other platforms.
> 
> This applies on 5.12-rc2
> 
> Cc: x...@kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-m...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> 
> Anshuman Khandual (6):
>   mm: Generalize ARCH_HAS_CACHE_LINE_SIZE
>   mm: Generalize SYS_SUPPORTS_HUGETLBFS (rename as ARCH_SUPPORTS_HUGETLBFS)
>   mm: Generalize ARCH_ENABLE_MEMORY_[HOTPLUG|HOTREMOVE]
>   mm: Drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION
>   mm: Drop redundant ARCH_ENABLE_SPLIT_PMD_PTLOCK
>   mm: Drop redundant HAVE_ARCH_TRANSPARENT_HUGEPAGE
> 
>  arch/arc/Kconfig   |  9 ++--
>  arch/arm/Kconfig   | 10 ++---
>  arch/arm64/Kconfig | 30 ++

For arm64:

Acked-by: Catalin Marinas 


Re: [RESEND PATCH] Kconfig: Move CONFIG_DEBUG_KMEMLEAK_TEST to samples/Kconfig

2021-03-19 Thread Catalin Marinas
On Thu, Mar 18, 2021 at 02:51:07PM +, chenjun (AM) wrote:
> From: Chen Jun 
> 
> commit 1abbef4f51724fb11f09adf0e75275f7cb422a8a
> ("mm,kmemleak-test.c: move kmemleak-test.c to samples dir")
> make CONFIG_DEBUG_KMEMLEAK_TEST depend on CONFIG_SAMPLES implicitly.
> And the dependency cannot be guaranteed by Kconfig.
> 
> move the definition of CONFIG_DEBUG_KMEMLEAK_TEST from lib/Kconfig.debug
> to samples/Kconfig.
> 
> Signed-off-by: Chen Jun 
> Acked-by: Catalin Marinas 

Cc'ing Andrew, I don't think he's seen the patch.

>   lib/Kconfig.debug | 8 
>   samples/Kconfig   | 8 
>   2 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 66d44d3..debacdc 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -678,14 +678,6 @@ config DEBUG_KMEMLEAK_MEM_POOL_SIZE
> fully initialised, this memory pool acts as an emergency one
> if slab allocations fail.
> 
> -config DEBUG_KMEMLEAK_TEST
> - tristate "Simple test for the kernel memory leak detector"
> - depends on DEBUG_KMEMLEAK && m
> - help
> -   This option enables a module that explicitly leaks memory.
> -
> -   If unsure, say N.
> -
>   config DEBUG_KMEMLEAK_DEFAULT_OFF
>   bool "Default kmemleak to off"
>   depends on DEBUG_KMEMLEAK
> diff --git a/samples/Kconfig b/samples/Kconfig
> index 0ed6e4d..15978dd 100644
> --- a/samples/Kconfig
> +++ b/samples/Kconfig
> @@ -216,4 +216,12 @@ config SAMPLE_WATCH_QUEUE
> Build example userspace program to use the new mount_notify(),
> sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.
> 
> +config DEBUG_KMEMLEAK_TEST
> + tristate "Simple test for the kernel memory leak detector"
> + depends on DEBUG_KMEMLEAK && m
> + help
> +   This option enables a module that explicitly leaks memory.
> +
> +   If unsure, say N.
> +
>   endif # SAMPLES
> -- 
> 2.7.4


Re: arm64: compiler_types.h:320:38: error: call to __compiletime_assert_417

2021-03-19 Thread Catalin Marinas
On Fri, Mar 19, 2021 at 08:49:07PM +0530, Naresh Kamboju wrote:
> [This email landed to Spam for some reason, sending it again with modified
> subject]
> 
> While building arm64 kernel modules the following kernel warnings /
> errors noticed on linux next 20210318 tag the gcc version is 7.3.0.
> Build PASS with gcc-8, gcc-9 and gcc-10.
> 
>  In file included from :0:0:
>  In function 'resiliency_test',
>  inlined from 'test_slub_init' at   lib/test_slub.c:120:2:
>include/linux/compiler_types.h:320:38: error: call to
> '__compiletime_assert_417' declared with attribute error: BUILD_BUG_ON
> failed: KMALLOC_MIN_SIZE > 16 | KMALLOC_SHIFT_HIGH < 10

KMALLOC_MIN_SIZE is 128 on arm64, so commit 1a58eef5def9 ("selftests:
add a kselftest for SLUB debugging functionality") breaks the build. The
test was previously in mm/slub.c hidden behind macro that no-one
enabled.

-- 
Catalin


Re: [PATCH v3 0/5] arm64: kasan: support CONFIG_KASAN_VMALLOC

2021-03-19 Thread Catalin Marinas
Hi Lecopzer,

On Sat, Feb 06, 2021 at 04:35:47PM +0800, Lecopzer Chen wrote:
> Linux supports KAsan for VMALLOC since commit 3c5c3cfb9ef4da9
> ("kasan: support backing vmalloc space with real shadow memory")
> 
> Acroding to how x86 ported it [1], they early allocated p4d and pgd,
> but in arm64 I just simulate how KAsan supports MODULES_VADDR in arm64
> by not to populate the vmalloc area except for kimg address.

Do you plan an update to a newer kernel like 5.12-rc3?

> Signed-off-by: Lecopzer Chen 
> Acked-by: Andrey Konovalov 
> Tested-by: Andrey Konovalov 
> Tested-by: Ard Biesheuvel 

You could move these to individual patches rather than the cover letter,
assuming that they still stand after the changes you've made. Also note
that Andrey K no longer has the @google.com email address if you cc him
on future patches (replace it with @gmail.com).

Thanks.

-- 
Catalin


Re: [PATCH v3 1/5] arm64: kasan: don't populate vmalloc area for CONFIG_KASAN_VMALLOC

2021-03-19 Thread Catalin Marinas
On Sat, Feb 06, 2021 at 04:35:48PM +0800, Lecopzer Chen wrote:
> Linux support KAsan for VMALLOC since commit 3c5c3cfb9ef4da9
> ("kasan: support backing vmalloc space with real shadow memory")
> 
> Like how the MODULES_VADDR does now, just not to early populate
> the VMALLOC_START between VMALLOC_END.
> 
> Before:
> 
> MODULE_VADDR: no mapping, no zoreo shadow at init
> VMALLOC_VADDR: backed with zero shadow at init
> 
> After:
> 
> MODULE_VADDR: no mapping, no zoreo shadow at init
> VMALLOC_VADDR: no mapping, no zoreo shadow at init

s/zoreo/zero/

> Thus the mapping will get allocated on demand by the core function
> of KASAN_VMALLOC.
> 
>   ---  vmalloc_shadow_start
>  |   |
>  |   |
>  |   | <= non-mapping
>  |   |
>  |   |
>  |---|
>  |///|<- kimage shadow with page table mapping.
>  |---|
>  |   |
>  |   | <= non-mapping
>  |   |
>  - vmalloc_shadow_end
>  |000|
>  |000| <= Zero shadow
>  |000|
>  - KASAN_SHADOW_END
> 
> Signed-off-by: Lecopzer Chen 
> ---
>  arch/arm64/mm/kasan_init.c | 18 +-
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
> index d8e66c78440e..20d06008785f 100644
> --- a/arch/arm64/mm/kasan_init.c
> +++ b/arch/arm64/mm/kasan_init.c
> @@ -214,6 +214,7 @@ static void __init kasan_init_shadow(void)
>  {
>   u64 kimg_shadow_start, kimg_shadow_end;
>   u64 mod_shadow_start, mod_shadow_end;
> + u64 vmalloc_shadow_end;
>   phys_addr_t pa_start, pa_end;
>   u64 i;
>  
> @@ -223,6 +224,8 @@ static void __init kasan_init_shadow(void)
>   mod_shadow_start = (u64)kasan_mem_to_shadow((void *)MODULES_VADDR);
>   mod_shadow_end = (u64)kasan_mem_to_shadow((void *)MODULES_END);
>  
> + vmalloc_shadow_end = (u64)kasan_mem_to_shadow((void *)VMALLOC_END);
> +
>   /*
>* We are going to perform proper setup of shadow memory.
>* At first we should unmap early shadow (clear_pgds() call below).
> @@ -241,12 +244,17 @@ static void __init kasan_init_shadow(void)
>  
>   kasan_populate_early_shadow(kasan_mem_to_shadow((void *)PAGE_END),
>  (void *)mod_shadow_start);
> - kasan_populate_early_shadow((void *)kimg_shadow_end,
> -(void *)KASAN_SHADOW_END);
>  
> - if (kimg_shadow_start > mod_shadow_end)
> - kasan_populate_early_shadow((void *)mod_shadow_end,
> - (void *)kimg_shadow_start);

Not something introduced by this patch but what happens if this
condition is false? It means that kimg_shadow_end < mod_shadow_start and
the above kasan_populate_early_shadow(PAGE_END, mod_shadow_start)
overlaps with the earlier kasan_map_populate(kimg_shadow_start,
kimg_shadow_end).

> + if (IS_ENABLED(CONFIG_KASAN_VMALLOC))
> + kasan_populate_early_shadow((void *)vmalloc_shadow_end,
> + (void *)KASAN_SHADOW_END);
> + else {
> + kasan_populate_early_shadow((void *)kimg_shadow_end,
> + (void *)KASAN_SHADOW_END);
> + if (kimg_shadow_start > mod_shadow_end)
> + kasan_populate_early_shadow((void *)mod_shadow_end,
> + (void *)kimg_shadow_start);
> + }
>  
>   for_each_mem_range(i, _start, _end) {
>   void *start = (void *)__phys_to_virt(pa_start);
> -- 
> 2.25.1
> 

-- 
Catalin


Re: [PATCH V2] mm/memtest: Add ARCH_USE_MEMTEST

2021-03-19 Thread Catalin Marinas
On Mon, Mar 01, 2021 at 10:02:06AM +0530, Anshuman Khandual wrote:
> early_memtest() does not get called from all architectures. Hence enabling
> CONFIG_MEMTEST and providing a valid memtest=[1..N] kernel command line
> option might not trigger the memory pattern tests as would be expected in
> normal circumstances. This situation is misleading.
> 
> The change here prevents the above mentioned problem after introducing a
> new config option ARCH_USE_MEMTEST that should be subscribed on platforms
> that call early_memtest(), in order to enable the config CONFIG_MEMTEST.
> Conversely CONFIG_MEMTEST cannot be enabled on platforms where it would
> not be tested anyway.
> 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Thomas Bogendoerfer 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Chris Zankel 
> Cc: Max Filippov 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-m...@vger.kernel.org
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Max Filippov 
> Signed-off-by: Anshuman Khandual 

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH] [RFC] arm64: enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION

2021-03-19 Thread Catalin Marinas
On Thu, Mar 18, 2021 at 09:41:54AM +0100, Arnd Bergmann wrote:
> On Wed, Mar 17, 2021 at 5:18 PM Catalin Marinas  
> wrote:
> >
> > On Wed, Mar 17, 2021 at 02:37:57PM +0000, Catalin Marinas wrote:
> > > On Thu, Feb 25, 2021 at 12:20:56PM +0100, Arnd Bergmann wrote:
> > > > diff --git a/arch/arm64/kernel/vmlinux.lds.S 
> > > > b/arch/arm64/kernel/vmlinux.lds.S
> > > > index bad2b9eaab22..926cdb597a45 100644
> > > > --- a/arch/arm64/kernel/vmlinux.lds.S
> > > > +++ b/arch/arm64/kernel/vmlinux.lds.S
> > > > @@ -217,7 +217,7 @@ SECTIONS
> > > > INIT_CALLS
> > > > CON_INITCALL
> > > > INIT_RAM_FS
> > > > -   *(.init.altinstructions .init.bss .init.bss.*)  /* from the 
> > > > EFI stub */
> > > > +   *(.init.altinstructions .init.data.* .init.bss .init.bss.*) 
> > > > /* from the EFI stub */
> > >
> > > INIT_DATA already covers .init.data and .init.data.*, so I don't think
> > > we need this change.
> >
> > Ah, INIT_DATA only covers init.data.* (so no dot in front). The above
> > is needed for the EFI stub.
> 
> I wonder if that is just a typo in INIT_DATA. Nico introduced it as part of
> 266ff2a8f51f ("kbuild: Fix asm-generic/vmlinux.lds.h for
> LD_DEAD_CODE_DATA_ELIMINATION"), so perhaps that should have
> been .init.data.* instead.

I think it was the other Nicholas ;) (with an 'h'). The vmlinux.lds.h
change indeed looks like a typo (it's been around since 4.18).

> > However, I gave this a quick try and under Qemu with -cpu max and -smp 2
> > (or more) it fails as below. I haven't debugged but the lr points to
> > just after the switch_to() call. Maybe some section got discarded and we
> > patched in the wrong instructions. It is fine with -cpu host or -smp 1.
> 
> Ah, interesting.
> 
> > ---8<
> > smp: Bringing up secondary CPUs ...
> > Detected PIPT I-cache on CPU1
> > CPU1: Booted secondary processor 0x01 [0x000f0510]
> > Unable to handle kernel paging request at virtual address eb91d81ad2971160
> > Mem abort info:
> >   ESR = 0x8604
> >   EC = 0x21: IABT (current EL), IL = 32 bits
> >   SET = 0, FnV = 0
> >   EA = 0, S1PTW = 0
> > [eb91d81ad2971160] address between user and kernel address ranges
> > Internal error: Oops: 8604 [#1] PREEMPT SMP
> > Modules linked in:
> > CPU: 1 PID: 16 Comm: migration/1 Not tainted 5.12.0-rc3-2-g128e977c1322 
> > #1
> > Stopper: 0x0 <- 0x0
> > pstate: 6085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> > pc : 0xeb91d81ad2971160
> > lr : __schedule+0x230/0x6b8
> > sp : 80001009bd60
> > x29: 80001009bd60 x28: 
> > x27: 000a6760 x26: 000b7540
> > x25: 0080 x24: d81ad3969000
> > x23: 000a6200 x22: 6ee0d81ad2971658
> > x21: 000a6200 x20: 0008
> > x19: 7fbc6bc0 x18: 0030
> > x17:  x16: 
> > x15: 8952b30a9a9e x14: 0366
> > x13: 0192 x12: 
> > x11: 0003 x10: 09b0
> > x9 : 80001009bd30 x8 : 000a6c10
> > x7 : 7fbc6cc0 x6 : fffedb30
> > x5 :  x4 : 
> > x3 : 0008 x2 : 
> > x1 : 000a6200 x0 : 000a3800
> > Call trace:
> >  0xeb91d81ad2971160
> >  schedule+0x70/0x108
> >  schedule_preempt_disabled+0x24/0x40
> >  __kthread_parkme+0x68/0xd0
> >  kthread+0x138/0x170
> >  ret_from_fork+0x10/0x30
> > Code: bad PC value
> > ---[ end trace af3481062ecef3e7 ]---
> 
> This looks like it has just returned from __schedule() to schedule()
> and is trying to return from that as well, through code like this:
> 
> .L562:
> // /git/arm-soc/kernel/sched/core.c:5159: }
> ldp x19, x20, [sp, 16]  //,,
> ldp x29, x30, [sp], 32  //,,,
> hint29 // autiasp
> ret
> 
> It looks like pointer authentication gone wrong, which ended up
> with dereferencing the broken pointer in x22, and it explains why
> it only happens with -cpu max. Presumably this also only happens
> on secondary CPUs, so maybe the bit that initializes PAC on
> secondary CPUs got discarded?

I seems that the whole alternative instructions section is gone, so any
run-time code patching that the kernel does won't work. The kernel boots
with the diff below but I'm not convinced we don't miss a

Re: [PATCH v16 0/9] arm64: ARMv8.5-A: MTE: Add async mode support

2021-03-18 Thread Catalin Marinas
On Mon, Mar 15, 2021 at 01:20:10PM +, Vincenzo Frascino wrote:
> This patchset implements the asynchronous mode support for ARMv8.5-A
> Memory Tagging Extension (MTE), which is a debugging feature that allows
> to detect with the help of the architecture the C and C++ programmatic
> memory errors like buffer overflow, use-after-free, use-after-return, etc.
> 
> MTE is built on top of the AArch64 v8.0 virtual address tagging TBI
> (Top Byte Ignore) feature and allows a task to set a 4 bit tag on any
> subset of its address space that is multiple of a 16 bytes granule. MTE
> is based on a lock-key mechanism where the lock is the tag associated to
> the physical memory and the key is the tag associated to the virtual
> address.
> When MTE is enabled and tags are set for ranges of address space of a task,
> the PE will compare the tag related to the physical memory with the tag
> related to the virtual address (tag check operation). Access to the memory
> is granted only if the two tags match. In case of mismatch the PE will raise
> an exception.
> 
> The exception can be handled synchronously or asynchronously. When the
> asynchronous mode is enabled:
>   - Upon fault the PE updates the TFSR_EL1 register.
>   - The kernel detects the change during one of the following:
> - Context switching
> - Return to user/EL0
> - Kernel entry from EL1
> - Kernel exit to EL1
>   - If the register has been updated by the PE the kernel clears it and
> reports the error.
> 
> The series is based on linux-next/akpm.

Andrew, could you please pick these patches up via the mm tree? They
depend on kasan patches already queued.

Andrey, all the kasan patches have your acked-by with the google.com
address and you've been cc'ed on that. You may want to update the
.mailmap file in the kernel.

Thanks.

-- 
Catalin


Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current

2021-03-18 Thread Catalin Marinas
On Thu, Mar 18, 2021 at 05:12:07PM +, Mark Rutland wrote:
> On Thu, Mar 18, 2021 at 04:17:24PM +0000, Catalin Marinas wrote:
> > On Wed, Mar 17, 2021 at 07:34:16PM +, Mark Rutland wrote:
> > > On Wed, Mar 17, 2021 at 06:36:36PM +0000, Catalin Marinas wrote:
> > > > On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote:
> > > > > On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same
> > > > > stack:
> > > > >  stack_trace_save+0x4c/0x78
> > > > >  register_early_stack+0x34/0x70
> > > > >  init_page_owner+0x34/0x230
> > > > >  page_ext_init+0x1bc/0x1dc
> > > > > 
> > > > > The reason is that:
> > > > > check_recursive_alloc always return 1 because that
> > > > > entries[0] is always equal to ip (__set_page_owner+0x3c/0x60).
> > > > > 
> > > > > The root cause is that:
> > > > > commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > > > > make the save_trace save 2 more entries.
> > > > > 
> > > > > Add skip in arch_stack_walk when task == current.
> > > > > 
> > > > > Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > > > > Signed-off-by: Chen Jun 
> > > > > ---
> > > > >  arch/arm64/kernel/stacktrace.c | 5 +++--
> > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/arch/arm64/kernel/stacktrace.c 
> > > > > b/arch/arm64/kernel/stacktrace.c
> > > > > index ad20981..c26b0ac 100644
> > > > > --- a/arch/arm64/kernel/stacktrace.c
> > > > > +++ b/arch/arm64/kernel/stacktrace.c
> > > > > @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn 
> > > > > consume_entry, void *cookie,
> > > > >  
> > > > >   if (regs)
> > > > >   start_backtrace(, regs->regs[29], regs->pc);
> > > > > - else if (task == current)
> > > > > + else if (task == current) {
> > > > > + ((struct stacktrace_cookie *)cookie)->skip += 2;
> > > > >   start_backtrace(,
> > > > >   (unsigned 
> > > > > long)__builtin_frame_address(0),
> > > > >   (unsigned long)arch_stack_walk);
> > > > > - else
> > > > > + } else
> > > > >   start_backtrace(, thread_saved_fp(task),
> > > > >   thread_saved_pc(task));
> > > > 
> > > > I don't like abusing the cookie here. It's void * as it's meant to be an
> > > > opaque type. I'd rather skip the first two frames in walk_stackframe()
> > > > instead before invoking fn().
> > > 
> > > I agree that we shouldn't touch cookie here.
> > > 
> > > I don't think that it's right to bodge this inside walk_stackframe(),
> > > since that'll add bogus skipping for the case starting with regs in the
> > > current task. If we need a bodge, it has to live in arch_stack_walk()
> > > where we set up the initial unwinding state.
> > 
> > Good point. However, instead of relying on __builtin_frame_address(1),
> > can we add a 'skip' value to struct stackframe via arch_stack_walk() ->
> > start_backtrace() that is consumed by walk_stackframe()?
> 
> We could, but I'd strongly prefer to use __builtin_frame_address(1) if
> we can, as it's much simpler to read and keeps the logic constrained to
> the starting function. I'd already hacked that up at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/unwind=5811a76c1be1dcea7104a9a771fc2604bc2a90ef
> 
> ... and I'm fairly confident that this works on arm64.

If it works with both clang and gcc (and various versions), it's cleaner
this way.

> If __builtin_frame_address(1) is truly unreliable, then we could just
> manually unwind one step within arch_stack_walk() when unwinding
> current, which I think is cleaner than spreading this within
> walk_stackframe().
> 
> I can clean up the commit message and post that as a real patch, if you
> like?

Yes, please. Either variant is fine by me, with a preference for
__builtin_frame_address(1) (if we know it works).

> > > In another thread, we came to the conclusion that arch_stack_walk()
> > > should start at its parent, and its parent should add any

Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current

2021-03-18 Thread Catalin Marinas
On Wed, Mar 17, 2021 at 07:34:16PM +, Mark Rutland wrote:
> On Wed, Mar 17, 2021 at 06:36:36PM +0000, Catalin Marinas wrote:
> > On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote:
> > > On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same
> > > stack:
> > >  stack_trace_save+0x4c/0x78
> > >  register_early_stack+0x34/0x70
> > >  init_page_owner+0x34/0x230
> > >  page_ext_init+0x1bc/0x1dc
> > > 
> > > The reason is that:
> > > check_recursive_alloc always return 1 because that
> > > entries[0] is always equal to ip (__set_page_owner+0x3c/0x60).
> > > 
> > > The root cause is that:
> > > commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > > make the save_trace save 2 more entries.
> > > 
> > > Add skip in arch_stack_walk when task == current.
> > > 
> > > Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> > > Signed-off-by: Chen Jun 
> > > ---
> > >  arch/arm64/kernel/stacktrace.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/arm64/kernel/stacktrace.c 
> > > b/arch/arm64/kernel/stacktrace.c
> > > index ad20981..c26b0ac 100644
> > > --- a/arch/arm64/kernel/stacktrace.c
> > > +++ b/arch/arm64/kernel/stacktrace.c
> > > @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn 
> > > consume_entry, void *cookie,
> > >  
> > >   if (regs)
> > >   start_backtrace(, regs->regs[29], regs->pc);
> > > - else if (task == current)
> > > + else if (task == current) {
> > > + ((struct stacktrace_cookie *)cookie)->skip += 2;
> > >   start_backtrace(,
> > >   (unsigned long)__builtin_frame_address(0),
> > >   (unsigned long)arch_stack_walk);
> > > - else
> > > + } else
> > >   start_backtrace(, thread_saved_fp(task),
> > >   thread_saved_pc(task));
> > 
> > I don't like abusing the cookie here. It's void * as it's meant to be an
> > opaque type. I'd rather skip the first two frames in walk_stackframe()
> > instead before invoking fn().
> 
> I agree that we shouldn't touch cookie here.
> 
> I don't think that it's right to bodge this inside walk_stackframe(),
> since that'll add bogus skipping for the case starting with regs in the
> current task. If we need a bodge, it has to live in arch_stack_walk()
> where we set up the initial unwinding state.

Good point. However, instead of relying on __builtin_frame_address(1),
can we add a 'skip' value to struct stackframe via arch_stack_walk() ->
start_backtrace() that is consumed by walk_stackframe()?

> In another thread, we came to the conclusion that arch_stack_walk()
> should start at its parent, and its parent should add any skipping it
> requires.

This makes sense.

> Currently, arch_stack_walk() is off-by-one, and we can bodge that by
> using __builtin_frame_address(1), though I'm waiting for some compiler
> folk to confirm that's sound. Otherwise we need to add an assembly
> trampoline to snapshot the FP, which is unfortunastely convoluted.
> 
> This report suggests that a caller of arch_stack_walk() is off-by-one
> too, which suggests a larger cross-architecture semantic issue. I'll try
> to take a look tomorrow.

I don't think the caller is off by one, at least not by the final skip
value. __set_page_owner() wants the trace to start at its caller. The
callee save_stack() in the same file adds a skip of 2.
save_stack_trace() increments the skip before invoking
arch_stack_walk(). So far, this assumes that arch_stack_walk() starts at
its parent, i.e. save_stack_trace().

So save_stack_trace() only need to skip 1 and I think that's in line
with the original report where the entries[0] is __set_page_owner(). We
only need to skip one. Another untested quick hack (we should probably
add the skip argument to start_backtrace()):

diff --git a/arch/arm64/include/asm/stacktrace.h 
b/arch/arm64/include/asm/stacktrace.h
index eb29b1fe8255..0d32d932ac89 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -56,6 +56,7 @@ struct stackframe {
DECLARE_BITMAP(stacks_done, __NR_STACK_TYPES);
unsigned long prev_fp;
enum stack_type prev_type;
+   int skip;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
int graph;
 #endif
@@ -153,6 +154,7 @@ static inline void start_backtrace(struct stackframe *frame,
 {
frame->fp = fp;
frame->pc = pc;
+   frame->skip = 0;
 #ifdef C

Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current

2021-03-17 Thread Catalin Marinas
On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote:
> On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same
> stack:
>  stack_trace_save+0x4c/0x78
>  register_early_stack+0x34/0x70
>  init_page_owner+0x34/0x230
>  page_ext_init+0x1bc/0x1dc
> 
> The reason is that:
> check_recursive_alloc always return 1 because that
> entries[0] is always equal to ip (__set_page_owner+0x3c/0x60).
> 
> The root cause is that:
> commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> make the save_trace save 2 more entries.
> 
> Add skip in arch_stack_walk when task == current.
> 
> Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
> Signed-off-by: Chen Jun 
> ---
>  arch/arm64/kernel/stacktrace.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index ad20981..c26b0ac 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn 
> consume_entry, void *cookie,
>  
>   if (regs)
>   start_backtrace(, regs->regs[29], regs->pc);
> - else if (task == current)
> + else if (task == current) {
> + ((struct stacktrace_cookie *)cookie)->skip += 2;
>   start_backtrace(,
>   (unsigned long)__builtin_frame_address(0),
>   (unsigned long)arch_stack_walk);
> - else
> + } else
>   start_backtrace(, thread_saved_fp(task),
>   thread_saved_pc(task));

I don't like abusing the cookie here. It's void * as it's meant to be an
opaque type. I'd rather skip the first two frames in walk_stackframe()
instead before invoking fn().

Prior to the conversion to ARCH_STACKWALK, we were indeed skipping two
more entries in __save_stack_trace() if tsk == current. Something like
below, completely untested:

diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index ad20981dfda4..2a9f759aa41a 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -115,10 +115,15 @@ NOKPROBE_SYMBOL(unwind_frame);
 void notrace walk_stackframe(struct task_struct *tsk, struct stackframe *frame,
 bool (*fn)(void *, unsigned long), void *data)
 {
+   /* for the current task, we don't want this function nor its caller */
+   int skip = tsk == current ? 2 : 0;
+
while (1) {
int ret;
 
-   if (!fn(data, frame->pc))
+   if (skip)
+   skip--;
+   else if (!fn(data, frame->pc))
break;
ret = unwind_frame(tsk, frame);
if (ret < 0)


-- 
Catalin


Re: [PATCH] [RFC] arm64: enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION

2021-03-17 Thread Catalin Marinas
On Wed, Mar 17, 2021 at 02:37:57PM +, Catalin Marinas wrote:
> On Thu, Feb 25, 2021 at 12:20:56PM +0100, Arnd Bergmann wrote:
> > diff --git a/arch/arm64/kernel/vmlinux.lds.S 
> > b/arch/arm64/kernel/vmlinux.lds.S
> > index bad2b9eaab22..926cdb597a45 100644
> > --- a/arch/arm64/kernel/vmlinux.lds.S
> > +++ b/arch/arm64/kernel/vmlinux.lds.S
> > @@ -217,7 +217,7 @@ SECTIONS
> > INIT_CALLS
> > CON_INITCALL
> > INIT_RAM_FS
> > -   *(.init.altinstructions .init.bss .init.bss.*)  /* from the EFI 
> > stub */
> > +   *(.init.altinstructions .init.data.* .init.bss .init.bss.*) 
> > /* from the EFI stub */
> 
> INIT_DATA already covers .init.data and .init.data.*, so I don't think
> we need this change.

Ah, INIT_DATA only covers init.data.* (so no dot in front). The above
is needed for the EFI stub.

However, I gave this a quick try and under Qemu with -cpu max and -smp 2
(or more) it fails as below. I haven't debugged but the lr points to
just after the switch_to() call. Maybe some section got discarded and we
patched in the wrong instructions. It is fine with -cpu host or -smp 1.

---8<
smp: Bringing up secondary CPUs ...
Detected PIPT I-cache on CPU1
CPU1: Booted secondary processor 0x01 [0x000f0510]
Unable to handle kernel paging request at virtual address eb91d81ad2971160
Mem abort info:
  ESR = 0x8604
  EC = 0x21: IABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
[eb91d81ad2971160] address between user and kernel address ranges
Internal error: Oops: 8604 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 16 Comm: migration/1 Not tainted 5.12.0-rc3-2-g128e977c1322 #1
Stopper: 0x0 <- 0x0
pstate: 6085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
pc : 0xeb91d81ad2971160
lr : __schedule+0x230/0x6b8
sp : 80001009bd60
x29: 80001009bd60 x28:  
x27: 000a6760 x26: 000b7540 
x25: 0080 x24: d81ad3969000 
x23: 000a6200 x22: 6ee0d81ad2971658 
x21: 000a6200 x20: 0008 
x19: 7fbc6bc0 x18: 0030 
x17:  x16:  
x15: 8952b30a9a9e x14: 0366 
x13: 0192 x12:  
x11: 0003 x10: 09b0 
x9 : 80001009bd30 x8 : 000a6c10 
x7 : 7fbc6cc0 x6 : fffedb30 
x5 :  x4 :  
x3 : 0008 x2 :  
x1 : 000a6200 x0 : 000a3800 
Call trace:
 0xeb91d81ad2971160
 schedule+0x70/0x108
 schedule_preempt_disabled+0x24/0x40
 __kthread_parkme+0x68/0xd0
 kthread+0x138/0x170
 ret_from_fork+0x10/0x30
Code: bad PC value
---[ end trace af3481062ecef3e7 ]---

-- 
Catalin


Re: [PATCH] [RFC] arm64: enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION

2021-03-17 Thread Catalin Marinas
On Thu, Feb 25, 2021 at 12:20:56PM +0100, Arnd Bergmann wrote:
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index bad2b9eaab22..926cdb597a45 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -217,7 +217,7 @@ SECTIONS
>   INIT_CALLS
>   CON_INITCALL
>   INIT_RAM_FS
> - *(.init.altinstructions .init.bss .init.bss.*)  /* from the EFI 
> stub */
> + *(.init.altinstructions .init.data.* .init.bss .init.bss.*) 
> /* from the EFI stub */

INIT_DATA already covers .init.data and .init.data.*, so I don't think
we need this change.

Also, mainline doesn't have .init.bss.*, do you know where this came
from? I can't find it in -next either.

-- 
Catalin


Re: [PATCH mm] kfence: make compatible with kmemleak

2021-03-17 Thread Catalin Marinas
On Wed, Mar 17, 2021 at 09:47:40AM +0100, Marco Elver wrote:
> Because memblock allocations are registered with kmemleak, the KFENCE
> pool was seen by kmemleak as one large object. Later allocations through
> kfence_alloc() that were registered with kmemleak via
> slab_post_alloc_hook() would then overlap and trigger a warning.
> Therefore, once the pool is initialized, we can remove (free) it from
> kmemleak again, since it should be treated as allocator-internal and be
> seen as "free memory".
> 
> The second problem is that kmemleak is passed the rounded size, and not
> the originally requested size, which is also the size of KFENCE objects.
> To avoid kmemleak scanning past the end of an object and trigger a
> KFENCE out-of-bounds error, fix the size if it is a KFENCE object.
> 
> For simplicity, to avoid a call to kfence_ksize() in
> slab_post_alloc_hook() (and avoid new IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)
> guard), just call kfence_ksize() in mm/kmemleak.c:create_object().
> 
> Reported-by: Luis Henriques 
> Cc: Catalin Marinas 
> Signed-off-by: Marco Elver 

Reviewed-by: Catalin Marinas 


Re: Issue with kfence and kmemleak

2021-03-16 Thread Catalin Marinas
On Tue, Mar 16, 2021 at 07:47:00PM +0100, Marco Elver wrote:
> One thing I've just run into: "BUG: KFENCE: out-of-bounds read in
> scan_block+0x6b/0x170 mm/kmemleak.c:1244"
> 
> Probably because kmemleak is passed the rounded size for the size-class,
> and not the real allocation size. Can this be fixed with
> kmemleak_ignore() only called on the KFENCE guard pages?

If it's only on the occasional object, you can do a
kmemleak_scan_area() but some care needed as this in turn allocates
memory for kmemleak internal metadata.

> I'd like kmemleak to scan the valid portion of an object allocated
> through KFENCE, but no further than that.
> 
> Or do we need to fix the size if it's a kfence object:
> 
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index c0014d3b91c1..fe6e3ae8e8c6 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -97,6 +97,7 @@
>  #include 
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -589,7 +590,7 @@ static struct kmemleak_object *create_object(unsigned 
> long ptr, size_t size,
>   atomic_set(>use_count, 1);
>   object->flags = OBJECT_ALLOCATED;
>   object->pointer = ptr;
> - object->size = size;
> + object->size = kfence_ksize((void *)ptr) ?: size;
>   object->excess_ref = 0;
>   object->min_count = min_count;
>   object->count = 0;  /* white color initially */
> 
> The alternative is to call kfence_ksize() in slab_post_alloc_hook() when
> calling kmemleak_alloc.

One of these is probably the easiest. If kfence only works on slab
objects, better to pass the right size from slab_post_alloc_hook(). If
you plan to expand it later to vmalloc(), just fix the size in
create_object().

-- 
Catalin


Re: [PATCH] arm64: vmlinux.lds.S: keep .entry.tramp.text section

2021-03-16 Thread Catalin Marinas
On Tue, Mar 16, 2021 at 05:39:27PM +0100, Arnd Bergmann wrote:
> On Tue, Mar 16, 2021 at 5:27 PM Catalin Marinas  
> wrote:
> > On Tue, Mar 16, 2021 at 10:45:32AM +0000, Catalin Marinas wrote:
> > > On Fri, Feb 26, 2021 at 08:32:57PM -0800, Fangrui Song wrote:
> > > > On 2021-02-26, Kees Cook wrote:
> > > > > On Fri, Feb 26, 2021 at 03:03:39PM +0100, Arnd Bergmann wrote:
> > > > > > From: Arnd Bergmann 
> > > > > >
> > > > > > When building with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION,
> > > > > > I sometimes see an assertion
> > > > > >
> > > > > >  ld.lld: error: Entry trampoline text too big
> > > > >
> > > > > Heh, "too big" seems a weird report for having it discarded. :)
> > > > >
> > > > > Any idea on this Fangrui?
> > > > >
> > > > > ( I see this is https://github.com/ClangBuiltLinux/linux/issues/1311 )
> > > >
> > > > This diagnostic is from an ASSERT in arch/arm64/kernel/vmlinux.lds
> > > >
> > > >   ASSERT((__entry_tramp_text_end - __entry_tramp_text_start) == (1 << 
> > > > 16),
> > > >"Entry trampoline text too big")
> > >
> > > Can we not change the ASSERT to be <= PAGE_SIZE instead?
> >
> > Ah, that won't work as I suspect we still need the trampoline section.
> >
> > Arnd, do you know why this section disappears? I did a simple test with
> > defconfig + LD_DEAD_CODE_DATA_ELIMINATION and the trampoline section is
> > still around.
> 
> If I remember correctly, this showed up when CONFIG_ARM_SDE_INTERFACE
> is disabled, which dropped the only reference into this section.
> If that doesn't make sense, I can try digging through the old build logs to
> reproduce the problem.

I suspected this as well but still worked for me when disabling it.

Anyway, I don't think identifying the exact option is necessary. With
CONFIG_UNMAP_KERNEL_AT_EL0=y we need this section around even if only
__entry_tramp_text_start/end are referenced.

In this case we happened to detect this issue because of the ASSERT in
vmlinux.lds.S but I wonder what else the linker drops with this dead
code elimination that we may not notice (it seems to remove about 500KB
from the resulting image in my test).

I'll push these two patches to -next for wider coverage before deciding
on mainline (though the option may not get much testing as it's hidden
behind EXPERT and default n).

-- 
Catalin


Re: Issue with kfence and kmemleak

2021-03-16 Thread Catalin Marinas
On Tue, Mar 16, 2021 at 06:30:00PM +0100, Marco Elver wrote:
> On Tue, Mar 16, 2021 at 04:42PM +, Luis Henriques wrote:
> > This is probably a known issue, but just in case: looks like it's not
> > possible to use kmemleak when kfence is enabled:
> > 
> > [0.272136] kmemleak: Cannot insert 0x888236e02f00 into the object 
> > search tree (overlaps existing)
> > [0.272136] CPU: 0 PID: 8 Comm: kthreadd Not tainted 5.12.0-rc3+ #92
> > [0.272136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> > [0.272136] Call Trace:
> > [0.272136]  dump_stack+0x6d/0x89
> > [0.272136]  create_object.isra.0.cold+0x40/0x62
> > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > [0.272136]  kmem_cache_alloc_trace+0x110/0x2f0
> > [0.272136]  ? process_one_work+0x5a0/0x5a0
> > [0.272136]  kthread+0x3f/0x150
> > [0.272136]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
> > [0.272136]  ? __kthread_bind_mask+0x60/0x60
> > [0.272136]  ret_from_fork+0x22/0x30
> > [0.272136] kmemleak: Kernel memory leak detector disabled
> > [0.272136] kmemleak: Object 0x888236e0 (size 2097152):
> > [0.272136] kmemleak:   comm "swapper", pid 0, jiffies 4294892296
> > [0.272136] kmemleak:   min_count = 0
> > [0.272136] kmemleak:   count = 0
> > [0.272136] kmemleak:   flags = 0x1
> > [0.272136] kmemleak:   checksum = 0
> > [0.272136] kmemleak:   backtrace:
> > [0.272136]  memblock_alloc_internal+0x6d/0xb0
> > [0.272136]  memblock_alloc_try_nid+0x6c/0x8a
> > [0.272136]  kfence_alloc_pool+0x26/0x3f
> > [0.272136]  start_kernel+0x242/0x548
> > [0.272136]  secondary_startup_64_no_verify+0xb0/0xbb
> > 
> > I've tried the hack below but it didn't really helped.  Obviously I don't
> > really understand what's going on ;-)  But I think the reason for this
> > patch not working as (I) expected is because kfence is initialised
> > *before* kmemleak.
> > 
> > diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> > index 3b8ec938470a..b4ffd7695268 100644
> > --- a/mm/kfence/core.c
> > +++ b/mm/kfence/core.c
> > @@ -631,6 +631,9 @@ void __init kfence_alloc_pool(void)
> >  
> > if (!__kfence_pool)
> > pr_err("failed to allocate pool\n");
> > +   kmemleak_no_scan(__kfence_pool);
> >  }
> 
> Can you try the below patch?
> 
> Thanks,
> -- Marco
> 
> -- >8 --
> 
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index f7106f28443d..5891019721f6 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -481,6 +482,13 @@ static bool __init kfence_init_pool(void)
>   addr += 2 * PAGE_SIZE;
>   }
>  
> + /*
> +  * The pool is live and will never be deallocated from this point on;
> +  * tell kmemleak this is now free memory, so that later allocations can
> +  * correctly be tracked.
> +  */
> + kmemleak_free_part_phys(__pa(__kfence_pool), KFENCE_POOL_SIZE);

I presume this pool does not refer any objects that are only tracked
through pool pointers.

kmemleak_free() (or *_free_part) should work, no need for the _phys
variant (which converts it back with __va).

Since we normally use kmemleak_ignore() (or no_scan) for objects we
don't care about, I'd expand the comment that this object needs to be
removed from the kmemleak object tree as it will overlap with subsequent
allocations handled by kfence which return pointers within this range.

-- 
Catalin


Re: [PATCH] arm64: vmlinux.lds.S: keep .entry.tramp.text section

2021-03-16 Thread Catalin Marinas
On Tue, Mar 16, 2021 at 10:45:32AM +, Catalin Marinas wrote:
> On Fri, Feb 26, 2021 at 08:32:57PM -0800, Fangrui Song wrote:
> > On 2021-02-26, Kees Cook wrote:
> > > On Fri, Feb 26, 2021 at 03:03:39PM +0100, Arnd Bergmann wrote:
> > > > From: Arnd Bergmann 
> > > > 
> > > > When building with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION,
> > > > I sometimes see an assertion
> > > > 
> > > >  ld.lld: error: Entry trampoline text too big
> > > 
> > > Heh, "too big" seems a weird report for having it discarded. :)
> > > 
> > > Any idea on this Fangrui?
> > > 
> > > ( I see this is https://github.com/ClangBuiltLinux/linux/issues/1311 )
> > 
> > This diagnostic is from an ASSERT in arch/arm64/kernel/vmlinux.lds
> > 
> >   ASSERT((__entry_tramp_text_end - __entry_tramp_text_start) == (1 << 16),
> >"Entry trampoline text too big")
> 
> Can we not change the ASSERT to be <= PAGE_SIZE instead?

Ah, that won't work as I suspect we still need the trampoline section.

Arnd, do you know why this section disappears? I did a simple test with
defconfig + LD_DEAD_CODE_DATA_ELIMINATION and the trampoline section is
still around.

-- 
Catalin


Re: [PATCH] arm64: vmlinux.lds.S: keep .entry.tramp.text section

2021-03-16 Thread Catalin Marinas
On Fri, Feb 26, 2021 at 08:32:57PM -0800, Fangrui Song wrote:
> On 2021-02-26, Kees Cook wrote:
> > On Fri, Feb 26, 2021 at 03:03:39PM +0100, Arnd Bergmann wrote:
> > > From: Arnd Bergmann 
> > > 
> > > When building with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION,
> > > I sometimes see an assertion
> > > 
> > >  ld.lld: error: Entry trampoline text too big
> > 
> > Heh, "too big" seems a weird report for having it discarded. :)
> > 
> > Any idea on this Fangrui?
> > 
> > ( I see this is https://github.com/ClangBuiltLinux/linux/issues/1311 )
> 
> This diagnostic is from an ASSERT in arch/arm64/kernel/vmlinux.lds
> 
>   ASSERT((__entry_tramp_text_end - __entry_tramp_text_start) == (1 << 16),
>"Entry trampoline text too big")

Can we not change the ASSERT to be <= PAGE_SIZE instead?

-- 
Catalin


Re: (subset) [PATCH 0/2] arch: enable GENERIC_FIND_FIRST_BIT for MIPS and ARM64

2021-03-15 Thread Catalin Marinas
On Thu, 25 Feb 2021 05:56:58 -0800, Yury Norov wrote:
> MIPS and ARM64 don't implement find_first_{zero}_bit in arch code and
> don't enable it in config. It leads to using find_next_bit() which is
> less efficient:
> 
> It's beneficial to enable GENERIC_FIND_FIRST_BIT as this functionality
> is not new at all and well-tested. It provides more optimized code and
> saves .text memory:
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[1/2] ARM64: enable GENERIC_FIND_FIRST_BIT
  https://git.kernel.org/arm64/c/98c5ec77c7c5

-- 
Catalin



Re: [PATCH v16 6/9] arm64: mte: Conditionally compile mte_enable_kernel_*()

2021-03-15 Thread Catalin Marinas
On Mon, Mar 15, 2021 at 01:20:16PM +, Vincenzo Frascino wrote:
> mte_enable_kernel_*() are not needed if KASAN_HW is disabled.
> 
> Add ash defines around the functions to conditionally compile the
> functions.
> 
> Signed-off-by: Vincenzo Frascino 

Acked-by: Catalin Marinas 

(BTW, Andrey now has a different email address; use the one in the
MAINTAINERS file)


Re: kernel BUG in memory_bm_free

2021-03-15 Thread Catalin Marinas
On Mon, Mar 15, 2021 at 08:08:06AM +0100, Dmitry Vyukov wrote:
> On Wed, Feb 3, 2021 at 6:59 AM syzbot
>  wrote:
> > syzbot found the following issue on:
> >
> > HEAD commit:3aaf0a27 Merge tag 'clang-format-for-linux-v5.11-rc7' of g..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17ef6108d0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=10152c2ea16351e7
> > dashboard link: https://syzkaller.appspot.com/bug?extid=5ecbe63baca437585bd4
> > userspace arch: arm64
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+5ecbe63baca437585...@syzkaller.appspotmail.com
> 
> The BUG is:
> BUG_ON(!virt_addr_valid(addr));
> 
> #syz fix: arm64: Do not pass tagged addresses to __is_lm_address()

Does this mean that commit 91cb2c8b072e ("arm64: Do not pass tagged
addresses to __is_lm_address()") fixes the regression? The patch was
merged in -5.11-rc7 I think.

-- 
Catalin


Re: arm64: kernel/sys.c - silence initialization warnings.

2021-03-15 Thread Catalin Marinas
On Fri, Mar 12, 2021 at 04:55:46AM -0500, Valdis Klētnieks wrote:
> Building arch/arm64/kernel/sys.o with W=1 throws over 300 warnings:
> 
> /usr/src/linux-next/arch/arm64/kernel/sys.c:56:40: warning: initialized field 
> overwritten [-Woverride-init]
>56 | #define __SYSCALL(nr, sym)  [nr] = __arm64_##sym,
>   |^~~~
> /usr/src/linux-next/include/uapi/asm-generic/unistd.h:29:37: note: in 
> expansion of macro '__SYSCALL'
>29 | #define __SC_COMP(_nr, _sys, _comp) __SYSCALL(_nr, _sys)
>   | ^
> /usr/src/linux-next/include/uapi/asm-generic/unistd.h:34:1: note: in 
> expansion of macro '__SC_COMP'
>34 | __SC_COMP(__NR_io_setup, sys_io_setup, compat_sys_io_setup)
>   | ^
> 
> We know that's pretty much the file's purpose in life, so tell the
> build system to not remind us.  This makes the 1 other warning a
> lot more noticeable. 
> 
> Signed-off-by: Valdis Kletnieks 
> 
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index ed65576ce710..916b21d2b35b 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -8,6 +8,7 @@ CFLAGS_armv8_deprecated.o := -I$(src)
>  CFLAGS_REMOVE_ftrace.o = $(CC_FLAGS_FTRACE)
>  CFLAGS_REMOVE_insn.o = $(CC_FLAGS_FTRACE)
>  CFLAGS_REMOVE_return_address.o = $(CC_FLAGS_FTRACE)
> +CFLAGS_sys.o += $(call cc-disable-warning, override-init)

We do similar initialisation in arch/arm64/kernel/sys32.c and
arch/arm64/kernel/traps.c for example. It's a pretty common pattern
throughout the kernel.

So we either treat W=1 output as diff against the vanilla kernel when
checking new patches or we remove override-init altogether from W=1.
Mark Rutland pointed me to an older thread:

https://lore.kernel.org/linux-arm-kernel/20190809083251.ga48...@lakrids.cambridge.arm.com/

-- 
Catalin


Re: [PATCH v15 5/8] arm64: mte: Enable TCO in functions that can read beyond buffer limits

2021-03-12 Thread Catalin Marinas
On Fri, Mar 12, 2021 at 03:23:44PM +, Vincenzo Frascino wrote:
> On 3/12/21 3:13 PM, Catalin Marinas wrote:
> > On Fri, Mar 12, 2021 at 02:22:07PM +, Vincenzo Frascino wrote:
> >> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> >> index 9b557a457f24..8603c6636a7d 100644
> >> --- a/arch/arm64/include/asm/mte.h
> >> +++ b/arch/arm64/include/asm/mte.h
> >> @@ -90,5 +90,20 @@ static inline void mte_assign_mem_tag_range(void *addr, 
> >> size_t size)
> >>  
> >>  #endif /* CONFIG_ARM64_MTE */
> >>  
> >> +#ifdef CONFIG_KASAN_HW_TAGS
> >> +/* Whether the MTE asynchronous mode is enabled. */
> >> +DECLARE_STATIC_KEY_FALSE(mte_async_mode);
> >> +
> >> +static inline bool system_uses_mte_async_mode(void)
> >> +{
> >> +  return static_branch_unlikely(_async_mode);
> >> +}
> >> +#else
> >> +static inline bool system_uses_mte_async_mode(void)
> >> +{
> >> +  return false;
> >> +}
> >> +#endif /* CONFIG_KASAN_HW_TAGS */
> > 
> > You can write this with fewer lines:
> > 
> > DECLARE_STATIC_KEY_FALSE(mte_async_mode);
> > 
> > static inline bool system_uses_mte_async_mode(void)
> > {
> > return IS_ENABLED(CONFIG_KASAN_HW_TAGS) &&
> > static_branch_unlikely(_async_mode);
> > }
> > 
> > The compiler will ensure that mte_async_mode is not referred when
> > !CONFIG_KASAN_HW_TAGS and therefore doesn't need to be defined.
> 
> Yes, I agree, but I introduce "#ifdef CONFIG_KASAN_HW_TAGS" in the successive
> patch anyway, according to me the overall code looks more uniform like this. 
> But
> I do not have a strong opinion or preference on this.

Ah, yes, I didn't look at patch 6 again as it was already reviewed and I
forgot the context. Leave it as it is then, my reviewed-by still stands.

-- 
Catalin


Re: [PATCH v15 0/8] arm64: ARMv8.5-A: MTE: Add async mode support

2021-03-12 Thread Catalin Marinas
On Fri, Mar 12, 2021 at 02:22:02PM +, Vincenzo Frascino wrote:
> Andrey Konovalov (1):
>   kasan, arm64: tests supports for HW_TAGS async mode
> 
> Vincenzo Frascino (7):
>   arm64: mte: Add asynchronous mode support
>   kasan: Add KASAN mode kernel parameter
>   arm64: mte: Drop arch_enable_tagging()
>   kasan: Add report for async mode
>   arm64: mte: Enable TCO in functions that can read beyond buffer limits
>   arm64: mte: Enable async tag check fault
>   arm64: mte: Report async tag faults before suspend

Other than the comments I gave already, feel free to add my ack on the
rest of the patches:

Acked-by: Catalin Marinas 


Re: [PATCH v15 5/8] arm64: mte: Enable TCO in functions that can read beyond buffer limits

2021-03-12 Thread Catalin Marinas
On Fri, Mar 12, 2021 at 02:22:07PM +, Vincenzo Frascino wrote:
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 9b557a457f24..8603c6636a7d 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -90,5 +90,20 @@ static inline void mte_assign_mem_tag_range(void *addr, 
> size_t size)
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> +#ifdef CONFIG_KASAN_HW_TAGS
> +/* Whether the MTE asynchronous mode is enabled. */
> +DECLARE_STATIC_KEY_FALSE(mte_async_mode);
> +
> +static inline bool system_uses_mte_async_mode(void)
> +{
> + return static_branch_unlikely(_async_mode);
> +}
> +#else
> +static inline bool system_uses_mte_async_mode(void)
> +{
> + return false;
> +}
> +#endif /* CONFIG_KASAN_HW_TAGS */

You can write this with fewer lines:

DECLARE_STATIC_KEY_FALSE(mte_async_mode);

static inline bool system_uses_mte_async_mode(void)
{
return IS_ENABLED(CONFIG_KASAN_HW_TAGS) &&
static_branch_unlikely(_async_mode);
}

The compiler will ensure that mte_async_mode is not referred when
!CONFIG_KASAN_HW_TAGS and therefore doesn't need to be defined.

> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index fa755cf94e01..9362928ba0d5 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -26,6 +26,10 @@ u64 gcr_kernel_excl __ro_after_init;
>  
>  static bool report_fault_once = true;
>  
> +/* Whether the MTE asynchronous mode is enabled. */
> +DEFINE_STATIC_KEY_FALSE(mte_async_mode);
> +EXPORT_SYMBOL_GPL(mte_async_mode);

Maybe keep these bracketed by #ifdef CONFIG_KASAN_HW_TAGS. I think the
mte_enable_kernel_*() aren't needed either if KASAN_HW is disabled (you
can do it with an additional patch).

With these, you can add:

Reviewed-by: Catalin Marinas 


Re: [PATCH v14 8/8] kselftest/arm64: Verify that TCO is enabled in load_unaligned_zeropad()

2021-03-11 Thread Catalin Marinas
On Thu, Mar 11, 2021 at 03:00:26PM +, Vincenzo Frascino wrote:
> On 3/11/21 1:25 PM, Catalin Marinas wrote:
> > On Mon, Mar 08, 2021 at 04:14:34PM +, Vincenzo Frascino wrote:
> >> load_unaligned_zeropad() and __get/put_kernel_nofault() functions can
> >> read passed some buffer limits which may include some MTE granule with a
> >> different tag.
> >>
> >> When MTE async mode is enable, the load operation crosses the boundaries
> >> and the next granule has a different tag the PE sets the TFSR_EL1.TF1
> >> bit as if an asynchronous tag fault is happened:
> >>
> >>  ==
> >>  BUG: KASAN: invalid-access
> >>  Asynchronous mode enabled: no access details available
> >>
> >>  CPU: 0 PID: 1 Comm: init Not tainted 5.12.0-rc1-ge1045c86620d-dirty #8
> >>  Hardware name: FVP Base RevC (DT)
> >>  Call trace:
> >>dump_backtrace+0x0/0x1c0
> >>show_stack+0x18/0x24
> >>dump_stack+0xcc/0x14c
> >>kasan_report_async+0x54/0x70
> >>mte_check_tfsr_el1+0x48/0x4c
> >>exit_to_user_mode+0x18/0x38
> >>finish_ret_to_user+0x4/0x15c
> >>  ==
> >>
> >> Verify that Tag Check Override (TCO) is enabled in these functions before
> >> the load and disable it afterwards to prevent this to happen.
> >>
> >> Note: The issue has been observed only with an MTE enabled userspace.
> > 
> > The above bug is all about kernel buffers. While userspace can trigger
> > the relevant code paths, it should not matter whether the user has MTE
> > enabled or not. Can you please confirm that you can still triggered the
> > fault with kernel-mode MTE but non-MTE user-space? If not, we may have a
> > bug somewhere as the two are unrelated: load_unaligned_zeropad() only
> > acts on kernel buffers and are subject to the kernel MTE tag check fault
> > mode.
> 
> I retried and you are right, it does not matter if it is a MTE or non-MTE
> user-space. The issue seems to be that this test does not trigger the problem
> all the times which probably lead me to the wrong conclusions.

Keep the test around for some quick checks before you get the kasan
test support.

> > I don't think we should have a user-space selftest for this. The bug is
> > not about a user-kernel interface, so an in-kernel test is more
> > appropriate. Could we instead add this to the kasan tests and calling
> > load_unaligned_zeropad() and other functions directly?
> 
> I agree with you we should abandon this strategy of triggering the issue due 
> to
> my comment above. I will investigate the option of having a kasan test and try
> to come up with one that calls the relevant functions directly. I would prefer
> though, since the rest of the series is almost ready, to post it in a future
> series. What do you think?

That's fine by me.

-- 
Catalin


Re: [PATCH v14 8/8] kselftest/arm64: Verify that TCO is enabled in load_unaligned_zeropad()

2021-03-11 Thread Catalin Marinas
On Mon, Mar 08, 2021 at 04:14:34PM +, Vincenzo Frascino wrote:
> load_unaligned_zeropad() and __get/put_kernel_nofault() functions can
> read passed some buffer limits which may include some MTE granule with a
> different tag.
> 
> When MTE async mode is enable, the load operation crosses the boundaries
> and the next granule has a different tag the PE sets the TFSR_EL1.TF1
> bit as if an asynchronous tag fault is happened:
> 
>  ==
>  BUG: KASAN: invalid-access
>  Asynchronous mode enabled: no access details available
> 
>  CPU: 0 PID: 1 Comm: init Not tainted 5.12.0-rc1-ge1045c86620d-dirty #8
>  Hardware name: FVP Base RevC (DT)
>  Call trace:
>dump_backtrace+0x0/0x1c0
>show_stack+0x18/0x24
>dump_stack+0xcc/0x14c
>kasan_report_async+0x54/0x70
>mte_check_tfsr_el1+0x48/0x4c
>exit_to_user_mode+0x18/0x38
>finish_ret_to_user+0x4/0x15c
>  ==
> 
> Verify that Tag Check Override (TCO) is enabled in these functions before
> the load and disable it afterwards to prevent this to happen.
> 
> Note: The issue has been observed only with an MTE enabled userspace.

The above bug is all about kernel buffers. While userspace can trigger
the relevant code paths, it should not matter whether the user has MTE
enabled or not. Can you please confirm that you can still triggered the
fault with kernel-mode MTE but non-MTE user-space? If not, we may have a
bug somewhere as the two are unrelated: load_unaligned_zeropad() only
acts on kernel buffers and are subject to the kernel MTE tag check fault
mode.

I don't think we should have a user-space selftest for this. The bug is
not about a user-kernel interface, so an in-kernel test is more
appropriate. Could we instead add this to the kasan tests and calling
load_unaligned_zeropad() and other functions directly?

-- 
Catalin


Re: [PATCH v14 3/8] arm64: mte: Drop arch_enable_tagging()

2021-03-11 Thread Catalin Marinas
On Mon, Mar 08, 2021 at 04:14:29PM +, Vincenzo Frascino wrote:
> arch_enable_tagging() was left in memory.h after the introduction of
> async mode to not break the bysectability of the KASAN KUNIT tests.
> 
> Remove the function now that KASAN has been fully converted.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Signed-off-by: Vincenzo Frascino 

Acked-by: Catalin Marinas 


Re: [PATCH] arm64: kasan: fix page_alloc tagging with DEBUG_VIRTUAL

2021-03-08 Thread Catalin Marinas
On Mon, Mar 08, 2021 at 05:10:23PM +0100, Andrey Konovalov wrote:
> When CONFIG_DEBUG_VIRTUAL is enabled, the default page_to_virt() macro
> implementation from include/linux/mm.h is used. That definition doesn't
> account for KASAN tags, which leads to no tags on page_alloc allocations.
> 
> Provide an arm64-specific definition for page_to_virt() when
> CONFIG_DEBUG_VIRTUAL is enabled that takes care of KASAN tags.
> 
> Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via 
> pagealloc")
> Cc: 
> Signed-off-by: Andrey Konovalov 
> ---
>  arch/arm64/include/asm/memory.h | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index c759faf7a1ff..0aabc3be9a75 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -328,6 +328,11 @@ static inline void *phys_to_virt(phys_addr_t x)
>  #define ARCH_PFN_OFFSET  ((unsigned long)PHYS_PFN_OFFSET)
>  
>  #if !defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_DEBUG_VIRTUAL)
> +#define page_to_virt(x)  ({  
> \
> + __typeof__(x) __page = x;   \
> + void *__addr = __va(page_to_phys(__page));  \
> + (void *)__tag_set((const void *)__addr, page_kasan_tag(__page));\
> +})
>  #define virt_to_page(x)  pfn_to_page(virt_to_pfn(x))

Reviewed-by: Catalin Marinas 


Re: [PATCH V3 2/2] arm64/mm: Reorganize pfn_valid()

2021-03-08 Thread Catalin Marinas
On Fri, Mar 05, 2021 at 10:54:58AM +0530, Anshuman Khandual wrote:
> There are multiple instances of pfn_to_section_nr() and __pfn_to_section()
> when CONFIG_SPARSEMEM is enabled. This can be optimized if memory section
> is fetched earlier. This replaces the open coded PFN and ADDR conversion
> with PFN_PHYS() and PHYS_PFN() helpers. While there, also add a comment.
> This does not cause any functional change.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Ard Biesheuvel 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Anshuman Khandual 

Acked-by: Catalin Marinas 


Re: [PATCH V3 1/2] arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory

2021-03-08 Thread Catalin Marinas
On Fri, Mar 05, 2021 at 10:54:57AM +0530, Anshuman Khandual wrote:
> pfn_valid() validates a pfn but basically it checks for a valid struct page
> backing for that pfn. It should always return positive for memory ranges
> backed with struct page mapping. But currently pfn_valid() fails for all
> ZONE_DEVICE based memory types even though they have struct page mapping.
> 
> pfn_valid() asserts that there is a memblock entry for a given pfn without
> MEMBLOCK_NOMAP flag being set. The problem with ZONE_DEVICE based memory is
> that they do not have memblock entries. Hence memblock_is_map_memory() will
> invariably fail via memblock_search() for a ZONE_DEVICE based address. This
> eventually fails pfn_valid() which is wrong. memblock_is_map_memory() needs
> to be skipped for such memory ranges. As ZONE_DEVICE memory gets hotplugged
> into the system via memremap_pages() called from a driver, their respective
> memory sections will not have SECTION_IS_EARLY set.
> 
> Normal hotplug memory will never have MEMBLOCK_NOMAP set in their memblock
> regions. Because the flag MEMBLOCK_NOMAP was specifically designed and set
> for firmware reserved memory regions. memblock_is_map_memory() can just be
> skipped as its always going to be positive and that will be an optimization
> for the normal hotplug memory. Like ZONE_DEVICE based memory, all normal
> hotplugged memory too will not have SECTION_IS_EARLY set for their sections
> 
> Skipping memblock_is_map_memory() for all non early memory sections would
> fix pfn_valid() problem for ZONE_DEVICE based memory and also improve its
> performance for normal hotplug memory as well.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Ard Biesheuvel 
> Cc: Robin Murphy 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Acked-by: David Hildenbrand 
> Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support")
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/arm64/mm/init.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 0ace5e68efba..5920c527845a 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -230,6 +230,18 @@ int pfn_valid(unsigned long pfn)
>  
>   if (!valid_section(__pfn_to_section(pfn)))
>   return 0;
> +
> + /*
> +  * ZONE_DEVICE memory does not have the memblock entries.
> +  * memblock_is_map_memory() check for ZONE_DEVICE based
> +  * addresses will always fail. Even the normal hotplugged
> +  * memory will never have MEMBLOCK_NOMAP flag set in their
> +  * memblock entries. Skip memblock search for all non early
> +  * memory sections covering all of hotplug memory including
> +  * both normal and ZONE_DEVICE based.
> +  */
> + if (!early_section(__pfn_to_section(pfn)))
> + return pfn_section_valid(__pfn_to_section(pfn), pfn);
>  #endif
>   return memblock_is_map_memory(addr);
>  }

Acked-by: Catalin Marinas 


Re: [PATCH v2 1/5] arm64: kasan: allow to init memory when setting tags

2021-03-08 Thread Catalin Marinas
On Mon, Mar 08, 2021 at 04:55:14PM +0100, Andrey Konovalov wrote:
> @@ -68,10 +69,16 @@ static inline void mte_set_mem_tag_range(void *addr, 
> size_t size, u8 tag)
>* 'asm volatile' is required to prevent the compiler to move
>* the statement outside of the loop.
>*/
> - asm volatile(__MTE_PREAMBLE "stg %0, [%0]"
> -  :
> -  : "r" (curr)
> -  : "memory");
> + if (init)
> + asm volatile(__MTE_PREAMBLE "stzg %0, [%0]"
> +  :
> +  : "r" (curr)
> +  : "memory");
> + else
> + asm volatile(__MTE_PREAMBLE "stg %0, [%0]"
> +  :
> +  : "r" (curr)
> +  : "memory");
>  
>   curr += MTE_GRANULE_SIZE;
>   } while (curr != end);

Is 'init' always a built-in constant here? If not, checking it once
outside the loop may be better (or check the code generation, maybe the
compiler is smart enough).

-- 
Catalin


Re: [PATCH V3 1/2] arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory

2021-03-08 Thread Catalin Marinas
On Fri, Mar 05, 2021 at 07:24:21PM +0100, David Hildenbrand wrote:
> On 05.03.21 19:13, Catalin Marinas wrote:
> > On Fri, Mar 05, 2021 at 10:54:57AM +0530, Anshuman Khandual wrote:
> > > pfn_valid() validates a pfn but basically it checks for a valid struct 
> > > page
> > > backing for that pfn. It should always return positive for memory ranges
> > > backed with struct page mapping. But currently pfn_valid() fails for all
> > > ZONE_DEVICE based memory types even though they have struct page mapping.
> > > 
> > > pfn_valid() asserts that there is a memblock entry for a given pfn without
> > > MEMBLOCK_NOMAP flag being set. The problem with ZONE_DEVICE based memory 
> > > is
> > > that they do not have memblock entries. Hence memblock_is_map_memory() 
> > > will
> > > invariably fail via memblock_search() for a ZONE_DEVICE based address. 
> > > This
> > > eventually fails pfn_valid() which is wrong. memblock_is_map_memory() 
> > > needs
> > > to be skipped for such memory ranges. As ZONE_DEVICE memory gets 
> > > hotplugged
> > > into the system via memremap_pages() called from a driver, their 
> > > respective
> > > memory sections will not have SECTION_IS_EARLY set.
> > > 
> > > Normal hotplug memory will never have MEMBLOCK_NOMAP set in their memblock
> > > regions. Because the flag MEMBLOCK_NOMAP was specifically designed and set
> > > for firmware reserved memory regions. memblock_is_map_memory() can just be
> > > skipped as its always going to be positive and that will be an 
> > > optimization
> > > for the normal hotplug memory. Like ZONE_DEVICE based memory, all normal
> > > hotplugged memory too will not have SECTION_IS_EARLY set for their 
> > > sections
> > > 
> > > Skipping memblock_is_map_memory() for all non early memory sections would
> > > fix pfn_valid() problem for ZONE_DEVICE based memory and also improve its
> > > performance for normal hotplug memory as well.
> > > 
> > > Cc: Catalin Marinas 
> > > Cc: Will Deacon 
> > > Cc: Ard Biesheuvel 
> > > Cc: Robin Murphy 
> > > Cc: linux-arm-ker...@lists.infradead.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Acked-by: David Hildenbrand 
> > > Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support")
> > > Signed-off-by: Anshuman Khandual 
> > > ---
> > >   arch/arm64/mm/init.c | 12 
> > >   1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > index 0ace5e68efba..5920c527845a 100644
> > > --- a/arch/arm64/mm/init.c
> > > +++ b/arch/arm64/mm/init.c
> > > @@ -230,6 +230,18 @@ int pfn_valid(unsigned long pfn)
> > >   if (!valid_section(__pfn_to_section(pfn)))
> > >   return 0;
> > > +
> > > + /*
> > > +  * ZONE_DEVICE memory does not have the memblock entries.
> > > +  * memblock_is_map_memory() check for ZONE_DEVICE based
> > > +  * addresses will always fail. Even the normal hotplugged
> > > +  * memory will never have MEMBLOCK_NOMAP flag set in their
> > > +  * memblock entries. Skip memblock search for all non early
> > > +  * memory sections covering all of hotplug memory including
> > > +  * both normal and ZONE_DEVICE based.
> > > +  */
> > > + if (!early_section(__pfn_to_section(pfn)))
> > > + return pfn_section_valid(__pfn_to_section(pfn), pfn);
> > 
> > Would something like this work instead:
> > 
> > if (online_device_section(ms))
> > return 1;
> > 
> > to avoid the assumptions around early_section()?
> 
> Please keep online section logic out of pfn valid logic. Tow different
> things. (and rather not diverge too much from generic pfn_valid() - we want
> to achieve the opposite in the long term, merging both implementations)

I think I misread the code. I was looking for a new flag to check like
SECTION_IS_DEVICE instead of assuming that !SECTION_IS_EARLY means
device or mhp.

Anyway, staring at this code for a bit more, I came to the conclusion
that the logic in Anshuman's patches is fairly robust - we only need to
check for memblock_is_map_memory() if early_section() as that's the only
case where we can have MEMBLOCK_NOMAP. Maybe the comment above should be
re-written a bit and avoid the ZONE_DEVICE and hotplugged memory
details altogether.

-- 
Catalin


Re: [PATCH V3 0/2] arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory

2021-03-05 Thread Catalin Marinas
On Fri, Mar 05, 2021 at 01:16:28PM -0500, Veronika Kabatova wrote:
> > > On 3/5/21 10:54 AM, Anshuman Khandual wrote:
> > > > This series fixes pfn_valid() for ZONE_DEVICE based memory and
> > > > also improves its performance for normal hotplug memory. While
> > > > here, it also reorganizes pfn_valid() on CONFIG_SPARSEMEM. This
> > > > series is based on v5.12-rc1.
[...]
> > > Could you please help recreate the earlier failure [1] but with this
> > > series applies on v5.12-rc1. Thank you.
> > 
> > the machine in question is currently loaned to a developer. I'll reach
> > out to them and will let you know once I have any results.
> 
> I'm happy to report the kernel boots with these new patches. I used the
> 5.12.0-rc1 kernel (commit 280d542f6ffac0) as a base. The full console log
> from the boot process is available at
> 
> https://gitlab.com/-/snippets/2086833

That's great Veronika. Thanks for confirming.

-- 
Catalin


Re: [PATCH V3 1/2] arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory

2021-03-05 Thread Catalin Marinas
On Fri, Mar 05, 2021 at 10:54:57AM +0530, Anshuman Khandual wrote:
> pfn_valid() validates a pfn but basically it checks for a valid struct page
> backing for that pfn. It should always return positive for memory ranges
> backed with struct page mapping. But currently pfn_valid() fails for all
> ZONE_DEVICE based memory types even though they have struct page mapping.
> 
> pfn_valid() asserts that there is a memblock entry for a given pfn without
> MEMBLOCK_NOMAP flag being set. The problem with ZONE_DEVICE based memory is
> that they do not have memblock entries. Hence memblock_is_map_memory() will
> invariably fail via memblock_search() for a ZONE_DEVICE based address. This
> eventually fails pfn_valid() which is wrong. memblock_is_map_memory() needs
> to be skipped for such memory ranges. As ZONE_DEVICE memory gets hotplugged
> into the system via memremap_pages() called from a driver, their respective
> memory sections will not have SECTION_IS_EARLY set.
> 
> Normal hotplug memory will never have MEMBLOCK_NOMAP set in their memblock
> regions. Because the flag MEMBLOCK_NOMAP was specifically designed and set
> for firmware reserved memory regions. memblock_is_map_memory() can just be
> skipped as its always going to be positive and that will be an optimization
> for the normal hotplug memory. Like ZONE_DEVICE based memory, all normal
> hotplugged memory too will not have SECTION_IS_EARLY set for their sections
> 
> Skipping memblock_is_map_memory() for all non early memory sections would
> fix pfn_valid() problem for ZONE_DEVICE based memory and also improve its
> performance for normal hotplug memory as well.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Ard Biesheuvel 
> Cc: Robin Murphy 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Acked-by: David Hildenbrand 
> Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support")
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/arm64/mm/init.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 0ace5e68efba..5920c527845a 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -230,6 +230,18 @@ int pfn_valid(unsigned long pfn)
>  
>   if (!valid_section(__pfn_to_section(pfn)))
>   return 0;
> +
> + /*
> +  * ZONE_DEVICE memory does not have the memblock entries.
> +  * memblock_is_map_memory() check for ZONE_DEVICE based
> +  * addresses will always fail. Even the normal hotplugged
> +  * memory will never have MEMBLOCK_NOMAP flag set in their
> +  * memblock entries. Skip memblock search for all non early
> +  * memory sections covering all of hotplug memory including
> +  * both normal and ZONE_DEVICE based.
> +  */
> + if (!early_section(__pfn_to_section(pfn)))
> + return pfn_section_valid(__pfn_to_section(pfn), pfn);

Would something like this work instead:

if (online_device_section(ms))
return 1;

to avoid the assumptions around early_section()?

-- 
Catalin


Re: [RESEND PATCH 1/2] ARM64: enable GENERIC_FIND_FIRST_BIT

2021-03-05 Thread Catalin Marinas
On Wed, Mar 03, 2021 at 02:17:41PM -0800, Yury Norov wrote:
> On Thu, Feb 25, 2021 at 02:02:06PM +, Will Deacon wrote:
> > On Thu, Feb 25, 2021 at 05:56:59AM -0800, Yury Norov wrote:
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index 31bd885b79eb..5596eab04092 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -108,6 +108,7 @@ config ARM64
> > >   select GENERIC_CPU_AUTOPROBE
> > >   select GENERIC_CPU_VULNERABILITIES
> > >   select GENERIC_EARLY_IOREMAP
> > > + select GENERIC_FIND_FIRST_BIT
> > >   select GENERIC_IDLE_POLL_SETUP
> > >   select GENERIC_IRQ_IPI
> > >   select GENERIC_IRQ_MULTI_HANDLER
> > 
> > Acked-by: Will Deacon 
> > 
> > Catalin can pick this up later in the cycle.
> > 
> > Will
> 
> Ping?

It's on my list for 5.13 but I'll only start queuing patches after -rc3.

-- 
Catalin


Re: [PATCH V2 1/2] arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory

2021-03-03 Thread Catalin Marinas
On Thu, Feb 11, 2021 at 01:35:56PM +0100, David Hildenbrand wrote:
> On 11.02.21 13:10, Anshuman Khandual wrote:
> > On 2/11/21 5:23 PM, Will Deacon wrote:
> > > ... and dropped. These patches appear to be responsible for a boot
> > > regression reported by CKI:
> > 
> > Ahh, boot regression ? These patches only change the behaviour
> > for non boot memory only.
> > 
> > > https://lore.kernel.org/r/cki.8d1cb60fec.k6njmef...@redhat.com
> > 
> > Will look into the logs and see if there is something pointing to
> > the problem.
> 
> It's strange. One thing I can imagine is a mis-detection of early sections.
> However, I don't see that happening:
> 
> In sparse_init_nid(), we:
> 1. Initialize the memmap
> 2. Set SECTION_IS_EARLY | SECTION_HAS_MEM_MAP via
>sparse_init_one_section()
> 
> Only hotplugged sections (DIMMs, dax/kmem) set SECTION_HAS_MEM_MAP without
> SECTION_IS_EARLY - which is correct, because these are not early.
> 
> So once we know that we have valid_section() -- SECTION_HAS_MEM_MAP is set
> -- early_section() should be correct.
> 
> Even if someone would be doing a pfn_valid() after
> memblocks_present()->memory_present() but before
> sparse_init_nid(), we should be fine (!valid_section() -> return 0).

I couldn't figure out how this could fail with Anshuman's patches.
Will's suspicion is that some invalid/null pointer gets dereferenced
before being initialised but the only case I see is somewhere in
pfn_section_valid() (ms->usage) if valid_section() && !early_section().

Assuming that we do get a valid_section(ms) && !early_section(ms), is
there a case where ms->usage is not initialised? I guess races with
section_deactivate() are not possible this early.

Another situation could be that pfn_valid() returns true when no memory
is mapped for that pfn.

> As it happens early during boot, I doubt that some NVDIMMs that get detected
> and added early during boot as system RAM (via dax/kmem) are the problem.

It is indeed very early, we can't even get the early console output.
Debugging this is even harder as it's only misbehaving on a board we
don't have access to.

On the logic in this patch, is the hot-added memory always covering a
full subsection? The arm64 pfn_valid() currently relies on
memblock_is_map_memory() but the patch changes it to
pfn_section_valid(). So if hot-added memory doesn't cover the full
subsection, it may return true even if the pfn is not mapped.

Regarding the robustness of the pfn_valid for ZONE_DEVICE memory, could
we instead have a SECTION_IS_DEVICE flag and only check for that as not
to disturb the hotplugged memory check via memblock_is_map_memory()?

-- 
Catalin


Re: [PATCH] arm64/mm: Drop THP conditionality from FORCE_MAX_ZONEORDER

2021-03-01 Thread Catalin Marinas
On Mon, Mar 01, 2021 at 04:55:14PM +0530, Anshuman Khandual wrote:
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 9cd33c7be429..d4690326274a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1156,8 +1156,8 @@ config XEN
>  
>  config FORCE_MAX_ZONEORDER
>   int
> - default "14" if (ARM64_64K_PAGES && TRANSPARENT_HUGEPAGE)
> - default "12" if (ARM64_16K_PAGES && TRANSPARENT_HUGEPAGE)
> + default "14" if ARM64_64K_PAGES
> + default "12" if ARM64_16K_PAGES
>   default "11"
>   help
> The kernel memory allocator divides physically contiguous memory

I think this makes sense. The original "14" was added by Steve C in
commit d03bb1455f3a ("ARM64: mm: Raise MAX_ORDER for 64KB pages and
THP.") back in 3.11. It looks like hugetlbfs (and the HUGETLB_PAGE_ORDER
definition) was added in the same kernel but we somehow missed the
!TRANSPARENT_HUGEPAGE case and smaller page order. The warning in
__fragmentation_index() was added much later in 4.14.

Anyway, the patch looks fine to me, we could apply it to some past
stable kernels:

Acked-by: Catalin Marinas 

An alternative would have be to add a dependency on both
TRANSPARENT_HUGEPAGE and HUGETLB_PAGE but I'm not sure it's worth it.

-- 
Catalin


  1   2   3   4   5   6   7   8   9   10   >