from:"Andrew Morton"

Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-11 Thread Andrew Morton

On Thu, 11 Apr 2024 17:40:02 +0200 Michal Koutný  wrote:

> Hello.
> 
> On Mon, Apr 08, 2024 at 01:29:55PM -0700, Andrew Morton 
>  wrote:
> > That seems like a large change.
> 
> In what sense is it large?

A large increase in the maximum number of processes.  Or did I misinterpret?

Re: [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()

2024-04-10 Thread Andrew Morton

On Fri,  5 Apr 2024 07:58:11 -0400 Paolo Bonzini  wrote:

> Please review!  Also feel free to take the KVM patches through the mm
> tree, as I don't expect any conflicts.

It's mainly a KVM thing and the MM changes are small and simple.
I'd say that the KVM tree would be a better home?

Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-09 Thread Andrew Morton

On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" 
 wrote:

> Hi Jonathan,
> 
> On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron
>  wrote:
> >
> > On Fri,  5 Apr 2024 00:07:05 +
> > "Ho-Ren (Jack) Chuang"  wrote:
> >
> > > Since different memory devices require finding, allocating, and putting
> > > memory types, these common steps are abstracted in this patch,
> > > enhancing the scalability and conciseness of the code.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Reviewed-by: "Huang, Ying" 
> > Reviewed-by: Jonathan Cameron 
> >
> Thank you for reviewing and for adding your "Reviewed-by"!
> I was wondering if I need to send a v12 and manually add
> this to the commit description, or if this is sufficient.

I had added Jonathan's r-b to the mm.git copy of this patch.

Re: [PATCH 2/3] kernel/pid: Remove default pid_max value

2024-04-08 Thread Andrew Morton

On Mon,  8 Apr 2024 16:58:18 +0200 Michal Koutný  wrote:

> The kernel provides mechanisms, while it should not imply policies --
> default pid_max seems to be an example of the policy that does not fit
> all. At the same time pid_max must have some value assigned, so use the
> end of the allowed range -- pid_max_max.
> 
> This change thus increases initial pid_max from 32k to 4M (x86_64
> defconfig).

That seems like a large change.

It isn't clear why we'd want to merge this patchset.  Does it improve
anyone's life and if so, how?

Re: [PATCH] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-04 Thread Andrew Morton

On Tue, 02 Apr 2024 00:24:28 -0600 Vishal Verma  
wrote:

> In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
> referenced patch should be replaced with lockdep_assert_held(_write)().
> 
> Replace those, and additionally, replace a couple of other
> WARN_ON_ONCE() introduced in the same patch for actual failure
> cases (i.e. when acquiring a semaphore fails in a remove / unregister
> path) with dev_WARN_ONCE() as is the precedent here.
> 
> Recall that previously, unregistration paths was implicitly protected by
> overloading the device lock, which the patch in [1] sought to remove.
> This meant adding a semaphore acquisition in these unregistration paths.
> Since that can fail, and it doesn't make sense to return errors from
> these paths, retain the two instances of (now) dev_WARN_ONCE().
> 
> ...
>
> @@ -471,6 +471,7 @@ static void __unregister_dev_dax(void *dev)
>  
>   dev_dbg(dev, "%s\n", __func__);
>  
> + lockdep_assert_held_write(_region_rwsem);
>   kill_dev_dax(dev_dax);
>   device_del(dev);
>   free_dev_dax_ranges(dev_dax);

This is new and unchangelogged?

I'm taking Dan's reply to your patch as Not-A-Nack ;)

Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem

2024-03-12 Thread Andrew Morton

On Tue, 12 Mar 2024 20:20:17 + "Verma, Vishal L"  
wrote:

> > All of these WARN_ON_ONCE() usages should be replaced with
> > lockdep_assert_held() and lockdep_assert_held_write() where appropriate.
> 
> Makes sense - I can send a patch post -rc1 to change these if that's okay 
> Andrew?

Please do.

Re: [syzbot] [virtualization?] linux-next boot error: WARNING: refcount bug in __free_pages_ok

2024-02-19 Thread Andrew Morton

On Mon, 19 Feb 2024 02:35:20 -0500 "Michael S. Tsirkin"  wrote:

> On Sun, Feb 18, 2024 at 09:06:18PM -0800, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:d37e1e4c52bc Add linux-next specific files for 20240216
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=171ca65218
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=4bc446d42a7d56c0
> > dashboard link: https://syzkaller.appspot.com/bug?extid=6f3c38e8a6a0297caa5a
> > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > Debian) 2.40
> > 
> > Downloadable assets:
> > disk image: 
> > https://storage.googleapis.com/syzbot-assets/14d0894504b9/disk-d37e1e4c.raw.xz
> > vmlinux: 
> > https://storage.googleapis.com/syzbot-assets/6cda61e084ee/vmlinux-d37e1e4c.xz
> > kernel image: 
> > https://storage.googleapis.com/syzbot-assets/720c85283c05/bzImage-d37e1e4c.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+6f3c38e8a6a0297ca...@syzkaller.appspotmail.com
> > 
>
> ...
>
> > usbcore: registered new interface driver port100
> > usbcore: registered new interface driver nfcmrvl
> > Loading iSCSI transport class v2.0-870.
> > virtio_scsi virtio0: 1/0/0 default/read/poll queues
> > [ cut here ]
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> > refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > Modules linked in:
> > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc4-next-20240216-syzkaller 
> > #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/25/2024
> > RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > Code: b2 00 00 00 e8 b7 94 f0 fc 5b 5d c3 cc cc cc cc e8 ab 94 f0 fc c6 05 
> > c6 16 ce 0a 01 90 48 c7 c7 a0 5a fe 8b e8 67 69 b4 fc 90 <0f> 0b 90 90 eb 
> > d9 e8 8b 94 f0 fc c6 05 a3 16 ce 0a 01 90 48 c7 c7
> > RSP: :c9066e10 EFLAGS: 00010246
> > RAX: 15c2c224c9b50400 RBX: 888020827d2c RCX: 8880162d8000
> > RDX:  RSI:  RDI: 
> > RBP: 0004 R08: 8157b942 R09: fbfff1bf95cc
> > R10: dc00 R11: fbfff1bf95cc R12: ea000502fdc0
> > R13: ea000502fdc8 R14: 1d4000a05fb9 R15: 
> > FS:  () GS:8880b940() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 88823000 CR3: 0df32000 CR4: 003506f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  
> >  reset_page_owner include/linux/page_owner.h:24 [inline]
> >  free_pages_prepare mm/page_alloc.c:1140 [inline]
> >  __free_pages_ok+0xc42/0xd70 mm/page_alloc.c:1269
> >  make_alloc_exact+0xc4/0x140 mm/page_alloc.c:4847
> >  vring_alloc_queue drivers/virtio/virtio_ring.c:319 [inline]
> 
> Wow this seems to be breakage deep in mm/ - all virtio does is
> call alloc_pages_exact and that corrupts the refcounts?
> 

Will the bot be performing a bisection?  If not, can one please be
performed?

Re: [PATCH v3] modules: wait do_free_init correctly

2024-02-18 Thread Andrew Morton

On Sat, 17 Feb 2024 16:18:10 +0800 Changbin Du  wrote:

> The synchronization here is just to ensure the module init's been freed
> before doing W+X checking. But the commit 1a7b7d922081 ("modules: Use
> vmalloc special flag") moves do_free_init() into a global workqueue
> instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
> has completed. We should wait it via flush_work().
> 
> Without this fix, we still could encounter false positive reports in
> W+X checking, and the rcu synchronization is unnecessary which can
> introduce significant delay.
> 
> Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
> PREEMPT_RT kernel.
>   [0.291444] Freeing unused kernel memory: 5568K
>   [0.402442] Run /sbin/init as init process
> 
> With this fix, the above delay can be eliminated.

Thanks, I'll queue this as a delta, to be folded into the base patch
prior to upstreaming.

I added a Tested-by: Eric, if that's OK by him?

Re: [PATCH v7 0/5] Add DAX ABI for memmap_on_memory

2024-01-25 Thread Andrew Morton

On Wed, 24 Jan 2024 12:03:45 -0800 Vishal Verma  
wrote:

> This series adds sysfs ABI to control memmap_on_memory behavior for DAX
> devices.

Thanks.  I'll add this to mm-unstable for some additional testing, but
I do think we should have the evidence of additional review on this
series's four preparatory patches before proceeding further.

Re: [PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

2024-01-17 Thread Andrew Morton

On Wed, 17 Jan 2024 06:16:36 + Chen Zhongjin  
wrote:

> There is a deadlock scenario in kprobe_optimizer():
> 
> pid A pid B   pid C
> kprobe_optimizer()do_exit()   perf_kprobe_init()
> mutex_lock(_mutex) exit_tasks_rcu_start()  
> mutex_lock(_mutex)
> synchronize_rcu_tasks()   zap_pid_ns_processes()  // waiting 
> kprobe_mutex
> // waiting tasks_rcu_exit_srcukernel_wait4()
>   // waiting pid C exit
> 
> To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in 
> kprobe_optimizer()
> rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also 
> promise
> that all preempted tasks have scheduled, but it will not wait 
> tasks_rcu_exit_srcu.
> 
> Signed-off-by: Chen Zhongjin 

Thanks.  Should we backport this fix into earlier kernels?  If so, are
we able to identify a suitable Fixes: target?

Re: [PATCH] modules: wait do_free_init correctly

2023-12-19 Thread Andrew Morton

On Tue, 19 Dec 2023 22:12:31 +0800 Changbin Du  wrote:

> The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> do_free_init() into a global workqueue instead of call_rcu(). So now
> we should wait it via flush_work().

What are the runtime effects of this change?

Re: [PATCH -next 2/2] mm: vmscan: add new event to trace shrink lru

2023-12-12 Thread Andrew Morton

On Mon, 11 Dec 2023 19:26:40 -0800 Bixuan Cui  wrote:

> -TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
> +TRACE_EVENT(mm_vmscan_lru_shrink_inactive_start,

Current kernels have a call to trace_mm_vmscan_lru_shrink_inactive() in
evict_folios(), so this renaming broke the build.

Re: [PATCH v2 1/8] scripts/gdb/symbols: add specific ko module load command

2023-09-12 Thread Andrew Morton

On Tue, 12 Sep 2023 11:41:29 +0200 Johannes Berg  
wrote:

> On Tue, 2023-08-08 at 16:30 +0800, Kuan-Ying Lee wrote:
> > Add lx-symbols  command to support add specific
> > ko module.
> 
> I'm not sure how this was supposed to work? It should have updated the
> documentation, but more importantly, it shouldn't have broken the
> documented usage of this command:
> 
>   The kernel (vmlinux) is taken from the current working directly. 
> Modules (.ko)
>   are scanned recursively, starting in the same directory. Optionally, 
> the module
>   search path can be extended by a space separated list of paths passed 
> to the
>   lx-symbols command.
> 
> Note how that talks about a "space separated list of paths" for which
> clearly a single path seems like it should be accepted?
> 
> > @@ -138,6 +139,19 @@ lx-symbols command."""

Thanks, I queued a revert.

From: Andrew Morton 
Subject: revert "scripts/gdb/symbols: add specific ko module load command"
Date: Tue Sep 12 09:19:10 AM PDT 2023

Revert 11f956538c07 ("scripts/gdb/symbols: add specific ko module load
command") due to breakage identified by Johannes Berg in [1].

Fixes: 11f956538c07 ("scripts/gdb/symbols: add specific ko module load command")
Reported-by: Johannes Berg 
Closes: 
https://lkml.kernel.org/r/c44b748307a074d0c250002cdcfe209b8cce93c9.ca...@sipsolutions.net
 [1]
Cc: AngeloGioacchino Del Regno 
Cc: Chinwen Chang 
Cc: Jan Kiszka 
Cc: Kieran Bingham 
Cc: Kuan-Ying Lee 
Cc: Matthias Brugger 
Cc: Qun-Wei Lin 
Signed-off-by: Andrew Morton 
---

 scripts/gdb/linux/symbols.py |   23 ++-
 1 file changed, 2 insertions(+), 21 deletions(-)

--- 
a/scripts/gdb/linux/symbols.py~revert-scripts-gdb-symbols-add-specific-ko-module-load-command
+++ a/scripts/gdb/linux/symbols.py
@@ -111,12 +111,11 @@ lx-symbols command."""
 return "{textaddr} {sections}".format(
 textaddr=textaddr, sections="".join(args))
 
-def load_module_symbols(self, module, module_file=None):
+def load_module_symbols(self, module):
 module_name = module['name'].string()
 module_addr = 
str(module['mem'][constants.LX_MOD_TEXT]['base']).split()[0]
 
-if not module_file:
-module_file = self._get_module_file(module_name)
+module_file = self._get_module_file(module_name)
 if not module_file and not self.module_files_updated:
 self._update_module_files()
 module_file = self._get_module_file(module_name)
@@ -139,19 +138,6 @@ lx-symbols command."""
 else:
 gdb.write("no module object found for '{0}'\n".format(module_name))
 
-def load_ko_symbols(self, mod_path):
-self.loaded_modules = []
-module_list = modules.module_list()
-
-for module in module_list:
-module_name = module['name'].string()
-module_pattern = ".*/{0}\.ko(?:.debug)?$".format(
-module_name.replace("_", r"[_\-]"))
-if re.match(module_pattern, mod_path) and os.path.exists(mod_path):
-self.load_module_symbols(module, mod_path)
-return
-raise gdb.GdbError("%s is not a valid .ko\n" % mod_path)
-
 def load_all_symbols(self):
 gdb.write("loading vmlinux\n")
 
@@ -190,11 +176,6 @@ lx-symbols command."""
 self.module_files = []
 self.module_files_updated = False
 
-argv = gdb.string_to_argv(arg)
-if len(argv) == 1:
-self.load_ko_symbols(argv[0])
-return
-
 self.load_all_symbols()
 
 if hasattr(gdb, 'Breakpoint'):
_

Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT

2023-07-24 Thread Andrew Morton

On Fri, 21 Jul 2023 14:15:31 +1000 Alistair Popple  wrote:

> Thanks for this Huang, I had been hoping to take a look at it this week
> but have run out of time. I'm keen to do some testing with it as well.

Thanks.  I'll queue this in mm-unstable for some testing.  Detailed
review and testing would be appreciated.

I made some adjustments to handle the renaming of destroy_memory_type()
to put_memory_type()
(https://lkml.kernel.org/r/20230706063905.543800-1-linmia...@huawei.com)

Re: [PATCH v2.1 1/8] fsdax: introduce page->share for fsdax in reflink mode

2022-12-02 Thread Andrew Morton

On Fri, 2 Dec 2022 09:23:11 + Shiyang Ruan  wrote:

> fsdax page is used not only when CoW, but also mapread. To make the it
> easily understood, use 'share' to indicate that the dax page is shared
> by more than one extent.  And add helper functions to use it.
> 
> Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED.
> 

For those who are wondering what changed, I queued the below incremental
patch.


From: Shiyang Ruan 
Subject: fsdax: introduce page->share for fsdax in reflink mode
Date: Fri, 2 Dec 2022 09:23:11 +

rename several functions

Link: 
https://lkml.kernel.org/r/1669972991-246-1-git-send-email-ruansy.f...@fujitsu.com
Signed-off-by: Shiyang Ruan 
Cc: Alistair Popple 
Cc: Dan Williams 
Cc: Darrick J. Wong 
Cc: Dave Chinner 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Signed-off-by: Andrew Morton 
---

 fs/dax.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/fs/dax.c~fsdax-introduce-page-share-for-fsdax-in-reflink-mode-fix
+++ a/fs/dax.c
@@ -334,7 +334,7 @@ static unsigned long dax_end_pfn(void *e
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
 
-static inline bool dax_mapping_is_shared(struct page *page)
+static inline bool dax_page_is_shared(struct page *page)
 {
return (unsigned long)page->mapping == PAGE_MAPPING_DAX_SHARED;
 }
@@ -343,7 +343,7 @@ static inline bool dax_mapping_is_shared
  * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
  * refcount.
  */
-static inline void dax_mapping_set_shared(struct page *page)
+static inline void dax_page_bump_sharing(struct page *page)
 {
if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_SHARED) {
/*
@@ -357,7 +357,7 @@ static inline void dax_mapping_set_share
page->share++;
 }
 
-static inline unsigned long dax_mapping_decrease_shared(struct page *page)
+static inline unsigned long dax_page_drop_sharing(struct page *page)
 {
return --page->share;
 }
@@ -381,7 +381,7 @@ static void dax_associate_entry(void *en
struct page *page = pfn_to_page(pfn);
 
if (shared) {
-   dax_mapping_set_shared(page);
+   dax_page_bump_sharing(page);
} else {
WARN_ON_ONCE(page->mapping);
page->mapping = mapping;
@@ -402,9 +402,9 @@ static void dax_disassociate_entry(void
struct page *page = pfn_to_page(pfn);
 
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-   if (dax_mapping_is_shared(page)) {
+   if (dax_page_is_shared(page)) {
/* keep the shared flag if this page is still shared */
-   if (dax_mapping_decrease_shared(page) > 0)
+   if (dax_page_drop_sharing(page) > 0)
continue;
} else
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
_

Re: [PATCH v2 3/8] fsdax: zero the edges if source is HOLE or UNWRITTEN

2022-12-01 Thread Andrew Morton

On Thu, 1 Dec 2022 15:58:11 -0800 "Darrick J. Wong"  wrote:

> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1092,7 +1092,7 @@ static int dax_iomap_direct_access(const struct iomap 
> > *iomap, loff_t pos,
> >  }
> >  
> >  /**
> > - * dax_iomap_cow_copy - Copy the data from source to destination before 
> > write
> > + * dax_iomap_copy_around - Copy the data from source to destination before 
> > write
> 
>  * dax_iomap_copy_around - Prepare for an unaligned write to a
>  * shared/cow page by copying the data before and after the range to be
>  * written.

Thanks, I added this:

--- a/fs/dax.c~fsdax-zero-the-edges-if-source-is-hole-or-unwritten-fix
+++ a/fs/dax.c
@@ -1092,7 +1092,8 @@ out:
 }
 
 /**
- * dax_iomap_copy_around - Copy the data from source to destination before 
write
+ * dax_iomap_copy_around - Prepare for an unaligned write to a shared/cow page
+ * by copying the data before and after the range to be written.
  * @pos:   address to do copy from.
  * @length:size of copy operation.
  * @align_size:aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
_

Re: [PATCH 0/2] fsdax,xfs: fix warning messages

2022-11-30 Thread Andrew Morton

On Tue, 29 Nov 2022 19:59:14 -0800 Dan Williams  
wrote:

> [ add Andrew ]
> 
> Shiyang Ruan wrote:
> > Many testcases failed in dax+reflink mode with warning message in dmesg.
> > This also effects dax+noreflink mode if we run the test after a
> > dax+reflink test.  So, the most urgent thing is solving the warning
> > messages.
> > 
> > Patch 1 fixes some mistakes and adds handling of CoW cases not
> > previously considered (srcmap is HOLE or UNWRITTEN).
> > Patch 2 adds the implementation of unshare for fsdax.
> > 
> > With these fixes, most warning messages in dax_associate_entry() are
> > gone.  But honestly, generic/388 will randomly failed with the warning.
> > The case shutdown the xfs when fsstress is running, and do it for many
> > times.  I think the reason is that dax pages in use are not able to be
> > invalidated in time when fs is shutdown.  The next time dax page to be
> > associated, it still remains the mapping value set last time.  I'll keep
> > on solving it.
> > 
> > The warning message in dax_writeback_one() can also be fixed because of
> > the dax unshare.
> 
> Thank you for digging in on this, I had been pinned down on CXL tasks
> and worried that we would need to mark FS_DAX broken for a cycle, so
> this is timely.
> 
> My only concern is that these patches look to have significant collisions with
> the fsdax page reference counting reworks pending in linux-next. Although,
> those are still sitting in mm-unstable:
> 
> http://lore.kernel.org/r/20221108162059.2ee440d5244657c4f16bd...@linux-foundation.org

As far as I know, Dan's "Fix the DAX-gup mistake" series is somewhat
stuck.  Jan pointed out:

https://lore.kernel.org/all/20221109113849.p7pwob533ijgrytu@quack3/T/#u

or have Jason's issues since been addressed?

> My preference would be to move ahead with both in which case I can help
> rebase these fixes on top. In that scenario everything would go through
> Andrew.
> 
> However, if we are getting too late in the cycle for that path I think
> these dax-fixes take precedence, and one more cycle to let the page
> reference count reworks sit is ok.

That sounds a decent approach.  So we go with this series ("fsdax,xfs:
fix warning messages") and aim at 6.3-rc1 with "Fix the DAX-gup
mistake"?

Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-08 Thread Andrew Morton

On Thu, 8 Sep 2022 15:51:50 -0700 Dan Williams  wrote:

> Jonathan Cameron wrote:
> > On Wed, 7 Sep 2022 18:07:31 -0700
> > Dan Williams  wrote:
> > 
> > > Andrew Morton wrote:
> > > > I really dislike the term "flush".  Sometimes it means writeback,
> > > > sometimes it means invalidate.  Perhaps at other times it means
> > > > both.
> > > > 
> > > > Can we please be very clear in comments and changelogs about exactly
> > > > what this "flush" does.   With bonus points for being more specific in 
> > > > the 
> > > > function naming?
> > > >   
> > > 
> > > That's a good point, "flush" has been cargo-culted along in Linux's
> > > cache management APIs to mean write-back-and-invalidate. In this case I
> > > think this API is purely about invalidate. It just so happens that x86
> > > has not historically had a global invalidate instruction readily
> > > available which leads to the overuse of wbinvd.
> > > 
> > > It would be nice to make clear that this API is purely about
> > > invalidating any data cached for a physical address impacted by address
> > > space management event (secure erase / new region provision). Write-back
> > > is an unnecessary side-effect.
> > > 
> > > So how about:
> > > 
> > > s/arch_flush_memregion/cpu_cache_invalidate_memregion/?
> > 
> > Want to indicate it 'might' write back perhaps?
> > So could be invalidate or clean and invalidate (using arm ARM terms just to 
> > add
> > to the confusion ;)
> > 
> > Feels like there will be potential race conditions where that matters as we 
> > might
> > force stale data to be written back.
> > 
> > Perhaps a comment is enough for that. Anyone have the "famous last words" 
> > feeling?
> 
> Is "invalidate" not clear that write-back is optional? Maybe not.

Yes, I'd say that "invalidate" means "dirty stuff may of may not have
been written back".  Ditto for invalidate_inode_pages2().

> Also, I realized that we tried to include the address range to allow for
> the possibility of flushing by virtual address range, but that
> overcomplicates the use. I.e. if someone issue secure erase and the
> region association is not established does that mean that mean that the
> cache invalidation is not needed? It could be the case that someone
> disables a device, does the secure erase, and then reattaches to the
> same region. The cache invalidation is needed, but at the time of the
> secure erase the HPA was unknown.
> 
> All this to say that I feel the bikeshedding will need to continue until
> morale improves.
> 
> I notice that the DMA API uses 'sync' to indicate, "make this memory
> consistent/coherent for the CPU or the device", so how about an API like
> 
> memregion_sync_for_cpu(int res_desc)
> 
> ...where the @res_desc would be IORES_DESC_CXL for all CXL and
> IORES_DESC_PERSISTENT_MEMORY for the current nvdimm use case.

"sync" is another of my pet peeves ;) In filesystem land, at least. 
Does it mean "start writeback and return" or does it mean "start
writeback, wait for its completion then return".

Re: [PATCH -next] memregion: Add arch_flush_memregion() interface

2022-09-07 Thread Andrew Morton

I really dislike the term "flush".  Sometimes it means writeback,
sometimes it means invalidate.  Perhaps at other times it means
both.

Can we please be very clear in comments and changelogs about exactly
what this "flush" does.   With bonus points for being more specific in the 
function naming?

Re: [PATCH v2] mm: fix missing wake-up event for FSDAX pages

2022-07-05 Thread Andrew Morton

On Wed, 6 Jul 2022 10:47:32 +0800 Muchun Song  wrote:

> > If this wakeup is not one of these, then are there reports from the
> > softlockup detector?
> > 
> > Do we have reports of processes permanently stuck in D state?
> >
> 
> No. The task is in an TASK_INTERRUPTIBLE state (see 
> __fuse_dax_break_layouts). 
> The hung task reporter only reports D task (TASK_UNINTERRUPTIBLE).

Thanks, I updated the changelog a bit.

: FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
: 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
: then they will be unpinned via unpin_user_page() using a folio variant
: to put the page, however, folio variants did not consider this special
: case, the result will be to miss a wakeup event (like the user of
: __fuse_dax_break_layouts()).  This results in a task being permanently
: stuck in TASK_INTERRUPTIBLE state.
: 
: Since FSDAX pages are only possibly obtained by GUP users, so fix GUP
: instead of folio_put() to lower overhead.

I believe these details are helpful for -stable maintainers who are
wondering why they were sent stuff.  Also for maintainers of
downstreeam older kernels who are scratching heads over some user bug
report, trying to find a patch which might fix it - for this they want
to see a description of the user-visible effects, for matching with
that bug report.

Re: [PATCH v2] mm: fix missing wake-up event for FSDAX pages

2022-07-05 Thread Andrew Morton

On Wed, 6 Jul 2022 00:38:41 +0100 Matthew Wilcox  wrote:

> On Tue, Jul 05, 2022 at 02:18:19PM -0700, Andrew Morton wrote:
> > On Tue,  5 Jul 2022 20:35:32 +0800 Muchun Song  
> > wrote:
> > 
> > > FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> > > 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> > > then they will be unpinned via unpin_user_page() using a folio variant
> > > to put the page, however, folio variants did not consider this special
> > > case, the result will be to miss a wakeup event (like the user of
> > > __fuse_dax_break_layouts()).  Since FSDAX pages are only possible get
> > > by GUP users, so fix GUP instead of folio_put() to lower overhead.
> > > 
> > 
> > What are the user visible runtime effects of this bug?
> 
> "missing wake up event" seems pretty obvious to me?  Something goes to
> sleep waiting for a page to become unused, and is never woken.

No, missed wakeups are often obscured by another wakeup coming in
shortly afterwards.

If this wakeup is not one of these, then are there reports from the
softlockup detector?

Do we have reports of processes permanently stuck in D state?

Re: [PATCH v2] mm: fix missing wake-up event for FSDAX pages

2022-07-05 Thread Andrew Morton

On Tue,  5 Jul 2022 20:35:32 +0800 Muchun Song  wrote:

> FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> then they will be unpinned via unpin_user_page() using a folio variant
> to put the page, however, folio variants did not consider this special
> case, the result will be to miss a wakeup event (like the user of
> __fuse_dax_break_layouts()).  Since FSDAX pages are only possible get
> by GUP users, so fix GUP instead of folio_put() to lower overhead.
> 

What are the user visible runtime effects of this bug?

Re: [PATCHSETS v2] v14 fsdax-rmap + v11 fsdax-reflink

2022-06-16 Thread Andrew Morton



Unless there be last-minute objections, I plan to move this series into
the non-rebasing mm-stable branch a few days from now.

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-06-02 Thread Andrew Morton

On Sun, 8 May 2022 22:36:06 +0800 Shiyang Ruan  wrote:

> This is a combination of two patchsets:
>  1.fsdax-rmap: 
> https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.f...@fujitsu.com/
>  2.fsdax-reflink: 
> https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.f...@fujitsu.com/

I'm getting lost in conflicts trying to get this merged up.  Mainly
memory-failure.c due to patch series "mm, hwpoison: enable 1GB hugepage
support".

Could you please take a look at what's in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm a few hours from
now?  Or the next linux-next.

And I suggest that converting it all into a single 14-patch series
would be more straightforward.

Thanks.

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-06-02 Thread Andrew Morton

On Thu, 2 Jun 2022 10:18:09 -0700 "Darrick J. Wong"  wrote:

> On Thu, Jun 02, 2022 at 05:42:13PM +0800, Shiyang Ruan wrote:
> > Hi,
> > 
> > Is there any other work I should do with these two patchsets?  I think they
> > are good for now.  So... since the 5.19-rc1 is coming, could the
> > notify_failure() part be merged as your plan?
> 
> Hmm.  I don't see any of the patches 1-5,7-13 in current upstream, so
> I'm guessing this means Andrew isn't taking it for 5.19?

Sorry, the volume of commentary led me to believe that it wasn't
considered finalized.  Shall take a look now.

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-10 Thread Andrew Morton

On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong"  wrote:

> On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams  
> > wrote:
> > 
> > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > really matters where al long as it's merged into the xfs for-next
> > > > tree so it gets filesystem test coverage...
> > > 
> > > So how about let the notify_failure() bits go through -mm this cycle,
> > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > baseline to build from?
> > 
> > What are we referring to here?  I think a minimal thing would be the
> > memremap.h and memory-failure.c changes from
> > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com ?
> > 
> > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > would probably be straining things to slip it into 5.19.
> > 
> > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > right thing, but it's a networking errno.  I suppose livable with if it
> > never escapes the kernel, but if it can get back to userspace then a
> > user would be justified in wondering how the heck a filesystem
> > operation generated a networking errno?
> 
>  most filesystems return EOPNOTSUPP rather enthusiastically when
> they don't know how to do something...

Can it propagate back to userspace?

Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink

2022-05-10 Thread Andrew Morton

On Tue, 10 May 2022 18:55:50 -0700 Dan Williams  
wrote:

> > It'll need to be a stable branch somewhere, but I don't think it
> > really matters where al long as it's merged into the xfs for-next
> > tree so it gets filesystem test coverage...
> 
> So how about let the notify_failure() bits go through -mm this cycle,
> if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> baseline to build from?

What are we referring to here?  I think a minimal thing would be the
memremap.h and memory-failure.c changes from
https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.f...@fujitsu.com ?

Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
would probably be straining things to slip it into 5.19.

The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
right thing, but it's a networking errno.  I suppose livable with if it
never escapes the kernel, but if it can get back to userspace then a
user would be justified in wondering how the heck a filesystem
operation generated a networking errno?

Re: [PATCH v5 0/6] Fix some bugs related to ramp and dax

2022-03-31 Thread Andrew Morton

On Thu, 31 Mar 2022 11:55:47 -0400 Qian Cai  wrote:

> On Fri, Mar 18, 2022 at 03:45:23PM +0800, Muchun Song wrote:
> > This series is based on next-20220225.
> > 
> > Patch 1-2 fix a cache flush bug, because subsequent patches depend on
> > those on those changes, there are placed in this series.  Patch 3-4
> > are preparation for fixing a dax bug in patch 5.  Patch 6 is code cleanup
> > since the previous patch remove the usage of follow_invalidate_pte().
> 
> Reverting this series fixed boot crashes.
> 

Thanks.  I'll drop

mm-rmap-fix-cache-flush-on-thp-pages.patch
dax-fix-cache-flush-on-pmd-mapped-pages.patch
mm-rmap-introduce-pfn_mkclean_range-to-cleans-ptes.patch
mm-rmap-introduce-pfn_mkclean_range-to-cleans-ptes-fix.patch
mm-pvmw-add-support-for-walking-devmap-pages.patch
dax-fix-missing-writeprotect-the-pte-entry.patch
dax-fix-missing-writeprotect-the-pte-entry-v6.patch
mm-simplify-follow_invalidate_pte.patch

Re: [PATCH v3 4/6] mm: pvmw: add support for walking devmap pages

2022-02-28 Thread Andrew Morton

On Mon, 28 Feb 2022 14:35:34 +0800 Muchun Song  wrote:

> The devmap pages can not use page_vma_mapped_walk() to check if a huge
> devmap page is mapped into a vma.  Add support for walking huge devmap
> pages so that DAX can use it in the next patch.
> 

x86_64 allnoconfig:

In file included from :
In function 'check_pmd',
inlined from 'page_vma_mapped_walk' at mm/page_vma_mapped.c:219:10:
././include/linux/compiler_types.h:347:45: error: call to 
'__compiletime_assert_232' declared with attribute error: BUILD_BUG failed
  347 | _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
  | ^
././include/linux/compiler_types.h:328:25: note: in definition of macro 
'__compiletime_assert'
  328 | prefix ## suffix(); 
\
  | ^~
././include/linux/compiler_types.h:347:9: note: in expansion of macro 
'_compiletime_assert'
  347 | _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
  | ^~~
./include/linux/build_bug.h:39:37: note: in expansion of macro 
'compiletime_assert'
   39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
  | ^~
./include/linux/build_bug.h:59:21: note: in expansion of macro 
'BUILD_BUG_ON_MSG'
   59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
  | ^~~~
./include/linux/huge_mm.h:307:28: note: in expansion of macro 'BUILD_BUG'
  307 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
  |^
./include/linux/huge_mm.h:104:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
  104 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
  |  ^~~
./include/linux/huge_mm.h:105:26: note: in expansion of macro 'HPAGE_PMD_ORDER'
  105 | #define HPAGE_PMD_NR (1

Re: [PATCH 2/2] drm/ttm: optimize the pool shrinker a bit v2

2021-04-15 Thread Andrew Morton

On Thu, 15 Apr 2021 13:56:24 +0200 "Christian König" 
 wrote:

> @@ -530,6 +525,11 @@ void ttm_pool_fini(struct ttm_pool *pool)
>   for (j = 0; j < MAX_ORDER; ++j)
>   ttm_pool_type_fini(>caching[i].orders[j]);
>   }
> +
> + /* We removed the pool types from the LRU, but we need to also make sure
> +  * that no shrinker is concurrently freeing pages from the pool.
> +  */
> + sync_shrinkers();

It isn't immediately clear to me how this works.  ttm_pool_fini() has
already freed all the pages hasn't it?  So why would it care if some
shrinkers are still playing with the pages?

Or is it the case that ttm_pool_fini() is assuming that there will be
some further action against these pages, which requires that shrinkers
no longer be accessing the pages and which further assumes that future
shrinker invocations will not be able to look up these pages?

IOW, a bit more explanation about the dynamics here would help!

Re: [PATCH v13 14/14] powerpc/64s/radix: Enable huge vmalloc mappings

2021-04-15 Thread Andrew Morton

On Thu, 15 Apr 2021 12:23:55 +0200 Christophe Leroy 
 wrote:
> > +* is done. STRICT_MODULE_RWX may require extra work to support this
> > +* too.
> > +*/
> >   
> > -   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, 
> > GFP_KERNEL,
> > -   PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, 
> > NUMA_NO_NODE,
> 
> 
> I think you should add the following in 
> 
> #ifndef MODULES_VADDR
> #define MODULES_VADDR VMALLOC_START
> #define MODULES_END VMALLOC_END
> #endif
> 
> And leave module_alloc() as is (just removing the enclosing #ifdef 
> MODULES_VADDR and adding the 
> VM_NO_HUGE_VMAP  flag)
> 
> This would minimise the conflits with the changes I did in powerpc/next 
> reported by Stephen R.
> 

I'll drop powerpc-64s-radix-enable-huge-vmalloc-mappings.patch for now,
make life simpler.

Nick, a redo on top of Christophe's changes in linux-next would be best
please.

Re: [PATCH] lib: remove "expecting prototype" kernel-doc warnings

2021-04-13 Thread Andrew Morton

On Sun, 11 Apr 2021 15:17:56 -0700 Randy Dunlap  wrote:

> Fix various kernel-doc warnings in lib/ due to missing or
> erroneous function names.
> Add kernel-doc for some function parameters that was missing.
> Use kernel-doc "Return:" notation in earlycpio.c.
> 
> Quietens the following warnings:
> 
> ../lib/earlycpio.c:61: warning: expecting prototype for cpio_data 
> find_cpio_data(). Prototype was for find_cpio_data() instead
> 
> ../lib/lru_cache.c:640: warning: expecting prototype for lc_dump(). Prototype 
> was for lc_seq_dump_details() instead
> lru_cache.c:90: warning: Function parameter or member 'cache' not described 
> in 'lc_create'

I'm still seeing this.

> lru_cache.c:90: warning: Function parameter or member 'cache' not described 
> in 'lc_create'

But it looks OK:

/**
 * lc_create - prepares to track objects in an active set
 * @name: descriptive name only used in lc_seq_printf_stats and 
lc_seq_dump_details
 * @cache: cache root pointer
 * @max_pending_changes: maximum changes to accumulate until a transaction is 
required
 * @e_count: number of elements allowed to be active simultaneously
 * @e_size: size of the tracked objects
 * @e_off: offset to the  lc_element member in a tracked object
 *
 * Returns a pointer to a newly initialized struct lru_cache on success,
 * or NULL on (allocation) failure.
 */
struct lru_cache *lc_create(const char *name, struct kmem_cache *cache,
unsigned max_pending_changes,
unsigned e_count, size_t e_size, size_t e_off)
{

Re: [PATCH v4] kasan: remove redundant config option

2021-04-11 Thread Andrew Morton

On Sun, 11 Apr 2021 11:53:33 +0100 Catalin Marinas  
wrote:

> Hi Andrew,
> 
> On Tue, Mar 30, 2021 at 10:36:37PM -0700, Andrew Morton wrote:
> > On Mon, 29 Mar 2021 16:54:26 +0200 Andrey Konovalov  
> > wrote:
> > > Looks like my patch "kasan: fix KASAN_STACK dependency for HW_TAGS"
> > > that was merged into 5.12-rc causes a build time warning:
> > > 
> > > include/linux/kasan.h:333:30: warning: 'CONFIG_KASAN_STACK' is not
> > > defined, evaluates to 0 [-Wundef]
> > > #if defined(CONFIG_KASAN) && CONFIG_KASAN_STACK
> > > 
> > > The fix for it would either be reverting the patch (which would leave
> > > the initial issue unfixed) or applying this "kasan: remove redundant
> > > config option" patch.
> > > 
> > > Would it be possible to send this patch (with the fix-up you have in
> > > mm) for the next 5.12-rc?
> > > 
> > > Here are the required tags:
> > > 
> > > Fixes: d9b571c885a8 ("kasan: fix KASAN_STACK dependency for HW_TAGS")
> > > Cc: sta...@vger.kernel.org
> > 
> > Got it, thanks.  I updated the changelog to mention the warning fix and
> > moved these ahead for a -rc merge.
> 
> Is there a chance this patch makes it into 5.12? I still get the warning
> with the latest Linus' tree (v5.12-rc6-408-g52e44129fba5) when enabling
> KASAN_HW_TAGS.

Trying.   We're still awaiting a tested fix for
https://lkml.kernel.org/r/ca+fcnzf1abrqg0dsxtoza9zm1bsblyq_xbu+xi9cv8wazxd...@mail.gmail.com

Re: [PATCH v4] lib/string: Introduce sysfs_streqcase

2021-04-09 Thread Andrew Morton

On Fri, 9 Apr 2021 08:44:39 +0200 Greg Kroah-Hartman 
 wrote:

> On Thu, Apr 08, 2021 at 10:05:02PM -0700, Andrew Morton wrote:
> > On Thu,  8 Apr 2021 15:06:05 +0200 Gioh Kim  wrote:
> > 
> > > As the name shows, it checks if strings are equal in case insensitive
> > > manner.
> > 
> > Peh.  Who would die if we simply made sysfs_streq() case-insensitive?
> 
> I doubt anyone, let's do that instead.

There's a risk that people will write scripts/config/etc on a 5.13+
kernel and then find that they malfunction on earlier kernels...

Re: [PATCH v1 1/1] kernel.h: Split out panic and oops helpers

2021-04-09 Thread Andrew Morton

On Wed, 7 Apr 2021 11:46:37 +0300 Andy Shevchenko  
wrote:

> On Wed, Apr 7, 2021 at 11:17 AM Kees Cook  wrote:
> >
> > On Tue, Apr 06, 2021 at 04:31:58PM +0300, Andy Shevchenko wrote:
> > > kernel.h is being used as a dump for all kinds of stuff for a long time.
> > > Here is the attempt to start cleaning it up by splitting out panic and
> > > oops helpers.
> > >
> > > At the same time convert users in header and lib folder to use new header.
> > > Though for time being include new header back to kernel.h to avoid twisted
> > > indirected includes for existing users.
> > >
> > > Signed-off-by: Andy Shevchenko 
> >
> > I like it! Do you have a multi-arch CI to do allmodconfig builds to
> > double-check this?
> 
> Unfortunately no, I rely on plenty of bots that are harvesting mailing lists.
> 
> But I will appreciate it if somebody can run this through various build tests.
> 

um, did you try x86_64 allmodconfig?

I'm up to
kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix-fix.patch
and counting.

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix

more files need panic_notifier.h

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 arch/x86/xen/enlighten.c|1 +
 drivers/video/fbdev/hyperv_fb.c |1 +
 2 files changed, 2 insertions(+)

--- a/arch/x86/xen/enlighten.c~kernelh-split-out-panic-and-oops-helpers-fix
+++ a/arch/x86/xen/enlighten.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
--- 
a/drivers/video/fbdev/hyperv_fb.c~kernelh-split-out-panic-and-oops-helpers-fix
+++ a/drivers/video/fbdev/hyperv_fb.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
_


From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix

arch/x86/purgatory/purgatory.c needs kernel.h

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 arch/x86/purgatory/purgatory.c |1 +
 1 file changed, 1 insertion(+)

--- 
a/arch/x86/purgatory/purgatory.c~kernelh-split-out-panic-and-oops-helpers-fix-fix
+++ a/arch/x86/purgatory/purgatory.c
@@ -8,6 +8,7 @@
  *   Vivek Goyal 
  */
 
+#include 
 #include 
 #include 
 #include 
_

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix

drivers/clk/analogbits/wrpll-cln28hpc.c needs minmax.h, math.h and limits.h

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 drivers/clk/analogbits/wrpll-cln28hpc.c |4 
 1 file changed, 4 insertions(+)

--- 
a/drivers/clk/analogbits/wrpll-cln28hpc.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix
+++ a/drivers/clk/analogbits/wrpll-cln28hpc.c
@@ -25,6 +25,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+
 #include 
 
 /* MIN_INPUT_FREQ: minimum input clock frequency, in Hz (Fref_min) */
_

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix

drivers/misc/pvpanic/pvpanic.c needs panic_notifier.h

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 drivers/misc/pvpanic/pvpanic.c |1 +
 1 file changed, 1 insertion(+)

--- 
a/drivers/misc/pvpanic/pvpanic.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix
+++ a/drivers/misc/pvpanic/pvpanic.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
_
From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix

fix drivers/misc/pvpanic/pvpanic.c and drivers/net/ipa/ipa_smp2p.c

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 drivers/net/ipa/ipa_smp2p.c |1 +
 1 file changed, 1 insertion(+)

--- 
a/drivers/net/ipa/ipa_smp2p.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix
+++ a/drivers/net/ipa/ipa_smp2p.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
_

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix

fix drivers/power/reset/ltc2952-poweroff.c and drivers/misc/bcm-vk/bcm_vk_dev.c

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 drivers/misc/bcm-vk/bcm_vk_dev.c   |1 +
 drivers/power/reset/ltc2952-poweroff.c |1 +
 2 files changed, 2 insertions(+)

--- 
a/drivers/power/reset/ltc2952-poweroff.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix
+++ a/drivers/power/reset/ltc2952-poweroff.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
--- 
a/drivers/misc/bcm-vk/bcm_vk_dev.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix
+++ a/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
_

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix-fix

fix drivers/leds/trigger/ledtrig-panic.c and drivers/firmware/google/gsmi.c

Cc: Andy Sh

Re: [PATCH v6 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-08 Thread Andrew Morton

On Wed, 07 Apr 2021 22:38:37 +0200 Oscar Salvador  wrote:

> On 2021-04-06 22:28, Oscar Salvador wrote:
> > Heh, it seems I spaced out today.
> > 
> > We need a few things on top:
> > 
> 

Yes please.

Re: [PATCH v4 0/8] make hugetlb put_page safe for all calling contexts

2021-04-08 Thread Andrew Morton

On Thu, 8 Apr 2021 09:11:30 +0200 Oscar Salvador  wrote:

> But if It is going to be easier for Andrew, just pull them all out and I
> will resend the whole series once this work goes in.

I think so.

I shall drop these:

mmpage_alloc-bail-out-earlier-on-enomem-in-alloc_contig_migrate_range.patch
mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes.patch
mmcompaction-let-isolate_migratepages_rangeblock-return-error-codes-fix.patch
mm-make-alloc_contig_range-handle-free-hugetlb-pages.patch
mm-make-alloc_contig_range-handle-in-use-hugetlb-pages.patch
mmpage_alloc-drop-unnecessary-checks-from-pfn_range_valid_contig.patch

and these:

mm-cma-change-cma-mutex-to-irq-safe-spinlock.patch
hugetlb-no-need-to-drop-hugetlb_lock-to-call-cma_release.patch
hugetlb-add-per-hstate-mutex-to-synchronize-user-adjustments.patch
hugetlb-create-remove_hugetlb_page-to-separate-functionality.patch
hugetlb-call-update_and_free_page-without-hugetlb_lock.patch
hugetlb-change-free_pool_huge_page-to-remove_pool_huge_page.patch
hugetlb-make-free_huge_page-irq-safe.patch
hugetlb-make-free_huge_page-irq-safe-fix.patch
hugetlb-add-lockdep_assert_held-calls-for-hugetlb_lock.patch

Along with notes-to-self that this:

  https://lkml.kernel.org/r/YGwnPCPaq1xKh/8...@hirez.programming.kicks-ass.net

might need attention and that this:

  hugetlb-make-free_huge_page-irq-safe.patch

might need updating.

Re: [PATCH v4] lib/string: Introduce sysfs_streqcase

2021-04-08 Thread Andrew Morton

On Thu,  8 Apr 2021 15:06:05 +0200 Gioh Kim  wrote:

> As the name shows, it checks if strings are equal in case insensitive
> manner.

Peh.  Who would die if we simply made sysfs_streq() case-insensitive?

Re: [PATCH] init/version.c: remove unused including

2021-04-08 Thread Andrew Morton

On Thu, 8 Apr 2021 14:26:58 +0800 Tian Tao  wrote:

> Remove including  that don't need it.
> 

Um, how can version.c possibly not include version.h?

Sure, it may obtain access to version.h via some other include, but
that's plain luck and nonsense.  And it's unreliable and it requires
whichever-header-is-doing-this to continue to include version.h on
behalf of some .c file which should be including it directly!

Re: [PATCH 0/9] userfaultfd: add minor fault handling for shmem

2021-04-08 Thread Andrew Morton

On Thu,  8 Apr 2021 16:43:18 -0700 Axel Rasmussen  
wrote:

> The idea is that it will apply cleanly to akpm's tree, *replacing* the 
> following
> patches (i.e., drop these first, and then apply this series):
> 
> userfaultfd-support-minor-fault-handling-for-shmem.patch
> userfaultfd-support-minor-fault-handling-for-shmem-fix.patch
> userfaultfd-support-minor-fault-handling-for-shmem-fix-2.patch
> userfaultfd-support-minor-fault-handling-for-shmem-fix-3.patch
> userfaultfd-support-minor-fault-handling-for-shmem-fix-4.patch
> userfaultfd-selftests-use-memfd_create-for-shmem-test-type.patch
> userfaultfd-selftests-create-alias-mappings-in-the-shmem-test.patch
> userfaultfd-selftests-reinitialize-test-context-in-each-test.patch
> userfaultfd-selftests-exercise-minor-fault-handling-shmem-support.patch

Well.  the problem is,

> + if (area_alias == MAP_FAILED)
> + err("mmap of memfd alias failed");

`err' doesn't exist until eleventy patches later, in Peter's
"userfaultfd/selftests: unify error handling".  I got tired of (and
lost confidence in) replacing "err(...)" with "fprintf(stderr, ...);
exit(1)" everywhere then fixing up the fallout when Peter's patch came
along.  Shudder.

Sorry, all this material pretty clearly isn't going to make 5.12
(potentially nine days hence), so I shall drop all the userfaultfd
patches.  Let's take a fresh run at all of this after -rc1.

I have tentatively retained the first series:

userfaultfd-add-minor-fault-registration-mode.patch
userfaultfd-add-minor-fault-registration-mode-fix.patch
userfaultfd-disable-huge-pmd-sharing-for-minor-registered-vmas.patch
userfaultfd-hugetlbfs-only-compile-uffd-helpers-if-config-enabled.patch
userfaultfd-add-uffdio_continue-ioctl.patch
userfaultfd-update-documentation-to-describe-minor-fault-handling.patch
userfaultfd-selftests-add-test-exercising-minor-fault-handling.patch

but I don't believe they have had much testing standalone, without the
other userfaultfd patches present.  So I don't think it's smart to
upstream these in this cycle.  Or I could drop them so you and Peter
can have a clean shot at redoing the whole thing.  Please let me know.

Re: [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()

2021-04-08 Thread Andrew Morton

On Fri, 9 Apr 2021 11:17:49 +0800 Miaohe Lin  wrote:

> On 2021/4/9 7:25, Mike Kravetz wrote:
> > On 4/2/21 2:32 AM, Miaohe Lin wrote:
> >> A rare out of memory error would prevent removal of the reserve map region
> >> for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
> >> dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
> >> and hugetlb_acct_memory could possibly fail too. We should correctly handle
> >> these cases.
> > 
> > Yes, this is a potential issue.
> > 
> > The 'good news' is that hugetlb_fix_reserve_counts() is unlikely to ever
> > be called.  To do so would imply we could not allocate a region entry
> > which is only 6 words in size.  We also keep a 'cache' of entries so we
> > may not even need to allocate.
> > 
> > But, as mentioned it is a potential issue.
> 
> Yes, a potential *theoretical* issue.
> 
> > 
> >> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of 
> >> pages")
> > 
> > This is likely going to make this get picked by by stable releases.
> > That is unfortunate as mentioned above this is mostly theoretical.
> > 
> 
> I will drop this. This does not worth backport.
> 

-stable have been asked not to backport MM patches unless MM patches
include "cc:stable".  ie, no making our backporting decisions for us,
please.

Re: [PATCH 1/2] gcov: re-fix clang-11+ support

2021-04-07 Thread Andrew Morton

On Wed, 7 Apr 2021 14:28:21 -0700 Nick Desaulniers  
wrote:

> On Wed, Apr 7, 2021 at 2:21 PM Andrew Morton  
> wrote:
> >
> > On Wed,  7 Apr 2021 11:54:55 -0700 Nick Desaulniers 
> >  wrote:
> >
> > > LLVM changed the expected function signature for
> > > llvm_gcda_emit_function() in the clang-11 release.  Users of clang-11 or
> > > newer may have noticed their kernels producing invalid coverage
> > > information:
> > >
> > > $ llvm-cov gcov -a -c -u -f -b .gcda -- gcno=.gcno
> > > 1 : checksum mismatch, \
> > >   (, ) != (,  > > C>)
> > > 2 Invalid .gcda File!
> > > ...
> > >
> > > Fix up the function signatures so calling this function interprets its
> > > parameters correctly and computes the correct cfg checksum. In
> > > particular, in clang-11, the additional checksum is no longer optional.
> >
> > Which tree is this against?  I'm seeing quite a lot of rejects against
> > Linus's current.
> 
> Today's linux-next; the only recent changes to this single source file
> since my last patches were:
> 
> commit b3c4e66c908b ("gcov: combine common code")
> commit 17d0508a080d ("gcov: use kvmalloc()")
> 
> both have your sign off, so I assume those are in your tree?

Yes, I presently have

gcov-clang-drop-support-for-clang-10-and-older.patch
gcov-combine-common-code.patch
gcov-simplify-buffer-allocation.patch
gcov-use-kvmalloc.patch

But this patch ("gcov: re-fix clang-11+ support") has cc:stable, so it
should be against Linus's tree, to give the -stable trees something
more mergeable.

Re: [PATCH 1/2] gcov: re-fix clang-11+ support

2021-04-07 Thread Andrew Morton

On Wed,  7 Apr 2021 11:54:55 -0700 Nick Desaulniers  
wrote:

> LLVM changed the expected function signature for
> llvm_gcda_emit_function() in the clang-11 release.  Users of clang-11 or
> newer may have noticed their kernels producing invalid coverage
> information:
> 
> $ llvm-cov gcov -a -c -u -f -b .gcda -- gcno=.gcno
> 1 : checksum mismatch, \
>   (, ) != (, )
> 2 Invalid .gcda File!
> ...
> 
> Fix up the function signatures so calling this function interprets its
> parameters correctly and computes the correct cfg checksum. In
> particular, in clang-11, the additional checksum is no longer optional.

Which tree is this against?  I'm seeing quite a lot of rejects against
Linus's current.

Re: [PATCH 2/2] fs/epoll: restore waking from ep_done_scan()

2021-04-05 Thread Andrew Morton

On Mon, 5 Apr 2021 20:22:26 -0700 Davidlohr Bueso  wrote:

> On Mon, 05 Apr 2021, Andrew Morton wrote:
> 
> >Tricky.  339ddb53d373 was merged in December 2019.  So do we backport
> >this fix?  Could any userspace code be depending upon the
> >post-339ddb53d373 behavior?
> 
> As with previous trouble caused by this commit, I vote for restoring the 
> behavior
> backporting the fix, basically the equivalent of adding (which was my 
> intention):
> 
> Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")

OK, I added the Fixes: line and the cc:stable line.

Re: [PATCH-next 2/5] lib/test_vmalloc.c: add a new 'nr_threads' parameter

2021-04-05 Thread Andrew Morton

On Sat, 3 Apr 2021 14:31:43 +0200 Uladzislau Rezki  wrote:

> > 
> > We may need to replaced that kcalloc() with kmvalloc() though...
> >
> Yep. If we limit to USHRT_MAX, the maximum amount of memory for
> internal data would be ~12MB. Something like below:
> 
> diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
> index d337985e4c5e..a5103e3461bf 100644
> --- a/lib/test_vmalloc.c
> +++ b/lib/test_vmalloc.c
> @@ -24,7 +24,7 @@
> MODULE_PARM_DESC(name, msg) \
> 
>  __param(int, nr_threads, 0,
> -   "Number of workers to perform tests(min: 1 max: 1024)");
> +   "Number of workers to perform tests(min: 1 max: 65536)");
> 
>  __param(bool, sequential_test_order, false,
> "Use sequential stress tests order");
> @@ -469,13 +469,13 @@ init_test_configurtion(void)
>  {
> /*
>  * A maximum number of workers is defined as hard-coded
> -* value and set to 1024. We add such gap just in case
> +* value and set to 65536. We add such gap just in case
>  * and for potential heavy stressing.
>  */
> -   nr_threads = clamp(nr_threads, 1, 1024);
> +   nr_threads = clamp(nr_threads, 1, 65536);
> 
> /* Allocate the space for test instances. */
> -   tdriver = kcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
> +   tdriver = kvcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL);
> if (tdriver == NULL)
> return -1;
> 
> @@ -555,7 +555,7 @@ static void do_concurrent_test(void)
> i, t->stop - t->start);
> }
> 
> -   kfree(tdriver);
> +   kvfree(tdriver);
>  }
> 
>  static int vmalloc_test_init(void)
> 
> Does it sound reasonable for you?

I think so.  It's a test thing so let's give testers more flexibility,
remembering that they don't need as much protection from their own
mistakes.

Re: [PATCH 2/2] fs/epoll: restore waking from ep_done_scan()

2021-04-05 Thread Andrew Morton

On Mon,  5 Apr 2021 16:10:25 -0700 Davidlohr Bueso  wrote:

> 339ddb53d373 (fs/epoll: remove unnecessary wakeups of nested epoll) changed
> the userspace visible behavior of exclusive waiters blocked on a common
> epoll descriptor upon a single event becoming ready. Previously, all tasks
> doing epoll_wait would awake, and now only one is awoken, potentially causing
> missed wakeups on applications that rely on this behavior, such as Apache 
> Qpid.
> 
> While the aforementioned commit aims at having only a wakeup single path in
> ep_poll_callback (with the exceptions of epoll_ctl cases), we need to restore
> the wakeup in what was the old ep_scan_ready_list() such that the next thread
> can be awoken, in a cascading style, after the waker's corresponding 
> ep_send_events().
> 

Tricky.  339ddb53d373 was merged in December 2019.  So do we backport
this fix?  Could any userspace code be depending upon the
post-339ddb53d373 behaviour?

Or do we just leave the post-339ddb53d373 code as-is?  Presumably the
issue is very rarely encountered, and changeing it back has its own
risks.

Re: lib/test_kasan_module.c:25:6: warning: variable 'unused' set but not used

2021-04-05 Thread Andrew Morton

On Mon, 5 Apr 2021 02:16:25 +0800 kernel test robot  wrote:

> Hi Andrey,
> 
> First bad commit (maybe != root cause):
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   2023a53bdf41b7646b1d384b6816af06309f73a5
> commit: 5d92bdffd2d53f98de683229c0ad7d028703fdba kasan: rename 
> CONFIG_TEST_KASAN_MODULE
> date:   6 weeks ago
> config: arm-randconfig-r024-20210404 (attached as .config)
> compiler: arm-linux-gnueabi-gcc (GCC) 9.3.0
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d92bdffd2d53f98de683229c0ad7d028703fdba
> git remote add linus 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git fetch --no-tags linus master
> git checkout 5d92bdffd2d53f98de683229c0ad7d028703fdba
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
> ARCH=arm 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> All warnings (new ones prefixed by >>):
> 
>lib/test_kasan_module.c: In function 'copy_user_test':
> >> lib/test_kasan_module.c:25:6: warning: variable 'unused' set but not used 
> >> [-Wunused-but-set-variable]
>   25 |  int unused;
>  |  ^~

Fair enough ;)

--- a/lib/test_kasan_module.c~a
+++ a/lib/test_kasan_module.c
@@ -22,7 +22,7 @@ static noinline void __init copy_user_te
char *kmem;
char __user *usermem;
size_t size = 10;
-   int unused;
+   int __maybe_unused unused;
 
kmem = kmalloc(size, GFP_KERNEL);
if (!kmem)
_

I guess we could test the copy_*_user return values are as expected,
but that isn't the point?

Re: [PATCH] mm: memcontrol: fix forget to obtain the ref to objcg in split_page_memcg

2021-04-02 Thread Andrew Morton

On Wed, 31 Mar 2021 20:35:02 -0700 Roman Gushchin  wrote:

> On Thu, Apr 01, 2021 at 11:31:16AM +0800, Miaohe Lin wrote:
> > On 2021/4/1 11:01, Muchun Song wrote:
> > > Christian Borntraeger reported a warning about "percpu ref
> > > (obj_cgroup_release) <= 0 (-1) after switching to atomic".
> > > Because we forgot to obtain the reference to the objcg and
> > > wrongly obtain the reference of memcg.
> > > 
> > > Reported-by: Christian Borntraeger 
> > > Signed-off-by: Muchun Song 
> > 
> > Thanks for the patch.
> > Is a Fixes tag needed?
> 
> No, as the original patch hasn't been merged into the Linus's tree yet.
> So the fix can be simply squashed.

Help.  Which is "the original patch"?

> Btw, the fix looks good to me.
> 
> Acked-by: Roman Gushchin

Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop

2021-04-02 Thread Andrew Morton

On Fri, 2 Apr 2021 15:03:37 +0800 Stillinux  wrote:

> In the case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule, the waiting request was stuck in the blk_plug
> flow of one process, and this process fell into an infinite loop.
> not do the action of brushing out the request.
> 
> The call flow of this process is swap_cluster_readahead.
> Use blk_start/finish_plug for blk_plug operation,
> flow swap_cluster_readahead->__read_swap_cache_async->swapcache_prepare.
> When swapcache_prepare return -EEXIST, it will fall into an infinite loop,
> even if cond_resched is called, but according to the schedule,
> sched_submit_work will be based on tsk->state, and will not flash out
> the blk_plug request, so will hang io, causing the overall system  hang.
> 
> For the first time involving the swap part, there is no good way to fix
> the problem from the fundamental problem. In order to solve the
> engineering situation, we chose to make swap_cluster_readahead aware of
> the memory pressure situation as soon as possible, and do io_schedule to
> flush out the blk_plug request, thereby changing the allocation flag in
> swap_readpage to GFP_NOIO , No longer do the memory reclaim of flush io.
> Although system operating normally, but not the most fundamental way.
> 

Thanks.

I'm not understanding why swapcache_prepare() repeatedly returns
-EEXIST in this situation?

And how does the switch to GFP_NOIO fix this?  Simply by avoiding
direct reclaim altogether?

> ---
>  mm/page_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index c493ce9ebcf5..87392ffabb12 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -403,7 +403,7 @@ int swap_readpage(struct page *page, bool synchronous)
>   }
> 
>   ret = 0;
> - bio = bio_alloc(GFP_KERNEL, 1);
> + bio = bio_alloc(GFP_NOIO, 1);
>   bio_set_dev(bio, sis->bdev);
>   bio->bi_opf = REQ_OP_READ;
>   bio->bi_iter.bi_sector = swap_page_sector(page);

Re: [PATCH-next 2/5] lib/test_vmalloc.c: add a new 'nr_threads' parameter

2021-04-02 Thread Andrew Morton

On Fri,  2 Apr 2021 22:22:34 +0200 "Uladzislau Rezki (Sony)"  
wrote:

> By using this parameter we can specify how many workers are
> created to perform vmalloc tests. By default it is one CPU.
> The maximum value is set to 1024.
> 
> As a result of this change a 'single_cpu_test' one becomes
> obsolete, therefore it is no longer needed.
> 

Why limit to 1024?  Maybe testers want more - what's the downside to
permitting that?

We may need to replaced that kcalloc() with kmvalloc() though...

Re: [PATCH v6 00/12] lib/find_bit: fast path for small bitmaps

2021-04-01 Thread Andrew Morton

On Thu, 1 Apr 2021 12:50:31 +0300 Andy Shevchenko  
wrote:

> > I normally don't have a lot of material for asm-generic either, half
> > the time there are no pull requests at all for a given release. I would
> > expect future changes to the bitmap implementation to only need
> > an occasional bugfix, which could go through either the asm-generic
> > tree or through mm and doesn't need another separate pull request.
> >
> > If it turns out to be a tree that needs regular updates every time,
> > then having a top level repository in linux-next would be appropriate.
> 
> Agree. asm-generic may serve for this. My worries are solely about how
> much burden we add on Andrew's shoulders.

Is fine.  Saving other developers from having to maintain tiny trees is
a thing I do.

Re: [PATCH] scripts: A new script for checking duplicate struct declaration

2021-04-01 Thread Andrew Morton

On Thu,  1 Apr 2021 19:09:43 +0800 Wan Jiabing  wrote:

> checkdeclares: find struct declared more than once.
> Inspired by checkincludes.pl.
> This script checks for duplicate struct declares.
> Note that this will not take into consideration macros, so
> you should run this only if you know you do have real dups
> and do not have them under #ifdef's.
> You could also just review the results.

include/linux/bpf-cgroup.h: struct bpf_prog is declared more than once.
include/linux/bpf.h: struct xdp_buff is declared more than once.
include/linux/bpf.h: struct sk_buff is declared more than once.
include/linux/bpf.h: struct btf_type is declared more than once.
include/linux/debug_locks.h: struct task_struct is declared more than once.
include/linux/fs.h: struct iov_iter is declared more than once.
include/linux/fs.h: struct files_struct is declared more than once.
include/linux/gpio.h: struct device is declared more than once.
include/linux/host1x.h: struct host1x is declared more than once.
include/linux/intel_rapl.h: struct rapl_package is declared more than once.
include/linux/libnvdimm.h: struct device is declared more than once.
include/linux/lightnvm.h: struct nvm_rq is declared more than once.
include/linux/memcontrol.h: struct mem_cgroup is declared more than once.
include/linux/mount.h: struct path is declared more than once.
include/linux/mutex.h: struct ww_acquire_ctx is declared more than once.
include/linux/netfilter.h: struct flowi is declared more than once.
include/linux/netfilter.h: struct nf_conntrack_tuple is declared more than once.
include/linux/netfilter.h: struct nlattr is declared more than once.
include/linux/netfilter.h: struct nf_conn is declared more than once.
include/linux/profile.h: struct pt_regs is declared more than once.
include/linux/trace_events.h: struct trace_array is declared more than once.

Sigh.  I keep telling them - "put the forward declaration at
top-of-file so it doesn't get duplicated later on".  Nobody listens to
Andrew.

> --- /dev/null
> +++ b/scripts/checkdeclares.pl
> @@ -0,0 +1,53 @@
> +#!/usr/bin/env perl
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# checkdeclares: find struct declared more than once
> +#
> +# Copyright 2021 Wan Jiabing
> +# Inspired by checkincludes.pl
> +#
> +# This script checks for duplicate struct declares.
> +# Note that this will not take into consideration macros so
> +# you should run this only if you know you do have real dups
> +# and do not have them under #ifdef's.
> +# You could also just review the results.
> +
> +use strict;
> +
> +sub usage {
> + print "Usage: checkdeclares.pl \n";
> + print "We just warn of struct declaration duplicates\n";

Not quite accurate.  I added this fixup:

--- 
a/scripts/checkdeclares.pl~scripts-a-new-script-for-checking-duplicate-struct-declaration-fix
+++ a/scripts/checkdeclares.pl
@@ -15,8 +15,8 @@
 use strict;

 sub usage {
-   print "Usage: checkdeclares.pl \n";
-   print "We just warn of struct declaration duplicates\n";
+   print "Usage: checkdeclares.pl file1.h ...\n";
+   print "Warns of struct declaration duplicates\n";
exit 1;
 }

Re: [PATCH -next] mm/vmalloc: Fix non-conforming function headers

2021-04-01 Thread Andrew Morton

On Thu, 1 Apr 2021 21:22:48 +0800 Qiheng Lin  wrote:

> Fix the following W=1 kernel build warning(s):
>  mm/vmalloc.c:425: warning: expecting prototype for vunmap_range_noflush(). 
> Prototype was for vunmap_range() instead
> 
> ...
>
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -413,7 +413,7 @@ void vunmap_range_noflush(unsigned long start, unsigned 
> long end)
>  }
>  
>  /**
> - * vunmap_range_noflush - unmap kernel virtual addresses
> + * vunmap_range - unmap kernel virtual addresses
>   * @addr: start of the VM area to unmap
>   * @end: end of the VM area to unmap (non-inclusive)
>   *

Thanks.  Nick has just sent along a patch which fixed this.

Re: [PATCH] mm/hugeltb: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

2021-04-01 Thread Andrew Morton

On Thu, 1 Apr 2021 10:26:36 -0400 Pavel Tatashin  
wrote:

> On Wed, Mar 31, 2021 at 12:38 PM Mike Rapoport  wrote:
> >
> > From: Mike Rapoport 
> >
> > The renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN missed one occurrence
> > in mm/hugetlb.c which causes build error:
> >
> >   CC  mm/hugetlb.o
> > mm/hugetlb.c: In function ‘dequeue_huge_page_node_exact’:
> > mm/hugetlb.c:1081:33: error: ‘PF_MEMALLOC_NOCMA’ undeclared (first use in 
> > this function); did you mean ‘PF_MEMALLOC_NOFS’?
> >   bool pin = !!(current->flags & PF_MEMALLOC_NOCMA);
> >  ^
> >  PF_MEMALLOC_NOFS
> > mm/hugetlb.c:1081:33: note: each undeclared identifier is reported only 
> > once for each function it appears in
> > scripts/Makefile.build:273: recipe for target 'mm/hugetlb.o' failed
> > make[2]: *** [mm/hugetlb.o] Error 1
> >
> > Signed-off-by: Mike Rapoport 
> > ---
> >  mm/hugetlb.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a5236c2f7bb2..c22111f3da20 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1078,7 +1078,7 @@ static void enqueue_huge_page(struct hstate *h, 
> > struct page *page)
> >  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> >  {
> > struct page *page;
> > -   bool pin = !!(current->flags & PF_MEMALLOC_NOCMA);
> > +   bool pin = !!(current->flags & PF_MEMALLOC_PIN);
> 
> Thank you Mike!
> 
> Andrew, since "mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN" is
> not yet in the mainline, should I send a new version of this patch so
> we won't have bisecting problems in the future?

I've already added Mike's fix, as
mm-cma-rename-pf_memalloc_nocma-to-pf_memalloc_pin-fix.patch.  It shall
fold it into mm-cma-rename-pf_memalloc_nocma-to-pf_memalloc_pin.patch
prior to upstreaming, so no bisection issue.

Re: [PATCH] mm: page_owner: detect page_owner recursion via task_struct

2021-04-01 Thread Andrew Morton

On Thu,  1 Apr 2021 23:30:10 +0100 Sergei Trofimovich  wrote:

> Before the change page_owner recursion was detected via fetching
> backtrace and inspecting it for current instruction pointer.
> It has a few problems:
> - it is slightly slow as it requires extra backtrace and a linear
>   stack scan of the result
> - it is too late to check if backtrace fetching required memory
>   allocation itself (ia64's unwinder requires it).
> 
> To simplify recursion tracking let's use page_owner recursion depth
> as a counter in 'struct task_struct'.

Seems like a better approach.

> The change make page_owner=on work on ia64 bu avoiding infinite
> recursion in:
>   kmalloc()
>   -> __set_page_owner()
>   -> save_stack()
>   -> unwind() [ia64-specific]
>   -> build_script()
>   -> kmalloc()
>   -> __set_page_owner() [we short-circuit here]
>   -> save_stack()
>   -> unwind() [recursion]
> 
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1371,6 +1371,15 @@ struct task_struct {
>   struct llist_head   kretprobe_instances;
>  #endif
>  
> +#ifdef CONFIG_PAGE_OWNER
> + /*
> +  * Used by page_owner=on to detect recursion in page tracking.
> +  * Is it fine to have non-atomic ops here if we ever access
> +  * this variable via current->page_owner_depth?

Yes, it is fine.  This part of the comment can be removed.

> +  */
> + unsigned int page_owner_depth;
> +#endif

Adding to the task_struct has a cost.  But I don't expect that
PAGE_OWNER is commonly used in prodction builds (correct?).

> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -213,6 +213,9 @@ struct task_struct init_task
>  #ifdef CONFIG_SECCOMP
>   .seccomp= { .filter_count = ATOMIC_INIT(0) },
>  #endif
> +#ifdef CONFIG_PAGE_OWNER
> + .page_owner_depth   = 0,
> +#endif
>  };
>  EXPORT_SYMBOL(init_task);

It will be initialized to zero by the compiler.  We can omit this hunk
entirely.

> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -20,6 +20,16 @@
>   */
>  #define PAGE_OWNER_STACK_DEPTH (16)
>  
> +/*
> + * How many reenters we allow to page_owner.
> + *
> + * Sometimes metadata allocation tracking requires more memory to be 
> allocated:
> + * - when new stack trace is saved to stack depot
> + * - when backtrace itself is calculated (ia64)
> + * Instead of falling to infinite recursion give it a chance to recover.
> + */
> +#define PAGE_OWNER_MAX_RECURSION_DEPTH (1)

So this is presently a boolean.  Is there any expectation that
PAGE_OWNER_MAX_RECURSION_DEPTH will ever be greater than 1?  If not, we
could use a single bit in the task_struct.  Add it to the
"Unserialized, strictly 'current'" bitfields.  Could make it a 2-bit field if 
we want
to permit PAGE_OWNER_MAX_RECURSION_DEPTH=larger.

Re: [PATCH] ia64: fix user_stack_pointer() for ptrace()

2021-03-31 Thread Andrew Morton

On Wed, 31 Mar 2021 09:44:47 +0100 Sergei Trofimovich  wrote:

> ia64 has two stacks:
> - memory stack (or stack), pointed at by by r12
> - register backing store (register stack), pointed at
>   ar.bsp/ar.bspstore with complications around dirty
>   register frame on CPU.
> 
> In https://bugs.gentoo.org/769614 Dmitry noticed that
> PTRACE_GET_SYSCALL_INFO returns register stack instead
> memory stack.
> 
> The bug comes from the fact that user_stack_pointer() and
> current_user_stack_pointer() don't return the same register:
> 
>   ulong user_stack_pointer(struct pt_regs *regs) { return regs->ar_bspstore; }
>   #define current_user_stack_pointer() (current_pt_regs()->r12)
> 
> The change gets both back in sync.
> 
> I think ptrace(PTRACE_GET_SYSCALL_INFO) is the only affected user
> by this bug on ia64.
> 
> The change fixes 'rt_sigreturn.gen.test' strace test where
> it was observed initially.
> 

I assume a cc:stable is justified here?

The bug seems to have been there for 10+ years, so there isn't a lot of
point in looking for the Fixes: reference.

Re: [External] RE: kernel warning percpu ref in obj_cgroup_release

2021-03-31 Thread Andrew Morton

On Wed, 31 Mar 2021 22:45:12 +0800 Muchun Song  wrote:

> 
> Hi Andrew,
> 
> Now we have two choices to fix this issue.
> 
> 1) Send a v6 patchset (Use obj_cgroup APIs to charge kmem pages)
> to fix this issue.
> 2) Send a separate fix patch (Just like above).
> 
> Both ways are ok for me. But I want to know which one is more
> convenient for you.

Either is OK.  2) is easier.

Re: [PATCH] userfaultfd: Write protect when virtual memory range has no page table entry

2021-03-31 Thread Andrew Morton

On Wed, 31 Mar 2021 16:49:27 +0200 Michal Hocko  wrote:

> > Thanks for the clarification! I have suspected this to be the case but
> > I am not really familiar with the interface to have any strong statement
> > here. Maybe we want to document this explicitly.
> 
> Btw. Andrew the patch still seems to be in mmotm. Do you plan to keep it
> there?

Dropped, thanks.

Re: [PATCH V2 1/1] mm:improve the performance during fork

2021-03-30 Thread Andrew Morton

On Mon, 29 Mar 2021 20:36:35 +0800 qianjun.ker...@gmail.com wrote:

> From: jun qian 
> 
> In our project, Many business delays come from fork, so
> we started looking for the reason why fork is time-consuming.
> I used the ftrace with function_graph to trace the fork, found
> that the vm_normal_page will be called tens of thousands and
> the execution time of this vm_normal_page function is only a
> few nanoseconds. And the vm_normal_page is not a inline function.
> So I think if the function is inline style, it maybe reduce the
> call time overhead.
> 
> I did the following experiment:
> 
> use the bpftrace tool to trace the fork time :
> 
> bpftrace -e 'kprobe:_do_fork/comm=="redis-server"/ {@st=nsecs;} \
> kretprobe:_do_fork /comm=="redis-server"/{printf("the fork time \
> is %d us\n", (nsecs-@st)/1000)}'
> 
> no inline vm_normal_page:
> result:
> the fork time is 40743 us
> the fork time is 41746 us
> the fork time is 41336 us
> the fork time is 42417 us
> the fork time is 40612 us
> the fork time is 40930 us
> the fork time is 41910 us
> 
> inline vm_normal_page:
> result:
> the fork time is 39276 us
> the fork time is 38974 us
> the fork time is 39436 us
> the fork time is 38815 us
> the fork time is 39878 us
> the fork time is 39176 us
> 
> In the same test environment, we can get 3% to 4% of
> performance improvement.
> 
> note:the test data is from the 4.18.0-193.6.3.el8_2.v1.1.x86_64,
> because my product use this version kernel to test the redis
> server, If you need to compare the latest version of the kernel
> test data, you can refer to the version 1 Patch.
> 
> We need to compare the changes in the size of vmlinux:
>   inline   non-inline   diff
> vmlinux size  9709248 bytes9709824 bytes-576 bytes
> 

I get very different results with gcc-7.2.0:

q:/usr/src/25> size mm/memory.o
   textdata bss dec hex filename
  748983375  64   78337   13201 mm/memory.o-before
  751193363  64   78546   132d2 mm/memory.o-after

That's a somewhat significant increase in code size, and larger code
size has a worsened cache footprint.

Not that this is necessarily a bad thing for a function which is
tightly called many times in succession as is vm__normal_page()

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -592,7 +592,7 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>   * PFNMAP mappings in order to support COWable mappings.
>   *
>   */
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +inline struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long 
> addr,
>   pte_t pte)
>  {
>   unsigned long pfn = pte_pfn(pte);

I'm a bit surprised this made any difference - rumour has it that
modern gcc just ignores `inline' and makes up its own mind.  Which is
why we added __always_inline.

Re: [PATCH v4] kasan: remove redundant config option

2021-03-30 Thread Andrew Morton

On Mon, 29 Mar 2021 16:54:26 +0200 Andrey Konovalov  
wrote:

> Looks like my patch "kasan: fix KASAN_STACK dependency for HW_TAGS"
> that was merged into 5.12-rc causes a build time warning:
> 
> include/linux/kasan.h:333:30: warning: 'CONFIG_KASAN_STACK' is not
> defined, evaluates to 0 [-Wundef]
> #if defined(CONFIG_KASAN) && CONFIG_KASAN_STACK
> 
> The fix for it would either be reverting the patch (which would leave
> the initial issue unfixed) or applying this "kasan: remove redundant
> config option" patch.
> 
> Would it be possible to send this patch (with the fix-up you have in
> mm) for the next 5.12-rc?
> 
> Here are the required tags:
> 
> Fixes: d9b571c885a8 ("kasan: fix KASAN_STACK dependency for HW_TAGS")
> Cc: sta...@vger.kernel.org

Got it, thanks.  I updated the changelog to mention the warning fix and
moved these ahead for a -rc merge.

Re: [PATCH v1 0/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory

2021-03-30 Thread Andrew Morton

On Tue, 30 Mar 2021 10:58:43 +0200 David Hildenbrand  wrote:

> > 
> >   MAINTAINERS|   1 +
> >   arch/alpha/include/uapi/asm/mman.h |   3 +
> >   arch/mips/include/uapi/asm/mman.h  |   3 +
> >   arch/parisc/include/uapi/asm/mman.h|   3 +
> >   arch/xtensa/include/uapi/asm/mman.h|   3 +
> >   include/uapi/asm-generic/mman-common.h |   3 +
> >   mm/gup.c   |  54 
> >   mm/internal.h  |   5 +-
> >   mm/madvise.c   |  69 +
> >   tools/testing/selftests/vm/.gitignore  |   3 +
> >   tools/testing/selftests/vm/Makefile|   1 +
> >   tools/testing/selftests/vm/madv_populate.c | 342 +
> >   tools/testing/selftests/vm/run_vmtests.sh  |  16 +
> >   13 files changed, 505 insertions(+), 1 deletion(-)
> >   create mode 100644 tools/testing/selftests/vm/madv_populate.c
> > 
> 
> Gentle ping.

Ping who ;)

I was hoping for more review activity.  Were no changes expected from
the [2/5] discussion with Jann?

Re: [PATCH][next] hfsplus: Fix out-of-bounds warnings in __hfsplus_setxattr

2021-03-30 Thread Andrew Morton

On Tue, 30 Mar 2021 09:52:26 -0500 "Gustavo A. R. Silva" 
 wrote:

> Fix the following out-of-bounds warnings by enclosing
> structure members file and finder into new struct info:
> 
> fs/hfsplus/xattr.c:300:5: warning: 'memcpy' offset [65, 80] from the object 
> at 'entry' is out of the bounds of referenced subobject 'user_info' with type 
> 'struct DInfo' at offset 48 [-Warray-bounds]
> fs/hfsplus/xattr.c:313:5: warning: 'memcpy' offset [65, 80] from the object 
> at 'entry' is out of the bounds of referenced subobject 'user_info' with type 
> 'struct FInfo' at offset 48 [-Warray-bounds]
> 
> Refactor the code by making it more "structured."
> 
> Also, this helps with the ongoing efforts to enable -Warray-bounds and
> makes the code clearer and avoid confusing the compiler.

Confused.  What was wrong with the old code?  Was this warning
legitimate and if so, why?  Or is this patch a workaround for a
compiler shortcoming?

Re: WARNING: at fs/proc/generic.c:717 remove_proc_entry

2021-03-30 Thread Andrew Morton

On Tue, 30 Mar 2021 22:19:04 +0530 Naresh Kamboju  
wrote:

> While running kselftest gpio on x86_64 and i386 the following warnings were
> noticed and the device did not recover.

I doubt if changes in the core code caused this.  Let's cc the
gpoi-mockup developers.

> GOOD: next-20210329
> BAD: next-20210330
> 
> steps to reproduce:
> ---
> # Boot linux next tag 20210330 on x86_64 machine
> # cd /opt/kselftests/default-in-kernel/
> # ./gpio-mockup.sh --> which is doing modprobe gpio-mockup
> 
> Test log:
> 
> # selftests: gpio: gpio-mockup.sh
> # 1.  Module load tests
> # 1.1.  dynamic allocation of gpio
> # ./gpio-mockup.sh: line 106: ./gpio-mockup-cdev: No such file or directory
> # test failed: line value is 127 when 1 was[   46.136044]
> [ cut here ]
> [   46.141916] remove_proc_entry: removing non-empty directory
> 'irq/3', leaking at least 'ttyS1'
>  expected
> # GPI[   46.150471] WARNING: CPU: 2 PID: 603 at
> /usr/src/kernel/fs/proc/generic.c:717 remove_proc_entry+0x1a8/0x1c0
> [   46.161566] Modules linked in: gpio_mockup(-) x86_pkg_temp_thermal fuse
> [   46.168195] CPU: 2 PID: 603 Comm: modprobe Not tainted
> 5.12.0-rc5-next-20210330 #1
> [   46.175793] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.2 05/23/2018
> [   46.183217] RIP: 0010:remove_proc_entry+0x1a8/0x1c0
> O gpio-mockup te[   46.188128] Code: 40 bc 24 a8 48 c7 c7 98 4a 6a a8
> 48 0f 45 c2 49 8b 94 24 b0 00 00 00 4c 8b 80 d8 00 00 00 48 8b 92 d8
> 00 00 00 e8 38 36 cb ff <0f> 0b e9 5b ff ff ff e8 0c f4 c5 00 66 90 66
> 2e 0f 1f 84 00 00 00
> [   46.208252] RSP: 0018:9da080293c58 EFLAGS: 00010286
> [   46.213511] RAX:  RBX: 93e8403bcbb8 RCX: 
> 
> [   46.220650] RDX: 0001 RSI: 93e9afb177f0 RDI: 
> 93e9afb177f0
> [   46.227820] RBP: 9da080293c88 R08: 0001 R09: 
> 0001
> [   46.235018] R10: 9da080293a80 R11: 9da080293a08 R12: 
> 93e8403bcb00
> [   46.242165] R13: 93e840263100 R14: 0001 R15: 
> 
> st FAIL
> [   46.249303] FS:  7f7dc2501740() GS:93e9afb0()
> knlGS:
> [   46.258164] CS:  0010 DS:  ES:  CR0: 80050033
> [   46.263977] CR2: 7f7dc1de67f0 CR3: 00014f292003 CR4: 
> 003706e0
> [   46.271124] DR0:  DR1:  DR2: 
> 
> [   46.278292] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   46.285459] Call Trace:
> [   46.287926]  unregister_irq_proc+0xf8/0x110
> [   46.292128]  free_desc+0x2e/0x70
> [   46.295393]  irq_free_descs+0x54/0x80
> [   46.299102]  irq_domain_free_irqs+0x11b/0x150
> [   46.303507]  irq_dispose_mapping+0x77/0x120
> [   46.307711]  gpio_mockup_dispose_mappings+0x4c/0x60 [gpio_mockup]
> [   46.313817]  devm_action_release+0x15/0x20
> [   46.317933]  release_nodes+0x11e/0x220
> [   46.321707]  devres_release_all+0x3c/0x50
> [   46.325733]  device_release_driver_internal+0x10e/0x1d0
> [   46.331016]  driver_detach+0x4d/0xa0
> [   46.334609]  bus_remove_driver+0x5f/0xe0
> [   46.338553]  driver_unregister+0x2f/0x50
> [   46.342495]  platform_driver_unregister+0x12/0x20
> [   46.347218]  gpio_mockup_exit+0x1f/0x5cc [gpio_mockup]
> [   46.352374]  __x64_sys_delete_module+0x15b/0x260
> [   46.357082]  do_syscall_64+0x37/0x50
> [   46.360676]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   46.365748] RIP: 0033:0x7f7dc1e0f997
> [   46.369342] Code: 73 01 c3 48 8b 0d 01 a5 2b 00 f7 d8 64 89 01 48
> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d1 a4 2b 00 f7 d8 64 89
> 01 48
> [   46.388108] RSP: 002b:7ffe35506e78 EFLAGS: 0206 ORIG_RAX:
> 00b0
> [   46.395708] RAX: ffda RBX: 0134ebd0 RCX: 
> 7f7dc1e0f997
> [   46.402883] RDX:  RSI: 0800 RDI: 
> 0134ec38
> [   46.410031] RBP:  R08: 7ffe35505e41 R09: 
> 
> [   46.417183] R10: 08da R11: 0206 R12: 
> 0134ec38
> [   46.424331] R13: 0001 R14: 0134ec38 R15: 
> 7ffe35508238
> [   46.431507] irq event stamp: 6645
> [   46.434845] hardirqs last  enabled at (6655): []
> console_unlock+0x34e/0x550
> [   46.443380] hardirqs last disabled at (6664): []
> console_unlock+0x3b9/0x550
> [   46.451915] softirqs last  enabled at (6564): []
> __do_softirq+0x30e/0x421
> [   46.460333] softirqs last disabled at (6559): []
> irq_exit_rcu+0xb3/0xc0
> [   46.468548] ---[ end trace 89d0119dab1b1498 ]---
> [ 46.477938] remove_proc_entry: removing non-empty directory 'irq/4',
> leaking at least 'ttyS0'
> [ 46.486512] WARNING: CPU: 0 PID: 603 at
> /usr/src/kernel/fs/proc/generic.c:717 remove_proc_entry+0x1a8/0x1c0
> 
> Reported-by: Naresh Kamboju 
> 
> metadata:
>   git branch: master
>   git repo: 
>

Re: [PATCH, -v3] mm: Fix typos in comments

2021-03-28 Thread Andrew Morton

On Mon, 22 Mar 2021 22:26:24 +0100 Ingo Molnar  wrote:

> * Randy Dunlap  wrote:
> 
> > > New version attached. Can I add your Reviewed-by?
> > 
> > Sure.
> > Reviewed-by: Randy Dunlap 
> 
> -v3 attached, the only change is the addition of your Reviewed-by.
> 
> This would be for -mm I suppose, if Andrew agrees too?

Yup, thanks.  This one-fix-per-email thing was killing me ;)

I just queued this behind everything else and will send to Linus
whichever fixes are still applicable, easy.

Re: [PATCH] userfaultfd/shmem: fix minor fault page leak

2021-03-24 Thread Andrew Morton

On Mon, 22 Mar 2021 13:48:35 -0700 Axel Rasmussen  
wrote:

> This fix is analogous to Peter Xu's fix for hugetlb [0]. If we don't
> put_page() after getting the page out of the page cache, we leak the
> reference.
> 
> The fix can be verified by checking /proc/meminfo and running the
> userfaultfd selftest in shmem mode. Without the fix, we see MemFree /
> MemAvailable steadily decreasing with each run of the test. With the
> fix, memory is correctly freed after the test program exits.
> 
> Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")

Confused.  The affected code:

> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1831,6 +1831,7 @@ static int shmem_getpage_gfp(struct inode *inode, 
> pgoff_t index,
>  
>   if (page && vma && userfaultfd_minor(vma)) {
>   unlock_page(page);
> + put_page(page);
>   *fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
>   return 0;
>   }

Is added by Peter's "page && vma && userfaultfd_minor".  I assume that
"Fixes:" is incorrect?

Re: [PATCH] tools: testing: Remove duplicate include of string.h

2021-03-24 Thread Andrew Morton

On Tue, 23 Mar 2021 11:29:56 +0800 Wan Jiabing  wrote:

> string.h has been included at line 15.So we remove the 
> duplicate one at line 17.

Thanks.  But we already have
https://lkml.kernel.org/r/20210316073336.426255-1-zhang.yun...@zte.com.cn.

Re: [PATCH] kernel: kcov: fix a typo in comment

2021-03-24 Thread Andrew Morton

On Tue, 23 Mar 2021 23:32:57 +0100 Marco Elver  wrote:

> On Tue, 23 Mar 2021 at 07:45, 'Dmitry Vyukov' via kasan-dev
>  wrote:
> > On Tue, Mar 23, 2021 at 7:24 AM tl455047  wrote:
> > >
> > > Fixed a typo in comment.
> > >
> > > Signed-off-by: tl455047 
> >
> > Reviewed-by: Dmitry Vyukov 
> >
> > +Andrew, linux-mm as KCOV patches are generally merged into mm.
> >
> > Thanks for the fix
> 
> FYI, I believe this code may not be accepted due to this:
> 
> "[...] It is imperative that all code contributed to the kernel be 
> legitimately
> free software.  For that reason, code from anonymous (or pseudonymous)
> contributors will not be accepted."
> 
> See Documentation/process/1.Intro.rst

Correct.  I let this one pass because the patch is so minor.  But yes,
a real name would be preferred, please.

Re: [PATCH] kasan: fix hwasan build for gcc

2021-03-24 Thread Andrew Morton

On Tue, 23 Mar 2021 13:51:32 +0100 Marco Elver  wrote:

> On Tue, 23 Mar 2021 at 13:41, Arnd Bergmann  wrote:
> >
> > From: Arnd Bergmann 
> >
> > gcc-11 adds support for -fsanitize=kernel-hwaddress, so it becomes
> > possible to enable CONFIG_KASAN_SW_TAGS.
> >
> > Unfortunately this fails to build at the moment, because the
> > corresponding command line arguments use llvm specific syntax.
> >
> > Change it to use the cc-param macro instead, which works on both
> > clang and gcc.
> >
> > Signed-off-by: Arnd Bergmann 
> 
> Reviewed-by: Marco Elver 
> 
> Although I think you need to rebase against either -mm or -next,
> because there have been changes to the CONFIG_KASAN_STACK variable.

This fix is applicable to 5.12, so it's better than the 5.13 patches in
-mm be changed to accomodate this patch.

afaict the only needed change was to update
kasan-remove-redundant-config-option.patch as below.  The
scripts/Makefile.kasan part was changed:

@@ -42,7 +48,7 @@ else
 endif
 
 CFLAGS_KASAN := -fsanitize=kernel-hwaddress \
-   $(call cc-param,hwasan-instrument-stack=$(CONFIG_KASAN_STACK)) \
+   $(call cc-param,hwasan-instrument-stack=$(stack_enable)) \
$(call cc-param,hwasan-use-short-granules=0) \
$(instrumentation_flags)
 


Whole patch:

--- a/arch/arm64/kernel/sleep.S~kasan-remove-redundant-config-option
+++ a/arch/arm64/kernel/sleep.S
@@ -134,7 +134,7 @@ SYM_FUNC_START(_cpu_resume)
 */
bl  cpu_do_resume
 
-#if defined(CONFIG_KASAN) && CONFIG_KASAN_STACK
+#if defined(CONFIG_KASAN) && defined(CONFIG_KASAN_STACK)
mov x0, sp
bl  kasan_unpoison_task_stack_below
 #endif
--- a/arch/x86/kernel/acpi/wakeup_64.S~kasan-remove-redundant-config-option
+++ a/arch/x86/kernel/acpi/wakeup_64.S
@@ -115,7 +115,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
movqpt_regs_r14(%rax), %r14
movqpt_regs_r15(%rax), %r15
 
-#if defined(CONFIG_KASAN) && CONFIG_KASAN_STACK
+#if defined(CONFIG_KASAN) && defined(CONFIG_KASAN_STACK)
/*
 * The suspend path may have poisoned some areas deeper in the stack,
 * which we now need to unpoison.
--- a/include/linux/kasan.h~kasan-remove-redundant-config-option
+++ a/include/linux/kasan.h
@@ -330,7 +330,7 @@ static inline bool kasan_check_byte(cons
 
 #endif /* CONFIG_KASAN */
 
-#if defined(CONFIG_KASAN) && CONFIG_KASAN_STACK
+#if defined(CONFIG_KASAN) && defined(CONFIG_KASAN_STACK)
 void kasan_unpoison_task_stack(struct task_struct *task);
 #else
 static inline void kasan_unpoison_task_stack(struct task_struct *task) {}
--- a/lib/Kconfig.kasan~kasan-remove-redundant-config-option
+++ a/lib/Kconfig.kasan
@@ -138,9 +138,10 @@ config KASAN_INLINE
 
 endchoice
 
-config KASAN_STACK_ENABLE
+config KASAN_STACK
bool "Enable stack instrumentation (unsafe)" if CC_IS_CLANG && 
!COMPILE_TEST
depends on KASAN_GENERIC || KASAN_SW_TAGS
+   default y if CC_IS_GCC
help
  The LLVM stack address sanitizer has a know problem that
  causes excessive stack usage in a lot of functions, see
@@ -155,7 +156,7 @@ config KASAN_STACK_ENABLE
  to use and enabled by default.
 
 config KASAN_STACK
-   int
+   bool
depends on KASAN_GENERIC || KASAN_SW_TAGS
default 1 if KASAN_STACK_ENABLE || CC_IS_GCC
default 0
--- a/mm/kasan/common.c~kasan-remove-redundant-config-option
+++ a/mm/kasan/common.c
@@ -63,7 +63,7 @@ void __kasan_unpoison_range(const void *
kasan_unpoison(address, size);
 }
 
-#if CONFIG_KASAN_STACK
+#ifdef CONFIG_KASAN_STACK
 /* Unpoison the entire stack for a task. */
 void kasan_unpoison_task_stack(struct task_struct *task)
 {
--- a/mm/kasan/kasan.h~kasan-remove-redundant-config-option
+++ a/mm/kasan/kasan.h
@@ -231,7 +231,7 @@ void *kasan_find_first_bad_addr(void *ad
 const char *kasan_get_bug_type(struct kasan_access_info *info);
 void kasan_metadata_fetch_row(char *buffer, void *row);
 
-#if defined(CONFIG_KASAN_GENERIC) && CONFIG_KASAN_STACK
+#if defined(CONFIG_KASAN_GENERIC) && defined(CONFIG_KASAN_STACK)
 void kasan_print_address_stack_frame(const void *addr);
 #else
 static inline void kasan_print_address_stack_frame(const void *addr) { }
--- a/mm/kasan/report_generic.c~kasan-remove-redundant-config-option
+++ a/mm/kasan/report_generic.c
@@ -128,7 +128,7 @@ void kasan_metadata_fetch_row(char *buff
memcpy(buffer, kasan_mem_to_shadow(row), META_BYTES_PER_ROW);
 }
 
-#if CONFIG_KASAN_STACK
+#ifdef CONFIG_KASAN_STACK
 static bool __must_check tokenize_frame_descr(const char **frame_descr,
  char *token, size_t max_tok_len,
  unsigned long *value)
--- a/scripts/Makefile.kasan~kasan-remove-redundant-config-option
+++ a/scripts/Makefile.kasan
@@ -2,6 +2,12 @@
 CFLAGS_KASAN_NOSANITIZE := -fno-builtin
 KASAN_SHADOW_OFFSET ?= $(CONFIG_KASAN_SHADOW_OFFSET)
 
+ifdef CONFIG_KASAN_STACK
+

Re: [PATCH] ia64: mca: allocate early mca with GFP_ATOMIC

2021-03-24 Thread Andrew Morton

On Wed, 24 Mar 2021 11:20:45 +0100 John Paul Adrian Glaubitz 
 wrote:

> >> #NEXT_PATCHES_START mainline-later (next week, approximately)
> >> ia64-mca-allocate-early-mca-with-gfp_atomic.patch
> 
> Great, thanks. We're still missing Valentin's patch for the NUMA enumeration 
> issue
> though. Should Valentin send the patch again with Andrew CC'ed?

I subscribed to linux-ia64 today, so I can go in there to find things. 
But if there's anything presently outstanding, please do resend.

I presently have

module-remove-duplicate-include-in-arch-ia64-kernel-heads.patch
ia64-kernel-few-typos-fixed-in-the-file-fsyss.patch
ia64-include-asm-minor-typo-fixes-in-the-file-pgtableh.patch
ia64-ensure-proper-numa-distance-and-possible-map-initialization.patch
ia64-drop-unused-ia64_fw_emu-ifdef.patch

Re: [PATCH v1 0/3] drivers/char: remove /dev/kmem for good

2021-03-24 Thread Andrew Morton



> Let's remove /dev/kmem, which is unused and obsolete.

I grabbed these.  Silently - the cc list is amazing ;)

I was wondering if it would be better to permanently disable /dev/kmem
in Kconfig along with a comment "if you really want this thing then
email peeps@places with a very good reason why".  Let that ride for a
year or three then blam.

But this is so much more attractive, and it certainly sounds like it's
worth any damage it might cause.

We do tend to think about distros.  I bet there are a number of weird
embedded type systems using /dev/kmem - it's amazing what sorts of
hacks those people will put up with the get something out the door. 
But those systems tend to carry a lot of specialized changes anyway, so
they can just add "revert David's patch" to their pile.

Re: [PATCH] ia64: Ensure proper NUMA distance and possible map initialization

2021-03-24 Thread Andrew Morton

On Thu, 18 Mar 2021 13:06:17 + Valentin Schneider 
 wrote:

> John Paul reported a warning about bogus NUMA distance values spurred by
> commit:
> 
>   620a6dc40754 ("sched/topology: Make sched_init_numa() use a set for the 
> deduplicating sort")
> 
> In this case, the afflicted machine comes up with a reported 256 possible
> nodes, all of which are 0 distance away from one another. This was
> previously silently ignored, but is now caught by the aforementioned
> commit.
> 
> The culprit is ia64's node_possible_map which remains unchanged from its
> initialization value of NODE_MASK_ALL. In John's case, the machine doesn't
> have any SRAT nor SLIT table, but AIUI the possible map remains untouched
> regardless of what ACPI tables end up being parsed. Thus, !online &&
> possible nodes remain with a bogus distance of 0 (distances \in [0, 9] are
> "reserved and have no meaning" as per the ACPI spec).
> 
> Follow x86 / drivers/base/arch_numa's example and set the possible map to
> the parsed map, which in this case seems to be the online map.
> 
> Link: 
> http://lore.kernel.org/r/255d6b5d-194e-eb0e-ecdd-97477a534...@physik.fu-berlin.de
> Fixes: 620a6dc40754 ("sched/topology: Make sched_init_numa() use a set for 
> the deduplicating sort")
> Reported-by: John Paul Adrian Glaubitz 
> Signed-off-by: Valentin Schneider 
> ---
> This might need an earlier Fixes: tag, but all of this is quite old and
> dusty (the git blame rabbit hole leads me to ~2008/2007)
> 

Thanks.  Is this worth a cc:stable tag?

Re: [PATCH] userfaultfd: Write protect when virtual memory range has no page table entry

2021-03-20 Thread Andrew Morton

On Fri, 19 Mar 2021 22:24:28 +0700 Bui Quang Minh  
wrote:

> userfaultfd_writeprotect() use change_protection() to clear write bit in
> page table entries (pte/pmd). So, later write to this virtual address
> range causes a page fault, which is then handled by userspace program.
> However, change_protection() has no effect when there is no page table
> entries associated with that virtual memory range (a newly mapped memory
> range). As a result, later access to that memory range causes allocating a
> page table entry with write bit still set (due to VM_WRITE flag in
> vma->vm_flags).
> 
> Add checks for VM_UFFD_WP in vma->vm_flags when allocating new page table
> entry in missing page table entry page fault path.

This sounds like a pretty significant bug?

Would it be possible to add a test to
tools/testing/selftests/vm/userfaultfd.c to check for this?  It should
fail without your patch and succeed with it.

Thanks.

Re: [PATCH v4 3/3] mm: fs: Invalidate BH LRU during page migration

2021-03-20 Thread Andrew Morton

On Fri, 19 Mar 2021 10:51:27 -0700 Minchan Kim  wrote:

> Pages containing buffer_heads that are in one of the per-CPU
> buffer_head LRU caches will be pinned and thus cannot be migrated.
> This can prevent CMA allocations from succeeding, which are often used
> on platforms with co-processors (such as a DSP) that can only use
> physically contiguous memory. It can also prevent memory
> hot-unplugging from succeeding, which involves migrating at least
> MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
> GiB based on the architecture in use.
> 
> Correspondingly, invalidate the BH LRU caches before a migration
> starts and stop any buffer_head from being cached in the LRU caches,
> until migration has finished.
> 
> Tested-by: Oliver Sang 
> Reported-by: kernel test robot 
> Signed-off-by: Chris Goldsworthy 
> Signed-off-by: Minchan Kim 

The signoff chain ordering might mean that Chris was the primary author, but
there is no From:him.  Please clarify?

Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

2021-03-18 Thread Andrew Morton

On Thu, 18 Mar 2021 10:00:17 -0600 Jens Axboe  wrote:

> On 3/18/21 9:53 AM, Shakeel Butt wrote:
> > On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe  wrote:
> >>
> >> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> >>> No major changes, just rebasing and resubmitting
> >>
> >> Applied for 5.13, thanks.
> >>
> > 
> > I have requested a couple of changes in the patch series. Can this
> > applied series still be changed or new patches are required?
> 
> I have nothing sitting on top of it for now, so as far as I'm concerned
> we can apply a new series instead. Then we can also fold in that fix
> from Colin that he posted this morning...

The collision in memcontrol.c is a pain, but I guess as this is mainly
a loop patch, the block tree is an appropriate route.

Here's the collision between "mm: Charge active memcg when no mm is
set" and Shakeels's
https://lkml.kernel.org/r/20210305212639.775498-1-shake...@google.com


--- mm/memcontrol.c
+++ mm/memcontrol.c
@@ -6728,8 +6730,15 @@ int mem_cgroup_charge(struct page *page, struct 
mm_struct *mm, gfp_t gfp_mask)
rcu_read_unlock();
}
 
-   if (!memcg)
-   memcg = get_mem_cgroup_from_mm(mm);
+   if (!memcg) {
+   if (!mm) {
+   memcg = get_mem_cgroup_from_current();
+   if (!memcg)
+   memcg = get_mem_cgroup_from_mm(current->mm);
+   } else {
+   memcg = get_mem_cgroup_from_mm(mm);
+   }
+   }
 
ret = try_charge(memcg, gfp_mask, nr_pages);
if (ret)


Which I resolved thusly:

int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
struct mem_cgroup *memcg;
int ret;

if (mem_cgroup_disabled())
return 0;

if (!mm) {
memcg = get_mem_cgroup_from_current();
(!memcg)
memcg = get_mem_cgroup_from_mm(current->mm);
} else {
memcg = get_mem_cgroup_from_mm(mm);
}

ret = __mem_cgroup_charge(page, memcg, gfp_mask);
css_put(>css);

return ret;
}

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-17 Thread Andrew Morton

On Mon, 15 Mar 2021 18:30:03 -0700 Arjun Roy  wrote:

> From: Arjun Roy 
> 
> TCP zerocopy receive is used by high performance network applications
> to further scale. For RX zerocopy, the memory containing the network
> data filled by the network driver is directly mapped into the address
> space of high performance applications. To keep the TLB cost low,
> these applications unmap the network memory in big batches. So, this
> memory can remain mapped for long time. This can cause a memory
> isolation issue as this memory becomes unaccounted after getting
> mapped into the application address space. This patch adds the memcg
> accounting for such memory.
> 
> Accounting the network memory comes with its own unique challenges.
> The high performance NIC drivers use page pooling to reuse the pages
> to eliminate/reduce expensive setup steps like IOMMU. These drivers
> keep an extra reference on the pages and thus we can not depend on the
> page reference for the uncharging. The page in the pool may keep a
> memcg pinned for arbitrary long time or may get used by other memcg.
> 
> This patch decouples the uncharging of the page from the refcnt and
> associates it with the map count i.e. the page gets uncharged when the
> last address space unmaps it. Now the question is, what if the driver
> drops its reference while the page is still mapped? That is fine as
> the address space also holds a reference to the page i.e. the
> reference count can not drop to zero before the map count.

What tree were you hoping to get this merged through?  I'd suggest net
- it's more likely to get tested over there.

>
> ...
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c

These changes could be inside #ifdef CONFIG_NET.  Although I expect
MEMCG=y&=n is pretty damn rare.

Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

2021-03-17 Thread Andrew Morton

On Wed, 10 Mar 2021 08:14:28 -0800 Minchan Kim  wrote:

> LRU pagevec holds refcount of pages until the pagevec are drained.
> It could prevent migration since the refcount of the page is greater
> than the expection in migration logic. To mitigate the issue,
> callers of migrate_pages drains LRU pagevec via migrate_prep or
> lru_add_drain_all before migrate_pages call.
> 
> However, it's not enough because pages coming into pagevec after the
> draining call still could stay at the pagevec so it could keep
> preventing page migration. Since some callers of migrate_pages have
> retrial logic with LRU draining, the page would migrate at next trail
> but it is still fragile in that it doesn't close the fundamental race
> between upcoming LRU pages into pagvec and migration so the migration
> failure could cause contiguous memory allocation failure in the end.
> 
> To close the race, this patch disables lru caches(i.e, pagevec)
> during ongoing migration until migrate is done.
> 
> Since it's really hard to reproduce, I measured how many times
> migrate_pages retried with force mode(it is about a fallback to a
> sync migration) with below debug code.
> 
> int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   ..
>   ..
> 
> if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
>printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
>dump_page(page, "fail to migrate");
> }
> 
> The test was repeating android apps launching with cma allocation
> in background every five seconds. Total cma allocation count was
> about 500 during the testing. With this patch, the dump_page count
> was reduced from 400 to 30.
> 
> The new interface is also useful for memory hotplug which currently
> drains lru pcp caches after each migration failure. This is rather
> suboptimal as it has to disrupt others running during the operation.
> With the new interface the operation happens only once. This is also in
> line with pcp allocator cache which are disabled for the offlining as
> well.
> 

This is really a rather ugly thing, particularly from a maintainability
point of view.  Are you sure you found all the sites which need the
enable/disable?  How do we prevent new ones from creeping in which need
the same treatment?  Is there some way of adding a runtime check which
will trip if a conversion was missed?

> ...
>
> +bool lru_cache_disabled(void)
> +{
> + return atomic_read(_disable_count);
> +}
> +
> +void lru_cache_enable(void)
> +{
> + atomic_dec(_disable_count);
> +}
> +
> +/*
> + * lru_cache_disable() needs to be called before we start compiling
> + * a list of pages to be migrated using isolate_lru_page().
> + * It drains pages on LRU cache and then disable on all cpus until
> + * lru_cache_enable is called.
> + *
> + * Must be paired with a call to lru_cache_enable().
> + */
> +void lru_cache_disable(void)
> +{
> + atomic_inc(_disable_count);
> +#ifdef CONFIG_SMP
> + /*
> +  * lru_add_drain_all in the force mode will schedule draining on
> +  * all online CPUs so any calls of lru_cache_disabled wrapped by
> +  * local_lock or preemption disabled would be ordered by that.
> +  * The atomic operation doesn't need to have stronger ordering
> +  * requirements because that is enforeced by the scheduling
> +  * guarantees.
> +  */
> + __lru_add_drain_all(true);
> +#else
> + lru_add_drain();
> +#endif
> +}

I guess at least the first two of these functions should be inlined.

Re: [PATCH v13 00/14] huge vmalloc mappings

2021-03-17 Thread Andrew Morton

On Wed, 17 Mar 2021 16:23:48 +1000 Nicholas Piggin  wrote:

> 
> *** BLURB HERE ***
> 

That's really not what it means ;)

Could we please get a nice description for the [0/n]?  What's it all
about, what's the benefit, what are potential downsides.

And performance testing results!  Because if it ain't faster, there's
no point in merging it?

Re: [PATCH 0/2] Fix page_owner broken on arm64

2021-03-17 Thread Andrew Morton

On Wed, 17 Mar 2021 14:20:48 + Chen Jun  wrote:

> On arm64, cat /sys/kernel/debug/page_owner
> All pages return the same stack
>  stack_trace_save+0x4c/0x78
>  register_early_stack+0x34/0x70
>  init_page_owner+0x34/0x230
>  page_ext_init+0x1bc/0x1dc
> 
> The reason is arch_stack_walk save 2 more entries than before.
> 
> To fix it, add skip in arch_stack_walk
> 
> *** BLURB HERE ***
> 
> 1. Prepare for 2, move stacktrace_cookie to .h
> 2. Fix the problem
> 

5fc57df2f6fd was September 2020, so I assume we'll be needing cc:stable
on these?

(I'll also assume that the arm folks will be handling these)

Re: [PATCH v3 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP

2021-03-17 Thread Andrew Morton

On Wed, 17 Mar 2021 14:41:46 -0700 Brian Geffon  wrote:

> Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This
> change will widen the support to include any mappings which are not
> VM_DONTEXPAND or VM_PFNMAP.

Please update changelog to explain why these two were omitted?

> The primary use case is to support
> MREMAP_DONTUNMAP on mappings which may have been created from a memfd.
> 
> This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL
> if VM_DONTEXPAND or VM_PFNMAP mappings are specified.
> 
> Lokesh Gidra who works on the Android JVM, provided an explanation of how
> such a feature will improve Android JVM garbage collection:
> "Android is developing a new garbage collector (GC), based on userfaultfd.
> The garbage collector will use userfaultfd (uffd) on the java heap during
> compaction. On accessing any uncompacted page, the application threads will
> find it missing, at which point the thread will create the compacted page
> and then use UFFDIO_COPY ioctl to get it mapped and then resume execution.
> Before starting this compaction, in a stop-the-world pause the heap will be
> mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive
> UFFD_EVENT_PAGEFAULT events after resuming execution.
> 
> To speedup mremap operations, pagetable movement was optimized by moving
> PUD entries instead of PTE entries [1]. It was necessary as mremap of even
> modest sized memory ranges also took several milliseconds, and stopping the
> application for that long isn't acceptable in response-time sensitive
> cases.
> 
> With UFFDIO_CONTINUE feature [2], it will be even more efficient to
> implement this GC, particularly the 'non-moveable' portions of the heap.
> It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
> However, for this to work, the java heap has to be on a 'shared' vma.
> Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
> patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
> compaction."

Is a manpage update planned?  It's appropriate to add this to the
series so folks can check it over.

Can we please get the appropriate updates into
tools/testing/selftests/vm/mremap_test.c for this?

> Signed-off-by: Brian Geffon 

v3 is getting up there.  Has there been much review activity?

Re: [PATCH 2/2] Bug Fix for last patch

2021-03-15 Thread Andrew Morton

On Tue, 16 Mar 2021 03:15:05 + Yixun Lan  wrote:

> This patch title is  too obscure to parse, it should clearly reflect
> what's the changes doing here

Yes please ;)  Otherwise Andrew has to madly grep around to try to figure out
what was Jiuyang Liu's "last patch"!

Re: [PATCH v3] fs/buffer.c: Revoke LRU when trying to drop buffers

2021-03-15 Thread Andrew Morton

On Wed, 13 Jan 2021 13:17:30 -0800 Chris Goldsworthy  
wrote:

> From: Laura Abbott 
> 
> When a buffer is added to the LRU list, a reference is taken which is
> not dropped until the buffer is evicted from the LRU list. This is the
> correct behavior, however this LRU reference will prevent the buffer
> from being dropped. This means that the buffer can't actually be dropped
> until it is selected for eviction. There's no bound on the time spent
> on the LRU list, which means that the buffer may be undroppable for
> very long periods of time. Given that migration involves dropping
> buffers, the associated page is now unmigratible for long periods of
> time as well. CMA relies on being able to migrate a specific range
> of pages, so these types of failures make CMA significantly
> less reliable, especially under high filesystem usage.
> 
> Rather than waiting for the LRU algorithm to eventually kick out
> the buffer, explicitly remove the buffer from the LRU list when trying
> to drop it. There is still the possibility that the buffer
> could be added back on the list, but that indicates the buffer is
> still in use and would probably have other 'in use' indicates to
> prevent dropping.
> 
> Note: a bug reported by "kernel test robot" lead to a switch from
> using xas_for_each() to xa_for_each().

(hm, why isn't drop_buffers() static to fs/buffer.c??)

It looks like patch this turns drop_buffers() into a very expensive
operation.  And that expensive operation occurs under the
address_space-wide private_lock, which is more ouch.

How carefully has this been tested for performance?  In pathological
circumstances (which are always someone's common case :()

Just thinking out loud...

If a buffer_head* is sitting in one or more of the LRUs, what is
stopping us from stripping it from the page anyway?  Then
try_to_free_buffers() can mark the bh as buffer_dead(), declare success
and leave the bh sitting in the LRU, with the LRU as the only reference
to that buffer.  Teach lookup_bh_lru() to skip over buffer_dead()
buffers and our now-dead buffer will eventually reach the tail of the
lru and get freed for real.

Re: [PATCH v3 1/2] init/initramfs.c: do unpacking asynchronously

2021-03-15 Thread Andrew Morton

On Sat, 13 Mar 2021 22:25:27 +0100 Rasmus Villemoes  
wrote:

> Most of the boot process doesn't actually need anything from the
> initramfs, until of course PID1 is to be executed. So instead of doing
> the decompressing and populating of the initramfs synchronously in
> populate_rootfs() itself, push that off to a worker thread.
> 
> This is primarily motivated by an embedded ppc target, where unpacking
> even the rather modest sized initramfs takes 0.6 seconds, which is
> long enough that the external watchdog becomes unhappy that it doesn't
> get attention soon enough. By doing the initramfs decompression in a
> worker thread, we get to do the device_initcalls and hence start
> petting the watchdog much sooner.
> 
> Normal desktops might benefit as well. On my mostly stock Ubuntu
> kernel, my initramfs is a 26M xz-compressed blob, decompressing to
> around 126M. That takes almost two seconds:
> 
> [0.201454] Trying to unpack rootfs image as initramfs...
> [1.976633] Freeing initrd memory: 29416K
> 
> Before this patch, these lines occur consecutively in dmesg. With this
> patch, the timestamps on these two lines is roughly the same as above,
> but with 172 lines inbetween - so more than one cpu has been kept busy
> doing work that would otherwise only happen after the
> populate_rootfs() finished.
> 
> Should one of the initcalls done after rootfs_initcall time (i.e.,
> device_ and late_ initcalls) need something from the initramfs (say, a
> kernel module or a firmware blob), it will simply wait for the
> initramfs unpacking to be done before proceeding, which should in
> theory make this completely safe.
> 
> But if some driver pokes around in the filesystem directly and not via
> one of the official kernel interfaces (i.e. request_firmware*(),
> call_usermodehelper*) that theory may not hold - also, I certainly
> might have missed a spot when sprinkling wait_for_initramfs(). So
> there is an escape hatch in the form of an initramfs_async= command
> line parameter.

This seems sensible.  And nice.

Are you sure that you've found all the code paths that require that
initramfs be ready?  You have one in init/main, one in the bowels of
the firmware loader and one in UML.  How do we know that there are no
other such places?

Also, all this doesn't buy anything for uniprocessor machines.  Is
there a simple way of making it all go away if !CONFIG_SMP?

Re: [PATCH v4 03/25] mm/vmstat: Add functions to account folio statistics

2021-03-13 Thread Andrew Morton

On Sun, 14 Mar 2021 04:11:55 + Matthew Wilcox  wrote:

> On Sat, Mar 13, 2021 at 12:37:07PM -0800, Andrew Morton wrote:
> > On Fri,  5 Mar 2021 04:18:39 + "Matthew Wilcox (Oracle)" 
> >  wrote:
> > 
> > > Allow page counters to be more readily modified by callers which have
> > > a folio.  Name these wrappers with 'stat' instead of 'state' as requested
> > > by Linus here:
> > > https://lore.kernel.org/linux-mm/CAHk-=wj847sudr-kt+46ft3+xffgiwpgthvm7djwgdi4cvr...@mail.gmail.com/
> > > 
> > > --- a/include/linux/vmstat.h
> > > +++ b/include/linux/vmstat.h
> > > @@ -402,6 +402,54 @@ static inline void drain_zonestat(struct zone *zone,
> > >   struct per_cpu_pageset *pset) { }
> > >  #endif   /* CONFIG_SMP */
> > >  
> > > +static inline
> > > +void __inc_zone_folio_stat(struct folio *folio, enum zone_stat_item item)
> > > +{
> > > + __mod_zone_page_state(folio_zone(folio), item, folio_nr_pages(folio));
> > > +}
> > 
> > The naming is unfortunate.  We expect
> > 
> > inc: add one to
> > dec: subtract one from
> > mod: modify by signed quantity
> > 
> > So these are inconsistent.  Perhaps use "add" and "sub" instead.  At
> > least to alert people to the fact that these are different.
> > 
> > And, again, it's nice to see the subsystem's name leading the
> > identifiers, so "zone_folio_stat_add()".
> 
> I thought this was a 'zone stat', so __zone_stat_add_folio()?
> I'm not necessarily signing up to change the existing
> __inc_node_page_state(), but I might.  If so, __node_stat_add_page()?

That works.  It's the "inc means +1" and "dec means -1" whiplash that
struck me the most.

Re: [PATCH v4 01/25] mm: Introduce struct folio

2021-03-13 Thread Andrew Morton

On Fri,  5 Mar 2021 04:18:37 + "Matthew Wilcox (Oracle)" 
 wrote:

> A struct folio refers to an entire (possibly compound) page.  A function
> which takes a struct folio argument declares that it will operate on the
> entire compound page, not just PAGE_SIZE bytes.  In return, the caller
> guarantees that the pointer it is passing does not point to a tail page.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/mm.h   | 30 ++
>  include/linux/mm_types.h | 17 +

Perhaps a new folio.h would be neater.

> @@ -1518,6 +1523,30 @@ static inline void set_page_links(struct page *page, 
> enum zone_type zone,
>  #endif
>  }
>  
> +static inline unsigned long folio_nr_pages(struct folio *folio)
> +{
> + return compound_nr(>page);
> +}
> +
> +static inline struct folio *next_folio(struct folio *folio)
> +{
> +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> + return (struct folio *)nth_page(>page, folio_nr_pages(folio));
> +#else
> + return folio + folio_nr_pages(folio);
> +#endif
> +}

It's a shame this isn't called folio_something(), like the rest of the API.

Unclear what this does.  Some comments would help.

> +static inline unsigned int folio_shift(struct folio *folio)
> +{
> + return PAGE_SHIFT + folio_order(folio);
> +}
> +
> +static inline size_t folio_size(struct folio *folio)
> +{
> + return PAGE_SIZE << folio_order(folio);
> +}

Why size_t?  That's pretty rare in this space and I'd have expected
unsigned long.


> @@ -1623,6 +1652,7 @@ extern void pagefault_out_of_memory(void);
>  
>  #define offset_in_page(p)((unsigned long)(p) & ~PAGE_MASK)
>  #define offset_in_thp(page, p)   ((unsigned long)(p) & (thp_size(page) - 
> 1))
> +#define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 
> 1))
>  
>  /*
>   * Flags passed to show_mem() and show_free_areas() to suppress output in
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0974ad501a47..a311cb48526f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -223,6 +223,23 @@ struct page {
>  #endif
>  } _struct_page_alignment;
>  
> +/*
> + * A struct folio is either a base (order-0) page or the head page of
> + * a compound page.
> + */
> +struct folio {
> + struct page page;
> +};
> +
> +static inline struct folio *page_folio(struct page *page)
> +{
> + unsigned long head = READ_ONCE(page->compound_head);
> +
> + if (unlikely(head & 1))
> + return (struct folio *)(head - 1);
> + return (struct folio *)page;
> +}

What purpose does the READ_ONCE() serve?

Re: [PATCH v4 03/25] mm/vmstat: Add functions to account folio statistics

2021-03-13 Thread Andrew Morton

On Fri,  5 Mar 2021 04:18:39 + "Matthew Wilcox (Oracle)" 
 wrote:

> Allow page counters to be more readily modified by callers which have
> a folio.  Name these wrappers with 'stat' instead of 'state' as requested
> by Linus here:
> https://lore.kernel.org/linux-mm/CAHk-=wj847sudr-kt+46ft3+xffgiwpgthvm7djwgdi4cvr...@mail.gmail.com/
> 
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -402,6 +402,54 @@ static inline void drain_zonestat(struct zone *zone,
>   struct per_cpu_pageset *pset) { }
>  #endif   /* CONFIG_SMP */
>  
> +static inline
> +void __inc_zone_folio_stat(struct folio *folio, enum zone_stat_item item)
> +{
> + __mod_zone_page_state(folio_zone(folio), item, folio_nr_pages(folio));
> +}

The naming is unfortunate.  We expect

inc: add one to
dec: subtract one from
mod: modify by signed quantity

So these are inconsistent.  Perhaps use "add" and "sub" instead.  At
least to alert people to the fact that these are different.

And, again, it's nice to see the subsystem's name leading the
identifiers, so "zone_folio_stat_add()".

Re: [PATCH v4 09/25] mm: Add folio_index, folio_page and folio_contains

2021-03-13 Thread Andrew Morton

On Fri,  5 Mar 2021 04:18:45 + "Matthew Wilcox (Oracle)" 
 wrote:

> folio_index() is the equivalent of page_index() for folios.  folio_page()
> finds the page in a folio for a page cache index.  folio_contains()
> tells you whether a folio contains a particular page cache index.
> 

copy-paste changelog into each function's covering comment?

> ---
>  include/linux/pagemap.h | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index f07c03da83f6..5094b50f7680 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -447,6 +447,29 @@ static inline bool thp_contains(struct page *head, 
> pgoff_t index)
>   return page_index(head) == (index & ~(thp_nr_pages(head) - 1UL));
>  }
>  
> +static inline pgoff_t folio_index(struct folio *folio)
> +{
> +if (unlikely(FolioSwapCache(folio)))
> +return __page_file_index(>page);
> +return folio->page.index;
> +}
> +
> +static inline struct page *folio_page(struct folio *folio, pgoff_t index)
> +{
> + index -= folio_index(folio);
> + VM_BUG_ON_FOLIO(index >= folio_nr_pages(folio), folio);
> + return >page + index;
> +}

One would expect folio_page() to be the reverse of page_folio(), only
it isn't anything like that.

> +/* Does this folio contain this index? */
> +static inline bool folio_contains(struct folio *folio, pgoff_t index)
> +{
> + /* HugeTLBfs indexes the page cache in units of hpage_size */
> + if (PageHuge(>page))
> + return folio->page.index == index;
> + return index - folio_index(folio) < folio_nr_pages(folio);
> +}
> +
>  /*
>   * Given the page we found in the page cache, return the page corresponding
>   * to this index in the file

Re: [PATCH v4 00/25] Page folios

2021-03-13 Thread Andrew Morton

On Fri,  5 Mar 2021 04:18:36 + "Matthew Wilcox (Oracle)" 
 wrote:

> Our type system does not currently distinguish between tail pages and
> head or single pages.  This is a problem because we call compound_head()
> multiple times (and the compiler cannot optimise it out), bloating the
> kernel.  It also makes programming hard as it is often unclear whether
> a function operates on an individual page, or an entire compound page.
> 
> This patch series introduces the struct folio, which is a type that
> represents an entire compound page.  This initial set reduces the kernel
> size by approximately 6kB, although its real purpose is adding
> infrastructure to enable further use of the folio.

Geeze it's a lot of noise.  More things to remember and we'll forever
have a mismash of `page' and `folio' and code everywhere converting
from one to the other.  Ongoing addition of folio
accessors/manipulators to overlay the existing page
accessors/manipulators, etc.

It's unclear to me that it's all really worth it.  What feedback have
you seen from others?

Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-12 Thread Andrew Morton

On Fri, 12 Mar 2021 12:39:12 -0600 Jim Newsome  wrote:

> On 3/12/21 12:22, Andrew Morton wrote:
> > 
> > Could we please see some performance testing results to permit us to
> > evaluate the value of this change?
> 
> Sure. I've been doing some ad-hoc measurements with the code below. It
> forks 8k children and then waits for them in reverse order (forcing a
> full list traversal each time). I'll need to reboot a couple times to
> get apples-to-apples measurements on bare metal, though. I'll plan to
> run with NUMCHILDREN = 0 -> 8000, by 100.
> 
> Does this look like it'd be sufficient, or is there more you'd like to
> see? The current form doesn't use ptrace, but I expect the results to be
> similar; (maybe more pronounced when tracing threaded children, since
> every thread is in the tracee list instead of just the group leaders).

A very specific microbenchmark which tickles a particular corner case
is useful, as long as there's some realistic chance that someone's
workload is adequately modeled by that test.

Also very useful would be some words which describe what led you to do
this work (presumably some real-world was being impacted) and a description
of how the patch improves that workload (or is expected to improve it).

IOW, please spend a bit of time selling the patch!  What is the case
for including it in Linux?  What benefit does it provide our users?

Re: [PATCH v5] do_wait: make PIDTYPE_PID case O(1) instead of O(n)

2021-03-12 Thread Andrew Morton

On Fri, 12 Mar 2021 11:38:55 -0600 Jim Newsome  wrote:

> do_wait is an internal function used to implement waitpid, waitid,
> wait4, etc. To handle the general case, it does an O(n) linear scan of
> the thread group's children and tracees.
> 
> This patch adds a special-case when waiting on a pid to skip these scans
> and instead do an O(1) lookup. This improves performance when waiting on
> a pid from a thread group with many children and/or tracees.

Could we please see some performance testing results to permit us to
evaluate the value of this change?

Thanks.

Re: [PATCH] mm/rmap: convert anon_vma.refcount from atomic_t to refcount_t

2021-03-11 Thread Andrew Morton

On Thu, 11 Mar 2021 13:56:15 +0800 Yejune Deng  wrote:

> refcount_t type should be used instead of atomic_t when the variable
> is used as a reference counter. This is because the implementation of
> refcount_t can prevent overflows and detect possible use-after-free.

The use of refcount_t comes at a cost:

q:/usr/src/25> size mm/rmap.o
   textdata bss dec hex filename
  311142147  32   33293820d mm/rmap.o
  315582147  32   3373783c9 mm/rmap.o

That's a bunch more instructions to execute on some fairly hot
code paths.

I guess the debugging/checking features are nice, but this is pretty
mature code.  I'm quite unsure that this tradeoff is a favorable one.

Re: [EXTERNAL] Re: [PATCH] kexec: Add kexec reboot string

2021-03-11 Thread Andrew Morton

On Thu, 11 Mar 2021 18:14:19 + Joe LeVeque  wrote:

> Is this all your looking for? If not, please let me know.
> 
> > Signed-off-by: Joe LeVeque 

Yes, thanks.

Re: [PATCH v4] kasan: remove redundant config option

2021-03-10 Thread Andrew Morton

On Thu, 11 Mar 2021 09:32:45 +0800 Walter Wu  wrote:

> 
> Hi Andrew,
> 
> I see my v4 patch is different in the next tree now. please see below
> information.
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=ebced5fb0ef969620ecdc4011f600f9e7c229a3c
> The different is in lib/Kconfig.kasan.
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/diff/lib/Kconfig.kasan?id=ebced5fb0ef969620ecdc4011f600f9e7c229a3c
> 

They look the same to me.  I did have `int' for KASAN_STACK due to a
merging mess, but I changed that to bool quite quickly.

Re: [PATCH] hugetlb: select PREEMPT_COUNT if HUGETLB_PAGE for in_atomic use

2021-03-10 Thread Andrew Morton

On Wed, 10 Mar 2021 18:13:21 -0800 Mike Kravetz  wrote:

> put_page does not correctly handle all calling contexts for hugetlb
> pages.  This was recently discussed in the threads [1] and [2].
> 
> free_huge_page is the routine called for the final put_page of huegtlb
> pages.  Since at least the beginning of git history, free_huge_page has
> acquired the hugetlb_lock to move the page to a free list and possibly
> perform other processing. When this code was originally written, the
> hugetlb_lock should have been made irq safe.
> 
> For many years, nobody noticed this situation until lockdep code caught
> free_huge_page being called from irq context.  By this time, another
> lock (hugetlb subpool) was also taken in the free_huge_page path.  In
> addition, hugetlb cgroup code had been added which could hold
> hugetlb_lock for a considerable period of time.  Because of this, commit
> c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> context") was added to address the issue of free_huge_page being called
> from irq context.  That commit hands off free_huge_page processing to a
> workqueue if !in_task.
> 
> The !in_task check handles the case of being called from irq context.
> However, it does not take into account the case when called with irqs
> disabled as in [1].
> 
> To complicate matters, functionality has been added to hugetlb
> such that free_huge_page may block/sleep in certain situations.  The
> hugetlb_lock is of course dropped before potentially blocking.
> 
> One way to handle all calling contexts is to have free_huge_page always
> send pages to the workqueue for processing.  This idea was briefly
> discussed here [3], but has some undesirable side effects.
> 
> Ideally, the hugetlb_lock should have been irq safe from the beginning
> and any code added to the free_huge_page path should have taken this
> into account.  However, this has not happened.  The code today does have
> the ability to hand off requests to a workqueue.  It does this for calls
> from irq context.  Changing the check in the code from !in_task to
> in_atomic would handle the situations when called with irqs disabled.
> However, it does not not handle the case when called with a spinlock
> held.  This is needed because the code could block/sleep.
> 
> Select PREEMPT_COUNT if HUGETLB_PAGE is enabled so that in_atomic can be
> used to detect all atomic contexts where sleeping is not possible.
> 
> [1] https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
> [2] https://lore.kernel.org/linux-mm/yejji9oawhuza...@dhcp22.suse.cz/
> [3] https://lore.kernel.org/linux-mm/ydzaawk41k4gd...@dhcp22.suse.cz/
> 
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -235,6 +235,7 @@ config HUGETLBFS
>  
>  config HUGETLB_PAGE
>   def_bool HUGETLBFS
> + select PREEMPT_COUNT
>  

Well this is unfortunate.  hugetlb is forcing PREEMPT_COUNT because we
screwed things up.

Did we consider changing the networking code to call a new
free_huge_tlb_from_irq()?  So the callee doesn't need to guess.

Or something else?

Is anyone looking onto fixing this for real?

Re: [PATCH] kexec: Add kexec reboot string

2021-03-10 Thread Andrew Morton

On Thu,  4 Mar 2021 13:46:26 +0100 Paul Menzel  wrote:

> From: Joe LeVeque 
> 
> The purpose is to notify the kernel module for fast reboot.
> 
> Upstream a patch from the SONiC network operating system [1].
> 
> [1]: https://github.com/Azure/sonic-linux-kernel/pull/46
> 
> Signed-off-by: Paul Menzel 

We should have Joe's signed-off-by: for this.  Joe, can you please send
it?

Re: [PATCH] mm/oom_kill: Ensure MMU notifier range_end() is paired with range_start()

2021-03-10 Thread Andrew Morton

On Wed, 10 Mar 2021 13:31:17 -0800 Sean Christopherson  
wrote:

> Invoke the MMU notifier's .invalidate_range_end() callbacks even if one
> of the .invalidate_range_start() callbacks failed.  If there are multiple
> notifiers, the notifier that did not fail may have performed actions in
> its ...start() that it expects to unwind via ...end().  Per the
> mmu_notifier_ops documentation, ...start() and ...end() must be paired.
> 
> The only in-kernel usage that is fatally broken is the SGI UV GRU driver,
> which effectively blocks and sleeps fault handlers during ...start(), and
> unblocks/wakes the handlers during ...end().  But, the only users that
> can fail ...start() are the i915 and Nouveau drivers, which are unlikely
> to collide with the SGI driver.
> 
> KVM is the only other user of ...end(), and while KVM also blocks fault
> handlers in ...start(), the fault handlers do not sleep and originate in
> killable ioctl() calls.  So while it's possible for the i915 and Nouveau
> drivers to collide with KVM, the bug is benign for KVM since the process
> is dying and KVM's guest is about to be terminated.
> 
> So, as of today, the bug is likely benign.  But, that may not always be
> true, e.g. there is a potential use case for blocking memslot updates in
> KVM while an invalidation is in-progress, and failure to unblock would
> result in said updates being blocked indefinitely and hanging.
> 
> Found by inspection.  Verified by adding a second notifier in KVM that
> periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
> and observing that KVM exits with an elevated notifier count.

Given that there's no way known of triggering this in 5.11 or earlier,
I'd normally question the need to backport into -stable kernels.

But I guess that people might later develop modules which they want to
load into earlier kernels, in which case this might bite them.  So I
agree on the cc:stable thing.

Re: [PATCH 0/3] btrfs: Convert kmap/memset/kunmap to memzero_user()

2021-03-10 Thread Andrew Morton

On Tue,  9 Mar 2021 13:21:34 -0800 ira.we...@intel.com wrote:

> Previously this was submitted to convert to zero_user()[1].  zero_user() is 
> not
> the same as memzero_user() and in fact some zero_user() calls may be better 
> off
> as memzero_user().  Regardless it was incorrect to convert btrfs to
> zero_user().
> 
> This series corrects this by lifting memzero_user(), converting it to
> kmap_local_page(), and then using it in btrfs.

This impacts btrfs more than MM.  I suggest the btrfs developers grab
it, with my

Acked-by: Andrew Morton

Re: [PATCH v5 1/7] mm: Restore init_on_* static branch defaults

2021-03-10 Thread Andrew Morton

On Tue,  9 Mar 2021 13:42:55 -0800 Kees Cook  wrote:

> Choosing the initial state of static branches changes the assembly layout
> (if the condition is expected to be likely, inline, or unlikely, out of
> line via a jump). The _TRUE/_FALSE defines for CONFIG_INIT_ON_*_DEFAULT_ON
> were accidentally removed. These need to stay so that the CONFIG controls
> the pessimization of the resulting static branch NOP/JMP locations.

Changelog doesn't really explain why anyone would want to apply this
patch.  This is especially important for -stable patches.

IOW, what is the user visible effect of the bug?

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-10 Thread Andrew Morton

On Wed, 10 Mar 2021 10:46:15 + Mel Gorman  
wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list. They can be freed in bulk using
> free_pages_bulk().

Why am I surprised we don't already have this.

> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
> 
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
> 
> ...
>
> +/* Drop reference counts and free order-0 pages from a list. */
> +void free_pages_bulk(struct list_head *list)
> +{
> + struct page *page, *next;
> +
> + list_for_each_entry_safe(page, next, list, lru) {
> + trace_mm_page_free_batched(page);
> + if (put_page_testzero(page)) {
> + list_del(>lru);
> + __free_pages_ok(page, 0, FPI_NONE);
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(free_pages_bulk);

I expect that batching games are planned in here as well?

>  static inline unsigned int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
> @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
> unsigned int order,
>   struct alloc_context *ac, gfp_t *alloc_mask,
>   unsigned int *alloc_flags)
>  {
> + gfp_mask &= gfp_allowed_mask;
> + *alloc_mask = gfp_mask;
> +
>   ac->highest_zoneidx = gfp_zone(gfp_mask);
>   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
>   ac->nodemask = nodemask;
> @@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
> unsigned int order,
>   return true;
>  }
>  
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + */

Documentation is rather lame.  Returns number of pages allocated...

> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long flags;
> + struct zone *zone;
> + struct zoneref *z;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + struct alloc_context ac;
> + gfp_t alloc_mask;
> + unsigned int alloc_flags;
> + int alloced = 0;
> +
> + if (nr_pages == 1)
> + goto failed;
> +
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
> _mask, _flags))
> + return 0;
> + gfp_mask = alloc_mask;
> +
> + /* Find an allowed local zone that meets the high watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> ac.highest_zoneidx, ac.nodemask) {
> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp_mask)) {
> + continue;
> + }
> +
> + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> + zone_to_nid(zone) != 
> zone_to_nid(ac.preferred_zoneref->zone)) {
> + goto failed;
> + }
> +
> + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
> nr_pages;
> + if (zone_watermark_fast(zone, 0,  mark,
> + zonelist_zone_idx(ac.preferred_zoneref),
> + alloc_flags, gfp_mask)) {
> + break;
> + }
> + }

I suspect the above was stolen from elsewhere and that some code
commonification is planned.


> + if (!zone)
> + return 0;
> +
> + /* Attempt the batch allocation */
> + local_irq_save(flags);
> + pcp = _cpu_ptr(zone->pageset)->pcp;
> + pcp_list = >lists[ac.migratetype];
> +
> + while (alloced < nr_pages) {
> + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> + pcp, pcp_list);
> + if (!page)
> + break;
> +
> + prep_new_page(page, 0, gfp_mask, 0);

I wonder if it would be worth running prep_new_page() in a second pass,
after reenabling interrupts.

Speaking of which, will the realtime people get upset about the
irqs-off latency?  How many pages are we talking about here?

> + list_add(>lru, alloc_list);
> +

Re: [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users

2021-03-10 Thread Andrew Morton

On Wed, 10 Mar 2021 10:46:13 + Mel Gorman  
wrote:

> This series introduces a bulk order-0 page allocator with sunrpc and
> the network page pool being the first users.

Right now, the [0/n] doesn't even tell us that it's a performance
patchset!

The whole point of this patchset appears to appear in the final paragraph
of the final patch's changelog.

: For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
: redirecting xdp_frame packets into a veth, that does XDP_PASS to create
: an SKB from the xdp_frame, which then cannot return the page to the
: page_pool.  In this case, we saw[1] an improvement of 18.8% from using
: the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Much more detail on the overall objective and the observed results,
please?

Also, that workload looks awfully corner-casey.  How beneficial is this
work for more general and widely-used operations?

> The implementation is not
> particularly efficient and the intention is to iron out what the semantics
> of the API should have for users. Once the semantics are ironed out, it can
> be made more efficient.

And some guesstimates about how much benefit remains to be realized
would be helpful.

Re: [PATCH v3] mm: page_alloc: dump migrate-failed pages

2021-03-10 Thread Andrew Morton

On Wed, 10 Mar 2021 10:01:04 -0800 Minchan Kim  wrote:

> Currently, debugging CMA allocation failures is quite limited.
> The most commong source of these failures seems to be page
> migration which doesn't provide any useful information on the
> reason of the failure by itself. alloc_contig_range can report
> those failures as it holds a list of migrate-failed pages.
> 
> page refcount, mapcount with page flags on dump_page are
> helpful information to deduce the culprit. Furthermore,
> dump_page_owner was super helpful to find long term pinner
> who initiated the page allocation.
> 
> The reason it approach with dynamic debug is the debug message
> could emit lots of noises as alloc_contig_range calls more
> frequently since it's a best effort allocator.
> 
> There are two ifdefery conditions to support common dyndbg options:
> 
> - CONFIG_DYNAMIC_DEBUG_CORE && DYNAMIC_DEBUG_MODULE
> It aims for supporting the feature with only specific file
> with adding ccflags.
> 
> - CONFIG_DYNAMIC_DEBUG
> It aims for supporting the feature with system wide globally.
> 
> A simple example to enable the feature:
> 
> Admin could enable the dump like this(by default, disabled)
> 
>   echo "func dump_migrate_failure_pages +p" > control
> 
> Admin could disable it.
> 
>   echo "func dump_migrate_failure_pages =_" > control

I think the changelog is out of sync.  Did you mean
"alloc_contig_dump_pages" here?

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8453,6 +8453,27 @@ static unsigned long pfn_max_align_up(unsigned long 
> pfn)
>   pageblock_nr_pages));
>  }

> +#if defined(CONFIG_DYNAMIC_DEBUG) || \
> + (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
> +static void alloc_contig_dump_pages(struct list_head *page_list)
> +{
> + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor,
> + "migrate failure");
> +
> + if (DYNAMIC_DEBUG_BRANCH(descriptor)) {
> + struct page *page;
> +
> + WARN(1, "failed callstack");
> + list_for_each_entry(page, page_list, lru)
> + dump_page(page, "migration failure");
> + }
> +}

I doubt if everyone is familiar with dynamic debug.  It might be kind
to add a little comment over this, telling people how to turn it on and
off.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 25528 matches

Mail list logo