from:"Andrea Arcangeli"

Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

On Sat, Jan 09, 2021 at 05:37:09PM -0800, Linus Torvalds wrote:
> On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds
>  wrote:
> >
> > And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem
> > for writing. For whoever wants to look at that, it's
> > mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to
> > turn the read-lock (and unlock) into a write-lock (and unlock).
> 
> Oh, and if it wasn't obvious, we'll have to debate what to do with
> trying to mprotect() a pinned page. Do we just ignore the pinned page
> (the way my clear_refs patch did)? Or do we turn it into -EBUSY? Or
> what?

Agreed, I assume mprotect would have the same effect.

mprotect in parallel of a read or recvmgs may be undefined, so I
didn't bring it up, but it was pretty clear.

The moment the write bit is cleared (no matter why and from who) and
the PG lock relased, if there's any GUP pin, GUP currently loses
synchrony.

In any case I intended to help exercising the new page_count logic
with the testcase, possibly to make it behave better somehow, no
matter how.

I admit I'm also wondering myself the exact semantics of O_DIRECT on
clear_refs or uffd-wp tracking, but the point is that losing reads and
getting unexpected data in the page, still doesn't look a good
behavior and it had to be at least checked.

To me ultimately the useful use case that is become impossible with
page_count isn't even clear_refs nor uffd-wp.

The useful case that I can see zero fundamental flaws in it, is a RDMA
or some other device computing in pure readonly DMA on the data while
a program runs normally and produces it. It could be even a
framebuffer that doesn't care about coherency. You may want to
occasionally wrprotect the memory under readonly long term GUP pin for
consistency even against bugs of the program itself. Why should
wrprotecting make the device lose synchrony? And kind of performance
we gain to the normal useful cases by breaking the special case? Is
there a benchmark showing it?

> So it's not *just* the locking that needs to be fixed. But just take a
> look at that suggested clear_refs patch of mine - it sure isn't
> complicated.

If we can skip the wrprotection it's fairly easy, I fully agree, even
then it still looks more difficult than using page_mapcount in
do_wp_page in my view, so I also don't see the simplification. And
overall the amount of kernel code had a net increase as result.

Thanks,
Andrea

Re: [PATCH 1/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

Hello,

On Sat, Jan 09, 2021 at 07:44:35PM -0500, Andrea Arcangeli wrote:
> allowing a child to corrupt memory in the parent. That's a problem
> that could happen not-maliciously too. So the scenario described

I updated the above partly quoted sentence since in the previous
version it didn't have full accuracy:

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=fc5a76b1c14e5e6cdc64ece306fc03773662d98a

"However since a single transient GUP pin on a tail page, would elevate
the page_count for all other tail pages (unlike the mapcount which is
subpage granular), the COW page reuse inaccuracy would then cross
different vmas and the effect would happen at a distance in vma of
different processes. A single GUP pin taken on a subpage mapped in a
different process could trigger 511 false positive COWs copies in the
local process, after a fork()."

This a best effort to try to document all side effects, but it'd be
great to hear from Kirill too on the above detail to have
confirmation.

Thanks and have a great weekend everyone,
Andrea

Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

Hello Linus,

On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote:
> +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == 
> VM_MAYWRITE)
> +
> +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long 
> addr, pte_t pte)
> +{
> + struct page *page;
> +
> + if (!is_cow_mapping(vma->vm_flags))
> + return false;
> + if (likely(!atomic_read(>vm_mm->has_pinned)))
> + return false;
> + page = vm_normal_page(vma, addr, pte);
> + if (!page)
> + return false;
> + if (page_mapcount(page) != 1)
> + return false;
> + return page_maybe_dma_pinned(page);
> +}

I just don't see the simplification coming from
09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking
page_mapcount above as an optimization, to me it looks much simpler to
check it in a single place, in do_wp_page, that in addition of
optimizing away the superfluous copy, would optimize away the above
complexity as well.

And I won't comment if it's actually safe to skip random pages or
not. All I know is for mprotect and uffd-wp, definitely the above
approach wouldn't work.

In addition I dislike has_pinned and FOLL_PIN.

has_pinned450 include/linux/mm_types.h   * for instance during 
page table copying for fork().
has_pinned   1021 kernel/fork.c atomic64_set(>pinned_vm, 0);
has_pinned   1239 mm/gup.c  atomic_set(>has_pinned, 1);
has_pinned   2579 mm/gup.c  
atomic_set(>mm->has_pinned, 1);
has_pinned   1099 mm/huge_memory.c   
atomic_read(_mm->has_pinned) &&
has_pinned   1213 mm/huge_memory.c   
atomic_read(_mm->has_pinned) &&
has_pinned819 mm/memory.c   if 
(likely(!atomic_read(_mm->has_pinned)))

Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1)
on it? Why isn't it a MMF_HAS_PINNED that already can be set
atomically under mmap_read_lock too? There's bit left free there, we
didn't run out yet to justify wasting another 31 bits. I hope I'm
overlooking something.

The existence of FOLL_LONGTERM is good and makes a difference at times
for writeback if it's on a MAP_SHARED, or it makes difference during
GUP to do a page_migrate before taking the pin, but for the whole rest
of the VM it's irrelevant if it's long or short term, so I'm also
concerned from what Jason mentioned about long term pins being treated
differently within the VM. That to me looks fundamentally as flawed as
introducing inaccuracies in do_wp_page, that even ignoring the
performance implications caused by the inaccuracy, simply prevent to
do some useful things.

I obviously agree all common workloads with no GUP pins and no
clear_refs and no uffd, are way more important to be optimal, but I
haven't seen a single benchmark showing the benefit of not taking the
PG_lock before a page copy or any other runtime benefit coming from
page_count in do_wp_page.

To the contrary now I see additional branches in fork and in various
other paths.

The only thing again where I see page_count provides a tangible
benefit, is the vmsplice attack from child to parent, but that has not
been fully fixed and if page_count is added to fix it in all COW
faults, it'll introduce extra inefficiency to the the very common
important workloads, not only to the special GUP/clear_refs/uffd-wp
workloads as your patch above shows.

Thanks,
Andrea

[PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

Hello Andrew and everyone,

Once we agree that COW page reuse requires full accuracy, the next
step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to
return going in that direction.

Who is going to orthogonally secure vmsplice, Andy, Jason, Jens?  Once
vmsplice is secured from taking unconstrained unprivileged long term
GUP pins, the attack from child to parent can still happen in theory,
but statistically speaking once that huge window is closed, it won't
be a practical concern, so it'll give us time to perfect the full
solution by closing all windows the VM core. vmsplice has to be
orthogonally fixed anyway, even if all windows were closed in VM core
first.

Unfortunately it's still not clear exactly what failed with
17839856fd588f4ab6b789f482ed3ffd7c403e1f but the whole point is that
we need to discuss that together.

Thanks,
Andrea

// SPDX-License-Identifier: GPL-3.0-or-later
/*
 *  reproducer for v5.11 O_DIRECT mm corruption with page_count
 *  instead of mapcount in do_wp_page.
 *
 *  Copyright (C) 2021  Red Hat, Inc.
 *
 *  gcc -O2 -o page_count_do_wp_page page_count_do_wp_page.c -lpthread
 *  page_count_do_wp_page ./whateverfile
 *
 *  NOTE: CONFIG_SOFT_DIRTY=y is required in the kernel config.
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define PAGE_SIZE (1UL<<12)
#define HARDBLKSIZE 512

static void* writer(void *_mem)
{
char *mem = (char *)_mem;
for(;;) {
usleep(random() % 1000);
mem[PAGE_SIZE-1] = 0;
}
}

static void* background_soft_dirty(void *data)
{
long fd = (long) data;
for (;;)
if (write(fd, "4", 1) != 1)
perror("write soft dirty"), exit(1);
}

int main(int argc, char *argv[])
{
if (argc < 2)
printf("%s ", argv[0]), exit(1);

char path[PAGE_SIZE];
strcpy(path, "/proc/");
sprintf(path + strlen(path), "%d", getpid());
strcat(path, "/clear_refs");
long soft_dirty_fd = open(path, O_WRONLY);
if (soft_dirty_fd < 0)
perror("open clear_refs"), exit(1);

char *mem;
if (posix_memalign((void **), PAGE_SIZE, PAGE_SIZE*3))
perror("posix_memalign"), exit(1);
/* THP is not using page_count so it would not corrupt memory */
if (madvise(mem, PAGE_SIZE, MADV_NOHUGEPAGE))
perror("madvise"), exit(1);
bzero(mem, PAGE_SIZE * 3);
memset(mem + PAGE_SIZE * 2, 0xff, HARDBLKSIZE);

/*
 * This is not specific to O_DIRECT. Even if O_DIRECT was
 * forced to use PAGE_SIZE minimum granularity for reads, a
 * recvmsg would create the same issue since it also use
 * iov_iter_get_pages internally to create transient GUP pins
 * on anon memory.
 */
int fd = open(argv[1], O_DIRECT|O_CREAT|O_RDWR|O_TRUNC, 0600);
if (fd < 0)
perror("open"), exit(1);
if (write(fd, mem, PAGE_SIZE) != PAGE_SIZE)
perror("write"), exit(1);

pthread_t soft_dirty;
if (pthread_create(_dirty, NULL,
   background_soft_dirty, (void *)soft_dirty_fd))
perror("soft_dirty"), exit(1);

pthread_t thread;
if (pthread_create(, NULL, writer, mem))
perror("pthread_create"), exit(1);

bool skip_memset = true;
while (1) {
if (pread(fd, mem, HARDBLKSIZE, 0) != HARDBLKSIZE)
perror("read"), exit(1);
if (memcmp(mem, mem+PAGE_SIZE, HARDBLKSIZE)) {
if (memcmp(mem, mem+PAGE_SIZE*2, PAGE_SIZE)) {
if (skip_memset)
printf("unexpected memory "
   "corruption detected\n");
else
printf("memory corruption detected, "
   "dumping page\n");
int end = PAGE_SIZE;
if (!memcmp(mem+HARDBLKSIZE, mem+PAGE_SIZE,
PAGE_SIZE-HARDBLKSIZE))
end = HARDBLKSIZE;
for (int i = 0; i < end; i++)
printf("%x", mem[i]);
printf("\n");
} else
printf("memory corruption detected\n");
}
skip_memset = !skip_memset;
if (!skip_memset)
memset(me

[PATCH 1/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli

This reverts commit 09854ba94c6aad7886996bfbee2530b3d8a7f4f4.
This reverts commit be068f29034fb00530a053d18b8cf140c32b12b3.
This reverts commit 1a0cf26323c80e2f1c58fc04f15686de61bfab0c.

This example below uses O_DIRECT, on a process under
clear_refs. There's no FOLL_LONGTERM involvement. All GUP pins are
transient. fork is never called and even when clear_refs is cleared by
an external program, fork() would not be involved.

thread0 thread1 other process
(or thread 3)
=   =   ===
read syscall
O_DIRECT
read DMA to vaddr+0
len = 512 bytes
GUP(FOLL_WRITE)
DMA writing to RAM started
clear_refs
pte_wrprotect
write vaddrA+512
page_count == 2
wp_copy_page
read syscall returns
read lost

Notwithstanding the fact that failing O_DIRECT at sub-PAGE_SIZE
granularity is an ABI break by itself, this is of course not just
about O_DIRECT. recvmsg kernel TLS and plenty of other GUP FOLL_WRITE
iov_iter_get_pages users would write to the memory with sub-PAGE_SIZE
granularity, depending on the msghdr iov tiny buffers. Those recvmsg
are already silently lost too as well as the above O_DIRECT already
get lost.

The fact COW must not happen too much is documented in commit
6d0a07edd17cfc12fdc1f36de8072fa17cc3666f:

==
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
==

And in the current comment above the THP mapcount:

==
 * [..] we only use
 * page_trans_huge_mapcount() in the copy-on-write faults where we
 * need full accuracy to avoid breaking page pinning, [..]
==

Quote from the Link tagged:

==
In other words, the page_count check in do_wp_page, extends the old
fork vs gup vs threads race, so that it also happens within a single
thread without fork() involvement.
===

In addition FOLL_LONGTERM breaks for readonly DMA, despite
FOLL_LONGTERM pages are now treated specially and copied in fork().

FOLL_LONGTERM is in no way special with regard to the VM core and it
can't be. From the VM standpoint all GUP in are equal and breaking
one, means breaking them all the same.

It is not possible to resolve this by looking only at FOLL_LONGTERM,
that only hides the problem, but it doesn't fully resolve it as shown
above.

In addition this avoids storms of extra false positive COWs if the THP
are shared by different processes, which causes extra overhead, but
the several performance considerations are only a secondary concern.

After this commit, the copy in fork() for FOLL_LONGTERM also must be
extended to all GUP pins including transient pins or we shouldn't even
copy FOLL_LONGTERM pinned pages. Treating FOLL_LONGTERM any different
from transient GUP pin is sign that something is fundamentally flawed.

FOLL_LONGTERM can be treated different in writeback and during GUP
runtime itself, but not in the VM-core when deciding when to COW or
not to COW an anonymous page.

This revert causes no theoretical security regression because THP is
not yet using page_count in do_huge_pmd_wp_page.

The alternative of this patch would be to replace the mapcount with
the page_count in do_huge_pmd_wp_page too in order to really close the
vmsplice attack from child to parent that way.

However since a single transient GUP pin on a tail page, would elevate
the page_count for all other tail pages (unlike the mapcount which is
subpage granular), the above "fork-less" race would span over a
HPAGE_PMD_SIZE range, it would even cross different vmas and the
effect would happen at a distance in vma of different processes
allowing a child to corrupt memory in the parent. That's a problem
that could happen not-maliciously too. So the scenario described
above, if extended to THP new refcounting, would be more concerning
than the vmsplice issue, that can be tackled first in vmsplice itself,
and only at a second stage in the COW code.

Link: https://lkml.kernel.org/r/20210107200402.31095-1-aarca...@redhat.com
Cc: sta...@kernel.org
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Andrea Arcangeli 
---
 include/linux/ksm.h |  7 ++
 mm/ksm.c| 25 +++
 mm/memory.c | 59 -
 3 files changed, 74 insertions(+), 17 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..e48b1e453ff5 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,6 +53,8 @@ struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-08 Thread Andrea Arcangeli

Hello Jason,

On Fri, Jan 08, 2021 at 08:42:55PM -0400, Jason Gunthorpe wrote:
> There is already a patch series floating about to do exactly that for
> FOLL_LONGTERM pins based on the existing code in GUP for CMA migration

Sounds great.

> The ship sailed on this a decade ago, it is completely infeasible to
> go back now, it would completely break widely used things like GPU,
> RDMA and more.

For all those that aren't using mmu notifier and that rely solely on
page pins, they still require privilege, except they do through /dev/
permissions.

Just the fact there's no capability check in the read/write/ioctl
doesn't mean those device inodes can be opened any luser: the fact the
kernel allows it, doesn't mean the /dev/ permission does too. The same
applies to /dev/kvm too, not just PCI device drivers.

Device drivers that you need to open in /dev/ before you can take a
GUP pin require whole different checks than syscalls like vmsplice and
io_uring that are universally available.

The very same GUP long term pinning kernel code can be perfectly safe
to use without any permission check for a device driver of an iommu in
/dev/, but completely unsafe for a syscall.

> If we want to have a high speed copy_from_user like thing that is not
> based on page pins but on mmu notifiers, then we should make that
> infrastructure and the various places that need it should use common
> code. At least vhost and io_uring are good candidates.

Actually the mmu notifier doesn't strictly require pins, it only
requires GUP. All users tend to use FOLL_GET just as a safety
precaution (I already tried to optimize away the two atomics per GUP,
but we were naked by the KVM maintainer that didn't want to take the
risk, I would have, but it's a fair point indeed, obviously it's safer
with the pin plus the mmu notifier, two is safer than one).

I'm not sure how any copy-user could obviate a secondary MMU mapping,
mappings and copies are mutually exclusive. Any copy would be breaking
memory coherency in this environment.

> Otherwise, we are pretending that they are DMA and using the DMA
> centric pin_user_pages() interface, which we still have to support and
> make work.

vhost and io_uring would be pure software constructs, but there are
hardware users of the GUP pin that don't use any DMA.

The long term GUP pin is not only about PCI devices doing DMA. KVM is
not ever using any DMA, despite it takes terabytes worth of very long
term GUP pins.

> > In any case, the extra flags required in FOLL_LONGTERM should be
> > implied by FOLL_LONGTERM itself, once it enters the gup code, because
> > it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0),
> > let alone having to specify FOLL_FORCE for just a read. But this is
> > going offtopic.
> 
> We really should revise this, I've been thinking for a while we need
> to internalize into gup.c the FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM
> idiom at least..

100% agreed.

> > > simply because it is using the CPU to memory copy as its "DMA".
> > 
> > vmsplice can't find all put_pages that may release the pages when the
> > pipe is read, or it'd be at least be able to do the unreliable
> > RLIMIT_MEMLOCK accounting.
> 
> Yikes! So it can't even use pin_user_pages FOLL_LONGTERM properly
> because that requires unpin_user_pages(), which means finding all the
> unpin sites too :\

Exactly.

> > To make another example a single unprivileged pin on the movable zone,
> > can break memhotunplug unless you use the mmu notifier. Every other
> > advanced feature falls apart.
> 
> As above FOLL_LONGTERM will someday migrate from movable zones.

Something like:

1) migrate from movable zones contextually to GUP

2) be accounted on the compound_order not on the number of GUP
   (io_uring needs fixing here)

3) maybe account not only in rlimit, but also expose the total worth
   of GUP pins in page_order units (not pins) to the OOM killer to be
   added to the rss (will double count though).

Maybe 3 is overkill but without it, OOM killer won't even see those
GUP pin coming, so if not done it's still kind of unsafe, if done
it'll risk double count.

Even then a GUP pin, still prevents optimization, it can't converge in
the right NUMA node the io ring just to make an example, but that's a
secondary performance concern.

The primary concern with the mmu notifier in io_uring is the
take_all_locks latency.

Longlived apps like qemu would be fine with mmu notifier. The main
question is also if there's any short-lived latency io_uring
usage... that wouldn't fly with take_all_locks.

The problem with the mmu notifier as an universal solution, for
example is that it can't wait for I/O completion of O_DIRECT since it
has no clue where the put_page is to wait for it, otherwise we could
avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be
completed before paging or anything can unmap the page under I/O from
the pagetable.

Even if we could reliably identify all the put_page of transient pins

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 11:25:21AM -0800, Linus Torvalds wrote:
> On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli  wrote:
> >
> > Do you intend to eventually fix the zygote vmsplice case or not?
> > Because in current upstream it's not fixed currently using the
> > enterprise default config.
> 
> Is this the hugepage case? Neither of your patches actually touched
> that, so I've forgotten the details.

The two patches only fixed the TLB flushing deferral in clear_refs and
uffd-wp.

So I didn't actually try to fix the hugepage case by adding the
page_count checks there too. I could try to do that at least it'd be
consistent but I still would try to find an alternate solution later.

> > Irrelevant special case as in: long term GUP pin on the memory?
> 
> Irrelevant special case in that
> 
>  (a) an extra COW shouldn't be a correctness issue unless somebody
> does something horribly wrong (and obviously the code that hasn't
> taken the mmap_lock for writing are then examples of that)
> 
> and
> 
>  (b) it's not a performance issue either unless you can find a real
> load that does it.
> 
> Hmm?

For b) I don't have an hard time to imagine `ps` hanging for seconds,
if clear_refs is touched on a 4T mm, but b) is not the main
concern.

Having to rely on a) is the main concern and it's not about tlb
flushes but the long term GUP pins.

Thanks,
Andrea

Re: [PATCH] x86/vm86/32: Remove VM86_SCREEN_BITMAP support

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 10:55:05AM -0800, Andy Lutomirski wrote:
> The implementation was rather buggy.  It unconditionally marked PTEs
> released the mmap lock before flushing the TLB, which could allow a racing
> CoW operation to falsely believe that the underlying memory was not
> writable.
> 
> I can't find any users at all of this mechanism, so just remove it.

Reviewed-by: Andrea Arcangeli

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote:
> Can we just remove vmsplice() support?  We could make it do a normal

The single case I've seen vmsplice used so far, that was really cool
is localhost live migration of qemu. However despite really cool, it
wasn't merged in the end, and I don't recall exactly why.

There are even more efficient (but slightly more complex) ways to do
that than vmsplice: using MAP_SHARED gigapages or MAP_SHARED tmpfs
with THP opted-in in the tmpfs mount, as guest physical memory instead
of anon memory and finding a way not having it cleared by kexec, so
you can also upgrade the host kernel and not just qemu... is a way
more optimal way to PIN and move all pages through the pipe and still
having to pay a superfluous copy on destination.

My guess why it's not popular, and I may be completely wrong on this
since I basically never used vmsplice (other than to proof of concept
DoS my phone to verify the long term GUP pin exploit works), is that
vmsplice is a more efficient, but not the most efficient option.

Exactly like in the live migration in place, it's always more
efficient to share a tmpfs THP backed region and have true zero copy,
than going through a pipe that still does one copy at the receiving
end. It may also be simpler and it's not dependent on F_SETPIPE_SIZE
obscure tunings. So in the end it's still too slow for apps that
requires maximum performance, and not worth the extra work for those
that don't.

I love vmsplice conceptually, just I'd rather prefer an luser cannot
run it.

> copy, thereby getting rid of a fair amount of nastiness and potential
> attacks.  Even ignoring issues relating to the length of time that the
> vmsplice reference is alive, we also have whatever problems could be
> caused by a malicious or misguided user vmsplice()ing some memory and
> then modifying it.

Sorry to ask but I'm curious, what also goes wrong if the user
modifies memory under GUP pin from vmsplice? That's not obvious to
see.

Thanks,
Andrea

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote:
> > > The majority cannot be converted to notifiers because they are DMA
> > > based. Every one of those is an ABI for something, and does not expect
> > > extra privilege to function. It would be a major breaking change to
> > > have pin_user_pages require some cap.
> > 
> > ... what makes them safe is to be transient GUP pin and not long
> > term.
> > 
> > Please note the "long term" in the underlined line.
> 
> Many of them are long term, though only 50 or so have been marked
> specifically with FOLL_LONGTERM. I don't see how we can make such a
> major ABI break.

io_uring is one of those indeed and I already flagged it.

This isn't a black and white issue, kernel memory is also pinned but
it's not in movable pageblocks... How do you tell the VM in GUP to
migrate memory to a non movable pageblock before pinning it? Because
that's what it should do to create less breakage.

For example iommu obviously need to be privileged, if your argument
that it's enough to use the right API to take long term pins
unconstrained, that's not the case. Pins are pins and prevent moving
or freeing the memory, their effect is the same and again worse than
mlock on many levels.

(then I know on preempt-rt should behave like a pin, and that's fine,
you disable all features for such purpose there)

io_uring is fine in comparison to vmpslice but still not perfect,
because it does the RLIMIT_MEMLOCK accounting but unfortunately, is
tangibly unreliable since a pin can cost 2m or 1G (now 1G is basically
privileged so it doesn't hurt to get the accounting wrong in such
case, but it's still technically mixing counting apples as oranges).

Maybe io_uring could keep not doing mmu notifier, I'd need more
investigation to be sure, but what's the point of keeping it
VM-breaking when it doesn't need to? Is io_uring required to setup the
ring at high frequency?

> Looking at it, vmsplice() is simply wrong. A long term page pin must
> use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode)
> FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode)
> 
> ie it must COW and it must reject cases that are not longterm safe,
> like DAX and CMA and so on.
>
> These are the well established rules, vmsplice does not get a pass

Where are the established rules written down? pin_user_pages.rst
doesn't even make a mention of FOLL_FORCE or FOLL_WRITE at all, mm.h
same thing.

In any case, the extra flags required in FOLL_LONGTERM should be
implied by FOLL_LONGTERM itself, once it enters the gup code, because
it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0),
let alone having to specify FOLL_FORCE for just a read. But this is
going offtopic.

> simply because it is using the CPU to memory copy as its "DMA".

vmsplice can't find all put_pages that may release the pages when the
pipe is read, or it'd be at least be able to do the unreliable
RLIMIT_MEMLOCK accounting.

I'm glad we agree vmsplice is unsafe. The PR for the seccomp filter is
open so if you don't mind, I'll link your review as confirmation.

> > speaking in practice. io_uring has similar concern but it can use mmu
> > notifier, so it can totally fix it and be 100% safe from this.
> 
> IIRC io_uring does use FOLL_LONGTERM and FOLL_WRITE..

Right it's one of those 50. FOLL_WRITE won't magically allow the
memory to be swapped or migrated.

To make another example a single unprivileged pin on the movable zone,
can break memhotunplug unless you use the mmu notifier. Every other
advanced feature falls apart.

So again, if an unprivileged syscalls allows a very limited number of
pages, maybe it checks also if it got a THP or a gigapage page from
the pin, it sets its own limit, maybe again it's not a big
concern.

vmsplice currently with zero privilege allows this:

 2  0 1074432 9589344  13548 132186040 4   172 2052 9997  5  2 93  
0  0
-> vmsplice reproducer started here
 1  0 1074432 8538184  13548 132582000 0 0 1973 8838  4  3 93  
0  0
 1  0 1074432 8538436  13548 132552400 0 0 1730 8168  4  2 94  
0  0
 1  0 1074432 8539096  13556 132188000 072 1811 8640  3  2 95  
0  0
 0  0 1074432 8539348  13564 132202800 036 1936 8684  4  2 95  
0  0
-> vmsplice killed here
 1  0 1074432 9586720  13564 132224800 0 0 1893 8514  4  2 94  
0  0

That's ~1G that goes away for each task and I didn't even check if
it's all THP pages getting in there, the rss is 3MB despite 1G is
taken down in GUP pins with zero privilege:

 1512 pts/25   S  0:00  0 0 133147 3044  0.0 ./vmsplice

Again memcg is robust so it's not a concern for the host, the attack
remains contained in the per-memcg OOM kil

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 09:39:56AM -0800, Linus Torvalds wrote:
> page_count() is simply the right and efficient thing to do.
> 
> You talk about all these theoretical inefficiencies for cases like
> zygote and page pinning, which have never ever been seen except as a
> possible attack vector.

Do you intend to eventually fix the zygote vmsplice case or not?
Because in current upstream it's not fixed currently using the
enterprise default config.

> Stop talking about irrelevant things. Stop trying to "optimize" things
> that never happen and don't matter.
> 
> Instead, what matters is the *NORMAL* VM flow.
> 
> Things like COW.
> 
> Things like "oh, now that we check just the page count, we don't even
> need the page lock for the common case any more".
> 
> > For the long term, I can't see how using page_count in do_wp_page is a
> > tenable proposition,
> 
> I think you should re-calibrate your expectations, and accept that
> page_count() is the right thing to do, and live with it.
> 
> And instead of worrying about irrelevant special-case code, start

Irrelevant special case as in: long term GUP pin on the memory?

Or irrelevant special case as in: causing secondary MMU to hit silent
data loss if a pte is ever wrprotected (arch code, vm86, whatever, all
under mmap_write_lock of course).

> worrying about the code that gets triggered tens of thousands of times
> a second, on regular loads, without anybody doing anything odd or
> special at all, just running plain and normal shell scripts or any
> other normal traditional load.
> 
> Those irrelevant special cases should be simple and work, not badly
> optimized to the point where they are buggy. And they are MUCH LESS
> IMPORTANT than the normal VM code, so if somebody does something odd,
> and it's slow, then that is the problem for the _odd_ case, not for
> the normal codepaths.
> 
> This is why I refuse to add crazy new special cases to core code. Make
> the rules simple and straightforward, and make the code VM work well.

New special cases? which new cases?

There's nothing new here besides the zygote that wasn't fully fixed
with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 and is actually the only
new case I can imagine where page_count actually isn't a regression.

All old cases that you seem to refer as irrelevant and are in
production in v4.18, I don't see anything new here.

Even for the pure COW case with zero GUP involvement an hugepage with
cows happening in different processes, would forever hit wp_copy_page
since count is always > 1 despite mapcount can be 1 for all
subpages. A simple app doing fork/exec would forever copy all memory
in the parent even after the exec is finished.

Thanks,
Andrea

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-08 Thread Andrea Arcangeli

On Fri, Jan 08, 2021 at 09:36:49AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote:
> > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote:
> > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote:
> > > 
> > > > vmsplice syscall API is insecure allowing long term GUP PINs without
  ^
> > > > privilege.
> > > 
> > > Lots of places are relying on pin_user_pages long term pins of memory,
> > > and cannot be converted to notifiers.
> > > 
> > > I don't think it is reasonable to just declare that insecure and
> > > requires privileges, it is a huge ABI break.
> > 
> > Where's that ABI? Are there specs or a code example in kernel besides
> > vmsplice itself?
> 
> If I understand you right, you are trying to say that the 193
> pin_user_pages() callers cannot exist as unpriv any more?

193, 1k 1m or their number in general, won't just make them safe...

> The majority cannot be converted to notifiers because they are DMA
> based. Every one of those is an ABI for something, and does not expect
> extra privilege to function. It would be a major breaking change to
> have pin_user_pages require some cap.

... what makes them safe is to be transient GUP pin and not long
term.

Please note the "long term" in the underlined line.

O_DIRECT is perfectly ok to be unprivileged obviously. The VM can
wait, eventually it goes away.

Even a swapout is not an instant event and can be hold off by any
number of other things besides a transient GUP pin. It can be hold off
by PG_lock just to make an example.

mlock however is long term, persistent, vmsplice takes persistent and
can pin way too much memory for each mm, that doesn't feel safe. The
more places doing stuff like that, the more likely one causes
a safety issue, not the other way around it in fact.

> > The whole zygote issue wouldn't even register if the child had the
> > exact same credentials of the parent. Problem is the child dropped
> > privileges and went with a luser id, that clearly cannot ptrace the
> > parent, and so if long term unprivileged GUP pins are gone from the
> > equation, what remains that the child can do is purely theoretical
> > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f.
> 
> Sorry, I'm not sure I've found a good explanation how ptrace and GUP
> are interacting to become a security problem.

ptrace is not involved. What I meant by mentioning ptrace, is that if
the child can ptrace the parent, then it doesn't matter if it can also
do the below, so the security concern is zero in such case.

With O_DIRECT or any transient pin you will never munmap while
O_DIRECT is in flight, if you munmap it's undefined what happens in
such case anyway.

It is a theoretical security issue made practical by vmsplice API that
allows to enlarge the window to years of time (not guaranteed
milliseconds), to wait for the parent to trigger the
wp_page_reuse. Remove vmsplice and the security issue in theory
remains, but removed vmsplice it becomes irrelevant statistically
speaking in practice. io_uring has similar concern but it can use mmu
notifier, so it can totally fix it and be 100% safe from this.

The scheduler disclosure date was 2020-08-25 so I can freely explain
the case that motivated all these changes.

case A)

if !fork() {
   // in child
   mmap one page
   vmsplice takes gup pin long term on such page
   munmap one page
   // mapcount == 1 (parent mm)
   // page_count == 2 (gup in child, and parent mm)
} else {
   parent writes to the page
   // mapcount == 1, wp_page_reuse
}

parent did a COW with mapcount == 1 so the parent will take over a
page that is still GUP pinned in the child. That's the security issue
because in this case the GUP pin is malicious.

Now imagine this case B)

   mmap one page
   RDMA or any secondary MMU takes a long term GUP pin
   munmap one page
   // mapcount == 1 (parent mm)
   // page_count == 2 (gup in RDMA, and parent mm)

How does the VM can tell between the two different cases? It can't.

The current page_count in do_wp_page treats both cases the same and
because page_count is 2 in both cases, it calls wp_page_copy in both
cases breaking-COW in both cases.

However, you know full well in the second case it is a feature and not
a bug, that wp_page_reuse is called instead, and in fact it has to be
called or it's a bug (and that's the bug page_count in do_wp_page
introduces).

So page_count in do_wp_page is breaking all valid users, to take care
of the purely theoretical security issue that isn't a practical concern
if only vmsplice is secured at least as good as mlock.

page_count in do_wp_page is fundamentally flawed for all long term GUP
pin done by secondary MMUs attach

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-08 Thread Andrea Arcangeli

Hello everyone,

On Fri, Jan 08, 2021 at 12:48:16PM +, Will Deacon wrote:
> On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote:
> > Please. Why is the correct patch not the attached one (apart from the
> > obvious fact that I haven't tested it and maybe just screwed up
> > completely - but you get the idea)?
> 
> It certainly looks simple and correct to me, although it means we're now
> taking the mmap sem for write in the case where we only want to clear the
> access flag, which should be fine with the thing only held for read, no?

I'm curious, would you also suggest that fixing just the TLB flushing
symptom is enough and we can forget about the ABI break coming from
page_count used in do_wp_page?

One random example: clear_refs will still break all long term GUP
pins, are you ok with that too?

page_count in do_wp_page is a fix for the original security issue from
vmsplice (where the child is fooling the parent in taking the
exclusive page in do_wp_page), that appears worse than the bug itself.

page_count in do_wp_page, instead of isolating as malicious when the
parent is reusing the page queued in the vmsplice pipe, is treating as
malicious also all legit cases that had to reliably reuse the page to
avoid the secondary MMUs to go out of sync.

page_count in do_wp_page is like a credit card provider blocking all
credit cards of all customers, because one credit card may have been
cloned (by vmsplice), but nobody can know which one was it. Of course
this technique will work perfectly as security fix because it will
treat all credit card users as malicious and it'll block them all
("block as in preventing re-use of the anon page").

The problem are those other credit card users that weren't malicious
that get their COW broken too. Those are the very long term GUP pins
if any anon page can be still wrprotected anywhere in the VM.

At the same time the real hanging fruit (vmsplice) that, if taken care
of, would have rendered the bug purely theoretical in security terms
hasn't been fixed yet, despite those unprivileged long term GUP pins
causes more reproducible security issues than just the COW, since they
can still DoS the OOM killer and they bypass at least the mlock
enforcement, even for non compound pages.

Of course just fixing vmsplice to require some privilege won't fix the
bug in full, so it's not suitable long term solution, but it has to
happen orthogonality for other reason, and it'd at least remove the
short term security concern.

In addition you're not experiencing the full fallout of the side
effects of page_count used to decide if to re-use all anon COW pages
because the bug is still there (with enterprise default config options
at least). Not all credit cards are blocked yet with only
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 applied. Only after you will
block them all, you will experience all the side effects of replacing
the per-subpage finegrined mapcount with the compound-wide page count.

The two statements above combined, result in my recommendation at this
point to resolve this in userland by rendering the security issue
theoretical by removing vmsplice from the OCI schema allowlist or by
enforcing it fixing in userland by always using execve after drop
privs (as crun always does when it starts the container of course).

For the long term, I can't see how using page_count in do_wp_page is a
tenable proposition, unless we either drop all secondary MMUs from the
kernel or VM features like clear_refs are dropped or unless the
page_count is magically stabilized and the speculative pagecache
lookups are also dropped.

If trying to manage the fallout by enforcing no anon page can ever be
wrprotected in place (i.e. dropping clear_refs feature or rendering it
unreliable by skipping elevated counts caused by spurious pagecache
lookups), it'd still sounds a too fragile design and too prone to
break to rely on that. There's random arch stuff even wrprotecting
memory, even very vm86 does it under the hood (vm86 is unlikely it has
a long term GUP pin on it of course, but still who knows?). I mean the
VM core cannot make assumptions like: "this vm86 case can still
wrprotect without worry because probably vm86 isn't used anymore with
any advanced secondary MMU, so if there's a GUP pin it probably is a
malicious vmsplice and not a RDMA or GPU or Virtualization secondary
MMU".

Then there's the secondary concern of the inefficiency it introduces
with extra unnecessary copies when a single GUP pin will prevent reuse
of 512 or 262144 subpages, in the 512 case also potentially mapped in
different processes. The TLB flushing discussions registers as the
last concern in my view.

Thanks,
Andrea

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-07 Thread Andrea Arcangeli

On Thu, Jan 07, 2021 at 02:51:24PM -0800, Linus Torvalds wrote:
> Ho humm. I had obviously not looked very much at that code. I had done
> a quick git grep, but now that I look closer, it *does* get the
> mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and
> then it does a mmap_write_downgrade().
> 
> So that's why I had looked more at the UFFD code, because that one was
> the one I was aware of doing this all with just the read lock. I
> _thought_ the softdirty code already took the write lock and wouldn't
> race with page faults.
> 
> But I had missed that write_downgrade. So yeah, this code has the same issue.

I overlooked the same thing initially. It's only when I noticed it
also used mmap_read_lock, that I figured that the group lock thingy
uffd-wp ad-hoc solution, despite it was fully self contained thanks to
the handle_userfault() catcher for the uffd-wp bit in the pagetable,
wasn't worth it since uffd-wp could always use whatever clear_refs
used to solve it.

> Anyway, the fix is - I think - the same I outlined earlier when I was
> talking about UFFD: take the thing for writing, so that you can't
> race.

Sure.

> The alternate fix remains "make sure we always flush the TLB before
> releasing the page table lock, and make COW do the copy under the page
> table lock". But I really would prefer to just have this code follow

The copy under PT lock isn't enough.

Flush TLB before releasing is enough of course.

Note also the patch in 2/2 patch that I sent:

 https://lkml.kernel.org/r/20210107200402.31095-3-aarca...@redhat.com

2/2 would have been my preferred solution for both and it works
fine. All corruption that was trivially reproducible with heavy
selftest program in the kernel, is all gone.

If only the TLB pending issue was the only regression page_count in
do_wp_page introduced, I would have never suggested we should
re-evaluate it. It'd be a good tradeoff in such case, even if it'd
penalize the soft-dirty runtime, especially if we were allowed to
deploy 2/2 as a non-blocking solution.

Until yesterday I fully intended to just propose 1/2 and 2/2, with a
whole different cover letter, CC stable and close this case.

> all the usual rules, and if it does a write protect, then it should
> take the mmap_sem for writing.

The problem isn't about performance anymore, the problem is a silent
ABI break to long term PIN user attached to an mm under clear_refs.

> Why is that very simple rule so bad?
> 
> (And see my unrelated but incidental note on it being a good idea to
> try to minimize latency by making surfe we don't do any IO under the
> mmap lock - whether held for reading _or_ writing. Because I do think
> we can improve in that area, if you have some good test-case).

That would be great indeed.

Thanks,
Andrea

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-07 Thread Andrea Arcangeli

On Thu, Jan 07, 2021 at 02:42:17PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli  wrote:
> >
> > Random memory corruption will still silently materialize as result of
> > the speculative lookups in the above scenario.
> 
> Explain.
> 
> Yes, you'll get random memory corruption if you keep doing wrprotect()
> without mmap_sem held for writing.

I didn't meant that.

> But I thought we agreed earlier that that is a bug. And I thought the
> softdirty code already got it for writing.

softdirty used mmap_read_lock too but this again isn't relevant here
and for the sake of discussion we can safely assume mmap_read_lock
doesn't exist in the kernel, and everything takes the mmap_write_lock
whenever a mmap_lock is taken at all.

I mean something bad will happen if a write happens, but soft dirty
cannot register it because we didn't wrprotect the pte? Some dirty
page won't be transferred to destination and it will be assumed there
was no softy dirty event for such page? Otherwise it would mean that
wrprotecting is simply optional for all pages under clear_refs?

Not doing the final TLB flush in softdirty caused some issue even when
there was no COW and the deferred flush only would delay the wrprotect
fault:

https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=rdjnu6vgf3hlo+...@mail.gmail.com/
   https://lore.kernel.org/linux-mm/20210105221628.GA12854@willie-the-truck/

Skipping the wrprotection of the pte because of a speculative
pagecache lookup elevating a random page_count, from the userland
point of view, I guessed would behave as missing the final TLB flush
before clear_refs returns to userland, just worse.

Thanks,
Andrea

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-07 Thread Andrea Arcangeli

On Thu, Jan 07, 2021 at 02:17:50PM -0800, Linus Torvalds wrote:
> So I think we can agree that even that softdirty case we can just
> handle by "don't do that then".

Absolutely. The question is if somebody was happily running clear_refs
with a RDMA attached to the process, by the time they update and
reboot they'll find it the hard way with silent mm corruption
currently.

So I was obliged to report this issue and the fact there was very
strong reason why page_count was not used there and it's even
documented explicitly in the source:

 * [..] however we only use
 * page_trans_huge_mapcount() in the copy-on-write faults where we
 * need full accuracy to avoid breaking page pinning, [..]

I didn't entirely forget the comment when I reiterated it in fact also
in Message-ID: <20200527212005.gc31...@redhat.com> on May 27 2020
since I recalled there was a very explicit reason why we had to use
page_mapcount in do_wp_page and deliver full accuracy.

Now I cannot proof there's any such user that will break, but we'll
find those with a 1 year or more of delay.

Even the tlb flushing deferral that caused clear_refs_write to also
corrupt userland memory and literally lose userland writes even
without any special secondary MMU hardware being attached to the
memory, took 6 months to find.

> if a page is pinned, the dirty bit of it makes no sense, because it
> might be dirtied complately asynchronously by the pinner.
>
> So I think none of the softdirty stuff should touch pinned pages. I
> think it falls solidly under that "don't do it then".
> 
> Just skipping over them in clear_soft_dirty[_pmd]() doesn't look that
> hard, does it?

1) How do you know again if it's not speculative lookup or an O_DIRECT
   that made them look pinned?

2) it's a hugepage 1, a 4k pin will make soft dirty then skip 511
   dirty bits by mistake including those pages that weren't pinned and
   that userland is still writing through the transhuge pmd.

   Until v4.x we had a page pin counter for each subpage so the above
   wouldn't have happened, but not anymore since it was simplified and
   improved but now the page_count is pretty inefficient there, even
   if it'd be possible to make safe.

3) the GUP(write=0) may be just reading from RAM and sending to the
   other end so clear_refs may have currently very much tracked CPU
   writes fine, with no interference whatsoever from the secondary MMU
   working in readonly on the memory.

Thanks,
Andrea

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-07 Thread Andrea Arcangeli

On Thu, Jan 07, 2021 at 01:29:43PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli  wrote:
> >
> > The problem is it's not even possible to detect reliably if there's
> > really a long term GUP pin because of speculative pagecache lookups.
> 
> So none of the normal code _needs_ that any more these days, which is
> what I think is so nice. Any pinning will do the COW, and then we have
> the logic to make sure it stays writable, and that keeps everything
> nicely coherent and is all fairly simple.
> 
> And yes, it does mean that if somebody then explicitly write-protects
> a page, it may end up being COW'ed after all, but if you first pinned
> it, and then started playing with the protections of that page, why
> should you be surprised?
> 
> So to me, this sounds like a "don't do that then" situation.
> 
> Anybody who does page pinning and wants coherency should NOT TOUCH THE
> MAPPING IT PINNED.
> 
> (And if you do touch it, it's your own fault, and you get to keep both
> of the broken pieces)
> 
> Now, I do agree that from a QoI standpoint, it would be really lovely
> if we actually enforced it. I'm not entirely sure we can, but maybe it
> would be reasonable to use that
> 
>   mm->has_pinned && page_maybe_dma_pinned(page)
> 
> at least as the beginning of a heuristic.
> 
> In fact, I do think that "page_maybe_dma_pinned()" could possibly be
> made stronger than it is. Because at *THAT* point, we might say "we
> know a pinned page always must have a page_mapcount() of 1" - since as
> part of pinning it and doing the GUP_PIN, we forced the COW, and then
> subsequent fork() operations enforce it too.
> 
> So I do think that it might be possible to make that clear_refs code
> notice "this page is pinned, I can't mark it WP without the pinning
> coherency breaking".
> 
> It might not even be hard. But admittedly I'm somewhat handwaving
> here, and I might not have thought of some situation.

I suppose the objective would be to detect when it's a transient pin
(as an O_DIRECT write) and fail clear_refs with an error for all other
cases of real long term pins that need to keep reading at full PCI
bandwidth, without extra GUP invocations after the wp_copy_page run.

Because of the speculative lookups are making the count unstable, it's
not even enough to use mmu notifier and never use FOLL_GET in GUP to
make it safe again (unlike what I mentioned in a previous email).

Random memory corruption will still silently materialize as result of
the speculative lookups in the above scenario.

My whole point here in starting this new thread to suggest page_count
in do_wp_page is an untenable solution is that such commit silently
broke every single long term PIN user that may be used in combination
of clear_refs since 2013.

Silent memory corruption undetected or a detectable error out of
clear_refs, are both different side effects the same technical ABI
break that rendered clear_refs fundamentally incompatible with
clear_refs, so detecting it or not still an ABI break that is.

I felt obliged to report there's something much deeper and
fundamentally incompatible between the page_count in do_wp_page any
wrprotection of exclusive memory in place, as if used in combination
with any RDMA for example. The TLB flushing and the
mmap_read/write_lock were just the tip of the iceberg and they're not
the main concern anymore.

In addition, the inefficiency caused by the fact the page_count effect
is multiplied by 512 times or 262144 while the mapcount remains 4k
granular, makes me think the page_count is unsuitable to be used there
and this specific cure with page_count in do_wp_page, looks worse than
the initial zygote disease.

Thanks,
Andrea

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-07 Thread Andrea Arcangeli

On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote:
> I think those would very much be worth fixing, so that if
> UFFDIO_WRITEPROTECT taking the mmapo_sem for writing causes problems,
> we can _fix_ those problems.
> 
> But I think it's entirely wrong to treat UFFDIO_WRITEPROTECT as
> specially as Andrea seems to want to treat it. Particularly with
> absolutely zero use cases to back it up.

Again for the record: there's nothing at all special in
UFFDIO_WRITEPROTECT in this respect.

If you could stop mentioning UFFDIO_WRITEPROTECT and only focus on
softdirty/clear_refs, maybe you wouldn't think my judgment is biased
towards clear_refs/softdirty too.

You can imagine the side effects of page_count doing a COW
erroneously, as corollary of the fact that KSM won't ever allow to
merge two pages if one of them is under GUP pin. Why is that?

Thanks,
Andrea

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-07 Thread Andrea Arcangeli

On Thu, Jan 07, 2021 at 12:32:09PM -0800, Linus Torvalds wrote:
> I think Andrea is blinded by his own love for UFFDIO: when I do a
> debian codesearch for UFFDIO_WRITEPROTECT, all it finds is the kernel
> and strace (and the qemu copies of the kernel headers).

For the record, I feel obliged to reiterate I'm thinking purely in
clear_refs terms here.

It'd be great if we can only focus on clear_refs_write and nothing
else.

Like I said earlier, whatever way clear_refs/softdirty copes with
do_wp_page, uffd-wp can do the identical thing so, uffd-wp is
effectively irrelevant in this whole discussion. Clear-refs/softdirty
predates uffd-wp by several years too.

Thanks,
Andrea

Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-07 Thread Andrea Arcangeli

On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote:
> 
> > vmsplice syscall API is insecure allowing long term GUP PINs without
> > privilege.
> 
> Lots of places are relying on pin_user_pages long term pins of memory,
> and cannot be converted to notifiers.
> 
> I don't think it is reasonable to just declare that insecure and
> requires privileges, it is a huge ABI break.

Where's that ABI? Are there specs or a code example in kernel besides
vmsplice itself?

I don't see how it's possible to consider long term GUP pins
completely unprivileged if not using mmu notifier. vmsplice doesn't
even account them in rlimit (it cannot because it cannot identify all
put_pages either).

Long term GUP pins not using mmu notifier and not accounted in rlimit
are an order of magnitude more VM-intrusive than mlock.

The reason it's worse than mlock, even if ignore all performance
feature that they break including numa bindings and that mlock doesn't
risk to break, come because you can unmap the memory after taking
those rlimit unaccounted GUP pins. So the OOM killer won't even have a
chance to see the GUP pins coming.

So it can't be that mlock has to be privileged but unconstrainted
unaccounted long term GUP pins as in vmsplice are ok to stay
unprivileged.

Now io_uring does account the GPU pins in the mlock rlimit, but after
the vma is unmapped it'd still cause the same confusion to OOM killer
and in addition the assumption that each GUP pin cost 4k is also
flawed. However io_uring model can use the mmu notifier without
slowdown to the fast paths, so it's not going to cause any ABI break
to fix it.

Or to see it another way, it'd be fine to declare all mlock rlimits
are obsolete and memcg is the only way to constrain RAM usage, but
then mlock should stop being privileged, because mlock is a lesser
concern and it won't risk to confuse the OOM killer at least.

The good thing is the GUP pins won't escape memcg accounting but that
accounting also doesn't come entirely free.

> FWIW, vhost tries to use notifiers as a replacement for GUP, and I
> think it ended up quite strange and complicated. It is hard to
> maintain performance when every access to the pages needs to hold some
> protection against parallel invalidation.

And that's fine, this is all about if it should require a one liner
change to add the username in the realtime group in /etc/group or not.

You're focusing on your use case, but we've to put things in
prospective of all these changes started.

The whole zygote issue wouldn't even register if the child had the
exact same credentials of the parent. Problem is the child dropped
privileges and went with a luser id, that clearly cannot ptrace the
parent, and so if long term unprivileged GUP pins are gone from the
equation, what remains that the child can do is purely theoretical
even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f.

NOTE: I'm all for fixing the COW for good, but vmsplice or any long
term GUP pin that is absolutely required to make such attack
practical, looks the real low hanging fruit here to fix.

However fixing it so clear_refs becomes fundamentally incompatible
with mmu notifier users unless they all convert to pure !FOLL_GET
GUPs, let alone long term GUP pins not using mmu notifier, doesn't
look great. For vmsplice that new break-COW is the fix because it
happens in the other process.

For every legit long term GUP, where the break-COW happens in the
single and only process, it's silent MM corruption.

Thanks,
Andrea

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-07 Thread Andrea Arcangeli

Hi Linus,

On Thu, Jan 07, 2021 at 12:17:40PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli  wrote:
> >
> > However there are two cases that could wrprotecting exclusive anon
> > pages with only the mmap_read_lock:
> 
> I still think the real fix is "Don't do that then", and just take the
> write lock.
> 
> The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a
> normal operation. Same goes for softdirty.
> 
> Why have those become _so_ magical that they can break the VM for
> everybody else?

I see what you mean above and I agree. Like said below:

==
In simple terms: the page_count check in do_wp_page makes it
impossible to wrprotect memory, if such memory is under a !FOLL_WRITE
GUP pin.
==

So to simplify let's ignore UFFDIO_WRITEPROTECT here, which is new and
adds no dependency on top of clear_refs in this respect.

So yes, if we drop any code that has to wrprotect memory in place in
the kernel (since all userland memory can be under GUP pin in read
mode) and we make such an operation illegal, then it's fine, but that
means clear_refs has to go or it has to fail if there's a GUP pin
during the wrprotection.

The problem is it's not even possible to detect reliably if there's
really a long term GUP pin because of speculative pagecache lookups.

We would need to declare that any secondary MMU which is supposed to
be VM-neutral using mmu notifier like a GPU or a RDMA device, cannot
be used in combination on clear_refs and it would need to be enforced
fully in userland. Most mmu notifier users drop the GUP pin during the
invalidate for extra safety in case an invalidate goes missing: they
would all need to drop FOLL_GET to be compliant and stop causing
memory corruption if clear_refs shall be still allowed to happen on
mmu notifier capable secondary MMUs. Even then how does userland know
which devices attaches to the memory with mmu notifer and never using
FOLL_GET and which aren't? It doesn't sound reliable to enforce this
in userland. So I don't see how clear_refs can be saved.

Now let's make another example that still shows at least some
fundamental inefficiency that has nothing to do with clear_refs.

Let's suppose a GUP pin is taken on a subpageA by a RDMA device by
process A (parent). Let's now assume subpageB is mapped in process B
(child of process A). Both subpageA and subpageB are exclusive
(mapcount == 1) and read-write but they share the same page_count
atomic counter (only the page_mapcounts are subpage granular). To
still tame the zygote concern with the page_count in do_wp_page, then
process B when it forks a child (processC) would forever have to do an
extra superflous COW even after process C exits. Is that what we want
on top of the fundamental unsafety added to clear_refs?

Thanks,
Andrea

[PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-07 Thread Andrea Arcangeli

main. Only RAM constrained embedded
devices are justified to take shortcuts with the ensuing side channel
security concern that emerges from it.

For all the above reasons, because so far the cure has been worse than
the disease itself, I'd recommend enterprise distro kernels to ignore
the zygote embedded model attack for the time being and not to
backport anything in this regard. What should not be backported,
specifically starts in 17839856fd588f4ab6b789f482ed3ffd7c403e1f.

17839856fd588f4ab6b789f482ed3ffd7c403e1f was supposed to be fine and
not break anything and unfortunately I was too busy while
17839856fd588f4ab6b789f482ed3ffd7c403e1f morphed into
09854ba94c6aad7886996bfbee2530b3d8a7f4f4, so I still miss a whole huge
discussion in that transition. I don't know what was fundamentally
flawed in 17839856fd588f4ab6b789f482ed3ffd7c403e1f yet.

All I comment about here is purely the current end result:
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 not any of the intermediate
steps that brought us there.

I already mentioned the page_count wasn't workable in do_wp_page in
Message-ID: <20200527212005.gc31...@redhat.com> on May 27 2020, quote:

"I don't see how we can check the page_count to decide if to COW or not
in the wrprotect fault, given [..] introduced in >="

Then Hugh said he wasn't happy about on 1 Sep 2020 in Message-ID:
.

In https://lkml.kernel.org/r/x+o49hrck1fbd...@redhat.com I suggested
"I hope we can find a way put the page_mapcount back where there" and
now I have to double down and suggest that the page_count is
fundamentally unsafe in do_wp_page.

I see how the page_count would also fix the specific zygote
child->parent attack and I kept an open mind hoping it would just
solve all problems magically. So I tried to fix even clear_refs to
cope with it, but this is only the tip of the icerbeg of what really
breaks.

So in short I contextually self-NAK 2/2 of this patchset and we need
to somehow reverse 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 instead.

Thanks,
Andrea

Andrea Arcangeli (1):
  mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

Will Deacon (1):
  mm: proc: Invalidate TLB after clearing soft-dirty page state

 fs/proc/task_mmu.c   | 26 ---
 include/linux/mm.h   | 46 +++
 include/linux/mm_types.h |  5 +++
 kernel/fork.c|  1 +
 mm/memory.c  | 69 
 mm/mprotect.c|  4 +++
 6 files changed, 146 insertions(+), 5 deletions(-)

[PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-07 Thread Andrea Arcangeli

rted-by: Nadav Amit 
Suggested-by: Yu Zhao 
Signed-off-by: Andrea Arcangeli 
---
 fs/proc/task_mmu.c   | 17 +-
 include/linux/mm.h   | 46 +++
 include/linux/mm_types.h |  5 +++
 kernel/fork.c|  1 +
 mm/memory.c  | 69 
 mm/mprotect.c|  4 +++
 6 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a127262ba517..e75cb135db02 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1235,8 +1235,20 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
}
if (type == CLEAR_REFS_SOFT_DIRTY) {
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   if (!(vma->vm_flags & VM_SOFTDIRTY))
+   struct vm_area_struct *tmp;
+   if (!(vma->vm_flags & VM_SOFTDIRTY)) {
+   inc_wrprotect_tlb_flush_pending(vma);
continue;
+   }
+
+   /*
+* Rollback wrprotect_tlb_flush_pending before
+* releasing the mmap_lock.
+*/
+   for (vma = mm->mmap; vma != tmp;
+vma = vma->vm_next)
+   dec_wrprotect_tlb_flush_pending(vma);
+
mmap_read_unlock(mm);
if (mmap_write_lock_killable(mm)) {
count = -EINTR;
@@ -1245,6 +1257,7 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
for (vma = mm->mmap; vma; vma = vma->vm_next) {
vma->vm_flags &= ~VM_SOFTDIRTY;
vma_set_page_prot(vma);
+   inc_wrprotect_tlb_flush_pending(vma);
}
mmap_write_downgrade(mm);
break;
@@ -1260,6 +1273,8 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
if (type == CLEAR_REFS_SOFT_DIRTY) {
mmu_notifier_invalidate_range_end();
flush_tlb_mm(mm);
+   for (vma = mm->mmap; vma; vma = vma->vm_next)
+   dec_wrprotect_tlb_flush_pending(vma);
dec_tlb_flush_pending(mm);
}
mmap_read_unlock(mm);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..caa1d9a71cb2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3177,5 +3177,51 @@ unsigned long wp_shared_mapping_range(struct 
address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+/*
+ * NOTE: the mmap_lock must be hold and it cannot be released at any
+ * time in between inc_wrprotect_tlb_flush_pending() and
+ * dec_wrprotect_tlb_flush_pending().
+ *
+ * This counter has to be elevated before taking the PT-lock to
+ * wrprotect pagetables, if the TLB isn't flushed before the
+ * PT-unlock.
+ *
+ * The only reader is the page fault so this has to be elevated (in
+ * addition of the mm->tlb_flush_pending) only when the mmap_read_lock
+ * is taken instead of the mmap_write_lock (otherwise the page fault
+ * couldn't run concurrently in the first place).
+ *
+ * This doesn't need to be elevated when clearing pagetables even if
+ * only holding the mmap_read_lock (as in MADV_DONTNEED). The page
+ * fault doesn't risk to access the data of the page that is still
+ * under tlb-gather deferred flushing, if the pagetable is none,
+ * because the pagetable doesn't point to it anymore.
+ *
+ * This counter is read more specifically by the page fault when it
+ * has to issue a COW that doesn't result in page re-use because of
+ * the lack of stability of the page_count (vs speculative pagecache
+ * lookups) or because of a GUP pin exist on an exclusive and writable
+ * anon page.
+ *
+ * If this counter is elevated and the pageteable is wrprotected (as
+ * in !pte/pmd_write) and present, it means the page may be still
+ * modified by userland through a stale TLB entry that was
+ * instantiated before the wrprotection. In such case the COW fault,
+ * if it decides not to re-use the page, will have to either wait this
+ * counter to return zero, or it needs to flush the TLB before
+ * proceeding copying the page.
+ */
+static inline void inc_wrprotect_tlb_flush_pending(struct vm_area_struct *vma)
+{
+   atomic_inc(>wrprotect_tlb_flush_pending);
+   VM_WARN_ON_ONCE(atomic_read(>wrprotect_tlb_flush_pending) <= 0);
+}
+
+static inline void dec_w

[PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state

2021-01-07 Thread Andrea Arcangeli

From: Will Deacon 

Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double
flush"), TLB invalidation is elided in tlb_finish_mmu() if no entries
were batched via the tlb_remove_*() functions. Consequently, the
page-table modifications performed by clear_refs_write() in response to
a write to /proc//clear_refs do not perform TLB invalidation.
Although this is fine when simply aging the ptes, in the case of
clearing the "soft-dirty" state we can end up with entries where
pte_write() is false, yet a writable mapping remains in the TLB.

Fix this by avoiding the mmu_gather API altogether: managing both the
'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB
invalidation for the sort-dirty path, much like mprotect() does already.

Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush")
Signed-off-by: Will Deacon 
Signed-off-by: Andrea Arcangeli 
---
 fs/proc/task_mmu.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee5a235b3056..a127262ba517 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1189,7 +1189,6 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
struct mm_struct *mm;
struct vm_area_struct *vma;
enum clear_refs_types type;
-   struct mmu_gather tlb;
int itype;
int rv;
 
@@ -1234,7 +1233,6 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
count = -EINTR;
goto out_mm;
}
-   tlb_gather_mmu(, mm, 0, -1);
if (type == CLEAR_REFS_SOFT_DIRTY) {
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (!(vma->vm_flags & VM_SOFTDIRTY))
@@ -1252,15 +1250,18 @@ static ssize_t clear_refs_write(struct file *file, 
const char __user *buf,
break;
}
 
+   inc_tlb_flush_pending(mm);
mmu_notifier_range_init(, MMU_NOTIFY_SOFT_DIRTY,
0, NULL, mm, 0, -1UL);
mmu_notifier_invalidate_range_start();
}
walk_page_range(mm, 0, mm->highest_vm_end, _refs_walk_ops,
);
-   if (type == CLEAR_REFS_SOFT_DIRTY)
+   if (type == CLEAR_REFS_SOFT_DIRTY) {
mmu_notifier_invalidate_range_end();
-   tlb_finish_mmu(, 0, -1);
+   flush_tlb_mm(mm);
+   dec_tlb_flush_pending(mm);
+   }
mmap_read_unlock(mm);
 out_mm:
mmput(mm);

Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()

2021-01-07 Thread Andrea Arcangeli

On Thu, Jan 07, 2021 at 06:28:29PM +0100, Vlastimil Babka wrote:
> On 1/6/21 9:18 PM, Hugh Dickins wrote:
> > On Wed, 6 Jan 2021, Andrea Arcangeli wrote:
> >> 
> >> I'd be surprised if the kernel can boot with BUG_ON() defined as "do
> >> {}while(0)" so I guess it doesn't make any difference.
> > 
> > I had been afraid of that too, when CONFIG_BUG is not set:
> > but I think it's actually "if (cond) do {} while (0)".
> 
> It's a maze of configs and arch-specific vs generic headers, but I do see this
> in include/asm-generic/bug.h:
> 
> #else /* !CONFIG_BUG */
> #ifndef HAVE_ARCH_BUG
> #define BUG() do {} while (1)
> #endif
> 
> So seems to me there *are* configurations possible where side-effects are 
> indeed
> thrown away, right?

But this not BUG_ON, and that is an infinite loop while(1), not an
optimization away as in while (0) that I was suggesting to just throw
away cond and make it a noop. BUG() is actually the thing to use to
move functional stuff out of BUG_ON so it's not going to be causing
issues if it just loops.

This overall feels mostly an aesthetically issue.

Thanks,
Andrea

Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()

2021-01-06 Thread Andrea Arcangeli

Hello,

On Wed, Jan 06, 2021 at 11:46:20AM -0800, Andrew Morton wrote:
> On Tue, 5 Jan 2021 20:28:27 -0800 (PST) Hugh Dickins  wrote:
> 
> > Alex, please consider why the authors of these lines (whom you
> > did not Cc) chose to write them without BUG_ON(): it has always
> > been preferred practice to use BUG_ON() on predicates, but not on
> > functionally effective statements (sorry, I've forgotten the proper
> > term: I'd say statements with side-effects, but here they are not
> > just side-effects: they are their main purpose).
> > 
> > We prefer not to hide those away inside BUG macros
> 
> Should we change that?  I find BUG_ON(something_which_shouldnt_fail())
> to be quite natural and readable.
> 
> As are things like the existing
> 
> BUG_ON(mmap_read_trylock(mm));
> BUG_ON(wb_domain_init(_wb_domain, GFP_KERNEL));
> 
> etc.
> 
> 
> No strong opinion here, but is current mostly-practice really
> useful?

I'd be surprised if the kernel can boot with BUG_ON() defined as "do
{}while(0)" so I guess it doesn't make any difference.

I've no strong opinion either, but personally my views matches Hugh's
views on this. I certainly tried to stick to that in the past since I
find it cleaner if a bugcheck just "checks" and can be deleted at any
time without sudden breakage.

Said that I also guess we're in the minority.

Thanks,
Andrea

Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 10:16:29PM +, Will Deacon wrote:
> On Tue, Jan 05, 2021 at 09:22:51PM +, Nadav Amit wrote:
> > > On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli  wrote:
> > > 
> > > On Tue, Jan 05, 2021 at 07:26:43PM +, Nadav Amit wrote:
> > >>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli  
> > >>> wrote:
> > >>> 
> > >>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
> > >>>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes 
> > >>>> tracking")
> > >>> 
> > >>> Targeting a backport down to 2013 when nothing could wrong in practice
> > >>> with page_mapcount sounds backwards and unnecessarily risky.
> > >>> 
> > >>> In theory it was already broken and in theory
> > >>> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the
> > >>> previous code of 2013 is completely wrong, but in practice the code
> > >>> from 2013 worked perfectly until Aug 21 2020.
> > >> 
> > >> Well… If you consider the bug that Will recently fixed [1], then 
> > >> soft-dirty
> > >> was broken (for a different, yet related reason) since 0758cd830494
> > >> ("asm-generic/tlb: avoid potential double flush”).
> > >> 
> > >> This is not to say that I argue that the patch should be backported to 
> > >> 2013,
> > >> just to say that memory corruption bugs can be unnoticed.
> > >> 
> > >> [1] 
> > >> https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/
> > > 
> > > Is this a fix or a cleanup?
> > > 
> > > The above is precisely what I said earlier that tlb_gather had no
> > > reason to stay in clear_refs and it had to use inc_tlb_flush_pending
> > > as mprotect, but it's not a fix? Is it? I suggested it as a pure
> > > cleanup. So again no backport required. The commit says fix this but
> > > it means "clean this up".
> > 
> > It is actually a fix. I think the commit log is not entirely correct and
> > should include:
> > 
> >   Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”).

Agreed.

> > 
> > Since 0758cd830494, calling tlb_finish_mmu() without any previous call to
> > pte_free_tlb() and friends does not flush the TLB. The soft-dirty bug
> > producer that I sent fails without this patch of Will.
> 
> Yes, it's a fix, but I didn't rush it for 5.10 because I don't think rushing
> this sort of thing does anybody any favours. I agree that the commit log
> should be updated; I mentioned this report in the cover letter:
> 
> https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=rdjnu6vgf3hlo+...@mail.gmail.com/
> 
> demonstrating that somebody has independently stumbled over the missing TLB
> invalidation in userspace, but it's not as bad as the other issues we've been
> discussing in this thread.

Thanks for the explanation Nadav and Will.

The fact the code was a 100% match to the cleanup I independently
suggested a few weeks ago to reduce the confusion in clear_refs, made
me overlook the difference 0758cd830494 made. I didn't realize the
flush got optimized away if no gathering happened.

Backporting this sort of thing with Fixes I guess tends to give the
same kind of favors as rushing it for 5.10, but then in general the
Fixes is accurate here.

Overall it looks obviously safe cleanup and it is also a fix starting
in v5.6, so I don't think this can cause more issues than what it
sure fixes at least.

The cleanup was needed anyway, even before it become a fix, since if
it was mandatory to use tlb_gather when you purely need
inc_tlb_flush_pending, then mprotect couldn't get away with it.

I guess the the optimization in 0758cd830494 just made it more
explicit that no code should use tlb_gather if it doesn't need to
gather any page. Maybe adding some commentary in the comment on top of
tlb_gather_mmu about the new behavior wouldn't hurt.

Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 09:22:51PM +, Nadav Amit wrote:
> It is also about performance due to unwarranted TLB flushes.

If there will be a problem switching to the wait_flush_pending() model
suggested by Peter may not even require changes to the common code in
memory.c since I'm thinking it may not even need to take a failure
path if we plug it in the same place of the tlb flush.

So instead of the flush we could always block there until we read zero
in the atomic, then smp_rmb() and we're ready to start the copy.

So either we flush IPI if we didn't read zero, or we block until we
read zero, the difference is going to be hidden to do_wp_page. All
do_wp_page cares about is that by the time the abstract call returns,
there's no stale TLB left for such pte. If it is achieved by blocking
and waiting or flushing the TLB it doesn't matter too much.

So thinking of how bad the IPI will do, with the improved arm64 tlb
flushing code in production, we keep track of how many simultaneous mm
context there are, specifically to skip the SMP-unscalable TLBI
broadcast on arm64 like we already avoid IPIs on lazy tlbs on x86 (see
x86 tlb_is_not_lazy in native_flush_tlb_others). In other words the
IPI will materialize only if there's more than one thread running
while clear_refs run. All lazy tlbs won't get IPIs on both x86
upstream and arm64 enterprise.

This won't help multithreaded processes that compute from all CPUs at
all times but even multiple vcpu threads aren't always guaranteed to
be running at all times.

My main concern would be an IPI flood that slowdown clear_refs and
UFFDIO_WRITEPROTECT, but an incremental optimization (not required for
correctness) is to have UFFDIO_WRITEPROTECT and clear_refs switch to
lazy tlb mode before they call inc_tlb_flush_pending() and unlazy only
after dec_tlb_flush_pending. So it's possible to at least guarantee
the IPI won't slow down them down.

> In addition, as I stated before, having some clean interfaces that tell
> whether a TLB flush is needed or not would be helpful and simpler to follow.
> For instance, we can have is_pte_prot_demotion(oldprot, newprot) to figure
> out whether a TLB flush is needed in change_pte_range() and avoid
> unnecessary flushes when unprotecting pages with either mprotect() or
> userfaultfd.

When you mentioned this earlier I was thinking what happens then with
flush_tlb_fix_spurious_fault(). The fact it's safe doesn't guarantee
it's a performance win if there's a stream of spurious faults as
result. So it'd need to be checked, especially as in the case of
mprotect where the flush can be deferred and coalesced in a single IPI
at the end so there's not so much to gain from it anyway.

If you can guarantee there won't be a flood suprious wrprotect faults,
then it'll be a nice optimization.

Thanks,
Andrea

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 08:06:22PM +, Nadav Amit wrote:
> I just thought that there might be some insinuation, as you mentioned VMware
> by name. My response was half-jokingly and should have had a smiley to
> prevent you from wasting your time on the explanation.

No problem, actually I appreciate you pointed out to give me the extra
opportunity to further clarify I wasn't implying anything like that,
sorry again for any confusion I may have generated.

I mentioned vmware because I'd be shocked if for the whole duration of
the wrprotect on the guest physical memory it'd have to halt all minor
faults and all memory freeing like it would happen to rust-vmm/qemu if
we take the mmap_write_lock, that's all. Or am I wrong about this?

For uffd-wp avoiding the mmap_write_lock isn't an immediate concern
(obviously so in the rust-vmm case which won't even do postcopy live
migration), but the above concern applies for the long term and maybe
mid term for qemu.

The postcopy live snapshoitting was the #1 use case so it's hard not
to mention it, but there's still other interesting userland use cases
of uffd-wp with various users already testing it in their apps, that
may ultimately become more prevalent, who knows.

The point is that those that will experiment with uffd-wp will run a
benchmark, post a blog, others will see the blog, they will test too
in their app and post their blog. It needs to deliver the full
acceleration immediately, otherwise the evaluation may show it as a
fail or not worth it.

In theory we could just say, we'll optimize it later if significant
userbase emerge, but in my view it's bit of a chicken egg problem and
I'm afraid that such theory may not work well in practice.

Still, for the initial fix, avoiding the mmap_write_lock seems more
important actually for clear_refs than for uffd-wp. uffd-wp is
somewhat lucky and will just share any solution to keep clear_refs
scalable, since the issue is identical.

Thanks,
Andrea

Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 07:26:43PM +, Nadav Amit wrote:
> > On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli  wrote:
> > 
> > On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
> >> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes 
> >> tracking")
> > 
> > Targeting a backport down to 2013 when nothing could wrong in practice
> > with page_mapcount sounds backwards and unnecessarily risky.
> > 
> > In theory it was already broken and in theory
> > 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the
> > previous code of 2013 is completely wrong, but in practice the code
> > from 2013 worked perfectly until Aug 21 2020.
> 
> Well… If you consider the bug that Will recently fixed [1], then soft-dirty
> was broken (for a different, yet related reason) since 0758cd830494
> ("asm-generic/tlb: avoid potential double flush”).
> 
> This is not to say that I argue that the patch should be backported to 2013,
> just to say that memory corruption bugs can be unnoticed.
> 
> [1] 
> https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/

Is this a fix or a cleanup?

The above is precisely what I said earlier that tlb_gather had no
reason to stay in clear_refs and it had to use inc_tlb_flush_pending
as mprotect, but it's not a fix? Is it? I suggested it as a pure
cleanup. So again no backport required. The commit says fix this but
it means "clean this up".

Now there are plenty of bugs can go unnoticed for decades, including
dirtycow and the very bug that allowed the fork child to attack the
parent with vmsplice that ultimately motivated the
page_mapcount->page_count in do_wp_page in Aug 2020.

Now let's take another example:
7066f0f933a1fd707bb38781866657769cff7efc which also was found by
source review only and never happened in practice, and unlike dirtycow
and the vmsplice attack on parent was not reproducible even at will
after it was found (even then it wouldn't be reproducible
exploitable). So you can take 7066f0f933a1fd707bb38781866657769cff7efc
as the example of theoretical issue that might still crash the kernel
if unlucky. So before 7066f0f933a1fd707bb38781866657769cff7efc, things
worked by luck but not reliably so.

How are all those above relevant here?

In my view none of the above is relevant.

As I already stated this specific issue, for both uffd-wp and
clear_refs wasn't even a theoretical bug before 2020 Aug, it is not
like 7066f0f933a1fd707bb38781866657769cff7efc, and it's not like
https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/
which appears a pure cleanup and doesn't need backporting to any
tree.

The uffd-wp clear_refs corruption mathematically could not happen
before Aug 2020, it worked by luck too, but unlike
7066f0f933a1fd707bb38781866657769cff7efc reliably so.

Philosophically I obviously agree the bug originated in 2013, but the
stable trees don't feel research material that would care about such a
intellectual detail.

So setting a Fixes back to 2013 that would go mess with all stable
tree by actively backporting a performance regressions to clear_refs
that can break runtime performance to fix a philosophical issue that
isn't even a theoretical issue, doesn't sound ideal to me.

> To summarize my action items based your (and others) feedback on both
> patches:
> 
> 1. I will break the first patch into two different patches, one with the
> “optimization” for write-unprotect, based on your feedback. It will not
> be backported.
>
> 2. I will try to add a patch to avoid TLB flushes on
> userfaultfd-writeunprotect. It will also not be backported.

I think 1 and 2 above could be in the same patch. Mixing an uffd-wp 
optimization with the
actual fix the memory corruption wasn't ideal, but doing the same
optimization to both wrprotect and un-wrprotect in the same patch
sounds ideal. The commit explanation would be identical and it can be
de-duplicated this way.

I'd suggest to coordinate with Peter on that, since I wasn't planning
to work on this if somebody else offered to do it.

> 3. Let me know if you want me to use your version of testing
> mm_tlb_flush_pending() and conditionally flushing, wait for new version fro
> you or Peter or to go with taking mmap_lock for write.

Yes, as you suggested, I'm trying to clean it up and send a new
version.

Ultimately my view is there are an huge number of cases where
mmap_write_lock or some other heavy lock that will require
occasionally to block on I/O is beyond impossible not to take. Even
speculative page faults only attack the low hanging anon memory and
there's still MADV_DONTNEED/FREE and other stuff that may have to run
in parallel with UFFDIO_WRITEPROTECT and clear_refs, not just page
faults.

As a reminder: the only case when modifying the vmas

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 07:05:22PM +, Nadav Amit wrote:
> > On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli  wrote:
> > I just don't like to slow down a feature required in the future for
> > implementing postcopy live snapshotting or other snapshots to userland
> > processes (for the non-KVM case, also unprivileged by default if using
> > bounce buffers to feed the syscalls) that can be used by open source
> > hypervisors to beat proprietary hypervisors like vmware.
> 
> Ouch, that’s uncalled for. I am sure that you understand that I have no
> hidden agenda and we all have the same goal.

Ehm I never said you had an hidden agenda, so I'm not exactly why
you're accusing me of something I never said.

I merely pointed out one relevant justification for increasing kernel
complexity here by not slowing down clear_refs longstanding O(N)
clear_refs/softdirty feature and the recent uffd-wp O(1) feature, is
to be more competitive with proprietary software solutions, since
at least for uffd-wp, postcopy live snapshotting that the #1 use
case.

I never questioned your contribution or your preference, to be even
more explicit, it never crossed my mind that you have an hidden
agenda.

However since everyone already acked your patches and I'm not ok with
your patches because they will give a hit to KVM postcopy live
snapshotting and other container clear_refs users, I have to justify
why I NAK your patches and remaining competitive with proprietary
hypervisors is one of them, so I don't see what is wrong to state a
tangible end goal here.

Thanks,
Andrea

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 01:41:34PM -0500, Peter Xu wrote:
> Agreed. I didn't mention uffd_wp check (which I actually mentioned in the 
> reply
> to v1 patchset) here only because the uffd_wp check is pure optimization; 
> while

Agreed it's a pure optimization.

Only if we used the group lock to fix this (which we didn't since it
wouldn't help clear_refs to avoid the performance regression), the
optimization would have become not an optimization anymore.

> the uffd_wp_resolve check is more critical because it is potentially a fix of
> similar tlb flushing issue where we could have demoted the pte without being
> noticed, so I think it's indeed more important as Nadav wanted to fix in the
> same patch.

I didn't get why that was touched in the same patch, I already
suggested to remove that optimization...

> It would be even nicer if we have both covered (all of them can be in
> unlikely() as Andrea suggested in the other email), then maybe nicer as a
> standalone patch, then mention about the difference of the two in the commit
> log (majorly, the resolving change will be more than optimization).

Yes, if you want to go ahead optimizing both cases of the
UFFDIO_WRITEPROTECT, I don't think there's any dependency on this. The
huge_memory.c also needs covering but I didn't look at it, hopefully
the code will result as clean as in the pte case.

I'll try to cleanup the tlb flush in the meantime to see if it look
maintainable after the cleanups.

Then we can change it to wait_pending_flush(); return VM_FAULT_RETRY
model if we want to or if the IPI is slower, at least clear_refs will
still not block on random pagein or swapin from disk, but only anon
memory write access will block while clear_refs run.

Thanks,
Andrea

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Andrea Arcangeli

On Mon, Jan 04, 2021 at 09:26:33PM +, Nadav Amit wrote:
> I would feel more comfortable if you provide patches for uffd-wp. If you
> want, I will do it, but I restate that I do not feel comfortable with this
> solution (worried as it seems a bit ad-hoc and might leave out a scenario
> we all missed or cause a TLB shootdown storm).
>
> As for soft-dirty, I thought that you said that you do not see a better
> (backportable) solution for soft-dirty. Correct me if I am wrong.

I think they should use the same technique, since they deal with the
exact same challenge. I will try to cleanup the patch in the meantime.

I can also try to do the additional cleanups to clear_refs to
eliminate the tlb_gather completely since it doesn't gather any page
and it has no point in using it.

> Anyhow, I will add your comments regarding the stale TLB window to make the
> description clearer.

Having the mmap_write_lock solution as backup won't hurt, but I think
it's only for planB if planA doesn't work and the only stable tree
that will have to apply this is v5.9.x. All previous don't need any
change in this respect. So there's no worry of rejects.

It worked by luck until Aug 2020, but it did so reliably or somebody
would have noticed already. And it's not exploitable either, it just
works stable, but it was prone to break if the kernel changed in some
other way, and it eventually changed in Aug 2020 when an unrelated
patch happened to the reuse logic.

If you want to maintain the mmap_write_lock patch if you could drop
the preserved_write and adjust the Fixes to target Aug 2020 it'd be
ideal. The uffd-wp needs a different optimization that maybe Peter is
already working on or I can include in the patchset for this, but
definitely in a separate commit because it's orthogonal.

It's great you noticed the W->RO transition of un-wprotect so we can
optimize that too (it will have a positive runtime effect, it's not
just theoretical since it's normal to unwrprotect a huge range once
the postcopy snapshotting of the virtual machine is complete), I was
thinking at the previous case discussed in the other thread.

I just don't like to slow down a feature required in the future for
implementing postcopy live snapshotting or other snapshots to userland
processes (for the non-KVM case, also unprivileged by default if using
bounce buffers to feed the syscalls) that can be used by open source
hypervisors to beat proprietary hypervisors like vmware.

The security concern of uffd-wp that allows to enlarge the window of
use-after-free kernel bugs, is not as a concern as it is for regular
processes. First the jailer model can obtain the uffd before dropping
all caps and before firing up seccomp in the child, so it won't even
require to lift the unprivileged_userfaultfd in the superior and
cleaner monolithic jailer model.

If the uffd and uffd-wp can only run in rust-vmm and qemu, that
userland is system software to be trusted as the kernel from the guest
point of view. It's similar to fuse, if somebody gets into the fuse
process it can also stop the kernel initiated faults. From that
respect fuse is also system software despite it runs in userland.

In other words I think if there's a vm-escape that takes control of
rust-vmm userland, the last worry is the fact it can stop kernel
initiated page faults because the jailer took an uffd before drop privs.

Thanks,
Andrea

Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup

2021-01-05 Thread Andrea Arcangeli

On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking")

Targeting a backport down to 2013 when nothing could wrong in practice
with page_mapcount sounds backwards and unnecessarily risky.

In theory it was already broken and in theory
09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the
previous code of 2013 is completely wrong, but in practice the code
from 2013 worked perfectly until Aug 21 2020.

Since nothing at all could go wrong in soft dirty and uffd-wp until
09854ba94c6aad7886996bfbee2530b3d8a7f4f4, the Fixes need to target
that, definitely not a patch from 2013.

This means the backports will apply clean, they don't need a simple
solution but one that doesn't regress the performance of open source
virtual machines and open source products using clear_refs and uffd-wp
in general.

Thanks,
Andrea

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index ab709023e9aa..c08c4055b051 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct 
> > vm_area_struct *vma, pmd_t *pmd,
> > oldpte = *pte;
> > if (pte_present(oldpte)) {
> > pte_t ptent;
> > -   bool preserve_write = prot_numa && pte_write(oldpte);
> > +   bool preserve_write = (prot_numa || uffd_wp_resolve) &&
> > + pte_write(oldpte);
> 
> Irrelevant of the other tlb issue, this is a standalone one and I commented in
> v1 about simply ignore the change if necessary; unluckily that seems to be
> ignored..  so I'll try again - would below be slightly better?
> 
> if (uffd_wp_resolve && !pte_uffd_wp(oldpte))
> continue;

I posted the exact same code before seeing the above so I take it as a good
sign :). I'd suggest to add the reverse check to the uffd_wp too.

> Firstly, current patch is confusing at least to me, because "uffd_wp_resolve"
> means "unprotect the pte", whose write bit should mostly be cleared already
> when uffd_wp_resolve is applicable.  Then "preserve_write" for that pte looks
> odd already.
> 
> Meanwhile, if that really happens (when pte write bit set, but during a
> uffd_wp_resolve request) imho there is really nothing we can do, so we should
> simply avoid touching that at all, and also avoid ptep_modify_prot_start,
> pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost.

Agreed. It should not just defer the flush, by doing continue we will
not flush anything.

So ultimately the above will be an orthogonal optimization, but now I
get the why the deferred tlb flush on the cpu0 of the v2 patch was the
problematic one. I didn't see we lacked the above optimization and I
thought we were discussing still the regular case where un-wrprotect
is called on a pte with uffd-wp set.

thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 04:37:27PM +0100, Peter Zijlstra wrote:
> (your other email clarified this point; the COW needs to copy while
> holding the PTL and we need TLBI under PTL if we're to change this)

The COW doesn't need to hold the PT lock, the TLBI broadcast doesn't
need to be delivered under PT lock either.

Simply there need to be a TLBI broadcast before the copy. The patch I
sent here https://lkml.kernel.org/r/x+qlr1wmgxms3...@redhat.com that
needs to be cleaned up with some abstraction and better commentary
also misses a smp_mb() in the case flush_tlb_page is not called, but
that's a small detail.

> And I'm thinking the speculative page fault series steps right into all
> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.

I thought about that but that only applies to some kind of "anon" page
fault.

Here the problem isn't just the page fault, the problem is not to
regress clear_refs to block on page fault I/O, and all
MAP_PRIVATE/MAP_SHARED filebacked faults bitting the disk to read
/usr/ will still prevent clear_refs from running (and the other way
around) if it has to take the mmap_sem for writing.

I don't look at the speculative page fault for a while but last I
checked there was nothing there that can tame the above major
regression from CPU speed to disk I/O speed that would be inflicted on
both clear_refs on huge mm and on uffd-wp.

Thanks,
Andrea

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Andrea Arcangeli

On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > > 
> > > > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > > > 
> > > > cpu0cpu1cpu2
> > > > 
> > > > [ Writable PTE
> > > >   cached in TLB 
> > > > ]
> > > > userfaultfd_writeprotect()
> > > > [ write-*unprotect* ]
> > > > mwriteprotect_range()
> > > > mmap_read_lock()
> > > > change_protection()
> > > > 
> > > > change_protection_range()
> > > > ...
> > > > change_pte_range()
> > > > [ *clear* “write”-bit ]
> > > > [ defer TLB flushes ]
> > > > [ page-fault ]
> > > > ...
> > > > wp_page_copy()
> > > >  cow_user_page()
> > > >   [ copy page ]
> > > > [ write to old
> > > >   page ]
> > > > ...
> > > >  set_pte_at_notify()
> > > 
> > > Yuck!
> > > 
> > 
> > Note, the above was posted before we figured out the details so it
> > wasn't showing the real deferred tlb flush that caused problems (the
> > one showed on the left causes zero issues).
> > 
> > The problematic one not pictured is the one of the wrprotect that has
> > to be running in another CPU which is also isn't picture above. More
> > accurate traces are posted later in the thread.
> 
> Lets assume CPU0 does a read-lock, W -> RO with deferred flush.

I was mistaken saying the deferred tlb flush was not shown in the v2
trace, just this appears a new different case we didn't happen to
consider before.

In the previous case we discussed earlier, when un-wrprotect above is
called it never should have been a W->RO since a wrprotect run first.

Doesn't it ring a bell that if an un-wrprotect does a W->RO
transition, something is a bit going backwards?

I don't recall from previous discussion that un-wrprotect was
considered as called on read-write memory. I think we need the below
change to fix this new case.

if (uffd_wp) {
+   if (unlikely(pte_uffd_wp(oldpte)))
+   continue;
ptent = pte_wrprotect(ptent);
ptent = pte_mkuffd_wp(ptent);
} else if (uffd_wp_resolve) {
+   if (unlikely(!pte_uffd_wp(oldpte)))
+   continue;
/*
 * Leave the write bit to be handled
 * by PF interrupt handler, then
 * things like COW could be properly
 * handled.
 */
ptent = pte_clear_uffd_wp(ptent);
}

I now get why the v2 patch touches preserved_write, but this is not
about preserve_write, it's not about leaving the write bit alone. This
is about leaving the whole pte alone if the uffd-wp bit doesn't
actually change.

We shouldn't just defer the tlb flush if un-wprotect is called on
read-write memory: we should not have flushed the tlb at all in such
case.

Same for hugepmd in huge_memory.c which will be somewhere else.

Once the above is optimized, then un-wrprotect as in
MM_CP_UFFD_WP_RESOLVE is usually preceded by wrprotect as in
MM_CP_UFFD_WP, and so it'll never be a W->RO but a RO->RO transition
that just clears the uffd_wp flag and nothing else and whose tlb flush
is in turn irrelevant.

The fix discussed still works for this new case too: I'm not
suggesting we should rely on the above optimization for the tlb
safety. The above is just a missing optimization.

> > > Isn't this all rather similar to the problem that resulted in the
> > > tlb_flush_pending mess?
> > > 
> > > I still think that's all fundamentally buggered, the much saner solution
> > > (IMO) would've been to

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-04 Thread Andrea Arcangeli

On Mon, Jan 04, 2021 at 08:39:37PM +, Nadav Amit wrote:
> > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli  wrote:
> > 
> > On Mon, Jan 04, 2021 at 07:35:06PM +, Nadav Amit wrote:
> >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli  wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> >>>> 
> >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
> >>>>> 
> >>>>> cpu0cpu1cpu2
> >>>>> 
> >>>>> [ Writable PTE
> >>>>>   cached in TLB 
> >>>>> ]
> >>>>> userfaultfd_writeprotect()
> >>>>> [ write-*unprotect* ]
> >>>>> mwriteprotect_range()
> >>>>> mmap_read_lock()
> >>>>> change_protection()
> >>>>> 
> >>>>> change_protection_range()
> >>>>> ...
> >>>>> change_pte_range()
> >>>>> [ *clear* “write”-bit ]
> >>>>> [ defer TLB flushes ]
> >>>>> [ page-fault ]
> >>>>> ...
> >>>>> wp_page_copy()
> >>>>>  cow_user_page()
> >>>>>   [ copy page ]
> >>>>> [ write to old
> >>>>>   page ]
> >>>>> ...
> >>>>>  set_pte_at_notify()
> >>>> 
> >>>> Yuck!
> >>> 
> >>> Note, the above was posted before we figured out the details so it
> >>> wasn't showing the real deferred tlb flush that caused problems (the
> >>> one showed on the left causes zero issues).
> >> 
> >> Actually it was posted after (note that this is v2). The aforementioned
> >> scenario that Peter regards to is the one that I actually encountered (not
> >> the second scenario that is “theoretical”). This scenario that Peter 
> >> regards
> >> is indeed more “stupid” in the sense that we should just not write-protect
> >> the PTE on userfaultfd write-unprotect.
> >> 
> >> Let me know if I made any mistake in the description.
> > 
> > I didn't say there is a mistake. I said it is not showing the real
> > deferred tlb flush that cause problems.
> > 
> > The issue here is that we have a "defer tlb flush" that runs after
> > "write to old page".
> > 
> > If you look at the above, you're induced to think the "defer tlb
> > flush" that causes issues is the one in cpu0. It's not. That is
> > totally harmless.
> 
> I do not understand what you say. The deferred TLB flush on cpu0 *is* the
> the one that causes the problem. The PTE is write-protected (although it is
> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
> the page-fault (and copy), while cpu2 keeps writing to the source page. If
> cpu0 did not defer the TLB flush, this problem would not happen.

Your argument "If cpu0 did not defer the TLB flush, this problem would
not happen" is identical to "if the cpu0 has a small TLB size and the
tlb entry is recycled, the problem would not happen".

There are a multitude of factors that are unrelated to the real
problematic deferred tlb flush that may happen and still won't cause
the issue, including a parallel IPI.

The point is that we don't need to worry about the "defer TLB flushes"
of the un-wrprotect, because you said earlier that deferring tlb
flushes when you're doing "permission promotions" does not cause
problems.

The only "deferred tlb flush" we need to worry about, not in the
picture, is the one following the actual permission removal (the
wrprotection).

> it shows the write that triggers the corruption instead of discussing
> “windows”, which might be less clear. Running copy_user_page() with stale

I think showing exactly where the race window opens is key to
understand the issue, but then that's the way I work and feel free to
think it in a

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-04 Thread Andrea Arcangeli

On Mon, Jan 04, 2021 at 07:35:06PM +, Nadav Amit wrote:
> > On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli  wrote:
> > 
> > Hello,
> > 
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> >> 
> >>> The scenario that happens in selftests/vm/userfaultfd is as follows:
> >>> 
> >>> cpu0  cpu1cpu2
> >>>   
> >>>   [ Writable PTE
> >>> cached in TLB ]
> >>> userfaultfd_writeprotect()
> >>> [ write-*unprotect* ]
> >>> mwriteprotect_range()
> >>> mmap_read_lock()
> >>> change_protection()
> >>> 
> >>> change_protection_range()
> >>> ...
> >>> change_pte_range()
> >>> [ *clear* “write”-bit ]
> >>> [ defer TLB flushes ]
> >>>   [ page-fault ]
> >>>   ...
> >>>   wp_page_copy()
> >>>cow_user_page()
> >>> [ copy page ]
> >>>   [ write to old
> >>> page ]
> >>>   ...
> >>>set_pte_at_notify()
> >> 
> >> Yuck!
> > 
> > Note, the above was posted before we figured out the details so it
> > wasn't showing the real deferred tlb flush that caused problems (the
> > one showed on the left causes zero issues).
> 
> Actually it was posted after (note that this is v2). The aforementioned
> scenario that Peter regards to is the one that I actually encountered (not
> the second scenario that is “theoretical”). This scenario that Peter regards
> is indeed more “stupid” in the sense that we should just not write-protect
> the PTE on userfaultfd write-unprotect.
> 
> Let me know if I made any mistake in the description.

I didn't say there is a mistake. I said it is not showing the real
deferred tlb flush that cause problems.

The issue here is that we have a "defer tlb flush" that runs after
"write to old page".

If you look at the above, you're induced to think the "defer tlb
flush" that causes issues is the one in cpu0. It's not. That is
totally harmless.


> 
> > The problematic one not pictured is the one of the wrprotect that has
> > to be running in another CPU which is also isn't picture above. More
> > accurate traces are posted later in the thread.
> 
> I think I included this scenario as well in the commit log (of v2). Let me
> know if I screwed up and the description is not clear.

Instead of not showing the real "defer tlb flush" in the trace and
then fixing it up in the comment, why don't you take the trace showing
the real problematic "defer tlb flush"? No need to reinvent it.

https://lkml.kernel.org/r/x+jjqk91plkbv...@redhat.com

See here the detail underlined:

deferred tlb flush <- too late
XX BUG RACE window close here

This show the real deferred tlb flush, your v2 does not include it
instead.

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-04 Thread Andrea Arcangeli

Hello,

On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> 
> > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > 
> > cpu0cpu1cpu2
> > 
> > [ Writable PTE
> >   cached in TLB ]
> > userfaultfd_writeprotect()
> > [ write-*unprotect* ]
> > mwriteprotect_range()
> > mmap_read_lock()
> > change_protection()
> > 
> > change_protection_range()
> > ...
> > change_pte_range()
> > [ *clear* “write”-bit ]
> > [ defer TLB flushes ]
> > [ page-fault ]
> > ...
> > wp_page_copy()
> >  cow_user_page()
> >   [ copy page ]
> > [ write to old
> >   page ]
> > ...
> >  set_pte_at_notify()
> 
> Yuck!
> 

Note, the above was posted before we figured out the details so it
wasn't showing the real deferred tlb flush that caused problems (the
one showed on the left causes zero issues).

The problematic one not pictured is the one of the wrprotect that has
to be running in another CPU which is also isn't picture above. More
accurate traces are posted later in the thread.

> Isn't this all rather similar to the problem that resulted in the
> tlb_flush_pending mess?
> 
> I still think that's all fundamentally buggered, the much saner solution
> (IMO) would've been to make things wait for the pending flush, instead

How do intend you wait in PT lock while the writer also has to take PT
lock repeatedly before it can do wake_up_var?

If you release the PT lock before calling wait_tlb_flush_pending it
all falls apart again.

This I guess explains why a local pte/hugepmd smp local invlpg is the
only working solution for this issue, similarly to how it's done in rmap.

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 07d9acb5b19c..0210547ac424 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct 
> *mm)
>*
>* Therefore we must rely on tlb_flush_*() to guarantee order.
>*/
> - atomic_dec(>tlb_flush_pending);
> + if (atomic_dec_and_test(>tlb_flush_pending))
> + wake_up_var(>tlb_flush_pending);
>  }
>  
>  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct 
> *mm)
>   return atomic_read(>tlb_flush_pending) > 1;
>  }
>  
> +static inline void wait_tlb_flush_pending(struct mm_struct *mm)
> +{
> + wait_var_event(>tlb_flush_pending,
> +atomic_read(>tlb_flush_pending) == 0);
> +}

I appreciate the effort in not regressing soft dirty and uffd-wp
writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for
writing.

At the same time what was posted so far wasn't clean enough but it
wasn't even tested... if we abstract it in some clean way and we mark
all connected points (soft dirty, uffd-wp, the wrprotect page fault),
then I can be optimistic it will remain understandable when we look at
it again a few years down the road.

Or at the very least it can't get worse than the "tlb_flush_pending
mess" you mentioned above.

flush_tlb_batched_pending() has to be orthogonally re-reviewed for
those things Nadav pointed out. But I'd rather keep that review in a
separate thread since any bug in that code has zero connection to this
issue. The basic idea is similar but the methods and logic are
different and our flush here will be granular and it's going to be
only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or
if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide,
unconditional etc.. Pretty much all different.

Thanks,
Andrea

Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging

2021-01-03 Thread Andrea Arcangeli

Hello Mike,

On Sun, Jan 03, 2021 at 03:47:53PM +0200, Mike Rapoport wrote:
> Thanks for the logs, it seems that implicitly adding reserved regions to
> memblock.memory wasn't that bright idea :)

Would it be possible to somehow clean up the hack then?

The only difference between the clean solution and the hack is that
the hack intended to achieved the exact same, but without adding the
reserved regions to memblock.memory.

The comment on that problematic area says the reserved area cannot be
used for DMA because of some unexplained hw issue, and that doing so
prevents booting, but since the area got reserved, even with the clean
solution, it shouldn't have never been used for DMA?

So I can only imagine that the physical memory region is way more
problematic than just for DMA. It sounds like that anything that
touches it, including the CPU, will hang the system, not just DMA. It
sounds somewhat similar to the other e820 direct mapping issue on x86?

If you want to test the hack on the arm board to check if it boots you
can use the below commit:

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=c3ea2633015104ce0df33dcddbc36f57de1392bc

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-24 Thread Andrea Arcangeli

On Thu, Dec 24, 2020 at 01:49:45PM -0500, Andrea Arcangeli wrote:
> Without the above, can't the CPU decrement the tlb_flush_pending while
> the IPI to ack didn't arrive yet in csd_lock_wait?

Ehm: csd_lock_wait has smp_acquire__after_ctrl_dep() so the write side
looks ok after all sorry.

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-24 Thread Andrea Arcangeli

On Wed, Dec 23, 2020 at 09:18:09PM -0800, Nadav Amit wrote:
> I am not trying to be argumentative, and I did not think through about an
> alternative solution. It sounds to me that your proposed solution is correct
> and would probably be eventually (slightly) more efficient than anything
> that I can propose.

On a side note, on the last proposed solution, I've been wondering
about the memory ordering mm_tlb_flush_pending.

There's plenty of locking before the point where the actual data is
read, but it's not a release/acquire full barrier (or more accurately
it is only on x86), so smp_rmb() seems needed before cow_user_page to
be sure no data can be read before we read the value of
mm_tlb_flush_pending.

To avoid an explicit smp_rmb() and to inherit the implicit smp_mb()
from the release/acquire of PT unlock/PT lock, we'd need to put the
flush before the previous PT unlock. (note, not after the next PT lock
or it'd be even worse).

So this made me look also the inc/dec:

+   smp_mb__before_atomic();
atomic_dec(>tlb_flush_pending);

Without the above, can't the CPU decrement the tlb_flush_pending while
the IPI to ack didn't arrive yet in csd_lock_wait?

The smp_rmb() is still needed in the page fault (implicit or explicit
doesn't matter), but we also need a smp_mb__before_atomic() above to
really make this work.

> Yet, I do want to explain my position. Reasoning on TLB flushes is hard, as
> this long thread shows. The question is whether it has to be so hard. In
> theory, we can only think about architectural considerations - whether a PTE
> permissions are promoted/demoted and whether the PTE was changed/cleared.
> 
> Obviously, it is more complex than that. Yet, once you add into the equation
> various parameters such as the VMA flags or whether a page is locked (which
> Mel told me was once a consideration), things become much more complicated.
> If all the logic of TLB flushes had been concentrated in a single point and
> maintenance of this code did not require thought about users and use-cases,
> I think things would have been much simpler, at least for me.

In your previous email you also suggested to add range invalidates and
bloom filter to index them by the address in the page fault since it'd
help MADV_PAGEOUT. That would increase complexity and it won't go
exactly in the above direction.

I assume that here nobody wants to add gratuitous complexity, and I
never suggested the code shouldn't become simpler and easier to
understand and maintain. But we can't solve everything in a single
thread in terms of cleaning up and auditing the entirety of the TLB
code.

To refocus: the only short term objective in this thread, is to fix
the data corruption in uffd-wp and clear_refs_write without
introducing new performance regressions compared to the previous
clear_refs_write runtime.

Once that is fixed and we didn't introduce a performance regression
while fixing an userland memory corruption regression (i.e. not really
a regression in theory, but in practice it is because it worked by
luck), then we can look at all the rest without hurry.

So if already want to start the cleanups like I mentioned in a
previous email, and I'll say it more explicitly, the tlb gather is for
freeing memory, it's just pure overhead and gives a false sense of
security, like it can make any difference, when just changing
protection with mmap_read_lock. It wouldn't be needed with the write
lock and it can't help solve the races that trigger with
mmap_read_lock either.

It is needed when you have to store the page pointer outside the pte,
clear a pte, flush the tlb and only then put_page. So it is needed to
keep tracks of which pages got cleared in the ptes so you don't have
to issue a tlb flush for each single pte that gets cleared.

So the only case when to use the tlb_gather is when you must make the
pte stop pointing to the page and you need an external storage that
will keep track of those pages that we cannot yet lose track of since
we didn't do put_page yet. That kind of external storage to keep track
of the pages that have pending tlb flushes, is never needed in
change_protection and clear_refs_write.

change_protection is already correct in that respect, it doesn't use
the tlb_gather. clear_refs_write is not ok and it need to stop using
tlb_gather_mmu/tlb_finish_mmu.

* tlb_gather_mmu - initialize an mmu_gather structure for page-table tear-down

See also the tear-down mention which doesn't apply to clear_refs_write.

clear_refs_write needs to become identical to
change_protection_range. Just increasing inc_tlb_flush_pending and
then doing a flush of the range right before the
dec_tlb_flush_pending.

That change is actually orthogonal to the regression fix to teach the
COW to deal with stale TLB entries and it will clean up the code.

So to pursue your simplification objective you could already go ahead
doing this improvement since it's very confusing to see the
clear_refs_write use something that it

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > I think there are other cases in which Andy’s concern is relevant
> > (MADV_PAGEOUT).

I didn't try to figure how it would help MADV_PAGEOUT. MADV_PAGEOUT
sounds cool feature, but maybe it'd need a way to flush the
invalidates out and a take a static key to enable the buildup of those
ranges?

I wouldn't like to slow down the fast paths even for MADV_PAGEOUT, and
the same applies to uffd-wp and softdirty in fact.

On Wed, Dec 23, 2020 at 08:34:00PM -0700, Yu Zhao wrote:
> That patch only demonstrate a rough idea and I should have been
> elaborate: if we ever decide to go that direction, we only need to
> worry about "jumping through hoops", because the final patch (set)
> I have in mind would not only have the build time optimization Andrea
> suggested but also include runtime optimizations like skipping
> do_swap_page() path and (!PageAnon() || page_mapcount > 1). Rest
> assured, the performance impact on do_wp_page() from occasionally an
> additional TLB flush on top of a page copy is negligible.

Agreed, I just posted something in that direction. Feel free to
refactor it, it's just a tentative implementation to show something
that may deliver all the performance we need in all cases involved
without slowing down the fast paths of non-uffd and non-softdirty
(well 1 branch).

> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > Perhaps holding some small bitmap based on part of the deferred flushed
> > pages (e.g., bits 12-17 of the address or some other kind of a single
> > hash-function bloom-filter) would be more performant to avoid (most)

The concern here aren't only the page faults having to run the bloom
filter, but how to manage the RAM storage pointed by the bloomfilter
or whatever index into the storage, which would slowdown mprotect.

Granted that mprotect is slow to begin with, but the idea we can't make
it any slower to make MADV_PAGEOUT or uffd-wp or clear_refs run
faster since it's too important and too frequent in comparison.

Just to restrict the potential false positive IPI caused by page_count
inevitable inaccuracies to uffd-wp and softdirty runtimes, a simple
check on vm_flags should be enough.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote:
> One other near zero cost improvement easy to add if this would be "if
> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made

The next worry then is if UFFDIO_WRITEPROTECT is very large then there
would be a flood of wrprotect faults, and they'd end up all issuing a
tlb flush during the UFFDIO_WRITEPROTECT itself which again is a
performance concern for certain uffd-wp use cases.

Those use cases would be for example redis and postcopy live
snapshotting, to use it for async snapshots, unprivileged too in the
case of redis if it temporarily uses bounce buffers for the syscall
I/O for the duration of the snapshot. hypervisors tuned profiles need
to manually lift the unprivileged_userfaultfd to 1 unless their jailer
leaves one capability in the snapshot thread.

Moving the check after userfaultfd_pte_wp would solve
userfaultfd_writeprotect(mode_wp=true), but that still wouldn't avoid
a flood of tlb flushes during userfaultfd_writeprotect(mode_wp=false)
because change_protection doesn't restore the pte_write:

} else if (uffd_wp_resolve) {
/*
 * Leave the write bit to be handled
 * by PF interrupt handler, then
 * things like COW could be properly
 * handled.
 */
ptent = pte_clear_uffd_wp(ptent);
}

When the snapshot is complete userfaultfd_writeprotect(mode_wp=false)
would need to run again on the whole range which can be very big
again.

Orthogonally I think we should also look to restore the pte_write
above orthogonally in uffd-wp, so it'll get yet an extra boost if
compared to current redis snapshotting fork(), that cannot restore all
pte_write after the snapshot child quit and forces a flood of spurious
wrprotect faults (uffd-wp can solve that too).

However, even if uffd-wp restored the pte_write, things would remain
suboptimal for a terabyte process under clear_refs, since softdirty
wrprotect faults that start happening while softdirty is still running
on the mm, won't be caught in userfaultfd_pte_wp.

Something like below, if cleaned up, abstracted properly and
documented well in the two places involved, will have a better chance
to perform optimally for softdirty too.

And on a side note the CONFIG_MEM_SOFT_DIRTY compile time check is
compulsory because VM_SOFTDIRTY is defined to zero if softdirty is not
built in. (for VM_UFFD_WP the CONFIG_HAVE_ARCH_USERFAULTFD_WP can be
removed and it won't make any measurable difference even when
USERFAULTFD=n)

RFC untested below, it's supposed to fix the softdirty testcase too,
even without the incremental fix, since it already does tlb_gather_mmu
before walk_page_range and tlb_finish_mmu after it and that appears
enough to define the inc/dec_tlb_flush_pending.

diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..66fd6d070c47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2844,11 +2844,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (!new_page)
goto oom;
} else {
+   bool in_uffd_wp, in_softdirty;
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
goto oom;

+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+   in_uffd_wp = !!(vma->vm_flags & VM_UFFD_WP);
+#else
+   in_uffd_wp = false;
+#endif
+#ifdef CONFIG_MEM_SOFT_DIRTY
+   in_softdirty = !(vma->vm_flags & VM_SOFTDIRTY);
+#else
+   in_softdirty = false;
+#endif
+   if ((in_uffd_wp || in_softdirty) &&
+   mm_tlb_flush_pending(mm))
+   flush_tlb_page(vma, vmf->address);
+
if (!cow_user_page(new_page, old_page, vmf)) {
/*
 * COW failed, if the fault was solved by other,

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending 
> set for quite a while — mprotect seems like it can wait in IO while splitting 
> a huge page, for example. That gives us a window in which every write fault 
> turns into a TLB flush.

mprotect can't run concurrently with a page fault in the first place.

One other near zero cost improvement easy to add if this would be "if
(vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
conditional to the two config options too.

Still I don't mind doing it in some other way, uffd-wp has much easier
time doing it in another way in fact.

Whatever performs better is fine, but queuing up pending invalidate
ranges don't look very attractive since it'd be a fixed cost that we'd
always have to pay even when there's no fault (and there can't be any
fault at least for mprotect).

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

Hello Linus,

On Wed, Dec 23, 2020 at 03:39:53PM -0800, Linus Torvalds wrote:
> On Wed, Dec 23, 2020 at 1:39 PM Andrea Arcangeli  wrote:
> >
> > On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > > Thanks for the details.
> >
> > I hope we can find a way put the page_mapcount back where there's a
> > page_count right now.
> 
> I really don't think that's ever going to happen - at least if you're
> talking about do_wp_page().

Yes, I was referring to do_wp_page().

> I refuse to make the *simple* core operations VM have to jump through
> hoops, and the old COW mapcount logic was that. I *much* prefer the
> newer COW code, because the rules are more straightforward.

Agreed. I don't have any practical alternative proposal in fact, it
was only an hypothetical wish/hope, echoing Hugh's view on this
matter.

> > page_count is far from optimal, but it is a feature it finally allowed
> > us to notice that various code (clear_refs_write included apparently
> > even after the fix) leaves stale too permissive TLB entries when it
> > shouldn't.
> 
> I absolutely agree that page_count isn't exactly optimal, but
> "mapcount" is just so much worse.

Agreed. The "isn't exactly optimal" part is the only explanation for
wish/hope.

> page_count() is at least _logical_, and has a very clear meaning:
> "this page has other users". mapcount() means something else, and is
> simply not sufficient or relevant wrt COW.
>
> That doesn't mean that page_mapcount() is wrong - it's just that it's
> wrong for COW. page_mapcount() is great for rmap, so that we can see
> when we need to shoot down a memory mapping of a page that gets
> released (truncate being the classic example).
> 
> I think that the mapcount games we used to have were horrible. I
> absolutely much prefer where we are now wrt COW.
> 
> The modern rules for COW handling are:
> 
>  - if we got a COW fault there might be another user, we copy (and
> this is where the page count makes so much logical sense).
> 
>  - if somebody needs to pin the page in the VM, we either make sure
> that it is pre-COWed and we
> 
>(a) either never turn it back into a COW page (ie the fork()-time
> stuff we have for pinned pages)
> 
>(b) or there is some explicit marker on the page or in the page
> table (ie the userfaultfd_pte_wp thing).
> 
> those are _so_ much more straightforward than the very complex rules
> we used to have that were actively buggy, in addition to requiring the
> page lock. So they were buggy and slow.

Agreed, this is a very solid solution that solves the problem the
mapcount had with GUP pins.

The only alternative but very vapourware thought I had on this matter,
is that all troubles start when we let a GUP-pinned page be unmapped
from the pgtables.

I already told Jens, is io_uring could use mmu notifier already (that
would make any io_uring GUP pin not count anymore with regard to
page_mapcount vs page_count difference) and vmsplice also needs some
handling or maybe become privileged.

The above two improvements are orthogonal with the COW issue since
long term GUP pins do more than just mlock and they break most
advanced VM features. It's not ok just to account GUP pins in rlimit
mlock.

The above two improvements however would go into the direction to make
mapcount more suitable for COW, but it'd still not be enough.

The transient GUP pins also would need to be taken care of, but we
could wait for those to be released, since those are guaranteed to be
released or something else, not just munmap()/MADV_DONTNEED, will
remain in D state.

Anyway.. until io_uring and vmsplice are solved first, it's pointless
to even wonder about transient GUP pins.

> And yes, I had forgotten about that userfaultfd_pte_wp() because I was
> being myopic and only looking at wp_page_copy(). So using that as a
> way to make sure that a page doesn't go through COW is a good way to
> avoid the COW race, but I think that thing requires a bit in the page
> table which might be a problem on some architectures?

Correct, it requires a bit in the pgtable.

The bit is required in order to disambiguate which regions have been
marked for wrprotect faults and which didn't.

A practical example would be: fork() called by an app with a very
large vma with VM_UFFD_WP set.

There would be a storm of WP userfaults if we didn't have the software
bit to detect exactly which pte were wrprotected by uffd-wp and which
ones were wrprotected by fork.

That bit then sends the COW fault to a safe place if wrprotect is
running (the problem is we didn't see the un-wrprotect would clear the
bit while the TLB flush of the wrprotect could be still pending).

The page_mapcount I guess hidden that race to begin with, just like it
did in clear_refs_write.

uffd-wp is similar to

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Wed, Dec 23, 2020 at 02:45:59PM -0800, Nadav Amit wrote:
> I think it may be reasonable.

Whatever solution used, there will be 2 users of it: uffd-wp will use
whatever technique used by clear_refs_write to avoid the
mmap_write_lock.

My favorite is Yu's patch and not the group lock anymore. The cons is
it changes the VM rules (which kind of reminds me my initial proposal
of adding a spurious tlb flush if mm_tlb_flush_pending is set, except
I didn't correctly specify it'd need to go in the page fault), but it
still appears the simplest.

> Just a proposal: At some point we can also ask ourselves whether the
> “artificial" limitation of the number of software bits per PTE should really
> limit us, or do we want to hold some additional metadata per-PTE by either
> putting it in an adjacent page (holding 64-bits of additional software-bits
> per PTE) or by finding some place in the page-struct to link to this
> metadata (and have the liberty of number of bits per PTE). One of the PTE
> software-bits can be repurposed to say whether there is “extra-metadata”
> associated with the PTE.
> 
> I am fully aware that there will be some overhead associated, but it
> can be limited to less-common use-cases.

That's a good point, so far far we didn't run out so it's not an
immediate concern. (as opposed we run out in page->flags where the
PG_tail went to some LSB).

In general kicking the can down the road sounds like the best thing to
do for those bit shortage matters, until we can't anymore at
least.. There's no gain to the kernel runtime, in doing something
generically good here (again see where PG_tail rightfully went).

So before spending RAM and CPU, we'd need to find a more compact
encoding with the bits we already have available.

This reminds me again we could double check if we could make
VM_UFFD_WP mutually exclusive with VM_SOFTDIRTY.

I wasn't sure if it could ever happen in a legit way to use both at
the same time (CRIU on a app using uffd-wp for its own internal mm
management?).

Theoretically it's too late already for it, but VM_UFFD_WP is
relatively new, if we're sure it cannot ever happen in a legit way, it
would be possible to at least evaluate/discuss it. This is an
immediate matter.

What we'll do if we later run out, is not an immediate matter instead,
because it won't make our life any simpler to resolve it now.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Wed, Dec 23, 2020 at 03:29:51PM -0700, Yu Zhao wrote:
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, wouldn't it?

It certainly would since this is basically declaring "leaving stale
TLB entries past mmap_read_lock" is now permitted as long as the
pending flush counter is elevated under mmap_sem for reading.

Anything that prevents uffd-wp to take mmap_write_lock looks great to
me, anything, the below included, as long as it looks like easy to
enforce and understand. And the below certainly is.

My view is that the below is at the very least an improvement in terms
of total complexity, compared to v5.10. At least it'll be documented.

So what would be not ok to me is to depend on undocumented not
guaranteed behavior in do_wp_page like the page_mapcount, which is
what we had until now in clear_refs_write and in uffd-wp (but only if
wrprotect raced against un-wrprotect, a tiny window if compared to
clear_refs_write).

Documenting that clearing pte_write and deferring the flush is allowed
if mm_tlb_flush_pending was elevated before taking the PT lock is less
complex and very well defined rule, if compared to what we had before
in the page_mapcount dependency of clear_refs_write which was prone to
break, and in fact it just did.

We'll need a commentary in both userfaultfd_writeprotect and
clear_refs_write that links to the below snippet.

If we abstract it in a header we can hide there also a #ifdef
CONFIG_HAVE_ARCH_SOFT_DIRTY=y && CONFIG_HAVE_ARCH_USERFAULTFD_WP=y &&
CONFIG_USERFAULTFD=y to make it even more explicit.

However it may be simpler to keep it unconditional, I don't mind
either ways. If it was up to me I'd write it to those 3 config options
to be sure I remember where it comes from and to force any other user
to register to be explicit they depend on that.

> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault 
> *vmf)
>   goto unlock;
>   }
>   if (vmf->flags & FAULT_FLAG_WRITE) {
> - if (!pte_write(entry))
> + if (!pte_write(entry)) {
> + if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> + flush_tlb_page(vmf->vma, vmf->address);
>   return do_wp_page(vmf);
> + }
>   entry = pte_mkdirty(entry);
>   }
>   entry = pte_mkyoung(entry);
>

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
> > On Dec 21, 2020, at 1:24 PM, Yu Zhao  wrote:
> > 
> > On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
> >> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit  wrote:
> >>> Using mmap_write_lock() was my initial fix and there was a strong pushback
> >>> on this approach due to its potential impact on performance.
> >> 
> >> From whom?
> >> 
> >> Somebody who doesn't understand that correctness is more important
> >> than performance? And that userfaultfd is not the most important part
> >> of the system?
> >> 
> >> The fact is, userfaultfd is CLEARLY BUGGY.
> >> 
> >>  Linus
> > 
> > Fair enough.
> > 
> > Nadav, for your patch (you might want to update the commit message).
> > 
> > Reviewed-by: Yu Zhao 
> > 
> > While we are all here, there is also clear_soft_dirty() that could
> > use a similar fix…
> 
> Just an update as for why I have still not sent v2: I fixed
> clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
> 
> So after some debugging, it appears that clear_refs_write() does not flush
> the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
> ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
> flush the TLB since there is clear_refs_write() does not call to
> __tlb_adjust_range() (unless there are nested TLBs are pending).
> 
> So I have a patch for this issue too: arguably the tlb_gather interface is
> not the right one for clear_refs_write() that does not clear PTEs but
> changes them.
> 
> Yet, sadly, my reproducer keeps falling (less frequently, but still). So I
> will keep debugging to see what goes wrong. I will send v2 once I figure out
> what the heck is wrong in the code or my reproducer.

If you put the page_mapcount check back in do_wp_page instead of
page_count, it'll stop reproducing but the bug is still very much
there...

It's a feature page_count finally shows you the corruption now by
virtue of the page_count being totally unreliable with all speculative
pagecache lookups randomly elevating it in the background.

The proof it worked by luck is that an unrelated change
(s/page_mapcount/page_count/) made the page fault behave slightly
different and broke clear_refs_write.

Even before page_mapcount was replaced with page_count, it has always
been forbidden to leave too permissive stale TLB entries out of sync
with a more restrictive pte/hugepmd permission past the PT unlock,
unless you're holding the mmap_write_lock.

So for example all rmap code has to flush before PT unlock release
too, usually it clears the pte as a whole but it's still a
downgrade.

The rmap_lock and the mmap_read_lock achieve the same: they keep the
vma stable but they can't stop the page fault from running (that's a
feature) so they have to flush inside the PT lock.

The tlb gather deals with preventing use after free (where userland
can modify kernel memory), but it cannot deal with the guarantee the
page fault requires.

So the clear_refs_write patch linked that alters the tlb flushing
appears a noop with respect to this bug. It cannot do anything to
prevent the page fault run with writable stale TLB entries with the
!pte_write.

If you don't add a marker here (it clears it, the exact opposite of
what should be happening), there's no way to avoid the mmap_write_lock
in my view.

static inline void clear_soft_dirty(struct vm_area_struct *vma,
unsigned long addr, pte_t *pte)
{
/*
 * The soft-dirty tracker uses #PF-s to catch writes
 * to pages, so write-protect the pte as well. See the
 * Documentation/admin-guide/mm/soft-dirty.rst for full description
 * of how soft-dirty works.
 */
pte_t ptent = *pte;

if (pte_present(ptent)) {
pte_t old_pte;

old_pte = ptep_modify_prot_start(vma, addr, pte);
ptent = pte_wrprotect(old_pte);
ptent = pte_clear_soft_dirty(ptent);
+   ptent = pte_mkuffd_wp(ptent);
ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);

One solution that would fix the userland mm corruption in
clear_refs_write is to take the mmap_read_lock, take some mutex
somewhere (vma/mm whatever), then in clear_soft_dirty make the above
modification adding the _PAGE_UFFD_WP, then flush tlb, release the
mutex and then release the mmap_read_lock.

Then here:

if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
+   if (vma->vm_flags & VM_SOFTDIRTY)
+  return handle_soft_dirty(vma);
return handle_userfault(vmf, VM_UFFD_WP);

Of course handle_soft_dirty will have to take the mutex once (1
mutex_lock/unlock cycle to run after any pending flush).

And then we'll have to enforce uffd-wp cannot be registered if
VM_SOFTDIRTY is set or the other way around so that VM_UFFD* is
mutually exclusive with VM_SOFTDIRTY.

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> Thanks for the details.

I hope we can find a way put the page_mapcount back where there's a
page_count right now.

If you're so worried about having to maintain a all defined well
documented (or to be documented even better if you ACK it)
marker/catcher for userfaultfd_writeprotect, I can't see how you could
consider to maintain the page fault safe against any random code
leaving too permissive TLB entries out of sync of the more restrictive
pte permissions as it was happening with clear_refs_write, which
worked by luck until page_mapcount was changed to page_count.

page_count is far from optimal, but it is a feature it finally allowed
us to notice that various code (clear_refs_write included apparently
even after the fix) leaves stale too permissive TLB entries when it
shouldn't.

The question is only which way you prefer to fix clear_refs_write and
I don't think we can deviate from those 3 methods that already exist
today. So clear_refs_write will have to pick one of those and
currently it's not falling in the same category with mprotect even
after the fix.

I think if clear_refs_write starts to take the mmap_write_lock and
really start to operate like mprotect, only then we can consider to
make userfaultfd_writeprotect also operate like mprotect.

Even then I'd hope we can at least be allowed to make it operate like
KSM write_protect_page for len <= HPAGE_PMD_SIZE, something that
clear_refs_write cannot do since it works in O(N) and tends to scan
everything at once, so there would be no point to optimize not to
defer the flush, for a process with a tiny amount of virtual memory
mapped.

vm86 also should be fixed to fall in the same category with mprotect,
since performance there is irrelevant.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Wed, Dec 23, 2020 at 10:52:35AM -0500, Peter Xu wrote:
> On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > In your patch, do we need to take wrprotect_rwsem in
> > handle_userfault() as well? Otherwise, it seems userspace would have
> > to synchronize between its wrprotect ioctl and fault handler? i.e.,
> > the fault hander needs to be aware that the content of write-
> > protected pages can actually change before the iotcl returns.
> 
> The handle_userfault() thread should be sleeping until another uffd_wp_resolve
> fixes the page fault for it.  However when the uffd_wp_resolve ioctl comes,
> then rwsem (either the group rwsem lock as Andrea proposed, or the mmap_sem, 
> or
> any new rwsem lock we'd like to introduce, maybe per-uffd rather than per-mm)
> should have guaranteed the previous wr-protect ioctls are finished and tlb 
> must
> have been flushed until this thread continues.
> 
> And I don't know why it matters even if the data changed - IMHO what uffd-wp

The data will change indeed and it's fine.

> wants to do is simply to make sure after wr-protect ioctl returns to 
> userspace,
> no change on the page should ever happen anymore.  So "whether data changed"
> seems matter more on the ioctl thread rather than the handle_userfault()
> thread.  IOW, I think data changes before tlb flush but after pte wr-protect 
> is
> always fine - but that's not fine anymore if the syscall returns.

Agreed.

>From the userland point of view all it matters is that the writes
through the stale TLB entries will stop in both the two cases:

1) before returning from the UFFDIO_WRITEPROTECT(mode_wp = true) ioctl syscall

2) before a parallel UFFDIO_WRITEPROTECT(mode_wp = false) can clear
   the _PAGE_UFFD_WP marker in the pte/hugepmd under the PT lock,
   assuming the syscall at point 1) is still in flight

Both points are guaranteed at all times by the group lock now, so
userland cannot even measure or perceive the existence of any stale
TLB at any given time in the whole uffd-wp workload.

So it's perfectly safe and identical as NUMA balancing and requires
zero extra locking in handle_userfault().

handle_userfault() is a dead end that simply waits and when it's the
right time it restarts the page fault. It can have occasional false positives
after f9bf352224d7d4612b55b8d0cd0eaa981a3246cf, false positive as in
restarting too soon, but even then it's perfectly safe since it's
equivalent of one more CPU hitting the page fault path. As long as the
marker is there, any spurious userfault will re-enter
handle_userfault().

handle_userfault() doesn't care about the data and in turn it cannot
care less about any stale TLB either. Userland cares but userland
cannot make any assumption about writes being fully stopped, until the
ioctl returned anyway and by that time the pending flush will be done
and in fact by the time userland can make any assumption also the
mmap_write_lock would have been released with the first proposed patch.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Wed, Dec 23, 2020 at 01:51:59PM -0500, Andrea Arcangeli wrote:
> NOTE: about the above comment, that mprotect takes
> mmap_read_lock. Your above code change in the commit above, still has
    write

Correction to avoid any confusion.

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Andrea Arcangeli

On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
> I think this is not against Linus's example - where cpu2 does not have tlb
> cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
> it.  So IMHO there's no problem here.
> 
> But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit.  Note that if
> it's uffd-wp wr-protection it's always applied along with removing of the 
> write
> bit in change_pte_range():
> 
> if (uffd_wp) {
> ptent = pte_wrprotect(ptent);
> ptent = pte_mkuffd_wp(ptent);
> }
> 
> So instead of being handled as COW page do_wp_page() will always trap
> userfaultfd-wp first, hence no chance to race with COW.
> 
> COW could only trigger after another uffd-wp-resolve ioctl which could remove
> the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
> after all wr-protect completes, which guarantees that when reaching the COW
> path the tlb must has been flushed anyways.  Then no one should be modifying
> the page anymore even without pgtable lock in COW path.
> 
> So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
> work, but it just may cause more tlb flush than Andrea's proposal especially
> when the protection range is large (it's common to have a huge protection 
> range
> for e.g. VM live snapshotting, where we'll protect all guest rw ram).
> 
> My understanding of current issue is that either we can take Andrea's proposal
> (although I think the group rwsem may not be extremely better than a per-mm
> rwsem, which I don't know... at least not worst than that?), or we can also go
> the other way (also as Andrea mentioned) so that when wr-protect:
> 
>   - for <=2M range (pmd or less), we take read rwsem, but flush tlb within
> pgtable lock
> 
>   - for >2M range, we take write rwsem directly but flush tlb once

I fully agree indeed.

I still have to fully understand how the clear_refs_write was fixed.

clear_refs_write has not even a "catcher" in the page fault but it
clears the pte_write under mmap_read_lock and it doesn't flush before
releasing the PT lock. It appears way more broken than
userfaultfd_writeprotect ever was.

static inline void clear_soft_dirty(struct vm_area_struct *vma,
unsigned long addr, pte_t *pte)
{
/*
 * The soft-dirty tracker uses #PF-s to catch writes
 * to pages, so write-protect the pte as well. See the
 * Documentation/admin-guide/mm/soft-dirty.rst for full description
 * of how soft-dirty works.
 */

https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/

"Although this is fine when simply aging the ptes, in the case of
clearing the "soft-dirty" state we can end up with entries where
pte_write() is false, yet a writable mapping remains in the TLB.

Fix this by avoiding the mmu_gather API altogether: managing both the
'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB
invalidation for the sort-dirty path, much like mprotect() does
already"

NOTE: about the above comment, that mprotect takes
mmap_read_lock. Your above code change in the commit above, still has
mmap_read_lock, there's still no similarity with mprotect whatsoever.

Did you test what happens when clear_refs_write leaves writable stale
TLB and you run do_wp_page copies the page while the other CPU still
is writing to the page? Has clear_refs_write a selftest as aggressive
and racy as the uffd selftest to reproduce that workload?

The race fixed with the group lock internally to uffd, is possible
thanks to marker+catcher in NUMA balancing style? But that's not there
for clear_refs_write even after the above commit.

It doesn't appear either that this patch is adding a synchronous tlb
flush inside the PT lock either.

Overall, it would be unfair if clear_refs_write is allowed to do the
same thing that userfaultfd_writeprotect has to do, but with the
mmap_read_lock, if userfaultfd_writeprotect will be forced to take
mmap_write_lock.

In my view there are 3 official ways to deal with this:

1) set a marker in the pte/hugepmd inside the PT lock, and add a
   catcher for the marker in the page fault to send the page fault to
   a dead end while there are stale TLB entries

   cases: userfaultfd_writeprotect() and NUMA balancing

2) take mmap_write_lock

   case: mprotect

3) flush the TLB before the PT lock release

   case: KSM write_protect_page

Where does the below patch land in the 3 cases? It doesn't fit any of
them and what it does looks still not sufficient to fix the userfault
memory corruption.

https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/

It'd be backwards if the complexity of the kernel was increased for
clear_refs_write, but forbidden for the O(1) API that delivers the
exact same information (clear_refs_write API delivers that info in
O(N)).

We went the last mile to guarantee uffd can be

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Andrea Arcangeli

On Tue, Dec 22, 2020 at 05:23:39PM -0700, Yu Zhao wrote:
> and 2) people are spearheading multiple efforts to reduce the mmap_lock
> contention, which hopefully would make ufd users suffer less soon.

In my view UFFD is an already deployed working solution that
eliminates the mmap_lock_write contention to allocate and free memory.

We need to add a UFFDIO_POPULATE to use in combination with
UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page or
THP and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE, but it
will solve the last performance bottleneck by avoiding a suprious
wrprotect fault after the allocation).

After that malloc based on uffd should become competitive single
threaded and it won't ever require the mmap_lock_write so allocations
and freeing of memory can continue indefinitely from all threaded in
parallel. There will never be another mmap or munmap stalling all
threads.

This is not why uffd was created, it's just a secondary performance
benefit of uffd, but it's still a relevant benefit in my view.

Every time I hear people with major mmap_lock_write issues I recommend
uffd, but you know, until we add the UFFDIO_POPULATE, it will still
have higher fixed allocation overhead because of the wprotect fault
after UFFDIO_ZEROCOPY. UFFDIO_COPY also would be not as optimal as a
clear_page and currently it's not even THP capable.

In addition you'll get a SIGBUS after an user after free. It's not
like when you have a malloc lib doing MADV_DONTNEED at PAGE_SIZE
granularity to rate limit the costly munmap, and then the app does an
use after free and it reads zero or writes to a newly faulted in page.

The above will not require any special privilege and all allocated
virtual memory remains fully swappable, because SIGBUS mode will never
have to block any kernel initiated faults.

uffd-wp also is totally usable unprivileged by default to replace
various software dirty bits with the info provided in O(1) instead of
O(N), as long as the writes are done in userland also unprivileged by
default without tweaking any sysctl and with zero risk of increasing
reproduciblity of any exploit against unrelated random kernel bugs.

So if we're forced to take the mmap_lock_write it'd be cool if at
least we can avoid it for 1 single pte or hugepmd wrprotection, as it
happens in write_protect_page() KSM.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Andrea Arcangeli

On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> We are talking about non-COW anon pages here -- they can't be mapped
> more than once. So why not just identify them by checking
> page_mapcount == 1 and then unconditionally reuse them? (This is
> probably where I've missed things.)

The problem in depending on page_mapcount to decide if it's COW or
non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP
may elevate the count of a COW anon page that become a non-COW anon
page.

This is Jann's idea not mine.

The problem is we have an unprivileged long term GUP like vmsplice
that facilitates elevating the page count indefinitely, until the
parent finally writes a secret to it. Theoretically a short term pin
would do it too so it's not just vmpslice, but the short term pin
would be incredibly more challenging to become a concern since it'd
kill a phone battery and flash before it can read any data.

So what happens with your page_mapcount == 1 check is that it doesn't
mean non-COW (we thought it did until it didn't for the long term gup
pin in vmsplice).

Jann's testcases does fork() and set page_mapcount 2 and page_count to
2, vmsplice, take unprivileged infinitely long GUP pin to set
page_count to 3, queue the page in the pipe with page_count elevated,
munmap to drop page_count to 2 and page_mapcount to 1.

page_mapcount is 1, so you'd think the page is non-COW and owned by
the parent, but the child can still read it so it's very much still
wp_page_copy material if the parent tries to modify it. Otherwise the
child can read the content.

This was supposed to be solvable by just doing the COW in gup(write=0)
case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly
sure why that didn't fly and it had to be reverted by Peter in
a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was
happening I was side tracked by urgent issues and I didn't manage to
look back of how we ended up with the big hammer page_count == 1 check
instead to decide if to call wp_page_reuse or wp_page_shared.

So anyway, the only thing that is clear to me is that keeping the
child from reading the page_mapcount == 1 pages of the parent, is the
only reason why wp_page_reuse(vmf) will only be called on
page_count(page) == 1 and not on page_mapcount(page) == 1.

It's also the reason why your page_mapcount assumption will risk to
reintroduce the issue, and I only wish we could put back page_mapcount
== 1 back there.

Still even if we put back page_mapcount there, it is not ok to leave
the page fault with stale TLB entries and to rely on the fact
wp_page_shared won't run. It'd also avoid the problem but I think if
you leave stale TLB entries in change_protection just like NUMA
balancing does, it also requires a catcher just like NUMA balancing
has, or it'd truly work by luck.

So until we can put a page_mapcount == 1 check back there, the
page_count will be by definition unreliable because of the speculative
lookups randomly elevating all non zero page_counts at any time in the
background on all pages, so you will never be able to tell if a page
is true COW or if it's just a spurious COW because of a speculative
lookup. It is impossible to differentiate a speculative lookup from a
vmsplice ref in a child.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Andrea Arcangeli

On Tue, Dec 22, 2020 at 02:14:41PM -0700, Yu Zhao wrote:
> This works but I don't prefer this option because 1) this is new
> way of making pte_wrprotect safe and 2) it's specific to ufd and
> can't be applied to clear_soft_dirty() which has no "catcher". No

I didn't look into clear_soft_dirty issue, I can look into that.

To avoid having to deal with a 3rd solution it will have to use one of
the two:

1) avoid deferring tlb flush and enforce a sync flush before dropping
  the PT lock even if mm_mm_tlb_flush_pending is true ->
  write_protect_page in KSM

2) add its own new catcher for its own specific marker (for UFFD-WP
   the marker is _PAGE_UFFD_WP, for change_prot_numa is PROT_NONE on a
   vma->vm_pgprot not PROT_NONE, for soft dirty it could be
   _PAGE_SOFT_DIRTY) to send the page fault to a dead end before the
   pte value is interpreted.

> matter how good the documentation about this new way is now, it
> will be out of date, speaking from my personal experience.

A common catcher for all 3 is not possible because each catcher
depends on whatever marker and whatever pte value they set that may
lead to a different deterministic path where to put the catcher or
multiple paths even. do_numa_page requires a catcher in a different
place already.

Or well, a common catcher for all 3 is technically possible but it'd
perform slower requiring to check things twice.

But perhaps the soft_dirty can use the same catcher of uffd-wp given
the similarity?

> I'd go with what Nadav has -- the memory corruption problem has been
> there for a while and nobody has complained except Nadav. This makes
> me think people is less likely to notice any performance issues from
> holding mmap_lock for write in his patch either.

uffd-wp is a fairly new feature, the current users are irrelevant,
keeping it optimal is just for the future potential.

> But I can't say I have zero concern with the potential performance
> impact given that I have been expecting the fix to go to stable,
> which I care most. So the next option on my list is to have a

Actually stable would be very fine to go with Nadav patch and use the
mmap_lock_write unconditionally. The optimal fix is only relevant for
the future potential, so it's only relevant for Linus's tree.

However the feature is recent enough that it won't require a deep
backport so the optimal fix is likely fine for stable as well,
generally stable prefers the same fix as in the upstream when there's
no major backport rejection issue.

The alternative solution for uffd is to do the deferred flush under
mmap_lock_write if len is > HPAGE_PMD_SIZE, or to tell
change_protection not to defer the flush and to take the
mmap_lock_read for <= HPAGE_PMD_SIZE. That will avoid depending on the
catcher and then userfaultfd_writeprotect(mode_wp=true)
userfaultfd_writeprotect(mode_wp=false) can even run in parallel at
all times. The cons is large userfaultfd_writeprotect will block for
I/O and those would happen at least in the postcopy live snapshotting
use case.

The main cons is that it'd require modification to change_protection
so it actually looks more intrusive, not less.

Overall anything that allows to wrprotect 1 pte with only the
mmap_lock_read exactly like KSM write_protect_page, would be enough for
uffd-wp.

What isn't ok in terms of future potential is unconditional
mmap_lock_write as in the original suggested patch in my view. It
doesn't mean we can take mmap_lock_write when the operation is large
and there is actually more benefit from deferring the flush.

> common "catcher" in do_wp_page() which singles out pages that have
> page_mapcount equal to one and reuse them instead of trying to

I don't think the page_mapcount matters here. If the wp page reuse was
more accurate (as it was before) we wouldn't notice this issue, but it
still would be a bug that there were stale TLB entries. It worked by
luck.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Andrea Arcangeli

On Tue, Dec 22, 2020 at 12:58:18PM -0800, Nadav Amit wrote:
> I had somewhat similar ideas - saving in each page-struct the generation,
> which would allow to: (1) extend pte_same() to detect interim changes
> that were reverted (RO->RW->RO) and (2) per-PTE pending flushes.

What don't you feel safe about, what's the problem with RO->RO->RO, I
don't get it.

The pte_same is perfectly ok without sequence counter in my view, I
never seen anything that would not be ok with pte_same given all the
invariant are respected. It's actually a great optimization compared
to any unscalable sequence counter.

The counter would slowdown everything, having to increase a counter
every time you change a pte, no matter if it's a counter per pgtable
or per-vma or per-mm, sounds very bad.

I'd rather prefer to take mmap_lock_write across the whole userfaultfd
ioctl, than having to deal with a new sequence counter increase for
every pte modification on a heavily contended cacheline.

Also note the counter would have solved nothing for
userfaultfd_writeprotect, it's useless to detect stale TLB entries.

See how !pte_write check happens after the counter was already increased:

CPU0CPU 1   CPU 2
--  ---
userfaultfd_wrprotect(mode_wp = true)
PT lock
atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE
false_shared_counter_counter++ 
^^
PT unlock

do_page_fault FAULT_FLAG_WRITE
userfaultfd_wrprotect(mode_wp = false)
PT lock
ATOMIC clear _PAGE_UFFD_WP <- problem
/* _PAGE_WRITE not set */
false_shared_counter_counter++ 
^^
PT unlock
XX BUG RACE window open here

PT lock
counter = false_shared_counter_counter
^^
FAULT_FLAG_WRITE is set by CPU
_PAGE_WRITE is still clear in pte
PT unlock

wp_page_copy
copy_user_page runs with stale TLB

pte_same(counter, orig_pte, pte) -> PASS
 ^^^
commit the copy to the pte with the lost writes

deferred tlb flush <- too late
XX BUG RACE window close here

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Andrea Arcangeli

On Tue, Dec 22, 2020 at 12:19:49PM -0800, Nadav Amit wrote:
> Perhaps any change to PTE in a page-table should increase a page-table
> generation that we would save in the page-table page-struct (next to the

The current rule is that by the time in the page fault we find a
pte/hugepmd in certain state under the "PT lock", the TLB cannot be
left stale with a more permissive state at that point.

So if under "PT lock" the page fault sees !pte_write, it means no
other CPU can possibly modify the page anymore.

That must be enforced at all times, no sequence number is required if
you enforce that and the whole point of the fix is to enforce that.

This is how things always worked in the page fault and it's perfectly
fine.

For the record I never suggested to change anything in the page fault
code.

So the only way we can leave stale TLB after dropping PT lock with the
mmap_read_lock and concurrent page faults is with a marker:

1) take PT lock
2) leave in the pte/hugepmd a unique identifiable marker
3) change the pte to downgrade permissions
4) drop PT lock and leave stale TLB entries with the upgraded permissions

In point 3) the pte is built in a way that is guaranteed to trigger a
deterministic path in the page fault. And in the way of that
deterministic path you put the marker check for 2) to send the fault
to a dead-end where the stale TLB is actually irrelevant and harmless.

No change to the page fault is needed if you only enforce the above.

Even if we'd take the mmap_write_lock in userfaultfd_writeprotect, you
will still have to deal with the above technique because of
change_prot_numa().

Would you suggest to add a sequence counter and to change all pte_same
in order to make change_prot_numa safer or is it safe enough already
using the above technique that checks the marker in the page fault?

To be safe NUMA balancing is using the same mechanism above of sending
the page fault into a dead end, in order to call the very same
function (change_permissions) with the very same lock (mmap_read_lock)
as userfaultfd_writeprotect.

What happens then in do_numa_page then is different than what happens
in handle_userfault, but both are a dead ends as far as the page fault
code is concerned and they will never come back to the page fault
code. That's how they avoid breaking the page fault.

The final rule for the above logic to work safe for uffd, is that the
marker cannot be cleared until after the deferred TLB flush is
executed (that's where the window for the race existed, and the fix
closed such window).

do_numa_page differs in clearing the marker while the TLB flush still
pending, but it can do that because it puts the same exact pte value
(with the upgraded permissions) that was there before it put the
marker in the pte. In turn do_numa_page makes the pending TLB flush
becomes a noop and it doesn't need to wait for it before removing the
marker. Does that last difference in how the marker is cleared,
warrant to consider what NUMA balancing radically different so to
forbid userfaultfd_writeprotect to use the same logic in the very same
function with the very same lock in order to decrease the VM
complexity? In my view no.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Andrea Arcangeli

On Tue, Dec 22, 2020 at 08:15:53PM +, Matthew Wilcox wrote:
> On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote:
> > My previous suggestion to use a mutex to serialize
> > userfaultfd_writeprotect with a mutex will still work, but we can run
> > as many wrprotect and un-wrprotect as we want in parallel, as long as
> > they're not simultaneous, we can do much better than a mutex.
> > 
> > Ideally we would need a new two_group_semaphore, where each group can
> > run as many parallel instances as it wants, but no instance of one
> > group can run in parallel with any instance of the other group. AFIK
> > such a kind of lock doesn't exist right now.
> 
> Kent and I worked on one for a bit, and we called it a red-black mutex.
> If team red had the lock, more members of team red could join in.
> If team black had the lock, more members of team black could join in.
> I forget what our rule was around fairness (if team red has the lock,
> and somebody from team black is waiting, can another member of team red
> take the lock, or must they block?)

In this case they would need to block and provide full fairness.

Well maybe just a bit of unfariness (to let a few more through the
door before it shuts) wouldn't be a deal breaker but it would need to
be bound or it'd starve the other color/side indefinitely. Otherwise
an ioctl mode_wp = true would block forever, if more ioctl mode_wp =
false keep coming in other CPUs (or the other way around).

The approximation with rwsem and two atomics provides full fariness in
both read and write mode (originally the read would stave the write
IIRC which was an issue for all mprotect etc.. not anymore thankfully).

> It was to solve the direct-IO vs buffered-IO problem (you can have as many
> direct-IO readers/writers at once or you can have as many buffered-IO
> readers/writers at once, but exclude a mix of direct and buffered I/O).
> In the end, we decided it didn't work all that well.

Well mixing buffered and direct-IO is certainly not a good practice so
it's reasonable to leave it up to userland to serialize if such mix is
needed, the kernel behavior is undefined if the mix is concurrent out
of order.

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Andrea Arcangeli

On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so

I assume you mean "in" mprotect_fixup, after change_protection.

If you would downgrade the mmap_lock to read there, then it'd severely
slowdown the non contention case, if there's more than vma that needs
change_protection.

You'd need to throw away the prev->vm_next info and you'd need to do a
new find_vma after droping the mmap_lock for reading and re-taking the
mmap_lock for writing at every iteration of the loop.

To do less harm to the non-contention case you could perhaps walk
vma->vm_next and check if it's outside the mprotect range and only
downgrade in such case. So let's assume we intend to optimize with
mmap_write_downgrade only the last vma.

The problem is once you had to take mmap_lock for writing, you already
stalled for I/O and waited all concurrent page faults and blocked them
as well for the vma allocations in split_vma, so that extra boost in
SMP scalability you get is lost in the noise there at best.

And the risk is that at worst that extra locked op of
mmap_write_downgrade() will hurt SMP scalability because it would
increase the locked ops of mprotect on the hottest false-shared
cacheline by 50% and that may outweight the benefit from unblocking
the page faults half a usec sooner on large systems.

But the ultimate reason why mprotect cannot do mmap_write_downgrade()
while userfaultfd_writeprotect can do mmap_read_lock and avoid the
mmap_write_lock altogether, is that mprotect leaves no mark in the
pte/hugepmd that allows to detect when the TLB is stale in order to
redirect the page fault in a dead end (handle_userfault() or
do_numa_page) until after the TLB has been flushed as it happens in
the the 4 cases below:

/*
 * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB
 * can be stale. We cannot allow do_wp_page to proceed or
 * it'll wrongly assume that nobody can still be writing to
 * the page if !pte_write.
 */
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
/*
 * STALE_TLB_WARNING: while the uffd_wp bit is set,
 * the TLB can be stale. We cannot allow wp_huge_pmd()
 * to proceed or it'll wrongly assume that nobody can
 * still be writing to the page if !pmd_write.
 */
if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
/*
 * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can
 * be stale.
 */
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
/*
 * STALE_TLB_WARNING: if the pmd is NUMA
 * protnone the TLB can be stale.
 */
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Andrea Arcangeli

. This is one of the potential use cases, but there's plenty
more.

So the alternative would be to take mmap_read_lock for "len <=
HPAGE_PMD_SIZE" and the mmap_write_lock otherwise, and to add a
parameter in change_protection to tell it to flush before dropping the
PT lock and not defer the flush. So this is going to be more intrusive
in the VM and it's overall unnecessary.

The below is mostly untested... but it'd be good to hear some feedback
before doing more work in this direction.

>From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Tue, 22 Dec 2020 14:30:16 -0500
Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after
 userfaultfd_writeprotect()

change_protection() is called by userfaultfd_writeprotect() with the
mmap_lock_read like in change_prot_numa().

The page fault code in wp_copy_page() rightfully assumes if the CPU
issued a write fault and the write bit in the pagetable is not set, no
CPU can write to the page. That's wrong assumption after
userfaultfd_writeprotect(). That's also wrong assumption after
change_prot_numa() where the regular page fault code also would assume
that if the present bit is not set and the page fault is running,
there should be no stale TLB entry, but there is still.

So to stay safe, the page fault code must be prevented to run as long
as long as the TLB flush remains pending. That is already achieved by
the do_numa_page() path for change_prot_numa() and by the
userfaultfd_pte_wp() path for userfaultfd_writeprotect().

The problem that needs fixing is that an un-wrprotect
(i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not
set) could run in between the original wrprotect
(i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set)
and wp_copy_page, while the TLB flush remains pending.

In such case the userfaultfd_pte_wp() check will stop being effective
to prevent the regular page fault code to run.

CPU0CPU 1   CPU 2
--  ---
userfaultfd_wrprotect(mode_wp = true)
PT lock
atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE
PT unlock

do_page_fault FAULT_FLAG_WRITE
userfaultfd_wrprotect(mode_wp = false)
PT lock
ATOMIC clear _PAGE_UFFD_WP <- problem
/* _PAGE_WRITE not set */
PT unlock
XX BUG RACE window open here

PT lock
FAULT_FLAG_WRITE is set by CPU
_PAGE_WRITE is still clear in pte
PT unlock

wp_page_copy
copy_user_page runs with stale TLB

deferred tlb flush <- too late
XX BUG RACE window close here


If the userfaultfd_wrprotect(mode_wp = false) can only run after the
deferred TLB flush the above cannot happen either, because the uffd_wp
bit will remain set until after the TLB flush and the page fault would
reliably hit the handle_userfault() dead end as long as the TLB is
stale.

So to fix this we need to prevent any un-wrprotect to start until the
last outstanding wrprotect completed and to prevent any further
wrprotect until the last outstanding un-wrprotect completed. Then
wp_page_copy can still run in parallel but only with the un-wrprotect,
and that's fine since it's a permission promotion.

We would need a new two_group_semaphore, where each group can run as
many parallel instances as it wants, but no instance of one group can
run in parallel with any instance of the other group. The below rwsem
with two atomics approximates that kind lock.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 39 +++
 mm/memory.c  | 20 
 2 files changed, 59 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 894cc28142e7..3729ca99dae5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -76,6 +76,20 @@ struct userfaultfd_ctx {
bool mmap_changing;
/* mm with one ore more vmas attached to this userfaultfd_ctx */
struct mm_struct *mm;
+   /*
+* We cannot let userfaultd_writeprotect(mode_wp = false)
+* clear _PAGE_UFFD_WP from the pgtable, until the last
+* outstanding userfaultd_writeprotect(mode_wp = true)
+* completes, because the TLB flush is deferred. The page
+* fault must be forced to enter handle_userfault() while the
+* TLB flush is deferred, and that's achieved with the
+* _PAGE_UFFD_WP bit. The _PAGE_UFFD_WP can only be cleared
+* after there are no stale TLB entries left, only then it'll
+* be safe again for th

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-21 Thread Andrea Arcangeli

Hello,

On Sat, Dec 19, 2020 at 09:08:55PM -0800, Andy Lutomirski wrote:
> On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli  wrote:
> > The ptes are changed always with the PT lock, in fact there's no
> > problem with the PTE updates. The only difference with mprotect
> > runtime is that the mmap_lock is taken for reading. And the effect
> > contested for this change doesn't affect the PTE, but supposedly the
> > tlb flushing deferral.
> 
> Can you point me at where the lock ends up being taken in this path?

pte_offset_map_lock in change_pte_range, as in mprotect, no difference.

As I suspected on my follow up, the bug described wasn't there, but
I'll look at the new theory posted.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-19 Thread Andrea Arcangeli

On Sat, Dec 19, 2020 at 06:01:39PM -0800, Andy Lutomirski wrote:
> I missed the beginning of this thread, but it looks to me like
> userfaultfd changes PTEs with not locking except mmap_read_lock().  It

There's no mmap_read_lock, I assume you mean mmap_lock for reading.

The ptes are changed always with the PT lock, in fact there's no
problem with the PTE updates. The only difference with mprotect
runtime is that the mmap_lock is taken for reading. And the effect
contested for this change doesn't affect the PTE, but supposedly the
tlb flushing deferral.

The change_protection_range is identical to what already happens with
zap_page_range. zap_page_range is called with mmap_lock for reading
in MADV_DONTNEED, and by munmap with mmap_lock for
writing. change_protection_range is called with mmap_lock for writing
by mprotect, and mmap_lock for reading by UFFDIO_WRITEPROTECT.

> also calls inc_tlb_flush_pending(), which is very explicitly
> documented as requiring the pagetable lock.  Those docs must be wrong,

The comment in inc_tlb_flush_pending() shows the pagetable lock is
taken after inc_tlb_flush_pending():

 *  atomic_inc(>tlb_flush_pending);
 *  spin_lock();

> because mprotect() uses the mmap_sem write lock, which is just fine,
> but ISTM some kind of mutual exclusion with proper acquire/release
> ordering is indeed needed.  So the userfaultfd code seems bogus.

If there's a bug, it'd be nice to fix without taking the mmap_lock for
writing.

The vma is guaranteed not modified, so I think it'd be pretty bad if
we had to give in the mmap_lock for writing just to wait for a tlb
flush that is issued deferred in the context of
userfaultfd_writeprotect.

> I think userfaultfd either needs to take a real lock (probably doesn't
> matter which) or the core rules about PTEs need to be rewritten.

It's not exactly clear how the do_wp_page could run on cpu1 before the
unprotect did an extra flush, I guess the trace needs one more
cpu/thread?

Anyway to wait the wrprotect to do the deferred flush, before the
unprotect can even start, one more mutex in the mm to take in all
callers of change_protection_range with the mmap_lock for reading may
be enough.

Thanks,
Andrea

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-19 Thread Andrea Arcangeli

On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
> > On Dec 19, 2020, at 1:34 PM, Nadav Amit  wrote:
> > 
> > [ cc’ing some more people who have experience with similar problems ]
> > 
> >> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli  wrote:
> >> 
> >> Hello,
> >> 
> >> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>> Analyzing this problem indicates that there is a real bug since
> >>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >> 
> >> Never having to take the mmap_sem for writing, and in turn never
> >> blocking, in order to modify the pagetables is quite an important
> >> feature in uffd that justifies uffd instead of mprotect. It's not the
> >> most important reason to use uffd, but it'd be nice if that guarantee
> >> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >> other pgtable manipulations.
> >> 
> >>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>> 
> >>> cpu0  cpu1
> >>>   
> >>> userfaultfd_writeprotect()
> >>> [ write-protecting ]
> >>> mwriteprotect_range()
> >>> mmap_read_lock()
> >>> change_protection()
> >>> change_protection_range()
> >>>  ...
> >>>  change_pte_range()
> >>>  [ defer TLB flushes]
> >>>   userfaultfd_writeprotect()
> >>>mmap_read_lock()
> >>>change_protection()
> >>>[ write-unprotect ]
> >>>...
> >>> [ unprotect PTE logically ]

Is the uffd selftest failing with upstream or after your kernel
modification that removes the tlb flush from unprotect?

} else if (uffd_wp_resolve) {
/*
 * Leave the write bit to be handled
 * by PF interrupt handler, then
 * things like COW could be properly
 * handled.
 */
ptent = pte_clear_uffd_wp(ptent);
}

Upstraem this will still do pages++, there's a tlb flush before
change_protection can return here, so I'm confused.


> >>>   ...
> >>>   [ page-fault]
> >>>   ...
> >>>   wp_page_copy()
> >>>   [ set new writable page in PTE]
> >> 
> >> Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
> >> is set and do an explicit (potentially spurious) tlb flush before
> >> write-unprotect?
> > 
> > There is a concrete scenario that I actually encountered and then there is a
> > general problem.
> > 
> > In general, the kernel code assumes that PTEs that are read from the
> > page-tables are coherent across all the TLBs, excluding permission promotion
> > (i.e., the PTE may have higher permissions in the page-tables than those
> > that are cached in the TLBs).
> > 
> > We therefore need to both: (a) protect change_protection_range() from the
> > changes of others who might defer TLB flushes without taking mmap_sem for
> > write (e.g., try_to_unmap_one()); and (b) to protect others (e.g.,
> > page-fault handlers) from concurrent changes of change_protection().
> > 
> > We have already encountered several similar bugs, and debugging such issues
> > s time consuming and these bugs impact is substantial (memory corruption,
> > security). So I think we should only stick to general solutions.
> > 
> > So perhaps your the approach of your proposed solution is feasible, but it
> > would have to be applied all over the place: we will need to add a check for
> > mm_tlb_flush_pending() and conditionally flush the TLB in every case in
> > which PTEs are read and there might be an assumption that the
> > access-permission reflect what the TLBs hold. This includes page-fault
> > handlers, but also NUMA migration code in change_protection(), softdirty
> > cleanup in clear_refs_write() and maybe others.
> > 
> > [ I have in mind another solution, such as keeping in each page-table a 
> > “table-generation” which is the mm-generation at the time of the change,
> > and only fl

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-19 Thread Andrea Arcangeli

Hello,

On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> Analyzing this problem indicates that there is a real bug since
> mmap_lock is only taken for read in mwriteprotect_range(). This might

Never having to take the mmap_sem for writing, and in turn never
blocking, in order to modify the pagetables is quite an important
feature in uffd that justifies uffd instead of mprotect. It's not the
most important reason to use uffd, but it'd be nice if that guarantee
would remain also for the UFFDIO_WRITEPROTECT API, not only for the
other pgtable manipulations.

> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> 
> cpu0  cpu1
>   
> userfaultfd_writeprotect()
> [ write-protecting ]
> mwriteprotect_range()
>  mmap_read_lock()
>  change_protection()
>   change_protection_range()
>...
>change_pte_range()
>[ defer TLB flushes]
>   userfaultfd_writeprotect()
>mmap_read_lock()
>change_protection()
>[ write-unprotect ]
>...
> [ unprotect PTE logically ]
>   ...
>   [ page-fault]
>   ...
>   wp_page_copy()
>   [ set new writable page in PTE]

Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
is set and do an explicit (potentially spurious) tlb flush before
write-unprotect?

Thanks,
Andrea

Re: [PATCH v2 1/2] mm: memblock: enforce overlap of memory.memblock and memory.reserved

2020-12-14 Thread Andrea Arcangeli

On Mon, Dec 14, 2020 at 12:18:07PM +0100, David Hildenbrand wrote:
> On 14.12.20 12:12, Mike Rapoport wrote:
> > On Mon, Dec 14, 2020 at 11:11:35AM +0100, David Hildenbrand wrote:
> >> On 09.12.20 22:43, Mike Rapoport wrote:
> >>> From: Mike Rapoport 
> >>>
> >>> memblock does not require that the reserved memory ranges will be a subset
> >>> of memblock.memory.
> >>>
> >>> As the result there maybe reserved pages that are not in the range of any
> >>> zone or node because zone and node boundaries are detected based on
> >>> memblock.memory and pages that only present in memblock.reserved are not
> >>> taken into account during zone/node size detection.
> >>>
> >>> Make sure that all ranges in memblock.reserved are added to 
> >>> memblock.memory
> >>> before calculating node and zone boundaries.
> >>>
> >>> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions 
> >>> rather that check each PFN")
> >>> Reported-by: Andrea Arcangeli 
> >>> Signed-off-by: Mike Rapoport 
> >>> ---
> >>>  include/linux/memblock.h |  1 +
> >>>  mm/memblock.c| 24 
> >>>  mm/page_alloc.c  |  7 +++
> >>>  3 files changed, 32 insertions(+)
> >>>
> >>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >>> index ef131255cedc..e64dae2dd1ce 100644
> >>> --- a/include/linux/memblock.h
> >>> +++ b/include/linux/memblock.h
> >>> @@ -120,6 +120,7 @@ int memblock_clear_nomap(phys_addr_t base, 
> >>> phys_addr_t size);
> >>>  unsigned long memblock_free_all(void);
> >>>  void reset_node_managed_pages(pg_data_t *pgdat);
> >>>  void reset_all_zones_managed_pages(void);
> >>> +void memblock_enforce_memory_reserved_overlap(void);
> >>>  
> >>>  /* Low level functions */
> >>>  void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags,
> >>> diff --git a/mm/memblock.c b/mm/memblock.c
> >>> index b68ee86788af..9277aca642b2 100644
> >>> --- a/mm/memblock.c
> >>> +++ b/mm/memblock.c
> >>> @@ -1857,6 +1857,30 @@ void __init_memblock 
> >>> memblock_trim_memory(phys_addr_t align)
> >>>   }
> >>>  }
> >>>  
> >>> +/**
> >>> + * memblock_enforce_memory_reserved_overlap - make sure every range in
> >>> + * @memblock.reserved is covered by @memblock.memory
> >>> + *
> >>> + * The data in @memblock.memory is used to detect zone and node 
> >>> boundaries
> >>> + * during initialization of the memory map and the page allocator. Make
> >>> + * sure that every memory range present in @memblock.reserved is also 
> >>> added
> >>> + * to @memblock.memory even if the architecture specific memory
> >>> + * initialization failed to do so
> >>> + */
> >>> +void __init memblock_enforce_memory_reserved_overlap(void)
> >>> +{
> >>> + phys_addr_t start, end;
> >>> + int nid;
> >>> + u64 i;
> >>> +
> >>> + __for_each_mem_range(i, , ,
> >>> +  NUMA_NO_NODE, MEMBLOCK_NONE, , , ) {
> >>> + pr_warn("memblock: reserved range [%pa-%pa] is not in memory\n",
> >>> + , );
> >>> + memblock_add_node(start, (end - start), nid);
> >>> + }
> >>> +}
> >>> +
> >>>  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> >>>  {
> >>>   memblock.current_limit = limit;
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index eaa227a479e4..dbc57dbbacd8 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -7436,6 +7436,13 @@ void __init free_area_init(unsigned long 
> >>> *max_zone_pfn)
> >>>   memset(arch_zone_highest_possible_pfn, 0,
> >>>   sizeof(arch_zone_highest_possible_pfn));
> >>>  
> >>> + /*
> >>> +  * Some architectures (e.g. x86) have reserved pages outside of
> >>> +  * memblock.memory. Make sure these pages are taken into account
> >>> +  * when detecting zone and node boundaries
> >>> +  */
> >>> + memblock_enforce_memory_reserved_overlap();
> >>> +
>

Re: [PATCH v2 2/2] mm: fix initialization of struct page for holes in memory layout

2020-12-09 Thread Andrea Arcangeli

Hello,

On Wed, Dec 09, 2020 at 11:43:04PM +0200, Mike Rapoport wrote:
> +void __init __weak memmap_init(unsigned long size, int nid,
> +unsigned long zone,
> +unsigned long range_start_pfn)
> +{
> + unsigned long start_pfn, end_pfn, hole_start_pfn = 0;
>   unsigned long range_end_pfn = range_start_pfn + size;
> + u64 pgcnt = 0;
>   int i;
>  
>   for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
>   start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
>   end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> + hole_start_pfn = clamp(hole_start_pfn, range_start_pfn,
> +range_end_pfn);
>  
>   if (end_pfn > start_pfn) {
>   size = end_pfn - start_pfn;
>   memmap_init_zone(size, nid, zone, start_pfn,
>MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
>   }
> +
> + if (hole_start_pfn < start_pfn)
> + pgcnt += init_unavailable_range(hole_start_pfn,
> + start_pfn, zone, nid);
> + hole_start_pfn = end_pfn;
>   }

After applying the new 1/2, the above loop seem to be functionally a
noop compared to what was in -mm yesterday, so the above looks great
as far as I'm concerned.

Unlike the simple fix this will not loop over holes that aren't part
of memblock.memory nor memblock.reserved and it drops the static
variable which would have required ordering and serialization.

By being functionally equivalent, it looks it also suffers from the
same dependency on pfn 0 (and not just pfn 0) being reserved that you
pointed out earlier.

I suppose to drop that further dependency we need a further round down
in this logic to the start of the pageblock_order or max-order like
mentioned yesterday?

If the first pfn of a pageblock (or maybe better a max-order block) is
valid, but not in memblock.reserved nor memblock.memory and any other
pages in such pageblock is freed to the buddy allocator, we should
make sure the whole pageblock gets initialized (or at least the pages
with a pfn lower than the one that was added to the buddy). So
applying a round down in the above loop might just do the trick.

Since the removal of that extra dependency was mostly orthogonal with
the above, I guess it's actually cleaner to do it incrementally.

I'd suggest to also document why we're doing it, in the code (not just
commit header) of the incremental patch, by mentioning which are the
specific VM invariants we're enforcing that the VM code always
depended upon, that required the rundown etc...

In the meantime I'll try to update all systems again with this
implementation to test it.

Thanks!
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-12-05 Thread Andrea Arcangeli

Hi Mel,

On Thu, Nov 26, 2020 at 10:47:20AM +, Mel Gorman wrote:
> Agreed. This thread has a lot of different directions in it at this
> point so what I'd hope for is first, a patch that initialises holes with
> zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a
> hole, it would be expected the pages are PageReserved. Second, a fix to
> fast_isolate that forces PFNs returned to always be within the stated
> zone boundaries.

The last two patches should resolve the struct page
initialization
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/ and the
VM_BUG_ON_PAGE never happened again as expected.

So I looked back to see how the "distance" logic is accurate. I added
those checks and I get negative hits:

diff --git a/mm/compaction.c b/mm/compaction.c
index cc1a7f600a86..844a90b0acdf 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1331,6 +1331,12 @@ fast_isolate_freepages(struct compact_control *cc)
low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));

+   WARN_ON_ONCE((long) low_pfn < 0);
+   WARN_ON_ONCE((long) min_pfn < 0);
+   if ((long) low_pfn < 0)
+   return cc->free_pfn;
+   if ((long) min_pfn < 0)
+   return cc->free_pfn;
if (WARN_ON_ONCE(min_pfn > low_pfn))
low_pfn = min_pfn;

Both warn-on-once triggers, so it goes negative. While it appears not
kernel crashing since pfn_valid won't succeed on negative entries and
they'll always be higher than any pfn in the freelist, is this sign
that there's further room for improvement here?

Thanks,
Andrea

[PATCH 0/1] mm: initialize struct pages in reserved regions outside of the zone ranges

2020-12-04 Thread Andrea Arcangeli

I'm running with these patch applied on all instances as solution to
the compaction crashes that started to trigger after the v5.9
upgrade. It's applied on top of
https://lore.kernel.org/lkml/20201201181502.2340-1-r...@kernel.org so
that it boots and works fine even when there are memblock.reserved
regions that don't overlap with the memblock.memory (as pfn 0 which is
in memblock.reserved but not added to memblock.memory).

To facilitate any testing:
https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

===
# grep -A1  -i e820  /proc/iomem
39619000-397f4fff : Unknown E820 type
397f5000-3be55fff : System RAM

DMA  zone_start_pfn 0zone_end_pfn() 4096 contiguous 1   
  
DMA32zone_start_pfn 4096 zone_end_pfn() 1048576  contiguous 0   
  
Normal   zone_start_pfn 1048576  zone_end_pfn() 4909056  contiguous 1   
  
Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  

235508 0x397f4000 0x40001000 reserved True pfn_valid True
235509 0x397f5000 0x4000 reserved False pfn_valid True
===
# grep -A1  -i e820  /proc/iomem 
7a17b000-7a216fff : Unknown E820 type
7a217000-7bff : System RAM

DMA  zone_start_pfn 0zone_end_pfn() 4096 contiguous 1   
  
DMA32zone_start_pfn 4096 zone_end_pfn() 1048576  contiguous 0   
  
Normal   zone_start_pfn 1048576  zone_end_pfn() 4715392  contiguous 1   
  
Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  

500246 0x7a216000 0x3fff1000 reserved True pfn_valid True
500247 0x7a217000 0x3fff reserved False pfn_valid True
===

I dislike the fix for pfn 0 merged in -mm so that it will still boot
with DEBUG_VM=y, posted in
https://lkml.kernel.org/r/20201203105107.gr123...@linux.ibm.com,
because in addition of looping every 2m from the start at every
memblock for every zone in a O(N*2) way, it'll keep assigning a
"wrong" zoneid to pfn 0 and it literally reintroduces the problem that
https://lore.kernel.org/lkml/20201201181502.2340-1-r...@kernel.org was
meant to solve.

This patch implements a more significant change since it alters the
zone boundaries.

If there's any problem with extending the zone spans to include
memblock.reserved regions it'd be good to know now, because if there's
no problem with that, I only see benefits in being able to guarantee a
proper zoneid and nid of all struct page that are in reserved
memblocks too and where reserve_bootmem_region will attempt to call
__SetPageReserved later expecting to deal with a "functional" page
structure.

An alternative to this patch, which still sounds preferable than
https://lkml.kernel.org/r/20201203105107.gr123...@linux.ibm.com, is to
just teach reserve_bootmem_region not try to set PageReserved on
uninitialized PagePoison pages and to learn how to deal with "non
functional" page structures, and to just leave them alone. PagePoison
conceptually would act as a NO_ZONEID setting in such case and
orthogonal with all the rest, it'd be nice if it was set also with
DEBUG_VM=n.

Overall anything looks preferable than once again assigning a wrong
zoneid to a page struct that belongs to a pfn that is not part of the
zone in order to boot with the very fix that was meant to prevent such
invariant to be broken in the first place.

I don't think pfn 0 deserves a magic exception and a pass to break the
invariant (even ignoring it may happen that the first pfn in nid > 0
might then also get the pass).

Andrea Arcangeli (1):
  mm: initialize struct pages in reserved regions outside of the zone
ranges

 include/linux/memblock.h | 17 +---
 mm/debug.c   |  3 ++-
 mm/memblock.c|  4 +--
 mm/page_alloc.c  | 57 
 4 files changed, 63 insertions(+), 18 deletions(-)

[PATCH 1/1] mm: initialize struct pages in reserved regions outside of the zone ranges

2020-12-04 Thread Andrea Arcangeli

Without this change, the pfn 0 isn't in any zone spanned range, and
it's also not in any memory.memblock range, so the struct page of pfn
0 wasn't initialized and the PagePoison remained set when
reserve_bootmem_region called __SetPageReserved, inducing a silent
boot failure with DEBUG_VM (and correctly so, because the crash
signaled the nodeid/nid of pfn 0 would be again wrong).

There's no enforcement that all memblock.reserved ranges must overlap
memblock.memory ranges, so the memblock.reserved ranges also require
an explicit initialization and the zones ranges need to be extended to
include all memblock.reserved ranges with struct pages too or they'll
be left uninitialized with PagePoison as it happened to pfn 0.

Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather 
that check each PFN")
Signed-off-by: Andrea Arcangeli 
---
 include/linux/memblock.h | 17 +---
 mm/debug.c   |  3 ++-
 mm/memblock.c|  4 +--
 mm/page_alloc.c  | 57 
 4 files changed, 63 insertions(+), 18 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index ef131255cedc..c8e30cd69564 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -251,7 +251,8 @@ static inline bool memblock_is_nomap(struct memblock_region 
*m)
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
- unsigned long *out_end_pfn, int *out_nid);
+ unsigned long *out_end_pfn, int *out_nid,
+ struct memblock_type *type);
 
 /**
  * for_each_mem_pfn_range - early memory pfn range iterator
@@ -263,9 +264,17 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long 
*out_start_pfn,
  *
  * Walks over configured memory ranges.
  */
-#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)  \
-   for (i = -1, __next_mem_pfn_range(, nid, p_start, p_end, p_nid); \
-i >= 0; __next_mem_pfn_range(, nid, p_start, p_end, p_nid))
+#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)\
+   for (i = -1, __next_mem_pfn_range(, nid, p_start, p_end, p_nid, \
+ );  \
+i >= 0; __next_mem_pfn_range(, nid, p_start, p_end, p_nid, \
+ ))
+
+#define for_each_res_pfn_range(i, nid, p_start, p_end, p_nid)\
+   for (i = -1, __next_mem_pfn_range(, nid, p_start, p_end, p_nid, \
+ );\
+i >= 0; __next_mem_pfn_range(, nid, p_start, p_end, p_nid, \
+ ))
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
diff --git a/mm/debug.c b/mm/debug.c
index ccca576b2899..6a1d534f5ffc 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -64,7 +64,8 @@ void __dump_page(struct page *page, const char *reason)
 * dump_page() when detected.
 */
if (page_poisoned) {
-   pr_warn("page:%px is uninitialized and poisoned", page);
+   pr_warn("page:%px pfn:%ld is uninitialized and poisoned",
+   page, page_to_pfn(page));
goto hex_only;
}
 
diff --git a/mm/memblock.c b/mm/memblock.c
index b68ee86788af..3964a5e8914f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1198,9 +1198,9 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int 
nid,
  */
 void __init_memblock __next_mem_pfn_range(int *idx, int nid,
unsigned long *out_start_pfn,
-   unsigned long *out_end_pfn, int *out_nid)
+   unsigned long *out_end_pfn, int *out_nid,
+   struct memblock_type *type)
 {
-   struct memblock_type *type = 
struct memblock_region *r;
int r_nid;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ce2bdaabdf96..3eed49598d66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1458,6 +1458,7 @@ static void __meminit init_reserved_page(unsigned long 
pfn)
 {
pg_data_t *pgdat;
int nid, zid;
+   bool found = false;
 
if (!early_page_uninitialised(pfn))
return;
@@ -1468,10 +1469,15 @@ static void __meminit init_reserved_page(unsigned long 
pfn)
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
struct zone *zone = >node_zones[zid];
 
-   if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
+   if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) {
+   found = true;
break;
+   }
}
-

Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads

2020-12-04 Thread Andrea Arcangeli

On Fri, Dec 04, 2020 at 02:23:29PM -0500, Peter Xu wrote:
> If we see [1]:
> 
>   if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma && 
> !is_migration_entry)
> 
> Then it's fundamentally the same as:
> 
>   swp_entry(0, _UFFD_SWP_UFFD_WP) && !vma_is_anonymous(vma)

Yes conceptually it's the same, but in practice it's different value
in the raw pte if you set  swp_offset  to _UFFD_SWP_UFFD_WP.

Setting swp_offset to _UFFD_SWP_UFFD_WP is just confusing, it's better
magic type 1 offset 0 then to even touch _UFFD_SWP_UFFD_WP if you
reserve the magic value in the swap entry.

pte_swp_uffd_wp or _UFFD_SWP_UFFD_WP are the raw pte value.

swp_entry(0, _UFFD_SWP_UFFD_WP) is not the raw pte value, and it has
to be first converted to the raw pte value with swp_entry_to_pte, so
the _UFFD_SWP_UFFD_WP gets shifted left.

If we use _UFFD_SWP_UFFD_WP it looks much cleaner to keep it in the
pte, not in the swp entry, since then you can use the already existing
methods that only can take in input the pte_t (not the swp_entry_t).

Thanks,
Andrea

Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads

2020-12-04 Thread Andrea Arcangeli

On Thu, Dec 03, 2020 at 11:10:18PM -0500, Andrea Arcangeli wrote:
> from the pte, one that cannot ever be set in any swp entry today. I
> assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
> you may want to verify it...

I thought more about the above, and I think the already existing
pte_swp_mkuffd_wp will just be enough without having to reserve an
extra bitflag if we encode it as a non migration entry.

The check:

if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma && 
!is_migration_entry)

should be enough to disambiguate it. When setting it, it'd be enough
to set the pte to the value _PAGE_SWP_UFFD_WP.

Although if you prefer to check for:

if (!pte_present && !pte_none && swp_type == 1 && swp_offset == 0 && 
not_anonymous_vma && !is_migration_entry)

that would do as well.

It's up to you, just my preference is to reuse _PAGE_SWP_UFFD_WP since
it has already to exist, there are already all the pte_swp_*uffd*
methods available or uffd-wp cannot work.

Thanks,
Andrea

Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads

2020-12-03 Thread Andrea Arcangeli

Hi Peter,

On Thu, Dec 03, 2020 at 09:30:51PM -0500, Peter Xu wrote:
> I'm just afraid there's no space left for a migration entry, because migration
> entries fills in the pfn information into swp offset field rather than a real
> offset (please refer to make_migration_entry())?  I assume PFN can use any 
> bit.
> Or did I miss anything?
> 
> I went back to see the original proposal from Hugh:
> 
>   IIUC you only need a single value, no need to carve out another whole
>   swp_type: could probably be swp_offset 0 of any swp_type other than 0.
> 
> Hugh/Andrea, sorry if this is a stupid swap question: could you help explain
> why swp_offset=0 won't be used by any swap device?  I believe it's correct,
> it's just that I failed to figure out the reason myself. :(
> 

Hugh may want to review if I got it wrong, but there's basically three
ways.

swp_type would mean adding one more reserved value in addition of
SWP_MIGRATION_READ and SWP_MIGRATION_WRITE (kind of increasing
SWP_MIGRATION_NUM to 3).

swp_offset = 0 works in combination of SWP_MIGRATION_WRITE and
SWP_MIGRATION_READ if we enforce pfn 0 is never used by the kernel
(I'd feel safer with pfn value -1UL truncated to the bits of the swp
offset, since the swp_entry format is common code).

The bit I was suggesting is just one more bit like _PAGE_SWP_UFFD_WP
from the pte, one that cannot ever be set in any swp entry today. I
assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
you may want to verify it...

It'd be set on the pte (not in the swap entry), then it doesn't matter
much what's inside the swp_entry anymore. The pte value would be
generated with this:

 pte_swp_uffd_wp_unmap(swp_entry_to_pte(swp_entry(SWP_MIGRATION_READ, 0)))

(maybe SWP_MIGRATION_READ could also be 0 and then it can be just
enough to set that single bit in the pte and nothing else, all other
bits zero)

We never store a raw swp entry in the pte (the raw swp entry is stored
in the xarray, it's the index of the swapcache).

To solve our unmap issue we only deal with pte storage (no xarray
index storage). This is why it can also be in the arch specific pte
representation of the swp entry, it doesn't need to be a special value
defined in the swp entry common code.

Being the swap entry to pte conversion arch dependent, such bit needs
to be defined by each arch (reserving a offset or type value in swp
entry would solve it in the common code).

#define SWP_OFFSET_FIRST_BIT(_PAGE_BIT_PROTNONE + 1)

All bits below PROTNONE are available for software use and we use bit
1 (soft dirty) 2 (uffd_wp). protnone bit 8 itself (global bit) must
not be set or it'll look protnone and pte_present will be true. Bit 7
is PSE so it's also not available because pte_present checks that
too.

It appears you can pick between bit 3 4 5 6 at your own choice and it
doesn't look like we're running out of those yet (if we were there
would be a bigger incentive to encode it as part of the swp entry
format). Example:

#define _PAGE_SWP_UFFD_WP_UNMAP _PAGE_PWT

If that bit it set and pte_present is false, then everything else in
that that pte is meaningless and it means uffd wrprotected
pte_none.

So in the migration-entry/swapin page fault path, you could go one
step back and check the pte for such bit, if it's set it's not a
migration entry.

If there's a read access it should fill the page mark with
shmem_fault, keep the pte wrprotected and then set _PAGE_UFFD_WP on
the pte. If there's a write access it should invoke handle_userfault.

If there's any reason where the swp_entry reservation is simpler
that's ok too, you'll see an huge lot of more details once you try to
implement it so you'll be better able to judje later. I'm greatly
simplifying everything but this is not simple feat...

Thanks,
Andrea

Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads

2020-12-03 Thread Andrea Arcangeli

On Thu, Dec 03, 2020 at 01:02:34PM -0500, Peter Xu wrote:
> On Wed, Dec 02, 2020 at 09:36:45PM -0800, Hugh Dickins wrote:
> > On Wed, 2 Dec 2020, Peter Xu wrote:
> > > On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote:
> > > > On Tue, 1 Dec 2020, Andrea Arcangeli wrote:
> > > > > 
> > > > > Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
> > > > > survive the pte invalidates in a way that remains associated to a
> > > > > certain vaddr in a single mm (so it can shoot itself in the foot if it
> > > > > wants, but it can't interfere with all other mm sharing the shmem
> > > > > file) would be welcome...
> > > > 
> > > > I think it has to be a new variety of swap-like non_swap_entry() pte,
> > > > see include/linux/swapops.h.  Anything else would be more troublesome.

Agreed. Solving it by changing the unmapping of the ptes is also some
trouble but less troublesome than adding new bitmaps to the vma to
handle in vma_merge/vma_split.

> > But those ptes will be pte_present(), so you must provide a pfn,
> 
> Could I ask why?

_PAGE_PROTNONE exists only for one reason, so pte_present returns true
and the page is as good as mapped, the pfn is real and mapped,
everything is up and running fine except _PAGE_PRESENT is not set in
the pte. _PAGE_PROTNONE doesn't unmap the pte, it only triggers faults
on a mapped pte.

When we set _PAGE_PROTNONE and clear _PAGE_PRESENT atomically, nothing
changes at all for all pte_present and all other VM common code,
except now you get page faults.

So numa hinting faults use that and it's a perfect fit for that,
because you just want to change nothing, but still be notified on
access.

Here instead you need to really unmap the page and lose any pfn or
page reference and everything along with it, just one bit need to
survive the unmap: the _PAGE_UFFD_WP bit.

I tend to agree this needs to work more similarly to the migration
entry like Hugh suggested or an entirely new mechanism to keep "vma
specific state" alive, for filebacked mappings that get zapped but
that have still a vma intact.

The vma removal in munmap() syscall, is then the only point where the
pte is only allowed to be cleared for good and only then the pte can
be freed.

Not even MADV_DONTNEED should be allowed to zero out the pte, it can
drop everything but that single bit.

> Meanwhile, this reminded me another option - besides _PAGE_PROTNONE, can we 
> use
> _PAGE_SPECIAL?  That sounds better at least from its naming that it tells it's
> a special page already. We can also leverage existing pte_special() checks 
> here
> and there, then mimic what we do with pte_devmap(), maybe?

That's also for mapped pages VM_PFNMAP or similar I think.

By memory the !pte_present case for filebacked vmas only exists as
migration entry so I think we'll just add a branch to that case so
that it can disambiguate the migration entry from the _PAGE_UFFDP_WP
bit.

So we can reserve one bit in the migration entry that is always
enforced zero when it is a migration entry.

When that bit is set on a non-present page in a filebacked vma, it
will mean _UFFD_PAGE_WP is set for that vaddr in that mm.

Then we need to pass a parameter in all pte zapping operations to tell
if the unmap event is happening because the vma has been truncated, or
if it's happening for any other reason.

If it's happening for any other reasons (page truncate, MADV_DONTNEED,
memory pressure to swap/writepage) we just convert any present pte
with _UFFD_PAGE_WP set, to the non-migration entry non-present pte
with the reserved migration entry bit set.

If the present pte has no _UFFD_PAGE_WP then it'll be zapped as usual
regardless of VM_UFFD_WP in the vma or not.

If the pte zapping is instead invoked because of a vma truncation, and
it means it's the last unmap operation on that vaddr, the pte zap
logic will be told to ignore the _UFFD_PAGE_WP in any present pte so
the ptes will be zeroed out and later freed as needed.

Thanks,
Andrea

Re: [PATCH] mm: refactor initialization of stuct page for holes in memory layout

2020-12-03 Thread Andrea Arcangeli

Hello,

On Thu, Dec 03, 2020 at 08:25:49AM +0200, Mike Rapoport wrote:
> On Wed, Dec 02, 2020 at 03:47:36PM -0800, Andrew Morton wrote:
> > On Tue,  1 Dec 2020 20:15:02 +0200 Mike Rapoport  wrote:
> > 
> > > From: Mike Rapoport 
> > > 
> > > There could be struct pages that are not backed by actual physical memory.
> > > This can happen when the actual memory bank is not a multiple of
> > > SECTION_SIZE or when an architecture does not register memory holes
> > > reserved by the firmware as memblock.memory.
> > > 
> > > Such pages are currently initialized using init_unavailable_mem() function
> > > that iterated through PFNs in holes in memblock.memory and if there is a
> > > struct page corresponding to a PFN, the fields if this page are set to
> > > default values and it is marked as Reserved.
> > > 
> > > init_unavailable_mem() does not take into account zone and node the page
> > > belongs to and sets both zone and node links in struct page to zero.
> > > 
> > > On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> > > instance in a configuration below:
> > > 
> > >   # grep -A1 E820 /proc/iomem
> > >   7a17b000-7a216fff : Unknown E820 type
> > >   7a217000-7bff : System RAM
> > > 
> > > unset zone link in struct page will trigger
> > > 
> > >   VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> > 
> > That sounds pretty serious.

My understanding is that with DEBUG_VM=n the invariant that broke
won't cause trouble, but Fedora is helping the upstream testing by
keeping DEBUG_VM=y and it's shipping with v5.8 and v5.9 for a while,
so it could very well crash those kernels if they've that type 20
thing in the e820 map.

> > 
> > > because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link 
> > > in
> > > struct page) in the same pageblock.
> > > 
> > > Interleave initialization of pages that correspond to holes with the
> > > initialization of memory map, so that zone and node information will be
> > > properly set on such pages.
> > > 
> > 
> > Should this be backported to -stable?  If so, do we have a suitable Fixes:?
> 
> Sorry, I forgot to add
> 
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather 
> that check each PFN")

I've been wondering already why I'm the only one getting a crash every
two weeks. Ince it crashed in MADV_HUGEPAGE of qemu that would
definitely happened even with Fedora despite the THP enabled =
madvise, and it hung qemu for good so it was noticeable since it was
in direction compaction.

Other times it was in kcompactd so it just killed the kernel thread
and it was only noticeable in the kernel logs and probably it doesn't
happen that frequently unless THP enabled = always, although it could
still happen, compaction isn't used just for THP.

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-12-03 Thread Andrea Arcangeli

On Thu, Dec 03, 2020 at 12:51:07PM +0200, Mike Rapoport wrote:
> On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote:
> > 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in
> >memblock.reserve that doesn't overlap any memblock.memory zone.
> 
> And, IMHO, this is the fundamental BUG.

We have a dozen arches that change all the time, new efi/e820 stuff,
always new bios configs, the memblock code must cope and be able to
cope with the caller being wrong or changing behavior if the e820 map
changes behaviour.

Trying to work around memblock core deficiencies in the caller is what
124049decbb121ec32742c94fb5d9d6bed8f24d8 already did, and it doesn't
work and it's not even clean thing to do.

In fact here there's really nothing to blame the caller for (i.e. the
e820 init code), unless you declare that there must be always overlap
(which I understood would break stuff for regions that must not have a
direct mapping).

The caller here is correct and it can be summarized as below:

if (e820 map type != system_ram)
   memblock_reserve(range)
else
   memblock_add(range)

Then:

   zone_boundaries = [ 16M, 4G, 0 ]
   free_area_init(zone_boundaries)

It's not the caller fault if the e820 map in the bios starts with:

pfn 0-1 reserved
pfn 1-N system RAM

This is better fixed in a way that must not require any change in the
caller...

> There is RAM at address 0, there is a DIMM there, so why on earth this
> should not be a part of memblock.memory?

How can you blame the caller if you explicitly didn't required that
.reserved ranges must always overlap .memory ranges? In fact it was
explained as a feature to avoid the direct mapping.

Declaring that there must be always overlap would be one way to cope
with this issue, but even then memblock must detect a wrong caller,
workaround it by doing a memblock_add call inside the
memblock_reserved and by printing a warning to improve the caller. It
shouldn't crash at boot with console off if the overlap is not there.

In my view the caller here is not to blame, all these issues are
because memblock needs improvement and must cope with any caller.

> >The memblock_start_of_DRAM moves the start of the DMA zone above
> >the pfn 0, so then pfn 0 already ends up in the no zone land David
> >mentioned where it's not trivial to decide to give it a zoneid if
> >it's not even spanned in the zone.
> > 
> >Not just the zoneid, there's not even a nid.
> > 
> >So we have a pfn with no zoneid, and no nid, and not part of the
> >zone ranges, not part of the nid ranges not part of the
> >memory.memblock.
> 
> We have a pfn that should have been in memblock.memory, in ZONE_DMA and
> in the first node with memory. If it was not trimmed from there 

My incremental patch already solves how to extend the zones spans and
the nid spans by taking memblock.reserved into account, not just
memblock.memory, but still without actually messing with the direct
mapping, otherwise I would have called memblock_add instead of walking
memblock.reserved when detecting the nid and zoneid ranges.

> 
> >We can't really skip the initialization of the the pfn 0, it has to
> >get the zoneid 0 treatment or if pfn 1 ends up in compaction code,
> >we'd crash again. (although we'd crash with a nice PagePoison(page)
> >== true behavior)
> 
> Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1.

How are you sure there's no a zone ID 0 that starts at pfn 0 and ends
at pfn 1 and the zone dma starts at pfn 1? In such case the pfn 0
would have a zoneid different from pfn 1.

I'm not exactly sure why we should initialize a pfn 0 with a zoneid of
a zone whose pfn 0 is not part of, that looks wrong.

I'm not exactly sure why you don't fix the zone->zone_start_pfn and
the respective nid node_start_pfn to start from pfn 0 instead, just
like I did in my patch.

> From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport 
> Date: Thu, 3 Dec 2020 11:40:17 +0200
> Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for 
> holes"
> 
> Signed-off-by: Mike Rapoport 
> ---
>  mm/page_alloc.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ce2bdaabdf96..86fde4424e87 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int 
> nid,
>  unsigned long zone,
>  unsigned long range_start_pfn)
>  {
> - unsigned long start_pfn, end_pfn, next_pfn = 0;
> + static unsigned long hole_start_pfn;
> + unsigned long start_pfn, end_pfn;
>   unsigned

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-12-02 Thread Andrea Arcangeli

On Wed, Dec 02, 2020 at 07:39:23PM +0200, Mike Rapoport wrote:
> Hmm, do you have any logs? It worked on my box and with various memory
> configurations in qemu.

It crashes in reserve_bootmem_region so no logs at all since the
serial console isn't active yet.

> I believe that the proper solution would be to have all the memory
> reserved by the firmware added to memblock.memory, like all other
> architectures do, but I'm not sure if we can easily and reliably
> determine what of E820 reserved types are actually backed by RAM.
> So this is not feasible as a short term solution.
> 
> My patch [1], though, is an immediate improvement and I think it's worth
> trying to fix off-by-one's that prevent your system from booting.
> 
> [1] https://lore.kernel.org/lkml/20201201181502.2340-1-r...@kernel.org

Yes that's the patch I applied.

Relevant points after debugging:

1) can you try again with DEBUG_VM=y?

2) DEBUG_VM=y already enforces the memset(page, 0xff, sizeof(struct
   page)) on all struct pages allocated, exactly I was suggesting to
   do in the previous email. I wonder why we're not doing that even
   with DEBUG_VM=n, perhaps it's too slow for TiB systems. See
   page_init_poison().

3) given 2), your patch below by deleting "0,0" initialization
   achieves the debug feature to keep PagePoison forever on all
   uninitialized pages, imagine PagePoison is really
   NO_NODEID/NO_ZONEID and doesn't need handling other than a crash.

-   __init_single_page(pfn_to_page(pfn), pfn, 0, 0);

4) because of the feature in 3) I now hit the PagePoison VM_BUG_ON
   because pfn 0 wasn't initialized when reserve_bootmem_region tries
   to call __SetPageReserved on a PagePoison page.

5) pfn 0 is the classical case where pfn 0 is in a reserved zone in
   memblock.reserve that doesn't overlap any memblock.memory zone.

   The memblock_start_of_DRAM moves the start of the DMA zone above
   the pfn 0, so then pfn 0 already ends up in the no zone land David
   mentioned where it's not trivial to decide to give it a zoneid if
   it's not even spanned in the zone.

   Not just the zoneid, there's not even a nid.

   So we have a pfn with no zoneid, and no nid, and not part of the
   zone ranges, not part of the nid ranges not part of the
   memory.memblock.

   We can't really skip the initialization of the the pfn 0, it has to
   get the zoneid 0 treatment or if pfn 1 ends up in compaction code,
   we'd crash again. (although we'd crash with a nice PagePoison(page)
   == true behavior)

The below fixes the issue and it boots fine again and the compaction
crash should be solved with both patches applied.

DMA  zone_start_pfn 0zone_end_pfn() 4096 contiguous 1   

DMA32zone_start_pfn 4096 zone_end_pfn() 524252   contiguous 1   

Normal   zone_start_pfn 0zone_end_pfn() 0contiguous 0   

Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0   

500246 0x7a216000 0x1fff1000 reserved True pfn_valid True
500247 0x7a217000 0x1fff reserved False pfn_valid True
500248 0x7a218000 0x1fff reserved False pfn_valid True

 quote from previous email 
73a6e474cb376921a311786652782155eac2fdf0 this changed it:

DMA  zone_start_pfn 1zone_end_pfn() 4096 contiguous 1
DMA32zone_start_pfn 4096 zone_end_pfn() 524252   contiguous 0
Normal   zone_start_pfn 0zone_end_pfn() 0contiguous 0
Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0

500246 0x7a216000 0xfff1000 reserved True
500247 0x7a217000 0x1fff reserved False
500248 0x7a218000 0x1fff00010200 reserved False ]
 quote from previous email 

>From 89469f063c192ae70aea0bd6e1e2a7e99894e5b6 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Wed, 2 Dec 2020 23:23:26 -0500
Subject: [PATCH 1/1] mm: initialize struct pages in reserved regions outside
 of the zone ranges

pfn 0 wasn't initialized and the PagePoison remained set when
reserve_bootmem_region() called __SetPageReserved, inducing a silent
boot failure with DEBUG_VM (and correctly so, because the crash
signaled the nodeid/nid of pfn 0 would be again wrong).

Without this change, the pfn 0 part of a memblock.reserved range,
isn't in any zone spanned range, nor in any nid spanned range, when we
initialize the memblock.memory holes pfn 0 won't be included because
it has no nid and no zoneid.

There's no enforcement that all memblock.reserved ranges must overlap
memblock.memory ranges, so the memblock.reserved ranges also require
an explicit initialization and the zones and nid ranges need to be
extended to include all memblock.reserved ranges with struct pages, or
they'll be left uninitialized with PagePoison as it happened to pfn 0.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/memblock.h | 17 +---

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-12-01 Thread Andrea Arcangeli

Hello Mike,

On Sun, Nov 29, 2020 at 02:32:57PM +0200, Mike Rapoport wrote:
> Hello Andrea,
> 
> On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote:
> > On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> > 
> > Let's try to merge init_unavailable_memory() into memmap_init().
> > Than it'll be able to set zone/nid for those nasty pfns that BIOS
> > decided to keep to itself, like in Andrea's case and will also take care
> > of struct pages that do not really have a frame in DRAM, but are there
> > because of arbitrary section size.
> 
> Did you have a chance to check if this solves your issue?
> If yes, I'll resend this as a formal patch.

I tested the patch you sent, but it doesn't seem to boot. Reverting it
booted.

Also I noticed leaving pages uninitialized already happened before and
was fixed in 124049decbb121ec32742c94fb5d9d6bed8f24d8 where literally
all holes were registered in memblock_reserve() by hand to workaround
this very same defect in memblock callee we're fixing again.

Then it was lifted in 9fd61bc95130d4971568b89c9548b5e0a4e18e0e since
the memblock was supposed to be able to initialize all pages.

Since this seems the second time this happens, I'd suggest to change
the graceful pfn, 0,0 initialization to memset(page, 0xff,
sizeof(struct page)) too like we mentioned earlier and then have at
least a DEBUG_SOMETHING=y to search for struct pages with all 1 in
page->flags to ensure the boot stage worked right so perhaps there's a
graceful notification at boot before a VM_BUG_ON triggers later. The
page struct validation could be done based on DEBUG_VM=y too since it
won't cause any runtime cost post boot.

Thanks,
Andrea

Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads

2020-12-01 Thread Andrea Arcangeli

On Tue, Dec 01, 2020 at 05:30:33PM -0500, Peter Xu wrote:
> On Tue, Dec 01, 2020 at 12:59:27PM +, Matthew Wilcox wrote:
> > On Mon, Nov 30, 2020 at 06:06:03PM -0500, Peter Xu wrote:
> > > Faulting around for reads are in most cases helpful for the performance 
> > > so that
> > > continuous memory accesses may avoid another trip of page fault.  However 
> > > it
> > > may not always work as expected.
> > > 
> > > For example, userfaultfd registered regions may not be the best candidate 
> > > for
> > > pre-faults around the reads.
> > > 
> > > For missing mode uffds, fault around does not help because if the page 
> > > cache
> > > existed, then the page should be there already.  If the page cache is not
> > > there, nothing else we can do, either.  If the fault-around code is 
> > > destined to
> > > be helpless for userfault-missing vmas, then ideally we can skip it.
> > 
> > This sounds like you're thinking of a file which has exactly one user.
> > If there are multiple processes mapping the same file, then no, there's
> > no reason to expect a page to be already present in the page table,
> > just because it's present in the page cache.
> > 
> > > For wr-protected mode uffds, errornously fault in those pages around 
> > > could lead
> > > to threads accessing the pages without uffd server's awareness.  For 
> > > example,
> > > when punching holes on uffd-wp registered shmem regions, we'll first try 
> > > to
> > > unmap all the pages before evicting the page cache but without locking the
> > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is 
> > > called
> > > before shmem_truncate_range()).  When fault-around happens near a hole 
> > > being
> > > punched, we might errornously fault in the "holes" right before it will be
> > > punched.  Then there's a small window before the page cache was finally
> > > dropped, and after the page will be writable again (NOTE: the uffd-wp 
> > > protect
> > > information is totally lost due to the pre-unmap in shmem_fallocate(), so 
> > > the
> > > page can be writable within the small window).  That's severe data loss.
> > 
> > This still doesn't make sense.  If the page is Uptodate in the page
> > cache, then userspace gets to access it.  If you don't want the page to
> > be accessible, ClearPageUptodate().  read() can also access it if it's
> > marked Uptodate.  A write fault on a page will call the filesystem's
> > page_mkwrite() and you can block it there.
> 
> I still don't think the page_mkwrite() could help here... Though Andrea 
> pointed

I tend to agree page_mkwrite won't help, there's no I/O, nor dirty
memory pressure, not even VM_FAULT_MISSING can work on real
filesystems yet. The uffd context is associated to certain virtual
addresses in the "mm", read/write syscalls shouldn't notice any
difference, as all other mm shouldn't notice anything either.

It should be enough to check the bit in shmem_fault invoked in ->fault
for this purpose, the problem is we need the bit to survive the invalidate.

> out an more important issue against swap cache (in the v1 thread [1]).  Indeed
> if we have those figured out maybe we'll also rethink this patch then it could
> become optional; while that seems to be required to allow shmem swap in/out
> with uffd-wp which I haven't yet tested.  As Hugh pointed out, purely reuse 
> the
> _PAGE_SWP_UFFD_WP in swap cache may not work trivially since uffd-wp is 
> per-pte
> rather than per-page, so I probably need to think a bit more on how to do
> that...
> 
> I don't know whether a patch like this could still be good in the future.  For
> now, let's drop this patch until we solve all the rest of the puzzle.
>
> My thanks to all the reviewers, and sorry for the noise!

Thanks to you Peter! No noise here, it's great progress to have found
the next piece of the puzzle.

Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
survive the pte invalidates in a way that remains associated to a
certain vaddr in a single mm (so it can shoot itself in the foot if it
wants, but it can't interfere with all other mm sharing the shmem
file) would be welcome...

Andrea

Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

2020-12-01 Thread Andrea Arcangeli

Hi Hugh,

On Tue, Dec 01, 2020 at 01:31:21PM -0800, Hugh Dickins wrote:
> Please don't ever rely on that i_private business for correctness: as

The out of order and lockless "if (inode->i_private)" branch didn't
inspire much confidence in terms of being able to rely on it for
locking correctness indeed.

> more of the comment around "So refrain from" explains, it was added
> to avoid a livelock with the trinity fuzzer, and I'd gladly delete
> it all, if I could ever be confident that enough has changed in the
> intervening years that it's no longer necessary.

I was wondering if it's still needed, so now I checked the old code
and it also did an unconditional lock_page in shmem_fault, so I assume
it's still necessary.

> It tends to prevent shmem faults racing hole punches in the same area
> (without quite guaranteeing no race at all).  But without the livelock
> issue, I'd just have gone on letting them race.  "Punch hole" ==
> "Lose data" - though I can imagine that UFFD would like more control
> over that.  Since map_pages is driven by faulting, it should already
> be throttled too.

Yes I see now it just needs to "eventually" stop the shmem_fault
activity, it doesn't need to catch those faults already in flight, so
it cannot relied upon as the form of serialization to zap the
pageteables while truncating the page.

> But Andrea in next mail goes on to see other issues with UFFD_WP
> in relation to shmem swap, so this thread is probably dead now.
> If there's a bit to spare for UFFD_WP in the anonymous swap entry,
> then that bit should be usable for shmem too: but a shmem page
> (and its swap entry) can be shared between many different users,
> so I don't know whether that will make sense.

An UFFD_WP software bit exists both for the present pte and for the
swap entry:

#define _PAGE_UFFD_WP   (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
#define _PAGE_SWP_UFFD_WP   _PAGE_USER

It works similarly to soft dirty, if the page is swapped _PAGE_UFFD_WP
it's translated to _PAGE_SWP_UFFD_WP and the other way around during
swapin.

The problem is that the bit represents an information that is specific
to an mm and a single mapping.

If you punch an hole in a shmem, you map the shmem file in two
processes A and B, you register the mapped range in a uffd opened by
process A using VM_UFFD_MISSING, only process A will generate
userfaults if you access the virtual address that corresponds to the
hole in the file, process B will still fill the hole normally and it
won't trigger userfaults in process A.

The uffd context is attached to an mm, and the notification only has
effect on that mm, the mm that created the context. Only the UFFDIO_
operations that resolve the fault (like UFFDIO_COPY) have effect
visible by all file sharers.

So VM_UFFD_WP shall work the same, and the _PAGE_SWP_UFFD_WP
information of a swapped out shmem page shouldn't go in the xarray
because doing so would share it with all "mm" sharing the mapping.

Currently uffd-wp only works on anon memory so this is a new challenge
in extending uffd-wp to shmem it seems.

If any pagetable of an anonymous memory mapping is zapped, then not
just the _PAGE_SWP_UFFD_WP bit is lost, the whole data is lost too so
it'd break not just uffd-wp. As opposed with shmem the ptes can be
zapped at all times by memory pressure alone even before the final
->writepage swap out is invoked.

It's always possible to make uffd-wp work without those bits (anon
memory also would work without those bits), but the reason those bits
exist is to avoid a flood of UFFDIO_WRITEPROTECT ioctl false-negatives
after fork or after swapping (in the anon case). In the shmem case if
we'd it without a bitflag to track which page was wrprotected in which
vma, the uffd handler would be required to mark each pte writable with
a syscall every time a new pte has been established on the range and
there's a write to the page.

One possibility could be to store the bit set by UFFDIO_WRITEPROTECT
in a vmalloc bitmap associated with the shmem vma at uffd registration
time, so it can be checked during the shmem_fault() if VM_UFFD_WP is
set on the vma? The vma_splits and vma_merge would be the hard part.

Another possibility would be to change how the pte invalidate work for
vmas with VM_UFFD_WP set, the pte and the _PAGE_UFFD_WP would need to
survive all invalidates... only the final truncation under page lock
would be really allowed to clear the whole pte and destroy the
_PAGE_UFFD_WP information in the pte and then to free the pgtables on
those ranges.

Once we find a way to make that bit survive the invalidates, we can
check it in shmem_fault to decide if to invoke handle_userfault if the
bit is set and FAULT_FLAG_WRITE is set in vmf->flags, or if to map the
page wrprotected in that vaddr if the bit is set and FAULT_FLAG_WRITE
is not set.

Thanks,
Andrea

Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

2020-12-01 Thread Andrea Arcangeli

Hi Peter,

On Sat, Nov 28, 2020 at 10:29:03AM -0500, Peter Xu wrote:
> Yes.  A trivial supplementary detail is that filemap_map_pages() will only set
> it read-only since alloc_set_pte() will only set write bit if it's a write.  
> In
> our case it's read fault-around so without it.  However it'll be quickly 
> marked
> as writable as long as the thread quickly writes to it again.

The problem here is that all the information injected into the kernel
through the UFFDIO_WRITEPROTECT syscalls is wiped by the quoted
unmap_mapping_range.

We can later discuss we actually check _PAGE_UFFD_WP, but that's not
the first thing to solve here.

Here need to we find a way to retain _PAGE_UFFD_WP until the page is
actually locked and is atomically disappearing from the xarray index
as far as further shmem page faults are concerned.

Adding that information to the xarray isn't ideal because it's mm
specific and not a property of the "pagecache", but keeping that
information in the pagetables as we do for anon memory is also causing
issue as shown below, because the shmem pagetables are supposed to be
zapped at any time and be re-created on the fly based on the
xarray. As opposed this never happens with anon memory.

For example when shmem swap the swap entry is written in the xarray
not in the pagetable, how are we going to let _PAGE_UFFD_WP survive
swapping of a shmem if we store the _PAGE_UFFD_WP in the pte? Did you
test swap?

So here I'm afraid there's something more fundamental in how to have
_PAGE_UFFD_WP survive swapping and a random pte zap, and we need more
infrastructure in the unmap_mapping_range to retain that information
so then swapping also works then.

So I don't think we can evaluate this patch without seeing the full
patchset and how the rest is being handled. Still it's great you found
the source of the corruption for the current shmem uffd-wp codebase so
we can move forward!

> > However filemap_map_pages isn't capable to fill a hole and to undo the
> > hole punch, all it can do are minor faults to re-fill the ptes from a
> > not-yet-truncated inode page.
> > 
> > Now it would be ok if filemap_map_pages re-filled the ptes, after they
> > were zapped the first time by fallocate, and then the fallocate would
> > zap them again before truncating the page, but I don't see a second
> > pte zap... there's just the below single call of unmap_mapping_range:
> > 
> > if ((u64)unmap_end > (u64)unmap_start)
> > unmap_mapping_range(mapping, unmap_start,
> > 1 + unmap_end - unmap_start, 0);
> > shmem_truncate_range(inode, offset, offset + len - 1);
> 
> IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call
> truncate_inode_page() then onto truncate_cleanup_page().

I wrongly assumed there was only one invalidate and I guessed that is what
caused the problem since it run outside the page lock so it wasn't
"atomic" with respect to the page truncation in the xarray.

The comment "refrain from faulting pages into the hole while it's
being punched" and the "inode->i_private waitqueue" logic is just to
slowly throttle the faulting so the truncation is faster. It's still
not entirely clear why it's needed since shmem_getpage_gfp uses
find_lock_entry and the page had to be locked. I didn't look at the
history to see the order of where those changes were added to see if
it's still needed.

Anyway it was already suspect that "unlikely(inode->i_private)" would
be enough as an out of order check to fully serialize things but for a
performance "throttling" logic it's fine and non concerning (worst
case it is not needed and the page lock waitqueue would have
serialized the faults against the slow invalidate loops already).

> Since we're at it, some more context: this is actually where I started to
> notice the issue, that we'll try to pre-unmap the whole region first before
> shmem_truncate_range().  The thing is shmem_truncate_range() will zap the ptes
> too, in an even safer way (with page lock held).  So before I came up with the
> current patch, I also tried below patch, and it also fixes the data corrupt
> issue:
> 
> -8<-
> diff --git a/mm/shmem.c b/mm/shmem.c
> index efa19e33e470..b275f401fe1f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int 
> mode, loff_t offset,
> inode_lock(inode);
>  
> if (mode & FALLOC_FL_PUNCH_HOLE) {
> -   struct address_space *mapping = file->f_mapping;
> loff_t unmap_start = round_up(offset, PAGE_SIZE);
> loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
> DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
> @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int 
> mode, loff_t offset,
> inode->i_private = _falloc;
> spin_unlock(>i_lock);
>  
> -   if

Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

2020-11-27 Thread Andrea Arcangeli

Hello,

On Fri, Nov 27, 2020 at 12:22:24PM +, Matthew Wilcox wrote:
> On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote:
> > For wr-protected mode uffds, errornously fault in those pages around could 
> > lead
> > to threads accessing the pages without uffd server's awareness.  For 
> > example,
> > when punching holes on uffd-wp registered shmem regions, we'll first try to
> > unmap all the pages before evicting the page cache but without locking the
> > page (please refer to shmem_fallocate(), where unmap_mapping_range() is 
> > called
> > before shmem_truncate_range()).  When fault-around happens near a hole being
> > punched, we might errornously fault in the "holes" right before it will be
> > punched.  Then there's a small window before the page cache was finally
> > dropped, and after the page will be writable again (NOTE: the uffd-wp 
> > protect
> > information is totally lost due to the pre-unmap in shmem_fallocate(), so 
> > the
> > page can be writable within the small window).  That's severe data loss.
> 
> Sounds like you have a missing page_mkwrite implementation.

If the real fault happened through the pagetable (which got dropped by
the hole punching), a "missing type" userfault would be delivered to
userland (because the pte would be none). Userland would invoke
UFFDIO_COPY with the UFFDIO_COPY_MODE_WP flag. Such flag would then
map the filled shmem page (not necessarily all zero and not
necessarily the old content before the hole punch) with _PAGE_RW not
set and _PAGE_UFFD_WP set, so the next write would also trigger a
wrprotect userfault (this is what the uffd-wp app expects).

filemap_map_pages doesn't notify userland when it fills a pte and it
will map again the page read-write.

However filemap_map_pages isn't capable to fill a hole and to undo the
hole punch, all it can do are minor faults to re-fill the ptes from a
not-yet-truncated inode page.

Now it would be ok if filemap_map_pages re-filled the ptes, after they
were zapped the first time by fallocate, and then the fallocate would
zap them again before truncating the page, but I don't see a second
pte zap... there's just the below single call of unmap_mapping_range:

if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start,
1 + unmap_end - unmap_start, 0);
shmem_truncate_range(inode, offset, offset + len - 1);

So looking at filemap_map_pages in shmem, I'm really wondering if the
non-uffd case is correct or not.

Do we end up with ptes pointing to non pagecache, so then the virtual
mapping is out of sync also with read/write syscalls that will then
allocate another shmem page out of sync of the ptes?

If a real page fault happened instead of filemap_map_pages, the
shmem_fault() would block during fallocate punch hole by checking
inode->i_private, as the comment says:

 * So refrain from
 * faulting pages into the hole while it's being punched.

It's not immediately clear where filemap_map_pages refrains from
faulting pages into the hole while it's being punched... given it's
ignoring inode->i_private.

So I'm not exactly sure how shmem can safely faulted in through
filemap_map_pages, without going through shmem_fault... I suppose
shmem simply is unsafe to use filemap_map_pages and it'd require
a specific shmem_map_pages?

If only filemap_map_pages was refraining re-faulting ptes of a shmem
page that is about to be truncated (whose original ptes had _PAGE_RW
unset and _PAGE_UFFD_WP set) there would be no problem with the uffd
interaction. So a proper shmem_map_pages could co-exist with uffd, the
userfaultfd_armed check would be only an optimization but it wouldn't
be required to avoid userland memory corruption?

>From 8c6fb1b7dde309f0c8b5666a8e13557ae46369b4 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Fri, 27 Nov 2020 19:12:44 -0500
Subject: [PATCH 1/1] shmem: stop faulting in pages without checking
 inode->i_private

Per shmem_fault comment shmem need to "refrain from faulting pages
into the hole while it's being punched" and to do so it must check
inode->i_private, which filemap_map_pages won't so it's unsafe to use
in shmem because it can leave ptes pointing to non-pagecache pages in
shmem backed vmas.

Signed-off-by: Andrea Arcangeli 
---
 mm/shmem.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 8e2b35ba93ad..f6f29b3e67c6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3942,7 +3942,6 @@ static const struct super_operations shmem_ops = {

 static const struct vm_operations_struct shmem_vm_ops = {
.fault  = shmem_fault,
-   .map_pages  = filemap_map_pages,
 #ifdef CONFIG_NUMA
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-26 Thread Andrea Arcangeli

On Thu, Nov 26, 2020 at 09:44:26PM +0200, Mike Rapoport wrote:
> TBH, the whole interaction between e820 and memblock keeps me puzzled
> and I can only make educated guesses why some ranges here are
> memblock_reserve()'d and some memblock_add()ed.

The mixed usage in that interaction between memblock.reserve and
memblock.memory where sometime it's used to reserve overlapping
memblock.memory ranges (clearly ok), and sometimes is used on ranges
with no overlap (not clear even why it's being called), is what makes
the current code messy.

We're basically passing down the exact same information (inverted),
through two different APIs, if there is no overlap.

> I think what should be there is that e820 entries that are essentially
> RAM, used by BIOS or not, should be listed in memblock.memory. Then
> using memblock_reserve() for parts that BIOS claimed for itself would
> have the same semantics as for memory allocated by kernel.
>
> I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a
> range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9 will be
> in memblock.reserved.
> 
> Than in page_alloc.c we'll know that we have a physical memory bank from
> 0 to 512M but there are some ranges that we cannot use.
>
> I suggested it back then when the issue with compaction was reported at
> the first time, but Baoquan mentioned that there are systems that cannot
> even tolerate having BIOS reserved areas in the page tables and I didn't
> continue to pursue this.

That explains why we can't add the e820 non-RAM regions to
memblock_add, what's not clear is why it should be required to call
memblock_reserve on a region that has no overlap with any memblock_add.

Instead of the patch that adds a walk to the memblock.reserve and that
requires adding even more memblock_reserve to e820__memblock_setup for
type 20, we can add a walk for the memblock.memory holes and then we
can remove the memblock_reserve for E820_TYPE_SOFT_RESERVED too.

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-26 Thread Andrea Arcangeli

On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> I think it's inveneted by your BIOS vendor :)

BTW, all systems I use on a daily basis have that type 20... Only two
of them are reproducing the VM_BUG_ON on a weekly basis on v5.9.

If you search 'E820 "type 20"' you'll get plenty of hits, so it's not
just me at least :).. In fact my guess is there are probably more
workstation/laptops with that type 20 than without. Maybe it only
showup if booting with EFI?

Easy to check with `dmesg | grep "type 20"` after boot.

One guess why this wasn't frequently reproduced is some desktop distro
is doing the mistake of keeping THP enabled = madvise by default and
they're reducing the overall compaction testing? Or maybe they're not
all setting DEBUG_VM=y (but some other distro I'm sure ships v5.9 with
DEBUG_VM=y). Often I hit this bug in kcompactd0 for example, that
wouldn't happen with THP enabled=madvise.

The two bpf tracing tools below can proof how the current
defrag=madvise default only increase the allocation latency from a few
usec to a dozen usec. Only if setting defrag=always the latency goes
up to single digit milliseconds, because of the cost of direct
compaction which is only worth paying for, for MADV_HUGEPAGE ranges
doing long-lived allocations (we know by now that defrag=always was
a suboptimal default).

https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp-comm.bp
https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp.bp

Since 3917c80280c93a7123f1a3a6dcdb10a3ea19737d even app like Redis
using fork for snapshotting purposes should prefer THP
enabled. (besides it would be better if it used uffd-wp as alternative
to fork)

3917c80280c93a7123f1a3a6dcdb10a3ea19737d also resolved another concern
because the decade old "fork() vs gup/O_DIRECT vs thread" race was
supposed to be unnoticeable from userland if the O_DIRECT min I/O
granularity was enforced to be >=PAGE_SIZE. However with THP backed
anon memory, that minimum granularity requirement increase to
HPAGE_PMD_SIZE. Recent kernels are going in the direction of solving
that race by doing cow during fork as it was originally proposed long
time ago
(https://lkml.kernel.org/r/20090311165833.GI27823@random.random) which
I suppose will solve the race with sub-PAGE_SIZE granularity too, but
3917c80280c93a7123f1a3a6dcdb10a3ea19737d alone is enough to reduce the
minumum I/O granularity requirement from HPAGE_PMD_SIZE to PAGE_SIZE
as some userland may have expected. The best of course is to fully
prevent that race condition by setting MADV_DONTFORK on the regions
under O_DIRECT (as qemu does for example).

Overall the only tangible concern left is potential higher memory
usage for servers handling tiny object storage freed at PAGE_SIZE
granularity with MADV_DONTNEED (instead of having a way to copy and
defrag the virtual space of small objects through a callback that
updates the pointer to the object...).

Small object storage relying on jemalloc/tcmalloc for tiny object
management simply need to selectively disable THP to avoid wasting
memory either with MADV_NOHUGEPAGE or with the prctl
PR_SET_THP_DISABLE. Flipping a switch in the OCI schema allows to
disable THP too for those object storage apps making heavy use of
MADV_DONTNEED, not even a single line of code need changing in the app
for it if deployed through the OCI container runtime.

Recent jemalloc uses MADV_NOHUGEPAGE. I didn't check exactly how it's
being used but I've an hope it already does the right thing and
separates small object arena zapped with MADV_DONTNEED at PAGE_SIZE
granularity, with large object arena where THP shall remain
enabled. glibc also should learn to separate small objects and big
objects in different arenas. This has to be handled by the app, like
it is handled optimally already in scylladb that in fact invokes
MADV_HUGEPAGE, or plenty of other databases are using not just THP but
also hugetlbfs which certainly won't fly if MADV_DONTNEED is attempted
at PAGE_SIZE granularity.. or elastic search that also gets a
significant boost from THP etc..

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-26 Thread Andrea Arcangeli

On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> memory.reserved cannot be calculated automatically. It represents all
> the memory allocations made before page allocator is up. And as
> memblock_reserve() is the most basic to allocate memory early at boot we
> cannot really delete it ;-)

Well this explanation totally covers "memory allocated early at
boot" that overlaps with memblock.memory.

Does the E820_TYPE_SOFT_RESERVED range added to memblock.reserve
define as "memory allocated early at boot"?

Does it overlap ranges added with any RAM added to memblock.memory?

if (entry->type == E820_TYPE_SOFT_RESERVED)
memblock_reserve(entry->addr, entry->size);

if (entry->type != E820_TYPE_RAM && entry->type != 
E820_TYPE_RESERVED_KERN)
continue;

memblock_add(entry->addr, entry->size);

To me the above looks it's being used for something completely
different than from reserving "memory allocated early at boot".

Why there is no warning at boot if there's no overlap between
memblock.resereve and memblock.memory?

My question about memblock.reserve is really about the non overlapping
ranges: why are ranges non overlapping with memblock.memory regions,
added to memblock.reserve, and why aren't those calculated
automatically as reverse of memblock.memory?

It's easy to see that when memblock.reserve overlaps fully, it makes
perfect sense and it has to stay for it. I was really only thinking at
the usage like above of memblock_reserve that looks like it should be
turned into a noop and deleted.

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-26 Thread Andrea Arcangeli

On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote:
> I agree that this is sub-optimal, as such pages are impossible to detect
> (PageReserved is just not clear as discussed with Andrea). The basic
> question is how we want to proceed:
> 
> a) Make sure any online struct page has a valid nid/zid, and is spanned
> by the nid/zid.
> b) Use a fake nid that will bail out when used for page_zone() and
> page_pgdat(), and make pfn walkers detect that.
> 
> AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner
> cases. Thoughts?

Yes that's a good summary.

My reason I slightly prefer a) is that it will perform faster at
runtime.

I seen also the proposed patch that adds the pfn_to_page embedded in
pfn_valid to show those pfn as invalid, that is especially slow and I
don't like it for that reason. Additional bugchecks for NO_NODEID will
slowdown things too, so they should be justified by some benefit. It
concerns me if pfn_valid becomes slower.

b) will also make compaction bail out on that pageblock which it
doesn't need to so it'll provide a worse runtime to compaction as
well.

a) is what the code already does if only the e820 map range was
reserved with memblock_reserved, after applying Mike's patch to
initialize reserved memblock regions too.

It turns out the VM_BUG_ON_PAGE, as far as compaction is concerned, is
just a false positive, it detected a broken VM invariant, but it was
fine to proceed in the DEBUG_VM=n case that wouldn't crash.

Before da50c57bdbb8e37ec6f8c934a2f8acbf4e4fdfb9 the struct page
corresponding to the e820 unknown type range page wouldn't be marked
PageReserved, that also looked wrong but it was also basically harmless.

Neither of the two defects is too bad: it could be ignored if we just
remove the bugcheck, but it's just nice if the bugcheck turn out to be
correct in the pageblock.

If we initialize all RAM and non-RAM ranges in the same way, then
there's no discrepancy. Mike's patch seem to do just that by walking
the memblock.reserved memblocks in the same way as the
memblock.memory.

NOTE: I would also much prefer b) if only it would guarantee zero
runtime slowdown. (Which is I especially dislike pfn_valid internally
calling pfn_to_page)

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Andrea Arcangeli

On Wed, Nov 25, 2020 at 12:34:41AM -0500, Andrea Arcangeli wrote:
> pfnphysaddr   page->flags
> 500224 0x7a20 0x1fff1000 reserved True
> 500225 0x7a201000 0x1fff1000 reserved True
> *snip*
> 500245 0x7a215000 0x1fff1000 reserved True
> 500246 0x7a216000 0x1fff1000 reserved True
> 500247 0x7a217000 0x3fff reserved False
> 500248 0x7a218000 0x3fff reserved False

The range is still this:

7a17b000-7a216fff : Unknown E820 type
7a217000-7ffdbfff : System RAM

This is with v5.3:

Interestingly the pages below 0x7a217000 weren't even marked reserved.

DMA  zone_start_pfn 1zone_end_pfn() 4096 contiguous 1   
  
DMA32zone_start_pfn 4096 zone_end_pfn() 524252   contiguous 1   
  
Normal   zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  
Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  

500246 0x7a216000 0x1fff reserved False
500247 0x7a217000 0x1fff00022036 reserved False
500248 0x7a218000 0x1fff2032 reserved False

4b094b7851bf4bf551ad456195d3f26e1c03bd74 no change (because the
unknown e820 type 20 was still initialized by memmap_init_zone despite
it was not part of memblock.memory):

DMA  zone_start_pfn 1zone_end_pfn() 4096 contiguous 1   
  
DMA32zone_start_pfn 4096 zone_end_pfn() 524252   contiguous 1   
  
Normal   zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  
Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  

500246 0x7a216000 0x1fff reserved False
500247 0x7a217000 0x1fff00022036 reserved False
500248 0x7a218000 0x1fff2032 reserved False

73a6e474cb376921a311786652782155eac2fdf0 this changed it:

DMA  zone_start_pfn 1zone_end_pfn() 4096 contiguous 1   
  
DMA32zone_start_pfn 4096 zone_end_pfn() 524252   contiguous 0   
  
Normal   zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  
Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  

500246 0x7a216000 0xfff1000 reserved True
500247 0x7a217000 0x1fff reserved False
500248 0x7a218000 0x1fff00010200 reserved False

da50c57bdbb8e37ec6f8c934a2f8acbf4e4fdfb9
(73a6e474cb376921a311786652782155eac2fdf0^) no change:

DMA  zone_start_pfn 1zone_end_pfn() 4096 contiguous 1   
  
DMA32zone_start_pfn 4096 zone_end_pfn() 524252   contiguous 1   
  
Normal   zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  
Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0   
  

500246 0x7a216000 0x1fff reserved False
500247 0x7a217000 0x1fff2032 reserved False
500248 0x7a218000 0x1fff2032 reserved False

So like Mike suggested this started in
73a6e474cb376921a311786652782155eac2fdf0, although the previous code
looked buggy too by not setting PageReserved when it should have.

It appears the pfn in the e820 type 20 were initialized by
memmap_init_zone before commit
73a6e474cb376921a311786652782155eac2fdf0 and they're not initialized
anymore after 73a6e474cb376921a311786652782155eac2fdf0 because the
memblock.memory didn't include the e820 type 20 so it couldn't call
memmap_init_zone anymore.

So after 4b094b7851bf4bf551ad456195d3f26e1c03bd74 uninitialized pfn
gets the 0,0 assignment (but even before
4b094b7851bf4bf551ad456195d3f26e1c03bd74 they would have gotten a
completely zero page->flags, 4b094b7851bf4bf551ad456195d3f26e1c03bd74
is completely unrelated to this issue).

Mike, your patch alone doesn't help because nothing calls
memblock_reserve on the range starting at 0x7a17b000. I appended at
the end the output of memblock=debug.

Only after I apply the below hack on top of your patch finally all
sign of bugs are gone, PG_reserved is finally set (old code was wrong
not setting it) and the zoneid/nid is right (new code leaves it
uninitialized and so it's wrong):

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 983cd53ed4c9..bbb5b4d7a117 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1303,6 +1303,8 @@ void __init e820__memblock_setup(void)
 
if (entry->type == E820_TYPE_SOFT_RESERVED)
memblock_reserve(entry->addr, entry->size);
+   if (entry->type == 20)
+   memblock_reserve(entry->addr, entry->size);
 
if (entry->type != E820_TYPE_RAM && entry->type != 
E820_TYPE_RESERVED_KERN)
continue;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3fb35fe6a9e4..2e2047282ff3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6145,7 +6145,9 @@ void __meminit __weak memmap_init(unsigned long size, int 
nid,
 {
unsigne

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Andrea Arcangeli

On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote:
> I think the very root cause is how e820__memblock_setup() registers
> memory with memblock:
> 
>   if (entry->type == E820_TYPE_SOFT_RESERVED)
>   memblock_reserve(entry->addr, entry->size);
> 
>   if (entry->type != E820_TYPE_RAM && entry->type != 
> E820_TYPE_RESERVED_KERN)
>   continue;
> 
>   memblock_add(entry->addr, entry->size);
> 
> From that point the system has inconsistent view of RAM in both
> memblock.memory and memblock.reserved and, which is then translated to
> memmap etc.
> 
> Unfortunately, simply adding all RAM to memblock is not possible as
> there are systems that for them "the addresses listed in the reserved
> range must never be accessed, or (as we discovered) even be reachable by
> an active page table entry" [1].
> 
> [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/

It looks like what's missing is a blockmem_reserve which I don't think
would interfere at all with the issue above since it won't create
direct mapping and it'll simply invoke the second stage that wasn't
invoked here.

I guess this would have a better chance to have the second
initialization stage run in reserve_bootmem_region and it would likely
solve the problem without breaking E820_TYPE_RESERVED which is known
by the kernel:

>   if (entry->type == E820_TYPE_SOFT_RESERVED)
>   memblock_reserve(entry->addr, entry->size);
> 

+   if (entry->type == 20)
+   memblock_reserve(entry->addr, entry->size);

>   if (entry->type != E820_TYPE_RAM && entry->type != 
> E820_TYPE_RESERVED_KERN)
>   continue;
> 

This is however just to show the problem, I didn't check what type 20
is.

To me it doesn't look the root cause though, the root cause is that if
you don't call memblock_reserve the page->flags remains uninitialized.

I think the page_alloc.c need to be more robust and detect at least if
if holes within zones (but ideally all pfn_valid of all struct pages
in system even if beyond the end of the zone) aren't being initialized
in the second stage without relying on the arch code to remember to
call memblock_reserve.

In fact it's not clear why memblock_reserve even exists, that
information can be calculated reliably by page_alloc in function of
memblock.memory alone by walking all nodes and all zones. It doesn't
even seem to help in destroying the direct mapping,
reserve_bootmem_region just initializes the struct pages so it doesn't
need a special memeblock_reserved to find those ranges.

In fact it's scary that codes then does stuff like this trusting the
memblock_reserve is nearly complete information (which obviously isn't
given type 20 doesn't get queued and I got that type 20 in all my systems):

for_each_reserved_mem_region(i, , ) {
if (addr >= start && addr_end <= end)
return true;
}

That code in irq-gic-v3-its.c should stop using
for_each_reserved_mem_region and start doing
pfn_valid(addr>>PAGE_SHIFT) if
PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead.

At best memory.reserved should be calculated automatically by the
page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed
by the e820 caller, instead of adding the memory_reserve call for type
20 we should delete the memory_reserve function.

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Andrea Arcangeli

On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
> On 25.11.20 19:28, Andrea Arcangeli wrote:
> > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> >> Before that change, the memmap of memory holes were only zeroed
> >> out. So the zones/nid was 0, however, pages were not reserved and
> >> had a refcount of zero - resulting in other issues.
> > 
> > So maybe that "0,0" zoneid/nid was not actually the thing that
> > introduced the regression? Note: I didn't bisect anything yet, it was
> > just a guess.
> 
> I guess 0/0 is the issue, but that existed before when we had a simple
> memmset(0). The root issue should be what Mike said:

Yes, the second stage must have stopped running somehow.

Is there anything we can do to induce a deterministically reproducible
kernel crashing behavior if the second stage doesn't run?

Why did we start doing a more graceful initialization in the first
stage, instead of making a less graceful by setting it to 0xff instead
of 0x00?

> 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
> that check each PFN")

So if that's not intentional, are you suggesting nodeid/nid was a bug
if it was set to 0,0 for a non-RAM valid pfn?

> "correct" is problematic. If you have an actual memory hole, there is
> not always a right answer - unless I am missing something important.
> 
> 
> Assume you have a layout like this
> 
> [  zone X ] [ hole ] [ zone Y ]
> 
> If either X and or Y starts within a memory section, you have a valid
> memmap for X - but what would be the right node/zone?
> 
> 
> Assume you have a layout like this
> 
> [ zone X ]
> 
> whereby X ends inside a memory section. The you hotplug memory. Assume
> it goes to X
> 
> [ zone X ][ hole in X ][ zone X]
> 
> or it goes to y
> 
> [ zone X ][ hole ][ zone Y ]
> 
> This can easily be reproduced by starting a VM in qemu with a memory
> size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory.

I don't get what the problem is sorry.

You have a pfn, if pfn_valid() is true, pfn_to_page returns a page
deterministically.

It's up to the kernel to decide which page structure blongs to any pfn
in the pfn_to_page function.

Now if the pfn_to_page(pfn) function returns a page whose nid/zone_id
in page->flags points to a node->zone whose zone_start_pfn -
end_zone_pfn range doesn't contain "pfn" that is a bug in
page_alloc.c.

I don't see how is it not possible to deterministically enforce the
above never happens. Only then it would be true that there's not
always a right answer.

zone can overlap, but it can't be that you do pfn_to_page of a
pfn_valid and you obtain a page whose zone doesn't contain that
pfn. Which is what is currently crashing compaction.

I don't see how this is an unsolvable problem and why we should accept
to live with a bogus page->flags for reserved pages.

> We can't. The general rule is (as I was once told by Michal IIRC) that

The fact we can't kernel crash reliably when somebody uses the wrong
0,0 uninitialized value by not adding an explicit PageReserved check,
is my primary concern in keeping those nodeid/nid uninitialized, but
non-kernel-crashing, since it already created this unreproducible bug.

> I'm not rooting for "keep this at 0/0" - I'm saying that I think there
> are corner cases where it might not be that easy.

I'm not saying it's easy. What I don't see is how you don't always
have the right answer and why it would be an unsolvable problem.

It is certainly problematic and difficult to solve in the mem_map
iniitalization logic, but to me having pfn_valid() &&
page_zone(pfn_to_page(pfn)) randomly returning the DMA zone on first
node also looks problematic and difficult to handle across all VM
code, so overall it looks preferable to keep the complexity of the
mem_map initialization self contained and not spilling over the rest
of the VM.

> Yes, but there is a "Some of these" :)
> 
> Boot a VM with "-M 4000" and observe the memmap in the last section -
> they won't get initialized a second time.

Is the beyond the end of the zone yet another case? I guess that's
less likely to give us problems because it's beyond the end of the
zone. Would pfn_valid return true for those pfn? If pfn_valid is not
true it's not really a concern but the again I'd rather prefer if
those struct pages beyond the end of the zone were kernel crashing set
to 0xff.

In other words I just don't see why we should ever prefer to leave
some pages at a graceful and erroneous nid 0 nodeid 0 that wouldn't
easily induce a crash if used.

> AFAIK, the mem_map array might have multiple NIDs - and it's set when
> initializing the zones.

Well because there's no mem_map array with SPARSEMEM, but it's not

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Andrea Arcangeli

On Wed, Nov 25, 2020 at 04:13:25PM +0200, Mike Rapoport wrote:
> I suspect that memmap for the reserved pages is not properly initialized
> after recent changes in free_area_init(). They are cleared at
> init_unavailable_mem() to have zone=0 and node=0, but they seem to be

I'd really like if we would not leave those to 0,0 and if we set the
whole struct page at 0xff, if we miss the second stage that corrects
the uninitialized value. The hope is that it'll crash faster and more
reproducible that way.

> never re-initialized with proper zone and node links which was not the
> case before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock
> regions rather that check each PFN").

What's strange is that 73a6e474cb37 was suggested as fix for this
bug...

https://lkml.kernel.org/r/20200505124314.GA5029@MiWiFi-R3L-srv

The addition of "pageblock_pfn_to_page" to validate min_pfn was added
in commit 73a6e474cb37, so I assumed that the first report below
didn't have commit 73a6e474cb37 already applied.

https://lkml.kernel.org/r/8c537eb7-85ee-4dcf-943e-3cc0ed0df...@lca.pw

However if you're correct perhaps the patch was already applied in
5.7.0-rc2-next-20200423+, it landed upstream in v5.8 after all.

> Back then, memmap_init_zone() looped from zone_start_pfn till
> zone_end_pfn and struct page for reserved pages with pfns inside the
> zone would be initialized.
> 
> Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with
> memblock.memory and for x86 reserved ranges are not in memblock.memory,
> so the memmap for them remains semi-initialized.

That would matches the symptoms. I'll test it as first thing after
confirming older kernels had the right zoneid/nid on the reserved
pages.

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Andrea Arcangeli

On Wed, Nov 25, 2020 at 01:08:54PM +0100, Vlastimil Babka wrote:
> Yeah I guess it would be simpler if zoneid/nid was correct for 
> pfn_valid() pfns within a zone's range, even if they are reserved due 
> not not being really usable memory.
> 
> I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the 
> chosen solution is to make this to a real hole, the hole should be 
> extended to MAX_ORDER_NR_PAGES aligned boundaries.

The way pfn_valid works it's not possible to render all non-RAM pfn as
!pfn_valid, CONFIG_HOLES_IN_ZONE would not achieve it 100% either. So
I don't think we can rely on that to eliminate all non-RAM reserved
pages from the mem_map and avoid having to initialize them in the
first place. Some could remain as in this case since in the same
pageblock there's non-RAM followed by RAM and all pfn are valid.

> In any case, compaction code can't fix this with better range checks.

David's correct that it can, by adding enough PageReserved (I'm
running all systems reproducing this with plenty of PageReserved
checks in all places to work around it until we do a proper fix).

My problem with that is that 1) it's simply non enforceable at runtime
that there is not missing PageReserved check and 2) what benefit it
would provide to leave a wrong zoneid in reserved pages and having to
add extra PageReserved checks?

A struct page has a deterministic zoneid/nid, if it's pointed by a
valid pfn (as in pfn_valid()) the simplest is that the zoneid/nid in
the page remain correct no matter if it's reserved at boot, it was
marked reserved by a driver that swap the page somewhere else with the
GART or EFI or something else. All reserved pages should work the
same, RAM and non-RAM, since the non-RAM status can basically change
at runtime if a driver assigns the page to hw somehow.

NOTE: on the compaction side, we still need to add
thepageblock_pfn_to_page to validate the "highest" pfn because the
pfn_valid() check is missing on the first pfn on the pageblock as it's
also missing the check of a pageblock that spans over two different
zones.

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Andrea Arcangeli

On Wed, Nov 25, 2020 at 12:41:55PM +0100, David Hildenbrand wrote:
> On 25.11.20 12:04, David Hildenbrand wrote:
> > On 25.11.20 11:39, Mel Gorman wrote:
> >> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>  Something must have changed more recently than v5.1 that caused the
>  zoneid of reserved pages to be wrong, a possible candidate for the
>  real would be this change below:
> 
>  +   __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> 
> >>>
> >>> Before that change, the memmap of memory holes were only zeroed out. So 
> >>> the zones/nid was 0, however, pages were not reserved and had a refcount 
> >>> of zero - resulting in other issues.
> >>>
> >>> Most pfn walkers shouldn???t mess with reserved pages and simply skip 
> >>> them. That would be the right fix here.
> >>>
> >>
> >> Ordinarily yes, pfn walkers should not care about reserved pages but it's
> >> still surprising that the node/zone linkages would be wrong for memory
> >> holes. If they are in the middle of a zone, it means that a hole with
> >> valid struct pages could be mistaken for overlapping nodes (if the hole
> >> was in node 1 for example) or overlapping zones which is just broken.
> > 
> > I agree within zones - but AFAIU, the issue is reserved memory between
> > zones, right?
> 
> Double checking, I was confused. This applies also to memory holes
> within zones in x86.

Yes this is a memory hole within the DMA32 zone.

Still why there should be any difference?

As long as a page struct exists it's in a well defined mem_map array
which comes for one and only one zoneid/nid combination.

So what would be the benefit of treating memory holes within zones or
in between zones differently and leave one or the other with a
zoneid/nid uninitialized?

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Andrea Arcangeli

On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
> Before that change, the memmap of memory holes were only zeroed
> out. So the zones/nid was 0, however, pages were not reserved and
> had a refcount of zero - resulting in other issues.

So maybe that "0,0" zoneid/nid was not actually the thing that
introduced the regression? Note: I didn't bisect anything yet, it was
just a guess.

In fact, I need first thing to boot with an old v5.3 kernel and to
dump the page->flags around the "Unknown E820 type" region to be 100%
certain that the older kernels had a correct page->flags for reserved
pages. I will clear this up before the end of the day.

> Most pfn walkers shouldn‘t mess with reserved pages and simply skip
> them. That would be the right fix here.

What would then be the advantage of keeping wrong page->flags in
reserved pages compared to actually set it correct?

How do you plan to enforce that every VM bit will call PageReserved
(as it should be introduced in pageblock_pfn_to_page in that case) if
it's not possible to disambiguate a real DMA zone and nid 0 from the
uninitialized value? How can we patch page_nodenum to enforce it won't
read an uninitialized value because a PageReserved check was missing
in the caller?

I don't see this easily enforceable by code review, it'd be pretty
flakey to leave a 0,0 wrong value in there with no way to verify if a
PageResered check in the caller was missing.

> > It'd be preferable if the pfn_valid fails and the
> > pfn_to_section_nr(pfn) returns an invalid section for the intermediate
> > step. Even better memset 0xff over the whole page struct until the
> > second stage comes around.
> 
> I recently discussed with Baoquan to
> 1. Using a new pagetype to mark memory holes
> 2. Using special nid/zid to mark memory holes
> 
> Setting the memmap to 0xff would be even more dangerous - pfn_zone() might 
> simply BUG_ON.

What would need to call pfn_zone in between first and second stage?

If something calls pfn_zone in between first and second stage isn't it
a feature if it crashes the kernel at boot?

Note: I suggested 0xff kernel crashing "until the second stage comes
around" during meminit at boot, not permanently.

/*
 * Use a fake node/zone (0) for now. Some of these pages
 * (in memblock.reserved but not in memblock.memory) will
 * get re-initialized via reserve_bootmem_region() later.
 */

Specifically I relied on the comment "get re-initialized via
reserve_bootmem_region() later".

I assumed the second stage overwrites the 0,0 to the real zoneid/nid
value, which is clearly not happening, hence it'd be preferable to get
a crash at boot reliably.

Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
calling init_reserved_page(start_pfn) won't do much with
CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
page->flags were still wrong for reserved pages in the "Unknown E820
type" region.

> > Whenever pfn_valid is true, it's better that the zoneid/nid is correct
> > all times, otherwise if the second stage fails we end up in a bug with
> > weird side effects.
> 
> Memory holes with a valid memmap might not have a zone/nid. For now, skipping 
> reserved pages should be good enough, no?

zoneid is always a pure abstract concept, not just the RAM-holes but
the RAM itself doesn't have a zoneid at all.

Nid is hardware defined for RAM, but for reserved RAM holes it becomes
a pure software construct as the zoneid.

So while it's right that they have no zone/nid in the hardware, they
still have a very well defined zoneid/nid in the software implementation.

The page struct belongs to one and only one mem_map array, that has a
specific nodeid and nid. So it can be always initialized right even
for non-RAM.

Only if the pfn is !pfn_valid, then it has no page struct and then
there's no zone/nid association, but in that case there's no reason to
even worry about it since nobody can possibly get a wrong value out of
the page struct because there's no page struct in the case.

Last but not the least, RAM pages can be marked reserved and assigned
to hardware and so it'd be really messy if real reserved RAM pages
given to hw, behaved different than non-RAM that accidentally got a
struct page because of section alignment issues.

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Andrea Arcangeli

On Wed, Nov 25, 2020 at 10:30:53AM +, Mel Gorman wrote:
> On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote:
> > Hello,
> > 
> > On Tue, Nov 24, 2020 at 01:32:05PM +, Mel Gorman wrote:
> > > I would hope that is not the case because they are not meant to overlap.
> > > However, if the beginning of the pageblock was not the start of a zone
> > > then the pages would be valid but the pfn would still be outside the
> > > zone boundary. If it was reserved, the struct page is valid but not
> > > suitable for set_pfnblock_flags_mask. However, it is a concern in
> > > general because the potential is there that pages are isolated from the
> > > wrong zone.
> > 
> > I guess we have more than one issue to correct in that function
> > because the same BUG_ON reproduced again even with the tentative patch
> > I posted earlier.
> > 
> > So my guess is that the problematic reserved page isn't pointed by the
> > min_pfn, but it must have been pointed by the "highest" variable
> > calculated below?
> > 
> > if (pfn >= highest)
> > highest = pageblock_start_pfn(pfn);
> > 
> > When I looked at where "highest" comes from, it lacks
> > pageblock_pfn_to_page check (which was added around v5.7 to min_pfn).
> > 
> > Is that the real bug, which may be fixed by something like this? (untested)
> > 
> 
> It's plausible as it is a potential source of leaking but as you note
> in another mail, it's surprising to me that valid struct pages, even if
> within memory holes and reserved would have broken node/zone information
> in the page flags.

I think the patch to add pageblock_pfn_to_page is still needed to cope
with !pfn_valid or a pageblock in between zones, but pfn_valid or
pageblock in between zones is not what happens here.

So the patch adding pageblock_pfn_to_page would have had the undesired
side effect of hiding the bug so it's best to deal with the other bug
first.

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-24 Thread Andrea Arcangeli

Hello,

On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:
> On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
> > A corollary issue was fixed in
> > 39639000-39814fff : Unknown E820 type
> > 
> > pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM:
> > 
> > 7a17b000-7a216fff : Unknown E820 type
> 
> It would be nice to also provide a /proc/zoneinfo and how exactly the 
> "zone_spans_pfn" was violated. I assume we end up below zone's 
> start_pfn, but is it true?

Agreed, I was about to grab that info along with all page struct
around the pfn 0x7a200 and phys address 0x7a216fff.

# grep -A1 E820 /proc/iomem
7a17b000-7a216fff : Unknown E820 type
7a217000-7bff : System RAM

DMA  zone_start_pfn 1zone_end_pfn() 4096 contiguous 1   

DMA32zone_start_pfn 4096 zone_end_pfn() 1048576  contiguous 0   

Normal   zone_start_pfn 1048576  zone_end_pfn() 4715392  contiguous 1   

Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0   

500222 0x7a1fe000 0x1fff1000 reserved True
500223 0x7a1ff000 0x1fff1000 reserved True

# I suspect "highest pfn" was somewhere in the RAM range
# 0x7a217000-0x7a40 and the pageblock_start_pfn(pfn)
# made highest point to pfn 0x7a200 physaddr 0x7a20
# below, which is reserved indeed since it's non-RAM
# first number is pfn hex(500224) == 0x7a200

pfnphysaddr   page->flags
500224 0x7a20 0x1fff1000 reserved True
500225 0x7a201000 0x1fff1000 reserved True
*snip*
500245 0x7a215000 0x1fff1000 reserved True
500246 0x7a216000 0x1fff1000 reserved True
500247 0x7a217000 0x3fff reserved False
500248 0x7a218000 0x3fff reserved False

All RAM pages non-reserved are starting at 0x7a217000 as expected.

The non-RAM page_zonenum(pfn_to_page(0x7a200)) points to ZONE_DMA and 
page_zone(page) below was the DMA zone despite the pfn of 0x7a200 is
in DMA32.

VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

So the patch I sent earlier should prevent the above BUG_ON by not
setting highest to 0x7a200 when pfn is in the phys RAM range
0x7a217000-0x7a40, because pageblock_pfn_to_page will notice that
the zone is the wrong one.

if (page_zone(start_page) != zone)
return NULL;

However the real bug seems that reserved pages have a zero zone_id in
the page->flags when it should have the real zone id/nid. The patch I
sent earlier to validate highest would only be needed to deal with
pfn_valid.

Something must have changed more recently than v5.1 that caused the
zoneid of reserved pages to be wrong, a possible candidate for the
real would be this change below:

+   __init_single_page(pfn_to_page(pfn), pfn, 0, 0);

Even if it may not be it, at the light of how the reserved page
zoneid/nid initialized went wrong, the above line like it's too flakey
to stay.

It'd be preferable if the pfn_valid fails and the
pfn_to_section_nr(pfn) returns an invalid section for the intermediate
step. Even better memset 0xff over the whole page struct until the
second stage comes around.

Whenever pfn_valid is true, it's better that the zoneid/nid is correct
all times, otherwise if the second stage fails we end up in a bug with
weird side effects.

Maybe it's not the above that left a zero zoneid though, I haven't
tried to bisect it yet to look how the page->flags looked like on a
older kernel that didn't seem to reproduce this crash, I'm just
guessing.

Thanks,
Andrea

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-24 Thread Andrea Arcangeli

Hello,

On Tue, Nov 24, 2020 at 01:32:05PM +, Mel Gorman wrote:
> I would hope that is not the case because they are not meant to overlap.
> However, if the beginning of the pageblock was not the start of a zone
> then the pages would be valid but the pfn would still be outside the
> zone boundary. If it was reserved, the struct page is valid but not
> suitable for set_pfnblock_flags_mask. However, it is a concern in
> general because the potential is there that pages are isolated from the
> wrong zone.

I guess we have more than one issue to correct in that function
because the same BUG_ON reproduced again even with the tentative patch
I posted earlier.

So my guess is that the problematic reserved page isn't pointed by the
min_pfn, but it must have been pointed by the "highest" variable
calculated below?

if (pfn >= highest)
highest = pageblock_start_pfn(pfn);

When I looked at where "highest" comes from, it lacks
pageblock_pfn_to_page check (which was added around v5.7 to min_pfn).

Is that the real bug, which may be fixed by something like this? (untested)

==
>From 262671e88723b3074251189004ceae39dcd1689d Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Sat, 21 Nov 2020 12:55:58 -0500
Subject: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set
 pageblock_skip on reserved pages

A corollary issue was fixed in
e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in
v5.7:

https://lkml.kernel.org/r/8c537eb7-85ee-4dcf-943e-3cc0ed0df...@lca.pw

==
page:eaaa refcount:1 mapcount:0 mapping:2243743b index:0x0
flags: 0x1fffe01000(reserved)
==

73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly fix
the second issue, but it still reproduces with v5.9 on two different
systems:

==
page:62b3e92f refcount:1 mapcount:0 mapping: index:0x0 
pfn:0x39800
flags: 0x1000(reserved)
raw: 1000 f5b880e60008 f5b880e60008 
raw:   0001 
page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn))
==
page:4d32675d refcount:1 mapcount:0 mapping: index:0x0 
pfn:0x7a200
flags: 0x1fff1000(reserved)
raw: 1fff1000 e6c5c1e88008 e6c5c1e88008 
raw:   0001 
page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn))
==

The page is "reserved" in all cases. In the last two crashes with the
pfn:

pfn 0x39800 -> 0x3980 min_pfn hit non-RAM:

39639000-39814fff : Unknown E820 type

pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM:

7a17b000-7a216fff : Unknown E820 type

The pageblock_pfn_to_page check was missing when rewinding pfn to the
start of the pageblock to calculate the "highest" value.

So the "highest" pfn could point to a non valid pfn or not within the
zone, checking the pageblock_pfn_to_page fixes it.

Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a 
migration target")
Signed-off-by: Andrea Arcangeli 
---
 mm/compaction.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 653862aba266..76f061af8f22 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1418,8 +1418,14 @@ fast_isolate_freepages(struct compact_control *cc)
nr_scanned++;
pfn = page_to_pfn(freepage);
 
-   if (pfn >= highest)
-   highest = pageblock_start_pfn(pfn);
+   if (pfn >= highest) {
+   unsigned long start_pfn, end_pfn;
+   start_pfn = pageblock_start_pfn(pfn);
+   end_pfn = pageblock_end_pfn(start_pfn);
+   if (pageblock_pfn_to_page(start_pfn, end_pfn,
+ cc->zone))
+   highest = pfn;
+   }
 
if (pfn >= low_pfn) {
cc->fast_search_fail = 0;
==


Can't we also try to scan in between start_pfn and "pfn" to see if
there's one pfn that passes the pageblock_pfn_to_page test or isn't it
worth it for the fast isolate variant?

> > Then compact_finished() detects that in
> > compact_zone(), but only after migrate_pages() and thus
> > fast_isolate_freepages() is called.
> > 
> > That would mean distance can be negative, or rather a large unsigned number
> > and low_pfn and min_pfn end up away from the zone?
> > 
> > Or maybe the above doesn't happen, but cc->free_pfn gets so close to start
> > of the zone, that the

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-21 Thread Andrea Arcangeli

On Sat, Nov 21, 2020 at 02:45:06PM -0500, Andrea Arcangeli wrote:
> + if (likely(!PageReserved(page)))

NOTE: this line will have to become "likely(page &&
!PageReserved(page))" to handle the case of non contiguous zones,
since pageblock_pfn_to_page might return NULL in that case.. I noticed
right after sending, but I'll wait some feedback before resending the
correction in case it gets fixed in another way.

[PATCH 0/1] VM_BUG_ON_PAGE(!zone_spans_pfn) in set_pfnblock_flags_mask

2020-11-21 Thread Andrea Arcangeli

Hello,

After hitting this twice on two different systems, I'm now running
with the tentative fix applied, but it's not a meaningful test since
it's non reproducible.

However it is possible to inject this bug if you do "grep E820
/proc/iomem" and then find a phys addr there with a struct page
(i.e. pfn_valid) in a zone, with this change:

min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
+   if (cc->zone is the zone where the e820 physaddr has a pfn_valid)
+  min_pfn = physaddr_of_E820_non_RAM_page_with_valid_pfn >> PAGE_SHIFT;

I didn't try to inject the bug to validate the fix and it'd be great
if someone can try that to validate this or any other fix.

Andrea Arcangeli (1):
  mm: compaction: avoid fast_isolate_around() to set pageblock_skip on
reserved pages

 mm/compaction.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

[PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-21 Thread Andrea Arcangeli

A corollary issue was fixed in
e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in
v5.7:

https://lkml.kernel.org/r/8c537eb7-85ee-4dcf-943e-3cc0ed0df...@lca.pw

==
page:eaaa refcount:1 mapcount:0 mapping:2243743b index:0x0
flags: 0x1fffe01000(reserved)
==

73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly the
second issue, but I still reproduced it twice with v5.9 on two
different systems:

==
page:62b3e92f refcount:1 mapcount:0 mapping: index:0x0 
pfn:0x39800
flags: 0x1000(reserved)
==
page:2a7114f8 refcount:1 mapcount:0 mapping: index:0x0 
pfn:0x7a200
flags: 0x1fff1000(reserved)
==

I actually never reproduced it until v5.9, but it's still the same bug
as it was reported first for v5.7.

See the page is "reserved" in all 3 cases. In the last two crashes
with the pfn:

pfn 0x39800 -> 0x3980 min_pfn hit non-RAM:

39639000-39814fff : Unknown E820 type

pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM:

7a17b000-7a216fff : Unknown E820 type

This actually seems a false positive bugcheck, the page structures are
valid and the zones are correct, just it's non-RAM but setting
pageblockskip should do no harm. However it's possible to solve the
crash without lifting the bugcheck, by enforcing the invariant that
the free_pfn cursor doesn't point to reserved pages (which would be
otherwise implicitly achieved through the PageBuddy check, except in
the new fast_isolate_around() path).

Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a 
migration target")
Signed-off-by: Andrea Arcangeli 
---
 mm/compaction.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 13cb7a961b31..d17e69549d34 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1433,7 +1433,10 @@ fast_isolate_freepages(struct compact_control *cc)
page = pageblock_pfn_to_page(min_pfn,
pageblock_end_pfn(min_pfn),
cc->zone);
-   cc->free_pfn = min_pfn;
+   if (likely(!PageReserved(page)))
+   cc->free_pfn = min_pfn;
+   else
+   page = NULL;
}
}
}

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3671 matches

Mail list logo