Re: [PATCH v3 2/2] sysctl: handle overflow for file-max

2018-10-18 Thread Andrea Arcangeli
Hi Al,

On Wed, Oct 17, 2018 at 01:35:48AM +0100, Al Viro wrote:
> On Wed, Oct 17, 2018 at 12:33:22AM +0200, Christian Brauner wrote:
> > Currently, when writing
> > 
> > echo 18446744073709551616 > /proc/sys/fs/file-max
> > 
> > /proc/sys/fs/file-max will overflow and be set to 0. That quickly
> > crashes the system.
> > This commit sets the max and min value for file-max and returns -EINVAL
> > when a long int is exceeded. Any higher value cannot currently be used as
> > the percpu counters are long ints and not unsigned integers. This behavior
> > also aligns with other tuneables that return -EINVAL when their range is
> > exceeded. See e.g. [1], [2] and others.
> 
> Mostly sane, but...  get_max_files() users are bloody odd.  The one in
> file-max limit reporting looks like a half-arsed attempt in "[PATCH] fix
> file counting".  The one in af_unix.c, though...  I don't remember how
> that check had come to be - IIRC that was a strange fallout of a thread
> with me, Andrea and ANK involved, circa 1999, but I don't remember details;
> Andrea, any memories?  It might be worth reconsidering...  The change in
> question is in 2.2.4pre6; what do we use unix_nr_socks for?  We try to
> limit the number of PF_UNIX socks by 2 * max_files, but max_files can be
> huge *and* non-constant (i.e. it can decrease).  What's more, 
> unix_tot_inflight
> is unsigned int and max_files might exceed 2^31 just fine since "fs: allow
> for more than 2^31 files" back in 2010...  Something's fishy there...

Feels like a lifetime ago :), but looking into I remembered some of
it. That thread was about some instability in unix sockets for an
unrelated bug in the garbage collector. While reviewing your fix, I
probably incidentally found a resource exhaustion problem in doing a
connect();close() loop on a listening stream af_unix. I found an
exploit somewhere in my home dated in 99 in ls -l. Then ANK found
another resource exhaustion by sending datagram sockets, which I also
found an exploit for in my home.

ANK pointed out that a connect syscall allocates two sockets, one to
be accepted by the listening process, the other is the connect
itself. That must be the explanation of the "*2".

The "max_files*2" is probably a patch was from you (which was not
overflowing back then), in attempt to fix the garbage collector issue
which initially looked like resource exhaustion.

I may have suggested to check sk_max_ack_backlog and fail connect() in
such case to solve the resource exhaustion, but my proposal was
obviously broken because connect() would return an error when the
backlog was full and I suppose I didn't implement anything like
unix_wait_for_peer. So I guess (not 100% sure) the get_max_files()*2
check stayed, not because of the bug in the garbage collector that was
fixed independently, but as a stop gap measure for the
connect();close() loop resource exhaustion.

I tried the exploit that does a connect();close() in a loop and it
gracefully hangs in unix_wait_for_peer() after sk_max_ack_backlog
connects.

Out of curiosity I tried also the dgram exploit and it hangs in
sock_alloc_send_pskb with sk_wmem_alloc_get(sk) < sk->sk_sndbuf
check. The file*2 limit couldn't have helped that one anyway.

If I set /proc/sys/net/core/somaxconn to 100 the exploit works
fine again and the connect;close loop again allocates infinite amount
of kernel RAM into a tiny RSS process and it triggered OOM (there was
no OOM killer in v2.2 I suppose). By default it's 128. There's also
sysctl_max_dgram_qlen for dgram that on Android is set to 600 (by
default 10).

I tend to think these resources are now capped by other means (notably
somaxconn, sysctl_max_dgram_qlen, sk_wmem_alloc_get) and unix_nr_socks
can be dropped. Or if that atomic counter is still needed it's not for
a trivial exploit anymore than just does listen(); SIGSTOP() from one
process and a connect();close() loop in another process. It'd require
more than a listening socket and heavily forking or a large increase
on the max number of file descriptors (a privileged op) to do a ton of
listens, but forking has its own memory footprint in userland too. At
the very least it should be a per-cpu counter synced to the atomic
global after a threshold.

The other reason for dropping is that it wasn't ideal that the trivial
exploit could still allocated max_files*2 SYN skbs with a loop of
connect;close, max_files*2 is too much already so I suppose it was
only a stop-gap measure to begin with.

Thanks,
Andrea


Re: possible deadlock in aio_poll

2018-10-17 Thread Andrea Arcangeli
On Mon, Sep 10, 2018 at 09:53:17AM -0700, Christoph Hellwig wrote:
> On Mon, Sep 10, 2018 at 12:41:05AM -0700, syzbot wrote:
> > =
> > WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> > 4.19.0-rc2+ #229 Not tainted
> > -
> > syz-executor2/9399 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > 126506e0 (>fd_wqh){+.+.}, at: spin_lock
> > include/linux/spinlock.h:329 [inline]
> > 126506e0 (>fd_wqh){+.+.}, at: aio_poll+0x760/0x1420
> > fs/aio.c:1747
> > 
> > and this task is already holding:
> > 2bed6bf6 (&(>ctx_lock)->rlock){..-.}, at: spin_lock_irq
> > include/linux/spinlock.h:354 [inline]
> > 2bed6bf6 (&(>ctx_lock)->rlock){..-.}, at: aio_poll+0x738/0x1420
> > fs/aio.c:1746
> > which would create a new lock dependency:
> >  (&(>ctx_lock)->rlock){..-.} -> (>fd_wqh){+.+.}
> 
> ctx->fd_wqh seems to only exist in userfaultfd, which indeed seems
> to do strange open coded waitqueue locking, and seems to fail to disable
> irqs.  Something like this should fix it:
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index bfa0ec69f924..356d2b8568c1 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1026,7 +1026,7 @@ static ssize_t userfaultfd_ctx_read(struct 
> userfaultfd_ctx *ctx, int no_wait,
>   struct userfaultfd_ctx *fork_nctx = NULL;
>  
>   /* always take the fd_wqh lock before the fault_pending_wqh lock */
> - spin_lock(>fd_wqh.lock);
> + spin_lock_irq(>fd_wqh.lock);
>   __add_wait_queue(>fd_wqh, );
>   for (;;) {
>   set_current_state(TASK_INTERRUPTIBLE);
> @@ -1112,13 +1112,13 @@ static ssize_t userfaultfd_ctx_read(struct 
> userfaultfd_ctx *ctx, int no_wait,
>   ret = -EAGAIN;
>   break;
>   }
> - spin_unlock(>fd_wqh.lock);
> + spin_unlock_irq(>fd_wqh.lock);
>   schedule();
> - spin_lock(>fd_wqh.lock);
> + spin_lock_irq(>fd_wqh.lock);
>   }
>   __remove_wait_queue(>fd_wqh, );
>   __set_current_state(TASK_RUNNING);
> - spin_unlock(>fd_wqh.lock);
> + spin_unlock_irq(>fd_wqh.lock);
>  
>   if (!ret && msg->event == UFFD_EVENT_FORK) {
>   ret = resolve_userfault_fork(ctx, fork_nctx, msg);
> 

Reviewed-by: Andrea Arcangeli 

This is lock inversion with userfaultfd_poll that takes the fq_wqh
after the irqsafe aio lock. And the aio lock can be taken from softirq
(so potentially for interrupts) leading to a potential lock inversion
deadlock.

I suggest to add a comment about the above in the code before the
first spin_lock_irq to explain why it needs to be _irq or it's not
obvious.

c430d1e848ff1240d126e79780f3c26208b8aed9 was just a false positive
instead.

I didn't comment on c430d1e848ff1240d126e79780f3c26208b8aed9 because I
was too busy with the speculative execution issues at the time and it
was just fine to drop the microoptimization, but while at it can we
look in how to add a spin_acquire or find a way to teach lockdep in
another way, so it's happy even if we restore the microoptimization?

If we do that, in addition we should also initialize the
ctx->fault_wqh spinlock to locked in the same patch (a spin_lock
during uffd ctx creation will do) to be sure nobody takes it as
further robustness feature against future modification (it gets more
self documenting too that it's not supposed to be taken and the
fault_pending_wq.lock has to be taken instead).

Thanks,
Andrea


Re: possible deadlock in aio_poll

2018-10-17 Thread Andrea Arcangeli
On Mon, Sep 10, 2018 at 09:53:17AM -0700, Christoph Hellwig wrote:
> On Mon, Sep 10, 2018 at 12:41:05AM -0700, syzbot wrote:
> > =
> > WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> > 4.19.0-rc2+ #229 Not tainted
> > -
> > syz-executor2/9399 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > 126506e0 (>fd_wqh){+.+.}, at: spin_lock
> > include/linux/spinlock.h:329 [inline]
> > 126506e0 (>fd_wqh){+.+.}, at: aio_poll+0x760/0x1420
> > fs/aio.c:1747
> > 
> > and this task is already holding:
> > 2bed6bf6 (&(>ctx_lock)->rlock){..-.}, at: spin_lock_irq
> > include/linux/spinlock.h:354 [inline]
> > 2bed6bf6 (&(>ctx_lock)->rlock){..-.}, at: aio_poll+0x738/0x1420
> > fs/aio.c:1746
> > which would create a new lock dependency:
> >  (&(>ctx_lock)->rlock){..-.} -> (>fd_wqh){+.+.}
> 
> ctx->fd_wqh seems to only exist in userfaultfd, which indeed seems
> to do strange open coded waitqueue locking, and seems to fail to disable
> irqs.  Something like this should fix it:
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index bfa0ec69f924..356d2b8568c1 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1026,7 +1026,7 @@ static ssize_t userfaultfd_ctx_read(struct 
> userfaultfd_ctx *ctx, int no_wait,
>   struct userfaultfd_ctx *fork_nctx = NULL;
>  
>   /* always take the fd_wqh lock before the fault_pending_wqh lock */
> - spin_lock(>fd_wqh.lock);
> + spin_lock_irq(>fd_wqh.lock);
>   __add_wait_queue(>fd_wqh, );
>   for (;;) {
>   set_current_state(TASK_INTERRUPTIBLE);
> @@ -1112,13 +1112,13 @@ static ssize_t userfaultfd_ctx_read(struct 
> userfaultfd_ctx *ctx, int no_wait,
>   ret = -EAGAIN;
>   break;
>   }
> - spin_unlock(>fd_wqh.lock);
> + spin_unlock_irq(>fd_wqh.lock);
>   schedule();
> - spin_lock(>fd_wqh.lock);
> + spin_lock_irq(>fd_wqh.lock);
>   }
>   __remove_wait_queue(>fd_wqh, );
>   __set_current_state(TASK_RUNNING);
> - spin_unlock(>fd_wqh.lock);
> + spin_unlock_irq(>fd_wqh.lock);
>  
>   if (!ret && msg->event == UFFD_EVENT_FORK) {
>   ret = resolve_userfault_fork(ctx, fork_nctx, msg);
> 

Reviewed-by: Andrea Arcangeli 

This is lock inversion with userfaultfd_poll that takes the fq_wqh
after the irqsafe aio lock. And the aio lock can be taken from softirq
(so potentially for interrupts) leading to a potential lock inversion
deadlock.

I suggest to add a comment about the above in the code before the
first spin_lock_irq to explain why it needs to be _irq or it's not
obvious.

c430d1e848ff1240d126e79780f3c26208b8aed9 was just a false positive
instead.

I didn't comment on c430d1e848ff1240d126e79780f3c26208b8aed9 because I
was too busy with the speculative execution issues at the time and it
was just fine to drop the microoptimization, but while at it can we
look in how to add a spin_acquire or find a way to teach lockdep in
another way, so it's happy even if we restore the microoptimization?

If we do that, in addition we should also initialize the
ctx->fault_wqh spinlock to locked in the same patch (a spin_lock
during uffd ctx creation will do) to be sure nobody takes it as
further robustness feature against future modification (it gets more
self documenting too that it's not supposed to be taken and the
fault_pending_wq.lock has to be taken instead).

Thanks,
Andrea


Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-10-16 Thread Andrea Arcangeli
Hello Zi,

On Sun, Oct 14, 2018 at 08:53:55PM -0400, Zi Yan wrote:
> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns 
> true
> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
> returns false in the same situation?

!pmd_present means it's a migration entry or swap entry and doesn't
point to RAM. It means if you do pmd_to_page(*pmd) it will return you
an undefined result.

During splitting the physical page is still very well pointed by the
pmd as long as pmd_trans_huge returns true and you hold the
pmd_lock.

pmd_trans_huge must be true at all times for a transhuge pmd that
points to a hugepage, or all VM fast paths won't serialize with the
pmd_lock, that is the only reason why, and it's a very good reason
because it avoids to take the pmd_lock when walking over non transhuge
pmds (i.e. when there are no THP allocated).

Now if we've to keep _PAGE_PSE set and return true in pmd_trans_huge
at all times, why would you want to make pmd_present return false? How
could it help if pmd_trans_huge returns true, but pmd_present returns
false despite pmd_to_page works fine and the pmd is really still
pointing to the page?

When userland faults on such pmd !pmd_present it will make the page
fault take a swap or migration path, but that's the wrong path if the
pmd points to RAM.

What we need to do during split is an invalidate of the huge TLB.
There's no pmd_trans_splitting anymore, so we only clear the present
bit in the PTE despite pmd_present still returns true (just like
PROT_NONE, nothing new in this respect). pmd_present never meant the
real present bit in the pte was set, it just means the pmd points to
RAM. It means it doesn't point to swap or migration entry and you can
do pmd_to_page and it works fine.

We need to invalidate the TLB by clearing the present bit and by
flushing the TLB before overwriting the transhuge pmd with the regular
pte (i.e. to make it non huge). That is actually required by an errata
(l1 cache aliasing of the same mapping through two different TLB of
two different sizes broke some old CPU and triggered machine checks).
It's not something fundamentally necessary from a common code point of
view. It's more risky from an hardware (not software) standpoint and
before you can get rid of the pmd you need to do a TLB flush anyway to
be sure CPUs stops using it, so better clear the present bit before
doing the real costly thing (the tlb flush with IPIs). Clearing the
present bit during the TLB flush is a cost that gets lost in the noise.

The clear of the real present bit during pmd (virtual) splitting is
done with pmdp_invalidate, that is created specifically to keeps
pmd_trans_huge=true, pmd_present=true despite the present bit is not
set. So you could imagine _PAGE_PSE as the real present bit.

Before the physical split was deferred and decoupled from the virtual
memory pmd split, pmd_trans_splitting allowed to wait the split to
finish and to keep all gup_fast at bay during it (while the page was
still mapped readable and writable in userland by other CPUs). Now the
physical split is deferred so you just split the pmd locally and only
a physical split invoked on the page (not the virtual split invoked on
the pmd with split_huge_pmd) has to keep gup at bay, and it does so by
freezing the refcount so all gup_fast fail with the
page_cache_get_speculative during the freeze. This removed the need of
the pmd_splitting flag in gup_fast (when pmd_splitting was set gup
fast had to go through the non-fast gup), but it means that now a
hugepage cannot be physically splitted if it's gup pinned. The main
difference is that freezing the refcount can fail, so the code must
learn to cope with such failure and defer it. Decoupling the physical
and virtual splits introduced the need of tracking the doublemap case
with a new PG_double_map flag too. It makes the refcounting of
hugepages trivial in comparison (identical to hugetlbfs in fact), but
it requires total_mapcount to account for all those huge and non huge
mappings. It primarily pays off to add THP to tmpfs where the physical
split may have to be deferred for pagecache reasons anyway.

Thanks,
Andrea


Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry

2018-10-16 Thread Andrea Arcangeli
Hello Zi,

On Sun, Oct 14, 2018 at 08:53:55PM -0400, Zi Yan wrote:
> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns 
> true
> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
> returns false in the same situation?

!pmd_present means it's a migration entry or swap entry and doesn't
point to RAM. It means if you do pmd_to_page(*pmd) it will return you
an undefined result.

During splitting the physical page is still very well pointed by the
pmd as long as pmd_trans_huge returns true and you hold the
pmd_lock.

pmd_trans_huge must be true at all times for a transhuge pmd that
points to a hugepage, or all VM fast paths won't serialize with the
pmd_lock, that is the only reason why, and it's a very good reason
because it avoids to take the pmd_lock when walking over non transhuge
pmds (i.e. when there are no THP allocated).

Now if we've to keep _PAGE_PSE set and return true in pmd_trans_huge
at all times, why would you want to make pmd_present return false? How
could it help if pmd_trans_huge returns true, but pmd_present returns
false despite pmd_to_page works fine and the pmd is really still
pointing to the page?

When userland faults on such pmd !pmd_present it will make the page
fault take a swap or migration path, but that's the wrong path if the
pmd points to RAM.

What we need to do during split is an invalidate of the huge TLB.
There's no pmd_trans_splitting anymore, so we only clear the present
bit in the PTE despite pmd_present still returns true (just like
PROT_NONE, nothing new in this respect). pmd_present never meant the
real present bit in the pte was set, it just means the pmd points to
RAM. It means it doesn't point to swap or migration entry and you can
do pmd_to_page and it works fine.

We need to invalidate the TLB by clearing the present bit and by
flushing the TLB before overwriting the transhuge pmd with the regular
pte (i.e. to make it non huge). That is actually required by an errata
(l1 cache aliasing of the same mapping through two different TLB of
two different sizes broke some old CPU and triggered machine checks).
It's not something fundamentally necessary from a common code point of
view. It's more risky from an hardware (not software) standpoint and
before you can get rid of the pmd you need to do a TLB flush anyway to
be sure CPUs stops using it, so better clear the present bit before
doing the real costly thing (the tlb flush with IPIs). Clearing the
present bit during the TLB flush is a cost that gets lost in the noise.

The clear of the real present bit during pmd (virtual) splitting is
done with pmdp_invalidate, that is created specifically to keeps
pmd_trans_huge=true, pmd_present=true despite the present bit is not
set. So you could imagine _PAGE_PSE as the real present bit.

Before the physical split was deferred and decoupled from the virtual
memory pmd split, pmd_trans_splitting allowed to wait the split to
finish and to keep all gup_fast at bay during it (while the page was
still mapped readable and writable in userland by other CPUs). Now the
physical split is deferred so you just split the pmd locally and only
a physical split invoked on the page (not the virtual split invoked on
the pmd with split_huge_pmd) has to keep gup at bay, and it does so by
freezing the refcount so all gup_fast fail with the
page_cache_get_speculative during the freeze. This removed the need of
the pmd_splitting flag in gup_fast (when pmd_splitting was set gup
fast had to go through the non-fast gup), but it means that now a
hugepage cannot be physically splitted if it's gup pinned. The main
difference is that freezing the refcount can fail, so the code must
learn to cope with such failure and defer it. Decoupling the physical
and virtual splits introduced the need of tracking the doublemap case
with a new PG_double_map flag too. It makes the refcounting of
hugepages trivial in comparison (identical to hugetlbfs in fact), but
it requires total_mapcount to account for all those huge and non huge
mappings. It primarily pays off to add THP to tmpfs where the physical
split may have to be deferred for pagecache reasons anyway.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Andrea Arcangeli
Hello,

On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
> we'll still make it into 4.19.1.  Am reluctant to merge this while
> discussion, testing and possibly more development are ongoing.

I think there can be definitely more developments primarily to make
the compact deferred logic NUMA aware. Instead of a global deferred
logic, we should split it per zone per node so that it backs off
exponentially with an higher cap in remote nodes. The current global
"backoff" limit will still apply to the "local" zone compaction. Who
would like to work on that?

However I don't think it's worth waiting for that, because it's not a
trivial change.

Certainly we can't ship upstream in production with this bug, so if it
doesn't get fixed upstream we'll fix it downstream first until the
more developments are production ready. This was a severe regression
compared to previous kernels that made important workloads unusable
and it starts when __GFP_THISNODE was added to THP allocations under
MADV_HUGEPAGE. It is not a significant risk to go to the previous
behavior before __GFP_THISNODE was added, it worked like that for
years.

This was simply an optimization to some lucky workloads that can fit
in a single node, but it ended up breaking the VM for others that
can't possibly fit in a single node, so going back is safe.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-16 Thread Andrea Arcangeli
Hello,

On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
> we'll still make it into 4.19.1.  Am reluctant to merge this while
> discussion, testing and possibly more development are ongoing.

I think there can be definitely more developments primarily to make
the compact deferred logic NUMA aware. Instead of a global deferred
logic, we should split it per zone per node so that it backs off
exponentially with an higher cap in remote nodes. The current global
"backoff" limit will still apply to the "local" zone compaction. Who
would like to work on that?

However I don't think it's worth waiting for that, because it's not a
trivial change.

Certainly we can't ship upstream in production with this bug, so if it
doesn't get fixed upstream we'll fix it downstream first until the
more developments are production ready. This was a severe regression
compared to previous kernels that made important workloads unusable
and it starts when __GFP_THISNODE was added to THP allocations under
MADV_HUGEPAGE. It is not a significant risk to go to the previous
behavior before __GFP_THISNODE was added, it worked like that for
years.

This was simply an optimization to some lucky workloads that can fit
in a single node, but it ended up breaking the VM for others that
can't possibly fit in a single node, so going back is safe.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrea Arcangeli
Hello Andrew,

On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes  
> wrote:
> >  Would it be possible to test with my 
> > patch[*] that does not try reclaim to address the thrashing issue?
> 
> Yes please.

It'd also be great if a testcase reproducing the 40% higher access
latency (with the one liner original fix) was available.

We don't have a testcase for David's 40% latency increase problem, but
that's likely to only happen when the system is somewhat low on memory
globally. So the measurement must be done when compaction starts
failing globally on all zones, but before the system starts
swapping. The more global fragmentation the larger will be the window
between "compaction fails because all zones are too fragmented" and
"there is still free PAGE_SIZEd memory available to reclaim without
swapping it out". If I understood correctly, that is precisely the
window where the 40% higher latency should materialize.

The workload that shows the badness in the upstream code is fairly
trivial. Mel and Zi reproduced it too and I have two testcases that
can reproduce it, one with device assignment and the other is just
memhog. That's a massively larger window than the one where the 40%
higher latency materializes.

When there's 75% or more of the RAM free (not even allocated as easily
reclaimable pagecache) globally, you don't expect to hit heavy
swapping.

The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
such window where all remote zones are fully fragmented is somehow
lesser of a concern in my view (plus there's the compact deferred
logic that should mitigate that scenario). Furthermore it is only a
concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
set the userland allocation is long lived, so such higher allocation
latency won't risk to hit short lived allocations that don't set
MADV_HUGEPAGE (unless madvise=always, but that's not the default
precisely because not all allocations are long lived).

If the MADV_HUGEPAGE using library was freely available it'd also be
nice.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrea Arcangeli
Hello Andrew,

On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes  
> wrote:
> >  Would it be possible to test with my 
> > patch[*] that does not try reclaim to address the thrashing issue?
> 
> Yes please.

It'd also be great if a testcase reproducing the 40% higher access
latency (with the one liner original fix) was available.

We don't have a testcase for David's 40% latency increase problem, but
that's likely to only happen when the system is somewhat low on memory
globally. So the measurement must be done when compaction starts
failing globally on all zones, but before the system starts
swapping. The more global fragmentation the larger will be the window
between "compaction fails because all zones are too fragmented" and
"there is still free PAGE_SIZEd memory available to reclaim without
swapping it out". If I understood correctly, that is precisely the
window where the 40% higher latency should materialize.

The workload that shows the badness in the upstream code is fairly
trivial. Mel and Zi reproduced it too and I have two testcases that
can reproduce it, one with device assignment and the other is just
memhog. That's a massively larger window than the one where the 40%
higher latency materializes.

When there's 75% or more of the RAM free (not even allocated as easily
reclaimable pagecache) globally, you don't expect to hit heavy
swapping.

The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
such window where all remote zones are fully fragmented is somehow
lesser of a concern in my view (plus there's the compact deferred
logic that should mitigate that scenario). Furthermore it is only a
concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
set the userland allocation is long lived, so such higher allocation
latency won't risk to hit short lived allocations that don't set
MADV_HUGEPAGE (unless madvise=always, but that's not the default
precisely because not all allocations are long lived).

If the MADV_HUGEPAGE using library was freely available it'd also be
nice.


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrea Arcangeli
On Mon, Oct 15, 2018 at 03:30:17PM -0700, David Rientjes wrote:
> At the risk of beating a dead horse that has already been beaten, what are 
> the plans for this patch when the merge window opens?  It would be rather 
> unfortunate for us to start incurring a 14% increase in access latency and 
> 40% increase in fault latency.  Would it be possible to test with my 
> patch[*] that does not try reclaim to address the thrashing issue?  If 
> that is satisfactory, I don't have a strong preference if it is done with 
> a hardcoded pageblock_order and __GFP_NORETRY check or a new 
> __GFP_COMPACT_ONLY flag.

I don't like the pageblock size hardcoding inside the page
allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at
least let the caller choose the behavior, so it looks more flexible.

To fix your 40% fault latency concern in the short term we could use
__GFP_COMPACT_ONLY, but I think we should get rid of
__GFP_COMPACT_ONLY later: we need more statistical data in the zone
structure to track remote compaction failures triggering because the
zone is fully fragmented.

Once the zone is fully fragmented we need to do a special exponential
backoff on that zone when the zone is from a remote node.

Furthermore at the first remote NUMA node zone failure caused by full
fragmentation we need to interrupt compaction and stop trying with all
remote nodes.

As long as compaction returns COMPACT_SKIPPED it's ok to keep doing
reclaim and keep doing compaction, as long as compaction succeeds.

What is causing the higher latency is the fact we try all zones from
all remote nodes even if there's a failure caused by full
fragmentation of all remote zone, and we're unable to skip (skip with
exponential backoff) only those zones where compaction cannot succeed
because of fragmentation.

Once we achieve the above deleting __GFP_COMPACT_ONLY will be a
trivial patch.

> I think the second issue of faulting remote thp by removing __GFP_THISNODE 
> needs supporting evidence that shows some platforms benefit from this (and 
> not with numa=fake on the command line :).
> 
>  [*] https://marc.info/?l=linux-kernel=153903127717471

That is needed to compare the current one liner fix with
__GFP_COMPACT_ONLY, but I don't think it's needed to compare v4.18
with the current fix. The badness of v4.18 was too bad keep, getting
local PAGE_SIZEd memory or remote THPs is a secondary issue.

In fact the main reason for __GFP_COMPACT_ONLY is not anymore such
tradeoff, but not to spend too much CPU in compaction when all nodes
are fragmented to avoid increasing the allocation latency too much.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-15 Thread Andrea Arcangeli
On Mon, Oct 15, 2018 at 03:30:17PM -0700, David Rientjes wrote:
> At the risk of beating a dead horse that has already been beaten, what are 
> the plans for this patch when the merge window opens?  It would be rather 
> unfortunate for us to start incurring a 14% increase in access latency and 
> 40% increase in fault latency.  Would it be possible to test with my 
> patch[*] that does not try reclaim to address the thrashing issue?  If 
> that is satisfactory, I don't have a strong preference if it is done with 
> a hardcoded pageblock_order and __GFP_NORETRY check or a new 
> __GFP_COMPACT_ONLY flag.

I don't like the pageblock size hardcoding inside the page
allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at
least let the caller choose the behavior, so it looks more flexible.

To fix your 40% fault latency concern in the short term we could use
__GFP_COMPACT_ONLY, but I think we should get rid of
__GFP_COMPACT_ONLY later: we need more statistical data in the zone
structure to track remote compaction failures triggering because the
zone is fully fragmented.

Once the zone is fully fragmented we need to do a special exponential
backoff on that zone when the zone is from a remote node.

Furthermore at the first remote NUMA node zone failure caused by full
fragmentation we need to interrupt compaction and stop trying with all
remote nodes.

As long as compaction returns COMPACT_SKIPPED it's ok to keep doing
reclaim and keep doing compaction, as long as compaction succeeds.

What is causing the higher latency is the fact we try all zones from
all remote nodes even if there's a failure caused by full
fragmentation of all remote zone, and we're unable to skip (skip with
exponential backoff) only those zones where compaction cannot succeed
because of fragmentation.

Once we achieve the above deleting __GFP_COMPACT_ONLY will be a
trivial patch.

> I think the second issue of faulting remote thp by removing __GFP_THISNODE 
> needs supporting evidence that shows some platforms benefit from this (and 
> not with numa=fake on the command line :).
> 
>  [*] https://marc.info/?l=linux-kernel=153903127717471

That is needed to compare the current one liner fix with
__GFP_COMPACT_ONLY, but I don't think it's needed to compare v4.18
with the current fix. The badness of v4.18 was too bad keep, getting
local PAGE_SIZEd memory or remote THPs is a secondary issue.

In fact the main reason for __GFP_COMPACT_ONLY is not anymore such
tradeoff, but not to spend too much CPU in compaction when all nodes
are fragmented to avoid increasing the allocation latency too much.

Thanks,
Andrea


Re: [PATCH] mm/thp: fix call to mmu_notifier in set_pmd_migration_entry()

2018-10-12 Thread Andrea Arcangeli
On Fri, Oct 12, 2018 at 01:35:19PM -0400, Jerome Glisse wrote:
> On Fri, Oct 12, 2018 at 01:24:22PM -0400, Andrea Arcangeli wrote:
> > Hello,
> > 
> > On Fri, Oct 12, 2018 at 12:20:54PM -0400, Zi Yan wrote:
> > > On 12 Oct 2018, at 12:09, jgli...@redhat.com wrote:
> > > 
> > > > From: Jérôme Glisse 
> > > >
> > > > Inside set_pmd_migration_entry() we are holding page table locks and
> > > > thus we can not sleep so we can not call invalidate_range_start/end()
> > > >
> > > > So remove call to mmu_notifier_invalidate_range_start/end() and add
> > > > call to mmu_notifier_invalidate_range(). Note that we are already
> > 
> > Why the call to mmu_notifier_invalidate_range if we're under
> > range_start and followed by range_end? (it's not _range_only_end, if
> > it was _range_only_end the above would be needed)
> 
> I wanted to be extra safe and accept to over invalidate. You are right
> that it is not strictly necessary. I am fine with removing it.

If it's superfluous, I'd generally prefer strict code unless there's a
very explicit comment about it that says it's actually superfluous.
Otherwise after a while we don't know why it was added there.

> We can remove it. Should i post a v2 without it ?

That's fine with me yes.

Thanks,
Andrea


Re: [PATCH] mm/thp: fix call to mmu_notifier in set_pmd_migration_entry()

2018-10-12 Thread Andrea Arcangeli
On Fri, Oct 12, 2018 at 01:35:19PM -0400, Jerome Glisse wrote:
> On Fri, Oct 12, 2018 at 01:24:22PM -0400, Andrea Arcangeli wrote:
> > Hello,
> > 
> > On Fri, Oct 12, 2018 at 12:20:54PM -0400, Zi Yan wrote:
> > > On 12 Oct 2018, at 12:09, jgli...@redhat.com wrote:
> > > 
> > > > From: Jérôme Glisse 
> > > >
> > > > Inside set_pmd_migration_entry() we are holding page table locks and
> > > > thus we can not sleep so we can not call invalidate_range_start/end()
> > > >
> > > > So remove call to mmu_notifier_invalidate_range_start/end() and add
> > > > call to mmu_notifier_invalidate_range(). Note that we are already
> > 
> > Why the call to mmu_notifier_invalidate_range if we're under
> > range_start and followed by range_end? (it's not _range_only_end, if
> > it was _range_only_end the above would be needed)
> 
> I wanted to be extra safe and accept to over invalidate. You are right
> that it is not strictly necessary. I am fine with removing it.

If it's superfluous, I'd generally prefer strict code unless there's a
very explicit comment about it that says it's actually superfluous.
Otherwise after a while we don't know why it was added there.

> We can remove it. Should i post a v2 without it ?

That's fine with me yes.

Thanks,
Andrea


Re: [PATCH] mm/thp: fix call to mmu_notifier in set_pmd_migration_entry()

2018-10-12 Thread Andrea Arcangeli
Hello,

On Fri, Oct 12, 2018 at 12:20:54PM -0400, Zi Yan wrote:
> On 12 Oct 2018, at 12:09, jgli...@redhat.com wrote:
> 
> > From: Jérôme Glisse 
> >
> > Inside set_pmd_migration_entry() we are holding page table locks and
> > thus we can not sleep so we can not call invalidate_range_start/end()
> >
> > So remove call to mmu_notifier_invalidate_range_start/end() and add
> > call to mmu_notifier_invalidate_range(). Note that we are already

Why the call to mmu_notifier_invalidate_range if we're under
range_start and followed by range_end? (it's not _range_only_end, if
it was _range_only_end the above would be needed)

> > calling mmu_notifier_invalidate_range_start/end() inside the function
> > calling set_pmd_migration_entry() (see try_to_unmap_one()).
> >
> > Signed-off-by: Jérôme Glisse 
> > Reported-by: Andrea Arcangeli 
> > Cc: Andrew Morton 
> > Cc: Greg Kroah-Hartman 
> > Cc: Zi Yan 
> > Cc: Kirill A. Shutemov 
> > Cc: "H. Peter Anvin" 
> > Cc: Anshuman Khandual 
> > Cc: Dave Hansen 
> > Cc: David Nellans 
> > Cc: Ingo Molnar 
> > Cc: Mel Gorman 
> > Cc: Minchan Kim 
> > Cc: Naoya Horiguchi 
> > Cc: Thomas Gleixner 
> > Cc: Vlastimil Babka 
> > Cc: Michal Hocko 
> > Cc: Andrea Arcangeli 
> > ---
> >  mm/huge_memory.c | 7 +--
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 533f9b00147d..93cb80fe12cb 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2885,9 +2885,6 @@ void set_pmd_migration_entry(struct 
> > page_vma_mapped_walk *pvmw,
> > if (!(pvmw->pmd && !pvmw->pte))
> > return;
> >
> > -   mmu_notifier_invalidate_range_start(mm, address,
> > -   address + HPAGE_PMD_SIZE);
> > -
> > flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> > pmdval = *pvmw->pmd;
> > pmdp_invalidate(vma, address, pvmw->pmd);
> > @@ -2898,11 +2895,9 @@ void set_pmd_migration_entry(struct 
> > page_vma_mapped_walk *pvmw,
> > if (pmd_soft_dirty(pmdval))
> > pmdswp = pmd_swp_mksoft_dirty(pmdswp);
> > set_pmd_at(mm, address, pvmw->pmd, pmdswp);
> > +   mmu_notifier_invalidate_range(mm, address, address + HPAGE_PMD_SIZE);

It's not obvious why it's needed, if it's needed maybe a comment can
be added.

> > page_remove_rmap(page, true);
> > put_page(page);
> > -
> > -   mmu_notifier_invalidate_range_end(mm, address,
> > -   address + HPAGE_PMD_SIZE);
> >  }
> >
> >  void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page 
> > *new)
> > -- 
> > 2.17.2
> 
> Yes, these are the redundant calls to 
> mmu_notifier_invalidate_range_start/end()
> in set_pmd_migration_entry(). Thanks for the patch.

They're not just redundant, it's called in non blockable path with
__mmu_notifier_invalidate_range_start(blockable=true).

Furthermore mmu notifier API doesn't support nesting.

KVM is actually robust against the nesting:

kvm->mmu_notifier_count++;

kvm->mmu_notifier_count--;

and KVM is always fine with non blockable calls, but that's not
universally true for all mmu notifier users.

Thanks,
Andrea


Re: [PATCH] mm/thp: fix call to mmu_notifier in set_pmd_migration_entry()

2018-10-12 Thread Andrea Arcangeli
Hello,

On Fri, Oct 12, 2018 at 12:20:54PM -0400, Zi Yan wrote:
> On 12 Oct 2018, at 12:09, jgli...@redhat.com wrote:
> 
> > From: Jérôme Glisse 
> >
> > Inside set_pmd_migration_entry() we are holding page table locks and
> > thus we can not sleep so we can not call invalidate_range_start/end()
> >
> > So remove call to mmu_notifier_invalidate_range_start/end() and add
> > call to mmu_notifier_invalidate_range(). Note that we are already

Why the call to mmu_notifier_invalidate_range if we're under
range_start and followed by range_end? (it's not _range_only_end, if
it was _range_only_end the above would be needed)

> > calling mmu_notifier_invalidate_range_start/end() inside the function
> > calling set_pmd_migration_entry() (see try_to_unmap_one()).
> >
> > Signed-off-by: Jérôme Glisse 
> > Reported-by: Andrea Arcangeli 
> > Cc: Andrew Morton 
> > Cc: Greg Kroah-Hartman 
> > Cc: Zi Yan 
> > Cc: Kirill A. Shutemov 
> > Cc: "H. Peter Anvin" 
> > Cc: Anshuman Khandual 
> > Cc: Dave Hansen 
> > Cc: David Nellans 
> > Cc: Ingo Molnar 
> > Cc: Mel Gorman 
> > Cc: Minchan Kim 
> > Cc: Naoya Horiguchi 
> > Cc: Thomas Gleixner 
> > Cc: Vlastimil Babka 
> > Cc: Michal Hocko 
> > Cc: Andrea Arcangeli 
> > ---
> >  mm/huge_memory.c | 7 +--
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 533f9b00147d..93cb80fe12cb 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2885,9 +2885,6 @@ void set_pmd_migration_entry(struct 
> > page_vma_mapped_walk *pvmw,
> > if (!(pvmw->pmd && !pvmw->pte))
> > return;
> >
> > -   mmu_notifier_invalidate_range_start(mm, address,
> > -   address + HPAGE_PMD_SIZE);
> > -
> > flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> > pmdval = *pvmw->pmd;
> > pmdp_invalidate(vma, address, pvmw->pmd);
> > @@ -2898,11 +2895,9 @@ void set_pmd_migration_entry(struct 
> > page_vma_mapped_walk *pvmw,
> > if (pmd_soft_dirty(pmdval))
> > pmdswp = pmd_swp_mksoft_dirty(pmdswp);
> > set_pmd_at(mm, address, pvmw->pmd, pmdswp);
> > +   mmu_notifier_invalidate_range(mm, address, address + HPAGE_PMD_SIZE);

It's not obvious why it's needed, if it's needed maybe a comment can
be added.

> > page_remove_rmap(page, true);
> > put_page(page);
> > -
> > -   mmu_notifier_invalidate_range_end(mm, address,
> > -   address + HPAGE_PMD_SIZE);
> >  }
> >
> >  void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page 
> > *new)
> > -- 
> > 2.17.2
> 
> Yes, these are the redundant calls to 
> mmu_notifier_invalidate_range_start/end()
> in set_pmd_migration_entry(). Thanks for the patch.

They're not just redundant, it's called in non blockable path with
__mmu_notifier_invalidate_range_start(blockable=true).

Furthermore mmu notifier API doesn't support nesting.

KVM is actually robust against the nesting:

kvm->mmu_notifier_count++;

kvm->mmu_notifier_count--;

and KVM is always fine with non blockable calls, but that's not
universally true for all mmu notifier users.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote:
> On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > > [Sorry for being slow in responding but I was mostly offline last few
> > >  days]
> > > 
> > > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > > [...]
> > > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > > promises about locality and that introducing MADV_LOCAL for specialised
> > > > libraries may be more appropriate with the initial semantic being how it
> > > > treats MADV_HUGEPAGE regions.
> > > 
> > > I agree with your other points and not going to repeat them. I am not
> > > sure madvise s the best API for the purpose though. We are talking about
> > > memory policy here and there is an existing api for that so I would
> > > _prefer_ to reuse it for this purpose.
> > > 
> > 
> > I flip-flopped on that one in my head multiple times on the basis of
> > how strict it should be. Memory policies tend to be black or white --
> > bind here, interleave there, etc. It wasn't clear to me what the best
> > policy would be to describe "allocate local as best as you can but allow
> > fallbacks if necessary".

MPOL_PREFERRED is not black and white. In fact I asked David earlier
if MPOL_PREFERRED could check if it would already be a good fit for
this. Still the point is it requires privilege (and for a good
reason).

> I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
> - try hard to allocate from a local or very close numa node(s) even when
> that requires expensive operations like the memory reclaim/compaction
> before falling back to other more distant numa nodes.

If MPOL_PREFERRED can't work something like this could be added.

I think "madvise vs mbind" is more an issue of "no-permission vs
permission" required. And if the processes ends up swapping out all
other process with their memory already allocated in the node, I think
some permission is correct to be required, in which case an mbind
looks a better fit. MPOL_PREFERRED also looks a first candidate for
investigation as it's already not black and white and allows spillover
and may already do the right thing in fact if set on top of
MADV_HUGEPAGE.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote:
> On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > > [Sorry for being slow in responding but I was mostly offline last few
> > >  days]
> > > 
> > > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > > [...]
> > > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > > promises about locality and that introducing MADV_LOCAL for specialised
> > > > libraries may be more appropriate with the initial semantic being how it
> > > > treats MADV_HUGEPAGE regions.
> > > 
> > > I agree with your other points and not going to repeat them. I am not
> > > sure madvise s the best API for the purpose though. We are talking about
> > > memory policy here and there is an existing api for that so I would
> > > _prefer_ to reuse it for this purpose.
> > > 
> > 
> > I flip-flopped on that one in my head multiple times on the basis of
> > how strict it should be. Memory policies tend to be black or white --
> > bind here, interleave there, etc. It wasn't clear to me what the best
> > policy would be to describe "allocate local as best as you can but allow
> > fallbacks if necessary".

MPOL_PREFERRED is not black and white. In fact I asked David earlier
if MPOL_PREFERRED could check if it would already be a good fit for
this. Still the point is it requires privilege (and for a good
reason).

> I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
> - try hard to allocate from a local or very close numa node(s) even when
> that requires expensive operations like the memory reclaim/compaction
> before falling back to other more distant numa nodes.

If MPOL_PREFERRED can't work something like this could be added.

I think "madvise vs mbind" is more an issue of "no-permission vs
permission" required. And if the processes ends up swapping out all
other process with their memory already allocated in the node, I think
some permission is correct to be required, in which case an mbind
looks a better fit. MPOL_PREFERRED also looks a first candidate for
investigation as it's already not black and white and allows spillover
and may already do the right thing in fact if set on top of
MADV_HUGEPAGE.

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
> causes workloads to severely regress both in fault and access latency when 
> we know that direct reclaim is unlikely to make direct compaction free an 
> entire pageblock.  It's more likely than not that the reclaim was 
> pointless and the allocation will still fail.

How do you know that? If all RAM is full of filesystem cache, but it's
not heavily fragmented by slab or other unmovable objects, compaction
will succeed every single time after reclaim frees 2M of cache like
it's asked to do.

reclaim succeeds every time, compaction then succeeds every time.

Not doing reclaim after COMPACT_SKIPPED is returned simply makes
compaction unable to compact memory once all nodes are filled by
filesystem cache.

Certainly it's better not to invoke reclaim at all if __GFP_THISNODE
is set, than swapping out heavy over the local node. Doing so however
has the drawback of reducing the direct compaction effectiveness. I
don't think it's true that reclaim is generally "pointless", it's just
that invoking any reclaim backfired so bad if __GFP_THISNODE was set,
than anything else (including weakining compaction effectiveness) was
better.

> If memory compaction were patched such that it can report that it could 
> successfully free a page of the specified order if there were free pages 
> at the end of the zone it could migrate to, reclaim might be helpful.  But 
> with the current implementation, I don't think that is reliably possible.  
> These free pages could easily be skipped over by the migration scanner 
> because of the presence of slab pages, for example, and unavailable to the 
> freeing scanner.

Yes there's one case where reclaim is "pointless", but it happens once
and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
reclaim then.

So you're right when we hit fragmentation there's one and only one
"pointless" reclaim invocation. And immediately after we also
exponentially backoff on the compaction invocations with the
compaction deferred logic.

We could try optimize away such "pointless" reclaim event for sure,
but it's probably an optimization that may just get lost in the noise
and may not be measurable, because it only happens once when the first
full fragmentation is encountered.

> I really think that for this patch to be merged over my proposed change 
> that it needs to be clearly demonstrated that reclaim was successful in 
> that it freed memory that was subsequently available to the compaction 
> freeing scanner and that enabled entire pageblocks to become free.  That, 
> in my experience, will very very seldom be successful because of internal 
> slab fragmentation, compaction_alloc() cannot soak up pages from the 
> reclaimed memory, and potentially thrash the zone completely pointlessly.  
> The last point is the problem being reported here, but the other two are 
> as legitimate.

I think the demonstration can already be inferred, because if hit full
memory fragmentation after every reclaim, __GFP_NORETRY would have
solved the "pathological THP allocation behavior" without requiring
your change to __GFP_NORETRY that makes it behave like
__GFP_COMPACT_ONLY for order == HPAGE_PMD_ORDER.

Anyway you can add a few statistic counters and verify in more
accurate way how often a COMPACT_SKIPPED + reclaim cycle is followed
by a COMPACT_DEFERRED.

> I'd appreciate if Andrea can test this patch, have a rebuttal that we 
> should still remove __GFP_THISNODE because we don't care about locality as 
> much as forming a hugepage, we can make that change, and then merge this 
> instead of causing such massive fault and access latencies.

I can certainly test, but from source review I'm already convinced
it'll solve fine the "pathological THP allocation behavior", no
argument about that. It's certainly better and more correct your patch
than the current upstream (no security issues with lack of permissions
for __GFP_THISNODE anymore either).

I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY
alternative I posted, for our testcase that hit into the "pathological
THP allocation behavior".

Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics
and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE
no matter which is the caller.

As opposed I let the caller choose and left __GFP_NORETRY semantics
alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think
letting the caller decide instead of hardcoding it for order 9 is
better, because __GFP_COMPACT_ONLY made sense to be set only if
__GFP_THISNODE was also set by the caller.

If a driver does an order 9 allocation with __GFP_THISNODE not set,
your patch will prevent it to allocate remote THP if all remote nodes
are full of cache (which is a reasonable common assumption as more THP
are allocated over time eating in all free memory). My patch didn't
alter that so I tend to prefer the __GFP_COMPACT_ONLY than the
hardcoding 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
> causes workloads to severely regress both in fault and access latency when 
> we know that direct reclaim is unlikely to make direct compaction free an 
> entire pageblock.  It's more likely than not that the reclaim was 
> pointless and the allocation will still fail.

How do you know that? If all RAM is full of filesystem cache, but it's
not heavily fragmented by slab or other unmovable objects, compaction
will succeed every single time after reclaim frees 2M of cache like
it's asked to do.

reclaim succeeds every time, compaction then succeeds every time.

Not doing reclaim after COMPACT_SKIPPED is returned simply makes
compaction unable to compact memory once all nodes are filled by
filesystem cache.

Certainly it's better not to invoke reclaim at all if __GFP_THISNODE
is set, than swapping out heavy over the local node. Doing so however
has the drawback of reducing the direct compaction effectiveness. I
don't think it's true that reclaim is generally "pointless", it's just
that invoking any reclaim backfired so bad if __GFP_THISNODE was set,
than anything else (including weakining compaction effectiveness) was
better.

> If memory compaction were patched such that it can report that it could 
> successfully free a page of the specified order if there were free pages 
> at the end of the zone it could migrate to, reclaim might be helpful.  But 
> with the current implementation, I don't think that is reliably possible.  
> These free pages could easily be skipped over by the migration scanner 
> because of the presence of slab pages, for example, and unavailable to the 
> freeing scanner.

Yes there's one case where reclaim is "pointless", but it happens once
and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
reclaim then.

So you're right when we hit fragmentation there's one and only one
"pointless" reclaim invocation. And immediately after we also
exponentially backoff on the compaction invocations with the
compaction deferred logic.

We could try optimize away such "pointless" reclaim event for sure,
but it's probably an optimization that may just get lost in the noise
and may not be measurable, because it only happens once when the first
full fragmentation is encountered.

> I really think that for this patch to be merged over my proposed change 
> that it needs to be clearly demonstrated that reclaim was successful in 
> that it freed memory that was subsequently available to the compaction 
> freeing scanner and that enabled entire pageblocks to become free.  That, 
> in my experience, will very very seldom be successful because of internal 
> slab fragmentation, compaction_alloc() cannot soak up pages from the 
> reclaimed memory, and potentially thrash the zone completely pointlessly.  
> The last point is the problem being reported here, but the other two are 
> as legitimate.

I think the demonstration can already be inferred, because if hit full
memory fragmentation after every reclaim, __GFP_NORETRY would have
solved the "pathological THP allocation behavior" without requiring
your change to __GFP_NORETRY that makes it behave like
__GFP_COMPACT_ONLY for order == HPAGE_PMD_ORDER.

Anyway you can add a few statistic counters and verify in more
accurate way how often a COMPACT_SKIPPED + reclaim cycle is followed
by a COMPACT_DEFERRED.

> I'd appreciate if Andrea can test this patch, have a rebuttal that we 
> should still remove __GFP_THISNODE because we don't care about locality as 
> much as forming a hugepage, we can make that change, and then merge this 
> instead of causing such massive fault and access latencies.

I can certainly test, but from source review I'm already convinced
it'll solve fine the "pathological THP allocation behavior", no
argument about that. It's certainly better and more correct your patch
than the current upstream (no security issues with lack of permissions
for __GFP_THISNODE anymore either).

I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY
alternative I posted, for our testcase that hit into the "pathological
THP allocation behavior".

Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics
and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE
no matter which is the caller.

As opposed I let the caller choose and left __GFP_NORETRY semantics
alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think
letting the caller decide instead of hardcoding it for order 9 is
better, because __GFP_COMPACT_ONLY made sense to be set only if
__GFP_THISNODE was also set by the caller.

If a driver does an order 9 allocation with __GFP_THISNODE not set,
your patch will prevent it to allocate remote THP if all remote nodes
are full of cache (which is a reasonable common assumption as more THP
are allocated over time eating in all free memory). My patch didn't
alter that so I tend to prefer the __GFP_COMPACT_ONLY than the
hardcoding 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> comment:
> 
>   /*
>* Checks for costly allocations with __GFP_NORETRY, which
>* includes THP page fault allocations
>*/
>   if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> 
> And that enables us to check compact_result to determine whether thp 
> allocation should fail or continue to reclaim.  I don't think it helps 
> that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> GFP_TRANSHUGE_LIGHT.

With your patch that changes how __GFP_NORETRY works for order =
pageblock_order then this can help. However it solves nothing for
large skbs and all other "costly_order" allocations. So if you think
it's so bad to have a 40% increased allocation latency if all nodes
are heavily fragmented for THP under MADV_HUGEPAGE (which are
guaranteed to be long lived allocations or they wouldn't set
MADV_HUGEPAGE in the first place), I'm not sure why you ignore the
same exact overhead for all other "costly_order" allocations that
would never set __GFP_THISNODE and will not even use
GFP_TRANSHUGE_LIGHT so they may not set __GFP_NORETRY at all.

I think it would be better to view the problem from a generic
"costly_order" allocation prospective without so much "hardcoded"
focus on THP MADV_HUGEPAGE only (which in fact are the least concern
of all in terms of that 40% latency increase if all nodes are totally
fragmented and compaction fails on all nodes, because they're long
lived so the impact of the increased latency is more likely to be lost
in the noise).

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
> we're on an older kernel: it does not attempt to do reclaim to make 
> compaction happy so that it finds memory that it can migrate memory to.  
> For reference, we use defrag setting of "defer+madvise".  Everybody who 
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
> MADV_HUGEPAGE users do the same but also try direct compaction.  That 
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
> because of need_resched() or spinlock contention.
> These allocations always fail, MADV_HUGEPAGE or otherwise, without 
> invoking direct reclaim.

Even older kernels (before "defer+madvise" option was available)
always invoked reclaim if COMPACT_SKIPPED was returned. The
COMPACT_SKIPPED behavior I referred that explains why __GFP_NORETRY
doesn't prevent reclaim to be invoked, is nothing new, it always
worked that way from the day compaction was introduced.

So it's fairly strange that your kernel doesn't call reclaim at all if
COMPACT_SKIPPED is returned.

> I am agreeing with both you and Mel that it makes no sense to thrash the 
> local node to make compaction happy and then hugepage-order memory 
> available.  I'm only differing with you on the mechanism to fail early: we 
> never want to do attempt reclaim on thp allocations specifically because 
> it leads to the behavior you are addressing.

Not sure I can agree with the above.

If all nodes (not only the local one) are already below the watermarks
and compaction returns COMPACT_SKIPPED, there is zero global free
memory available, we would need to swap anyway to succeed the 2M
allocation. So it's better to reclaim 2M from on node and then retry
compaction again on the same node if compaction is failing on the node
because of COMPACT_SKIPPED and the global free memory is zero. If it
swapouts it would have swapped out anyway but this way THP will be
returned.

It's just __GFP_THISNODE that broke the above logic. __GFP_THISNODE
may just be safe to use only if the global amount of free memory (in
all nodes) is zero (or below HPAGE_PMD_SIZE generally speaking).

> My contention is that removing __GFP_THISNODE papers over the problem, 
> especially in cases where remote memory is also fragmnented. It incurs a 
> much higher (40%) fault latency and then incurs 13.9% greater access 
> latency.  It is not a good result, at least for Haswell, Naples, and Rome.
> 
> To make a case that we should fault hugepages remotely as fallback, either 
> for non-MADV_HUGEPAGE users who do not use direct compaction, or 
> MADV_HUGEPAGE users who use direct compaction, we need numbers that 
> suggest there is a performance benefit in terms of access latency to 
> suggest that it is better; this is especially the case when the fault 
> latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
> that this patch works much harder to fault memory remotely that incurs a 
> substantial performance penalty when it fails and in the cases where it 
> succeeds we have much worse access latency.

What we're fixing is a effectively a breakage and insecure behavior
too and that must be fixed first. A heavily multithreaded processes

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-09 Thread Andrea Arcangeli
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> comment:
> 
>   /*
>* Checks for costly allocations with __GFP_NORETRY, which
>* includes THP page fault allocations
>*/
>   if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> 
> And that enables us to check compact_result to determine whether thp 
> allocation should fail or continue to reclaim.  I don't think it helps 
> that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> GFP_TRANSHUGE_LIGHT.

With your patch that changes how __GFP_NORETRY works for order =
pageblock_order then this can help. However it solves nothing for
large skbs and all other "costly_order" allocations. So if you think
it's so bad to have a 40% increased allocation latency if all nodes
are heavily fragmented for THP under MADV_HUGEPAGE (which are
guaranteed to be long lived allocations or they wouldn't set
MADV_HUGEPAGE in the first place), I'm not sure why you ignore the
same exact overhead for all other "costly_order" allocations that
would never set __GFP_THISNODE and will not even use
GFP_TRANSHUGE_LIGHT so they may not set __GFP_NORETRY at all.

I think it would be better to view the problem from a generic
"costly_order" allocation prospective without so much "hardcoded"
focus on THP MADV_HUGEPAGE only (which in fact are the least concern
of all in terms of that 40% latency increase if all nodes are totally
fragmented and compaction fails on all nodes, because they're long
lived so the impact of the increased latency is more likely to be lost
in the noise).

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
> we're on an older kernel: it does not attempt to do reclaim to make 
> compaction happy so that it finds memory that it can migrate memory to.  
> For reference, we use defrag setting of "defer+madvise".  Everybody who 
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
> MADV_HUGEPAGE users do the same but also try direct compaction.  That 
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
> because of need_resched() or spinlock contention.
> These allocations always fail, MADV_HUGEPAGE or otherwise, without 
> invoking direct reclaim.

Even older kernels (before "defer+madvise" option was available)
always invoked reclaim if COMPACT_SKIPPED was returned. The
COMPACT_SKIPPED behavior I referred that explains why __GFP_NORETRY
doesn't prevent reclaim to be invoked, is nothing new, it always
worked that way from the day compaction was introduced.

So it's fairly strange that your kernel doesn't call reclaim at all if
COMPACT_SKIPPED is returned.

> I am agreeing with both you and Mel that it makes no sense to thrash the 
> local node to make compaction happy and then hugepage-order memory 
> available.  I'm only differing with you on the mechanism to fail early: we 
> never want to do attempt reclaim on thp allocations specifically because 
> it leads to the behavior you are addressing.

Not sure I can agree with the above.

If all nodes (not only the local one) are already below the watermarks
and compaction returns COMPACT_SKIPPED, there is zero global free
memory available, we would need to swap anyway to succeed the 2M
allocation. So it's better to reclaim 2M from on node and then retry
compaction again on the same node if compaction is failing on the node
because of COMPACT_SKIPPED and the global free memory is zero. If it
swapouts it would have swapped out anyway but this way THP will be
returned.

It's just __GFP_THISNODE that broke the above logic. __GFP_THISNODE
may just be safe to use only if the global amount of free memory (in
all nodes) is zero (or below HPAGE_PMD_SIZE generally speaking).

> My contention is that removing __GFP_THISNODE papers over the problem, 
> especially in cases where remote memory is also fragmnented. It incurs a 
> much higher (40%) fault latency and then incurs 13.9% greater access 
> latency.  It is not a good result, at least for Haswell, Naples, and Rome.
> 
> To make a case that we should fault hugepages remotely as fallback, either 
> for non-MADV_HUGEPAGE users who do not use direct compaction, or 
> MADV_HUGEPAGE users who use direct compaction, we need numbers that 
> suggest there is a performance benefit in terms of access latency to 
> suggest that it is better; this is especially the case when the fault 
> latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
> that this patch works much harder to fault memory remotely that incurs a 
> substantial performance penalty when it fails and in the cases where it 
> succeeds we have much worse access latency.

What we're fixing is a effectively a breakage and insecure behavior
too and that must be fixed first. A heavily multithreaded processes

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Andrea Arcangeli
Hello,

On Thu, Oct 04, 2018 at 04:05:26PM -0700, David Rientjes wrote:
> The source of the problem needs to be addressed: memory compaction.  We 
> regress because we lose __GFP_NORETRY and pointlessly try reclaim, but 

I commented in detail about the __GFP_NORETRY topic in the other email
so I will skip the discussion about __GFP_NORETRY in the context of
this answer except for the comment at the end of the email to the
actual code that implements __GFP_NORETRY.

> But that's a memory compaction issue, not a thp gfp mask issue; the 
> reclaim issue is responded to below.

Actually memory compaction has no issues whatsoever with
__GFP_THISNODE regardless of __GFP_NORETRY.

> This patch causes an even worse regression if all system memory is 
> fragmented such that thp cannot be allocated because it tries to stress 
> compaction on remote nodes, which ends up unsuccessfully, not just the 
> local node.
> 
> On Haswell, when all memory is fragmented (not just the local node as I 
> obtained by 13.9% regression result), the patch results in a fault latency 
> regression of 40.9% for MADV_HUGEPAGE region of 8GB.  This is because it 
> is thrashing both nodes pointlessly instead of just failing for 
> __GFP_THISNODE.

There's no I/O involved at the very least on compaction, nor we drop
any cache or shrink any slab by mistake by just invoking compaction.
Even when you hit the worst case "all nodes are 100% fragmented"
scenario that generates the 40% increased allocation latency, all
other tasks running in the local node will keep running fine, and they
won't be pushed away forcefully into swap with all their kernel cache
depleted, which is a mlock/mbind privileged behavior that the app
using the MADV_HUGEPAGE lib should not ever been able to inflict on
other processes running in the node from different users (users as in
uid).

Furthermore when you incur the worst case latency after that there's
compact deferred logic skipping compaction next time around if all
nodes were so fragmented to the point of guaranteed failure. While
there's nothing stopping reclaim to run every time COMPACT_SKIPPED is
returned just because compaction keeps succeeding as reclaim keeps
pushing more 2M amounts into swap from the local nodes.

I don't doubt with 1024 nodes things can get pretty bad when they're
all 100% fragmented, __GFP_THISNODE would win in such case, but then
what you're asking then is the __GFP_COMPACT_ONLY behavior. That will
solve it.

What we'd need probably regardless of how we solve this bug (because
not all compaction invocations are THP invocations... and we can't
keep making special cases and optimizations tailored for THP or we end
up in that same 40% higher latency for large skbs and other stuff) is
a more sophisticated COMPACT_DEFERRED logic where you can track when
remote compaction failed. Then you wait many more times before trying
a global compaction. It could be achieved with just a compact_deferred
counter in the zone/pgdat (wherever it fits best).

Overall I don't think the bug we're dealing with and the slowdown of
compaction on the remote nodes are comparable, also considering the
latter will still happen regardless if you've large skbs or other
drivers allocating large amounts of memory as an optimization.

> So the end result is that the patch regresses access latency forever by 
> 13.9% when the local node is fragmented because it is accessing remote thp 
> vs local pages of the native page size, and regresses fault latency of 
> 40.9% when the system is fully fragmented.  The only time that fault 
> latency is improved is when remote memory is not fully fragmented, but 
> then you must incur the remote access latency.

You get THP however which will reduce the TLB miss cost and maximize
TLB usage, so it depends on the app if that 13.9% cost is actually
offseted by the THP benefit or not.

It entirely depends if large part of the workload mostly fits in
in-socket CPU cache. The more the in-socket/node CPU cache pays off,
the more remote-THP also pays off. There would be definitely workloads
that would run faster, not slower, with the remote THP instead of
local PAGE_SIZEd memory. The benefit of THP is also larger for the
guest loads than for host loads, so it depends on that too.

We agree about the latency issue with a ton of RAM and thousands of
nodes, but again that can be mitigated with a NUMA friendly
COMPACT_DEFERRED logic NUMA aware. Even without such
NUMA-aware-compact_deferred logic improvement, the worst case of the
remote compaction behavior still doesn't look nearly as bad as this
bug by thinking about it. And it only is a concern for extremely large
NUMA systems (which may run the risk of running in other solubility
issues in other places if random workloads are applied to it and all
nodes are low on memory and fully fragmented which is far from common
scenario on those large systems), while the bug we fixed was hurting
badly all very common 2 nodes installs with workloads that 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Andrea Arcangeli
Hello,

On Thu, Oct 04, 2018 at 04:05:26PM -0700, David Rientjes wrote:
> The source of the problem needs to be addressed: memory compaction.  We 
> regress because we lose __GFP_NORETRY and pointlessly try reclaim, but 

I commented in detail about the __GFP_NORETRY topic in the other email
so I will skip the discussion about __GFP_NORETRY in the context of
this answer except for the comment at the end of the email to the
actual code that implements __GFP_NORETRY.

> But that's a memory compaction issue, not a thp gfp mask issue; the 
> reclaim issue is responded to below.

Actually memory compaction has no issues whatsoever with
__GFP_THISNODE regardless of __GFP_NORETRY.

> This patch causes an even worse regression if all system memory is 
> fragmented such that thp cannot be allocated because it tries to stress 
> compaction on remote nodes, which ends up unsuccessfully, not just the 
> local node.
> 
> On Haswell, when all memory is fragmented (not just the local node as I 
> obtained by 13.9% regression result), the patch results in a fault latency 
> regression of 40.9% for MADV_HUGEPAGE region of 8GB.  This is because it 
> is thrashing both nodes pointlessly instead of just failing for 
> __GFP_THISNODE.

There's no I/O involved at the very least on compaction, nor we drop
any cache or shrink any slab by mistake by just invoking compaction.
Even when you hit the worst case "all nodes are 100% fragmented"
scenario that generates the 40% increased allocation latency, all
other tasks running in the local node will keep running fine, and they
won't be pushed away forcefully into swap with all their kernel cache
depleted, which is a mlock/mbind privileged behavior that the app
using the MADV_HUGEPAGE lib should not ever been able to inflict on
other processes running in the node from different users (users as in
uid).

Furthermore when you incur the worst case latency after that there's
compact deferred logic skipping compaction next time around if all
nodes were so fragmented to the point of guaranteed failure. While
there's nothing stopping reclaim to run every time COMPACT_SKIPPED is
returned just because compaction keeps succeeding as reclaim keeps
pushing more 2M amounts into swap from the local nodes.

I don't doubt with 1024 nodes things can get pretty bad when they're
all 100% fragmented, __GFP_THISNODE would win in such case, but then
what you're asking then is the __GFP_COMPACT_ONLY behavior. That will
solve it.

What we'd need probably regardless of how we solve this bug (because
not all compaction invocations are THP invocations... and we can't
keep making special cases and optimizations tailored for THP or we end
up in that same 40% higher latency for large skbs and other stuff) is
a more sophisticated COMPACT_DEFERRED logic where you can track when
remote compaction failed. Then you wait many more times before trying
a global compaction. It could be achieved with just a compact_deferred
counter in the zone/pgdat (wherever it fits best).

Overall I don't think the bug we're dealing with and the slowdown of
compaction on the remote nodes are comparable, also considering the
latter will still happen regardless if you've large skbs or other
drivers allocating large amounts of memory as an optimization.

> So the end result is that the patch regresses access latency forever by 
> 13.9% when the local node is fragmented because it is accessing remote thp 
> vs local pages of the native page size, and regresses fault latency of 
> 40.9% when the system is fully fragmented.  The only time that fault 
> latency is improved is when remote memory is not fully fragmented, but 
> then you must incur the remote access latency.

You get THP however which will reduce the TLB miss cost and maximize
TLB usage, so it depends on the app if that 13.9% cost is actually
offseted by the THP benefit or not.

It entirely depends if large part of the workload mostly fits in
in-socket CPU cache. The more the in-socket/node CPU cache pays off,
the more remote-THP also pays off. There would be definitely workloads
that would run faster, not slower, with the remote THP instead of
local PAGE_SIZEd memory. The benefit of THP is also larger for the
guest loads than for host loads, so it depends on that too.

We agree about the latency issue with a ton of RAM and thousands of
nodes, but again that can be mitigated with a NUMA friendly
COMPACT_DEFERRED logic NUMA aware. Even without such
NUMA-aware-compact_deferred logic improvement, the worst case of the
remote compaction behavior still doesn't look nearly as bad as this
bug by thinking about it. And it only is a concern for extremely large
NUMA systems (which may run the risk of running in other solubility
issues in other places if random workloads are applied to it and all
nodes are low on memory and fully fragmented which is far from common
scenario on those large systems), while the bug we fixed was hurting
badly all very common 2 nodes installs with workloads that 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Andrea Arcangeli
Hi,

On Fri, Oct 05, 2018 at 01:35:15PM -0700, David Rientjes wrote:
> Why is it ever appropriate to do heavy reclaim and swap activity to 
> allocate a transparent hugepage?  This is exactly what the __GFP_NORETRY 
> check for high-order allocations is attempting to avoid, and it explicitly 
> states that it is for thp faults.  The fact that we lost __GFP_NORERY for 
> thp allocations for all settings, including the default setting, other 
> than yours (setting of "always") is what I'm focusing on.  There is no 
> guarantee that this activity will free an entire pageblock or that it is 
> even worthwhile.

I tried to add just __GFP_NORETRY but it changes nothing. Try it
yourself if you think that can resolve the swap storm and excessive
reclaim CPU overhead... and see if it works. I didn't intend to
reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
have worked. I tried adding __GFP_NORETRY first of course.

Reason why it doesn't help is: compaction fails because not enough
free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
your lib user, compaction fails because not enough free RAM, reclaim
is invoked etc.. compact_result is not COMPACT_DEFERRED, but
COMPACT_SKIPPED.

See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
still create the same heavy swap storm and unfairly penalize all apps
with memory allocated in the local node like if your library had
actually the kernel privilege to run mbind or mlock, which is not ok.

Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
can run with __GFP_THISNODE set, all bets are off and we're back to
square one, no difference (at best marginal difference) with
__GFP_NORETRY being set.

> That aside, removing __GFP_THISNODE can make the fault latency much worse 
> if remote notes are fragmented and/or reclaim has the inability to free 
> contiguous memory, which it likely cannot.  This is where I measured over 
> 40% fault latency regression from Linus's tree with this patch on a 
> fragmnented system where order-9 memory is neither available from node 0 
> or node 1 on Haswell.

Discussing the drawbacks of removing __GFP_THISNODE is an orthogonal
topic. __GFP_COMPACT_ONLY approach didn't have any of those drawbacks
about the remote latency because __GFP_THISNODE was still set at all
times, just as you like it. You seem to think __GFP_NORETRY will work
as well as __GFP_COMPACT_ONLY but it doesn't.

Calling compaction (and only compaction!) with __GFP_THISNODE set
doesn't break anything and that was what __GFP_COMPACT_ONLY was about.

> The behavior that MADV_HUGEPAGE specifies is certainly not clearly 
> defined, unfortunately.  The way that an application writer may read it, 
> as we have, is that it will make a stronger attempt at allocating a 
> hugepage at fault.  This actually works quite well when the allocation 
> correctly has __GFP_NORETRY, as it's supposed to, and compaction is 
> MIGRATE_ASYNC.

Like Mel said, your app just happens to fit in a local node, if the
user of the lib is slightly different and allocates 16G on a system
where each node is 4G, the post-fix MADV_HUGEPAGE will perform
extremely better also for the lib user.

And you know, if the lib user fits in one node, it can use mbind and
it won't hit OOM... and you'd need some capability giving the app
privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest
of the processes running the local node (like mbind and mlock require
too).

Could you just run a test with the special lib and allocate 4 times
the size of a node, and see how the lib performs with upstream and
upstream+fix? Feel free to add __GFP_NORETRY anywhere you like in the
test of the upstream without fix.

The only constraint I would ask for the test (if the app using the lib
is not a massively multithreaded app, like qemu is, and you just
intend to run malloc(SIZEOFNODE*4); memset) is to run the app-lib
under "taskset -c 0". Otherwise NUMA balancing could move the the CPU
next to the last memory touched, which couldn't be done if each thread
accesses all ram at random from all 4 nodes at the same time (which is
a totally legitimate workload too and must not hit the "pathological
THP allocation performance").

> removed in a thp allocation.  I don't think anybody in this thread wants 
> 14% remote access latency regression if we allocate remotely or 40% fault 
> latency regression when remote nodes are fragmented as well.

Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40%
fault latency already.

Also you're underestimating the benefit of THP given from remote nodes
for virt a bit, the 40% fault latency is not an issue when the
allocation is long lived, which is what MADV_HUGEPAGE is telling the
kernel, and the benefit of THP for guest is multiplied. It's more a
feature than a bug that 40% fault latency with MADV_HUGEPAGE set at
least for all long lived allocations (but if the allocations aren't
long lived, why should MADV_HUGEPAGE 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-05 Thread Andrea Arcangeli
Hi,

On Fri, Oct 05, 2018 at 01:35:15PM -0700, David Rientjes wrote:
> Why is it ever appropriate to do heavy reclaim and swap activity to 
> allocate a transparent hugepage?  This is exactly what the __GFP_NORETRY 
> check for high-order allocations is attempting to avoid, and it explicitly 
> states that it is for thp faults.  The fact that we lost __GFP_NORERY for 
> thp allocations for all settings, including the default setting, other 
> than yours (setting of "always") is what I'm focusing on.  There is no 
> guarantee that this activity will free an entire pageblock or that it is 
> even worthwhile.

I tried to add just __GFP_NORETRY but it changes nothing. Try it
yourself if you think that can resolve the swap storm and excessive
reclaim CPU overhead... and see if it works. I didn't intend to
reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
have worked. I tried adding __GFP_NORETRY first of course.

Reason why it doesn't help is: compaction fails because not enough
free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
your lib user, compaction fails because not enough free RAM, reclaim
is invoked etc.. compact_result is not COMPACT_DEFERRED, but
COMPACT_SKIPPED.

See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
still create the same heavy swap storm and unfairly penalize all apps
with memory allocated in the local node like if your library had
actually the kernel privilege to run mbind or mlock, which is not ok.

Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
can run with __GFP_THISNODE set, all bets are off and we're back to
square one, no difference (at best marginal difference) with
__GFP_NORETRY being set.

> That aside, removing __GFP_THISNODE can make the fault latency much worse 
> if remote notes are fragmented and/or reclaim has the inability to free 
> contiguous memory, which it likely cannot.  This is where I measured over 
> 40% fault latency regression from Linus's tree with this patch on a 
> fragmnented system where order-9 memory is neither available from node 0 
> or node 1 on Haswell.

Discussing the drawbacks of removing __GFP_THISNODE is an orthogonal
topic. __GFP_COMPACT_ONLY approach didn't have any of those drawbacks
about the remote latency because __GFP_THISNODE was still set at all
times, just as you like it. You seem to think __GFP_NORETRY will work
as well as __GFP_COMPACT_ONLY but it doesn't.

Calling compaction (and only compaction!) with __GFP_THISNODE set
doesn't break anything and that was what __GFP_COMPACT_ONLY was about.

> The behavior that MADV_HUGEPAGE specifies is certainly not clearly 
> defined, unfortunately.  The way that an application writer may read it, 
> as we have, is that it will make a stronger attempt at allocating a 
> hugepage at fault.  This actually works quite well when the allocation 
> correctly has __GFP_NORETRY, as it's supposed to, and compaction is 
> MIGRATE_ASYNC.

Like Mel said, your app just happens to fit in a local node, if the
user of the lib is slightly different and allocates 16G on a system
where each node is 4G, the post-fix MADV_HUGEPAGE will perform
extremely better also for the lib user.

And you know, if the lib user fits in one node, it can use mbind and
it won't hit OOM... and you'd need some capability giving the app
privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest
of the processes running the local node (like mbind and mlock require
too).

Could you just run a test with the special lib and allocate 4 times
the size of a node, and see how the lib performs with upstream and
upstream+fix? Feel free to add __GFP_NORETRY anywhere you like in the
test of the upstream without fix.

The only constraint I would ask for the test (if the app using the lib
is not a massively multithreaded app, like qemu is, and you just
intend to run malloc(SIZEOFNODE*4); memset) is to run the app-lib
under "taskset -c 0". Otherwise NUMA balancing could move the the CPU
next to the last memory touched, which couldn't be done if each thread
accesses all ram at random from all 4 nodes at the same time (which is
a totally legitimate workload too and must not hit the "pathological
THP allocation performance").

> removed in a thp allocation.  I don't think anybody in this thread wants 
> 14% remote access latency regression if we allocate remotely or 40% fault 
> latency regression when remote nodes are fragmented as well.

Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40%
fault latency already.

Also you're underestimating the benefit of THP given from remote nodes
for virt a bit, the 40% fault latency is not an issue when the
allocation is long lived, which is what MADV_HUGEPAGE is telling the
kernel, and the benefit of THP for guest is multiplied. It's more a
feature than a bug that 40% fault latency with MADV_HUGEPAGE set at
least for all long lived allocations (but if the allocations aren't
long lived, why should MADV_HUGEPAGE 

Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-04 Thread Andrea Arcangeli
Hello David,

On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> There are ways to address this without introducing regressions for 
> existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> remote thp allocations, which users of this library would never set, or 
> fix memory compaction so that it does not incur substantial allocation 
> latency when it will likely fail.

These librarians needs to call a new MADV_ and the current
MADV_HUGEPAGE should not be affected because the new MADV_ will
require some capbility (i.e. root privilege).

qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
to break it and require change to it to run at higher privilege to
retain the direct compaction behavior of MADV_HUGEPAGE.

The new behavior you ask to retain in MADV_HUGEPAGE, generated the
same misbehavior to VM as mlock could have done too, so it can't just
be given by default without any privilege whatsoever.

Ok you could mitigate the breakage that MADV_HUGEPAGE could have
generated (before the recent fix) by isolating malicious or
inefficient programs with memcg, but by default in a multiuser system
without cgroups the global disruption provided before the fix
(i.e. the pathological THP behavior) is not warranted. memcg shouldn't
be mandatory to avoid a process to affect the VM in such a strong way
(i.e. all other processes who happened to be allocated in the node
where the THP allocation triggered, being trashed in swap like if all
memory of all other nodes was not completely free).

Not only that, it's not only about malicious processes it's also
excessively inefficient for processes that just don't fit in a local
node and use MADV_HUGEPAGE. Your processes all fit in the local node
for sure if they're happy about it. This was reported as a
"pathological THP regression" after all in a workload that couldn't
swap at all because of the iommu gup persistent refcount pins.

The alternative patch I posted which still invoked direct reclaim
locally, and falled back to NUMA-local PAGE_SIZEd allocations instead
of falling back to NUMA-remote THP, could have been given without
privilege, but it still won't swapout as this patch would have done,
so it still won't go as far as the MADV_HUGEPAGE behavior had before
the fix (for the lib users that strongly needed local memory).

Overall I think the call about the default behavior of MADV_HUGEPAGE
is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
fix in -mm), or by changing direct compaction to only call compaction
and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.

To go beyond that some privilege is needed and a new MADV_ flag can
require privilege or return error if there's not enough privilege. So
the lib with 100's users can try to use that new flag first, show an
error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if
the app hasn't enough privilege. The alternative is to add a new mem
policy less strict than MPOL_BIND to achieve what you need on top of
MADV_HUGEPAGE (which also would require some privilege of course as
all mbinds). I assume you already evaluated the preferred and local
mbinds and it's not a perfect fit?

If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could
still add a THP sysfs/sysctl control to lift the privilege requirement
marking it as insecure setting in docs
(mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by
default). This would be on the same lines of other sysctl that
increase the max number of files open and such things (perhaps a
sysctl would be better in fact for tuning in /etc/sysctl.conf).

Note there was still some improvement left possible in my
__GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for
the local node shown the local node not to have enough real "free"
PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if
compaction failed, we should have relaxed __GFP_THISNODE and tried to
allocate THP from the NUMA-remote nodes before falling back to
PAGE_SIZEd allocations. That also won't require any new privilege.

To get the same behavior as before though you would need to drop all
caches with echo 3 >drop_caches before the app starts (and it still
won't swap anon memory which previous MADV_HUGEPAGE behavior would
have, but the whole point is that the previous MADV_HUGEPAGE behavior
would have backfired for the lib too if the app was allocating so much
RAM from the local node that it required swapouts of local anon
memory).

Thanks,
Andrea


Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-10-04 Thread Andrea Arcangeli
Hello David,

On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> There are ways to address this without introducing regressions for 
> existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> remote thp allocations, which users of this library would never set, or 
> fix memory compaction so that it does not incur substantial allocation 
> latency when it will likely fail.

These librarians needs to call a new MADV_ and the current
MADV_HUGEPAGE should not be affected because the new MADV_ will
require some capbility (i.e. root privilege).

qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
to break it and require change to it to run at higher privilege to
retain the direct compaction behavior of MADV_HUGEPAGE.

The new behavior you ask to retain in MADV_HUGEPAGE, generated the
same misbehavior to VM as mlock could have done too, so it can't just
be given by default without any privilege whatsoever.

Ok you could mitigate the breakage that MADV_HUGEPAGE could have
generated (before the recent fix) by isolating malicious or
inefficient programs with memcg, but by default in a multiuser system
without cgroups the global disruption provided before the fix
(i.e. the pathological THP behavior) is not warranted. memcg shouldn't
be mandatory to avoid a process to affect the VM in such a strong way
(i.e. all other processes who happened to be allocated in the node
where the THP allocation triggered, being trashed in swap like if all
memory of all other nodes was not completely free).

Not only that, it's not only about malicious processes it's also
excessively inefficient for processes that just don't fit in a local
node and use MADV_HUGEPAGE. Your processes all fit in the local node
for sure if they're happy about it. This was reported as a
"pathological THP regression" after all in a workload that couldn't
swap at all because of the iommu gup persistent refcount pins.

The alternative patch I posted which still invoked direct reclaim
locally, and falled back to NUMA-local PAGE_SIZEd allocations instead
of falling back to NUMA-remote THP, could have been given without
privilege, but it still won't swapout as this patch would have done,
so it still won't go as far as the MADV_HUGEPAGE behavior had before
the fix (for the lib users that strongly needed local memory).

Overall I think the call about the default behavior of MADV_HUGEPAGE
is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
fix in -mm), or by changing direct compaction to only call compaction
and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.

To go beyond that some privilege is needed and a new MADV_ flag can
require privilege or return error if there's not enough privilege. So
the lib with 100's users can try to use that new flag first, show an
error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if
the app hasn't enough privilege. The alternative is to add a new mem
policy less strict than MPOL_BIND to achieve what you need on top of
MADV_HUGEPAGE (which also would require some privilege of course as
all mbinds). I assume you already evaluated the preferred and local
mbinds and it's not a perfect fit?

If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could
still add a THP sysfs/sysctl control to lift the privilege requirement
marking it as insecure setting in docs
(mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by
default). This would be on the same lines of other sysctl that
increase the max number of files open and such things (perhaps a
sysctl would be better in fact for tuning in /etc/sysctl.conf).

Note there was still some improvement left possible in my
__GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for
the local node shown the local node not to have enough real "free"
PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if
compaction failed, we should have relaxed __GFP_THISNODE and tried to
allocate THP from the NUMA-remote nodes before falling back to
PAGE_SIZEd allocations. That also won't require any new privilege.

To get the same behavior as before though you would need to drop all
caches with echo 3 >drop_caches before the app starts (and it still
won't swap anon memory which previous MADV_HUGEPAGE behavior would
have, but the whole point is that the previous MADV_HUGEPAGE behavior
would have backfired for the lib too if the app was allocating so much
RAM from the local node that it required swapouts of local anon
memory).

Thanks,
Andrea


Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-09-12 Thread Andrea Arcangeli
Hello,

On Tue, Sep 11, 2018 at 01:56:13PM +0200, Michal Hocko wrote:
> Well, it seems that expectations differ for users. It seems that kvm
> users do not really agree with your interpretation.

Like David also mentioned here:

lkml.kernel.org/r/alpine.deb.2.21.1808211021110.258...@chino.kir.corp.google.com

depends on the hardware what is a win, so there's no one size fits
all.

For two sockets providing remote THP to KVM is likely a win, but
changing the defaults depending on boot-time NUMA topology makes
things less deterministic and it's also impossible to define an exact
break even point.

> I do realize that this is a gray zone because nobody bothered to define
> the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> is exceptionaly short of information). So we are left with more or less
> undefined behavior and define it properly now. As we can see this might
> regress in some workloads but I strongly suspect that an explicit
> binding sounds more logical approach than a thp specific mpol mode. If
> anything this should be a more generic memory policy basically saying
> that a zone/node reclaim mode should be enabled for the particular
> allocation.

MADV_HUGEPAGE means the allocation is long lived, so the cost of
compaction is worth it in direct reclaim. Not much else. That is not
the problem.

The problem is that even if you ignore the breakage and regression to
real life workloads, what is happening right now obviously would
require root privilege but MADV_HUEGPAGE requires no root privilege.

Swapping heavy because MADV_HUGEPAGE when there are gigabytes free on
other nodes and not even 4k would be swapped-out with THP turned off
in sysfs, is simply not possibly what MADV_HUGEPAGE could have been
about, and it's a kernel regression that never existed until that
commit that added __GFP_THISNODE to the default THP heuristic in
mempolicy.

I think we should defer the problem of what is better between 4k NUMA
local or remote THP by default for later, I provided two options
myself because it didn't matter so much which option we picked in the
short term, as long as the bug was fixed.

I wasn't particularly happy about your patch because it still swaps
with certain defrag settings which is still allowing things that
shouldn't happen without some kind of privileged capability.

If you can update the patch to prevent swapping in all cases it's a go
as far as I'm concerned. The main difference is that you're dropping
the THP logic in the mempolicy code which will make it worse for some
case and I was trying to retain it for all cases where swapping
wouldn't happen anyway and such logic would have still provided the
behavior David prefers to those cases.

Adding the new feature to create a THP specific mempolicy can be done
later. In the meanwhile the current mempolicy code can always override
whatever THP default behavior that gets out of this, just it will
require the admin to setup a mempolicy to enforce the preferred
behavior to 4k and THP allocations alike.

Thanks,
Andrea


Re: [PATCH] mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-09-12 Thread Andrea Arcangeli
Hello,

On Tue, Sep 11, 2018 at 01:56:13PM +0200, Michal Hocko wrote:
> Well, it seems that expectations differ for users. It seems that kvm
> users do not really agree with your interpretation.

Like David also mentioned here:

lkml.kernel.org/r/alpine.deb.2.21.1808211021110.258...@chino.kir.corp.google.com

depends on the hardware what is a win, so there's no one size fits
all.

For two sockets providing remote THP to KVM is likely a win, but
changing the defaults depending on boot-time NUMA topology makes
things less deterministic and it's also impossible to define an exact
break even point.

> I do realize that this is a gray zone because nobody bothered to define
> the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> is exceptionaly short of information). So we are left with more or less
> undefined behavior and define it properly now. As we can see this might
> regress in some workloads but I strongly suspect that an explicit
> binding sounds more logical approach than a thp specific mpol mode. If
> anything this should be a more generic memory policy basically saying
> that a zone/node reclaim mode should be enabled for the particular
> allocation.

MADV_HUGEPAGE means the allocation is long lived, so the cost of
compaction is worth it in direct reclaim. Not much else. That is not
the problem.

The problem is that even if you ignore the breakage and regression to
real life workloads, what is happening right now obviously would
require root privilege but MADV_HUEGPAGE requires no root privilege.

Swapping heavy because MADV_HUGEPAGE when there are gigabytes free on
other nodes and not even 4k would be swapped-out with THP turned off
in sysfs, is simply not possibly what MADV_HUGEPAGE could have been
about, and it's a kernel regression that never existed until that
commit that added __GFP_THISNODE to the default THP heuristic in
mempolicy.

I think we should defer the problem of what is better between 4k NUMA
local or remote THP by default for later, I provided two options
myself because it didn't matter so much which option we picked in the
short term, as long as the bug was fixed.

I wasn't particularly happy about your patch because it still swaps
with certain defrag settings which is still allowing things that
shouldn't happen without some kind of privileged capability.

If you can update the patch to prevent swapping in all cases it's a go
as far as I'm concerned. The main difference is that you're dropping
the THP logic in the mempolicy code which will make it worse for some
case and I was trying to retain it for all cases where swapping
wouldn't happen anyway and such logic would have still provided the
behavior David prefers to those cases.

Adding the new feature to create a THP specific mempolicy can be done
later. In the meanwhile the current mempolicy code can always override
whatever THP default behavior that gets out of this, just it will
require the admin to setup a mempolicy to enforce the preferred
behavior to 4k and THP allocations alike.

Thanks,
Andrea


Re: [PATCH v3 1/3] ptrace: Provide ___ptrace_may_access() that can be applied on arbitrary tasks

2018-09-05 Thread Andrea Arcangeli
On Wed, Sep 05, 2018 at 08:29:07PM +0200, Jiri Kosina wrote:
> (and no, my testing of the patch I sent on current tree didn't produce any 
> hangs -- was there a reliable way to trigger it on 3.10?).

Only a very specific libvirt acceptance test found this after a while
and it wasn't a customer it was caught by QA. The reporter said it
wasn't sure about how to reproduce this issue either, it happened once
in a while the backtrace was still enough to fix it for sure and then
it never happened again.

It's not because of virt but probably because of selinux+audit. This
is precisely why I thought once you enter LSM from the scheduler
atomic path the trouble starts as each LSM implementation of those
calls may crash or not crash.

Perhaps you didn't sandbox KVM inside selinux by default?

This is the lockup the patch I posted fixed for 3.10.

[ 1838.621010] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
6
[ 1838.629070] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 
3.10.0-327.62.4.el7.x86_64 #1
[ 1838.637610] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 
01/09/2017
[ 1838.645954] Call Trace:
[ 1838.648680][] dump_stack+0x19/0x1b
[ 1838.655113]  [] panic+0xd8/0x1e7
[ 1838.660460]  [] ? restart_watchdog_hrtimer+0x50/0x50
[ 1838.667742]  [] watchdog_overflow_callback+0xc2/0xd0
[ 1838.675024]  [] __perf_event_overflow+0xa1/0x250
[ 1838.681920]  [] perf_event_overflow+0x14/0x20
[ 1838.688526]  [] intel_pmu_handle_irq+0x1e8/0x470
[ 1838.695423]  [] ? ioremap_page_range+0x24c/0x330
[ 1838.702320]  [] ? unmap_kernel_range_noflush+0x11/0x20
[ 1838.709797]  [] ? ghes_copy_tofrom_phys+0x124/0x210
[ 1838.716984]  [] ? ghes_read_estatus+0xa0/0x190
[ 1838.723687]  [] perf_event_nmi_handler+0x2b/0x50
[ 1838.730582]  [] nmi_handle.isra.0+0x69/0xb0
[ 1838.736992]  [] do_nmi+0x169/0x340
[ 1838.742532]  [] end_repeat_nmi+0x1e/0x7e
[ 1838.748653]  [] ? _raw_spin_lock_irqsave+0x3d/0x60
[ 1838.755742]  [] ? _raw_spin_lock_irqsave+0x3d/0x60
[ 1838.762831]  [] ? _raw_spin_lock_irqsave+0x3d/0x60
[ 1838.769917]  <>  [] avc_compute_av+0x126/0x1b5
[ 1838.777125]  [] ? walk_tg_tree_from+0xbe/0x110
[ 1838.783828]  [] avc_has_perm_noaudit+0xc4/0x110
[ 1838.790628]  [] cred_has_capability+0x6b/0x120
[ 1838.797331]  [] ? ktime_get+0x4c/0xd0
[ 1838.803160]  [] ? clockevents_program_event+0x6b/0xf0
[ 1838.810532]  [] selinux_capable+0x2e/0x40
[ 1838.816748]  [] security_capable_noaudit+0x15/0x20
[ 1838.823829]  [] has_ns_capability_noaudit+0x15/0x20
[ 1838.831014]  [] ptrace_has_cap+0x35/0x40
[ 1838.837126]  [] ___ptrace_may_access+0xa7/0x1e0
[ 1838.843925]  [] __schedule+0x26e/0xa00
[ 1838.849855]  [] schedule_preempt_disabled+0x29/0x70
[ 1838.857041]  [] cpu_startup_entry+0x184/0x290
[ 1838.863637]  [] start_secondary+0x1da/0x250


Re: [PATCH v3 1/3] ptrace: Provide ___ptrace_may_access() that can be applied on arbitrary tasks

2018-09-05 Thread Andrea Arcangeli
On Wed, Sep 05, 2018 at 08:29:07PM +0200, Jiri Kosina wrote:
> (and no, my testing of the patch I sent on current tree didn't produce any 
> hangs -- was there a reliable way to trigger it on 3.10?).

Only a very specific libvirt acceptance test found this after a while
and it wasn't a customer it was caught by QA. The reporter said it
wasn't sure about how to reproduce this issue either, it happened once
in a while the backtrace was still enough to fix it for sure and then
it never happened again.

It's not because of virt but probably because of selinux+audit. This
is precisely why I thought once you enter LSM from the scheduler
atomic path the trouble starts as each LSM implementation of those
calls may crash or not crash.

Perhaps you didn't sandbox KVM inside selinux by default?

This is the lockup the patch I posted fixed for 3.10.

[ 1838.621010] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
6
[ 1838.629070] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 
3.10.0-327.62.4.el7.x86_64 #1
[ 1838.637610] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 
01/09/2017
[ 1838.645954] Call Trace:
[ 1838.648680][] dump_stack+0x19/0x1b
[ 1838.655113]  [] panic+0xd8/0x1e7
[ 1838.660460]  [] ? restart_watchdog_hrtimer+0x50/0x50
[ 1838.667742]  [] watchdog_overflow_callback+0xc2/0xd0
[ 1838.675024]  [] __perf_event_overflow+0xa1/0x250
[ 1838.681920]  [] perf_event_overflow+0x14/0x20
[ 1838.688526]  [] intel_pmu_handle_irq+0x1e8/0x470
[ 1838.695423]  [] ? ioremap_page_range+0x24c/0x330
[ 1838.702320]  [] ? unmap_kernel_range_noflush+0x11/0x20
[ 1838.709797]  [] ? ghes_copy_tofrom_phys+0x124/0x210
[ 1838.716984]  [] ? ghes_read_estatus+0xa0/0x190
[ 1838.723687]  [] perf_event_nmi_handler+0x2b/0x50
[ 1838.730582]  [] nmi_handle.isra.0+0x69/0xb0
[ 1838.736992]  [] do_nmi+0x169/0x340
[ 1838.742532]  [] end_repeat_nmi+0x1e/0x7e
[ 1838.748653]  [] ? _raw_spin_lock_irqsave+0x3d/0x60
[ 1838.755742]  [] ? _raw_spin_lock_irqsave+0x3d/0x60
[ 1838.762831]  [] ? _raw_spin_lock_irqsave+0x3d/0x60
[ 1838.769917]  <>  [] avc_compute_av+0x126/0x1b5
[ 1838.777125]  [] ? walk_tg_tree_from+0xbe/0x110
[ 1838.783828]  [] avc_has_perm_noaudit+0xc4/0x110
[ 1838.790628]  [] cred_has_capability+0x6b/0x120
[ 1838.797331]  [] ? ktime_get+0x4c/0xd0
[ 1838.803160]  [] ? clockevents_program_event+0x6b/0xf0
[ 1838.810532]  [] selinux_capable+0x2e/0x40
[ 1838.816748]  [] security_capable_noaudit+0x15/0x20
[ 1838.823829]  [] has_ns_capability_noaudit+0x15/0x20
[ 1838.831014]  [] ptrace_has_cap+0x35/0x40
[ 1838.837126]  [] ___ptrace_may_access+0xa7/0x1e0
[ 1838.843925]  [] __schedule+0x26e/0xa00
[ 1838.849855]  [] schedule_preempt_disabled+0x29/0x70
[ 1838.857041]  [] cpu_startup_entry+0x184/0x290
[ 1838.863637]  [] start_secondary+0x1da/0x250


Re: [PATCH v3 1/3] ptrace: Provide ___ptrace_may_access() that can be applied on arbitrary tasks

2018-09-05 Thread Andrea Arcangeli
On Wed, Sep 05, 2018 at 08:58:23AM -0700, Andi Kleen wrote:
> > So, after giving it a bit more thought, I still believe "I want spectre V2 
> > protection" vs. "I do not care about spectre V2 on my system 
> > (=nospectre_v2)" are the sane options we should provide; so I'll respin v4 
> > of my patchset, including the ptrace check in switch_mm() (statically 
> > patched out on !IBPB-capable systems), and we can then later see whether 
> > the LSM implementation, once it exists, should be used instead.
> 
> Please if you repost include plenty of performance numbers for multi threaded
> workloads.  It's ridiculous to even discuss this without them.

Multi threaded workloads won't be affected because they share the
memory in the first place... the check itself is lost in the noise
too. Maybe you meant to ask for multiple parallel processes
(multithreaded or not, zero difference) all with a different user id?

What is more weird for me is to attempt to discuss the STIBP part of
the patch without knowing which microcodes exactly implement STIBP in
a way that is slow. Tim already said it's a measurable performance hit,
but on some CPU it's zero performance hit. We don't even know if STIBP
is actually useful or if it's a noop on those CPUs where it won't
affect performance.

Back to the IBPB, from implementation standpoint at least on 3.10 this
code posted would lockup hard eventually and we got complains.

ptrace_has_cap(tcred->user_ns, mode) is supposed to eventually lockup
hard if called from scheduler as it does some locking, and we fixed
that already half a year ago.

Not sure how it's still unfixed in Jiri's codebase after so long, or
if it's an issue specific to 3.10 and upstream gets away without this.

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index eb7862f185ff..4a8d0dd73c93 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -285,7 +285,8 @@ int ___ptrace_may_access(struct task_struct *tracer,
gid_eq(caller_gid, tcred->sgid) &&
gid_eq(caller_gid, tcred->gid))
goto ok;
-   if (ptrace_has_cap(tcred->user_ns, mode))
+   if (!(mode & PTRACE_MODE_NOACCESS_CHK) &&
+   ptrace_has_cap(tcred->user_ns, mode))
goto ok;
rcu_read_unlock();
return -EPERM;
@@ -296,7 +297,8 @@ ok:
dumpable = get_dumpable(task->mm);
rcu_read_lock();
if (dumpable != SUID_DUMP_USER &&
-   !ptrace_has_cap(__task_cred(task)->user_ns, mode)) {
+   ((mode & PTRACE_MODE_NOACCESS_CHK) ||
+!ptrace_has_cap(__task_cred(task)->user_ns, mode))) {
rcu_read_unlock();
return -EPERM;
}


Re: [PATCH v3 1/3] ptrace: Provide ___ptrace_may_access() that can be applied on arbitrary tasks

2018-09-05 Thread Andrea Arcangeli
On Wed, Sep 05, 2018 at 08:58:23AM -0700, Andi Kleen wrote:
> > So, after giving it a bit more thought, I still believe "I want spectre V2 
> > protection" vs. "I do not care about spectre V2 on my system 
> > (=nospectre_v2)" are the sane options we should provide; so I'll respin v4 
> > of my patchset, including the ptrace check in switch_mm() (statically 
> > patched out on !IBPB-capable systems), and we can then later see whether 
> > the LSM implementation, once it exists, should be used instead.
> 
> Please if you repost include plenty of performance numbers for multi threaded
> workloads.  It's ridiculous to even discuss this without them.

Multi threaded workloads won't be affected because they share the
memory in the first place... the check itself is lost in the noise
too. Maybe you meant to ask for multiple parallel processes
(multithreaded or not, zero difference) all with a different user id?

What is more weird for me is to attempt to discuss the STIBP part of
the patch without knowing which microcodes exactly implement STIBP in
a way that is slow. Tim already said it's a measurable performance hit,
but on some CPU it's zero performance hit. We don't even know if STIBP
is actually useful or if it's a noop on those CPUs where it won't
affect performance.

Back to the IBPB, from implementation standpoint at least on 3.10 this
code posted would lockup hard eventually and we got complains.

ptrace_has_cap(tcred->user_ns, mode) is supposed to eventually lockup
hard if called from scheduler as it does some locking, and we fixed
that already half a year ago.

Not sure how it's still unfixed in Jiri's codebase after so long, or
if it's an issue specific to 3.10 and upstream gets away without this.

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index eb7862f185ff..4a8d0dd73c93 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -285,7 +285,8 @@ int ___ptrace_may_access(struct task_struct *tracer,
gid_eq(caller_gid, tcred->sgid) &&
gid_eq(caller_gid, tcred->gid))
goto ok;
-   if (ptrace_has_cap(tcred->user_ns, mode))
+   if (!(mode & PTRACE_MODE_NOACCESS_CHK) &&
+   ptrace_has_cap(tcred->user_ns, mode))
goto ok;
rcu_read_unlock();
return -EPERM;
@@ -296,7 +297,8 @@ ok:
dumpable = get_dumpable(task->mm);
rcu_read_lock();
if (dumpable != SUID_DUMP_USER &&
-   !ptrace_has_cap(__task_cred(task)->user_ns, mode)) {
+   ((mode & PTRACE_MODE_NOACCESS_CHK) ||
+!ptrace_has_cap(__task_cred(task)->user_ns, mode))) {
rcu_read_unlock();
return -EPERM;
}


Re: [PATCH v3 1/3] ptrace: Provide ___ptrace_may_access() that can be applied on arbitrary tasks

2018-09-04 Thread Andrea Arcangeli
On Wed, Sep 05, 2018 at 01:00:37AM +, Schaufler, Casey wrote:
> Sorry, I've been working in security too long for my
> optimistic streak to be very wide.

Eheh. So I was simply trying to follow in context, but it wasn't
entirely clear, so I tried to take it out of context and then it was
even less clear, never mind I was only looking for some more
explanation.

> Not especially, no. I have gotten considerable feedback that
> it should be configurable, and while I may not see the value myself,
> I do respect the people who've requested the configurability.

Correct me if I'm wrong, but LSM as last word "module" implies to make
this logic modular.

My wondering is because:

1) why not just a single version of this logic in the scheduler?
   (i.e. can only be on or off or perhaps a only-ibpb-dumpable tweak
   to retain current slightly higher perf but less secure behavior)

2) even if you want a myriad of versions of this IBPB logic, how do
   the different versions can possibly fit in a whole "module" whose
   semantics have pretty much nothing to do with the other methods in
   the module like LSM_HOOK_INIT(capable, cap_capable),
   LSM_HOOK_INIT(vm_enough_memory, cap_vm_enough_memory), and
   LSM_HOOK_INIT(mmap_addr, cap_mmap_addr), or even
   LSM_HOOK_INIT(inode_follow_link, selinux_inode_follow_link),
   LSM_HOOK_INIT(inode_permission, selinux_inode_permission). I mean
   it looks an apple to oranges monolithic link in the same module. Or
   are you going to load this method in a separate module with only a
   single method implemented and then load multiple LSM modules at the
   same time?

3) if you need so much tweaking that not even a single off/on switch
   independent of any module loaded through LSM is not enough, how is
   it ok that all the rest shouldn't be configurable? All the rest has
   more performance impact than this one so it'd start from there if
   something.

I understand how configurablity is valuable (this is why we kept
everything dynamic tunable at runtime, including the PTI stop machine
to allow even PTI TLB flushes to be turned off).

I'm just trying to understand how IBPB fits in a LSM module
implementation.

Even if you build with CONFIG_SECURITY=n PTI won't go away, retpoline
won't go away, the l1flush in vmenter won't go away, the
pte/transhugepmd inversion won't go away, why only the runtime
tunability or tweaking of IBPB fits in a LSM module?

> If they provide different answers there should be different
> functions. It's a question of viewpoint. If you don't care about
> the SELinux viewpoint you shouldn't have to include it in your
> checks, any more than you care about x86 issues when you're
> running on a MIPS processor.

I don't see how selinux fits into this. Selinux doesn't affect virtual
memory protection. Not having selinux even built in doesn't mean you
are ok with virtual memory protection not being provided by the CPU.

Or in other words selinux fits into this as much as seccomp or
apparmor fits into this.

This is just like a memory barrier mb() to be executed in the context
switch code. Security issues can emerge with the lack of it regardless
if selinux is built in and enabled or CONFIG_SECURITY=n.

selinux matters after an exploit already happened, this as opposed is
needed to prevent the exploit in the first place.

Also the correct fully secure version provides a single answer
(i.e. in theory assuming a perfect implementation that didn't forget
anything so having a single implementation will also increase the
chances that we get as close as possible to the theoretical correct
implementation).

If it provides different answers in this case it is because somebody
wants not perfect security to provide higher performance, i.e. the
"configurablity" which is valuable and fine feature to provide. Just a
LSM module doesn't seem the most flexibile way to provide
configurability by far.

> Yes, even security people have to worry about locking.

Yes it was some lock that if contended would lockup if used from
inside the scheduler.

> What *is* the most robust way?

The below pretty much explains it.

> Yes, there are locking issues. The code in the LSM infrastructure and in
> the security modules needs to be aware of that and deal with it. The SELinux
> code I proposed is more complex than it could be because the audit code
> requires locking that doesn't work in the switching context.

Or in other words having multiple versions of this called from within
the scheduler is a bit like making the kernel modular and then because
the locking is subtle you may then have scheduler deadlocks only
happening with some kernel module loaded instead of others, but the
real question is what is the payoff compared to just allowing the
scheduler code to be tuned with x86 debugfs or sysctl the normal
way.

Also how would a new common code method in LSM fit for CPUs that are
so slow that they can't possibly need anything like IBPB and
flush_RSB()?

Thanks!
Andrea


Re: [PATCH v3 1/3] ptrace: Provide ___ptrace_may_access() that can be applied on arbitrary tasks

2018-09-04 Thread Andrea Arcangeli
On Wed, Sep 05, 2018 at 01:00:37AM +, Schaufler, Casey wrote:
> Sorry, I've been working in security too long for my
> optimistic streak to be very wide.

Eheh. So I was simply trying to follow in context, but it wasn't
entirely clear, so I tried to take it out of context and then it was
even less clear, never mind I was only looking for some more
explanation.

> Not especially, no. I have gotten considerable feedback that
> it should be configurable, and while I may not see the value myself,
> I do respect the people who've requested the configurability.

Correct me if I'm wrong, but LSM as last word "module" implies to make
this logic modular.

My wondering is because:

1) why not just a single version of this logic in the scheduler?
   (i.e. can only be on or off or perhaps a only-ibpb-dumpable tweak
   to retain current slightly higher perf but less secure behavior)

2) even if you want a myriad of versions of this IBPB logic, how do
   the different versions can possibly fit in a whole "module" whose
   semantics have pretty much nothing to do with the other methods in
   the module like LSM_HOOK_INIT(capable, cap_capable),
   LSM_HOOK_INIT(vm_enough_memory, cap_vm_enough_memory), and
   LSM_HOOK_INIT(mmap_addr, cap_mmap_addr), or even
   LSM_HOOK_INIT(inode_follow_link, selinux_inode_follow_link),
   LSM_HOOK_INIT(inode_permission, selinux_inode_permission). I mean
   it looks an apple to oranges monolithic link in the same module. Or
   are you going to load this method in a separate module with only a
   single method implemented and then load multiple LSM modules at the
   same time?

3) if you need so much tweaking that not even a single off/on switch
   independent of any module loaded through LSM is not enough, how is
   it ok that all the rest shouldn't be configurable? All the rest has
   more performance impact than this one so it'd start from there if
   something.

I understand how configurablity is valuable (this is why we kept
everything dynamic tunable at runtime, including the PTI stop machine
to allow even PTI TLB flushes to be turned off).

I'm just trying to understand how IBPB fits in a LSM module
implementation.

Even if you build with CONFIG_SECURITY=n PTI won't go away, retpoline
won't go away, the l1flush in vmenter won't go away, the
pte/transhugepmd inversion won't go away, why only the runtime
tunability or tweaking of IBPB fits in a LSM module?

> If they provide different answers there should be different
> functions. It's a question of viewpoint. If you don't care about
> the SELinux viewpoint you shouldn't have to include it in your
> checks, any more than you care about x86 issues when you're
> running on a MIPS processor.

I don't see how selinux fits into this. Selinux doesn't affect virtual
memory protection. Not having selinux even built in doesn't mean you
are ok with virtual memory protection not being provided by the CPU.

Or in other words selinux fits into this as much as seccomp or
apparmor fits into this.

This is just like a memory barrier mb() to be executed in the context
switch code. Security issues can emerge with the lack of it regardless
if selinux is built in and enabled or CONFIG_SECURITY=n.

selinux matters after an exploit already happened, this as opposed is
needed to prevent the exploit in the first place.

Also the correct fully secure version provides a single answer
(i.e. in theory assuming a perfect implementation that didn't forget
anything so having a single implementation will also increase the
chances that we get as close as possible to the theoretical correct
implementation).

If it provides different answers in this case it is because somebody
wants not perfect security to provide higher performance, i.e. the
"configurablity" which is valuable and fine feature to provide. Just a
LSM module doesn't seem the most flexibile way to provide
configurability by far.

> Yes, even security people have to worry about locking.

Yes it was some lock that if contended would lockup if used from
inside the scheduler.

> What *is* the most robust way?

The below pretty much explains it.

> Yes, there are locking issues. The code in the LSM infrastructure and in
> the security modules needs to be aware of that and deal with it. The SELinux
> code I proposed is more complex than it could be because the audit code
> requires locking that doesn't work in the switching context.

Or in other words having multiple versions of this called from within
the scheduler is a bit like making the kernel modular and then because
the locking is subtle you may then have scheduler deadlocks only
happening with some kernel module loaded instead of others, but the
real question is what is the payoff compared to just allowing the
scheduler code to be tuned with x86 debugfs or sysctl the normal
way.

Also how would a new common code method in LSM fit for CPUs that are
so slow that they can't possibly need anything like IBPB and
flush_RSB()?

Thanks!
Andrea


Re: [PATCH v3 1/3] ptrace: Provide ___ptrace_may_access() that can be applied on arbitrary tasks

2018-09-04 Thread Andrea Arcangeli
Hello,

On Tue, Sep 04, 2018 at 06:10:47PM +, Schaufler, Casey wrote:
> The real reason to use an LSM based approach is that overloading ptrace
> checks is a Really Bad Idea. Ptrace is a user interface. Side-channel is a
> processor interface. Even if ptrace_may_access() does exactly what you

"Side channel is a processor interface" doesn't make me optimistic,
but I assume you're not speaking for Intel.

> want it to for side-channel mitigation today it would be incredibly
> inappropriate to tie the two together for eternity. You don't want to restrict
> the ptrace checks to only those things that are also good for side-channel,
> and vice versa. 

It seems like you want to make this more configurable, we have all
debugfs x86 specific tweaks to disable IBPB at runtime and we don't
allow a runtime opt-out of IBPB alone.

If you shutdown either IBRS or retpolines at runtime with debugfs,
then IBPB goes away too.

Giving a finegrined way to disable only IBPB we found was overkill
because IBPB has never been measurable if done only when the prev task
cannot ptrace the next task (which is a superset of the too weak
upstream not dumpable check, but still not a performance issue).

Allowing IBPB runtime opt-out doesn't make much sense if you don't
allow to disable retpolines too still at runtime. And disabling
retpolines from LSM doesn't sounds the right place, it's an x86
temporary quirk only relevant for current buggy CPUs.

There should be a function that decides when IBPB and flush_RSB should
be run (flush_RSB has to be run if SMEP because with SMEP there's no
need to run flush_RSB at every kernel entry anymore), and that
function happens to check ptrace today, but it would check other stuff
too if we had other interfaces besides ptrace that would allow the
prev task to read the memory of the next task. So it's not so much
about ptrace nor about IBPB, it's about "IBPB_RSB" vs "prev task
can read memory of next task". Then each arch can implement
"IBPB_RSB" method its own way but the check is for the common
code and it should be in the scheduler and there's just 1 version of
this check needed.

I don't think there should be a ton of different versions of this
function each providing a different answer, which is what LSM would
provide.

At most you can add a x86 debugfs tweak to shut off the logic but
again it would make more sense if retpolines could be shut off at
runtime too, doing it just for IBPB sounds backwards because it has
such an unmeasurable overhead.

> Yes. That would be me. 

Even the attempt to make an innocuous call like
ptrace_has_cap(tcred->user_ns, mode) will eventually
deadlock there.

I used a PTRACE_MODE_ check that the ptrace check code uses to filter
out specific calls that may eventually enter LSM and hard lockup in
non reproducible workloads (some false positive IBPB is ok, especially
if it avoids a deadlock).

Everything can be fixed in any way, but calling LSM from scheduler
code doesn't sound the most robust thing to do in general because what
works outside the scheduler doesn't work from within those semi atomic
regions (it tends to work initially until it eventually
deadlocks). The original code of such check, had all sort of deadlocks
because of that.

Thanks,
Andrea


Re: [PATCH v3 1/3] ptrace: Provide ___ptrace_may_access() that can be applied on arbitrary tasks

2018-09-04 Thread Andrea Arcangeli
Hello,

On Tue, Sep 04, 2018 at 06:10:47PM +, Schaufler, Casey wrote:
> The real reason to use an LSM based approach is that overloading ptrace
> checks is a Really Bad Idea. Ptrace is a user interface. Side-channel is a
> processor interface. Even if ptrace_may_access() does exactly what you

"Side channel is a processor interface" doesn't make me optimistic,
but I assume you're not speaking for Intel.

> want it to for side-channel mitigation today it would be incredibly
> inappropriate to tie the two together for eternity. You don't want to restrict
> the ptrace checks to only those things that are also good for side-channel,
> and vice versa. 

It seems like you want to make this more configurable, we have all
debugfs x86 specific tweaks to disable IBPB at runtime and we don't
allow a runtime opt-out of IBPB alone.

If you shutdown either IBRS or retpolines at runtime with debugfs,
then IBPB goes away too.

Giving a finegrined way to disable only IBPB we found was overkill
because IBPB has never been measurable if done only when the prev task
cannot ptrace the next task (which is a superset of the too weak
upstream not dumpable check, but still not a performance issue).

Allowing IBPB runtime opt-out doesn't make much sense if you don't
allow to disable retpolines too still at runtime. And disabling
retpolines from LSM doesn't sounds the right place, it's an x86
temporary quirk only relevant for current buggy CPUs.

There should be a function that decides when IBPB and flush_RSB should
be run (flush_RSB has to be run if SMEP because with SMEP there's no
need to run flush_RSB at every kernel entry anymore), and that
function happens to check ptrace today, but it would check other stuff
too if we had other interfaces besides ptrace that would allow the
prev task to read the memory of the next task. So it's not so much
about ptrace nor about IBPB, it's about "IBPB_RSB" vs "prev task
can read memory of next task". Then each arch can implement
"IBPB_RSB" method its own way but the check is for the common
code and it should be in the scheduler and there's just 1 version of
this check needed.

I don't think there should be a ton of different versions of this
function each providing a different answer, which is what LSM would
provide.

At most you can add a x86 debugfs tweak to shut off the logic but
again it would make more sense if retpolines could be shut off at
runtime too, doing it just for IBPB sounds backwards because it has
such an unmeasurable overhead.

> Yes. That would be me. 

Even the attempt to make an innocuous call like
ptrace_has_cap(tcred->user_ns, mode) will eventually
deadlock there.

I used a PTRACE_MODE_ check that the ptrace check code uses to filter
out specific calls that may eventually enter LSM and hard lockup in
non reproducible workloads (some false positive IBPB is ok, especially
if it avoids a deadlock).

Everything can be fixed in any way, but calling LSM from scheduler
code doesn't sound the most robust thing to do in general because what
works outside the scheduler doesn't work from within those semi atomic
regions (it tends to work initially until it eventually
deadlocks). The original code of such check, had all sort of deadlocks
because of that.

Thanks,
Andrea


Re: [PATCH] KVM: try __get_user_pages_fast even if not in atomic context

2018-08-06 Thread Andrea Arcangeli
Hello,

On Mon, Aug 06, 2018 at 01:44:49PM +0200, Paolo Bonzini wrote:
> On 06/08/2018 09:51, Xiao Guangrong wrote:
> > 
> > 
> > On 07/27/2018 11:46 PM, Paolo Bonzini wrote:
> >> We are currently cutting hva_to_pfn_fast short if we do not want an
> >> immediate exit, which is represented by !async && !atomic.  However,
> >> this is unnecessary, and __get_user_pages_fast is *much* faster
> >> because the regular get_user_pages takes pmd_lock/pte_lock.
> >> In fact, when many CPUs take a nested vmexit at the same time
> >> the contention on those locks is visible, and this patch removes
> >> about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
> >> nested guest.
> >>
> > 
> > Nice improvement.
> > 
> > Then after that, we will unconditionally try hva_to_pfn_fast(), does
> > it hurt the case that the mappings in the host's page tables have not
> > been present yet?
> 
> I don't think so, because that's quite slow anyway.

There will be a minimal impact, but it's worth it.

The reason it's worth is that we shouldn't be calling
get_user_pages_unlocked in hva_to_pfn_slow if we could pass
FOLL_HWPOISON to get_user_pages_fast.

And get_user_pages_fast is really just __get_user_pages_fast +
get_user_pages_unlocked with just a difference (see below).

Reviewed-by: Andrea Arcangeli 

> 
> > Can we apply this tech to other places using gup or even squash it
> > into  get_user_pages()?
> 
> That may make sense.  Andrea, do you have an idea?

About further improvements looking at commit
5b65c4677a57a1d4414212f9995aa0e46a21ff80 it looks like it may be worth
adding a new gup variant __get_user_pages_fast_irq_enabled to make our
slow path "__get_user_pages_fast_irq_enabled +
get_user_pages_unlocked" really as fast as get_user_pages_fast (which
we can't call in the atomic case and can't take the foll flags, making
it take the foll flags would also make it somewhat slower by adding
branches).

If I understand correctly the commit header Before refers to when
get_user_pages_fast was calling __get_user_pages_fast, and After is
the optimized version without local_irq_save/restore but instead using
local_irq_disable/enable.

So we'd need to call a new __get_user_pages_fast_irq_enabled instead
of __get_user_pages_fast that would only safe to call when irq are
enabled and that's always the case for KVM also for the atomic case
(KVM's atomic case is atomic only because of the spinlock, not because
irqs are disabled). Such new method would then also be ok to be called
from interrupts as long as irq are enabled when it is being called.

Such change would also contribute to reduce the minimal impact to the
_slow case. x86 would be sure fine with the generic version and it's
trivial to implement, I haven't checked other arch details.

Thanks,
Andrea


Re: [PATCH] KVM: try __get_user_pages_fast even if not in atomic context

2018-08-06 Thread Andrea Arcangeli
Hello,

On Mon, Aug 06, 2018 at 01:44:49PM +0200, Paolo Bonzini wrote:
> On 06/08/2018 09:51, Xiao Guangrong wrote:
> > 
> > 
> > On 07/27/2018 11:46 PM, Paolo Bonzini wrote:
> >> We are currently cutting hva_to_pfn_fast short if we do not want an
> >> immediate exit, which is represented by !async && !atomic.  However,
> >> this is unnecessary, and __get_user_pages_fast is *much* faster
> >> because the regular get_user_pages takes pmd_lock/pte_lock.
> >> In fact, when many CPUs take a nested vmexit at the same time
> >> the contention on those locks is visible, and this patch removes
> >> about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
> >> nested guest.
> >>
> > 
> > Nice improvement.
> > 
> > Then after that, we will unconditionally try hva_to_pfn_fast(), does
> > it hurt the case that the mappings in the host's page tables have not
> > been present yet?
> 
> I don't think so, because that's quite slow anyway.

There will be a minimal impact, but it's worth it.

The reason it's worth is that we shouldn't be calling
get_user_pages_unlocked in hva_to_pfn_slow if we could pass
FOLL_HWPOISON to get_user_pages_fast.

And get_user_pages_fast is really just __get_user_pages_fast +
get_user_pages_unlocked with just a difference (see below).

Reviewed-by: Andrea Arcangeli 

> 
> > Can we apply this tech to other places using gup or even squash it
> > into  get_user_pages()?
> 
> That may make sense.  Andrea, do you have an idea?

About further improvements looking at commit
5b65c4677a57a1d4414212f9995aa0e46a21ff80 it looks like it may be worth
adding a new gup variant __get_user_pages_fast_irq_enabled to make our
slow path "__get_user_pages_fast_irq_enabled +
get_user_pages_unlocked" really as fast as get_user_pages_fast (which
we can't call in the atomic case and can't take the foll flags, making
it take the foll flags would also make it somewhat slower by adding
branches).

If I understand correctly the commit header Before refers to when
get_user_pages_fast was calling __get_user_pages_fast, and After is
the optimized version without local_irq_save/restore but instead using
local_irq_disable/enable.

So we'd need to call a new __get_user_pages_fast_irq_enabled instead
of __get_user_pages_fast that would only safe to call when irq are
enabled and that's always the case for KVM also for the atomic case
(KVM's atomic case is atomic only because of the spinlock, not because
irqs are disabled). Such new method would then also be ok to be called
from interrupts as long as irq are enabled when it is being called.

Such change would also contribute to reduce the minimal impact to the
_slow case. x86 would be sure fine with the generic version and it's
trivial to implement, I haven't checked other arch details.

Thanks,
Andrea


Re: [PATCH] userfaultfd: hugetlbfs: Fix userfaultfd_huge_must_wait pte access

2018-07-03 Thread Andrea Arcangeli
Hello,

On Wed, Jun 27, 2018 at 10:47:44AM +0200, Janosch Frank wrote:
> On 26.06.2018 19:00, Mike Kravetz wrote:
> > On 06/26/2018 06:24 AM, Janosch Frank wrote:
> >> Use huge_ptep_get to translate huge ptes to normal ptes so we can
> >> check them with the huge_pte_* functions. Otherwise some architectures
> >> will check the wrong values and will not wait for userspace to bring
> >> in the memory.
> >>
> >> Signed-off-by: Janosch Frank 
> >> Fixes: 369cd2121be4 ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait 
> >> for hugepmd ranges")
> > Adding linux-mm and Andrew on Cc:
> > 
> > Thanks for catching and fixing this.
> 
> Sure
> I'd be happy if we get less of these problems with time, this one was
> rather painful to debug. :)

What I thought when I read the fix is it would be more robust and we
could catch any further error like this at build time by having
huge_pte_offset return a new type "hugepte_t *" instead of the current
"pte_t *". Of course then huge_ptep_get() would take a "hugepte_t *" as
parameter. The x86 implementation would then become:

static inline pte_t huge_ptep_get(hugepte_t *ptep)
{
return *(pte_t *)ptep;
}

I haven't tried it, perhaps it's not feasible for other reasons
because there's a significant fallout from such a change (i.e. a lot
of hugetlbfs methods needs to change input type), but you said you're
actively looking to get less of these problems this could be a way if
it can be done, so I should mention it.

The need of huge_ptep_get() of course is very apparent when reading the
fix, but it was all but apparent when reading the previous code and the
previous code was correct for x86 because of course huge_ptep_get is
implemented as *ptep on x86.

For now the current fix is certainly good, any robustness cleanup is
cleaner if done orthogonal anyway.

Thanks!
Andrea


Re: [PATCH] userfaultfd: hugetlbfs: Fix userfaultfd_huge_must_wait pte access

2018-07-03 Thread Andrea Arcangeli
Hello,

On Wed, Jun 27, 2018 at 10:47:44AM +0200, Janosch Frank wrote:
> On 26.06.2018 19:00, Mike Kravetz wrote:
> > On 06/26/2018 06:24 AM, Janosch Frank wrote:
> >> Use huge_ptep_get to translate huge ptes to normal ptes so we can
> >> check them with the huge_pte_* functions. Otherwise some architectures
> >> will check the wrong values and will not wait for userspace to bring
> >> in the memory.
> >>
> >> Signed-off-by: Janosch Frank 
> >> Fixes: 369cd2121be4 ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait 
> >> for hugepmd ranges")
> > Adding linux-mm and Andrew on Cc:
> > 
> > Thanks for catching and fixing this.
> 
> Sure
> I'd be happy if we get less of these problems with time, this one was
> rather painful to debug. :)

What I thought when I read the fix is it would be more robust and we
could catch any further error like this at build time by having
huge_pte_offset return a new type "hugepte_t *" instead of the current
"pte_t *". Of course then huge_ptep_get() would take a "hugepte_t *" as
parameter. The x86 implementation would then become:

static inline pte_t huge_ptep_get(hugepte_t *ptep)
{
return *(pte_t *)ptep;
}

I haven't tried it, perhaps it's not feasible for other reasons
because there's a significant fallout from such a change (i.e. a lot
of hugetlbfs methods needs to change input type), but you said you're
actively looking to get less of these problems this could be a way if
it can be done, so I should mention it.

The need of huge_ptep_get() of course is very apparent when reading the
fix, but it was all but apparent when reading the previous code and the
previous code was correct for x86 because of course huge_ptep_get is
implemented as *ptep on x86.

For now the current fix is certainly good, any robustness cleanup is
cleaner if done orthogonal anyway.

Thanks!
Andrea


Re: [PATCH v2] mm/ksm: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm

2018-06-07 Thread Andrea Arcangeli
On Thu, Jun 07, 2018 at 03:13:44PM -0700, Andrew Morton wrote:
> This patch is quite urgent and is tagged for -stable backporting, yet
> it remains in an unreviewed state.  Any takers?

It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
zap those bits and hide the misalignment caused by the low metadata
bits being erroneously left set in the address, but the arm code
notices when that's the last page in the memslot and the hva_end is
getting aligned and the size is below one page.

> [35380.933345] [] dump_backtrace+0x0/0x22c
> [35380.938723] [] show_stack+0x24/0x2c
> [35380.943759] [] dump_stack+0x8c/0xb0
> [35380.948794] [] bad_page+0xf4/0x154
> [35380.953740] [] free_pages_check_bad+0x90/0x9c
> [35380.959642] [] free_pcppages_bulk+0x464/0x518
> [35380.965545] [] free_hot_cold_page+0x22c/0x300
> [35380.971448] [] __put_page+0x54/0x60
> [35380.976484] [] unmap_stage2_range+0x170/0x2b4
> [35380.982385] [] kvm_unmap_hva_handler+0x30/0x40
> [35380.988375] [] handle_hva_to_gpa+0xb0/0xec
> [35380.994016] [] kvm_unmap_hva_range+0x5c/0xd0
> [35380.999833] [] 
> 
> I even injected a fault on purpose in kvm_unmap_hva_range by seting
> size=size-0x200, the call trace is similar as above.  So I thought the
> panic is similarly caused by the root cause of WARN_ON.

I think the problem triggers in the addr += PAGE_SIZE of
unmap_stage2_ptes that never matches end because end is aligned but
addr is not.

} while (pte++, addr += PAGE_SIZE, addr != end);

x86 again only works on hva_start/hva_end after converting it to
gfn_start/end and that being in pfn units the bits are zapped before
they risk to cause trouble.

> 
> Link: 
> http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejia...@gmail.com
> Signed-off-by: Jia He 
> Cc: Suzuki K Poulose 
> Cc: Andrea Arcangeli 
> Cc: Minchan Kim 
> Cc: Claudio Imbrenda 
> Cc: Arvind Yadav 
> Cc: Mike Rapoport 
> Cc: Jia He 
> Cc: 
> Signed-off-by: Andrew Morton 
> ---
> 

Reviewed-by: Andrea Arcangeli 


Re: [PATCH v2] mm/ksm: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm

2018-06-07 Thread Andrea Arcangeli
On Thu, Jun 07, 2018 at 03:13:44PM -0700, Andrew Morton wrote:
> This patch is quite urgent and is tagged for -stable backporting, yet
> it remains in an unreviewed state.  Any takers?

It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
zap those bits and hide the misalignment caused by the low metadata
bits being erroneously left set in the address, but the arm code
notices when that's the last page in the memslot and the hva_end is
getting aligned and the size is below one page.

> [35380.933345] [] dump_backtrace+0x0/0x22c
> [35380.938723] [] show_stack+0x24/0x2c
> [35380.943759] [] dump_stack+0x8c/0xb0
> [35380.948794] [] bad_page+0xf4/0x154
> [35380.953740] [] free_pages_check_bad+0x90/0x9c
> [35380.959642] [] free_pcppages_bulk+0x464/0x518
> [35380.965545] [] free_hot_cold_page+0x22c/0x300
> [35380.971448] [] __put_page+0x54/0x60
> [35380.976484] [] unmap_stage2_range+0x170/0x2b4
> [35380.982385] [] kvm_unmap_hva_handler+0x30/0x40
> [35380.988375] [] handle_hva_to_gpa+0xb0/0xec
> [35380.994016] [] kvm_unmap_hva_range+0x5c/0xd0
> [35380.999833] [] 
> 
> I even injected a fault on purpose in kvm_unmap_hva_range by seting
> size=size-0x200, the call trace is similar as above.  So I thought the
> panic is similarly caused by the root cause of WARN_ON.

I think the problem triggers in the addr += PAGE_SIZE of
unmap_stage2_ptes that never matches end because end is aligned but
addr is not.

} while (pte++, addr += PAGE_SIZE, addr != end);

x86 again only works on hva_start/hva_end after converting it to
gfn_start/end and that being in pfn units the bits are zapped before
they risk to cause trouble.

> 
> Link: 
> http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejia...@gmail.com
> Signed-off-by: Jia He 
> Cc: Suzuki K Poulose 
> Cc: Andrea Arcangeli 
> Cc: Minchan Kim 
> Cc: Claudio Imbrenda 
> Cc: Arvind Yadav 
> Cc: Mike Rapoport 
> Cc: Jia He 
> Cc: 
> Signed-off-by: Andrew Morton 
> ---
> 

Reviewed-by: Andrea Arcangeli 


Re: [patch V2 1/2] sysfs/cpu: Add vulnerability folder

2018-01-26 Thread Andrea Arcangeli
Hello,

On Sun, Jan 07, 2018 at 10:48:00PM +0100, Thomas Gleixner wrote:
> +static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
> +static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
> +static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);

This sysfs feature implemented as above is weakening kernel security,
it should be 0400 above.

It doesn't make sense to expose to luser when a software fix (or even
only a software mitigation) has been disabled at build time to gain
all performance back (see CONFIG_RETPOLINE=n config option).

$ cat /boot/kernel-`uname -r`
cat: /boot/kernel-4.15.0-rc9+: Permission denied
$ cat /lib/modules/`uname -r`/kernel/arch/x86/kvm/kvm.ko 
cat: /lib/modules/4.15.0-rc9+/kernel/arch/x86/kvm/kvm.ko: Permission denied
$ dmesg
dmesg: read kernel buffer failed: Operation not permitted

Noticing when cr3 is flipped in kernel/exit is easy, but noticing when
the syscall table or the whole kernel has been built with retpolines
is not trivial to detect. Same for variant#1.

Exploiting spectre variant#2 for an attacker is not without risk of
being detected while the setup is being mounted, as the CPU load would
spike for hours.

I may notice if a random background network daemon suddenly starts
running at 100% CPU load for hours (especially on mobile devices that
would be physically noticeable).

Containers shouldn't have sysfs and you can workaround the above if
you run all network daemons behind mount namespaces, but in general
leaving this directory readable by luser is weaking security because
it exposes when mounting a variant#2 attack can succeed.

It even tells when it is worth to focus on the syscall_table indirect
call or if the attack needs to dig deeper because asm retpolines were
used, but the kernel was built with a gcc without retpolines.

The only case where the above isn't weakening security is when the
full fix is on for all the variants is enabled (and variant#1 for now
just shows vulnerable..).

For the same reasons the whole directory, not just the files, should be
0500, especially if this would be used for any other equivalent issue
in the future and it won't stick to these 3 files, I didn't implement
that yet, because it's less urgent if nobody adds any more files soon.

>From 578b411c8dcb1435dd1f94a6cd062f4eedb70fb5 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarca...@redhat.com>
Date: Wed, 24 Jan 2018 19:19:36 +0100
Subject: [PATCH 1/1] x86/spectre/meltdown: avoid the vulnerability directory
 to weaken kernel security

If any of the fixes is disabled to gain some performance back at
runtime or build time, should not be exposed to unprivileged userland.

Signed-off-by: Andrea Arcangeli <aarca...@redhat.com>
---
 drivers/base/cpu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index d99038487a0d..a3a8e008f957 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -531,9 +531,9 @@ ssize_t __weak cpu_show_spectre_v2(struct device *dev,
return sprintf(buf, "Not affected\n");
 }
 
-static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
-static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
-static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
+static DEVICE_ATTR(meltdown, 0400, cpu_show_meltdown, NULL);
+static DEVICE_ATTR(spectre_v1, 0400, cpu_show_spectre_v1, NULL);
+static DEVICE_ATTR(spectre_v2, 0400, cpu_show_spectre_v2, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
_attr_meltdown.attr,



Re: [patch V2 1/2] sysfs/cpu: Add vulnerability folder

2018-01-26 Thread Andrea Arcangeli
Hello,

On Sun, Jan 07, 2018 at 10:48:00PM +0100, Thomas Gleixner wrote:
> +static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
> +static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
> +static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);

This sysfs feature implemented as above is weakening kernel security,
it should be 0400 above.

It doesn't make sense to expose to luser when a software fix (or even
only a software mitigation) has been disabled at build time to gain
all performance back (see CONFIG_RETPOLINE=n config option).

$ cat /boot/kernel-`uname -r`
cat: /boot/kernel-4.15.0-rc9+: Permission denied
$ cat /lib/modules/`uname -r`/kernel/arch/x86/kvm/kvm.ko 
cat: /lib/modules/4.15.0-rc9+/kernel/arch/x86/kvm/kvm.ko: Permission denied
$ dmesg
dmesg: read kernel buffer failed: Operation not permitted

Noticing when cr3 is flipped in kernel/exit is easy, but noticing when
the syscall table or the whole kernel has been built with retpolines
is not trivial to detect. Same for variant#1.

Exploiting spectre variant#2 for an attacker is not without risk of
being detected while the setup is being mounted, as the CPU load would
spike for hours.

I may notice if a random background network daemon suddenly starts
running at 100% CPU load for hours (especially on mobile devices that
would be physically noticeable).

Containers shouldn't have sysfs and you can workaround the above if
you run all network daemons behind mount namespaces, but in general
leaving this directory readable by luser is weaking security because
it exposes when mounting a variant#2 attack can succeed.

It even tells when it is worth to focus on the syscall_table indirect
call or if the attack needs to dig deeper because asm retpolines were
used, but the kernel was built with a gcc without retpolines.

The only case where the above isn't weakening security is when the
full fix is on for all the variants is enabled (and variant#1 for now
just shows vulnerable..).

For the same reasons the whole directory, not just the files, should be
0500, especially if this would be used for any other equivalent issue
in the future and it won't stick to these 3 files, I didn't implement
that yet, because it's less urgent if nobody adds any more files soon.

>From 578b411c8dcb1435dd1f94a6cd062f4eedb70fb5 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Wed, 24 Jan 2018 19:19:36 +0100
Subject: [PATCH 1/1] x86/spectre/meltdown: avoid the vulnerability directory
 to weaken kernel security

If any of the fixes is disabled to gain some performance back at
runtime or build time, should not be exposed to unprivileged userland.

Signed-off-by: Andrea Arcangeli 
---
 drivers/base/cpu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index d99038487a0d..a3a8e008f957 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -531,9 +531,9 @@ ssize_t __weak cpu_show_spectre_v2(struct device *dev,
return sprintf(buf, "Not affected\n");
 }
 
-static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
-static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
-static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
+static DEVICE_ATTR(meltdown, 0400, cpu_show_meltdown, NULL);
+static DEVICE_ATTR(spectre_v1, 0400, cpu_show_spectre_v1, NULL);
+static DEVICE_ATTR(spectre_v2, 0400, cpu_show_spectre_v2, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
_attr_meltdown.attr,



Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-25 Thread Andrea Arcangeli
On Thu, Jan 25, 2018 at 10:58:32AM +0100, Michal Hocko wrote:
> Ohh, absolutely. And that is why we have changed the default in upstream
> 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
> stall-free defrag option")

Agreed, that direct compaction change should already address the cases
quoted in the other URLs.

One of the URL is about using fork() to snapshot a nosql db state,
that one can't be helped by the above commit but it's still unrelated
to MADV_DONTNEED or memory bloat.

It would be possible to fully fix the use of fork() for snapshotting
without userfaultfd WP mode, by just adding an madvise that forces 4k
CoWs on top of 2M THP and to call it in the parent that keeps writing
to memory while the child is writing the readonly copy to disk, but I
believe userfaultfd WP will be way more optimal as it provides so many
other advantages (i.e. avoid fork() in the first place and use
pthread_create and be able to throttle on I/O and limit the max memory
usage to something less than x2 RAM without the risk of triggering the
OOM killer and have a ring that is written immediately to keep the mem
utilization low etc..).

Thanks,
Andrea


Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-25 Thread Andrea Arcangeli
On Thu, Jan 25, 2018 at 10:58:32AM +0100, Michal Hocko wrote:
> Ohh, absolutely. And that is why we have changed the default in upstream
> 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
> stall-free defrag option")

Agreed, that direct compaction change should already address the cases
quoted in the other URLs.

One of the URL is about using fork() to snapshot a nosql db state,
that one can't be helped by the above commit but it's still unrelated
to MADV_DONTNEED or memory bloat.

It would be possible to fully fix the use of fork() for snapshotting
without userfaultfd WP mode, by just adding an madvise that forces 4k
CoWs on top of 2M THP and to call it in the parent that keeps writing
to memory while the child is writing the readonly copy to disk, but I
believe userfaultfd WP will be way more optimal as it provides so many
other advantages (i.e. avoid fork() in the first place and use
pthread_create and be able to throttle on I/O and limit the max memory
usage to something less than x2 RAM without the risk of triggering the
OOM killer and have a ring that is written immediately to keep the mem
utilization low etc..).

Thanks,
Andrea


Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-25 Thread Andrea Arcangeli
On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
> I'm trying to address many different THP issues and memory bloat is
> first among them.

You quoted redis in an earlier email, the redis issue has nothing to
do with MADV_DONTNEED.

I can quickly explain the redis issue.

Redis uses fork() to create a readonly copy of the memory to do
snapshotting in the child, while parent still writes to the memory.

THP CoWs in the parent are higher latency than 4k CoWs, they also take
more memory, but that's secondary, in fact the maximum waste of memory
in this model will reach the same worst case (x2) with 4k CoWs
too, no difference.

The problem is the copy-user there, it adds latency and wastes CPU.

Redis can simply use userfaultfd WP mode once it'll be upstream and
then it will use 4k granularity as the granularity of the writeprotect
userfaults is up to userland to decide.

The main benefit is it can avoid the worst case degradation of using
x2 physical memory (disabling THP makes zero difference in that
regard, if storage is very slow x2 physical memory can still be used
if very unlucky), it can throttle the WP writes (anon COW cannot
throttle), it can avoid to fork altogether so it shares the same
pagetables. It can also put the "user-CoWed" pages (in the fault
handler) in front of the write queue, to be written first, using a
ring buffer for the CoWed 4k pages, to keep memory utilization even
lower despite THP stays on at all times for all pages that didn't get
a CoW yet. This will be an optimal snapshot method, much better than
fork() no matter if 4k or THP are backing the memory.

In short MADV_DONTNEED has nothing to do with redis, if mysql gets an
improvement surely you can post a benchmark instead of URLs.

If you want low memory usage at the cost of potentially slower
performance overall you should use transparent_hugepage=madvise .

The cases where THP is not a good tradeoff are genreally related to
lower performance in copy-user or the higher cost of compaction if the
app is only ever doing short lived allocations.

If you post a reproducible benchmark with real life app that gets an
improvement with whatever change you're doing, it'll be possible to
evaluate it.

Thanks,
Andrea


Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-25 Thread Andrea Arcangeli
On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
> I'm trying to address many different THP issues and memory bloat is
> first among them.

You quoted redis in an earlier email, the redis issue has nothing to
do with MADV_DONTNEED.

I can quickly explain the redis issue.

Redis uses fork() to create a readonly copy of the memory to do
snapshotting in the child, while parent still writes to the memory.

THP CoWs in the parent are higher latency than 4k CoWs, they also take
more memory, but that's secondary, in fact the maximum waste of memory
in this model will reach the same worst case (x2) with 4k CoWs
too, no difference.

The problem is the copy-user there, it adds latency and wastes CPU.

Redis can simply use userfaultfd WP mode once it'll be upstream and
then it will use 4k granularity as the granularity of the writeprotect
userfaults is up to userland to decide.

The main benefit is it can avoid the worst case degradation of using
x2 physical memory (disabling THP makes zero difference in that
regard, if storage is very slow x2 physical memory can still be used
if very unlucky), it can throttle the WP writes (anon COW cannot
throttle), it can avoid to fork altogether so it shares the same
pagetables. It can also put the "user-CoWed" pages (in the fault
handler) in front of the write queue, to be written first, using a
ring buffer for the CoWed 4k pages, to keep memory utilization even
lower despite THP stays on at all times for all pages that didn't get
a CoW yet. This will be an optimal snapshot method, much better than
fork() no matter if 4k or THP are backing the memory.

In short MADV_DONTNEED has nothing to do with redis, if mysql gets an
improvement surely you can post a benchmark instead of URLs.

If you want low memory usage at the cost of potentially slower
performance overall you should use transparent_hugepage=madvise .

The cases where THP is not a good tradeoff are genreally related to
lower performance in copy-user or the higher cost of compaction if the
app is only ever doing short lived allocations.

If you post a reproducible benchmark with real life app that gets an
improvement with whatever change you're doing, it'll be possible to
evaluate it.

Thanks,
Andrea


Re: [RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

2018-01-20 Thread Andrea Arcangeli
Hello everyone,

On Sat, Jan 20, 2018 at 01:56:08PM +, Van De Ven, Arjan wrote:
> well first of all don't use IBRS, use retpoline

This issue triggers in the IBPB code during user to user context
switch and IBPB is still needed there no matter if kernel is using
retpolines or if it uses kernel IBRS. In fact IBPB is still needed
there even if retpolines+user_ibrs is used or if
always_ibrs/ibrs_enabled=2 is used (IBRS doesn't protect from the
poison generated in the same predictor mode, "especially" in future
CPUs).

Only retpolining all userland would avoid IBPB here, but I doubt you
suggest that.

Kernel retpolines or kernel IBRS would make zero difference for
this specific issue.

> and if Andrea says this was a known issue in their code then I think that 
> closes the issue.
> 

It's an implementation bug we inherited from the merge of a CPU vendor
patch and I can confirm it's already closed. The fix has been already
shipped with the wave 2 update in fact and some other versions even
had the bug fixed since the very first wave on 0day.

That deadlock nuisance only ever triggered in artificial QA testcases
and even then it wasn't easily reproducible.

We already moved the follow ups in vendor BZ to avoid using bandwidth
here.

Thank you!
Andrea


Re: [RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

2018-01-20 Thread Andrea Arcangeli
Hello everyone,

On Sat, Jan 20, 2018 at 01:56:08PM +, Van De Ven, Arjan wrote:
> well first of all don't use IBRS, use retpoline

This issue triggers in the IBPB code during user to user context
switch and IBPB is still needed there no matter if kernel is using
retpolines or if it uses kernel IBRS. In fact IBPB is still needed
there even if retpolines+user_ibrs is used or if
always_ibrs/ibrs_enabled=2 is used (IBRS doesn't protect from the
poison generated in the same predictor mode, "especially" in future
CPUs).

Only retpolining all userland would avoid IBPB here, but I doubt you
suggest that.

Kernel retpolines or kernel IBRS would make zero difference for
this specific issue.

> and if Andrea says this was a known issue in their code then I think that 
> closes the issue.
> 

It's an implementation bug we inherited from the merge of a CPU vendor
patch and I can confirm it's already closed. The fix has been already
shipped with the wave 2 update in fact and some other versions even
had the bug fixed since the very first wave on 0day.

That deadlock nuisance only ever triggered in artificial QA testcases
and even then it wasn't easily reproducible.

We already moved the follow ups in vendor BZ to avoid using bandwidth
here.

Thank you!
Andrea


Re: [PATCH 23/35] x86/speculation: Add basic speculation control code

2018-01-19 Thread Andrea Arcangeli
On Fri, Jan 19, 2018 at 04:15:33AM +, Van De Ven, Arjan wrote:
> there is no such guarantee. Some of the IBRS implementations will
> actually flush rather than disable, or flush parts and disable other
> parts.

To me it helps in order to memorize the spec to understand why the
spec is the way it is.

I tried to help explaining some of that, but I notice that I created
more confusion... I never intended IBPB can be skipped in user to user
switches if leaving IBRS set in userland, that's not what we do and it
wouldn't be ok with certain smarter CPUs.

> yes the wording is a bit cryptic, but it's also very explicit about
> what it covers (and the rest is not covered!) and had to allow a few
> different implementations unfortunately.

We already follow the spec to the letter and we only depend on what is
covered there.

Surely the specs already explain everything better than I could ever
do, so if anything wasn't clear in the two previous emails where I
failed to explain the difference between setting or leaving IBRS set
in userland (ibrs_user) and setting or leaving STIBP set in userland
(stibp_user) you'll find all answers in the very explicit spec per
above quote.

Thanks,
Andrea


Re: [PATCH 23/35] x86/speculation: Add basic speculation control code

2018-01-19 Thread Andrea Arcangeli
On Fri, Jan 19, 2018 at 04:15:33AM +, Van De Ven, Arjan wrote:
> there is no such guarantee. Some of the IBRS implementations will
> actually flush rather than disable, or flush parts and disable other
> parts.

To me it helps in order to memorize the spec to understand why the
spec is the way it is.

I tried to help explaining some of that, but I notice that I created
more confusion... I never intended IBPB can be skipped in user to user
switches if leaving IBRS set in userland, that's not what we do and it
wouldn't be ok with certain smarter CPUs.

> yes the wording is a bit cryptic, but it's also very explicit about
> what it covers (and the rest is not covered!) and had to allow a few
> different implementations unfortunately.

We already follow the spec to the letter and we only depend on what is
covered there.

Surely the specs already explain everything better than I could ever
do, so if anything wasn't clear in the two previous emails where I
failed to explain the difference between setting or leaving IBRS set
in userland (ibrs_user) and setting or leaving STIBP set in userland
(stibp_user) you'll find all answers in the very explicit spec per
above quote.

Thanks,
Andrea


Re: [PATCH 23/35] x86/speculation: Add basic speculation control code

2018-01-18 Thread Andrea Arcangeli
Hello,

On Thu, Jan 18, 2018 at 03:25:25PM -0800, Andy Lutomirski wrote:
> I read the whitepaper that documented the new MSRs a couple days ago
> and I'm now completely unable to find it.  If anyone could send the
> link, that would be great.

I see Andrew posted a link.

> From memory, however, the docs were quite clear that setting leaving
> IBRS set when entering user mode or guest mode does not guarantee any
> particular protection unless an additional CPUID bit (the name of
> which I forget) is set, and that current CPUs will *not* get that bit

My current understanding is that with SPEC_CTRL alone set in cpuid,
IBRS is meaningful, other bits don't matter.

> set by microcode update.  IOW the protection given is that, if you set
> IBRS bit zero after entry to kernel mode, you are protected until you
> re-enter user mode.  When you're in user mode, you're not protected.

If you leave IBRS set while in user mode, userland is protected as
strong as kernel mode.

Of course kernel can attack usermode through spectre variant#2 if it
wants :), but usermode has to trust the kernel to begin with, just
like guest mode has to trust the host kernel.

If you leave IBRS set, guest mode (including guest usermode) cannot
attack host userland, host userland cannot attack other host userland
and no HT is possible either.

Note guest kernel attack on host userland is not as concerning but
it's possible too, as long as you use host dm-crypt instead of qcow2
encryption on host, it's not a major concern. guest userland attack on
host userland is more of a concern because if it can be mounted
successfully, it would dump all guest memory making it impossible for
the guest to defend itself from its own userland (unless it uses ibpb
2 of course which calls IBPB at guest user to kernel entry instead of
IBRS). IBRS isn't enough in guest because like retpoline it doesn't
guarantee an IBPB.

Because these are only theoretical and there's no PoC IBRS is not
enabled by default in userland of course. However retpolining qemu
would be a low overhead sure solution to this that would avoid having
to set ibrs in userland to achieve the same.

> When you return back to kernel mode, you *still* aren't protected no
> matter what value is in the MSR until you write 1 again.

When you return to kernel mode you've to call IBRS again even if it
was left set, because there's a higher-privilege mode change. That's
equivalent to calling only IBPB and leaving STIBP set (only way to
understand the locations where IBRS has to be set is to imagine IBRS
as a strict "STIBP; IBPB").

specs are very explicit that it's very meaningful to set IBRS even if
already set.

> > That is true no matter if kernel is using retpolines or ibrs.
> >
> > IBRS is semantically equivalent to "STIBP; IBPB", so user_ibrs is
> > always inclusive of user_stibp.
> 
> Are you quite sure?  I had the impression that IBPB was much slower

Nothing in the specs says that IBRS is a "STIBP; IBPB" equivalent, but
if you want to find where to set IBRS, yes, I'm quite sure, you've to
think it like it.

> than writing 1 to IBRS and that writing 1 to IBRS has much more
> limited effects.

I think I already partly answered it already in the sentence that
followed the one you quoted, but I should elaborate it.

The semantics to use depending what you're trying to solve because it
can have both.

"STIBP; IBPB" could be called the coarse semantics and you have to
imagine IBRS as "STIBP; IBPB" whenever you are checking where to write
IBRS, or you risk missing places where you have to write IBRS even if
it is already set.

As opposed when you check when you need to leave IBRS set (note: after
it was already set in all places found with the coarse semantics),
or where you need to call IBPB, you can't rely on the coarse
semantics because of course the IBPB may not have really
happened... and so you've to imagine IBRS with finegrined semantics as
"temporary immunization from the poison including HT/SMT poison and
all poison is still left in the CPU and will return to affect the
runtime as soon as IBRS is cleared".

IBRS Q/A:

1) When to write IBRS in SPEC_CTRL? -> imagine it as "STIBP; IBPB"

2) When to leave IBRS set or when to call IBPB -> imagine the previous
setting of IBRS as temporarily disabling indirect branch prediction
without an IBPB implicit in IBRS

If you think it only like 1) you risk missing some places where you've
to write IBRS even if it was already set.

If you think it only like 2) you risk clearing it too early or you
risk missing a necessary IBPB.

It has to be thought simultaneously in both ways.

The sure thing I get from the specs is IBRS always implies STIBP (even
when STIBP is a noop), specs are pretty explicit about that.

So following the above two rules, assume you've retpolined host kernel
and you want to protect host userland from guest userland, IBRS set in
the host kernel to host user transition (user_ibrs) will achieve it
fine. Setting STIBP in the kernel 

Re: [PATCH 23/35] x86/speculation: Add basic speculation control code

2018-01-18 Thread Andrea Arcangeli
Hello,

On Thu, Jan 18, 2018 at 03:25:25PM -0800, Andy Lutomirski wrote:
> I read the whitepaper that documented the new MSRs a couple days ago
> and I'm now completely unable to find it.  If anyone could send the
> link, that would be great.

I see Andrew posted a link.

> From memory, however, the docs were quite clear that setting leaving
> IBRS set when entering user mode or guest mode does not guarantee any
> particular protection unless an additional CPUID bit (the name of
> which I forget) is set, and that current CPUs will *not* get that bit

My current understanding is that with SPEC_CTRL alone set in cpuid,
IBRS is meaningful, other bits don't matter.

> set by microcode update.  IOW the protection given is that, if you set
> IBRS bit zero after entry to kernel mode, you are protected until you
> re-enter user mode.  When you're in user mode, you're not protected.

If you leave IBRS set while in user mode, userland is protected as
strong as kernel mode.

Of course kernel can attack usermode through spectre variant#2 if it
wants :), but usermode has to trust the kernel to begin with, just
like guest mode has to trust the host kernel.

If you leave IBRS set, guest mode (including guest usermode) cannot
attack host userland, host userland cannot attack other host userland
and no HT is possible either.

Note guest kernel attack on host userland is not as concerning but
it's possible too, as long as you use host dm-crypt instead of qcow2
encryption on host, it's not a major concern. guest userland attack on
host userland is more of a concern because if it can be mounted
successfully, it would dump all guest memory making it impossible for
the guest to defend itself from its own userland (unless it uses ibpb
2 of course which calls IBPB at guest user to kernel entry instead of
IBRS). IBRS isn't enough in guest because like retpoline it doesn't
guarantee an IBPB.

Because these are only theoretical and there's no PoC IBRS is not
enabled by default in userland of course. However retpolining qemu
would be a low overhead sure solution to this that would avoid having
to set ibrs in userland to achieve the same.

> When you return back to kernel mode, you *still* aren't protected no
> matter what value is in the MSR until you write 1 again.

When you return to kernel mode you've to call IBRS again even if it
was left set, because there's a higher-privilege mode change. That's
equivalent to calling only IBPB and leaving STIBP set (only way to
understand the locations where IBRS has to be set is to imagine IBRS
as a strict "STIBP; IBPB").

specs are very explicit that it's very meaningful to set IBRS even if
already set.

> > That is true no matter if kernel is using retpolines or ibrs.
> >
> > IBRS is semantically equivalent to "STIBP; IBPB", so user_ibrs is
> > always inclusive of user_stibp.
> 
> Are you quite sure?  I had the impression that IBPB was much slower

Nothing in the specs says that IBRS is a "STIBP; IBPB" equivalent, but
if you want to find where to set IBRS, yes, I'm quite sure, you've to
think it like it.

> than writing 1 to IBRS and that writing 1 to IBRS has much more
> limited effects.

I think I already partly answered it already in the sentence that
followed the one you quoted, but I should elaborate it.

The semantics to use depending what you're trying to solve because it
can have both.

"STIBP; IBPB" could be called the coarse semantics and you have to
imagine IBRS as "STIBP; IBPB" whenever you are checking where to write
IBRS, or you risk missing places where you have to write IBRS even if
it is already set.

As opposed when you check when you need to leave IBRS set (note: after
it was already set in all places found with the coarse semantics),
or where you need to call IBPB, you can't rely on the coarse
semantics because of course the IBPB may not have really
happened... and so you've to imagine IBRS with finegrined semantics as
"temporary immunization from the poison including HT/SMT poison and
all poison is still left in the CPU and will return to affect the
runtime as soon as IBRS is cleared".

IBRS Q/A:

1) When to write IBRS in SPEC_CTRL? -> imagine it as "STIBP; IBPB"

2) When to leave IBRS set or when to call IBPB -> imagine the previous
setting of IBRS as temporarily disabling indirect branch prediction
without an IBPB implicit in IBRS

If you think it only like 1) you risk missing some places where you've
to write IBRS even if it was already set.

If you think it only like 2) you risk clearing it too early or you
risk missing a necessary IBPB.

It has to be thought simultaneously in both ways.

The sure thing I get from the specs is IBRS always implies STIBP (even
when STIBP is a noop), specs are pretty explicit about that.

So following the above two rules, assume you've retpolined host kernel
and you want to protect host userland from guest userland, IBRS set in
the host kernel to host user transition (user_ibrs) will achieve it
fine. Setting STIBP in the kernel 

Re: [PATCH 23/35] x86/speculation: Add basic speculation control code

2018-01-18 Thread Andrea Arcangeli
On Thu, Jan 18, 2018 at 12:24:31PM -0600, Josh Poimboeuf wrote:
> On Thu, Jan 18, 2018 at 06:12:36PM +0100, Paolo Bonzini wrote:
> > On 18/01/2018 18:08, Dave Hansen wrote:
> > > On 01/18/2018 08:37 AM, Josh Poimboeuf wrote:
> > >>>
> > >>> --- a/Documentation/admin-guide/kernel-parameters.txt
> > >>> +++ b/Documentation/admin-guide/kernel-parameters.txt
> > >>> @@ -3932,6 +3932,7 @@
> > >>> retpoline - replace indirect branches
> > >>> retpoline,generic - google's original retpoline
> > >>> retpoline,amd - AMD-specific minimal thunk
> > >>> +   ibrs  - Intel: Indirect Branch 
> > >>> Restricted Speculation
> > >> Are there plans to add spectre_v2=ibrs_always to prevent SMT-based
> > >> attacks?
> > > 
> > > What does "ibrs_always" mean to you?
> 
> Maybe ibrs_always isn't the best name.  Basically we need an option to
> protect user-user attacks via SMT.
> 
> It could be implemented with IBRS=1, or STIBP, or as part of the
> mythical IBRS_ATT.

User stibp or user ibrs would be different things, both would be valid
for different use cases, and the user stibp should perform better.

Leaving ibrs on when returning from kernel to userland (or setting
ibrs if kernel used retpolines instead of ibrs) achieves stronger
semantics than just setting SPEC_CTRL with stibp when returning to
userland.

That is true no matter if kernel is using retpolines or ibrs.

IBRS is semantically equivalent to "STIBP; IBPB", so user_ibrs is
always inclusive of user_stibp.

Said that the CPU should better achieve such semantics without really
internally issuing an IBPB of course, but you can think at the current
IBRS as "STIBP; IBPB". That IBPB immediately after the STIBP makes a
difference to the non HT attacks possible on host userland.

user_smt wouldn't solve all cases that user_ibrs solves, but it'd be
ideal if critical user apps are built with retpolines and the only
concern left is a HT/SMT attack on those only need to care about
HT/SMT.

To begin with, user_ibrs would be more important than user_stibp.

On a side note: stibp isn't always available, it requires a new cpuid
check on bit 27 too, you can still write to it but it won't #gp, on
some CPUs it's simply implicit and you can write to it, but it's a
noop. I haven't figured exactly to differentiate when it's disabled or
implicitly enabled when not enumerated in cpuid.


Re: [PATCH 23/35] x86/speculation: Add basic speculation control code

2018-01-18 Thread Andrea Arcangeli
On Thu, Jan 18, 2018 at 12:24:31PM -0600, Josh Poimboeuf wrote:
> On Thu, Jan 18, 2018 at 06:12:36PM +0100, Paolo Bonzini wrote:
> > On 18/01/2018 18:08, Dave Hansen wrote:
> > > On 01/18/2018 08:37 AM, Josh Poimboeuf wrote:
> > >>>
> > >>> --- a/Documentation/admin-guide/kernel-parameters.txt
> > >>> +++ b/Documentation/admin-guide/kernel-parameters.txt
> > >>> @@ -3932,6 +3932,7 @@
> > >>> retpoline - replace indirect branches
> > >>> retpoline,generic - google's original retpoline
> > >>> retpoline,amd - AMD-specific minimal thunk
> > >>> +   ibrs  - Intel: Indirect Branch 
> > >>> Restricted Speculation
> > >> Are there plans to add spectre_v2=ibrs_always to prevent SMT-based
> > >> attacks?
> > > 
> > > What does "ibrs_always" mean to you?
> 
> Maybe ibrs_always isn't the best name.  Basically we need an option to
> protect user-user attacks via SMT.
> 
> It could be implemented with IBRS=1, or STIBP, or as part of the
> mythical IBRS_ATT.

User stibp or user ibrs would be different things, both would be valid
for different use cases, and the user stibp should perform better.

Leaving ibrs on when returning from kernel to userland (or setting
ibrs if kernel used retpolines instead of ibrs) achieves stronger
semantics than just setting SPEC_CTRL with stibp when returning to
userland.

That is true no matter if kernel is using retpolines or ibrs.

IBRS is semantically equivalent to "STIBP; IBPB", so user_ibrs is
always inclusive of user_stibp.

Said that the CPU should better achieve such semantics without really
internally issuing an IBPB of course, but you can think at the current
IBRS as "STIBP; IBPB". That IBPB immediately after the STIBP makes a
difference to the non HT attacks possible on host userland.

user_smt wouldn't solve all cases that user_ibrs solves, but it'd be
ideal if critical user apps are built with retpolines and the only
concern left is a HT/SMT attack on those only need to care about
HT/SMT.

To begin with, user_ibrs would be more important than user_stibp.

On a side note: stibp isn't always available, it requires a new cpuid
check on bit 27 too, you can still write to it but it won't #gp, on
some CPUs it's simply implicit and you can write to it, but it's a
noop. I haven't figured exactly to differentiate when it's disabled or
implicitly enabled when not enumerated in cpuid.


Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Andrea Arcangeli
On Thu, Jan 18, 2018 at 06:45:00AM -0800, Dave Hansen wrote:
> On 01/18/2018 04:25 AM, Kirill A. Shutemov wrote:
> > [   10.084024] diff: -858690919
> > [   10.084258] hpage_nr_pages: 1
> > [   10.084386] check1: 0
> > [   10.084478] check2: 0
> ...
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index d22b84310f6d..57b4397f1ea5 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -70,6 +70,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> > }
> > if (pte_page(*pvmw->pte) < pvmw->page)
> > return false;
> > +
> > +   if (pte_page(*pvmw->pte) - pvmw->page) {
> > +   printk("diff: %d\n", pte_page(*pvmw->pte) - pvmw->page);
> > +   printk("hpage_nr_pages: %d\n", 
> > hpage_nr_pages(pvmw->page));
> > +   printk("check1: %d\n", pte_page(*pvmw->pte) - 
> > pvmw->page < 0);
> > +   printk("check2: %d\n", pte_page(*pvmw->pte) - 
> > pvmw->page >= hpage_nr_pages(pvmw->page));
> > +   BUG();
> > +   }
> 
> This says that pte_page(*pvmw->pte) and pvmw->page are roughly 4GB away
> from each other (858690919*4=0xccba559c0).  That's not the compiler
> being wonky, it just means that the virtual addresses of the memory
> sections are that far apart.
> 
> This won't happen when you have vmemmap or flatmem because the mem_map[]
> is virtually contiguous and pointer arithmetic just works against all
> 'struct page' pointers.  But with classic sparsemem, it doesn't.
> 
> You need to make sure that the PFNs are in the same section before you
> can do the math that you want to do here.

Isn't it simply that pvmw->page isn't a page or pte_page(*pvmw->pte)
isn't a page?

The distance cannot matter, MMU isn't involved, this is pure 64bit
aritmetics, 1giga 1 terabyte, 48bits 5level pagetables are meaningless
in this comparison.

#include 

int main()
{
volatile long i;
struct x { char a[40]; };
for (i = 0; i < 40*3; i += 40) {
printf("%ld\n", ((struct x *)0)-struct x *)i;
}
printf("\n");
for (i = 0; i < 40; i += 1) {
if (i==4)
i = 40;
printf("%ld\n", ((struct x *)0)-struct x *)i;
}
return 0;
}

You need to add two debug checks on "pte_page(*pvmw->pte) % 64" and
same for pvmw->page to find out the one of the two that isn't a page.

If both are real pages there's a bug that allocates page structs not
naturally aligned.


Re: [mm 4.15-rc8] Random oopses under memory pressure.

2018-01-18 Thread Andrea Arcangeli
On Thu, Jan 18, 2018 at 06:45:00AM -0800, Dave Hansen wrote:
> On 01/18/2018 04:25 AM, Kirill A. Shutemov wrote:
> > [   10.084024] diff: -858690919
> > [   10.084258] hpage_nr_pages: 1
> > [   10.084386] check1: 0
> > [   10.084478] check2: 0
> ...
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index d22b84310f6d..57b4397f1ea5 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -70,6 +70,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> > }
> > if (pte_page(*pvmw->pte) < pvmw->page)
> > return false;
> > +
> > +   if (pte_page(*pvmw->pte) - pvmw->page) {
> > +   printk("diff: %d\n", pte_page(*pvmw->pte) - pvmw->page);
> > +   printk("hpage_nr_pages: %d\n", 
> > hpage_nr_pages(pvmw->page));
> > +   printk("check1: %d\n", pte_page(*pvmw->pte) - 
> > pvmw->page < 0);
> > +   printk("check2: %d\n", pte_page(*pvmw->pte) - 
> > pvmw->page >= hpage_nr_pages(pvmw->page));
> > +   BUG();
> > +   }
> 
> This says that pte_page(*pvmw->pte) and pvmw->page are roughly 4GB away
> from each other (858690919*4=0xccba559c0).  That's not the compiler
> being wonky, it just means that the virtual addresses of the memory
> sections are that far apart.
> 
> This won't happen when you have vmemmap or flatmem because the mem_map[]
> is virtually contiguous and pointer arithmetic just works against all
> 'struct page' pointers.  But with classic sparsemem, it doesn't.
> 
> You need to make sure that the PFNs are in the same section before you
> can do the math that you want to do here.

Isn't it simply that pvmw->page isn't a page or pte_page(*pvmw->pte)
isn't a page?

The distance cannot matter, MMU isn't involved, this is pure 64bit
aritmetics, 1giga 1 terabyte, 48bits 5level pagetables are meaningless
in this comparison.

#include 

int main()
{
volatile long i;
struct x { char a[40]; };
for (i = 0; i < 40*3; i += 40) {
printf("%ld\n", ((struct x *)0)-struct x *)i;
}
printf("\n");
for (i = 0; i < 40; i += 1) {
if (i==4)
i = 40;
printf("%ld\n", ((struct x *)0)-struct x *)i;
}
return 0;
}

You need to add two debug checks on "pte_page(*pvmw->pte) % 64" and
same for pvmw->page to find out the one of the two that isn't a page.

If both are real pages there's a bug that allocates page structs not
naturally aligned.


Re: [PATCH] x86/pti: unpoison pgd for trusted boot

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 02:49:39PM -0800, Dave Hansen wrote:
> 
> Updated to make this on top of x86/pti.

Reviewed-by: Andrea Arcangeli <aarca...@redhat.com>


Re: [PATCH] x86/pti: unpoison pgd for trusted boot

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 02:49:39PM -0800, Dave Hansen wrote:
> 
> Updated to make this on top of x86/pti.

Reviewed-by: Andrea Arcangeli 


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:35:45PM -0800, Tim Chen wrote:
> time may not provide full protection on all cpu models.

All right no problem at all, it's fixed up.

Until very recently the majority of microcodes wasn't available in the
first place so I guess it's no big issue if in a subset of those the
IBRS barrier-like behavior wasn't immediately fully leveraged in all
cases. I'm just glad this detail was clarified sooner than later.

The IBRS barrier-like behavior and need to be set even when it's
already set, when changing mode to an higher privilege, is crystal
clear now.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:35:45PM -0800, Tim Chen wrote:
> time may not provide full protection on all cpu models.

All right no problem at all, it's fixed up.

Until very recently the majority of microcodes wasn't available in the
first place so I guess it's no big issue if in a subset of those the
IBRS barrier-like behavior wasn't immediately fully leveraged in all
cases. I'm just glad this detail was clarified sooner than later.

The IBRS barrier-like behavior and need to be set even when it's
already set, when changing mode to an higher privilege, is crystal
clear now.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 03:24:17PM +, David Woodhouse wrote:
> Since it achieves nothing¹ but to make userspace run slower, there's no
> need to write it again on returning to userspace. It will perform that
> function just fine without doing so.

Ok, very glad we are on the same page now.

Note that as far as I can tell there was no way to answer the above
question by reading the spec.

You also explicitly used the word barrier in association with IBRS
before, but there was no word barrier in the aforementioned specs in
association with IBRS (every word barrier was always and only in
association with IBPB).

I hope this discussion helped clear the additional barrier semantics
of IBRS in more understandable way for the current/future upstream
code.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 03:24:17PM +, David Woodhouse wrote:
> Since it achieves nothing¹ but to make userspace run slower, there's no
> need to write it again on returning to userspace. It will perform that
> function just fine without doing so.

Ok, very glad we are on the same page now.

Note that as far as I can tell there was no way to answer the above
question by reading the spec.

You also explicitly used the word barrier in association with IBRS
before, but there was no word barrier in the aforementioned specs in
association with IBRS (every word barrier was always and only in
association with IBPB).

I hope this discussion helped clear the additional barrier semantics
of IBRS in more understandable way for the current/future upstream
code.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 06:59:54AM -0800, Dave Hansen wrote:
> On 01/10/2018 06:10 AM, Andrea Arcangeli wrote:
> > Tim and Dave please comment too, Tim you originally wrote that code
> > that leaves IBRS always on and never toggles it in the kernel entry
> > point so you must know full well if Arjan is correct that you must
> > toggle IBRS every time you enter kernel and in turn ibrs_enabled 2
> > isn't valid mode.
> 
> Hi Andrea,
> 
> The "writing IBRS=1 acts as a barrier when it is already IBRS=1"
> behavior is something which I misunderstood in the past.  Thanks, Arjan,
> for clearing it up.

"writing IBRS=1 acts as a barrier when it is already IBRS=1" would
have been much clearer wording frankly. IBPB is IBP "Barrier", but
also IBRS is a barrier, no problem :).

So we'll add a dummy IBRS write to SPEC_CTRL in kernel entry and
vmexit so that it is compliant with all released microcodes that may
require it, also when ibrs_enabled is 2. Can you confirm?

Can you also tell if IBRS must be written as a barrier to SPEC_CTRL in
return to userland (kernel exit) when ibrs_enabled 2? Generally we
wouldn't run a barrier there with ibrs_enabled 2, but absolutely
nothing is intuitive here so I need to ask explicitly.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 06:59:54AM -0800, Dave Hansen wrote:
> On 01/10/2018 06:10 AM, Andrea Arcangeli wrote:
> > Tim and Dave please comment too, Tim you originally wrote that code
> > that leaves IBRS always on and never toggles it in the kernel entry
> > point so you must know full well if Arjan is correct that you must
> > toggle IBRS every time you enter kernel and in turn ibrs_enabled 2
> > isn't valid mode.
> 
> Hi Andrea,
> 
> The "writing IBRS=1 acts as a barrier when it is already IBRS=1"
> behavior is something which I misunderstood in the past.  Thanks, Arjan,
> for clearing it up.

"writing IBRS=1 acts as a barrier when it is already IBRS=1" would
have been much clearer wording frankly. IBPB is IBP "Barrier", but
also IBRS is a barrier, no problem :).

So we'll add a dummy IBRS write to SPEC_CTRL in kernel entry and
vmexit so that it is compliant with all released microcodes that may
require it, also when ibrs_enabled is 2. Can you confirm?

Can you also tell if IBRS must be written as a barrier to SPEC_CTRL in
return to userland (kernel exit) when ibrs_enabled 2? Generally we
wouldn't run a barrier there with ibrs_enabled 2, but absolutely
nothing is intuitive here so I need to ask explicitly.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
Hello,

On Wed, Jan 10, 2018 at 02:46:22PM +0100, Thomas Gleixner wrote:
> So here is the simple list of questions all to be answered with YES or
> NO. I don't want to see any of the 'but, though ...'. We all know by now
> that it's CPU dependent and slow and whatever and that IBRS_ATT will be in
> future CPUs. So get your act together and tell a clear YES or NO.

Other comments/code from Tim Chen, and Dave Hansen and most important
the ibrs_enabled 2 description and implementation on lkml, makes me
still wonder if even Arjan may have misunderstood some detail about
IBRS semantics too.

Tim and Dave please comment too, Tim you originally wrote that code
that leaves IBRS always on and never toggles it in the kernel entry
point so you must know full well if Arjan is correct that you must
toggle IBRS every time you enter kernel and in turn ibrs_enabled 2
isn't valid mode.

I can answer my understanding of IBRS so far that would make perfect
sense if that code posted made sense in the first place...

https://marc.info/?l=linux-kernel=151520606320646

> 
> 1) Does IBRS=1 when set once act as a set-and-forget option ?

Yes it can. Once set makes spectre variant#2 go away but only for IBP
BTB, stuff_RSB still needed and register hygiene too.

> 
>  1a) If the answer to #1 is yes, is it more secure than toggling it?

It's equivalent, not more secure nor less secure.

Simply IBRS only protects the code that runs with IBRS set. So you if
you leave it always set, you protect more code.

IBRS set is equivalent to the code having been compiled with reptoline
on CPUS where reptoline is equaivalent to IBRS.

The act of setting IBRS does nothing, it's not a barrier (IBPB is a
barrier and flushes the CPU internal state). After IBRS is set the
code is like if it's built with reptoline.

IBRS may reduce the performance measurably and this is why it has to
be disabled in userland and that is the motivation for using
ibrs_enabled 1 the default.

The toggling at kernel entry is purely required so userland keeps
running fast with IBRS not set by default. But it would be more secure
to leave IBRS always on which is what ibrs_enabled 2, it's just
incredibly slower with current silicon in some microcode
implementations, which is why it's not the default.

> 
>  1b) If the answer to #1 is yes, is retpoline required ?

reptoline is definitely not required whenever IBRS is set. IBRS set if
something is more secure than repotline as it's sure thing variant#2
is fixed and the microcode was applied if IBRS could be set.

> 
>  1c) If the answer to #1 is yes, is RSB stuffing required ?

RBS is totally different piece of the CPU not covered by IBRS
AFIK. I'm not sure about this though but doing also stuff_RSB wouldn't
hurt if the CPU has no microcode update applied.

SMEP set in cr4 obviates the need of stuff_RSB in kernel entry but not
in vmexit.

> 2) Does toggle mode of IBRS require retpoline ?

Toggle mode means nothing in terms of added security. Toggle mode
simply protects only kernel space.

If you use the toggle mode (ibrs_enabled 1 default) only kernel is
protected by IBRS and in turn you don't have to build the kernel with
reptoline.

If you leave IBRS always on it's effectively like having built
everything (kernel and all userland) with reptoline.

> 3) Does toggle mode of IBRS require RSB stuffing ?

Yes, again IBRS does nothing to prevent the RSB stuffing. Only SMEP
helps there if set in cr4 but only for kernel entry, vmexit still
needs stuff_RSB.

The stuff_RSB part I'm no certain but again it won't hurt to do it and
we do it even with IBRS on. It's not measurable though.

> 4) Exist CPUs which require IBRS to be selected automatically ?
> 
>4b) If yes, provide the list as a separate answer please

Not sure if I get the question so not answering this one.

Future CPUs will have a new CPUID bit where enabling IBRS will perform
better than toggling it at kernel entry, so in those CPUs the most
secure mode (ibrs_enabled 2) that leaves IBRS always on, will perform
even better than the current default (ibrs_enabled 1). So in presence
of such future CPUID bit we should prefer ibrs_enabled 2 as default
and leave IBRS always on as default.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
Hello,

On Wed, Jan 10, 2018 at 02:46:22PM +0100, Thomas Gleixner wrote:
> So here is the simple list of questions all to be answered with YES or
> NO. I don't want to see any of the 'but, though ...'. We all know by now
> that it's CPU dependent and slow and whatever and that IBRS_ATT will be in
> future CPUs. So get your act together and tell a clear YES or NO.

Other comments/code from Tim Chen, and Dave Hansen and most important
the ibrs_enabled 2 description and implementation on lkml, makes me
still wonder if even Arjan may have misunderstood some detail about
IBRS semantics too.

Tim and Dave please comment too, Tim you originally wrote that code
that leaves IBRS always on and never toggles it in the kernel entry
point so you must know full well if Arjan is correct that you must
toggle IBRS every time you enter kernel and in turn ibrs_enabled 2
isn't valid mode.

I can answer my understanding of IBRS so far that would make perfect
sense if that code posted made sense in the first place...

https://marc.info/?l=linux-kernel=151520606320646

> 
> 1) Does IBRS=1 when set once act as a set-and-forget option ?

Yes it can. Once set makes spectre variant#2 go away but only for IBP
BTB, stuff_RSB still needed and register hygiene too.

> 
>  1a) If the answer to #1 is yes, is it more secure than toggling it?

It's equivalent, not more secure nor less secure.

Simply IBRS only protects the code that runs with IBRS set. So you if
you leave it always set, you protect more code.

IBRS set is equivalent to the code having been compiled with reptoline
on CPUS where reptoline is equaivalent to IBRS.

The act of setting IBRS does nothing, it's not a barrier (IBPB is a
barrier and flushes the CPU internal state). After IBRS is set the
code is like if it's built with reptoline.

IBRS may reduce the performance measurably and this is why it has to
be disabled in userland and that is the motivation for using
ibrs_enabled 1 the default.

The toggling at kernel entry is purely required so userland keeps
running fast with IBRS not set by default. But it would be more secure
to leave IBRS always on which is what ibrs_enabled 2, it's just
incredibly slower with current silicon in some microcode
implementations, which is why it's not the default.

> 
>  1b) If the answer to #1 is yes, is retpoline required ?

reptoline is definitely not required whenever IBRS is set. IBRS set if
something is more secure than repotline as it's sure thing variant#2
is fixed and the microcode was applied if IBRS could be set.

> 
>  1c) If the answer to #1 is yes, is RSB stuffing required ?

RBS is totally different piece of the CPU not covered by IBRS
AFIK. I'm not sure about this though but doing also stuff_RSB wouldn't
hurt if the CPU has no microcode update applied.

SMEP set in cr4 obviates the need of stuff_RSB in kernel entry but not
in vmexit.

> 2) Does toggle mode of IBRS require retpoline ?

Toggle mode means nothing in terms of added security. Toggle mode
simply protects only kernel space.

If you use the toggle mode (ibrs_enabled 1 default) only kernel is
protected by IBRS and in turn you don't have to build the kernel with
reptoline.

If you leave IBRS always on it's effectively like having built
everything (kernel and all userland) with reptoline.

> 3) Does toggle mode of IBRS require RSB stuffing ?

Yes, again IBRS does nothing to prevent the RSB stuffing. Only SMEP
helps there if set in cr4 but only for kernel entry, vmexit still
needs stuff_RSB.

The stuff_RSB part I'm no certain but again it won't hurt to do it and
we do it even with IBRS on. It's not measurable though.

> 4) Exist CPUs which require IBRS to be selected automatically ?
> 
>4b) If yes, provide the list as a separate answer please

Not sure if I get the question so not answering this one.

Future CPUs will have a new CPUID bit where enabling IBRS will perform
better than toggling it at kernel entry, so in those CPUs the most
secure mode (ibrs_enabled 2) that leaves IBRS always on, will perform
even better than the current default (ibrs_enabled 1). So in presence
of such future CPUID bit we should prefer ibrs_enabled 2 as default
and leave IBRS always on as default.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:45:52PM +, Van De Ven, Arjan wrote:
>  
> > Andrea, what you're saying is directly contradicting what I've heard
> > from Intel.
> > 
> > The documentation already distinguishes between IBRS on current
> > hardware, and IBRS_ATT on future hardware. If it was the case that IBRS
> > on current hardware is a set-and-forget option and completely disables
> > branch prediction, then they would say that. Rather than explicitly
> > saying the *opposite*, specifically for the case of current hardware,
> > as they do.
> > 
> > Rather than continuing to debate it, perhaps it's best just to wake for
> > the US to wake up, and Intel to give a definitive answer.
> 
> On current hardware, you cannot just set IBRS always.

ibrs_enabled 1:

sets IBRS at vmexit and at kernel entry.
clears IBRS at kernel exit (return to usermode)
restores whatever IBRS value the guest was using at vmenter

ibrs_enabled 2:

sets IBRS always in host
sets IBRS if it wasn't already set by the guest in vmexit
restores whatever IBRS valeu the guest was using at vmenter

This matches the semantics described here by Tim patchset on lkml:

https://marc.info/?l=linux-kernel=151520606320646

If you what you say is correct, then you should discuss with Tim what
"echo 2 > /sys/kernel/debug/x86/ibrs_enabled will turn on IBRS in both
userspace and kernel" is doing or is supposed to achieve.

> (In practice, on some you might get lucky if you try. Intel does not 
> guarantee it. Intel does not test it. The model is to write the msr on 
> privilege change, e.g. ring transition)

ibrs_enabled 1 is the default always with SPEC_CTRL in cpuid.

The question is ibrs_enabled 2 optional mode but what is implemented
matches the semantics of the above patchset in link from Tim so again
you should talk with Tim and adjourn on the status of leaving IBRS
always on with current silicon.

I can tell in practice it works as I described in all microcodes
I tested.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:45:52PM +, Van De Ven, Arjan wrote:
>  
> > Andrea, what you're saying is directly contradicting what I've heard
> > from Intel.
> > 
> > The documentation already distinguishes between IBRS on current
> > hardware, and IBRS_ATT on future hardware. If it was the case that IBRS
> > on current hardware is a set-and-forget option and completely disables
> > branch prediction, then they would say that. Rather than explicitly
> > saying the *opposite*, specifically for the case of current hardware,
> > as they do.
> > 
> > Rather than continuing to debate it, perhaps it's best just to wake for
> > the US to wake up, and Intel to give a definitive answer.
> 
> On current hardware, you cannot just set IBRS always.

ibrs_enabled 1:

sets IBRS at vmexit and at kernel entry.
clears IBRS at kernel exit (return to usermode)
restores whatever IBRS value the guest was using at vmenter

ibrs_enabled 2:

sets IBRS always in host
sets IBRS if it wasn't already set by the guest in vmexit
restores whatever IBRS valeu the guest was using at vmenter

This matches the semantics described here by Tim patchset on lkml:

https://marc.info/?l=linux-kernel=151520606320646

If you what you say is correct, then you should discuss with Tim what
"echo 2 > /sys/kernel/debug/x86/ibrs_enabled will turn on IBRS in both
userspace and kernel" is doing or is supposed to achieve.

> (In practice, on some you might get lucky if you try. Intel does not 
> guarantee it. Intel does not test it. The model is to write the msr on 
> privilege change, e.g. ring transition)

ibrs_enabled 1 is the default always with SPEC_CTRL in cpuid.

The question is ibrs_enabled 2 optional mode but what is implemented
matches the semantics of the above patchset in link from Tim so again
you should talk with Tim and adjourn on the status of leaving IBRS
always on with current silicon.

I can tell in practice it works as I described in all microcodes
I tested.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 02:10:10PM +0100, Andrea Arcangeli wrote:
> It's still incredibly faster to shutdown part of the CPU temporarily
> than to flush its internal state as a whole with IBPB. If it wouldn't
> be the case ibrs_enabled 0 ibpb_enabled 2 special mode would perform
> better (but that's only enabled by default if SPEC_CTRL is not
> available and only IBPB_SUPPORT is).

Yet another bit of perhaps useful info: if you run 100% kernel
intensive computation (i.e. only interrupts on idle task or a kernel
driver doing all in kernel) ibrs_enabled 0 ibpb_enabled 2 of course is
much faster than ibrs_enabled 1 ibpb_enabled 1. As long as you don't
have frequent ring change, ibrs_enabled 0 ibpb_enabled 2 is faster.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 02:10:10PM +0100, Andrea Arcangeli wrote:
> It's still incredibly faster to shutdown part of the CPU temporarily
> than to flush its internal state as a whole with IBPB. If it wouldn't
> be the case ibrs_enabled 0 ibpb_enabled 2 special mode would perform
> better (but that's only enabled by default if SPEC_CTRL is not
> available and only IBPB_SUPPORT is).

Yet another bit of perhaps useful info: if you run 100% kernel
intensive computation (i.e. only interrupts on idle task or a kernel
driver doing all in kernel) ibrs_enabled 0 ibpb_enabled 2 of course is
much faster than ibrs_enabled 1 ibpb_enabled 1. As long as you don't
have frequent ring change, ibrs_enabled 0 ibpb_enabled 2 is faster.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 02:05:51PM +0100, Andrea Arcangeli wrote:
> Also note, the slowdown of setting IBRS varies with older CPUs being

To give a further detail, older CPUs to provide IBRS semantics have to
do something even less finegrined that doesn't just restricts
speculation across IBRS less privileged prediction level, that doesn't
just restrict speculation across indirect branches, that doesn't just
restrict indirect branch prediction, but that shuts down a whole block
of the CPU to achieve those super finegrined theoretical IBRS
semantics that matches future silicon.

IBRS is doing a lot more than what it would be needed because there
was no other way to achieve it on some microcode
implementations. Which is why ibrs_enabled 1 is needed to achieve
better performance as you don't want to run slower while in user mode,
much better to pay the cost of setting IBRS and shutting down that
part of the CPU.

It's still incredibly faster to shutdown part of the CPU temporarily
than to flush its internal state as a whole with IBPB. If it wouldn't
be the case ibrs_enabled 0 ibpb_enabled 2 special mode would perform
better (but that's only enabled by default if SPEC_CTRL is not
available and only IBPB_SUPPORT is).

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 02:05:51PM +0100, Andrea Arcangeli wrote:
> Also note, the slowdown of setting IBRS varies with older CPUs being

To give a further detail, older CPUs to provide IBRS semantics have to
do something even less finegrined that doesn't just restricts
speculation across IBRS less privileged prediction level, that doesn't
just restrict speculation across indirect branches, that doesn't just
restrict indirect branch prediction, but that shuts down a whole block
of the CPU to achieve those super finegrined theoretical IBRS
semantics that matches future silicon.

IBRS is doing a lot more than what it would be needed because there
was no other way to achieve it on some microcode
implementations. Which is why ibrs_enabled 1 is needed to achieve
better performance as you don't want to run slower while in user mode,
much better to pay the cost of setting IBRS and shutting down that
part of the CPU.

It's still incredibly faster to shutdown part of the CPU temporarily
than to flush its internal state as a whole with IBPB. If it wouldn't
be the case ibrs_enabled 0 ibpb_enabled 2 special mode would perform
better (but that's only enabled by default if SPEC_CTRL is not
available and only IBPB_SUPPORT is).

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 02:02:02PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2018 at 12:51:13PM +, David Woodhouse wrote:
> > If it worked as Andrea suggests, then there would be absolutely no
> > point in the patches we've seen which add the IBRS-frobbing on syscall
> > entry and vmexit.
> 
> This is perhaps what you're missing I think, there is a huge point:
> performance.

And note, if you run getppid in a loop you're perfectly right
ibrs_enabled 2 will perform better than ibrs_enabled 1, but we run
multiple set of macro benchmarks and ibrs_enabled 1 ibpb_enabled 1
always wins. So ibrs_enabled 2 ibpb_enabled 1 is left as an optional
"paranoid" security mode, more secure but even slower.

We draw a line for the default to stop the PoC which we know can be
mounted (at least in theory :) and is practical attack.

Also note, the slowdown of setting IBRS varies with older CPUs being
more affected, but for all current silicon ibrs_enabled 1 ibpb_enabled
1 is faster in 99% of real life workloads.

The future silicon will set the CPUID that will tell us set
"ibrs_enabled 2" by default, but you can do it right and it's even
more secure as it completely solves any HT effect in userland and it
prevents the KVM guest kernel to be attacked by guest userland through
a spectre variant#2 attack from guest userland to qemu host userland.

Hope this helps,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 02:02:02PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2018 at 12:51:13PM +, David Woodhouse wrote:
> > If it worked as Andrea suggests, then there would be absolutely no
> > point in the patches we've seen which add the IBRS-frobbing on syscall
> > entry and vmexit.
> 
> This is perhaps what you're missing I think, there is a huge point:
> performance.

And note, if you run getppid in a loop you're perfectly right
ibrs_enabled 2 will perform better than ibrs_enabled 1, but we run
multiple set of macro benchmarks and ibrs_enabled 1 ibpb_enabled 1
always wins. So ibrs_enabled 2 ibpb_enabled 1 is left as an optional
"paranoid" security mode, more secure but even slower.

We draw a line for the default to stop the PoC which we know can be
mounted (at least in theory :) and is practical attack.

Also note, the slowdown of setting IBRS varies with older CPUs being
more affected, but for all current silicon ibrs_enabled 1 ibpb_enabled
1 is faster in 99% of real life workloads.

The future silicon will set the CPUID that will tell us set
"ibrs_enabled 2" by default, but you can do it right and it's even
more secure as it completely solves any HT effect in userland and it
prevents the KVM guest kernel to be attacked by guest userland through
a spectre variant#2 attack from guest userland to qemu host userland.

Hope this helps,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 12:51:13PM +, David Woodhouse wrote:
> If it worked as Andrea suggests, then there would be absolutely no
> point in the patches we've seen which add the IBRS-frobbing on syscall
> entry and vmexit.

This is perhaps what you're missing I think, there is a huge point:
performance.

ibrs_enabled 1 ibpb_enabled 1 is much faster than ibrs_enabled 2
ibpb_enabled 1.

The current IBRS is disabling IBP, it can't do the "finegrined" thing
that leaves IBP speculation enabled.

Kernel is not a 100% C++ virtual template, so for kernel IBRS enabled
is not as a big deal as userland with IBRS enabled.

Once the new CPUid bit will be set it'll tell us IBRS is really
restricting speculation in a finegrined way with all privilege levels
ordered, so it simply tells is ibrs_enabled 2 ibpb_enabled 1 will be
faster than ibrs_enabled 1 ibpb_enabled 1, so the only thing we've to
do with the new CPUID bit you're talking about from the future, is to
change the default to ibrs_enabled 2 and it'll perform even better
with zero overhead in kernel entry points.

We already optimized for the future silicon, we've only to add a tweak
to detect cpuid and set the default ibrs_enabled to 2.

> The "IBRS all the time" feature is something we get on *future*
> hardware, not current hardware.

No, for current hardware it's the most secure option available and
required in fact if you want to fix guest userland attack on host
userland with math certainty which can matter for some customer.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 12:51:13PM +, David Woodhouse wrote:
> If it worked as Andrea suggests, then there would be absolutely no
> point in the patches we've seen which add the IBRS-frobbing on syscall
> entry and vmexit.

This is perhaps what you're missing I think, there is a huge point:
performance.

ibrs_enabled 1 ibpb_enabled 1 is much faster than ibrs_enabled 2
ibpb_enabled 1.

The current IBRS is disabling IBP, it can't do the "finegrined" thing
that leaves IBP speculation enabled.

Kernel is not a 100% C++ virtual template, so for kernel IBRS enabled
is not as a big deal as userland with IBRS enabled.

Once the new CPUid bit will be set it'll tell us IBRS is really
restricting speculation in a finegrined way with all privilege levels
ordered, so it simply tells is ibrs_enabled 2 ibpb_enabled 1 will be
faster than ibrs_enabled 1 ibpb_enabled 1, so the only thing we've to
do with the new CPUID bit you're talking about from the future, is to
change the default to ibrs_enabled 2 and it'll perform even better
with zero overhead in kernel entry points.

We already optimized for the future silicon, we've only to add a tweak
to detect cpuid and set the default ibrs_enabled to 2.

> The "IBRS all the time" feature is something we get on *future*
> hardware, not current hardware.

No, for current hardware it's the most secure option available and
required in fact if you want to fix guest userland attack on host
userland with math certainty which can matter for some customer.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:47:22PM +0100, Jiri Kosina wrote:
> On Wed, 10 Jan 2018, Andrea Arcangeli wrote:
> 
> > Perhaps the confusing come from "less privileged prediction mode" and
> > you thought that meant "less privileged ring mode". It says "predction
> > mode" not ring 3.
> 
> Well, prediction mode is defined by "CPL3 vs CPL0-2" and "VMX root vs VMX 
> non-root", with obvious ordering of privileges.
> 
> So if IBRS is set, branch predictor will not allow the predicted target to 
> be influenced by code that executed in less privileged prediction mode 
> before value of '1' IBRS mode was last written to, and that's pretty much 
> it.

Which in current silicon IBP speculation is turned off always, and the
above specification really is to provide more finegrined semantics
for future silicon where it'll perform best to leave it always on and
it'll be still as secure as it is now despite the IBP speculation may
not always be turned off like it happens right now.

With all the prediction modes ordered right for the respective
guest/ring and CPUID will tell us when it's higher perf to enable
ibrs_enabled 2 ibpb_enabled 1 by default.

Again I see zero issues with leaving IBRS always on in current and
future silicon and I see absolutely zero problems in setting IBRS in
vmexit to prevent the whole guest mode to attack the kernel memory,
and in fact ibrs_enabled 2 will even more secure and it'll prevent the
gust mode userland even to attack the host qemu userland through
spectre variant#2.

As long as the "with obvious ordering of privileges" is maintained
when IBRS is not a total turn off of IBP speculation, everything works
as intended.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:47:22PM +0100, Jiri Kosina wrote:
> On Wed, 10 Jan 2018, Andrea Arcangeli wrote:
> 
> > Perhaps the confusing come from "less privileged prediction mode" and
> > you thought that meant "less privileged ring mode". It says "predction
> > mode" not ring 3.
> 
> Well, prediction mode is defined by "CPL3 vs CPL0-2" and "VMX root vs VMX 
> non-root", with obvious ordering of privileges.
> 
> So if IBRS is set, branch predictor will not allow the predicted target to 
> be influenced by code that executed in less privileged prediction mode 
> before value of '1' IBRS mode was last written to, and that's pretty much 
> it.

Which in current silicon IBP speculation is turned off always, and the
above specification really is to provide more finegrined semantics
for future silicon where it'll perform best to leave it always on and
it'll be still as secure as it is now despite the IBP speculation may
not always be turned off like it happens right now.

With all the prediction modes ordered right for the respective
guest/ring and CPUID will tell us when it's higher perf to enable
ibrs_enabled 2 ibpb_enabled 1 by default.

Again I see zero issues with leaving IBRS always on in current and
future silicon and I see absolutely zero problems in setting IBRS in
vmexit to prevent the whole guest mode to attack the kernel memory,
and in fact ibrs_enabled 2 will even more secure and it'll prevent the
gust mode userland even to attack the host qemu userland through
spectre variant#2.

As long as the "with obvious ordering of privileges" is maintained
when IBRS is not a total turn off of IBP speculation, everything works
as intended.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 12:29:44PM +, David Woodhouse wrote:
> On Wed, 2018-01-10 at 13:17 +0100, Andrea Arcangeli wrote:
> > On Wed, Jan 10, 2018 at 12:09:34PM +, David Woodhouse wrote:
> > > That is not consistent with the documentation I've seen, which Intel
> > > have so far utterly failed to publish AFAICT.
> > > 
> > > "a near indirect jump/call/return may be affected by code in a less 
> > > privileged
> > > prediction mode that executed AFTER IBRS mode was last written with a 
> > > value of 1"
> > 
> > You must have misunderstood the context there, or the above text is
> > wrong to begin with.
> 
> That's a quote from the Intel documentation for the IBRS feature.
> Go read it, please.

Perhaps the confusing come from "less privileged prediction mode" and
you thought that meant "less privileged ring mode". It says "predction
mode" not ring 3.

Frankly I found that documentation very confusing like the inversion
of IBRS enabled really meaning IBP speculation disabled like Ingo
pointed out.

It's clear all done in a rush but we've to live with it. After all the
electric current also really flows in the opposite direction of the
arrow..

> Andrea, what part of this whole fucking mess isn't entirely batshit
> insane to start with? :)

Absolutely :).

> I think you are confused with the future IBRS_ATT option which will
> exist on new hardware. 

No, the only difference is that such option will perform best.

This is why the current default ibrs_enabled 1 ibpb_enabled 1.

That future CPUID bit will simply make the kernel boot by default with
ibrs_enabled 2 ibpb_enabled 1 and it'll perform best as it won't have
to write to SPEC_CTRL in kernel entry kernel exit vmenter/vmexit like
it commonly has to do with ibrs_enabled 1.

The only difference is that ibrs_enabled 2 will perform best, while
currently ibrs_enabled 1 performs best.

> Right now, IBRS works as I described it because that's the best they
> could do with microcode.

It works as I described but instead of arguing about specs above,
Intel should clarify that IBRS can already be set 100% of the time, be
left alone and set always, with all CPUs with SPEC_CTRL, and it's the
most secure mode available and matches the IBRS patchset with
ibrs_enabled 2.

Hope this helps clarify and of course this is so complex it's
perfectly normal to misunderstand something, but I'm confident it's
not me who misunderstood because if I did, the whole ibrs_enabled 2
that was just posted would make zero sense, that is for current CPUs
and it's the most secure option available (not less secure as you
seem to think).

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 12:29:44PM +, David Woodhouse wrote:
> On Wed, 2018-01-10 at 13:17 +0100, Andrea Arcangeli wrote:
> > On Wed, Jan 10, 2018 at 12:09:34PM +, David Woodhouse wrote:
> > > That is not consistent with the documentation I've seen, which Intel
> > > have so far utterly failed to publish AFAICT.
> > > 
> > > "a near indirect jump/call/return may be affected by code in a less 
> > > privileged
> > > prediction mode that executed AFTER IBRS mode was last written with a 
> > > value of 1"
> > 
> > You must have misunderstood the context there, or the above text is
> > wrong to begin with.
> 
> That's a quote from the Intel documentation for the IBRS feature.
> Go read it, please.

Perhaps the confusing come from "less privileged prediction mode" and
you thought that meant "less privileged ring mode". It says "predction
mode" not ring 3.

Frankly I found that documentation very confusing like the inversion
of IBRS enabled really meaning IBP speculation disabled like Ingo
pointed out.

It's clear all done in a rush but we've to live with it. After all the
electric current also really flows in the opposite direction of the
arrow..

> Andrea, what part of this whole fucking mess isn't entirely batshit
> insane to start with? :)

Absolutely :).

> I think you are confused with the future IBRS_ATT option which will
> exist on new hardware. 

No, the only difference is that such option will perform best.

This is why the current default ibrs_enabled 1 ibpb_enabled 1.

That future CPUID bit will simply make the kernel boot by default with
ibrs_enabled 2 ibpb_enabled 1 and it'll perform best as it won't have
to write to SPEC_CTRL in kernel entry kernel exit vmenter/vmexit like
it commonly has to do with ibrs_enabled 1.

The only difference is that ibrs_enabled 2 will perform best, while
currently ibrs_enabled 1 performs best.

> Right now, IBRS works as I described it because that's the best they
> could do with microcode.

It works as I described but instead of arguing about specs above,
Intel should clarify that IBRS can already be set 100% of the time, be
left alone and set always, with all CPUs with SPEC_CTRL, and it's the
most secure mode available and matches the IBRS patchset with
ibrs_enabled 2.

Hope this helps clarify and of course this is so complex it's
perfectly normal to misunderstand something, but I'm confident it's
not me who misunderstood because if I did, the whole ibrs_enabled 2
that was just posted would make zero sense, that is for current CPUs
and it's the most secure option available (not less secure as you
seem to think).

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:20:45PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2018 at 12:12:53PM +, David Woodhouse wrote:
> > IBRS is like a barrier. You must write it between the 'problematic'
> > loading of the branch targets, and the kernel code which might be
> > affected.
> > 
> > You cannot, on current hardware, merely set it once and forget about
> > it. That is not sufficient.
> 
> I think you've got it all wrong...

On a side note, I hope once you get it right, the debugfs tunable will
suddenly look less horrid.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:20:45PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2018 at 12:12:53PM +, David Woodhouse wrote:
> > IBRS is like a barrier. You must write it between the 'problematic'
> > loading of the branch targets, and the kernel code which might be
> > affected.
> > 
> > You cannot, on current hardware, merely set it once and forget about
> > it. That is not sufficient.
> 
> I think you've got it all wrong...

On a side note, I hope once you get it right, the debugfs tunable will
suddenly look less horrid.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 12:12:53PM +, David Woodhouse wrote:
> IBRS is like a barrier. You must write it between the 'problematic'
> loading of the branch targets, and the kernel code which might be
> affected.
> 
> You cannot, on current hardware, merely set it once and forget about
> it. That is not sufficient.

I think you've got it all wrong...


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 12:12:53PM +, David Woodhouse wrote:
> IBRS is like a barrier. You must write it between the 'problematic'
> loading of the branch targets, and the kernel code which might be
> affected.
> 
> You cannot, on current hardware, merely set it once and forget about
> it. That is not sufficient.

I think you've got it all wrong...


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 12:09:34PM +, David Woodhouse wrote:
> That is not consistent with the documentation I've seen, which Intel
> have so far utterly failed to publish AFAICT.
> 
> "a near indirect jump/call/return may be affected by code in a less privileged
> prediction mode that executed AFTER IBRS mode was last written with a value 
> of 1"

You must have misunderstood the context there, or the above text is
wrong to begin with.

> The kernel is only protected from branch targets set in userspace
> *BEFORE* the IBRS mode was last set to 1. If you set it to 1, then
> leave it like that while you run userspace and then kernel code again,
> you are not protected.

I'm sure you've got it wrong, that would be crazy if it would be the
case.

Even from an implementation standpoint IBRS just means stop
speculating through branch prediction, there's no way they can add
this separation between ring 0 and ring 3 when it didn't exist to
begin with.

RSB had it with SMEP and in fact there's no need to do stuff_RSB with
SMEP enabled in cr4 and we alternative it out in kernel entry on host
and guest, but we still have to do that in vmexit.

IBP/BTB had nothing like, which is why user ring 3 can attack host
ring 0.

It will have it and also separating guest from host but in the future.

Plus IBRS can be set at will by guest mode, if guest kernel is
malicious it will let guest userland run with IBRS on. You can't trust
the guest kernel to begin with. So it would make IBRS totally useless
for KVM to set in vmexit if you were right about that.

In any case, we've a ibrs_enabled 0 ibpb_enabled 2 that would achieve
full security if you were right, but I think you got it all wrong
about IBRS and rings.

Second: IBRS means Indirect Branch Restricted Speculation, the
speculation through branch prediction is restricted while it's
set. It's inconceivable that they add ring knowledge to a part that is
couldn't be helped by SMEP in the first place that had at least some
ring separation when enabled.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 12:09:34PM +, David Woodhouse wrote:
> That is not consistent with the documentation I've seen, which Intel
> have so far utterly failed to publish AFAICT.
> 
> "a near indirect jump/call/return may be affected by code in a less privileged
> prediction mode that executed AFTER IBRS mode was last written with a value 
> of 1"

You must have misunderstood the context there, or the above text is
wrong to begin with.

> The kernel is only protected from branch targets set in userspace
> *BEFORE* the IBRS mode was last set to 1. If you set it to 1, then
> leave it like that while you run userspace and then kernel code again,
> you are not protected.

I'm sure you've got it wrong, that would be crazy if it would be the
case.

Even from an implementation standpoint IBRS just means stop
speculating through branch prediction, there's no way they can add
this separation between ring 0 and ring 3 when it didn't exist to
begin with.

RSB had it with SMEP and in fact there's no need to do stuff_RSB with
SMEP enabled in cr4 and we alternative it out in kernel entry on host
and guest, but we still have to do that in vmexit.

IBP/BTB had nothing like, which is why user ring 3 can attack host
ring 0.

It will have it and also separating guest from host but in the future.

Plus IBRS can be set at will by guest mode, if guest kernel is
malicious it will let guest userland run with IBRS on. You can't trust
the guest kernel to begin with. So it would make IBRS totally useless
for KVM to set in vmexit if you were right about that.

In any case, we've a ibrs_enabled 0 ibpb_enabled 2 that would achieve
full security if you were right, but I think you got it all wrong
about IBRS and rings.

Second: IBRS means Indirect Branch Restricted Speculation, the
speculation through branch prediction is restricted while it's
set. It's inconceivable that they add ring knowledge to a part that is
couldn't be helped by SMEP in the first place that had at least some
ring separation when enabled.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:01:58PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2018 at 11:58:54AM +, David Woodhouse wrote:
> > On Wed, 2018-01-10 at 12:54 +0100, Andrea Arcangeli wrote:
> > > On Wed, Jan 10, 2018 at 09:27:59AM +, David Woodhouse wrote:
> > > > I don't know why you're calling that 'IBRS=2'; are you getting
> > > confused
> > > > by Andrea's distro horridness?
> > > 
> > > Eh, yes he's got confused. ibrs_enabled 2 simply means to leave IBRS
> > > set in SPEC_CTLR 100% of the time, except in guest mode.
> > 
> > On all current hardware, if you only set IBRS when you exit a guest,
> > then you are not protecting yourself from userspace at all. IBRS acts
> > as a *barrier* in all current hardware.
> 
> Kernel memory is 100% protected if you set only IBRS at vmexit.
> 
> Once IBRS is set, there's no way any userland (nor host nor guest) can
> attack the kernel memory through spectre variant#2.
> 
> What is not protected is host userland from guest userland which is
> point 3 in the email I posted earlier and I already provided all
> details there on how to fix that purely theoretical issue not part of
> the PoC with the provided debugfs tunables, so I won't repeat here.

Also I read in another email you thought IBRS is a barrier or
something, it's not, it's purely temporarily preventing the CPU to
speculate through IBP BTB whatever, it will do the least possible
amount of work without altering the internal state that can still be
poisoned by the attacker but it's no problem because kernel mode won't
make any use of that "poison". IBRS won't flush anything, the poison
stays there, it's not barrier, and IBPB is the only barrier that
flushes the content.

IBPB is expensive but after IBPB you return running at core full speed
but you can't be influenced by any poison somebody put in the CPU to
influence your runtime. IBRS as opposed is not expensive to execute
(relatively speaking) but then the CPU runs slower as it can't use
those parts of the CPU that may have been poisoned by the attacker.

So you can also flush the whole content of IBP/BTB at vmexit with IBPB
instead of setting SPECT_CTRL to IBRS. Same thing at kernel entry. If
you do that and you don't care about SMT/HT effect it's equivalent to
setting IBRS at vmexit/kernelentry and in fact it'll also protect host
userland from guest userland.

Hope this clarifies what those IBRS IBPB features achieve.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 01:01:58PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 10, 2018 at 11:58:54AM +, David Woodhouse wrote:
> > On Wed, 2018-01-10 at 12:54 +0100, Andrea Arcangeli wrote:
> > > On Wed, Jan 10, 2018 at 09:27:59AM +, David Woodhouse wrote:
> > > > I don't know why you're calling that 'IBRS=2'; are you getting
> > > confused
> > > > by Andrea's distro horridness?
> > > 
> > > Eh, yes he's got confused. ibrs_enabled 2 simply means to leave IBRS
> > > set in SPEC_CTLR 100% of the time, except in guest mode.
> > 
> > On all current hardware, if you only set IBRS when you exit a guest,
> > then you are not protecting yourself from userspace at all. IBRS acts
> > as a *barrier* in all current hardware.
> 
> Kernel memory is 100% protected if you set only IBRS at vmexit.
> 
> Once IBRS is set, there's no way any userland (nor host nor guest) can
> attack the kernel memory through spectre variant#2.
> 
> What is not protected is host userland from guest userland which is
> point 3 in the email I posted earlier and I already provided all
> details there on how to fix that purely theoretical issue not part of
> the PoC with the provided debugfs tunables, so I won't repeat here.

Also I read in another email you thought IBRS is a barrier or
something, it's not, it's purely temporarily preventing the CPU to
speculate through IBP BTB whatever, it will do the least possible
amount of work without altering the internal state that can still be
poisoned by the attacker but it's no problem because kernel mode won't
make any use of that "poison". IBRS won't flush anything, the poison
stays there, it's not barrier, and IBPB is the only barrier that
flushes the content.

IBPB is expensive but after IBPB you return running at core full speed
but you can't be influenced by any poison somebody put in the CPU to
influence your runtime. IBRS as opposed is not expensive to execute
(relatively speaking) but then the CPU runs slower as it can't use
those parts of the CPU that may have been poisoned by the attacker.

So you can also flush the whole content of IBP/BTB at vmexit with IBPB
instead of setting SPECT_CTRL to IBRS. Same thing at kernel entry. If
you do that and you don't care about SMT/HT effect it's equivalent to
setting IBRS at vmexit/kernelentry and in fact it'll also protect host
userland from guest userland.

Hope this clarifies what those IBRS IBPB features achieve.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 11:58:54AM +, David Woodhouse wrote:
> On Wed, 2018-01-10 at 12:54 +0100, Andrea Arcangeli wrote:
> > On Wed, Jan 10, 2018 at 09:27:59AM +, David Woodhouse wrote:
> > > I don't know why you're calling that 'IBRS=2'; are you getting
> > confused
> > > by Andrea's distro horridness?
> > 
> > Eh, yes he's got confused. ibrs_enabled 2 simply means to leave IBRS
> > set in SPEC_CTLR 100% of the time, except in guest mode.
> 
> On all current hardware, if you only set IBRS when you exit a guest,
> then you are not protecting yourself from userspace at all. IBRS acts
> as a *barrier* in all current hardware.

Kernel memory is 100% protected if you set only IBRS at vmexit.

Once IBRS is set, there's no way any userland (nor host nor guest) can
attack the kernel memory through spectre variant#2.

What is not protected is host userland from guest userland which is
point 3 in the email I posted earlier and I already provided all
details there on how to fix that purely theoretical issue not part of
the PoC with the provided debugfs tunables, so I won't repeat here.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 11:58:54AM +, David Woodhouse wrote:
> On Wed, 2018-01-10 at 12:54 +0100, Andrea Arcangeli wrote:
> > On Wed, Jan 10, 2018 at 09:27:59AM +, David Woodhouse wrote:
> > > I don't know why you're calling that 'IBRS=2'; are you getting
> > confused
> > > by Andrea's distro horridness?
> > 
> > Eh, yes he's got confused. ibrs_enabled 2 simply means to leave IBRS
> > set in SPEC_CTLR 100% of the time, except in guest mode.
> 
> On all current hardware, if you only set IBRS when you exit a guest,
> then you are not protecting yourself from userspace at all. IBRS acts
> as a *barrier* in all current hardware.

Kernel memory is 100% protected if you set only IBRS at vmexit.

Once IBRS is set, there's no way any userland (nor host nor guest) can
attack the kernel memory through spectre variant#2.

What is not protected is host userland from guest userland which is
point 3 in the email I posted earlier and I already provided all
details there on how to fix that purely theoretical issue not part of
the PoC with the provided debugfs tunables, so I won't repeat here.


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 09:27:59AM +, David Woodhouse wrote:
> I don't know why you're calling that 'IBRS=2'; are you getting confused
> by Andrea's distro horridness?

Eh, yes he's got confused. ibrs_enabled 2 simply means to leave IBRS
set in SPEC_CTLR 100% of the time, except in guest mode.

IBRS is the "1" value to write in SPEC_CTRL MSR, SPEC_CTRL = 2 is not
IBRS.

"ibrs_enabled ibpb_enabled" semantics are practically the only thing we
didn't change from the code that we've been presented with (aside from
adding ibrs_enable 0 ibpb_enabled 2 new mode to fix spectre variant#2
on CPUs with no SPEC_CTRL/IBRS and only with IBPB_SUPPORT, that
otherwise would have no kernel protection at all [of course modulo
HT/SMT but that's not the PoC and ibrs_enabled 1 ibpb_enabled 1 also
won't protect guest/user vs guest/user for HT/SMT and CPU
pinning/isolation or ibrs 2 or HT/SMT disablement is needed for that]).

NOTE: we moved ibrs_enabled ibpb_enable to debugfs, they were in more
problematic location before we moved them there.

We could have done something entirely different, but ibrs_enabled
ibpb_enabled semantics as they were presented, looked as good as it
could get to be able to achieve all the below on 0day:

1) ship a full fix for all CPUs (including those CPUs that cannot be
   fixed with reptolines) that protected the host kernel memory 100%
   from spectre variant#2 by default

2) to be able to get as much performance back as possible for all
   those cases where these attacks are irrelevant and performance is
   critical by setting ibrs_enabled 0 ibpb_enabled 0 (pti_enabled 0 is
   available too in the same debug/x86 location as well in fact).

   If you see complains on ibrs_enabled 1 slowing down ffmpeg or
   something compared to ibrs_enabled 0, well it's because we shipped
   those tunables in the first place that precisely allows so easy
   comparison, and that makes it so easy to set ibrs_enabled 0 and run
   at full CPU power in kernel mode too if spectre variant#2 is not a
   concern.

   Compare that to having to reboot the system and modify grub every
   time before running a test... that would have been truly horrid as
   a solution to get the all performance back where spectre variant#2
   is no concern.

   If you're rendering a movie or you're transcoding with ffmpeg and
   nothing else and you're behind several level of firewalls or even
   disconnected from the network, it can all be disabled (including
   setting pti_enabled 0 if that's the only app in guest or host,
   i.e. a single memcached microservice running in a KVM guest).

3) allow to optionally make it impossible for a KVM guest userland to
   read KVM guest kernel memory by attacking qemu host userland, by
   simply setting ibrs_enabled 2 ibpb_enabled 1 for a subset of
   customers that may be especially concerned about that theoretical
   issue. Also allow to fix other theoretical HT/SMT attacks affecting
   guest/user mode vs guest/user mode with the same ibrs_enabled 2
   ibpb_enabled 1.

   We don't know if those issues are practical but ibrs_enabled 2 will
   provide math guarantee for those as well.

   For the KVM guest userland attack on qemu userland not even SMEP
   helps as kindly confirmed by Dave. For that issue also the
   ibrs_enabled 0 ibpb_enabled 2 on host or even only in guest is
   enough, but likely it would perform worse than ibrs_enabled 2
   ibpb_enabled 1, but then it depends on the CPU which is again why
   tunables are so handy so you can optimize for the best tradeoff.

Not exactly sure how you could have accommodated for all 3 points
above in an even more flexible way by removing those three debugfs
tunables.

If you don't want to deal with them because it's very complex to
understand exactly what they do, there's a simple solution too: ignore
them knowing it boots secure by default if either
SPEC_CTRL/IBPB_SUPPORT are present in boot "dmesg" logs and to disable
it all to get the performance back: echo 0 >ibrs_enabled echo 0
>ibpb_enabled.

Even Dave just said that even with reptolines optimization patched in
by default at boot, he'd still like to be able to enable the
equivalent of ibrs_enabled 1, let alone those CPUs where reptolines
can't fix spectre variant#2. Furthermore ibrs_enabled 2 is not
obsoleted at all by kernel repotlines unless you rebuild the whole
userland including all qemu dependencies and glibc with reptolines
which would have been impossible to achieve on 0day no matter what.

Retpolines are an optimization for later for a subset of CPUs so that
we can stop setting IBRS in SPEC_CTRL while in kernel mode. On some
CPUs IBRS or in general disabling the IBP is so fast I wouldn't even
bother to enable reptolines at boot in if they don't even work safe to
begin with.

Thanks,
Andrea


Re: [patch RFC 5/5] x86/speculation: Add basic speculation control code

2018-01-10 Thread Andrea Arcangeli
On Wed, Jan 10, 2018 at 09:27:59AM +, David Woodhouse wrote:
> I don't know why you're calling that 'IBRS=2'; are you getting confused
> by Andrea's distro horridness?

Eh, yes he's got confused. ibrs_enabled 2 simply means to leave IBRS
set in SPEC_CTLR 100% of the time, except in guest mode.

IBRS is the "1" value to write in SPEC_CTRL MSR, SPEC_CTRL = 2 is not
IBRS.

"ibrs_enabled ibpb_enabled" semantics are practically the only thing we
didn't change from the code that we've been presented with (aside from
adding ibrs_enable 0 ibpb_enabled 2 new mode to fix spectre variant#2
on CPUs with no SPEC_CTRL/IBRS and only with IBPB_SUPPORT, that
otherwise would have no kernel protection at all [of course modulo
HT/SMT but that's not the PoC and ibrs_enabled 1 ibpb_enabled 1 also
won't protect guest/user vs guest/user for HT/SMT and CPU
pinning/isolation or ibrs 2 or HT/SMT disablement is needed for that]).

NOTE: we moved ibrs_enabled ibpb_enable to debugfs, they were in more
problematic location before we moved them there.

We could have done something entirely different, but ibrs_enabled
ibpb_enabled semantics as they were presented, looked as good as it
could get to be able to achieve all the below on 0day:

1) ship a full fix for all CPUs (including those CPUs that cannot be
   fixed with reptolines) that protected the host kernel memory 100%
   from spectre variant#2 by default

2) to be able to get as much performance back as possible for all
   those cases where these attacks are irrelevant and performance is
   critical by setting ibrs_enabled 0 ibpb_enabled 0 (pti_enabled 0 is
   available too in the same debug/x86 location as well in fact).

   If you see complains on ibrs_enabled 1 slowing down ffmpeg or
   something compared to ibrs_enabled 0, well it's because we shipped
   those tunables in the first place that precisely allows so easy
   comparison, and that makes it so easy to set ibrs_enabled 0 and run
   at full CPU power in kernel mode too if spectre variant#2 is not a
   concern.

   Compare that to having to reboot the system and modify grub every
   time before running a test... that would have been truly horrid as
   a solution to get the all performance back where spectre variant#2
   is no concern.

   If you're rendering a movie or you're transcoding with ffmpeg and
   nothing else and you're behind several level of firewalls or even
   disconnected from the network, it can all be disabled (including
   setting pti_enabled 0 if that's the only app in guest or host,
   i.e. a single memcached microservice running in a KVM guest).

3) allow to optionally make it impossible for a KVM guest userland to
   read KVM guest kernel memory by attacking qemu host userland, by
   simply setting ibrs_enabled 2 ibpb_enabled 1 for a subset of
   customers that may be especially concerned about that theoretical
   issue. Also allow to fix other theoretical HT/SMT attacks affecting
   guest/user mode vs guest/user mode with the same ibrs_enabled 2
   ibpb_enabled 1.

   We don't know if those issues are practical but ibrs_enabled 2 will
   provide math guarantee for those as well.

   For the KVM guest userland attack on qemu userland not even SMEP
   helps as kindly confirmed by Dave. For that issue also the
   ibrs_enabled 0 ibpb_enabled 2 on host or even only in guest is
   enough, but likely it would perform worse than ibrs_enabled 2
   ibpb_enabled 1, but then it depends on the CPU which is again why
   tunables are so handy so you can optimize for the best tradeoff.

Not exactly sure how you could have accommodated for all 3 points
above in an even more flexible way by removing those three debugfs
tunables.

If you don't want to deal with them because it's very complex to
understand exactly what they do, there's a simple solution too: ignore
them knowing it boots secure by default if either
SPEC_CTRL/IBPB_SUPPORT are present in boot "dmesg" logs and to disable
it all to get the performance back: echo 0 >ibrs_enabled echo 0
>ibpb_enabled.

Even Dave just said that even with reptolines optimization patched in
by default at boot, he'd still like to be able to enable the
equivalent of ibrs_enabled 1, let alone those CPUs where reptolines
can't fix spectre variant#2. Furthermore ibrs_enabled 2 is not
obsoleted at all by kernel repotlines unless you rebuild the whole
userland including all qemu dependencies and glibc with reptolines
which would have been impossible to achieve on 0day no matter what.

Retpolines are an optimization for later for a subset of CPUs so that
we can stop setting IBRS in SPEC_CTRL while in kernel mode. On some
CPUs IBRS or in general disabling the IBP is so fast I wouldn't even
bother to enable reptolines at boot in if they don't even work safe to
begin with.

Thanks,
Andrea


Re: Avoid speculative indirect calls in kernel

2018-01-08 Thread Andrea Arcangeli
On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote:
> Thanks for resending it.

Thanks to you for the PTI improvements!

Did my best to do the cleanest patch for tip, but I now figured Dave's
original comment was spot on: a _PAGE_NX clear then becomes necessary
also after pud_alloc not only after p4d_alloc.

pmd_alloc would run into the same with x86 32bit non-PAE too.

So there are two choices, either going back to one single _PAGE_NX
clear from the original Dave's original patch as below, or to add
multiple clear after each level which was my objective and is more
robust, but it may be overkill in this case. As long as it was one
line it looked a clear improvement.

Considering the caller in both cases is going to abort I guess we can
use the one liner approach as Dave and Jiri did originally.

It's up to you, doing it at each level would be more resilent in case
the caller is changed.

For the efi_64 same issue, the current tip patch will work better, but
it can still be cleaned up with pgd_efi instead of pgd_offset_k().

I got partly fooled because it worked great with 4levels, but it
wasn't ok anyway for 32bit non-PAE. Sometime it's the simpler stuff
that gets more subtle.

Andrea

>From 391517951e904cdd231dda9943c36a25a7bf01b9 Mon Sep 17 00:00:00 2001
From: Dave Hansen <dave.han...@linux.intel.com>
Date: Sat, 6 Jan 2018 18:41:14 +0100
Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot

This is another case similar to what EFI does: create a new set of
page tables, map some code at a low address, and jump to it.  PTI
mistakes this low address for userspace and mistakenly marks it
non-executable in an effort to make it unusable for userspace.  Undo
the poison to allow execution.

Signed-off-by: Dave Hansen <dave.han...@linux.intel.com>
Cc: Ning Sun <ning@intel.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: tboot-de...@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Andrea Arcangeli <aarca...@redhat.com>
---
 arch/x86/kernel/tboot.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index a4eb27918ceb..a2486f444073 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long vaddr, unsigned 
long pfn,
return -1;
set_pte_at(_mm, vaddr, pte, pfn_pte(pfn, prot));
pte_unmap(pte);
+
+   /*
+* PTI poisons low addresses in the kernel page tables in the
+* name of making them unusable for userspace.  To execute
+* code at such a low address, the poison must be cleared.
+*
+* Note: 'pgd' actually gets set in p4d_alloc() _or_
+* pud_alloc() depending on 4/5-level paging.
+*/
+   pgd->pgd &= ~_PAGE_NX;
+
return 0;
 }
 


Re: Avoid speculative indirect calls in kernel

2018-01-08 Thread Andrea Arcangeli
On Mon, Jan 08, 2018 at 09:53:02PM +0100, Thomas Gleixner wrote:
> Thanks for resending it.

Thanks to you for the PTI improvements!

Did my best to do the cleanest patch for tip, but I now figured Dave's
original comment was spot on: a _PAGE_NX clear then becomes necessary
also after pud_alloc not only after p4d_alloc.

pmd_alloc would run into the same with x86 32bit non-PAE too.

So there are two choices, either going back to one single _PAGE_NX
clear from the original Dave's original patch as below, or to add
multiple clear after each level which was my objective and is more
robust, but it may be overkill in this case. As long as it was one
line it looked a clear improvement.

Considering the caller in both cases is going to abort I guess we can
use the one liner approach as Dave and Jiri did originally.

It's up to you, doing it at each level would be more resilent in case
the caller is changed.

For the efi_64 same issue, the current tip patch will work better, but
it can still be cleaned up with pgd_efi instead of pgd_offset_k().

I got partly fooled because it worked great with 4levels, but it
wasn't ok anyway for 32bit non-PAE. Sometime it's the simpler stuff
that gets more subtle.

Andrea

>From 391517951e904cdd231dda9943c36a25a7bf01b9 Mon Sep 17 00:00:00 2001
From: Dave Hansen 
Date: Sat, 6 Jan 2018 18:41:14 +0100
Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot

This is another case similar to what EFI does: create a new set of
page tables, map some code at a low address, and jump to it.  PTI
mistakes this low address for userspace and mistakenly marks it
non-executable in an effort to make it unusable for userspace.  Undo
the poison to allow execution.

Signed-off-by: Dave Hansen 
Cc: Ning Sun 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: tboot-de...@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Andrea Arcangeli 
---
 arch/x86/kernel/tboot.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index a4eb27918ceb..a2486f444073 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -138,6 +138,17 @@ static int map_tboot_page(unsigned long vaddr, unsigned 
long pfn,
return -1;
set_pte_at(_mm, vaddr, pte, pfn_pte(pfn, prot));
pte_unmap(pte);
+
+   /*
+* PTI poisons low addresses in the kernel page tables in the
+* name of making them unusable for userspace.  To execute
+* code at such a low address, the poison must be cleared.
+*
+* Note: 'pgd' actually gets set in p4d_alloc() _or_
+* pud_alloc() depending on 4/5-level paging.
+*/
+   pgd->pgd &= ~_PAGE_NX;
+
return 0;
 }
 


Re: [PATCH v2 4/8] x86/spec_ctrl: Add sysctl knobs to enable/disable SPEC_CTRL feature

2018-01-08 Thread Andrea Arcangeli
On Mon, Jan 08, 2018 at 08:43:40PM +0300, Alexey Dobriyan wrote:
> > +   len = sprintf(buf, "%d\n", READ_ONCE(*field));
> 
> READ_ONCE isn't necessary as there is only one read being made.

Others might disagree but you shouldn't ever let gcc touch any memory
that can change under gcc without READ_ONCE or volatile.

Ages ago I was told in a switch() statement gcc could decide to use an
hash and re-read the value and crash.

Not for the simple case above, and we often don't use READ_ONCE and we
might user barrier() instead, but it would be better to use READ_ONCE.

READ_ONCE is more correct and there's no point to try to optimize it
especially if there's only one read being made in that function.

Either version in practice will work, but READ_ONCE is preferable IMHO.

Thanks,
Andrea


Re: [PATCH v2 4/8] x86/spec_ctrl: Add sysctl knobs to enable/disable SPEC_CTRL feature

2018-01-08 Thread Andrea Arcangeli
On Mon, Jan 08, 2018 at 08:43:40PM +0300, Alexey Dobriyan wrote:
> > +   len = sprintf(buf, "%d\n", READ_ONCE(*field));
> 
> READ_ONCE isn't necessary as there is only one read being made.

Others might disagree but you shouldn't ever let gcc touch any memory
that can change under gcc without READ_ONCE or volatile.

Ages ago I was told in a switch() statement gcc could decide to use an
hash and re-read the value and crash.

Not for the simple case above, and we often don't use READ_ONCE and we
might user barrier() instead, but it would be better to use READ_ONCE.

READ_ONCE is more correct and there's no point to try to optimize it
especially if there's only one read being made in that function.

Either version in practice will work, but READ_ONCE is preferable IMHO.

Thanks,
Andrea


Re: Avoid speculative indirect calls in kernel

2018-01-08 Thread Andrea Arcangeli
On Fri, Jan 05, 2018 at 10:59:28AM +0100, Thomas Gleixner wrote:
> I've seen the insanities which were crammed into the distro kernels, which
> have sysctls and whatever, but at the same time these kernels shipped in a

Debugfs tunables only, there are no sysctl, quoting Greg:

http://lkml.kernel.org/r/20180107082026.ga11...@kroah.com

"It's a debugfs api, it can be changed at any time, to be anything we
want, and all is fine :)"

> haste do not even boot on a specific class of machines. [..]

If you refer to the two efi_64.c and tboot.c corner case boot failures
found over the last weekend those affected upstream 4.15-rc 4.14.12
and all PTI branches in linux-tip too (perhaps less reproducible there
because of differences in old_memmap handling).

I sent you a better version of the efi_64.c fix from Jiri privately
and you still miss the tboot fix in linux-tip so you still got a boot
failure to fix there.

This is incremental with
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=WIP.x86/pti
where the "Unbreak EFI old_memmap" fix is applied.

I respinned it after doing the more correct fix in this case too (same
as the efi_64.c improvement) while leaving the attribution to the fix
to Dave as he did the hard part.

>From 0c480d1eeabd56379144a4ed6b6fb24f3b84e40e Mon Sep 17 00:00:00 2001
From: Dave Hansen <dave.han...@linux.intel.com>
Date: Sat, 6 Jan 2018 18:41:14 +0100
Subject: [PATCH 1/1] x86/kaiser/efi: unbreak tboot

This is another case similar to what EFI does: create a new set of
page tables, map some code at a low address, and jump to it.  PTI
mistakes this low address for userspace and mistakenly marks it
non-executable in an effort to make it unusable for userspace.  Undo
the poison to allow execution.

Signed-off-by: Dave Hansen <dave.han...@linux.intel.com>
Cc: Ning Sun <ning@intel.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: "H. Peter Anvin" <h...@zytor.com>
Cc: x...@kernel.org
Cc: tboot-de...@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Andrea Arcangeli <aarca...@redhat.com>
---
 arch/x86/kernel/tboot.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index a4eb27918ceb..75869a4b6c41 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -127,6 +127,7 @@ static int map_tboot_page(unsigned long vaddr, unsigned 
long pfn,
p4d = p4d_alloc(_mm, pgd, vaddr);
if (!p4d)
return -1;
+   pgd->pgd &= ~_PAGE_NX;
pud = pud_alloc(_mm, p4d, vaddr);
if (!pud)
return -1;

If I can help and assist in any other way let me know.

Thanks,
Andrea


<    1   2   3   4   5   6   7   8   9   10   >