Re: [PATCH v4 00/10] PCID and improved laziness

2017-09-13 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Andy Lutomirski  wrote:
> 
> > I'm on my way to LPC, so I can't  easily work on this right this instant.
> > 
> > Can you try this branch, though?
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb88ae619b4c3d832d224f2c641849dc02aed864
> 
> Any objections against me applying these fixes directly and getting them to 
> Linus 
> later today, to not widen the window of breakage any further?
> 
> I'll also apply:
> 
>x86/mm/64: Initialize CR4.PCIDE early

So now that I've looked at the branch, I think I'll apply these three:

  cb88ae619b4c: x86/mm/64: Initialize CR4.PCIDE early
  6ec68017784f: x86/hibernate/64: Mask off CR3's PCID bits in the saved CR3
  5d1298206d7b: x86/mm: Get rid of VM_BUG_ON in switch_tlb_irqs_off()

The first one is that likely fixed the crash that Sai Praneeth Prakhya reported.

Thanks,

Ingo


Re: [PATCH v4 00/10] PCID and improved laziness

2017-09-13 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> I'm on my way to LPC, so I can't  easily work on this right this instant.
> 
> Can you try this branch, though?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb88ae619b4c3d832d224f2c641849dc02aed864

Any objections against me applying these fixes directly and getting them to 
Linus 
later today, to not widen the window of breakage any further?

I'll also apply:

   x86/mm/64: Initialize CR4.PCIDE early

Thanks,

Ingo


Re: [PATCH v4 00/10] PCID and improved laziness

2017-09-12 Thread Sai Praneeth Prakhya
> 
> 
> Hi Andy,
> 
> I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on
> Skylake server and noticed that it reboots automatically.
> 
> When I booted the same kernel with command line arg "nopcid" it works
> fine. Please find below a snippet of dmesg. Please let me know if you
> need more info to debug.
> 
> [0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+
> root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0
> console=ttyS0,115200n8
> [0.00] log_buf_len individual max cpu contribution: 4096 bytes
> [0.00] log_buf_len total cpu_extra contributions: 258048 bytes
> [0.00] log_buf_len min size: 262144 bytes
> [0.00] log_buf_len: 524288 bytes
> [0.00] early log buf free: 212560(81%)
> [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [0.00] [ cut here ]
> [0.00] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245
> initialize_tlbstate_and_flush+0x6c/0xf0
> [0.00] Modules linked in:
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5
> [0.00] task: 8960f480 task.stack: 8960
> [0.00] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0
> [0.00] RSP: :89603e60 EFLAGS: 00010046
> [0.00] RAX: 000406b0 RBX: 9f1700a17880 RCX:
> 8965de60
> [0.00] RDX: 008383a0a000 RSI: 0960a000 RDI:
> 008383a0a000
> [0.00] RBP: 89603e60 R08:  R09:
> 
> [0.00] R10: 89603ee8 R11:  R12:
> 
> [0.00] R13: 9f1700a0c3e0 R14: 8960f480 R15:
> 
> [0.00] FS:  () GS:9f1700a0()
> knlGS:
> [0.00] CS:  0010 DS:  ES:  CR0: 80050033
> [0.00] CR2: 9fa7b000 CR3: 008383a0a000 CR4:
> 000406b0
> [0.00] Call Trace:
> [0.00]  cpu_init+0x206/0x4f0
> [0.00]  ? __set_pte_vaddr+0x1d/0x30
> [0.00]  trap_init+0x3e/0x50
> [0.00]  ? trap_init+0x3e/0x50
> [0.00]  start_kernel+0x1e2/0x3f2
> [0.00]  x86_64_start_reservations+0x24/0x26
> [0.00]  x86_64_start_kernel+0x6f/0x72
> [0.00]  secondary_startup_64+0xa5/0xa5
> [0.00] Code: de 00 48 01 f0 48 39 c7 0f 85 92 00 00 00 48 8b 05
> ee e2 ee 00 a9 00 00 02 00 74 11 65 48 8b 05 8b 9d 7c 77 a9 00 00 02 00
> 75 02 <0f> ff 48 81 e2 00 f0 ff ff 0f 22 da 65 66 c7 05 66 9d 7c 77 00 
> [0.00] ---[ end trace c258f2d278fe031f ]---
> [0.00] Memory: 791050356K/803934656K available (9585K kernel
> code, 1313K rwdata, 3000K rodata, 1176K init, 680K bss, 12884300K
> reserved, 0K cma-reserved)
> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=64,
> Nodes=4
> [0.00] Hierarchical RCU implementation.
> [0.00]RCU event tracing is enabled.
> [0.00] NR_IRQS: 4352, nr_irqs: 3928, preallocated irqs: 16
> [0.00] Console: colour dummy device 80x25
> [0.00] console [tty0] enabled
> [0.00] console [ttyS0] enabled
> [0.00] clocksource: hpet: mask: 0x max_cycles:
> 0x, max_idle_ns: 79635855245 ns
> [0.001000] tsc: Detected 2000.000 MHz processor
> [0.002000] Calibrating delay loop (skipped), value calculated using
> timer frequency.. 4000.00 BogoMIPS (lpj=200)
> [0.003003] pid_max: default: 65536 minimum: 512
> [0.004030] ACPI: Core revision 20170728
> [0.091853] ACPI: 6 ACPI AML tables successfully acquired and loaded
> [0.094143] Security Framework initialized
> [0.095004] SELinux:  Initializing.
> [0.145612] Dentry cache hash table entries: 33554432 (order: 16,
> 268435456 bytes)
> [0.170544] Inode-cache hash table entries: 16777216 (order: 15,
> 134217728 bytes)
> [0.172699] Mount-cache hash table entries: 524288 (order: 10,
> 4194304 bytes)
> [0.174441] Mountpoint-cache hash table entries: 524288 (order: 10,
> 4194304 bytes)
> [0.176351] CPU: Physical Processor ID: 0
> [0.177003] CPU: Processor Core ID: 0
> [0.178007] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
> [0.179003] ENERGY_PERF_BIAS: View and update with
> x86_energy_perf_policy(8)
> [0.180013] mce: CPU supports 20 MCE banks
> [0.181018] CPU0: Thermal monitoring enabled (TM1)
> [0.182057] process: using mwait in idle threads
> [0.183005] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
> [0.184003] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
> [0.185223] Freeing SMP alternatives memory: 36K
> [0.193912] smpboot: Max logical packages: 8
> [0.194017] Switched APIC routing to physical flat.
> [0.196496] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [0.206252] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8164 CPU @
> 2.00GHz (family: 0x6, model: 0x55, stepping: 0x4)
> [   

Re: [PATCH v4 00/10] PCID and improved laziness

2017-09-12 Thread Sai Praneeth Prakhya
> From: Andy Lutomirski 
> Date: Thu, Jun 29, 2017 at 8:53 AM
> Subject: [PATCH v4 00/10] PCID and improved laziness
> To: x...@kernel.org
> Cc: linux-kernel@vger.kernel.org, Borislav Petkov ,
> Linus Torvalds , Andrew Morton
> , Mel Gorman ,
> "linux...@kvack.org" , Nadav Amit
> , Rik van Riel , Dave Hansen
> , Arjan van de Ven ,
> Peter Zijlstra , Andy Lutomirski
> 
> 
> 
> *** Ingo, even if this misses 4.13, please apply the first patch
> before
> *** the merge window.
> 
> There are three performance benefits here:
> 
> 1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
>This avoids many of them when switching tasks by using PCID.  In
>a stupid little benchmark I did, it saves about 100ns on my laptop
>per context switch.  I'll try to improve that benchmark.
> 
> 2. Mms that have been used recently on a given CPU might get to keep
>their TLB entries alive across process switches with this patch
>set.  TLB fills are pretty fast on modern CPUs, but they're even
>faster when they don't happen.
> 
> 3. Lazy TLB is way better.  We used to do two stupid things when we
>ran kernel threads: we'd send IPIs to flush user contexts on their
>CPUs and then we'd write to CR3 for no particular reason as an
> excuse
>to stop further IPIs.  With this patch, we do neither.
> 
> This will, in general, perform suboptimally if paravirt TLB flushing
> is in use (currently just Xen, I think, but Hyper-V is in the works).
> The code is structured so we could fix it in one of two ways: we
> could take a spinlock when touching the percpu state so we can update
> it remotely after a paravirt flush, or we could be more careful about
> our exactly how we access the state and use cmpxchg16b to do atomic
> remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
> the optimization entirely.)
> 
> This is still missing a final comment-only patch to add overall
> documentation for the whole thing, but I didn't want to block sending
> the maybe-hopefully-final code on that.
> 
> This is based on tip:x86/mm.  The branch is here if you want to play:
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
> 
> In general, performance seems to exceed my expectations.  Here are
> some performance numbers copy-and-pasted from the changelogs for
> "Rework lazy TLB mode and TLB freshness" and "Try to preserve old
> TLB entries using PCID":
> 
> 

Hi Andy,

I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on
Skylake server and noticed that it reboots automatically.

When I booted the same kernel with command line arg "nopcid" it works
fine. Please find below a snippet of dmesg. Please let me know if you
need more info to debug.

[0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+
root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0
console=ttyS0,115200n8
[0.00] log_buf_len individual max cpu contribution: 4096 bytes
[0.00] log_buf_len total cpu_extra contributions: 258048 bytes
[0.00] log_buf_len min size: 262144 bytes
[0.00] log_buf_len: 524288 bytes
[0.00] early log buf free: 212560(81%)
[0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
[0.00] [ cut here ]
[0.00] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245
initialize_tlbstate_and_flush+0x6c/0xf0
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5
[0.00] task: 8960f480 task.stack: 8960
[0.00] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0
[0.00] RSP: :89603e60 EFLAGS: 00010046
[0.00] RAX: 000406b0 RBX: 9f1700a17880 RCX:
8965de60
[0.00] RDX: 008383a0a000 RSI: 0960a000 RDI:
008383a0a000
[0.00] RBP: 89603e60 R08:  R09:

[0.00] R10: 89603ee8 R11:  R12:

[0.00] R13: 9f1700a0c3e0 R14: 8960f480 R15:

[0.00] FS:  () GS:9f1700a0()
knlGS:
[0.00] CS:  0010 DS:  ES:  CR0: 80050033
[0.00] CR2: 9fa7b000 CR3: 008383a0a000 CR4:
000406b0
[0.00] Call Trace:
[0.00]  cpu_init+0x206/0x4f0
[0.00]  ? __set_pte_vaddr+0x1d/0x30
[0.00]  trap_init+0x3e/0x50
[0.00]  ? trap_init+0x3e/0x50
[0.00]  start_kernel+0x1e2/0x3f2
[0.00]  x86_64_start_reservations+0x24/0x26
[0.00]  x86_64_start_kernel+0x6f/0x72
[0.00]  secondary_startup_64+0xa5/0xa5
[0.00] Code: de 00 48 01 f0

Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-17 Thread Mel Gorman
On Mon, Jul 17, 2017 at 05:06:25PM +0200, Ingo Molnar wrote:
> > > I'll push it all out when it passes testing.
> > > 
> > > If it's all super stable I plan to tempt Linus with a late merge window 
> > > pull 
> > > request for all these preparatory patches. (Unless he objects that is. 
> > > Hint, hint.)
> > > 
> > > Any objections?
> > > 
> > 
> > What was the final verdict here? I have a patch ready that should be layered
> > on top which will need a backport. PCID support does not appear to have
> > made it in this merge window so I'm wondering if I should send the patch
> > as-is for placement on top of Andy's work or go with the backport and
> > apply a follow-on patch after Andy's work gets merged.
> 
> It's en route for v4.14 - it narrowly missed v4.13.
> 

Grand. I sent out a version that doesn't depend on Andy's work to Andrew
as it's purely a mm patch. If that passes inspection then I'll send a
follow-on patch to apply on top of the PCID work.

Thanks Ingo.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-17 Thread Ingo Molnar

* Mel Gorman  wrote:

> On Wed, Jul 05, 2017 at 10:56:57AM +0200, Ingo Molnar wrote:
> > 
> > * Andy Lutomirski  wrote:
> > 
> > > *** Ingo, even if this misses 4.13, please apply the first patch before
> > > *** the merge window.
> > 
> > > Andy Lutomirski (10):
> > >   x86/mm: Don't reenter flush_tlb_func_common()
> > >   x86/mm: Delete a big outdated comment about TLB flushing
> > >   x86/mm: Give each mm TLB flush generation a unique ID
> > >   x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
> > >   x86/mm: Rework lazy TLB mode and TLB freshness tracking
> > >   x86/mm: Stop calling leave_mm() in idle code
> > >   x86/mm: Disable PCID on 32-bit kernels
> > >   x86/mm: Add nopcid to turn off PCID
> > >   x86/mm: Enable CR4.PCIDE on supported systems
> > >   x86/mm: Try to preserve old TLB entries using PCID
> > 
> > So this series is really nice, and the first two patches are already 
> > upstream, and 
> > I've just applied all but the final patch to tip:x86/mm (out of caution - 
> > I'm a wimp).
> > 
> > That should already offer some improvements and enables the CR4 bit - but 
> > doesn't 
> > actually use the PCID hardware yet.
> > 
> > I'll push it all out when it passes testing.
> > 
> > If it's all super stable I plan to tempt Linus with a late merge window 
> > pull 
> > request for all these preparatory patches. (Unless he objects that is. 
> > Hint, hint.)
> > 
> > Any objections?
> > 
> 
> What was the final verdict here? I have a patch ready that should be layered
> on top which will need a backport. PCID support does not appear to have
> made it in this merge window so I'm wondering if I should send the patch
> as-is for placement on top of Andy's work or go with the backport and
> apply a follow-on patch after Andy's work gets merged.

It's en route for v4.14 - it narrowly missed v4.13.

Thanks,

Ingo


Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-17 Thread Mel Gorman
On Wed, Jul 05, 2017 at 10:56:57AM +0200, Ingo Molnar wrote:
> 
> * Andy Lutomirski  wrote:
> 
> > *** Ingo, even if this misses 4.13, please apply the first patch before
> > *** the merge window.
> 
> > Andy Lutomirski (10):
> >   x86/mm: Don't reenter flush_tlb_func_common()
> >   x86/mm: Delete a big outdated comment about TLB flushing
> >   x86/mm: Give each mm TLB flush generation a unique ID
> >   x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
> >   x86/mm: Rework lazy TLB mode and TLB freshness tracking
> >   x86/mm: Stop calling leave_mm() in idle code
> >   x86/mm: Disable PCID on 32-bit kernels
> >   x86/mm: Add nopcid to turn off PCID
> >   x86/mm: Enable CR4.PCIDE on supported systems
> >   x86/mm: Try to preserve old TLB entries using PCID
> 
> So this series is really nice, and the first two patches are already 
> upstream, and 
> I've just applied all but the final patch to tip:x86/mm (out of caution - I'm 
> a wimp).
> 
> That should already offer some improvements and enables the CR4 bit - but 
> doesn't 
> actually use the PCID hardware yet.
> 
> I'll push it all out when it passes testing.
> 
> If it's all super stable I plan to tempt Linus with a late merge window pull 
> request for all these preparatory patches. (Unless he objects that is. Hint, 
> hint.)
> 
> Any objections?
> 

What was the final verdict here? I have a patch ready that should be layered
on top which will need a backport. PCID support does not appear to have
made it in this merge window so I'm wondering if I should send the patch
as-is for placement on top of Andy's work or go with the backport and
apply a follow-on patch after Andy's work gets merged.

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-13 Thread Matt Fleming
On Tue, 11 Jul, at 08:00:47AM, Andy Lutomirski wrote:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/
> 
> I did:
> 
> $ ./context_switch_latency_64 0 process same

Ah, that's better. I see about a 3.3% speedup with your patches when
running the context-switch benchmark.


Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-11 Thread Andy Lutomirski
On Tue, Jul 11, 2017 at 4:32 AM, Matt Fleming  wrote:
> On Fri, 30 Jun, at 01:44:22PM, Matt Fleming wrote:
>> On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote:
>> > *** Ingo, even if this misses 4.13, please apply the first patch before
>> > *** the merge window.
>> >
>> > There are three performance benefits here:
>> >
>> > 1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
>> >This avoids many of them when switching tasks by using PCID.  In
>> >a stupid little benchmark I did, it saves about 100ns on my laptop
>> >per context switch.  I'll try to improve that benchmark.
>> >
>> > 2. Mms that have been used recently on a given CPU might get to keep
>> >their TLB entries alive across process switches with this patch
>> >set.  TLB fills are pretty fast on modern CPUs, but they're even
>> >faster when they don't happen.
>> >
>> > 3. Lazy TLB is way better.  We used to do two stupid things when we
>> >ran kernel threads: we'd send IPIs to flush user contexts on their
>> >CPUs and then we'd write to CR3 for no particular reason as an excuse
>> >to stop further IPIs.  With this patch, we do neither.
>>
>> Heads up, I'm gonna queue this for a run on SUSE's performance test
>> grid.
>
> FWIW, I didn't see any change in performance with this series on a
> PCID-capable machine. On the plus side, I didn't see any weird-looking
> bugs either.
>
> Are your benchmarks available anywhere?

https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/

I did:

$ ./context_switch_latency_64 0 process same

and

$ ./madvise_bounce_64 10k [IIRC -- it might have been a different loop count]

--Andy


Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-11 Thread Matt Fleming
On Fri, 30 Jun, at 01:44:22PM, Matt Fleming wrote:
> On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote:
> > *** Ingo, even if this misses 4.13, please apply the first patch before
> > *** the merge window.
> > 
> > There are three performance benefits here:
> > 
> > 1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
> >This avoids many of them when switching tasks by using PCID.  In
> >a stupid little benchmark I did, it saves about 100ns on my laptop
> >per context switch.  I'll try to improve that benchmark.
> > 
> > 2. Mms that have been used recently on a given CPU might get to keep
> >their TLB entries alive across process switches with this patch
> >set.  TLB fills are pretty fast on modern CPUs, but they're even
> >faster when they don't happen.
> > 
> > 3. Lazy TLB is way better.  We used to do two stupid things when we
> >ran kernel threads: we'd send IPIs to flush user contexts on their
> >CPUs and then we'd write to CR3 for no particular reason as an excuse
> >to stop further IPIs.  With this patch, we do neither.
> 
> Heads up, I'm gonna queue this for a run on SUSE's performance test
> grid.

FWIW, I didn't see any change in performance with this series on a
PCID-capable machine. On the plus side, I didn't see any weird-looking
bugs either.

Are your benchmarks available anywhere?


Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 1:56 AM, Ingo Molnar  wrote:
>
> If it's all super stable I plan to tempt Linus with a late merge window pull
> request for all these preparatory patches. (Unless he objects that is. Hint, 
> hint.)

I don't think I'll object. At some point the best testing is "lots of users".

TLB issues are a bitch to debug, but at the same time this is clearly
a "..but at some point we need to bite the bullet" case. I doubt the
series is going to get a lot better.

But yes, please do give it as much testing as humanly possible even
without the wider coverage by random people.

   Linus


Re: [PATCH v4 00/10] PCID and improved laziness

2017-07-05 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> *** Ingo, even if this misses 4.13, please apply the first patch before
> *** the merge window.

> Andy Lutomirski (10):
>   x86/mm: Don't reenter flush_tlb_func_common()
>   x86/mm: Delete a big outdated comment about TLB flushing
>   x86/mm: Give each mm TLB flush generation a unique ID
>   x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
>   x86/mm: Rework lazy TLB mode and TLB freshness tracking
>   x86/mm: Stop calling leave_mm() in idle code
>   x86/mm: Disable PCID on 32-bit kernels
>   x86/mm: Add nopcid to turn off PCID
>   x86/mm: Enable CR4.PCIDE on supported systems
>   x86/mm: Try to preserve old TLB entries using PCID

So this series is really nice, and the first two patches are already upstream, 
and 
I've just applied all but the final patch to tip:x86/mm (out of caution - I'm a 
wimp).

That should already offer some improvements and enables the CR4 bit - but 
doesn't 
actually use the PCID hardware yet.

I'll push it all out when it passes testing.

If it's all super stable I plan to tempt Linus with a late merge window pull 
request for all these preparatory patches. (Unless he objects that is. Hint, 
hint.)

Any objections?

Thanks,

Ingo


Re: [PATCH v4 00/10] PCID and improved laziness

2017-06-30 Thread Matt Fleming
On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote:
> *** Ingo, even if this misses 4.13, please apply the first patch before
> *** the merge window.
> 
> There are three performance benefits here:
> 
> 1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
>This avoids many of them when switching tasks by using PCID.  In
>a stupid little benchmark I did, it saves about 100ns on my laptop
>per context switch.  I'll try to improve that benchmark.
> 
> 2. Mms that have been used recently on a given CPU might get to keep
>their TLB entries alive across process switches with this patch
>set.  TLB fills are pretty fast on modern CPUs, but they're even
>faster when they don't happen.
> 
> 3. Lazy TLB is way better.  We used to do two stupid things when we
>ran kernel threads: we'd send IPIs to flush user contexts on their
>CPUs and then we'd write to CR3 for no particular reason as an excuse
>to stop further IPIs.  With this patch, we do neither.

Heads up, I'm gonna queue this for a run on SUSE's performance test
grid.


[PATCH v4 00/10] PCID and improved laziness

2017-06-29 Thread Andy Lutomirski
*** Ingo, even if this misses 4.13, please apply the first patch before
*** the merge window.

There are three performance benefits here:

1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
   This avoids many of them when switching tasks by using PCID.  In
   a stupid little benchmark I did, it saves about 100ns on my laptop
   per context switch.  I'll try to improve that benchmark.

2. Mms that have been used recently on a given CPU might get to keep
   their TLB entries alive across process switches with this patch
   set.  TLB fills are pretty fast on modern CPUs, but they're even
   faster when they don't happen.

3. Lazy TLB is way better.  We used to do two stupid things when we
   ran kernel threads: we'd send IPIs to flush user contexts on their
   CPUs and then we'd write to CR3 for no particular reason as an excuse
   to stop further IPIs.  With this patch, we do neither.

This will, in general, perform suboptimally if paravirt TLB flushing
is in use (currently just Xen, I think, but Hyper-V is in the works).
The code is structured so we could fix it in one of two ways: we
could take a spinlock when touching the percpu state so we can update
it remotely after a paravirt flush, or we could be more careful about
our exactly how we access the state and use cmpxchg16b to do atomic
remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
the optimization entirely.)

This is still missing a final comment-only patch to add overall
documentation for the whole thing, but I didn't want to block sending
the maybe-hopefully-final code on that.

This is based on tip:x86/mm.  The branch is here if you want to play:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

In general, performance seems to exceed my expectations.  Here are
some performance numbers copy-and-pasted from the changelogs for
"Rework lazy TLB mode and TLB freshness" and "Try to preserve old
TLB entries using PCID":

MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity.  In
an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU.
This is intended to be a nearly worst-case test.
patched: 13.4µs
unpatched:   21.6µs

Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores),
nrounds = 100, 256M data
patched: 1.1 seconds or so
unpatched:   1.9 seconds or so

ping-pong between two mms on the same CPU using eventfd:
  patched: 1.22µs
  patched, nopcid: 1.33µs
  unpatched:   1.34µs

Same ping-pong, but now touch 512 pages (all zero-page to minimize
cache misses) each iteration.  dTLB misses are measured by
dtlb_load_misses.miss_causes_a_walk:
  patched: 1.8µs  11M  dTLB misses
  patched, nopcid: 6.2µs, 207M dTLB misses
  unpatched:   6.1µs, 190M dTLB misses

MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity.  In
an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU.
This is intended to be a nearly worst-case test.
  patched: 13.4µs
  unpatched:   21.6µs

Changes from v3:
 - Lots more acks.
 - Move comment deletion to the beginning.
 - Misc cleanups from lots of reviewers.

Changes from v2:
 - Add some Acks
 - Move the reentrancy issue to the beginning.
   (I also sent the same patch as a standalone fix -- it's just in here
so that this series applies to x86/mm.)
 - Fix some comments.

Changes from RFC:
 - flush_tlb_func_common() no longer gets reentered (Nadav)
 - Fix ASID corruption on unlazying (kbuild bot)
 - Move Xen init to the right place
 - Misc cleanups

Andy Lutomirski (10):
  x86/mm: Don't reenter flush_tlb_func_common()
  x86/mm: Delete a big outdated comment about TLB flushing
  x86/mm: Give each mm TLB flush generation a unique ID
  x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  x86/mm: Rework lazy TLB mode and TLB freshness tracking
  x86/mm: Stop calling leave_mm() in idle code
  x86/mm: Disable PCID on 32-bit kernels
  x86/mm: Add nopcid to turn off PCID
  x86/mm: Enable CR4.PCIDE on supported systems
  x86/mm: Try to preserve old TLB entries using PCID

 Documentation/admin-guide/kernel-parameters.txt |   2 +
 arch/ia64/include/asm/acpi.h|   2 -
 arch/x86/include/asm/acpi.h |   2 -
 arch/x86/include/asm/disabled-features.h|   4 +-
 arch/x86/include/asm/mmu.h  |  25 +-
 arch/x86/include/asm/mmu_context.h  |  15 +-
 arch/x86/include/asm/processor-flags.h  |   2 +
 arch/x86/include/asm/tlbflush.h |  87 +-
 arch/x86/kernel/cpu/bugs.c  |   8 +
 arch/x86/kernel/cpu/common.c|  40 +++
 arch/x86/mm/init.c  |   2 +-
 arch/x86/mm/tlb.c   | 382 
 arch/x86/xen/enlighten_pv.c |   6 +
 arch/x86/xen/mmu_pv.c   |   5 +-
 drivers/acpi/processor_idle.c   |