Re: [PATCH v4 00/10] PCID and improved laziness
* Ingo Molnar wrote: > > * Andy Lutomirski wrote: > > > I'm on my way to LPC, so I can't easily work on this right this instant. > > > > Can you try this branch, though? > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb88ae619b4c3d832d224f2c641849dc02aed864 > > Any objections against me applying these fixes directly and getting them to > Linus > later today, to not widen the window of breakage any further? > > I'll also apply: > >x86/mm/64: Initialize CR4.PCIDE early So now that I've looked at the branch, I think I'll apply these three: cb88ae619b4c: x86/mm/64: Initialize CR4.PCIDE early 6ec68017784f: x86/hibernate/64: Mask off CR3's PCID bits in the saved CR3 5d1298206d7b: x86/mm: Get rid of VM_BUG_ON in switch_tlb_irqs_off() The first one is that likely fixed the crash that Sai Praneeth Prakhya reported. Thanks, Ingo
Re: [PATCH v4 00/10] PCID and improved laziness
* Andy Lutomirski wrote: > I'm on my way to LPC, so I can't easily work on this right this instant. > > Can you try this branch, though? > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb88ae619b4c3d832d224f2c641849dc02aed864 Any objections against me applying these fixes directly and getting them to Linus later today, to not widen the window of breakage any further? I'll also apply: x86/mm/64: Initialize CR4.PCIDE early Thanks, Ingo
Re: [PATCH v4 00/10] PCID and improved laziness
> > > Hi Andy, > > I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on > Skylake server and noticed that it reboots automatically. > > When I booted the same kernel with command line arg "nopcid" it works > fine. Please find below a snippet of dmesg. Please let me know if you > need more info to debug. > > [0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+ > root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0 > console=ttyS0,115200n8 > [0.00] log_buf_len individual max cpu contribution: 4096 bytes > [0.00] log_buf_len total cpu_extra contributions: 258048 bytes > [0.00] log_buf_len min size: 262144 bytes > [0.00] log_buf_len: 524288 bytes > [0.00] early log buf free: 212560(81%) > [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes) > [0.00] [ cut here ] > [0.00] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245 > initialize_tlbstate_and_flush+0x6c/0xf0 > [0.00] Modules linked in: > [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5 > [0.00] task: 8960f480 task.stack: 8960 > [0.00] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0 > [0.00] RSP: :89603e60 EFLAGS: 00010046 > [0.00] RAX: 000406b0 RBX: 9f1700a17880 RCX: > 8965de60 > [0.00] RDX: 008383a0a000 RSI: 0960a000 RDI: > 008383a0a000 > [0.00] RBP: 89603e60 R08: R09: > > [0.00] R10: 89603ee8 R11: R12: > > [0.00] R13: 9f1700a0c3e0 R14: 8960f480 R15: > > [0.00] FS: () GS:9f1700a0() > knlGS: > [0.00] CS: 0010 DS: ES: CR0: 80050033 > [0.00] CR2: 9fa7b000 CR3: 008383a0a000 CR4: > 000406b0 > [0.00] Call Trace: > [0.00] cpu_init+0x206/0x4f0 > [0.00] ? __set_pte_vaddr+0x1d/0x30 > [0.00] trap_init+0x3e/0x50 > [0.00] ? trap_init+0x3e/0x50 > [0.00] start_kernel+0x1e2/0x3f2 > [0.00] x86_64_start_reservations+0x24/0x26 > [0.00] x86_64_start_kernel+0x6f/0x72 > [0.00] secondary_startup_64+0xa5/0xa5 > [0.00] Code: de 00 48 01 f0 48 39 c7 0f 85 92 00 00 00 48 8b 05 > ee e2 ee 00 a9 00 00 02 00 74 11 65 48 8b 05 8b 9d 7c 77 a9 00 00 02 00 > 75 02 <0f> ff 48 81 e2 00 f0 ff ff 0f 22 da 65 66 c7 05 66 9d 7c 77 00 > [0.00] ---[ end trace c258f2d278fe031f ]--- > [0.00] Memory: 791050356K/803934656K available (9585K kernel > code, 1313K rwdata, 3000K rodata, 1176K init, 680K bss, 12884300K > reserved, 0K cma-reserved) > [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=64, > Nodes=4 > [0.00] Hierarchical RCU implementation. > [0.00]RCU event tracing is enabled. > [0.00] NR_IRQS: 4352, nr_irqs: 3928, preallocated irqs: 16 > [0.00] Console: colour dummy device 80x25 > [0.00] console [tty0] enabled > [0.00] console [ttyS0] enabled > [0.00] clocksource: hpet: mask: 0x max_cycles: > 0x, max_idle_ns: 79635855245 ns > [0.001000] tsc: Detected 2000.000 MHz processor > [0.002000] Calibrating delay loop (skipped), value calculated using > timer frequency.. 4000.00 BogoMIPS (lpj=200) > [0.003003] pid_max: default: 65536 minimum: 512 > [0.004030] ACPI: Core revision 20170728 > [0.091853] ACPI: 6 ACPI AML tables successfully acquired and loaded > [0.094143] Security Framework initialized > [0.095004] SELinux: Initializing. > [0.145612] Dentry cache hash table entries: 33554432 (order: 16, > 268435456 bytes) > [0.170544] Inode-cache hash table entries: 16777216 (order: 15, > 134217728 bytes) > [0.172699] Mount-cache hash table entries: 524288 (order: 10, > 4194304 bytes) > [0.174441] Mountpoint-cache hash table entries: 524288 (order: 10, > 4194304 bytes) > [0.176351] CPU: Physical Processor ID: 0 > [0.177003] CPU: Processor Core ID: 0 > [0.178007] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' > [0.179003] ENERGY_PERF_BIAS: View and update with > x86_energy_perf_policy(8) > [0.180013] mce: CPU supports 20 MCE banks > [0.181018] CPU0: Thermal monitoring enabled (TM1) > [0.182057] process: using mwait in idle threads > [0.183005] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 > [0.184003] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 > [0.185223] Freeing SMP alternatives memory: 36K > [0.193912] smpboot: Max logical packages: 8 > [0.194017] Switched APIC routing to physical flat. > [0.196496] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 > [0.206252] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8164 CPU @ > 2.00GHz (family: 0x6, model: 0x55, stepping: 0x4) > [
Re: [PATCH v4 00/10] PCID and improved laziness
> From: Andy Lutomirski > Date: Thu, Jun 29, 2017 at 8:53 AM > Subject: [PATCH v4 00/10] PCID and improved laziness > To: x...@kernel.org > Cc: linux-kernel@vger.kernel.org, Borislav Petkov , > Linus Torvalds , Andrew Morton > , Mel Gorman , > "linux...@kvack.org" , Nadav Amit > , Rik van Riel , Dave Hansen > , Arjan van de Ven , > Peter Zijlstra , Andy Lutomirski > > > > *** Ingo, even if this misses 4.13, please apply the first patch > before > *** the merge window. > > There are three performance benefits here: > > 1. TLB flushing is slow. (I.e. the flush itself takes a while.) >This avoids many of them when switching tasks by using PCID. In >a stupid little benchmark I did, it saves about 100ns on my laptop >per context switch. I'll try to improve that benchmark. > > 2. Mms that have been used recently on a given CPU might get to keep >their TLB entries alive across process switches with this patch >set. TLB fills are pretty fast on modern CPUs, but they're even >faster when they don't happen. > > 3. Lazy TLB is way better. We used to do two stupid things when we >ran kernel threads: we'd send IPIs to flush user contexts on their >CPUs and then we'd write to CR3 for no particular reason as an > excuse >to stop further IPIs. With this patch, we do neither. > > This will, in general, perform suboptimally if paravirt TLB flushing > is in use (currently just Xen, I think, but Hyper-V is in the works). > The code is structured so we could fix it in one of two ways: we > could take a spinlock when touching the percpu state so we can update > it remotely after a paravirt flush, or we could be more careful about > our exactly how we access the state and use cmpxchg16b to do atomic > remote updates. (On SMP systems without cmpxchg16b, we'd just skip > the optimization entirely.) > > This is still missing a final comment-only patch to add overall > documentation for the whole thing, but I didn't want to block sending > the maybe-hopefully-final code on that. > > This is based on tip:x86/mm. The branch is here if you want to play: > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid > > In general, performance seems to exceed my expectations. Here are > some performance numbers copy-and-pasted from the changelogs for > "Rework lazy TLB mode and TLB freshness" and "Try to preserve old > TLB entries using PCID": > > Hi Andy, I have booted Linus's tree (8fac2f96ab86b0e14ec4e42851e21e9b518bdc55) on Skylake server and noticed that it reboots automatically. When I booted the same kernel with command line arg "nopcid" it works fine. Please find below a snippet of dmesg. Please let me know if you need more info to debug. [0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0+ root=UUID=3b8e9636-6e23-4785-a4e2-5954bfe86fd9 ro console=tty0 console=ttyS0,115200n8 [0.00] log_buf_len individual max cpu contribution: 4096 bytes [0.00] log_buf_len total cpu_extra contributions: 258048 bytes [0.00] log_buf_len min size: 262144 bytes [0.00] log_buf_len: 524288 bytes [0.00] early log buf free: 212560(81%) [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes) [0.00] [ cut here ] [0.00] WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:245 initialize_tlbstate_and_flush+0x6c/0xf0 [0.00] Modules linked in: [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.13.0+ #5 [0.00] task: 8960f480 task.stack: 8960 [0.00] RIP: 0010:initialize_tlbstate_and_flush+0x6c/0xf0 [0.00] RSP: :89603e60 EFLAGS: 00010046 [0.00] RAX: 000406b0 RBX: 9f1700a17880 RCX: 8965de60 [0.00] RDX: 008383a0a000 RSI: 0960a000 RDI: 008383a0a000 [0.00] RBP: 89603e60 R08: R09: [0.00] R10: 89603ee8 R11: R12: [0.00] R13: 9f1700a0c3e0 R14: 8960f480 R15: [0.00] FS: () GS:9f1700a0() knlGS: [0.00] CS: 0010 DS: ES: CR0: 80050033 [0.00] CR2: 9fa7b000 CR3: 008383a0a000 CR4: 000406b0 [0.00] Call Trace: [0.00] cpu_init+0x206/0x4f0 [0.00] ? __set_pte_vaddr+0x1d/0x30 [0.00] trap_init+0x3e/0x50 [0.00] ? trap_init+0x3e/0x50 [0.00] start_kernel+0x1e2/0x3f2 [0.00] x86_64_start_reservations+0x24/0x26 [0.00] x86_64_start_kernel+0x6f/0x72 [0.00] secondary_startup_64+0xa5/0xa5 [0.00] Code: de 00 48 01 f0
Re: [PATCH v4 00/10] PCID and improved laziness
On Mon, Jul 17, 2017 at 05:06:25PM +0200, Ingo Molnar wrote: > > > I'll push it all out when it passes testing. > > > > > > If it's all super stable I plan to tempt Linus with a late merge window > > > pull > > > request for all these preparatory patches. (Unless he objects that is. > > > Hint, hint.) > > > > > > Any objections? > > > > > > > What was the final verdict here? I have a patch ready that should be layered > > on top which will need a backport. PCID support does not appear to have > > made it in this merge window so I'm wondering if I should send the patch > > as-is for placement on top of Andy's work or go with the backport and > > apply a follow-on patch after Andy's work gets merged. > > It's en route for v4.14 - it narrowly missed v4.13. > Grand. I sent out a version that doesn't depend on Andy's work to Andrew as it's purely a mm patch. If that passes inspection then I'll send a follow-on patch to apply on top of the PCID work. Thanks Ingo. -- Mel Gorman SUSE Labs
Re: [PATCH v4 00/10] PCID and improved laziness
* Mel Gorman wrote: > On Wed, Jul 05, 2017 at 10:56:57AM +0200, Ingo Molnar wrote: > > > > * Andy Lutomirski wrote: > > > > > *** Ingo, even if this misses 4.13, please apply the first patch before > > > *** the merge window. > > > > > Andy Lutomirski (10): > > > x86/mm: Don't reenter flush_tlb_func_common() > > > x86/mm: Delete a big outdated comment about TLB flushing > > > x86/mm: Give each mm TLB flush generation a unique ID > > > x86/mm: Track the TLB's tlb_gen and update the flushing algorithm > > > x86/mm: Rework lazy TLB mode and TLB freshness tracking > > > x86/mm: Stop calling leave_mm() in idle code > > > x86/mm: Disable PCID on 32-bit kernels > > > x86/mm: Add nopcid to turn off PCID > > > x86/mm: Enable CR4.PCIDE on supported systems > > > x86/mm: Try to preserve old TLB entries using PCID > > > > So this series is really nice, and the first two patches are already > > upstream, and > > I've just applied all but the final patch to tip:x86/mm (out of caution - > > I'm a wimp). > > > > That should already offer some improvements and enables the CR4 bit - but > > doesn't > > actually use the PCID hardware yet. > > > > I'll push it all out when it passes testing. > > > > If it's all super stable I plan to tempt Linus with a late merge window > > pull > > request for all these preparatory patches. (Unless he objects that is. > > Hint, hint.) > > > > Any objections? > > > > What was the final verdict here? I have a patch ready that should be layered > on top which will need a backport. PCID support does not appear to have > made it in this merge window so I'm wondering if I should send the patch > as-is for placement on top of Andy's work or go with the backport and > apply a follow-on patch after Andy's work gets merged. It's en route for v4.14 - it narrowly missed v4.13. Thanks, Ingo
Re: [PATCH v4 00/10] PCID and improved laziness
On Wed, Jul 05, 2017 at 10:56:57AM +0200, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > > > *** Ingo, even if this misses 4.13, please apply the first patch before > > *** the merge window. > > > Andy Lutomirski (10): > > x86/mm: Don't reenter flush_tlb_func_common() > > x86/mm: Delete a big outdated comment about TLB flushing > > x86/mm: Give each mm TLB flush generation a unique ID > > x86/mm: Track the TLB's tlb_gen and update the flushing algorithm > > x86/mm: Rework lazy TLB mode and TLB freshness tracking > > x86/mm: Stop calling leave_mm() in idle code > > x86/mm: Disable PCID on 32-bit kernels > > x86/mm: Add nopcid to turn off PCID > > x86/mm: Enable CR4.PCIDE on supported systems > > x86/mm: Try to preserve old TLB entries using PCID > > So this series is really nice, and the first two patches are already > upstream, and > I've just applied all but the final patch to tip:x86/mm (out of caution - I'm > a wimp). > > That should already offer some improvements and enables the CR4 bit - but > doesn't > actually use the PCID hardware yet. > > I'll push it all out when it passes testing. > > If it's all super stable I plan to tempt Linus with a late merge window pull > request for all these preparatory patches. (Unless he objects that is. Hint, > hint.) > > Any objections? > What was the final verdict here? I have a patch ready that should be layered on top which will need a backport. PCID support does not appear to have made it in this merge window so I'm wondering if I should send the patch as-is for placement on top of Andy's work or go with the backport and apply a follow-on patch after Andy's work gets merged. Thanks. -- Mel Gorman SUSE Labs
Re: [PATCH v4 00/10] PCID and improved laziness
On Tue, 11 Jul, at 08:00:47AM, Andy Lutomirski wrote: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/ > > I did: > > $ ./context_switch_latency_64 0 process same Ah, that's better. I see about a 3.3% speedup with your patches when running the context-switch benchmark.
Re: [PATCH v4 00/10] PCID and improved laziness
On Tue, Jul 11, 2017 at 4:32 AM, Matt Fleming wrote: > On Fri, 30 Jun, at 01:44:22PM, Matt Fleming wrote: >> On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote: >> > *** Ingo, even if this misses 4.13, please apply the first patch before >> > *** the merge window. >> > >> > There are three performance benefits here: >> > >> > 1. TLB flushing is slow. (I.e. the flush itself takes a while.) >> >This avoids many of them when switching tasks by using PCID. In >> >a stupid little benchmark I did, it saves about 100ns on my laptop >> >per context switch. I'll try to improve that benchmark. >> > >> > 2. Mms that have been used recently on a given CPU might get to keep >> >their TLB entries alive across process switches with this patch >> >set. TLB fills are pretty fast on modern CPUs, but they're even >> >faster when they don't happen. >> > >> > 3. Lazy TLB is way better. We used to do two stupid things when we >> >ran kernel threads: we'd send IPIs to flush user contexts on their >> >CPUs and then we'd write to CR3 for no particular reason as an excuse >> >to stop further IPIs. With this patch, we do neither. >> >> Heads up, I'm gonna queue this for a run on SUSE's performance test >> grid. > > FWIW, I didn't see any change in performance with this series on a > PCID-capable machine. On the plus side, I didn't see any weird-looking > bugs either. > > Are your benchmarks available anywhere? https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/ I did: $ ./context_switch_latency_64 0 process same and $ ./madvise_bounce_64 10k [IIRC -- it might have been a different loop count] --Andy
Re: [PATCH v4 00/10] PCID and improved laziness
On Fri, 30 Jun, at 01:44:22PM, Matt Fleming wrote: > On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote: > > *** Ingo, even if this misses 4.13, please apply the first patch before > > *** the merge window. > > > > There are three performance benefits here: > > > > 1. TLB flushing is slow. (I.e. the flush itself takes a while.) > >This avoids many of them when switching tasks by using PCID. In > >a stupid little benchmark I did, it saves about 100ns on my laptop > >per context switch. I'll try to improve that benchmark. > > > > 2. Mms that have been used recently on a given CPU might get to keep > >their TLB entries alive across process switches with this patch > >set. TLB fills are pretty fast on modern CPUs, but they're even > >faster when they don't happen. > > > > 3. Lazy TLB is way better. We used to do two stupid things when we > >ran kernel threads: we'd send IPIs to flush user contexts on their > >CPUs and then we'd write to CR3 for no particular reason as an excuse > >to stop further IPIs. With this patch, we do neither. > > Heads up, I'm gonna queue this for a run on SUSE's performance test > grid. FWIW, I didn't see any change in performance with this series on a PCID-capable machine. On the plus side, I didn't see any weird-looking bugs either. Are your benchmarks available anywhere?
Re: [PATCH v4 00/10] PCID and improved laziness
On Wed, Jul 5, 2017 at 1:56 AM, Ingo Molnar wrote: > > If it's all super stable I plan to tempt Linus with a late merge window pull > request for all these preparatory patches. (Unless he objects that is. Hint, > hint.) I don't think I'll object. At some point the best testing is "lots of users". TLB issues are a bitch to debug, but at the same time this is clearly a "..but at some point we need to bite the bullet" case. I doubt the series is going to get a lot better. But yes, please do give it as much testing as humanly possible even without the wider coverage by random people. Linus
Re: [PATCH v4 00/10] PCID and improved laziness
* Andy Lutomirski wrote: > *** Ingo, even if this misses 4.13, please apply the first patch before > *** the merge window. > Andy Lutomirski (10): > x86/mm: Don't reenter flush_tlb_func_common() > x86/mm: Delete a big outdated comment about TLB flushing > x86/mm: Give each mm TLB flush generation a unique ID > x86/mm: Track the TLB's tlb_gen and update the flushing algorithm > x86/mm: Rework lazy TLB mode and TLB freshness tracking > x86/mm: Stop calling leave_mm() in idle code > x86/mm: Disable PCID on 32-bit kernels > x86/mm: Add nopcid to turn off PCID > x86/mm: Enable CR4.PCIDE on supported systems > x86/mm: Try to preserve old TLB entries using PCID So this series is really nice, and the first two patches are already upstream, and I've just applied all but the final patch to tip:x86/mm (out of caution - I'm a wimp). That should already offer some improvements and enables the CR4 bit - but doesn't actually use the PCID hardware yet. I'll push it all out when it passes testing. If it's all super stable I plan to tempt Linus with a late merge window pull request for all these preparatory patches. (Unless he objects that is. Hint, hint.) Any objections? Thanks, Ingo
Re: [PATCH v4 00/10] PCID and improved laziness
On Thu, 29 Jun, at 08:53:12AM, Andy Lutomirski wrote: > *** Ingo, even if this misses 4.13, please apply the first patch before > *** the merge window. > > There are three performance benefits here: > > 1. TLB flushing is slow. (I.e. the flush itself takes a while.) >This avoids many of them when switching tasks by using PCID. In >a stupid little benchmark I did, it saves about 100ns on my laptop >per context switch. I'll try to improve that benchmark. > > 2. Mms that have been used recently on a given CPU might get to keep >their TLB entries alive across process switches with this patch >set. TLB fills are pretty fast on modern CPUs, but they're even >faster when they don't happen. > > 3. Lazy TLB is way better. We used to do two stupid things when we >ran kernel threads: we'd send IPIs to flush user contexts on their >CPUs and then we'd write to CR3 for no particular reason as an excuse >to stop further IPIs. With this patch, we do neither. Heads up, I'm gonna queue this for a run on SUSE's performance test grid.
[PATCH v4 00/10] PCID and improved laziness
*** Ingo, even if this misses 4.13, please apply the first patch before *** the merge window. There are three performance benefits here: 1. TLB flushing is slow. (I.e. the flush itself takes a while.) This avoids many of them when switching tasks by using PCID. In a stupid little benchmark I did, it saves about 100ns on my laptop per context switch. I'll try to improve that benchmark. 2. Mms that have been used recently on a given CPU might get to keep their TLB entries alive across process switches with this patch set. TLB fills are pretty fast on modern CPUs, but they're even faster when they don't happen. 3. Lazy TLB is way better. We used to do two stupid things when we ran kernel threads: we'd send IPIs to flush user contexts on their CPUs and then we'd write to CR3 for no particular reason as an excuse to stop further IPIs. With this patch, we do neither. This will, in general, perform suboptimally if paravirt TLB flushing is in use (currently just Xen, I think, but Hyper-V is in the works). The code is structured so we could fix it in one of two ways: we could take a spinlock when touching the percpu state so we can update it remotely after a paravirt flush, or we could be more careful about our exactly how we access the state and use cmpxchg16b to do atomic remote updates. (On SMP systems without cmpxchg16b, we'd just skip the optimization entirely.) This is still missing a final comment-only patch to add overall documentation for the whole thing, but I didn't want to block sending the maybe-hopefully-final code on that. This is based on tip:x86/mm. The branch is here if you want to play: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid In general, performance seems to exceed my expectations. Here are some performance numbers copy-and-pasted from the changelogs for "Rework lazy TLB mode and TLB freshness" and "Try to preserve old TLB entries using PCID": MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU. This is intended to be a nearly worst-case test. patched: 13.4µs unpatched: 21.6µs Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores), nrounds = 100, 256M data patched: 1.1 seconds or so unpatched: 1.9 seconds or so ping-pong between two mms on the same CPU using eventfd: patched: 1.22µs patched, nopcid: 1.33µs unpatched: 1.34µs Same ping-pong, but now touch 512 pages (all zero-page to minimize cache misses) each iteration. dTLB misses are measured by dtlb_load_misses.miss_causes_a_walk: patched: 1.8µs 11M dTLB misses patched, nopcid: 6.2µs, 207M dTLB misses unpatched: 6.1µs, 190M dTLB misses MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU. This is intended to be a nearly worst-case test. patched: 13.4µs unpatched: 21.6µs Changes from v3: - Lots more acks. - Move comment deletion to the beginning. - Misc cleanups from lots of reviewers. Changes from v2: - Add some Acks - Move the reentrancy issue to the beginning. (I also sent the same patch as a standalone fix -- it's just in here so that this series applies to x86/mm.) - Fix some comments. Changes from RFC: - flush_tlb_func_common() no longer gets reentered (Nadav) - Fix ASID corruption on unlazying (kbuild bot) - Move Xen init to the right place - Misc cleanups Andy Lutomirski (10): x86/mm: Don't reenter flush_tlb_func_common() x86/mm: Delete a big outdated comment about TLB flushing x86/mm: Give each mm TLB flush generation a unique ID x86/mm: Track the TLB's tlb_gen and update the flushing algorithm x86/mm: Rework lazy TLB mode and TLB freshness tracking x86/mm: Stop calling leave_mm() in idle code x86/mm: Disable PCID on 32-bit kernels x86/mm: Add nopcid to turn off PCID x86/mm: Enable CR4.PCIDE on supported systems x86/mm: Try to preserve old TLB entries using PCID Documentation/admin-guide/kernel-parameters.txt | 2 + arch/ia64/include/asm/acpi.h| 2 - arch/x86/include/asm/acpi.h | 2 - arch/x86/include/asm/disabled-features.h| 4 +- arch/x86/include/asm/mmu.h | 25 +- arch/x86/include/asm/mmu_context.h | 15 +- arch/x86/include/asm/processor-flags.h | 2 + arch/x86/include/asm/tlbflush.h | 87 +- arch/x86/kernel/cpu/bugs.c | 8 + arch/x86/kernel/cpu/common.c| 40 +++ arch/x86/mm/init.c | 2 +- arch/x86/mm/tlb.c | 382 arch/x86/xen/enlighten_pv.c | 6 + arch/x86/xen/mmu_pv.c | 5 +- drivers/acpi/processor_idle.c |