Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-26 Thread Aaron Lu
On Tue, Mar 26, 2019 at 03:32:12PM +0800, Aaron Lu wrote:
> On Fri, Mar 08, 2019 at 11:44:01AM -0800, Subhra Mazumdar wrote:
> > 
> > On 2/22/19 4:45 AM, Mel Gorman wrote:
> > >On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > >>On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  
> > >>wrote:
> > >>>However; whichever way around you turn this cookie; it is expensive and 
> > >>>nasty.
> > >>Do you (or anybody else) have numbers for real loads?
> > >>
> > >>Because performance is all that matters. If performance is bad, then
> > >>it's pointless, since just turning off SMT is the answer.
> > >>
> > >I tried to do a comparison between tip/master, ht disabled and this series
> > >putting test workloads into a tagged cgroup but unfortunately it failed
> > >
> > >[  156.978682] BUG: unable to handle kernel NULL pointer dereference at 
> > >0058
> > >[  156.986597] #PF error: [normal kernel read fault]
> > >[  156.991343] PGD 0 P4D 0
> > >[  156.993905] Oops:  [#1] SMP PTI
> > >[  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 
> > >5.0.0-rc7-schedcore-v1r1 #1
> > >[  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 
> > >2.0a 05/09/2016
> > >[  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> > >[  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 
> > >c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
> > >  53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 
> > > b7 19 01
> > >[  157.037544] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
> > >[  157.042819] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 
> > >0001
> > >[  157.050015] RDX: 88bf3bdb0a40 RSI:  RDI: 
> > >0001547f175c
> > >[  157.057215] RBP: 88bf7fae32c0 R08: 0001e358 R09: 
> > >88810fb9f000
> > >[  157.064410] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
> > >
> > >[  157.071611] R13: 88bf4e3ea0c0 R14:  R15: 
> > >88bf4e3ea7a8
> > >[  157.078814] FS:  () GS:88bf7f5c() 
> > >knlGS:
> > >[  157.086977] CS:  0010 DS:  ES:  CR0: 80050033
> > >[  157.092779] CR2: 0058 CR3: 0220e005 CR4: 
> > >003606e0
> > >[  157.099979] DR0:  DR1:  DR2: 
> > >
> > >[  157.109529] DR3:  DR6: fffe0ff0 DR7: 
> > >0400
> > >[  157.119058] Call Trace:
> > >[  157.123865]  pick_next_entity+0x61/0x110
> > >[  157.130137]  pick_task_fair+0x4b/0x90
> > >[  157.136124]  __schedule+0x365/0x12c0
> > >[  157.141985]  schedule_idle+0x1e/0x40
> > >[  157.147822]  do_idle+0x166/0x280
> > >[  157.153275]  cpu_startup_entry+0x19/0x20
> > >[  157.159420]  start_secondary+0x17a/0x1d0
> > >[  157.165568]  secondary_startup_64+0xa4/0xb0
> > >[  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs 
> > >msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
> > >kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe 
> > >aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb 
> > >aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper 
> > >pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler 
> > >acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress 
> > >zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper 
> > >syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel 
> > >ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class 
> > >scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc 
> > >scsi_dh_alua
> > >[  157.258990] CR2: 0058
> > >[  157.264961] ---[ end trace a301ac5e3ee86fde ]---
> > >[  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> > >[  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 
> > >c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb 
> > ><48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> > >[  157.316121] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
> > >[  157.324060] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 
> > >0001
> > >[  157.333932] RDX: 88bf3bdb0a40 RSI:  RDI: 
> > >0001547f175c
> > >[  157.343795] RBP: 88bf7fae32c0 R08: 0001e358 R09: 
> > >88810fb9f000
> > >[  157.353634] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
> > >
> > >[  157.363506] R13: 88bf4e3ea0c0 R14:  R15: 
> > >88bf4e3ea7a8
> > >[  157.373395] FS:  () GS:88bf7f5c() 
> > >knlGS:
> > >[  157.384238] CS:  0010 DS:  ES:  CR0: 80050033
> > >[  157.392709] CR2: 0058 CR3: 0220e005 CR4: 
> > >003606e0
> > >[  157.402601] DR0:  

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-26 Thread Aaron Lu
On Fri, Mar 08, 2019 at 11:44:01AM -0800, Subhra Mazumdar wrote:
> 
> On 2/22/19 4:45 AM, Mel Gorman wrote:
> >On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >>On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
> >>>However; whichever way around you turn this cookie; it is expensive and 
> >>>nasty.
> >>Do you (or anybody else) have numbers for real loads?
> >>
> >>Because performance is all that matters. If performance is bad, then
> >>it's pointless, since just turning off SMT is the answer.
> >>
> >I tried to do a comparison between tip/master, ht disabled and this series
> >putting test workloads into a tagged cgroup but unfortunately it failed
> >
> >[  156.978682] BUG: unable to handle kernel NULL pointer dereference at 
> >0058
> >[  156.986597] #PF error: [normal kernel read fault]
> >[  156.991343] PGD 0 P4D 0
> >[  156.993905] Oops:  [#1] SMP PTI
> >[  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 
> >5.0.0-rc7-schedcore-v1r1 #1
> >[  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 
> >05/09/2016
> >[  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> >[  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 
> >c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
> >  53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 
> > 19 01
> >[  157.037544] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
> >[  157.042819] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 
> >0001
> >[  157.050015] RDX: 88bf3bdb0a40 RSI:  RDI: 
> >0001547f175c
> >[  157.057215] RBP: 88bf7fae32c0 R08: 0001e358 R09: 
> >88810fb9f000
> >[  157.064410] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
> >
> >[  157.071611] R13: 88bf4e3ea0c0 R14:  R15: 
> >88bf4e3ea7a8
> >[  157.078814] FS:  () GS:88bf7f5c() 
> >knlGS:
> >[  157.086977] CS:  0010 DS:  ES:  CR0: 80050033
> >[  157.092779] CR2: 0058 CR3: 0220e005 CR4: 
> >003606e0
> >[  157.099979] DR0:  DR1:  DR2: 
> >
> >[  157.109529] DR3:  DR6: fffe0ff0 DR7: 
> >0400
> >[  157.119058] Call Trace:
> >[  157.123865]  pick_next_entity+0x61/0x110
> >[  157.130137]  pick_task_fair+0x4b/0x90
> >[  157.136124]  __schedule+0x365/0x12c0
> >[  157.141985]  schedule_idle+0x1e/0x40
> >[  157.147822]  do_idle+0x166/0x280
> >[  157.153275]  cpu_startup_entry+0x19/0x20
> >[  157.159420]  start_secondary+0x17a/0x1d0
> >[  157.165568]  secondary_startup_64+0xa4/0xb0
> >[  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr 
> >intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel 
> >kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel 
> >xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 
> >crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr 
> >ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad 
> >pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress 
> >raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea 
> >sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm 
> >xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class 
> >scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc 
> >scsi_dh_alua
> >[  157.258990] CR2: 0058
> >[  157.264961] ---[ end trace a301ac5e3ee86fde ]---
> >[  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> >[  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 
> >c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 
> >2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> >[  157.316121] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
> >[  157.324060] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 
> >0001
> >[  157.333932] RDX: 88bf3bdb0a40 RSI:  RDI: 
> >0001547f175c
> >[  157.343795] RBP: 88bf7fae32c0 R08: 0001e358 R09: 
> >88810fb9f000
> >[  157.353634] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
> >
> >[  157.363506] R13: 88bf4e3ea0c0 R14:  R15: 
> >88bf4e3ea7a8
> >[  157.373395] FS:  () GS:88bf7f5c() 
> >knlGS:
> >[  157.384238] CS:  0010 DS:  ES:  CR0: 80050033
> >[  157.392709] CR2: 0058 CR3: 0220e005 CR4: 
> >003606e0
> >[  157.402601] DR0:  DR1:  DR2: 
> >
> >[  157.412488] DR3:  DR6: fffe0ff0 DR7: 
> >0400
> >[  157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
> >[  158.529804] Shutting down cpus 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-18 Thread Aubrey Li
On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
 wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >>  wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing, following works.
>

okay, here is another one, on my system, the boot up CPUs don't match the
possible cpu map, so the not onlined CPU rq->core are not initialized, which
causes NULL pointer dereference panic in online_fair_sched_group():

And here is a quick fix.
-
@@ -10488,7 +10493,8 @@ void online_fair_sched_group(struct task_group *tg)
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
+   if (!rq->core)
+   continue;
raw_spin_lock_irq(rq_lockp(rq));
update_rq_clock(rq);
attach_entity_cfs_rq(se);

Thanks,
-Aubrey


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-14 Thread Julien Desfossez
On 2/18/19 8:56 AM, Peter Zijlstra wrote:
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.

We are seeing this hard lockup within 1 hour of testing the patchset with 2
VMs using the core scheduler feature. Here is the full dmesg. We have the
kdump as well if more information is necessary.

[ 1989.647539] core sched enabled
[ 3353.211527] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[ 3353.211528] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted
5.0-0.coresched-generic #1
[ 3353.211530] RIP: 0010:native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211532] Code: eb e8 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 00 3a 02 00 48 03 04 f5 20 48 bb a6 48 89 10 8b 42 08 85 c0 75 09 
90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 8e 0f 18 0e eb 8f
[ 3353.211533] RSP: 0018:97ba3f603e18 EFLAGS: 0046
[ 3353.211535] RAX:  RBX: 0202 RCX:
0004
[ 3353.211535] RDX: 97ba3f623a00 RSI: 0007 RDI:
97dabf822d40
[ 3353.211536] RBP: 97ba3f603e18 R08: 0004 R09:
00018499
[ 3353.211537] R10: 0001 R11:  R12:
0001
[ 3353.211538] R13: a7340740 R14: 000c R15:
000c
[ 3353.211539] FS:  () GS:97ba3f60()
knlGS:
[ 3353.211544] CS:  0010 DS:  ES:  CR0: 80050033
[ 3353.211545] CR2: 7efeac310004 CR3: 001bf4c0e002 CR4:
001626f0
[ 3353.211546] Call Trace:
[ 3353.211546]  
[ 3353.211547]  _raw_spin_lock_irqsave+0x35/0x40
[ 3353.211548]  update_blocked_averages+0x35/0x5d0
[ 3353.211549]  ? rebalance_domains+0x180/0x2c0
[ 3353.211549]  update_nohz_stats+0x48/0x60
[ 3353.211550]  _nohz_idle_balance+0xdf/0x290
[ 3353.211551]  run_rebalance_domains+0x97/0xa0
[ 3353.211551]  __do_softirq+0xe4/0x2f3
[ 3353.211552]  irq_exit+0xb6/0xc0
[ 3353.211553]  scheduler_ipi+0xe4/0x130
[ 3353.211553]  smp_reschedule_interrupt+0x39/0xe0
[ 3353.211554]  reschedule_interrupt+0xf/0x20
[ 3353.211555]  
[ 3353.211556] RIP: 0010:cpuidle_enter_state+0xbc/0x440
[ 3353.211557] Code: ff e8 d8 dd 86 ff 80 7d d3 00 74 17 9c 58 0f 1f 44 00
00 f6 c4 02 0f 85 54 03 00 00 31 ff e8 eb 1d 8d ff fb 66 0f 1f 44 00 00 <45>
85 f6 0f 88 1a 03 00 00 4c 2b 6d c8 48 ba cf f7 53 e3 a5 9b c4
[ 3353.211558] RSP: 0018:a6e03df8 EFLAGS: 0246 ORIG_RAX:
ff02
[ 3353.211560] RAX: 97ba3f622d40 RBX: a6f545e0 RCX:
001f
[ 3353.211561] RDX: 024c9b7d936c RSI: 47318912 RDI:

[ 3353.211562] RBP: a6e03e38 R08: 0002 R09:
00022600
[ 3353.211562] R10: a6e03dc8 R11: 02dc R12:
d6c67f602968
[ 3353.211563] R13: 024c9b7d936c R14: 0004 R15:
a6f54760
[ 3353.211564]  ? cpuidle_enter_state+0x98/0x440
[ 3353.211565]  cpuidle_enter+0x17/0x20
[ 3353.211565]  call_cpuidle+0x23/0x40
[ 3353.211566]  do_idle+0x204/0x280
[ 3353.211567]  cpu_startup_entry+0x1d/0x20
[ 3353.211567]  rest_init+0xae/0xb0
[ 3353.211568]  arch_call_rest_init+0xe/0x1b
[ 3353.211569]  start_kernel+0x4f5/0x516
[ 3353.211569]  x86_64_start_reservations+0x24/0x26
[ 3353.211570]  x86_64_start_kernel+0x74/0x77
[ 3353.211571]  secondary_startup_64+0xa4/0xb0
[ 3353.211571] Kernel panic - not syncing: NMI IOCK error: Not continuing
[ 3353.211572] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted
5.0-0.coresched-generic #1
[ 3353.211574] Call Trace:
[ 3353.211575]  
[ 3353.211575]  dump_stack+0x63/0x85
[ 3353.211576]  panic+0xfe/0x2a4
[ 3353.211576]  nmi_panic+0x39/0x40
[ 3353.211577]  io_check_error+0x92/0xa0
[ 3353.211578]  default_do_nmi+0x9e/0x110
[ 3353.211578]  do_nmi+0x119/0x180
[ 3353.211579]  end_repeat_nmi+0x16/0x50
[ 3353.211580] RIP: 0010:native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211581] Code: eb e8 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 00 3a 02 00 48 03 04 f5 20 48 bb a6 48 89 10 8b 42 08 85 c0 75 09 
90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 8e 0f 18 0e eb 8f
[ 3353.211582] RSP: 0018:97ba3f603e18 EFLAGS: 0046
[ 3353.211583] RAX: 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-14 Thread Li, Aubrey
The original patch seems missing the following change for 32bit.

Thanks,
-Aubrey

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..78de28ebc45d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int 
cpu,
/*
 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 */
-   raw_spin_lock_irq(_rq(cpu)->lock);
+   raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
if (index == CPUACCT_STAT_NSTATS) {
@@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int 
cpu,
}
 
 #ifndef CONFIG_64BIT
-   raw_spin_unlock_irq(_rq(cpu)->lock);
+   raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
return data;
@@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, 
int cpu, u64 val)
/*
 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 */
-   raw_spin_lock_irq(_rq(cpu)->lock);
+   raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-   raw_spin_unlock_irq(_rq(cpu)->lock);
+   raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 }
 
@@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void 
*V)
 * Take rq->lock to make 64-bit read safe on 32-bit
 * platforms.
 */
-   raw_spin_lock_irq(_rq(cpu)->lock);
+   raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-   raw_spin_unlock_irq(_rq(cpu)->lock);
+   raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
}
seq_puts(m, "\n");


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-13 Thread Aubrey Li
On Thu, Mar 14, 2019 at 8:35 AM Tim Chen  wrote:
> >>
> >> One more NULL pointer dereference:
> >>
> >> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
> >> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
> >> at 0008
> >> [  201.950254] [ cut here ]
> >> [  201.959045] #PF error: [normal kernel read fault]
> >> [  201.964272] !se->on_rq
> >> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> >> set_next_buddy+0x52/0x70
> >
> Shouldn't the for_each_sched_entity(se) skip the code block for !se case
> have avoided null pointer access of se?
>
> Since
> #define for_each_sched_entity(se) \
> for (; se; se = se->parent)
>
> Scratching my head a bit here on how your changes would have made
> a difference.

This NULL pointer dereference is not replicable, which makes me thought the
change works...

>
> In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed 
> with the actual OOPs?
> Saw also in your original log rb_insert_color.  Wonder if that
> was actually the source of the Oops?

No chance to figure this out, I only saw this once, lockup occurs more
frequently.

Thanks,
-Aubrey


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-13 Thread Tim Chen


>>
>> One more NULL pointer dereference:
>>
>> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
>> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
>> at 0008
>> [  201.950254] [ cut here ]
>> [  201.959045] #PF error: [normal kernel read fault]
>> [  201.964272] !se->on_rq
>> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
>> set_next_buddy+0x52/0x70
> 
> A quick workaround below:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4fd94f..ef6acfe2cf7d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se)
> return;
> 
> for_each_sched_entity(se) {
> -   if (SCHED_WARN_ON(!se->on_rq))
> +   if (SCHED_WARN_ON(!(se && se->on_rq))
> return;
> cfs_rq_of(se)->last = se;
> }
> @@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se)
> return;
> 
> for_each_sched_entity(se) {
> -   if (SCHED_WARN_ON(!se->on_rq))
> +   if (SCHED_WARN_ON(!(se && se->on_rq))


Shouldn't the for_each_sched_entity(se) skip the code block for !se case
have avoided null pointer access of se?

Since
#define for_each_sched_entity(se) \
for (; se; se = se->parent)

Scratching my head a bit here on how your changes would have made
a difference.

In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed 
with the actual OOPs?
Saw also in your original log rb_insert_color.  Wonder if that
was actually the source of the Oops?


[  202.078674] RIP: 0010:set_next_buddy+0x52/0x70
[  202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.090144] RIP: 0010:rb_insert_color+0x17/0x190
[  202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5
6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e
bb fb ff <0f> 0b c3 83 bf 04 03 0e
[  202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.118263] RSP: 0018:c9000a5cbbb0 EFLAGS: 00010086
[  202.129858] RSP: 0018:c9000a463cc0 EFLAGS: 00010046
[  202.135102] RAX:  RBX: 88980047e800 RCX: 
[  202.135105] RDX: 888be28caa40 RSI: 0001 RDI: 8110c3fa
[  202.156251] RAX:  RBX: 888bfeb8 RCX: 888bfeb80

Thanks.

Tim


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-12 Thread Aubrey Li
On Tue, Mar 12, 2019 at 3:45 PM Aubrey Li  wrote:
>
> On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
>  wrote:
> >
> >
> > On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> > >
> > > On 3/10/19 9:23 PM, Aubrey Li wrote:
> > >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> > >>  wrote:
> > >>> expected. Most of the performance recovery happens in patch 15 which,
> > >>> unfortunately, is also the one that introduces the hard lockup.
> > >>>
> > >> After applied Subhra's patch, the following is triggered by enabling
> > >> core sched when a cgroup is
> > >> under heavy load.
> > >>
> > > It seems you are facing some other deadlock where printk is involved.
> > > Can you
> > > drop the last patch (patch 16 sched: Debug bits...) and try?
> > >
> > > Thanks,
> > > Subhra
> > >
> > Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> > 16. Btw
> > the NULL fix had something missing,
>
> One more NULL pointer dereference:
>
> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
> at 0008
> [  201.950254] [ cut here ]
> [  201.959045] #PF error: [normal kernel read fault]
> [  201.964272] !se->on_rq
> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> set_next_buddy+0x52/0x70

A quick workaround below:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4fd94f..ef6acfe2cf7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se)
return;

for_each_sched_entity(se) {
-   if (SCHED_WARN_ON(!se->on_rq))
+   if (SCHED_WARN_ON(!(se && se->on_rq))
return;
cfs_rq_of(se)->last = se;
}
@@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se)
return;

for_each_sched_entity(se) {
-   if (SCHED_WARN_ON(!se->on_rq))
+   if (SCHED_WARN_ON(!(se && se->on_rq))
return;
cfs_rq_of(se)->next = se;
}

And now I'm running into a hard LOCKUP:

[  326.336279] NMI watchdog: Watchdog detected hard LOCKUP on cpu 31
[  326.336280] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  326.336311] irq event stamp: 164460
[  326.336312] hardirqs last  enabled at (164459):
[] sched_core_balance+0x247/0x470
[  326.336312] hardirqs last disabled at (164460):
[] sched_core_balance+0x113/0x470
[  326.336313] softirqs last  enabled at (164250):
[] __do_softirq+0x359/0x40a
[  326.336314] softirqs last disabled at (164213):
[] irq_exit+0xc1/0xd0
[  326.336315] CPU: 31 PID: 0 Comm: swapper/31 Tainted: G  I
5.0.0-rc8-00542-gd697415be692-dirty #15
[  326.336316] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  326.336317] RIP: 0010:native_queued_spin_lock_slowpath+0x18f/0x1c0
[  326.336318] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 80 51 1e 00 48 03 04 f5 40 58 39 82 48 89 10 8b 42 08 85 c0 75
09 f3 90 <8b> 42 08 85 c0 74 f7 4b
[  326.336318] RSP: :c9000643bd58 EFLAGS: 0046
[  326.336319] RAX:  RBX: 888c0ade4400 RCX: 0080
[  326.336320] RDX: 88980bbe5180 RSI: 0019 RDI: 888c0ade4400
[  326.336321] RBP: 888c0ade4400 R08: 0080 R09: 001e3a80
[  326.336321] R10: c9000643bd08 R11:  R12: 
[  326.336322] R13:  R14: 88980bbe4400 R15: 001f
[  326.336323] FS:  () GS:88980ba0()
knlGS:
[  326.336323] CS:  0010 DS:  ES:  CR0: 80050033
[  326.336324] CR2: 7fdcd7fd7728 CR3: 0017e821a001 CR4: 000606e0
[  326.336325] Call Trace:
[  326.336325]  do_raw_spin_lock+0xab/0xb0
[  326.336326]  _raw_spin_lock+0x4b/0x60
[  326.336326]  double_rq_lock+0x99/0x140
[  326.336327]  sched_core_balance+0x11e/0x470
[  326.336327]  __balance_callback+0x49/0xa0
[  326.336328]  __schedule+0x1113/0x1570
[  326.336328]  schedule_idle+0x1e/0x40
[  326.336329]  do_idle+0x16b/0x2a0
[  326.336329]  cpu_startup_entry+0x19/0x20
[  326.336330]  start_secondary+0x17f/0x1d0
[  326.336331]  secondary_startup_64+0xa4/0xb0
[  330.959367] ---[ end Kernel panic - not syncing: Hard LOCKUP ]---


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-12 Thread Pawan Gupta
Hi,

With core scheduling LTP reports 2 new failures related to 
cgroups(memcg_stat_rss and memcg_move_charge_at_immigrate). I will try to debug 
it.

Also "perf sched map" indicates there might be a small window when 2 processes 
in different cgroups run together on one core.
In below case B0 and D0(stress-ng-cpu and sysbench) belong to 2 different 
cgroups with cpu.tag enabled.

$ perf sched map

  *A0 382.266600 secs A0 => kworker/0:1-eve:51
  *B0 382.266612 secs B0 => stress-ng-cpu:7956
  *A0 382.394597 secs 
  *B0 382.394609 secs 
   B0 *C0 382.494459 secs C0 => i915/signal:0:450
   B0 *D0 382.494468 secs D0 => sysbench:8088
  *.   D0 382.494472 secs .  => swapper:0
   .  *C0 383.095787 secs 
  *B0  C0 383.095792 secs 
   B0 *D0 383.095820 secs
  *A0  D0 383.096587 secs

In some cases I dont see an IPI getting sent to sibling cpu when 2 incompatible 
processes are picked. Like is below logs at timestamp 382.146250
"stress-ng-cpu" is picked when "sysbench" is running on the sibling cpu.

  kworker/0:1-51[000] d...   382.146246: __schedule: cpu(0): selected: 
stress-ng-cpu/7956 9945bad29200
  kworker/0:1-51[000] d...   382.146246: __schedule: max: 
stress-ng-cpu/7956 9945bad29200
  kworker/0:1-51[000] d...   382.146247: __prio_less: (swapper/4/0;140,0,0) 
?< (sysbench/8088;140,34783671987,0)
  kworker/0:1-51[000] d...   382.146248: __prio_less: 
(stress-ng-cpu/7956;119,34817170203,0) ?< (sysbench/8088;119,34783671987,0)
  kworker/0:1-51[000] d...   382.146249: __schedule: cpu(4): selected: 
sysbench/8088 9945a7405200
  kworker/0:1-51[000] d...   382.146249: __prio_less: 
(stress-ng-cpu/7956;119,34817170203,0) ?< (sysbench/8088;119,34783671987,0)
  kworker/0:1-51[000] d...   382.146250: __schedule: picked: 
stress-ng-cpu/7956 9945bad29200
  kworker/0:1-51[000] d...   382.146251: __switch_to: Pawan: cpu(0) 
switching to stress-ng-cpu
  kworker/0:1-51[000] d...   382.146251: __switch_to: Pawan: cpu(4) running 
sysbench
stress-ng-cpu-7956  [000] dN..   382.274234: __schedule: cpu(0): selected: 
kworker/0:1/51 0
stress-ng-cpu-7956  [000] dN..   382.274235: __schedule: max: kworker/0:1/51 0
stress-ng-cpu-7956  [000] dN..   382.274235: __schedule: cpu(4): selected: 
sysbench/8088 9945a7405200
stress-ng-cpu-7956  [000] dN..   382.274237: __prio_less: 
(kworker/0:1/51;119,50744489595,0) ?< (sysbench/8088;119,34911643157,0)
stress-ng-cpu-7956  [000] dN..   382.274237: __schedule: picked: kworker/0:1/51 0
stress-ng-cpu-7956  [000] d...   382.274239: __switch_to: Pawan: cpu(0) 
switching to kworker/0:1
stress-ng-cpu-7956  [000] d...   382.274239: __switch_to: Pawan: cpu(4) running 
sysbench

-Pawan


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-12 Thread Aubrey Li
On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
 wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >>  wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing,

One more NULL pointer dereference:

Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
[  201.950203] BUG: unable to handle kernel NULL pointer dereference
at 0008
[  201.950254] [ cut here ]
[  201.959045] #PF error: [normal kernel read fault]
[  201.964272] !se->on_rq
[  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
set_next_buddy+0x52/0x70
[  201.969596] PGD 800be9ed7067 P4D 800be9ed7067 PUD c00911067 PMD 0
[  201.972300] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  201.981712] Oops:  [#1] SMP PTI
[  201.989463] CPU: 22 PID: 2965 Comm: schbench Tainted: G  I
 5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.074710] CPU: 27 PID: 2947 Comm: schbench Tainted: G  I
 5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.078662] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.078674] RIP: 0010:set_next_buddy+0x52/0x70
[  202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.090144] RIP: 0010:rb_insert_color+0x17/0x190
[  202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5
6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e
bb fb ff <0f> 0b c3 83 bf 04 03 0e
[  202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.118263] RSP: 0018:c9000a5cbbb0 EFLAGS: 00010086
[  202.129858] RSP: 0018:c9000a463cc0 EFLAGS: 00010046
[  202.135102] RAX:  RBX: 88980047e800 RCX: 
[  202.135105] RDX: 888be28caa40 RSI: 0001 RDI: 8110c3fa
[  202.156251] RAX:  RBX: 888bfeb8 RCX: 888bfeb8
[  202.156255] RDX: 888be28c8348 RSI: 88980b5e50c8 RDI: 888bfeb80348
[  202.177390] RBP: 88980047ea00 R08:  R09: 001e3a80
[  202.177393] R10: c9000a5cbb28 R11:  R12: 888c0b9e4400
[  202.183317] RBP: 88980b5e4400 R08: 014f R09: 8898049cf000
[  202.183320] R10: 0078 R11: 8898049cfc5c R12: 0004
[  202.189241] R13: 888be28caa40 R14: 0009 R15: 0009
[  202.189245] FS:  7f05f87f8700() GS:888c0b80()
knlGS:
[  202.197310] R13: c9000a463d20 R14: 0246 R15: 001c
[  202.197314] FS:  7f0611cca700() GS:88980b20()
knlGS:
[  202.205373] CS:  0010 DS:  ES:  CR0: 80050033
[  202.205377] CR2: 7f05e9fdb728 CR3: 000be4d0e006 CR4: 000606e0
[  202.213441] CS:  0010 DS:  ES:  CR0: 80050033
[  202.213444] CR2: 0008 CR3: 000be4d0e005 CR4: 000606e0
[  202.221509] Call Trace:
[  202.229574] Call Trace:
[  202.237640]  dequeue_task_fair+0x7e/0x1b0
[  202.245700]  enqueue_task+0x6f/0xb0
[  202.253761]  __schedule+0xcc8/0x1570
[  202.261823]  ttwu_do_activate+0x6a/0xc0
[  202.270985]  schedule+0x28/0x70
[  202.279042]  try_to_wake_up+0x20b/0x510
[  202.288206]  futex_wait_queue_me+0xbf/0x130
[  202.294714]  wake_up_q+0x3f/0x80
[  202.302773]  futex_wait+0xeb/0x240
[  202.309282]  futex_wake+0x157/0x180
[  202.317353]  ? __switch_to_asm+0x40/0x70
[  202.320158]  do_futex+0x451/0xad0
[  202.322970]  ? __switch_to_asm+0x34/0x70
[  202.322980]  ? __switch_to_asm+0x40/0x70
[  202.327541]  ? do_nanosleep+0xcc/0x1a0
[  202.331521]  do_futex+0x479/0xad0
[  202.335599]  ? hrtimer_nanosleep+0xe7/0x230
[  202.339954]  ? lockdep_hardirqs_on+0xf0/0x180
[  202.343548]  __x64_sys_futex+0x134/0x180
[  202.347906]  ? _raw_spin_unlock_irq+0x29/0x40
[  202.352660]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  202.356343]  ? finish_task_switch+0x9a/0x2c0
[  202.360228]  do_syscall_64+0x60/0x1b0
[  202.364197]  ? __schedule+0xbcd/0x1570
[  202.368663]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  202.372448]  

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-12 Thread Aaron Lu
On Mon, Mar 11, 2019 at 05:20:19PM -0700, Greg Kerr wrote:
> On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
>  wrote:
> >
> >
> > On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> > >
> > > On 3/10/19 9:23 PM, Aubrey Li wrote:
> > >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> > >>  wrote:
> > >>> expected. Most of the performance recovery happens in patch 15 which,
> > >>> unfortunately, is also the one that introduces the hard lockup.
> > >>>
> > >> After applied Subhra's patch, the following is triggered by enabling
> > >> core sched when a cgroup is
> > >> under heavy load.
> > >>
> > > It seems you are facing some other deadlock where printk is involved.
> > > Can you
> > > drop the last patch (patch 16 sched: Debug bits...) and try?
> > >
> > > Thanks,
> > > Subhra
> > >
> > Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> > 16. Btw
> > the NULL fix had something missing, following works.
> 
> Is this panic below, which occurs when I tag the first process,
> related or known? If not, I will debug it tomorrow.
> 
> [   46.831828] BUG: unable to handle kernel NULL pointer dereference
> at 
> [   46.831829] core sched enabled
> [   46.834261] #PF error: [WRITE]
> [   46.834899] PGD 0 P4D 0
> [   46.835438] Oops: 0002 [#1] SMP PTI
> [   46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
> 5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
> [   46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1 04/01/2014

Probably due to SMT not enabled for this qemu setup.

rq->core can be NULL for cpu0: sched_cpu_starting() won't be called for
CPU0 and since it doesn't have any siblings, its rq->core remains
un-initialized(NULL).

> [   46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
> [   46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
> ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
> 00 00 00  0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
> 00 00
> [   46.843000] RSP: 0018:b9d300cabe38 EFLAGS: 00010046
> [   46.843744] RAX:  RBX:  RCX: 
> 0004
> [   46.844709] RDX: 0001 RSI: aea435ae RDI: 
> 
> [   46.845689] RBP: b9d300cabed8 R08:  R09: 
> 00020800
> [   46.846651] R10: af603ea0 R11: 0001 R12: 
> af6576c0
> [   46.847619] R13: 9a57366c8000 R14: 9a5737401300 R15: 
> ade868f0
> [   46.848584] FS:  () GS:9a5737a0()
> knlGS:
> [   46.849680] CS:  0010 DS:  ES:  CR0: 80050033
> [   46.850455] CR2:  CR3: 0001d36fa000 CR4: 
> 06f0
> [   46.851415] DR0:  DR1:  DR2: 
> 
> [   46.852371] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   46.853326] Call Trace:
> [   46.853678]  __schedule+0x139/0x11f0
> [   46.854167]  ? cpumask_next+0x16/0x20
> [   46.854668]  ? cpu_stop_queue_work+0xc0/0xc0
> [   46.855252]  ? sort_range+0x20/0x20
> [   46.855742]  schedule+0x4e/0x60
> [   46.856171]  smpboot_thread_fn+0x12a/0x160
> [   46.856725]  kthread+0x112/0x120
> [   46.857164]  ? kthread_stop+0xf0/0xf0
> [   46.857661]  ret_from_fork+0x35/0x40
> [   46.858146] Modules linked in:
> [   46.858562] CR2: 
> [   46.859022] ---[ end trace e9fff08f17bfd2be ]---


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-11 Thread Subhra Mazumdar



On 3/11/19 5:20 PM, Greg Kerr wrote:

On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
 wrote:


On 3/11/19 11:34 AM, Subhra Mazumdar wrote:

On 3/10/19 9:23 PM, Aubrey Li wrote:

On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
 wrote:

expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.


After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.


It seems you are facing some other deadlock where printk is involved.
Can you
drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra


Never Mind, I am seeing the same lockdep deadlock output even w/o patch
16. Btw
the NULL fix had something missing, following works.

Is this panic below, which occurs when I tag the first process,
related or known? If not, I will debug it tomorrow.

[   46.831828] BUG: unable to handle kernel NULL pointer dereference
at 
[   46.831829] core sched enabled
[   46.834261] #PF error: [WRITE]
[   46.834899] PGD 0 P4D 0
[   46.835438] Oops: 0002 [#1] SMP PTI
[   46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
[   46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[   46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
[   46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
00 00 00  0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
00 00
[   46.843000] RSP: 0018:b9d300cabe38 EFLAGS: 00010046
[   46.843744] RAX:  RBX:  RCX: 0004
[   46.844709] RDX: 0001 RSI: aea435ae RDI: 
[   46.845689] RBP: b9d300cabed8 R08:  R09: 00020800
[   46.846651] R10: af603ea0 R11: 0001 R12: af6576c0
[   46.847619] R13: 9a57366c8000 R14: 9a5737401300 R15: ade868f0
[   46.848584] FS:  () GS:9a5737a0()
knlGS:
[   46.849680] CS:  0010 DS:  ES:  CR0: 80050033
[   46.850455] CR2:  CR3: 0001d36fa000 CR4: 06f0
[   46.851415] DR0:  DR1:  DR2: 
[   46.852371] DR3:  DR6: fffe0ff0 DR7: 0400
[   46.853326] Call Trace:
[   46.853678]  __schedule+0x139/0x11f0
[   46.854167]  ? cpumask_next+0x16/0x20
[   46.854668]  ? cpu_stop_queue_work+0xc0/0xc0
[   46.855252]  ? sort_range+0x20/0x20
[   46.855742]  schedule+0x4e/0x60
[   46.856171]  smpboot_thread_fn+0x12a/0x160
[   46.856725]  kthread+0x112/0x120
[   46.857164]  ? kthread_stop+0xf0/0xf0
[   46.857661]  ret_from_fork+0x35/0x40
[   46.858146] Modules linked in:
[   46.858562] CR2: 
[   46.859022] ---[ end trace e9fff08f17bfd2be ]---

- Greg


This seems to be different


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-11 Thread Greg Kerr
On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
 wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >>  wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing, following works.

Is this panic below, which occurs when I tag the first process,
related or known? If not, I will debug it tomorrow.

[   46.831828] BUG: unable to handle kernel NULL pointer dereference
at 
[   46.831829] core sched enabled
[   46.834261] #PF error: [WRITE]
[   46.834899] PGD 0 P4D 0
[   46.835438] Oops: 0002 [#1] SMP PTI
[   46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
[   46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[   46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
[   46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
00 00 00  0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
00 00
[   46.843000] RSP: 0018:b9d300cabe38 EFLAGS: 00010046
[   46.843744] RAX:  RBX:  RCX: 0004
[   46.844709] RDX: 0001 RSI: aea435ae RDI: 
[   46.845689] RBP: b9d300cabed8 R08:  R09: 00020800
[   46.846651] R10: af603ea0 R11: 0001 R12: af6576c0
[   46.847619] R13: 9a57366c8000 R14: 9a5737401300 R15: ade868f0
[   46.848584] FS:  () GS:9a5737a0()
knlGS:
[   46.849680] CS:  0010 DS:  ES:  CR0: 80050033
[   46.850455] CR2:  CR3: 0001d36fa000 CR4: 06f0
[   46.851415] DR0:  DR1:  DR2: 
[   46.852371] DR3:  DR6: fffe0ff0 DR7: 0400
[   46.853326] Call Trace:
[   46.853678]  __schedule+0x139/0x11f0
[   46.854167]  ? cpumask_next+0x16/0x20
[   46.854668]  ? cpu_stop_queue_work+0xc0/0xc0
[   46.855252]  ? sort_range+0x20/0x20
[   46.855742]  schedule+0x4e/0x60
[   46.856171]  smpboot_thread_fn+0x12a/0x160
[   46.856725]  kthread+0x112/0x120
[   46.857164]  ? kthread_stop+0xf0/0xf0
[   46.857661]  ret_from_fork+0x35/0x40
[   46.858146] Modules linked in:
[   46.858562] CR2: 
[   46.859022] ---[ end trace e9fff08f17bfd2be ]---

- Greg

>
> ->8
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4..27cbc64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
>   * Avoid running the skip buddy, if running something else can
>   * be done without getting too unfair.
> */
> -   if (cfs_rq->skip == se) {
> +   if (cfs_rq->skip && cfs_rq->skip == se) {
>  struct sched_entity *second;
>
>  if (se == curr) {
> @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
> /*
>   * Prefer last buddy, try to return the CPU to a preempted task.
> */
> -   if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> +   if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
> left)
> +   < 1)
>  se = cfs_rq->last;
>
> /*
>   * Someone really wants this to run. If it's not unfair, run it.
> */
> -   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> +   if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
> left)
> +   < 1)
>  se = cfs_rq->next;
>
>  clear_buddies(cfs_rq, se);
> @@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq)
>
>  se = pick_next_entity(cfs_rq, NULL);
>
> +   if (!(se || curr))
> +   return NULL;
> +
>  if (curr) {
>  if (se && curr->on_rq)
> update_curr(cfs_rq);
>


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-11 Thread Subhra Mazumdar



On 3/11/19 11:34 AM, Subhra Mazumdar wrote:


On 3/10/19 9:23 PM, Aubrey Li wrote:

On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
 wrote:

expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.


After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

It seems you are facing some other deadlock where printk is involved. 
Can you

drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra

Never Mind, I am seeing the same lockdep deadlock output even w/o patch 
16. Btw

the NULL fix had something missing, following works.

->8

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4..27cbc64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)

 * Avoid running the skip buddy, if running something else can
 * be done without getting too unfair.
*/
-   if (cfs_rq->skip == se) {
+   if (cfs_rq->skip && cfs_rq->skip == se) {
    struct sched_entity *second;

    if (se == curr) {
@@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)

/*
 * Prefer last buddy, try to return the CPU to a preempted task.
*/
-   if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+   if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, 
left)

+   < 1)
    se = cfs_rq->last;

/*
 * Someone really wants this to run. If it's not unfair, run it.
*/
-   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+   if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, 
left)

+   < 1)
    se = cfs_rq->next;

    clear_buddies(cfs_rq, se);
@@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq)

    se = pick_next_entity(cfs_rq, NULL);

+   if (!(se || curr))
+   return NULL;
+
    if (curr) {
    if (se && curr->on_rq)
update_curr(cfs_rq);



Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-11 Thread Subhra Mazumdar



On 3/10/19 9:23 PM, Aubrey Li wrote:

On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
 wrote:

expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.


After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

It seems you are facing some other deadlock where printk is involved. 
Can you

drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra



Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-10 Thread Aubrey Li
On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
 wrote:
>
> expected. Most of the performance recovery happens in patch 15 which,
> unfortunately, is also the one that introduces the hard lockup.
>

After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

Mar 10 22:46:57 aubrey-ivb kernel: [ 2662.973792] core sched enabled
[ 2663.348371] WARNING: CPU: 5 PID: 3087 at kernel/sched/pelt.h:119
update_load_avg+00
[ 2663.357960] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_ni
[ 2663.443269] CPU: 5 PID: 3087 Comm: schbench Tainted: G  I
5.0.0-rc8-7
[ 2663.454520] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2663.466063] RIP: 0010:update_load_avg+0x52/0x5e0
[ 2663.471286] Code: 8b af 70 01 00 00 8b 3d 14 a6 6e 01 85 ff 74 1c
e9 4c 04 00 00 40
[ 2663.492350] RSP: :c9000a6a3dd8 EFLAGS: 00010046
[ 2663.498276] RAX:  RBX: 888be7937600 RCX: 0001
[ 2663.506337] RDX:  RSI: 888c09fe4418 RDI: 0046
[ 2663.514398] RBP: 888bdfb8aac0 R08:  R09: 888bdfb9aad8
[ 2663.522459] R10:  R11:  R12: 
[ 2663.530520] R13: 888c09fe4400 R14: 0001 R15: 888bdfb8aa40
[ 2663.538582] FS:  7f006a7cc700() GS:888c0a60()
knlGS:000
[ 2663.547739] CS:  0010 DS:  ES:  CR0: 80050033
[ 2663.554241] CR2: 00604048 CR3: 000bfdd64006 CR4: 000606e0
[ 2663.562310] Call Trace:
[ 2663.565128]  ? update_load_avg+0xa6/0x5e0
[ 2663.569690]  ? update_load_avg+0xa6/0x5e0
[ 2663.574252]  set_next_entity+0xd9/0x240
[ 2663.578619]  set_next_task_fair+0x6e/0xa0
[ 2663.583182]  __schedule+0x12af/0x1570
[ 2663.587350]  schedule+0x28/0x70
[ 2663.590937]  exit_to_usermode_loop+0x61/0xf0
[ 2663.595791]  prepare_exit_to_usermode+0xbf/0xd0
[ 2663.600936]  retint_user+0x8/0x18
[ 2663.604719] RIP: 0033:0x402057
[ 2663.608209] Code: 24 10 64 48 8b 04 25 28 00 00 00 48 89 44 24 38
31 c0 e8 2c eb ff
[ 2663.629351] RSP: 002b:7f006a7cbe50 EFLAGS: 0246 ORIG_RAX:
ff02
[ 2663.637924] RAX: 0029778f RBX: 002dc6c0 RCX: 0002
[ 2663.645985] RDX: 7f006a7cbe60 RSI:  RDI: 7f006a7cbe50
[ 2663.654046] RBP: 0006 R08: 0001 R09: 7ffe965450a0
[ 2663.662108] R10: 7f006a7cbe30 R11: 0003b368 R12: 7f006a7cbed0
[ 2663.670160] R13: 7f0098c1ce6f R14:  R15: 7f0084a30390
[ 2663.678226] irq event stamp: 27182
[ 2663.682114] hardirqs last  enabled at (27181): []
exit_to_usermo0
[ 2663.692348] hardirqs last disabled at (27182): []
__schedule+0xd0
[ 2663.701716] softirqs last  enabled at (27004): []
__do_softirq+0a
[ 2663.711268] softirqs last disabled at (26999): []
irq_exit+0xc1/0
[ 2663.720247] ---[ end trace d46e59b84bcde977 ]---
[ 2663.725503] BUG: unable to handle kernel paging request at 005df5f0
[ 2663.733377] #PF error: [WRITE]
[ 2663.736875] PGD 800bff037067 P4D 800bff037067 PUD bff0b1067
PMD bfbf02067 0
[ 2663.745954] Oops: 0002 [#1] SMP PTI
[ 2663.749931] CPU: 5 PID: 3078 Comm: schbench Tainted: GW I
5.0.0-rc8-7
[ 2663.761233] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2663.772836] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1c0
[ 2663.779827] Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83
e0 03 83 ee 01 42
[ 2663.800970] RSP: :c9000a633e18 EFLAGS: 00010006
[ 2663.806892] RAX: 005df5f0 RBX: 888bdfbf2a40 RCX: 0018
[ 2663.814954] RDX: 888c0a7e5180 RSI: 1fff RDI: 888bdfbf2a40
[ 2663.823015] RBP: 888bdfbf2a40 R08: 0018 R09: 0001
[ 2663.831068] R10: c9000a633dc0 R11: 888bdfbf2a58 R12: 0046
[ 2663.839129] R13: 888bdfb8aa40 R14: 888be5b90d80 R15: 888be5b90d80
[ 2663.847182] FS:  7f00797ea700() GS:888c0a60()
knlGS:000
[ 2663.856330] CS:  0010 DS:  ES:  CR0: 80050033
[ 2663.862834] CR2: 005df5f0 CR3: 000bfdd64006 CR4: 000606e0
[ 2663.870895] Call Trace:
[ 2663.873715]  do_raw_spin_lock+0xab/0xb0
[ 2663.878095]  _raw_spin_lock_irqsave+0x63/0x80
[ 2663.883066]  __balance_callback+0x19/0xa0
[ 2663.887626]  __schedule+0x1113/0x1570
[ 2663.891803]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 2663.897142]  ? apic_timer_interrupt+0xa/0x20
[ 2663.901996]  ? interrupt_entry+0x9a/0xe0
[ 2663.906450]  ? apic_timer_interrupt+0xa/0x20
[ 2663.911307] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_ni
[ 2663.996886] CR2: 005df5f0
[ 2664.000686] ---[ end trace d46e59b84bcde978 ]---
[ 2664.011393] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1c0
[ 2664.018386] Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83
e0 03 83 ee 01 42
[ 2664.039529] RSP: :c9000a633e18 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-08 Thread Subhra Mazumdar



On 2/22/19 4:45 AM, Mel Gorman wrote:

On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.


I tried to do a comparison between tip/master, ht disabled and this series
putting test workloads into a tagged cgroup but unfortunately it failed

[  156.978682] BUG: unable to handle kernel NULL pointer dereference at 
0058
[  156.986597] #PF error: [normal kernel read fault]
[  156.991343] PGD 0 P4D 0
[  156.993905] Oops:  [#1] SMP PTI
[  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 
5.0.0-rc7-schedcore-v1r1 #1
[  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 
05/09/2016
[  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 
e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
  53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 
01
[  157.037544] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
[  157.042819] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 0001
[  157.050015] RDX: 88bf3bdb0a40 RSI:  RDI: 0001547f175c
[  157.057215] RBP: 88bf7fae32c0 R08: 0001e358 R09: 88810fb9f000
[  157.064410] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
[  157.071611] R13: 88bf4e3ea0c0 R14:  R15: 88bf4e3ea7a8
[  157.078814] FS:  () GS:88bf7f5c() 
knlGS:
[  157.086977] CS:  0010 DS:  ES:  CR0: 80050033
[  157.092779] CR2: 0058 CR3: 0220e005 CR4: 003606e0
[  157.099979] DR0:  DR1:  DR2: 
[  157.109529] DR3:  DR6: fffe0ff0 DR7: 0400
[  157.119058] Call Trace:
[  157.123865]  pick_next_entity+0x61/0x110
[  157.130137]  pick_task_fair+0x4b/0x90
[  157.136124]  __schedule+0x365/0x12c0
[  157.141985]  schedule_idle+0x1e/0x40
[  157.147822]  do_idle+0x166/0x280
[  157.153275]  cpu_startup_entry+0x19/0x20
[  157.159420]  start_secondary+0x17a/0x1d0
[  157.165568]  secondary_startup_64+0xa4/0xb0
[  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr 
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel 
xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd 
ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich 
i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs 
libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast 
i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas 
libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[  157.258990] CR2: 0058
[  157.264961] ---[ end trace a301ac5e3ee86fde ]---
[  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 
be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 
85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
[  157.316121] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
[  157.324060] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 0001
[  157.333932] RDX: 88bf3bdb0a40 RSI:  RDI: 0001547f175c
[  157.343795] RBP: 88bf7fae32c0 R08: 0001e358 R09: 88810fb9f000
[  157.353634] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
[  157.363506] R13: 88bf4e3ea0c0 R14:  R15: 88bf4e3ea7a8
[  157.373395] FS:  () GS:88bf7f5c() 
knlGS:
[  157.384238] CS:  0010 DS:  ES:  CR0: 80050033
[  157.392709] CR2: 0058 CR3: 0220e005 CR4: 003606e0
[  157.402601] DR0:  DR1:  DR2: 
[  157.412488] DR3:  DR6: fffe0ff0 DR7: 0400
[  157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
[  158.529804] Shutting down cpus with NMI
[  158.573249] Kernel Offset: disabled
[  158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task! ]---

RIP translates to kernel/sched/fair.c:6819

static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
 s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */

 if (vdiff <= 0)
 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-07 Thread Paolo Bonzini
On 22/02/19 15:10, Peter Zijlstra wrote:
>> I agree on not bike shedding about the API, but can we agree on some of
>> the high level properties? For example, who generates the core
>> scheduling ids, what properties about them are enforced, etc.?
> It's an opaque cookie; the scheduler really doesn't care. All it does is
> ensure that tasks match or force idle within a core.
> 
> My previous patches got the cookie from a modified
> preempt_notifier_register/unregister() which passed the vcpu->kvm
> pointer into it from vcpu_load/put.
> 
> This auto-grouped VMs. It was also found to be somewhat annoying because
> apparently KVM does a lot of userspace assist for all sorts of nonsense
> and it would leave/re-join the cookie group for every single assist.
> Causing tons of rescheduling.

KVM doesn't do _that much_ userspace exiting in practice when VMs are
properly configured (if they're not, you probably don't care about core
scheduling).

However, note that KVM needs core scheduling groups to be defined at the
thread level; one group per process is not enough.  A VM has a bunch of
I/O threads and vCPU threads, and we want to set up core scheduling like
this:

+--+
| VM 1   iothread1  iothread2  |
| ++-+ |
| | vCPU0  vCPU1   |   vCPU2  vCPU3  | |
| ++-+ |
+--+

+--+
| VM 1   iothread1  iothread2  |
| ++-+ |
| | vCPU0  vCPU1   |   vCPU2  vCPU3  | |
| ++-+ |
| | vCPU4  vCPU5   |   vCPU6  vCPU7  | |
| ++-+ |
+--+

where the iothreads need not be subject to core scheduling but the vCPUs
do.  If you don't place guest-sibling vCPUs in the same core scheduling
group, bad things happen.

The reason is that the guest might also be running a core scheduler, so
you could have:

- guest process 1 registering two threads A and B in the same group

- guest process 2 registering two threads C and D in the same group

- guest core scheduler placing thread A on vCPU0, thread B on vCPU1,
thread C on vCPU2, thread D on vCPU3

- host core scheduler deciding the four threads can be in physical cores
0-1, but physical core 0 gets A+C and physical core 1 gets B+D

- now process 2 shares cache with process 1. :(

Paolo


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-28 Thread Subhra Mazumdar



On 2/18/19 8:56 AM, Peter Zijlstra wrote:

A much 'demanded' feature: core-scheduling :-(

I still hate it with a passion, and that is part of why it took a little
longer than 'promised'.

While this one doesn't have all the 'features' of the previous (never
published) version and isn't L1TF 'complete', I tend to like the structure
better (relatively speaking: I hate it slightly less).

This one is sched class agnostic and therefore, in principle, doesn't horribly
wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
to force-idle siblings).

Now, as hinted by that, there are semi sane reasons for actually having this.
Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
per core (due to SMT fundamentally sharing caches) and therefore grouping
related tasks on a core makes it more reliable.

However; whichever way around you turn this cookie; it is expensive and nasty.


I am seeing the following hard lockup frequently now. Following is full
kernel output:

[ 5846.412296] drop_caches (8657): drop_caches: 3
[ 5846.624823] drop_caches (8658): drop_caches: 3
[ 5850.604641] hugetlbfs: oracle (8671): Using mlock ulimits for SHM_HUGETL
B is deprecated
[ 5962.930812] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
[ 5962.930814] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5962.930828] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930828] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930829] RIP: 0010:try_to_wake_up+0x98/0x470
[ 5962.930830] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43
3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38
<85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f
[ 5962.930831] RSP: 0018:c9000f4dbcb8 EFLAGS: 0002
[ 5962.930832] RAX: 0001 RBX: 88dfb4af1680 RCX:
0041
[ 5962.930832] RDX: 0001 RSI:  RDI:
88dfb4af214c
[ 5962.930833] RBP:  R08: 0001 R09:
c9000f4dbd80
[ 5962.930833] R10: 8880 R11: ea00f0003d80 R12:
88dfb4af214c
[ 5962.930834] R13: 0001 R14: 0046 R15:
0001
[ 5962.930834] FS:  7ff4fabd9ae0() GS:88dfbe28()
knlGS:
[ 5962.930834] CS:  0010 DS:  ES:  CR0: 80050033
[ 5962.930835] CR2: 000f4cc84000 CR3: 003b93d36002 CR4:
003606e0
[ 5962.930835] DR0:  DR1:  DR2:

[ 5962.930836] DR3:  DR6: fffe0ff0 DR7:
0400
[ 5962.930836] Call Trace:
[ 5962.930837]  ? __switch_to_asm+0x34/0x70
[ 5962.930837]  ? __switch_to_asm+0x40/0x70
[ 5962.930838]  ? __switch_to_asm+0x34/0x70
[ 5962.930838]  autoremove_wake_function+0x11/0x50
[ 5962.930838]  __wake_up_common+0x8f/0x160
[ 5962.930839]  ? __switch_to_asm+0x40/0x70
[ 5962.930839]  __wake_up_common_lock+0x7c/0xc0
[ 5962.930840]  pipe_write+0x24e/0x3f0
[ 5962.930840]  __vfs_write+0x127/0x1b0
[ 5962.930840]  vfs_write+0xb3/0x1b0
[ 5962.930841]  ksys_write+0x52/0xc0
[ 5962.930841]  do_syscall_64+0x5b/0x170
[ 5962.930842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5962.930842] RIP: 0033:0x3b5900e7b0
[ 5962.930843] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90
90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24
[ 5962.930843] RSP: 002b:7ffedbcd93a8 EFLAGS: 0246 ORIG_RAX:
0001
[ 5962.930844] RAX: ffda RBX: 7ff4faa86e24 RCX:
003b5900e7b0
[ 5962.930845] RDX: 028f RSI: 7ff4faa9688e RDI:
000a
[ 5962.930845] RBP: 7ffedbcd93c0 R08: 7ffedbcd9458 R09:
0020
[ 5962.930846] R10:  R11: 0246 R12:
7ffedbcd9458
[ 5962.930847] R13: 7ff4faa9688e R14: 7ff4faa89cc8 R15:
7ff4faa86bd0
[ 5962.930847] Kernel panic - not syncing: Hard LOCKUP
[ 5962.930848] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930848] Hardware name: Oracle Corporation ORACLE
SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930849] Call Trace:
[ 5962.930849]  
[ 5962.930849]  dump_stack+0x5c/0x7b
[ 5962.930850]  panic+0xfe/0x2b2
[ 5962.930850]  nmi_panic+0x35/0x40
[ 5962.930851]  watchdog_overflow_callback+0xef/0x100
[ 5962.930851]  __perf_event_overflow+0x5a/0xe0
[ 5962.930852]  handle_pmi_common+0x1d1/0x280
[ 5962.930852]  ? __set_pte_vaddr+0x32/0x50
[ 5962.930852]  ? 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-26 Thread Aubrey Li
On Tue, Feb 26, 2019 at 4:26 PM Aubrey Li  wrote:
>
> On Sat, Feb 23, 2019 at 3:27 AM Tim Chen  wrote:
> >
> > On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> > > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> > >> On 18/02/19 21:40, Peter Zijlstra wrote:
> > >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >  On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  
> >  wrote:
> > >
> > > However; whichever way around you turn this cookie; it is expensive 
> > > and nasty.
> > 
> >  Do you (or anybody else) have numbers for real loads?
> > 
> >  Because performance is all that matters. If performance is bad, then
> >  it's pointless, since just turning off SMT is the answer.
> > >>>
> > >>> Not for these patches; they stopped crashing only yesterday and I
> > >>> cleaned them up and send them out.
> > >>>
> > >>> The previous version; which was more horrible; but L1TF complete, was
> > >>> between OK-ish and horrible depending on the number of VMEXITs a
> > >>> workload had.
> > >>>
> > >>> If there were close to no VMEXITs, it beat smt=off, if there were lots
> > >>> of VMEXITs it was far far worse. Supposedly hosting people try their
> > >>> very bestest to have no VMEXITs so it mostly works for them (with the
> > >>> obvious exception of single VCPU guests).
> > >>
> > >> If you are giving access to dedicated cores to guests, you also let them
> > >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> > >> bound workload.
> > >>
> > >> In any case, IIUC what you are looking for is:
> > >>
> > >> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> > >> bound.
> > >>
> > >> 2) compare two runs, one without SMT and without core scheduler, and one
> > >> with SMT+core scheduler.
> > >>
> > >> 3) find out whether performance is helped by SMT despite the increased
> > >> overhead of the core scheduler
> > >>
> > >> Do you want some other load in the host, so that the scheduler actually
> > >> does do something?  Or is the point just that you show that the
> > >> performance isn't affected when the scheduler does not have anything to
> > >> do (which should be obvious, but having numbers is always better)?
> > >
> > > Well, what _I_ want is for all this to just go away :-)
> > >
> > > Tim did much of testing last time around; and I don't think he did
> > > core-pinning of VMs much (although I'm sure he did some of that). I'm
> >
> > Yes. The last time around I tested basic scenarios like:
> > 1. single VM pinned on a core
> > 2. 2 VMs pinned on a core
> > 3. system oversubscription (no pinning)
> >
> > In general, CPU bound benchmarks and even things without too much I/O
> > causing lots of VMexits perform better with HT than without for Peter's
> > last patchset.
> >
> > > still a complete virt noob; I can barely boot a VM to save my life.
> > >
> > > (you should be glad to not have heard my cursing at qemu cmdline when
> > > trying to reproduce some of Tim's results -- lets just say that I can
> > > deal with gpg)
> > >
> > > I'm sure he tried some oversubscribed scenarios without pinning.
> >
> > We did try some oversubscribed scenarios like SPECVirt, that tried to
> > squeeze tons of VMs on a single system in over subscription mode.
> >
> > There're two main problems in the last go around:
> >
> > 1. Workload with high rate of Vmexits (SpecVirt is one)
> > were a major source of pain when we tried Peter's previous patchset.
> > The switch from vcpus to qemu and back in previous version of Peter's patch
> > requires some coordination between the hyperthread siblings via IPI.  And 
> > for
> > workload that does this a lot, the overhead quickly added up.
> >
> > For Peter's new patch, this overhead hopefully would be reduced and give
> > better performance.
> >
> > 2. Load balancing is quite tricky.  Peter's last patchset did not have
> > load balancing for consolidating compatible running threads.
> > I did some non-sophisticated load balancing
> > to pair vcpus up.  But the constant vcpu migrations overhead probably ate up
> > any improvements from better load pairing.  So I didn't get much
> > improvement in the over-subscription case when turning on load balancing
> > to consolidate the VCPUs of the same VM. We'll probably have to try
> > out this incarnation of Peter's patch and see how well the load balancing
> > works.
> >
> > I'll try to line up some benchmarking folks to do some tests.
>
> I can help to do some basic tests.
>
> Cgroup bias looks weird to me. If I have hundreds of cgroups, should I turn
> core scheduling(cpu.tag) on one by one? Or Is there a global knob I missed?
>

I encountered the following panic when I turned core sched on in a
cgroup when the cgroup
was running a best effort workload with high CPU utilization.

Feb 27 01:51:53 aubrey-ivb kernel: [  508.981348] core sched enabled
[  508.990627] BUG: unable to handle kernel NULL pointer dereference
at 008
[  

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-26 Thread Aubrey Li
On Sat, Feb 23, 2019 at 3:27 AM Tim Chen  wrote:
>
> On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> >> On 18/02/19 21:40, Peter Zijlstra wrote:
> >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>  On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  
>  wrote:
> >
> > However; whichever way around you turn this cookie; it is expensive and 
> > nasty.
> 
>  Do you (or anybody else) have numbers for real loads?
> 
>  Because performance is all that matters. If performance is bad, then
>  it's pointless, since just turning off SMT is the answer.
> >>>
> >>> Not for these patches; they stopped crashing only yesterday and I
> >>> cleaned them up and send them out.
> >>>
> >>> The previous version; which was more horrible; but L1TF complete, was
> >>> between OK-ish and horrible depending on the number of VMEXITs a
> >>> workload had.
> >>>
> >>> If there were close to no VMEXITs, it beat smt=off, if there were lots
> >>> of VMEXITs it was far far worse. Supposedly hosting people try their
> >>> very bestest to have no VMEXITs so it mostly works for them (with the
> >>> obvious exception of single VCPU guests).
> >>
> >> If you are giving access to dedicated cores to guests, you also let them
> >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> >> bound workload.
> >>
> >> In any case, IIUC what you are looking for is:
> >>
> >> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> >> bound.
> >>
> >> 2) compare two runs, one without SMT and without core scheduler, and one
> >> with SMT+core scheduler.
> >>
> >> 3) find out whether performance is helped by SMT despite the increased
> >> overhead of the core scheduler
> >>
> >> Do you want some other load in the host, so that the scheduler actually
> >> does do something?  Or is the point just that you show that the
> >> performance isn't affected when the scheduler does not have anything to
> >> do (which should be obvious, but having numbers is always better)?
> >
> > Well, what _I_ want is for all this to just go away :-)
> >
> > Tim did much of testing last time around; and I don't think he did
> > core-pinning of VMs much (although I'm sure he did some of that). I'm
>
> Yes. The last time around I tested basic scenarios like:
> 1. single VM pinned on a core
> 2. 2 VMs pinned on a core
> 3. system oversubscription (no pinning)
>
> In general, CPU bound benchmarks and even things without too much I/O
> causing lots of VMexits perform better with HT than without for Peter's
> last patchset.
>
> > still a complete virt noob; I can barely boot a VM to save my life.
> >
> > (you should be glad to not have heard my cursing at qemu cmdline when
> > trying to reproduce some of Tim's results -- lets just say that I can
> > deal with gpg)
> >
> > I'm sure he tried some oversubscribed scenarios without pinning.
>
> We did try some oversubscribed scenarios like SPECVirt, that tried to
> squeeze tons of VMs on a single system in over subscription mode.
>
> There're two main problems in the last go around:
>
> 1. Workload with high rate of Vmexits (SpecVirt is one)
> were a major source of pain when we tried Peter's previous patchset.
> The switch from vcpus to qemu and back in previous version of Peter's patch
> requires some coordination between the hyperthread siblings via IPI.  And for
> workload that does this a lot, the overhead quickly added up.
>
> For Peter's new patch, this overhead hopefully would be reduced and give
> better performance.
>
> 2. Load balancing is quite tricky.  Peter's last patchset did not have
> load balancing for consolidating compatible running threads.
> I did some non-sophisticated load balancing
> to pair vcpus up.  But the constant vcpu migrations overhead probably ate up
> any improvements from better load pairing.  So I didn't get much
> improvement in the over-subscription case when turning on load balancing
> to consolidate the VCPUs of the same VM. We'll probably have to try
> out this incarnation of Peter's patch and see how well the load balancing
> works.
>
> I'll try to line up some benchmarking folks to do some tests.

I can help to do some basic tests.

Cgroup bias looks weird to me. If I have hundreds of cgroups, should I turn
core scheduling(cpu.tag) on one by one? Or Is there a global knob I missed?

Thanks,
-Aubrey


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-22 Thread Tim Chen
On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
>> On 18/02/19 21:40, Peter Zijlstra wrote:
>>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
 On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  
 wrote:
>
> However; whichever way around you turn this cookie; it is expensive and 
> nasty.

 Do you (or anybody else) have numbers for real loads?

 Because performance is all that matters. If performance is bad, then
 it's pointless, since just turning off SMT is the answer.
>>>
>>> Not for these patches; they stopped crashing only yesterday and I
>>> cleaned them up and send them out.
>>>
>>> The previous version; which was more horrible; but L1TF complete, was
>>> between OK-ish and horrible depending on the number of VMEXITs a
>>> workload had.
>>>
>>> If there were close to no VMEXITs, it beat smt=off, if there were lots
>>> of VMEXITs it was far far worse. Supposedly hosting people try their
>>> very bestest to have no VMEXITs so it mostly works for them (with the
>>> obvious exception of single VCPU guests).
>>
>> If you are giving access to dedicated cores to guests, you also let them
>> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
>> bound workload.
>>
>> In any case, IIUC what you are looking for is:
>>
>> 1) take a benchmark that *is* helped by SMT, this will be something CPU
>> bound.
>>
>> 2) compare two runs, one without SMT and without core scheduler, and one
>> with SMT+core scheduler.
>>
>> 3) find out whether performance is helped by SMT despite the increased
>> overhead of the core scheduler
>>
>> Do you want some other load in the host, so that the scheduler actually
>> does do something?  Or is the point just that you show that the
>> performance isn't affected when the scheduler does not have anything to
>> do (which should be obvious, but having numbers is always better)?
> 
> Well, what _I_ want is for all this to just go away :-)
> 
> Tim did much of testing last time around; and I don't think he did
> core-pinning of VMs much (although I'm sure he did some of that). I'm

Yes. The last time around I tested basic scenarios like:
1. single VM pinned on a core
2. 2 VMs pinned on a core
3. system oversubscription (no pinning)

In general, CPU bound benchmarks and even things without too much I/O
causing lots of VMexits perform better with HT than without for Peter's
last patchset.

> still a complete virt noob; I can barely boot a VM to save my life.
> 
> (you should be glad to not have heard my cursing at qemu cmdline when
> trying to reproduce some of Tim's results -- lets just say that I can
> deal with gpg)
> 
> I'm sure he tried some oversubscribed scenarios without pinning. 

We did try some oversubscribed scenarios like SPECVirt, that tried to
squeeze tons of VMs on a single system in over subscription mode.

There're two main problems in the last go around:

1. Workload with high rate of Vmexits (SpecVirt is one) 
were a major source of pain when we tried Peter's previous patchset.
The switch from vcpus to qemu and back in previous version of Peter's patch
requires some coordination between the hyperthread siblings via IPI.  And for
workload that does this a lot, the overhead quickly added up.

For Peter's new patch, this overhead hopefully would be reduced and give
better performance.

2. Load balancing is quite tricky.  Peter's last patchset did not have
load balancing for consolidating compatible running threads.
I did some non-sophisticated load balancing
to pair vcpus up.  But the constant vcpu migrations overhead probably ate up
any improvements from better load pairing.  So I didn't get much
improvement in the over-subscription case when turning on load balancing
to consolidate the VCPUs of the same VM. We'll probably have to try
out this incarnation of Peter's patch and see how well the load balancing
works.

I'll try to line up some benchmarking folks to do some tests.

Tim



Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-22 Thread Mel Gorman
On Fri, Feb 22, 2019 at 12:45:44PM +, Mel Gorman wrote:
> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
> > >
> > > However; whichever way around you turn this cookie; it is expensive and 
> > > nasty.
> > 
> > Do you (or anybody else) have numbers for real loads?
> > 
> > Because performance is all that matters. If performance is bad, then
> > it's pointless, since just turning off SMT is the answer.
> > 
> 
> I tried to do a comparison between tip/master, ht disabled and this series
> putting test workloads into a tagged cgroup but unfortunately it failed
> 
> [  156.978682] BUG: unable to handle kernel NULL pointer dereference at 
> 0058
> [  156.986597] #PF error: [normal kernel read fault]
> [  156.991343] PGD 0 P4D 0

When bodged around, one test survived (performance was crucified but the
benchmark is very synthetic). pgbench (test 2) paniced with a hard
lockup. Most of the console log was corrupted (unrelated to the patch)
but the relevant part is

[ 4587.419674] Call Trace:
[ 4587.419674]  _raw_spin_lock+0x1b/0x20
[ 4587.419675]  sched_core_balance+0x155/0x520
[ 4587.419675]  ? __switch_to_asm+0x34/0x70
[ 4587.419675]  __balance_callback+0x49/0xa0
[ 4587.419676]  __schedule+0xf15/0x12c0
[ 4587.419676]  schedule_idle+0x1e/0x40
[ 4587.419677]  do_idle+0x166/0x280
[ 4587.419677]  cpu_startup_entry+0x19/0x20
[ 4587.419678]  start_secondary+0x17a/0x1d0
[ 4587.419678]  secondary_startup_64+0xa4/0xb0
[ 4587.419679] Kernel panic - not syncing: Hard LOCKUP

-- 
Mel Gorman
SUSE Labs


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-22 Thread Peter Zijlstra
On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> On 18/02/19 21:40, Peter Zijlstra wrote:
> > On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  
> >> wrote:
> >>>
> >>> However; whichever way around you turn this cookie; it is expensive and 
> >>> nasty.
> >>
> >> Do you (or anybody else) have numbers for real loads?
> >>
> >> Because performance is all that matters. If performance is bad, then
> >> it's pointless, since just turning off SMT is the answer.
> > 
> > Not for these patches; they stopped crashing only yesterday and I
> > cleaned them up and send them out.
> > 
> > The previous version; which was more horrible; but L1TF complete, was
> > between OK-ish and horrible depending on the number of VMEXITs a
> > workload had.
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
> 
> If you are giving access to dedicated cores to guests, you also let them
> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> bound workload.
> 
> In any case, IIUC what you are looking for is:
> 
> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> bound.
> 
> 2) compare two runs, one without SMT and without core scheduler, and one
> with SMT+core scheduler.
> 
> 3) find out whether performance is helped by SMT despite the increased
> overhead of the core scheduler
> 
> Do you want some other load in the host, so that the scheduler actually
> does do something?  Or is the point just that you show that the
> performance isn't affected when the scheduler does not have anything to
> do (which should be obvious, but having numbers is always better)?

Well, what _I_ want is for all this to just go away :-)

Tim did much of testing last time around; and I don't think he did
core-pinning of VMs much (although I'm sure he did some of that). I'm
still a complete virt noob; I can barely boot a VM to save my life.

(you should be glad to not have heard my cursing at qemu cmdline when
trying to reproduce some of Tim's results -- lets just say that I can
deal with gpg)

I'm sure he tried some oversubscribed scenarios without pinning. But
even there, when all the vCPU threads are runnable, they don't schedule
that much. Sure we take the preemption tick and thus schedule 100-1000
times a second, but that's managable.

We spend quite some time tracing workloads and fixing funny behaviour --
none of that has been done for these patches yet.

The moment KVM needed user space assist for things (and thus VMEXITs
happened) things came apart real quick.


Anyway, Tim, can you tell these fine folks what you did and for what
scenarios the last incarnation did show promise?


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-22 Thread Peter Zijlstra
On Wed, Feb 20, 2019 at 10:33:55AM -0800, Greg Kerr wrote:
> > On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:

> Using cgroups could imply that a privileged user is meant to create and
> track all the core scheduling groups. It sounds like you picked cgroups
> out of ease of prototyping and not the specific behavior?

Yep. Where a prtcl() patch would've been similarly simple, the userspace
part would've been more annoying. The cgroup thing I can just echo into.

> > As it happens; there is actually a bug in that very cgroup patch that
> > can cause undesired scheduling. Try spotting and fixing that.
> > 
> This is where I think the high level properties of core scheduling are
> relevant. I'm not sure what bug is in the existing patch, but it's hard
> for me to tell if the existing code behaves correctly without answering
> questions, such as, "Should processes from two separate parents be
> allowed to co-execute?"

Sure, why not.

The bug is that we set the cookie and don't force a reschedule. This
then allows the existing task selection to continue; which might not
adhere to the (new) cookie constraints.

It is a transient state though; as soon as we reschedule this gets
corrected automagically.

A second bug is that we leak the cgroup tag state on destroy.

A third bug would be that it is not hierarchical -- but that this point
meh.

> > Another question is if we want to be L1TF complete (and how strict) or
> > not, and if so, build the missing pieces (for instance we currently
> > don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> > and horrible code and missing for that reason).
> >
> I assumed from the beginning that this should be safe across exceptions.
> Is there a mitigating reason that it shouldn't?

I'm not entirely sure what you mean; so let me expound -- L1TF is public
now after all.

So the basic problem is that a malicious guest can read the entire L1,
right? L1 is shared between SMT. So if one sibling takes a host
interrupt and populates L1 with host data, that other thread can read
it from the guest.

This is why my old patches (which Tim has on github _somewhere_) also
have hooks in irq_enter/irq_exit.

The big question is of course; if any data touched by interrupts is
worth the pain.

> > So first; does this provide what we need? If that's sorted we can
> > bike-shed on uapi/abi.

> I agree on not bike shedding about the API, but can we agree on some of
> the high level properties? For example, who generates the core
> scheduling ids, what properties about them are enforced, etc.?

It's an opaque cookie; the scheduler really doesn't care. All it does is
ensure that tasks match or force idle within a core.

My previous patches got the cookie from a modified
preempt_notifier_register/unregister() which passed the vcpu->kvm
pointer into it from vcpu_load/put.

This auto-grouped VMs. It was also found to be somewhat annoying because
apparently KVM does a lot of userspace assist for all sorts of nonsense
and it would leave/re-join the cookie group for every single assist.
Causing tons of rescheduling.

I'm fine with having all these interfaces, kvm, prctl and cgroup, and I
don't care about conflict resolution -- that's the tedious part of the
bike-shed :-)

The far more important questions are if there's enough workloads where
this can be made useful or not. If not, none of that interface crud
matters one whit, we can file these here patches in the bit-bucket and
happily go spend out time elsewhere.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-22 Thread Mel Gorman
On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
> >
> > However; whichever way around you turn this cookie; it is expensive and 
> > nasty.
> 
> Do you (or anybody else) have numbers for real loads?
> 
> Because performance is all that matters. If performance is bad, then
> it's pointless, since just turning off SMT is the answer.
> 

I tried to do a comparison between tip/master, ht disabled and this series
putting test workloads into a tagged cgroup but unfortunately it failed

[  156.978682] BUG: unable to handle kernel NULL pointer dereference at 
0058
[  156.986597] #PF error: [normal kernel read fault]
[  156.991343] PGD 0 P4D 0
[  156.993905] Oops:  [#1] SMP PTI
[  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 
5.0.0-rc7-schedcore-v1r1 #1
[  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 
05/09/2016
[  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 
e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
[  157.037544] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
[  157.042819] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 0001
[  157.050015] RDX: 88bf3bdb0a40 RSI:  RDI: 0001547f175c
[  157.057215] RBP: 88bf7fae32c0 R08: 0001e358 R09: 88810fb9f000
[  157.064410] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
[  157.071611] R13: 88bf4e3ea0c0 R14:  R15: 88bf4e3ea7a8
[  157.078814] FS:  () GS:88bf7f5c() 
knlGS:
[  157.086977] CS:  0010 DS:  ES:  CR0: 80050033
[  157.092779] CR2: 0058 CR3: 0220e005 CR4: 003606e0
[  157.099979] DR0:  DR1:  DR2: 
[  157.109529] DR3:  DR6: fffe0ff0 DR7: 0400
[  157.119058] Call Trace:
[  157.123865]  pick_next_entity+0x61/0x110
[  157.130137]  pick_task_fair+0x4b/0x90
[  157.136124]  __schedule+0x365/0x12c0
[  157.141985]  schedule_idle+0x1e/0x40
[  157.147822]  do_idle+0x166/0x280
[  157.153275]  cpu_startup_entry+0x19/0x20
[  157.159420]  start_secondary+0x17a/0x1d0
[  157.165568]  secondary_startup_64+0xa4/0xb0
[  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr 
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel 
xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd 
ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich 
i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs 
libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast 
i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas 
libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[  157.258990] CR2: 0058
[  157.264961] ---[ end trace a301ac5e3ee86fde ]---
[  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 
e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 
58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
[  157.316121] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
[  157.324060] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 0001
[  157.333932] RDX: 88bf3bdb0a40 RSI:  RDI: 0001547f175c
[  157.343795] RBP: 88bf7fae32c0 R08: 0001e358 R09: 88810fb9f000
[  157.353634] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
[  157.363506] R13: 88bf4e3ea0c0 R14:  R15: 88bf4e3ea7a8
[  157.373395] FS:  () GS:88bf7f5c() 
knlGS:
[  157.384238] CS:  0010 DS:  ES:  CR0: 80050033
[  157.392709] CR2: 0058 CR3: 0220e005 CR4: 003606e0
[  157.402601] DR0:  DR1:  DR2: 
[  157.412488] DR3:  DR6: fffe0ff0 DR7: 0400
[  157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
[  158.529804] Shutting down cpus with NMI
[  158.573249] Kernel Offset: disabled
[  158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task! ]---

RIP translates to kernel/sched/fair.c:6819

static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */

if (vdiff <= 0)
return 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-22 Thread Paolo Bonzini
On 18/02/19 21:40, Peter Zijlstra wrote:
> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
>>>
>>> However; whichever way around you turn this cookie; it is expensive and 
>>> nasty.
>>
>> Do you (or anybody else) have numbers for real loads?
>>
>> Because performance is all that matters. If performance is bad, then
>> it's pointless, since just turning off SMT is the answer.
> 
> Not for these patches; they stopped crashing only yesterday and I
> cleaned them up and send them out.
> 
> The previous version; which was more horrible; but L1TF complete, was
> between OK-ish and horrible depending on the number of VMEXITs a
> workload had.
>
> If there were close to no VMEXITs, it beat smt=off, if there were lots
> of VMEXITs it was far far worse. Supposedly hosting people try their
> very bestest to have no VMEXITs so it mostly works for them (with the
> obvious exception of single VCPU guests).

If you are giving access to dedicated cores to guests, you also let them
do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
bound workload.

In any case, IIUC what you are looking for is:

1) take a benchmark that *is* helped by SMT, this will be something CPU
bound.

2) compare two runs, one without SMT and without core scheduler, and one
with SMT+core scheduler.

3) find out whether performance is helped by SMT despite the increased
overhead of the core scheduler

Do you want some other load in the host, so that the scheduler actually
does do something?  Or is the point just that you show that the
performance isn't affected when the scheduler does not have anything to
do (which should be obvious, but having numbers is always better)?

Paolo


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-21 Thread Subhra Mazumdar



On 2/21/19 6:03 AM, Peter Zijlstra wrote:

On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:

On 2/18/19 9:49 AM, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

Linus

I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.

I'm thinking oracle schedules quite a bit, right? Then you get massive
overhead (as shown).


Out of curiosity I ran the patchset from Amazon with the same setup to see
if performance wise it was any better. But it looks equally bad. At 32
users it performed even worse and the idle time increased much more. Only
good thing about it was it was being fair to both the instances as seen in
the low %stdev

Users  Baseline %stdev  %idle  cosched %stdev %idle
16 1    2.9 66 0.93(-7%)   1.1 69
24 1    11.3    53 0.87(-13%)  11.2 61
32 1    7   41 0.66(-34%)  5.3 54


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-21 Thread Subhra Mazumdar



On 2/21/19 6:03 AM, Peter Zijlstra wrote:

On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:

On 2/18/19 9:49 AM, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

Linus

I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.

I'm thinking oracle schedules quite a bit, right? Then you get massive
overhead (as shown).

Yes. In terms of idleness we have:

Users baseline core_sched
16    67% 70%
24    53% 59%
32    41% 49%

So there is more idleness with core sched which is understandable as there
can be forced idleness. The other part contributing to regression is most
likely overhead.


The thing with virt workloads is that if they don't VMEXIT lots, they
also don't schedule lots (the vCPU stays running, nested scheduler
etc..).

I plan to run some VM workloads.


Also; like I wrote, it is quite possible there is some sibling rivalry
here, which can cause excessive rescheduling. Someone would have to
trace a workload and check.

My older patches had a condition that would not preempt a task for a
little while, such that it might make _some_ progress, these patches
don't have that (yet).



Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-21 Thread Peter Zijlstra
On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:
> 
> On 2/18/19 9:49 AM, Linus Torvalds wrote:
> > On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
> > > However; whichever way around you turn this cookie; it is expensive and 
> > > nasty.
> > Do you (or anybody else) have numbers for real loads?
> > 
> > Because performance is all that matters. If performance is bad, then
> > it's pointless, since just turning off SMT is the answer.
> > 
> >Linus
> I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
> This is on baremetal, no virtualization.  

I'm thinking oracle schedules quite a bit, right? Then you get massive
overhead (as shown).

The thing with virt workloads is that if they don't VMEXIT lots, they
also don't schedule lots (the vCPU stays running, nested scheduler
etc..).

Also; like I wrote, it is quite possible there is some sibling rivalry
here, which can cause excessive rescheduling. Someone would have to
trace a workload and check.

My older patches had a condition that would not preempt a task for a
little while, such that it might make _some_ progress, these patches
don't have that (yet).



Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Subhra Mazumdar



On 2/18/19 9:49 AM, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

   Linus

I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.  In all cases I put each DB
instance in separate cpu cgroup. Following are the avg throughput numbers
of the 2 instances. %stdev is the standard deviation between the 2
instances.

Baseline = build w/o CONFIG_SCHED_CORE
core_sched = build w/ CONFIG_SCHED_CORE
HT_disable = offlined sibling HT with baseline

Users  Baseline  %stdev  core_sched %stdev HT_disable   %stdev
16 997768    3.28    808193(-19%)   34 1053888(+5.6%)   2.9
24 1157314   9.4 974555(-15.8%) 40.5 1197904(+3.5%)   4.6
32 1693644   6.4 1237195(-27%)  42.8 1308180(-22.8%)  5.3

The regressions are substantial. Also noticed one of the DB instances was
having much less throughput than the other with core scheduling which
brought down the avg and also reflected in the very high %stdev. Disabling
HT has effect at 32 users but still better than core scheduling both in
terms of avg and %stdev. There are some issue with the DB setup for which
I couldn't go beyond 32 users.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Greg Kerr
On Wed, Feb 20, 2019 at 10:42:55AM +0100, Peter Zijlstra wrote:
> 
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
> 
I am relieved to know that when my mail client embeds HTML tags into raw
text, it will only be the second most annoying thing I've done on
e-mail.

Speaking of annoying things to do, sorry for switching e-mail addresses
but this is easier to do from my personal e-mail.

> On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
> > Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
> > quick and dirty cgroup tagging interface," I believe cgroups are used to
> > define co-scheduling groups in this implementation.
> > 
> > Chrome OS engineers (kerr...@google.com, mpden...@google.com, and
> > pal...@google.com) are considering an interface that is usable by 
> > unprivileged
> > userspace apps. cgroups are a global resource that require privileged 
> > access.
> > Have you considered an interface that is akin to namespaces? Consider the
> > following strawperson API proposal (I understand prctl() is generally
> > used for process
> > specific actions, so we aren't married to using prctl()):
> 
> I don't think we're anywhere near the point where I care about
> interfaces with this stuff.
> 
> Interfaces are a trivial but tedious matter once the rest works to
> satisfaction.
> 
I agree that the API itself is a bit of a bike shedding and that's why I
provided a strawperson proposal to highlight the desired properties. I
do think the high level semantics are important to agree upon.

Using cgroups could imply that a privileged user is meant to create and
track all the core scheduling groups. It sounds like you picked cgroups
out of ease of prototyping and not the specific behavior?

> As it happens; there is actually a bug in that very cgroup patch that
> can cause undesired scheduling. Try spotting and fixing that.
> 
This is where I think the high level properties of core scheduling are
relevant. I'm not sure what bug is in the existing patch, but it's hard
for me to tell if the existing code behaves correctly without answering
questions, such as, "Should processes from two separate parents be
allowed to co-execute?"

> Another question is if we want to be L1TF complete (and how strict) or
> not, and if so, build the missing pieces (for instance we currently
> don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> and horrible code and missing for that reason).
>
I assumed from the beginning that this should be safe across exceptions.
Is there a mitigating reason that it shouldn't?

> 
> So first; does this provide what we need? If that's sorted we can
> bike-shed on uapi/abi.
I agree on not bike shedding about the API, but can we agree on some of
the high level properties? For example, who generates the core
scheduling ids, what properties about them are enforced, etc.?

Regards,

Greg Kerr


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Subhra Mazumdar



On 2/20/19 1:42 AM, Peter Zijlstra wrote:

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:

Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
quick and dirty cgroup tagging interface," I believe cgroups are used to
define co-scheduling groups in this implementation.

Chrome OS engineers (kerr...@google.com, mpden...@google.com, and
pal...@google.com) are considering an interface that is usable by unprivileged
userspace apps. cgroups are a global resource that require privileged access.
Have you considered an interface that is akin to namespaces? Consider the
following strawperson API proposal (I understand prctl() is generally
used for process
specific actions, so we aren't married to using prctl()):

I don't think we're anywhere near the point where I care about
interfaces with this stuff.

Interfaces are a trivial but tedious matter once the rest works to
satisfaction.

As it happens; there is actually a bug in that very cgroup patch that
can cause undesired scheduling. Try spotting and fixing that.

Another question is if we want to be L1TF complete (and how strict) or
not, and if so, build the missing pieces (for instance we currently
don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
and horrible code and missing for that reason).

I remember asking Paul about this and he mentioned he has a Address Space
Isolation proposal to cover this. So it seems this is out of scope of
core scheduling?


So first; does this provide what we need? If that's sorted we can
bike-shed on uapi/abi.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Peter Zijlstra


A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
> Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
> quick and dirty cgroup tagging interface," I believe cgroups are used to
> define co-scheduling groups in this implementation.
> 
> Chrome OS engineers (kerr...@google.com, mpden...@google.com, and
> pal...@google.com) are considering an interface that is usable by unprivileged
> userspace apps. cgroups are a global resource that require privileged access.
> Have you considered an interface that is akin to namespaces? Consider the
> following strawperson API proposal (I understand prctl() is generally
> used for process
> specific actions, so we aren't married to using prctl()):

I don't think we're anywhere near the point where I care about
interfaces with this stuff.

Interfaces are a trivial but tedious matter once the rest works to
satisfaction.

As it happens; there is actually a bug in that very cgroup patch that
can cause undesired scheduling. Try spotting and fixing that.

Another question is if we want to be L1TF complete (and how strict) or
not, and if so, build the missing pieces (for instance we currently
don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
and horrible code and missing for that reason).

So first; does this provide what we need? If that's sorted we can
bike-shed on uapi/abi.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-19 Thread Greg Kerr
Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
quick and dirty cgroup tagging interface," I believe cgroups are used to
define co-scheduling groups in this implementation.

Chrome OS engineers (kerr...@google.com, mpden...@google.com, and
pal...@google.com) are considering an interface that is usable by unprivileged
userspace apps. cgroups are a global resource that require privileged access.
Have you considered an interface that is akin to namespaces? Consider the
following strawperson API proposal (I understand prctl() is generally
used for process
specific actions, so we aren't married to using prctl()):

# API Properties

The kernel introduces coscheduling groups, which specify which processes may
be executed together. An unprivileged process may use prctl() to create a
coscheduling group. The process may then join the coscheduling group, and
place any of its child processes into the coscheduling group. To
provide flexibility for
unrelated processes to join pre-existing groups, an IPC mechanism could send a
coscheduling group handle between processes.

# Strawperson API Proposal
To create a new coscheduling group:
int coscheduling_group = prctl(PR_CREATE_COSCHEDULING_GROUP);

The return value is >= 0 on success and -1 on failure, with the following
possible values for errno:

ENOTSUP: This kernel doesn’t support the PR_NEW_COSCHEDULING_GROUP
operation.
EMFILE: The process’ kernel-side coscheduling group table is full.

To join a given process to the group:
pid_t process = /* self or child... */
int status = prctl(PR_JOIN_COSCHEDULING_GROUP, coscheduling_group, process);
if (status) {
err(errno, NULL);
}

The kernel will check and enforce that the given process ID really is the
caller’s own PID or a PID of one of the caller’s children, and that the given
group ID really exists. The return value is 0 on success and -1 on failure,
with the following possible values for errno:

EPERM: The caller could not join the given process to the coscheduling
   group because it was not the creator of the given coscheduling group.
EPERM: The caller could not join the given process to the coscheduling
   group because the given process was not the caller or one
of the caller’s
   children.
EINVAL: The given group ID did not exist in the kernel-side coscheduling
group table associated with the caller.
ESRCH: The given process did not exist.

Regards,

Greg Kerr (kerr...@google.com)

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
>
>
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.
>
> It doesn't help that there are truly bonghit crazy proposals for using this 
> out
> there, and I really hope to never see them in code.
>
> These patches are lightly tested and didn't insta explode, but no promises,
> they might just set your pets on fire.
>
> 'enjoy'
>
> @pjt; I know this isn't quite what we talked about, but this is where I ended
> up after I started typing. There's plenty design decisions to question and my
> changelogs don't even get close to beginning to cover them all. Feel free to 
> ask.
>
> ---
>  include/linux/sched.h|   9 +-
>  kernel/Kconfig.preempt   |   8 +-
>  kernel/sched/core.c  | 762 
> ---
>  kernel/sched/deadline.c  |  99 +++---
>  kernel/sched/debug.c |   4 +-
>  kernel/sched/fair.c  | 129 +---
>  kernel/sched/idle.c  |  42 ++-
>  kernel/sched/pelt.h  |   2 +-
>  kernel/sched/rt.c|  96 +++---
>  kernel/sched/sched.h | 183 
>  kernel/sched/stop_task.c |  35 ++-
>  kernel/sched/topology.c  |   4 +-
>  kernel/stop_machine.c|   2 +
>  13 files changed, 1096 insertions(+), 279 deletions(-)
>
>


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-19 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra  wrote:
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
> >
> > It's just that people have been bugging me for this crap; and I figure
> > I'd post it now that it's not exploding anymore and let others have at.
> 
> The patches didn't look disgusting to me, but I admittedly just
> scanned through them quickly.
> 
> Are there downsides (maintenance and/or performance) when core
> scheduling _isn't_ enabled? I guess if it's not a maintenance or
> performance nightmare when off, it's ok to just give people the
> option.

So this bit is the main straight-line performance impact when the 
CONFIG_SCHED_CORE Kconfig feature is present (which I expect distros to 
enable broadly):

  +static inline bool sched_core_enabled(struct rq *rq)
  +{
  +   return static_branch_unlikely(&__sched_core_enabled) && 
rq->core_enabled;
  +}

   static inline raw_spinlock_t *rq_lockp(struct rq *rq)
   {
  +   if (sched_core_enabled(rq))
  +   return >core->__lock
  +
  return >__lock;


This should at least in principe keep the runtime overhead down to more 
NOPs and a bit bigger instruction cache footprint - modulo compiler 
shenanigans.

Here's the code generation impact on x86-64 defconfig:

   textdata bss dec hex filename
228  48   0 276 114 sched.core.n/cpufreq.o (ex 
sched.core.n/built-in.a)
228  48   0 276 114 sched.core.y/cpufreq.o (ex 
sched.core.y/built-in.a)

   4438  96   0453411b6 sched.core.n/completion.o (ex 
sched.core.n/built-in.a)
   4438  96   0453411b6 sched.core.y/completion.o (ex 
sched.core.y/built-in.a)

   21672428   0459511f3 sched.core.n/cpuacct.o (ex 
sched.core.n/built-in.a)
   21672428   0459511f3 sched.core.y/cpuacct.o (ex 
sched.core.y/built-in.a)

  61099   22114 488   83701   146f5 sched.core.n/core.o (ex 
sched.core.n/built-in.a)
  70541   25370 508   96419   178a3 sched.core.y/core.o (ex 
sched.core.y/built-in.a)

   32626272   09534253e sched.core.n/wait_bit.o (ex 
sched.core.n/built-in.a)
   32626272   09534253e sched.core.y/wait_bit.o (ex 
sched.core.y/built-in.a)

  12235 341  96   126723180 sched.core.n/rt.o (ex 
sched.core.n/built-in.a)
  13073 917  96   140863706 sched.core.y/rt.o (ex 
sched.core.y/built-in.a)

  10293 4771928   12698319a sched.core.n/topology.o (ex 
sched.core.n/built-in.a)
  10363 5091928   128003200 sched.core.y/topology.o (ex 
sched.core.y/built-in.a)

886  24   0 910 38e sched.core.n/cpupri.o (ex 
sched.core.n/built-in.a)
886  24   0 910 38e sched.core.y/cpupri.o (ex 
sched.core.y/built-in.a)

   1061  64   01125 465 sched.core.n/stop_task.o (ex 
sched.core.n/built-in.a)
   1077 128   01205 4b5 sched.core.y/stop_task.o (ex 
sched.core.y/built-in.a)

  18443 365  24   188324990 sched.core.n/deadline.o (ex 
sched.core.n/built-in.a)
  200192189  24   2223256d8 sched.core.y/deadline.o (ex 
sched.core.y/built-in.a)

   1123   8  641195 4ab sched.core.n/loadavg.o (ex 
sched.core.n/built-in.a)
   1123   8  641195 4ab sched.core.y/loadavg.o (ex 
sched.core.y/built-in.a)

   1323   8   01331 533 sched.core.n/stats.o (ex 
sched.core.n/built-in.a)
   1323   8   01331 533 sched.core.y/stats.o (ex 
sched.core.y/built-in.a)

   1282 164  321478 5c6 sched.core.n/isolation.o (ex 
sched.core.n/built-in.a)
   1282 164  321478 5c6 sched.core.y/isolation.o (ex 
sched.core.y/built-in.a)

   1564  36   01600 640 sched.core.n/cpudeadline.o (ex 
sched.core.n/built-in.a)
   1564  36   01600 640 sched.core.y/cpudeadline.o (ex 
sched.core.y/built-in.a)

   1640  56   01696 6a0 sched.core.n/swait.o (ex 
sched.core.n/built-in.a)
   1640  56   01696 6a0 sched.core.y/swait.o (ex 
sched.core.y/built-in.a)

   1859 244  322135 857 sched.core.n/clock.o (ex 
sched.core.n/built-in.a)
   1859 244  322135 857 sched.core.y/clock.o (ex 
sched.core.y/built-in.a)

   2339   8   02347 92b sched.core.n/cputime.o (ex 
sched.core.n/built-in.a)
   2339   8   02347 92b sched.core.y/cputime.o (ex 
sched.core.y/built-in.a)

   3014  32   03046 be6 sched.core.n/membarrier.o (ex 
sched.core.n/built-in.a)
   3014  32   03046 be6 sched.core.y/membarrier.o (ex 
sched.core.y/built-in.a)

  50027 964  96   51087c78f sched.core.n/fair.o 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-18 Thread Linus Torvalds
On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra  wrote:
>
> If there were close to no VMEXITs, it beat smt=off, if there were lots
> of VMEXITs it was far far worse. Supposedly hosting people try their
> very bestest to have no VMEXITs so it mostly works for them (with the
> obvious exception of single VCPU guests).
>
> It's just that people have been bugging me for this crap; and I figure
> I'd post it now that it's not exploding anymore and let others have at.

The patches didn't look disgusting to me, but I admittedly just
scanned through them quickly.

Are there downsides (maintenance and/or performance) when core
scheduling _isn't_ enabled? I guess if it's not a maintenance or
performance nightmare when off, it's ok to just give people the
option.

That all assumes that it works at all for the people who are clamoring
for this feature, but I guess they can run some loads on it
eventually. It's a holiday in the US right now ("Presidents' Day"),
but maybe we can get some numebrs this week?

Linus


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-18 Thread Peter Zijlstra
On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
> >
> > However; whichever way around you turn this cookie; it is expensive and 
> > nasty.
> 
> Do you (or anybody else) have numbers for real loads?
> 
> Because performance is all that matters. If performance is bad, then
> it's pointless, since just turning off SMT is the answer.

Not for these patches; they stopped crashing only yesterday and I
cleaned them up and send them out.

The previous version; which was more horrible; but L1TF complete, was
between OK-ish and horrible depending on the number of VMEXITs a
workload had.

If there were close to no VMEXITs, it beat smt=off, if there were lots
of VMEXITs it was far far worse. Supposedly hosting people try their
very bestest to have no VMEXITs so it mostly works for them (with the
obvious exception of single VCPU guests).

It's just that people have been bugging me for this crap; and I figure
I'd post it now that it's not exploding anymore and let others have at.




Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-18 Thread Linus Torvalds
On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
>
> However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

  Linus


[RFC][PATCH 00/16] sched: Core scheduling

2019-02-18 Thread Peter Zijlstra


A much 'demanded' feature: core-scheduling :-(

I still hate it with a passion, and that is part of why it took a little
longer than 'promised'.

While this one doesn't have all the 'features' of the previous (never
published) version and isn't L1TF 'complete', I tend to like the structure
better (relatively speaking: I hate it slightly less).

This one is sched class agnostic and therefore, in principle, doesn't horribly
wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
to force-idle siblings).

Now, as hinted by that, there are semi sane reasons for actually having this.
Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
per core (due to SMT fundamentally sharing caches) and therefore grouping
related tasks on a core makes it more reliable.

However; whichever way around you turn this cookie; it is expensive and nasty.

It doesn't help that there are truly bonghit crazy proposals for using this out
there, and I really hope to never see them in code.

These patches are lightly tested and didn't insta explode, but no promises,
they might just set your pets on fire.

'enjoy'

@pjt; I know this isn't quite what we talked about, but this is where I ended
up after I started typing. There's plenty design decisions to question and my
changelogs don't even get close to beginning to cover them all. Feel free to 
ask.

---
 include/linux/sched.h|   9 +-
 kernel/Kconfig.preempt   |   8 +-
 kernel/sched/core.c  | 762 ---
 kernel/sched/deadline.c  |  99 +++---
 kernel/sched/debug.c |   4 +-
 kernel/sched/fair.c  | 129 +---
 kernel/sched/idle.c  |  42 ++-
 kernel/sched/pelt.h  |   2 +-
 kernel/sched/rt.c|  96 +++---
 kernel/sched/sched.h | 183 
 kernel/sched/stop_task.c |  35 ++-
 kernel/sched/topology.c  |   4 +-
 kernel/stop_machine.c|   2 +
 13 files changed, 1096 insertions(+), 279 deletions(-)