Re: [ANNOUNCE] v4.11.12-rt9

2017-08-08 Thread Mike Galbraith
On Mon, 2017-08-07 at 10:22 +0200, Mike Galbraith wrote:
> On Mon, 2017-08-07 at 09:33 +0200, Sebastian Andrzej Siewior wrote:
>  
> > what timer is it :)?
> 
> kernel/exit.c:
> 851 hrtimer_cancel(&tsk->signal->real_timer);
> 
> That one.

cb_entry being base.expired (deferred) is the "what" part.

Freshly captured crash dump:

crash> hrtimer 8801605a7400
struct hrtimer {
  node = {
node = {
  __rb_parent_color = 18446612138225726464, 
  rb_right = 0x88017f0d4be0, 
  rb_left = 0x0
}, 
expires = 323599443206
  }, 
  _softexpires = 323599443206, 
  function = 0x81129f00 , 
  base = 0x88017f0d4580, 
  cb_entry = {
next = 0x88017f0d45a0,
prev = 0x88017f0d45a0
  }, 
  irqsafe = 0, 
  state = 0 '\000', 
  is_rel = 0 '\000'
}
crash> hrtimer_clock_base 0x88017f0d4580
struct hrtimer_clock_base {
  cpu_base = 0x88017f0d4440, 
  index = 0, 
  clockid = 1, 
  active = {
head = {
  rb_node = 0xc900079cfa48
}, 
next = 0x88017f0d4be0
  }, 
  expired = {
next = 0x88017f0d45a0, 
prev = 0x88017f0d45a0
  }, 
  get_time = 0x8111dc90 , 
  offset = 0
}
crash> hrtimer_clock_base -ox 0x88017f0d4580
struct hrtimer_clock_base {
  [88017f0d4580] struct hrtimer_cpu_base *cpu_base;
  [88017f0d4588] int index;
  [88017f0d458c] clockid_t clockid;
  [88017f0d4590] struct timerqueue_head active;
  [88017f0d45a0] struct list_head expired;
  [88017f0d45b0] ktime_t (*get_time)(void);
  [88017f0d45b8] ktime_t offset;
}
SIZE: 0x40
crash


Re: [ANNOUNCE] v4.11.12-rt9

2017-08-07 Thread Mike Galbraith
On Mon, 2017-08-07 at 09:52 +0200, Sebastian Andrzej Siewior wrote:
> 
> can you reproduce that one? I don't see where this TASK_UNINTERRUPTIBLE
> is coming from.

Ditto.  Gripe noted where state was set to TASK_RUNNING, which doesn't
look particularly wonderful.

-Mike


Re: [ANNOUNCE] v4.11.12-rt9

2017-08-07 Thread Mike Galbraith
On Mon, 2017-08-07 at 09:52 +0200, Sebastian Andrzej Siewior wrote:
> 
> can you reproduce that one? I don't see where this TASK_UNINTERRUPTIBLE
> is coming from.

Yup, x3550 just reproduced nearly instantly.

-Mike


Re: [ANNOUNCE] v4.11.12-rt9

2017-08-07 Thread Mike Galbraith
On Mon, 2017-08-07 at 09:33 +0200, Sebastian Andrzej Siewior wrote:
> On 2017-08-05 16:57:23 [+0200], Mike Galbraith wrote:
> > > Woohoo!
> > 
> > Box put a small dent in enthusiasm.  After a bit of hotplug flogging,
> > box blew up on shutdown.  x3550 M3 has a serial port, and reproduced.
> > 
> > [  624.216065] list_del corruption. prev->next should be 88015cb31278, 
> > but was 88017f0945a0
> > [  624.216077] [ cut here ]
> > [  624.216079] kernel BUG at lib/list_debug.c:53!
> 
> what timer is it :)?

kernel/exit.c:
851 hrtimer_cancel(&tsk->signal->real_timer);

That one.


Re: [ANNOUNCE] v4.11.12-rt9

2017-08-07 Thread Sebastian Andrzej Siewior
On 2017-08-05 08:13:03 [+0200], Mike Galbraith wrote:
> 
> Steven's script annoyed the scheduler here, but woohoo regardless, it
> hasn't yet made boom, or stopped dead in its tracks.  I'll give it some
> exercise on my 64 core box, where death has never (modulo fugly hacks
> that survived 30 hrs of hell.. once) been more than minutes away.
> 
> [  190.589248] [ cut here ]
> [  190.589273] WARNING: CPU: 1 PID: 5679 at kernel/sched/core.c:6346 
> __might_sleep+0x80/0x90
> [  190.589277] do not call blocking ops when !TASK_RUNNING; state=2 set at 
> [] __finish_swait+0x5/0x60
> [  190.589340] CPU: 1 PID: 5679 Comm: stress-cpu-hotp Tainted: GE 
>   4.11.12-rt9-virgin #11
> [  190.589341] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
> 09/23/2013
> [  190.589341] Call Trace:
> [  190.589355]  __might_sleep+0x80/0x90
> [  190.589358]  rt_mutex_lock_state+0x25/0x60
> [  190.589361]  rt_mutex_lock+0x13/0x20
> [  190.589362]  _mutex_lock+0x39/0x40
> [  190.589365]  stop_cpus+0x23/0x50
> [  190.589367]  stop_machine_cpuslocked+0xed/0x130
> [  190.589370]  takedown_cpu+0x80/0x110
> [  190.589372]  cpuhp_invoke_callback+0x248/0x9d0
> [  190.589376]  cpuhp_down_callbacks+0x42/0x80
> [  190.589378]  _cpu_down+0xc5/0x100
> [  190.589380]  do_cpu_down+0x3c/0x60
> [  190.589381]  cpu_down+0x10/0x20
> [  190.589384]  cpu_subsys_offline+0x14/0x20
> [  190.589385]  device_offline+0x8a/0xb0
> [  190.589387]  online_store+0x40/0x80
> [  190.589389]  dev_attr_store+0x18/0x30
> [  190.589391]  sysfs_kf_write+0x44/0x60
> [  190.589392]  kernfs_fop_write+0x13c/0x1d0
> [  190.589395]  __vfs_write+0x28/0x140

can you reproduce that one? I don't see where this TASK_UNINTERRUPTIBLE
is coming from.

Sebastian


Re: [ANNOUNCE] v4.11.12-rt9

2017-08-07 Thread Sebastian Andrzej Siewior
On 2017-08-05 16:57:23 [+0200], Mike Galbraith wrote:
> > Woohoo!
> 
> Box put a small dent in enthusiasm.  After a bit of hotplug flogging,
> box blew up on shutdown.  x3550 M3 has a serial port, and reproduced.
> 
> [  624.216065] list_del corruption. prev->next should be 88015cb31278, 
> but was 88017f0945a0
> [  624.216077] [ cut here ]
> [  624.216079] kernel BUG at lib/list_debug.c:53!

what timer is it :)?

Sebastian


Re: [ANNOUNCE] v4.11.12-rt9

2017-08-05 Thread Mike Galbraith
On Sat, 2017-08-05 at 08:13 +0200, Mike Galbraith wrote:
> On Fri, 2017-08-04 at 19:38 +0200, Sebastian Andrzej Siewior wrote:
> > Dear RT folks!
> > 
> > I'm pleased to announce the v4.11.12-rt9 patch set. 
> > 
> > Changes since v4.11.12-rt8:
> > 
> >   - CPU hotplug could be rock solid now. Yes. The rewrite of the hotplug
> > related parts for RT including rwlock's implementation over the last
> > few weeks looks good. 'good' means that Steven's CPU-hotplug test script
> > run a x86 box with two nodes without hanging for over a week.
> 
> Woohoo!

Box put a small dent in enthusiasm.  After a bit of hotplug flogging,
box blew up on shutdown.  x3550 M3 has a serial port, and reproduced.

[  624.216065] list_del corruption. prev->next should be 88015cb31278, but 
was 88017f0945a0
[  624.216077] [ cut here ]
[  624.216079] kernel BUG at lib/list_debug.c:53!
[  624.216080] invalid opcode:  [#1] PREEMPT SMP
[  624.216085] Dumping ftrace buffer:
[  624.216090](ftrace buffer empty)
[  624.216091] Modules linked in: ebtable_filter(E) ebtables(E) 
ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) 
af_packet(E) br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) 
iscsi_boot_sysfs(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) 
nls_iso8859_1(E) nls_cp437(E) kvm(E) vfat(E) fat(E) ipmi_ssif(E) irqbypass(E) 
cdc_ether(E) crct10dif_pclmul(E) usbnet(E) crc32_pclmul(E) mii(E) 
crc32c_intel(E) ghash_clmulni_intel(E) ipmi_si(E) pcbc(E) ipmi_devintf(E) 
iTCO_wdt(E) iTCO_vendor_support(E) aesni_intel(E) ipmi_msghandler(E) bnx2(E) 
aes_x86_64(E) i7core_edac(E) i5500_temp(E) crypto_simd(E) ioatdma(E) lpc_ich(E) 
glue_helper(E) edac_core(E) dca(E) mfd_core(E) shpchp(E) cryptd(E) button(E) 
acpi_cpufreq(E) i2c_i801(E) pcspkr(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) 
lockd(E) grace(E)
[  624.216131]  sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) sd_mod(E) 
ehci_pci(E) uhci_hcd(E) i2c_algo_bit(E) ata_generic(E) drm_kms_helper(E) 
syscopyarea(E) sysfillrect(E) sysimgblt(E) ehci_hcd(E) fb_sys_fops(E) 
ata_piix(E) ttm(E) ahci(E) libahci(E) drm(E) megaraid_sas(E) libata(E) 
usbcore(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) 
scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
[  624.216155] CPU: 0 PID: 2584 Comm: ntpd Tainted: GW   E   
4.11.12-rt9-virgin #1
[  624.216156] Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , 
BIOS -[D6E150AUS-1.10]- 12/15/2010
[  624.216158] task: 880143fc6000 task.stack: c90003b18000
[  624.216164] RIP: 0010:__list_del_entry_valid+0x7b/0x90
[  624.216165] RSP: 0018:c90003b1be18 EFLAGS: 00010096
[  624.216167] RAX: 0054 RBX: 88015cb31240 RCX: 
[  624.216168] RDX: 880143fc6000 RSI:  RDI: 810f552b
[  624.216169] RBP: c90003b1be18 R08: 0001 R09: 
[  624.216170] R10: 201f R11:  R12: 88017f094580
[  624.216172] R13: 88017f094440 R14: 88017f094440 R15: 88015cb31278
[  624.216173] FS:  7f63dcde4700() GS:88017f00() 
knlGS:
[  624.216175] CS:  0010 DS:  ES:  CR0: 80050033
[  624.216176] CR2: 7f63dbd11080 CR3: 00016510c000 CR4: 06f0
[  624.216178] Call Trace:
[  624.216184]  __remove_hrtimer+0xac/0xd0
[  624.216188]  hrtimer_try_to_cancel+0xd2/0x220
[  624.216191]  hrtimer_cancel+0x1f/0x30
[  624.216195]  do_exit+0x742/0xd60
[  624.216200]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
[  624.216205]  ? trace_hardirqs_on_caller+0xf9/0x1c0
[  624.216208]  do_group_exit+0x4c/0xc0
[  624.216211]  SyS_exit_group+0x14/0x20
[  624.216214]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[  624.216216] RIP: 0033:0x7f63dbfd04a9
[  624.216217] RSP: 002b:7ffdb1e96568 EFLAGS: 0246 ORIG_RAX: 
00e7
[  624.216219] RAX: ffda RBX: 7f63dcde6000 RCX: 7f63dbfd04a9
[  624.216220] RDX:  RSI: 7f63dc2b4303 RDI: 
[  624.216221] RBP: 7ffdb1e96560 R08: 003c R09: 00e7
[  624.216223] R10: ff90 R11: 0246 R12: 7f63dbd10b70
[  624.216224] R13: 7ffdb1e964d0 R14: 7ffdb1e964e8 R15: 
[  624.216230] Code: e8 82 45 dd ff 0f 0b 48 89 fe 31 c0 48 c7 c7 28 05 a5 81 
e8 6f 45 dd ff 0f 0b 48 89 fe 31 c0 48 c7 c7 e8 04 a5 81 e8 5c 45 dd ff <0f> 0b 
48 89 fe 31 c0 48 c7 c7 b0 04 a5 81 e8 49 45 dd ff 0f 0b 
[  624.216266] RIP: __list_del_entry_valid+0x7b/0x90 RSP: c90003b1be18

config.xz
Description: application/xz


Re: [ANNOUNCE] v4.11.12-rt9

2017-08-04 Thread Mike Galbraith
On Fri, 2017-08-04 at 19:38 +0200, Sebastian Andrzej Siewior wrote:
> Dear RT folks!
> 
> I'm pleased to announce the v4.11.12-rt9 patch set. 
> 
> Changes since v4.11.12-rt8:
> 
>   - CPU hotplug could be rock solid now. Yes. The rewrite of the hotplug
> related parts for RT including rwlock's implementation over the last
> few weeks looks good. 'good' means that Steven's CPU-hotplug test script
> run a x86 box with two nodes without hanging for over a week.

Woohoo!

Steven's script annoyed the scheduler here, but woohoo regardless, it
hasn't yet made boom, or stopped dead in its tracks.  I'll give it some
exercise on my 64 core box, where death has never (modulo fugly hacks
that survived 30 hrs of hell.. once) been more than minutes away.

[  190.589248] [ cut here ]
[  190.589273] WARNING: CPU: 1 PID: 5679 at kernel/sched/core.c:6346 
__might_sleep+0x80/0x90
[  190.589277] do not call blocking ops when !TASK_RUNNING; state=2 set at 
[] __finish_swait+0x5/0x60
[  190.589279] Modules linked in: x86_pkg_temp_thermal(E-) fuse(E) 
ebtable_filter(E) ebtables(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) 
nfs(E) fscache(E) xt_pkttype(E) xt_physdev(E) af_packet(E) br_netfilter(E) 
bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) 
xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) 
ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) 
nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) 
nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) libcrc32c(E) 
ip6table_filter(E) ip6_tables(E) x_tables(E) nls_iso8859_1(E) nls_cp437(E) 
intel_rapl(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) 
joydev(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) 
snd_hda_codec_generic(E)
[  190.589300]  crct10dif_pclmul(E) crc32_pclmul(E) snd_hda_intel(E) 
crc32c_intel(E) snd_hda_codec(E) snd_hda_core(E) intel_spi_platform(E) 
intel_spi(E) spi_nor(E) snd_hwdep(E) ghash_clmulni_intel(E) battery(E) pcbc(E) 
r8169(E) mtd(E) mii(E) snd_pcm(E) iTCO_wdt(E) iTCO_vendor_support(E) 
aesni_intel(E) snd_timer(E) aes_x86_64(E) crypto_simd(E) snd(E) mei_me(E) 
lpc_ich(E) thermal(E) glue_helper(E) tpm_infineon(E) soundcore(E) i2c_i801(E) 
mfd_core(E) cryptd(E) mei(E) shpchp(E) pcspkr(E) fan(E) intel_smartconnect(E) 
nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) sr_mod(E) 
cdrom(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) 
hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) 
syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ehci_pci(E) 
xhci_pci(E)
[  190.589324]  ahci(E) ehci_hcd(E) ttm(E) xhci_hcd(E) libahci(E) drm(E) 
libata(E) usbcore(E) video(E) button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) 
virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) jbd2(E) 
mbcache(E) loop(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) 
scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
[  190.589340] CPU: 1 PID: 5679 Comm: stress-cpu-hotp Tainted: GE   
4.11.12-rt9-virgin #11
[  190.589341] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[  190.589341] Call Trace:
[  190.589345]  dump_stack+0x85/0xc8
[  190.589348]  __warn+0xec/0x110
[  190.589351]  warn_slowpath_fmt+0x4f/0x60
[  190.589353]  ? __finish_swait+0x5/0x60
[  190.589354]  ? __finish_swait+0x5/0x60
[  190.589355]  __might_sleep+0x80/0x90
[  190.589358]  rt_mutex_lock_state+0x25/0x60
[  190.589360]  ? cpu_stop_queue_work+0xb0/0xb0
[  190.589361]  rt_mutex_lock+0x13/0x20
[  190.589362]  _mutex_lock+0x39/0x40
[  190.589363]  ? stop_cpus+0x23/0x50
[  190.589365]  stop_cpus+0x23/0x50
[  190.589366]  ? cpuhp_invoke_callback+0x9d0/0x9d0
[  190.589367]  stop_machine_cpuslocked+0xed/0x130
[  190.589368]  ? cpuhp_invoke_callback+0x9d0/0x9d0
[  190.589370]  takedown_cpu+0x80/0x110
[  190.589372]  ? cpuhp_complete_idle_dead+0x20/0x20
[  190.589372]  cpuhp_invoke_callback+0x248/0x9d0
[  190.589376]  cpuhp_down_callbacks+0x42/0x80
[  190.589378]  _cpu_down+0xc5/0x100
[  190.589380]  do_cpu_down+0x3c/0x60
[  190.589381]  cpu_down+0x10/0x20
[  190.589384]  cpu_subsys_offline+0x14/0x20
[  190.589385]  device_offline+0x8a/0xb0
[  190.589387]  online_store+0x40/0x80
[  190.589389]  dev_attr_store+0x18/0x30
[  190.589391]  sysfs_kf_write+0x44/0x60
[  190.589392]  kernfs_fop_write+0x13c/0x1d0
[  190.589395]  __vfs_write+0x28/0x140
[  190.589397]  ? rcu_read_lock_sched_held+0x98/0xa0
[  190.589398]  ? rcu_sync_lockdep_assert+0x32/0x60
[  190.589399]  ? __sb_start_write+0x1d2/0x290
[  190.589400]  ? vfs_write+0x196/0x1f0
[  190.589402]  ? security_file_permission+0x3b/0xc0
[  190.589404]  vfs_write+0xc7/0x1f0
[  190.589406]  ? trace_hardirqs_on_caller+0xf9/0x1c0
[  190.589408]  SyS_write+0x49/0xa0
[  190.589410]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[  190.589411] RIP: 0033:0x7fb5065a92d0
[  190.589411] RSP: 002b:7ffe9afe2988 EFLA