Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

2017-04-20 Thread Vincent Legout
On Wed, Apr 19, 2017 at 08:39:05PM +0100, Ben Hutchings wrote :
> On Fri, 2017-04-14 at 11:18 +0200, Vincent Legout wrote:
> [...]
> > Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5
> > minutes even without doing any 'xl vcpu-set'?
> 
> The MCE polling timer for each CPU runs every 5 minutes, so this is
> presumably the first time it runs.  Perhaps this domain is configured
> such that CPUs are hot-removed shortly after boot?

I didn't explicitly set anything like that, but I guess it could also be
a default configuration in Xen.

> In the first crash, it looks like the timer for CPU x!=0 is being
> called on CPU 0.  In general this can happen if CPU x is hot-removed;
> its timers are migrated to another CPU.  This should *not* be possible
> with the MCE timer, as there is a hotplug callback that removes the
> timer when a CPU is removed.  There is a check for the timer having
> been migrated anyway, which triggers the WARNING.  The timer function
> then tries to re-add the timer for the current CPU, but that's still
> pending, which triggers the BUG.  Either the hotplug callback was not
> called, or the timer was migrated before being removed resulting in a
> race condition.
> 
> > With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of
> > the time (between 1 and 16 vcpus), but after several tries, I got the
> > attached trace.
> 
> I'm not sure what's going on in this crash, but as it's a null
> dereference in migrate_timer_list it seems somewhat related.
> 
> I didn't find any changes that would explain how this was fixed between
> 4.0 and 4.2.  I suggest you work around it by adding 'nomce' to the
> kernel command line as I would expect Xen or dom0 to handle MCEs.

Thanks a lot Ben, I can't reproduce the issue with 'nomce'.

Thanks,
Vincent


signature.asc
Description: PGP signature


Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

2017-04-19 Thread Ben Hutchings
On Fri, 2017-04-14 at 11:18 +0200, Vincent Legout wrote:
[...]
> Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5
> minutes even without doing any 'xl vcpu-set'?

The MCE polling timer for each CPU runs every 5 minutes, so this is
presumably the first time it runs.  Perhaps this domain is configured
such that CPUs are hot-removed shortly after boot?

In the first crash, it looks like the timer for CPU x!=0 is being
called on CPU 0.  In general this can happen if CPU x is hot-removed;
its timers are migrated to another CPU.  This should *not* be possible
with the MCE timer, as there is a hotplug callback that removes the
timer when a CPU is removed.  There is a check for the timer having
been migrated anyway, which triggers the WARNING.  The timer function
then tries to re-add the timer for the current CPU, but that's still
pending, which triggers the BUG.  Either the hotplug callback was not
called, or the timer was migrated before being removed resulting in a
race condition.

> With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of
> the time (between 1 and 16 vcpus), but after several tries, I got the
> attached trace.

I'm not sure what's going on in this crash, but as it's a null
dereference in migrate_timer_list it seems somewhat related.

I didn't find any changes that would explain how this was fixed between
4.0 and 4.2.  I suggest you work around it by adding 'nomce' to the
kernel command line as I would expect Xen or dom0 to handle MCEs.

Ben.

-- 
Ben Hutchings
Man invented language to satisfy his deep need to complain. - Lily
Tomlin



signature.asc
Description: This is a digitally signed message part


Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

2017-04-14 Thread Vincent Legout
On Fri, Apr 14, 2017 at 09:15:58AM +0200, Vincent Legout wrote :
> On Thu, Apr 13, 2017 at 11:41:37PM +0100, Ben Hutchings wrote :
> > Control: tag -1 moreinfo
> > 
> > On Thu, 2017-04-13 at 11:18 +0200, Vincent Legout wrote:
> > > Package: src:linux
> > > Version: 3.16.39-1+deb8u2
> > > Severity: normal
> > > 
> > > Hi,
> > > 
> > > A xen jessie domU crashes around 5 minutes after the boot with the
> > > attached backtrace (at every boot). dom0 is also a Debian jessie running
> > > Xen 4.8.
> > > 
> > > It only happens when the guest is in pv mode, it works fine with pvhvm.
> > > 
> > > It also crashes with older 3.16 kernels and 4.0.2-1, but not with
> > > 4.2.1-1 (last 2 kernels from snapshot.debian.org).
> > > 
> > > # uname -a
> > > 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 
> > > GNU/Linux
> > 
> > From the crash log:
> > 
> > > [  300.632389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 
> > > 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2
> > 
> > This indicates there was an earlier WARNING message; what was that?
> 
> Thanks for the answer.
> 
> I got this WARNING after I increased verbosity in the command line:

The WARNING and BUG disappear if "maxvcpus" is disabled in the guest
configuration (which prevents adding or removing vcpus).

Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5
minutes even without doing any 'xl vcpu-set'?

With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of
the time (between 1 and 16 vcpus), but after several tries, I got the
attached trace.

Vincent
[   62.000210] BUG: unable to handle kernel NULL pointer dereference at 
0008
[   62.000229] IP: [] migrate_timer_list+0x3b/0xc0
[   62.000246] PGD 0 
[   62.000251] Oops: 0002 [#1] SMP 
[   62.000261] Modules linked in: x86_pkg_temp_thermal thermal_sys intel_rapl 
coretemp crc32_pclmul evdev aesni_intel aes_x86_64 lrw gf128mul glue_helper 
ablk_helper pcspkr cryptd autofs4 ext4 crc16 mbcache jbd2 xen_netfront 
xen_blkfront crct10dif_pclmul crct10dif_common crc32c_intel
[   62.000306] CPU: 9 PID: 89 Comm: xenwatch Not tainted 3.16.0-4-amd64 #1 
Debian 3.16.39-1+deb8u2
[   62.000318] task: 88003d597370 ti: 88003d598000 task.ti: 
88003d598000
[   62.000326] RIP: e030:[]  [] 
migrate_timer_list+0x3b/0xc0
[   62.000338] RSP: e02b:88003d59bd70  EFLAGS: 00010087
[   62.000344] RAX: dead0200 RBX:  RCX: 223a
[   62.000351] RDX:  RSI: 88003f96ca00 RDI: 88003daac000
[   62.000357] RBP: 88003f96ca00 R08: 4000 R09: fff8
[   62.000364] R10:  R11:  R12: 88003e3a5430
[   62.000375] R13: 88003daac000 R14: 818e2fa0 R15: 88003e3a5030
[   62.000387] FS:  () GS:88003f92() 
knlGS:
[   62.000395] CS:  e033 DS:  ES:  CR0: 80050033
[   62.000401] CR2: 0008 CR3: 01813000 CR4: 00042660
[   62.000408] Stack:
[   62.000411]   88003daac000 88003e3a5c30 
88003e3a5830
[   62.000422]  88003e3a5430 81075188 88003e3a4000 
fff2
[   62.000432]  8184c1a0  0007 
0001
[   62.000443] Call Trace:
[   62.000455]  [] ? timer_cpu_notify+0xf8/0x2e0
[   62.000465]  [] ? notifier_call_chain+0x4e/0x70
[   62.000478]  [] ? cpu_notify+0x1f/0x40
[   62.000486]  [] ? cpu_notify_nofail+0xa/0x20
[   62.000499]  [] ? _cpu_down+0x17b/0x290
[   62.000512]  [] ? unregister_xenbus_watch+0x210/0x210
[   62.000520]  [] ? cpu_down+0x2d/0x40
[   62.000530]  [] ? handle_vcpu_hotplug_event+0xa7/0xd0
[   62.000538]  [] ? xenwatch_thread+0x92/0x130
[   62.000550]  [] ? prepare_to_wait_event+0xf0/0xf0
[   62.000565]  [] ? kthread+0xbd/0xe0
[   62.000572]  [] ? kthread_create_on_node+0x180/0x180
[   62.000586]  [] ? ret_from_fork+0x58/0x90
[   62.000594]  [] ? kthread_create_on_node+0x180/0x180
[   62.000600] Code: 49 89 fd 41 54 49 89 f4 55 53 48 8b 2e 48 39 ee 74 4a 66 
0f 1f 44 00 00 0f 1f 44 00 00 48 8b 45 08 48 8b 55 00 48 89 ee 4c 89 ef <48> 89 
42 08 48 89 10 48 b8 00 02 00 00 00 00 ad de 48 89 45 08 
[   62.000680] RIP  [] migrate_timer_list+0x3b/0xc0
[   62.000692]  RSP 
[   62.000696] CR2: 0008
[   62.000703] ---[ end trace b62387850d17f99e ]---
[   84.492006] INFO: rcu_sched detected stalls on CPUs/tasks: { 2 8 9} 
(detected by 4, t=5255 jiffies, g=614, c=613, q=59)
[   84.492039] sending NMI to all CPUs:
[   63.481417] NMI backtrace for cpu 0
[   63.481417] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G  D   
3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2
[   63.481417] task: 8181a460 ti: 8180 task.ti: 
8180
[   63.481417] RIP: e030:[]  [] 
_raw_spin_lock+0x28/0x30
[   63.481417] RSP: e02b:88003f803b58  EFLAGS: 0093
[   63.481417] RAX: 0198 RBX: 88003af9d3d8 RCX: 019b
[   63.481417] 

Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

2017-04-14 Thread Vincent Legout
On Thu, Apr 13, 2017 at 11:41:37PM +0100, Ben Hutchings wrote :
> Control: tag -1 moreinfo
> 
> On Thu, 2017-04-13 at 11:18 +0200, Vincent Legout wrote:
> > Package: src:linux
> > Version: 3.16.39-1+deb8u2
> > Severity: normal
> > 
> > Hi,
> > 
> > A xen jessie domU crashes around 5 minutes after the boot with the
> > attached backtrace (at every boot). dom0 is also a Debian jessie running
> > Xen 4.8.
> > 
> > It only happens when the guest is in pv mode, it works fine with pvhvm.
> > 
> > It also crashes with older 3.16 kernels and 4.0.2-1, but not with
> > 4.2.1-1 (last 2 kernels from snapshot.debian.org).
> > 
> > # uname -a
> > 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux
> 
> From the crash log:
> 
> > [  300.632389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 
> > 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2
> 
> This indicates there was an earlier WARNING message; what was that?

Thanks for the answer.

I got this WARNING after I increased verbosity in the command line:

[  300.636063] [ cut here ]
[  300.636102] WARNING: CPU: 0 PID: 0 at 
/build/linux-GSgHvp/linux-3.16.39/arch/x86/kernel/cpu/mcheck/mce.c:1307 
mce_timer_fn+0x132/0x140()
[  300.636116] Modules linked in: x86_pkg_temp_thermal thermal_sys intel_rapl 
coretemp crc32_pclmul evdev aesni_intel aes_x86_64 lrw gf128mul glue_helper 
ablk_helper pcspkr cryptd autofs4 ext4 crc16 mbcache jbd2 xen_netfront 
xen_blkfront crct10dif_pclmul crct10dif_common crc32c_intel
[  300.636167] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-4-amd64 #1 
Debian 3.16.39-1+deb8u2
[  300.636178]   81514c81  
0009
[  300.636188]  81068867 88003f80ca00 88003f9eca00 
0100
[  300.636199]  81038a30 000f 81038b62 
81a66e00
[  300.636211] Call Trace:
[  300.636216][] ? dump_stack+0x5d/0x78
[  300.636242]  [] ? warn_slowpath_common+0x77/0x90
[  300.636250]  [] ? mce_cpu_restart+0x40/0x40
[  300.636257]  [] ? mce_timer_fn+0x132/0x140
[  300.636267]  [] ? call_timer_fn+0x31/0x140
[  300.636274]  [] ? mce_cpu_restart+0x40/0x40
[  300.636284]  [] ? run_timer_softirq+0x1e9/0x2f0
[  300.636292]  [] ? __do_softirq+0xf1/0x2d0
[  300.636299]  [] ? irq_exit+0x95/0xa0
[  300.636309]  [] ? xen_evtchn_do_upcall+0x35/0x50
[  300.636319]  [] ? xen_do_hypervisor_callback+0x1e/0x30
[  300.636324][] ? xen_hypercall_sched_op+0xc/0x20
[  300.636339]  [] ? xen_hypercall_sched_op+0xc/0x20
[  300.636349]  [] ? xen_safe_halt+0xc/0x20
[  300.636360]  [] ? default_idle+0x19/0xd0
[  300.636370]  [] ? cpu_startup_entry+0x374/0x470
[  300.636384]  [] ? start_kernel+0x497/0x4a2
[  300.636392]  [] ? set_init_arg+0x4e/0x4e
[  300.636400]  [] ? xen_start_kernel+0x569/0x573
[  300.636413] ---[ end trace 7131ef713ca84161 ]---

Then, the same BUG as before. It always happens after 300 seconds.

Vincent


signature.asc
Description: PGP signature


Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

2017-04-13 Thread Ben Hutchings
Control: tag -1 moreinfo

On Thu, 2017-04-13 at 11:18 +0200, Vincent Legout wrote:
> Package: src:linux
> Version: 3.16.39-1+deb8u2
> Severity: normal
> 
> Hi,
> 
> A xen jessie domU crashes around 5 minutes after the boot with the
> attached backtrace (at every boot). dom0 is also a Debian jessie running
> Xen 4.8.
> 
> It only happens when the guest is in pv mode, it works fine with pvhvm.
> 
> It also crashes with older 3.16 kernels and 4.0.2-1, but not with
> 4.2.1-1 (last 2 kernels from snapshot.debian.org).
> 
> # uname -a
> 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux

From the crash log:

> [  300.632389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 
> 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2

This indicates there was an earlier WARNING message; what was that?

Ben.

-- 
Ben Hutchings
Any sufficiently advanced bug is indistinguishable from a feature.



signature.asc
Description: This is a digitally signed message part


Processed: Re: Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

2017-04-13 Thread Debian Bug Tracking System
Processing control commands:

> tag -1 moreinfo
Bug #860236 [src:linux] xen pv domU crash with 3.16 kernel and xen 4.8
Added tag(s) moreinfo.

-- 
860236: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=860236
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8

2017-04-13 Thread Vincent Legout
Package: src:linux
Version: 3.16.39-1+deb8u2
Severity: normal

Hi,

A xen jessie domU crashes around 5 minutes after the boot with the
attached backtrace (at every boot). dom0 is also a Debian jessie running
Xen 4.8.

It only happens when the guest is in pv mode, it works fine with pvhvm.

It also crashes with older 3.16 kernels and 4.0.2-1, but not with
4.2.1-1 (last 2 kernels from snapshot.debian.org).

# uname -a
3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux

Vincent
[  300.632313] kernel BUG at 
/build/linux-GSgHvp/linux-3.16.39/kernel/timer.c:946!
[  300.632320] invalid opcode:  [#1] SMP
[  300.632326] Modules linked in: fuse btrfs xor raid6_pq ufs qnx4 hfsplus hfs 
minix ntfs vfat msdos fat jfs xfs libcrc32c crc32c_generic dm_mod 
x86_pkg_temp_thermal thermal_sys intel_rapl coretemp crc32_pclmul evdev 
aesni_intel aes_x86_64 lrw gf128mul glue_helper pcspkr ablk_helper cryptd 
autofs4 ext4 crc16 mbcache jbd2 crct10dif_pclmul crct10dif_common xen_netfront 
xen_blkfront crc32c_intel
[  300.632389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 
3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2
[  300.632396] task: 8181a460 ti: 8180 task.ti: 
8180
[  300.632403] RIP: e030:[]  [] 
add_timer_on+0xea/0x100
[  300.632415] RSP: e02b:88003f603e78  EFLAGS: 00010282
[  300.632422] RAX:  RBX: 81a66e00 RCX: 0001000125c4
[  300.632428] RDX: 88003f60 RSI:  RDI: 88003f60ca00
[  300.632434] RBP: 88003f60ca00 R08: 0001009f R09: 88003f603de0
[  300.632441] R10: 88003f603de4 R11: dfbfefff R12: 81a66e00
[  300.632448] R13: 81038a30 R14:  R15: 
[  300.632462] FS:  () GS:88003f60() 
knlGS:88003f60
[  300.632469] CS:  e033 DS:  ES:  CR0: 80050033
[  300.632476] CR2: 01bf6808 CR3: 0008 CR4: 00042660
[  300.632483] Stack:
[  300.632487]  81a66e00 88003f7eca00 0100 
81038a30
[  300.632499]  000f  81073ea1 
81a66e00
[  300.632509]  88003f7eca00 0001 81038a30 
000f
[  300.632521] Call Trace:
[  300.632525]  
[  300.632530]  [] ? mce_cpu_restart+0x40/0x40
[  300.632543]  [] ? call_timer_fn+0x31/0x140
[  300.632553]  [] ? mce_cpu_restart+0x40/0x40
[  300.632563]  [] ? run_timer_softirq+0x1e9/0x2f0
[  300.632570]  [] ? __do_softirq+0xf1/0x2d0
[  300.632577]  [] ? irq_exit+0x95/0xa0
[  300.632584]  [] ? xen_evtchn_do_upcall+0x35/0x50
[  300.632595]  [] ? xen_do_hypervisor_callback+0x1e/0x30
[  300.632600]  
[  300.632603]  [] ? xen_hypercall_sched_op+0xc/0x20
[  300.632614]  [] ? xen_hypercall_sched_op+0xc/0x20
[  300.632623]  [] ? xen_safe_halt+0xc/0x20
[  300.632631]  [] ? default_idle+0x19/0xd0
[  300.632640]  [] ? cpu_startup_entry+0x374/0x470
[  300.632650]  [] ? start_kernel+0x497/0x4a2
[  300.632657]  [] ? set_init_arg+0x4e/0x4e
[  300.632665]  [] ? xen_start_kernel+0x569/0x573
[  300.632674] Code: a6 85 00 48 85 db 74 21 48 8b 03 66 0f 1f 44 00 00 48 8b 
7b 08 48 83 c3 10 4c 89 ea 48 89 ee ff d0 48 8b 03 48 85 c0 75 e8 eb 87 <0f> 0b 
48 8b 74 24 30 e8 3a fe ff ff e9 3e ff ff ff 0f 1f 44 00
[  300.632756] RIP  [] add_timer_on+0xea/0x100
[  300.632766]  RSP 
[  300.632779] ---[ end trace 77fe5db1be9d3b29 ]---
[  300.632790] Kernel panic - not syncing: Fatal exception in interrupt
[  300.632803] Kernel Offset: 0x0 from 0x8100 (relocation range: 
0x8000-0x9fff) 


signature.asc
Description: PGP signature