Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 08/20/2011 04:16 PM, Brad Campbell wrote: Author: Alexander Duyck alexander.h.du...@intel.com Date: Thu Jul 1 13:28:27 2010 + x86: Drop CONFIG_MCORE2 check around setting of NET_IP_ALIGN This patch removes the CONFIG_MCORE2 check from around NET_IP_ALIGN. It is based on a suggestion from Andi Kleen. The assumption is that there are not any x86 cores where unaligned access is really slow, and this change would allow for a performance improvement to still exist on configurations that are not necessarily optimized for Core 2. Cc: Andi Kleen a...@linux.intel.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Signed-off-by: Alexander Duyck alexander.h.du...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com Acked-by: H. Peter Anvin h...@zytor.com Signed-off-by: David S. Miller da...@davemloft.net :04 04 5a15867789080a2f67a74b17c4422f85b7a9fb4a b98769348bd765731ca3ff03b33764257e23226c March I can confirm this bug exists in the 3.0 kernel, however I'm unable to reproduce it on todays git. So anyone using netfilter, kvm and bridge on kernels between 2.6.36-rc1 and 3.0 may hit this bug, but it looks like it is fixed in the current 3.1-rc kernels. Thanks for this effort. I don't think this patch is buggy in itself, it merely exposed another bug which was fixed later on. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le lundi 22 août 2011 à 09:36 +0300, Avi Kivity a écrit : On 08/20/2011 04:16 PM, Brad Campbell wrote: Author: Alexander Duyck alexander.h.du...@intel.com Date: Thu Jul 1 13:28:27 2010 + x86: Drop CONFIG_MCORE2 check around setting of NET_IP_ALIGN This patch removes the CONFIG_MCORE2 check from around NET_IP_ALIGN. It is based on a suggestion from Andi Kleen. The assumption is that there are not any x86 cores where unaligned access is really slow, and this change would allow for a performance improvement to still exist on configurations that are not necessarily optimized for Core 2. Cc: Andi Kleen a...@linux.intel.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Signed-off-by: Alexander Duyck alexander.h.du...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com Acked-by: H. Peter Anvin h...@zytor.com Signed-off-by: David S. Miller da...@davemloft.net :04 04 5a15867789080a2f67a74b17c4422f85b7a9fb4a b98769348bd765731ca3ff03b33764257e23226c March I can confirm this bug exists in the 3.0 kernel, however I'm unable to reproduce it on todays git. So anyone using netfilter, kvm and bridge on kernels between 2.6.36-rc1 and 3.0 may hit this bug, but it looks like it is fixed in the current 3.1-rc kernels. Thanks for this effort. I don't think this patch is buggy in itself, it merely exposed another bug which was fixed later on. Some piece of hardware has a 2-byte offset requirement, and driver incorrectly assumed NET_IP_ALIGN was 2 on x86. Brad, could you post your config (lsmod, dmesg) again ? tg3.c code for example uses a private value, not related to NET_IP_ALIGN #define TG3_RAW_IP_ALIGN 2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 07/06/11 21:37, Eric Dumazet wrote: Le mardi 07 juin 2011 à 21:27 +0800, Brad Campbell a écrit : On 07/06/11 04:22, Eric Dumazet wrote: Could you please try latest linux-2.6 tree ? We fixed many networking bugs that could explain your crash. No good I'm afraid. [ 543.040056] = [ 543.040136] BUG ip_dst_cache: Padding overwritten. 0x8803e4217ffe-0x8803e4217fff [ 543.040194] Thats pretty strange : These are the last two bytes of a page, set to 0x (a 16 bit value) There is no way a dst field could actually sit on this location (its a padding), since a dst is a bit less than 256 bytes (0xe8), and each entry is aligned on a 64byte address. grep dst /proc/slabinfo ip_dst_cache 32823 62944256 322 : tunables00 0 : slabdata 1967 1967 0 sizeof(struct rtable)=0xe8 - [ 543.040198] [ 543.040298] INFO: Slab 0xea000d9e74d0 objects=25 used=25 fp=0x (null) flags=0x80004081 [ 543.040364] Pid: 4576, comm: kworker/1:2 Not tainted 3.0.0-rc2 #1 [ 543.040415] Call Trace: [ 543.040472] [810b9c1d] ? slab_err+0xad/0xd0 [ 543.040528] [8102e034] ? check_preempt_wakeup+0xa4/0x160 [ 543.040595] [810ba206] ? slab_pad_check+0x126/0x170 [ 543.040650] [8133045b] ? dst_destroy+0x8b/0x110 [ 543.040701] [810ba29a] ? check_slab+0x4a/0xc0 [ 543.040753] [810baf2d] ? free_debug_processing+0x2d/0x250 [ 543.040808] [810bb27b] ? __slab_free+0x12b/0x140 [ 543.040862] [810bbe99] ? kmem_cache_free+0x99/0xa0 [ 543.040915] [8133045b] ? dst_destroy+0x8b/0x110 [ 543.040967] [813307f6] ? dst_gc_task+0x196/0x1f0 [ 543.041021] [8104e954] ? queue_delayed_work_on+0x154/0x160 [ 543.041081] [813066fe] ? do_dbs_timer+0x20e/0x3d0 [ 543.041133] [81330660] ? dst_alloc+0x180/0x180 [ 543.041187] [8104f28b] ? process_one_work+0xfb/0x3b0 [ 543.041242] [8104f964] ? worker_thread+0x144/0x3d0 [ 543.041296] [8102cc10] ? __wake_up_common+0x50/0x80 [ 543.041678] [8104f820] ? rescuer_thread+0x2e0/0x2e0 [ 543.041729] [8104f820] ? rescuer_thread+0x2e0/0x2e0 [ 543.041782] [81053436] ? kthread+0x96/0xa0 [ 543.041835] [813e1d14] ? kernel_thread_helper+0x4/0x10 [ 543.041890] [810533a0] ? kthread_worker_fn+0x120/0x120 [ 543.041944] [813e1d10] ? gs_change+0xb/0xb [ 543.041993] Padding 0x8803e4217f40: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.042718] Padding 0x8803e4217f50: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.043433] Padding 0x8803e4217f60: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.044155] Padding 0x8803e4217f70: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.044866] Padding 0x8803e4217f80: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.045590] Padding 0x8803e4217f90: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.046311] Padding 0x8803e4217fa0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.047034] Padding 0x8803e4217fb0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.047755] Padding 0x8803e4217fc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.048474] Padding 0x8803e4217fd0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.049203] Padding 0x8803e4217fe0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.049909] Padding 0x8803e4217ff0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 00 00 ZZ.. [ 543.050021] FIX ip_dst_cache: Restoring 0x8803e4217f40-0x8803e4217fff=0x5a [ 543.050021] Dropped -mm, Hugh and Andrea from CC as this does not appear to be mm or ksm related. I'll pare down the firewall and see if I can make it break easier with a smaller test set. Hmm, not sure now :( Could you reproduce another bug please ? I know this is an old one, but I recently purchased a second system to allow me to test and bisect this off-line (the live system is too much of a headache to bisect on). brad@test:/raid10/src/linux-2.6$ git bisect log git bisect start # good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35 git bisect good 9fe6206f400646a2322096b56c59891d530e8d51 # bad: [da5cabf80e2433131bf0ed8993abc0f7ea618c73] Linux 2.6.36-rc1 git bisect bad da5cabf80e2433131bf0ed8993abc0f7ea618c73 # bad: [0f477dd0851bdcee82923da66a7fc4a44cb1bc3d] Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip git bisect bad 0f477dd0851bdcee82923da66a7fc4a44cb1bc3d # bad:
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 06/10/2011 05:52 AM, Simon Horman wrote: At one point I would have need an 8000km long wire to the reset switch :-) Even more off-topic, there has been a case when a 200,000,000 km long wire to the reset button was needed. IIRC they got away with a watchdog. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 11-06-09 10:52 PM, Simon Horman wrote: On Thu, Jun 09, 2011 at 01:02:13AM +0800, Brad Campbell wrote: On 08/06/11 11:59, Eric Dumazet wrote: Well, a bisection definitely should help, but needs a lot of time in your case. Yes. compile, test, crash, walk out to the other building to press reset, lather, rinse, repeat. I need a reset button on the end of a 50M wire, or a hardware watchdog! Something many of us don't realize is that nearly all Intel chipsets have a built-in hardware watchdog timer. This includes chipset for consumer desktop boards as well as the big iron server stuff. It's the i8xx_tco driver in the kernel enables use of them: modprobe i8xx_tco Cheers -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On Fri, 10 Jun 2011, Mark Lord wrote: Something many of us don't realize is that nearly all Intel chipsets have a built-in hardware watchdog timer. This includes chipset for consumer desktop boards as well as the big iron server stuff. It's the i8xx_tco driver in the kernel enables use of them: That's the old module name, but yes, it is very useful in desktops and laptops (when it works). Server-class hardware will have a baseboard management unit that can really power-cycle the system instead of just rebooting. And test it first before you depend on it triggering at a remote location, as the firmware might cause the Intel chipset watchdog to actually hang the box instead of causing a proper reboot (happens on the IBM thinkpad T43, for example). -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On Thu, Jun 09, 2011 at 01:02:13AM +0800, Brad Campbell wrote: On 08/06/11 11:59, Eric Dumazet wrote: Well, a bisection definitely should help, but needs a lot of time in your case. Yes. compile, test, crash, walk out to the other building to press reset, lather, rinse, repeat. I need a reset button on the end of a 50M wire, or a hardware watchdog! Not strictly on-topic, but in situations where I have machines that either don't have lights-out facilities or have broken ones I find that network controlled power switches to be very useful. At one point I would have need an 8000km long wire to the reset switch :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 08/06/11 11:59, Eric Dumazet wrote: Well, a bisection definitely should help, but needs a lot of time in your case. Yes. compile, test, crash, walk out to the other building to press reset, lather, rinse, repeat. I need a reset button on the end of a 50M wire, or a hardware watchdog! Actually it's not so bad. If I turn off slub debugging the kernel panics and reboots itself. This.. : [2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1 [2.913066] netconsole: device eth0 not up yet, forcing it [3.660062] Refined TSC clocksource calibration: 3213.422 MHz. [3.660118] Switching to clocksource tsc [ 63.200273] r8169 :03:00.0: eth0: unable to load firmware patch rtl_nic/rtl8168e-1.fw (-2) [ 63.223513] r8169 :03:00.0: eth0: link down [ 63.223556] r8169 :03:00.0: eth0: link down ..is slowing down reboots considerably. 3.0-rc does _not_ like some timing hardware in my machine. Having said that, at least it does not randomly panic on SCSI like 2.6.39 does. Ok, I've ruled out TCPMSS. Found out where it was being set and neutered it. I've replicated it with only the single DNAT rule. Could you try following patch, because this is the 'usual suspect' I had yesterday : diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 46cbd28..9f548f9 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta; } +#if 0 if (fastpath size + sizeof(struct skb_shared_info)= ksize(skb-head)) { memmove(skb-head + size, skb_shinfo(skb), @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, off = nhead; goto adjust_others; } - +#endif data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); if (!data) goto nodata; Nope.. that's not it. sigh That might have changed the characteristic of the fault slightly, but unfortunately I got caught with a couple of fsck's, so I only got to test it 3 times tonight. It's unfortunate that this is a production system, so I can only take it down between about 9pm and 1am. That would normally be pretty productive, except that an fsck of a 14TB ext4 can take 30 minutes if it panics at the wrong time. I'm out of time tonight, but I'll have a crack at some bisection tomorrow night. Now I just have to go back far enough that it works, and be near enough not to have to futz around with /proc /sys or drivers. I really, really, really appreciate you guys helping me with this. It has been driving me absolutely bonkers. If I'm ever in the same town as any of you, dinner and drinks are on me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le jeudi 09 juin 2011 à 01:02 +0800, Brad Campbell a écrit : On 08/06/11 11:59, Eric Dumazet wrote: Well, a bisection definitely should help, but needs a lot of time in your case. Yes. compile, test, crash, walk out to the other building to press reset, lather, rinse, repeat. I need a reset button on the end of a 50M wire, or a hardware watchdog! Actually it's not so bad. If I turn off slub debugging the kernel panics and reboots itself. This.. : [2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1 [2.913066] netconsole: device eth0 not up yet, forcing it [3.660062] Refined TSC clocksource calibration: 3213.422 MHz. [3.660118] Switching to clocksource tsc [ 63.200273] r8169 :03:00.0: eth0: unable to load firmware patch rtl_nic/rtl8168e-1.fw (-2) [ 63.223513] r8169 :03:00.0: eth0: link down [ 63.223556] r8169 :03:00.0: eth0: link down ..is slowing down reboots considerably. 3.0-rc does _not_ like some timing hardware in my machine. Having said that, at least it does not randomly panic on SCSI like 2.6.39 does. Ok, I've ruled out TCPMSS. Found out where it was being set and neutered it. I've replicated it with only the single DNAT rule. Could you try following patch, because this is the 'usual suspect' I had yesterday : diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 46cbd28..9f548f9 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta; } +#if 0 if (fastpath size + sizeof(struct skb_shared_info)= ksize(skb-head)) { memmove(skb-head + size, skb_shinfo(skb), @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, off = nhead; goto adjust_others; } - +#endif data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); if (!data) goto nodata; Nope.. that's not it. sigh That might have changed the characteristic of the fault slightly, but unfortunately I got caught with a couple of fsck's, so I only got to test it 3 times tonight. It's unfortunate that this is a production system, so I can only take it down between about 9pm and 1am. That would normally be pretty productive, except that an fsck of a 14TB ext4 can take 30 minutes if it panics at the wrong time. I'm out of time tonight, but I'll have a crack at some bisection tomorrow night. Now I just have to go back far enough that it works, and be near enough not to have to futz around with /proc /sys or drivers. I really, really, really appreciate you guys helping me with this. It has been driving me absolutely bonkers. If I'm ever in the same town as any of you, dinner and drinks are on me. Hmm, I wonder if kmemcheck could help you, but its slow as hell, so not appropriate for production :( -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 07/06/11 04:22, Eric Dumazet wrote: Could you please try latest linux-2.6 tree ? We fixed many networking bugs that could explain your crash. No good I'm afraid. [ 543.040056] = [ 543.040136] BUG ip_dst_cache: Padding overwritten. 0x8803e4217ffe-0x8803e4217fff [ 543.040194] - [ 543.040198] [ 543.040298] INFO: Slab 0xea000d9e74d0 objects=25 used=25 fp=0x (null) flags=0x80004081 [ 543.040364] Pid: 4576, comm: kworker/1:2 Not tainted 3.0.0-rc2 #1 [ 543.040415] Call Trace: [ 543.040472] [810b9c1d] ? slab_err+0xad/0xd0 [ 543.040528] [8102e034] ? check_preempt_wakeup+0xa4/0x160 [ 543.040595] [810ba206] ? slab_pad_check+0x126/0x170 [ 543.040650] [8133045b] ? dst_destroy+0x8b/0x110 [ 543.040701] [810ba29a] ? check_slab+0x4a/0xc0 [ 543.040753] [810baf2d] ? free_debug_processing+0x2d/0x250 [ 543.040808] [810bb27b] ? __slab_free+0x12b/0x140 [ 543.040862] [810bbe99] ? kmem_cache_free+0x99/0xa0 [ 543.040915] [8133045b] ? dst_destroy+0x8b/0x110 [ 543.040967] [813307f6] ? dst_gc_task+0x196/0x1f0 [ 543.041021] [8104e954] ? queue_delayed_work_on+0x154/0x160 [ 543.041081] [813066fe] ? do_dbs_timer+0x20e/0x3d0 [ 543.041133] [81330660] ? dst_alloc+0x180/0x180 [ 543.041187] [8104f28b] ? process_one_work+0xfb/0x3b0 [ 543.041242] [8104f964] ? worker_thread+0x144/0x3d0 [ 543.041296] [8102cc10] ? __wake_up_common+0x50/0x80 [ 543.041678] [8104f820] ? rescuer_thread+0x2e0/0x2e0 [ 543.041729] [8104f820] ? rescuer_thread+0x2e0/0x2e0 [ 543.041782] [81053436] ? kthread+0x96/0xa0 [ 543.041835] [813e1d14] ? kernel_thread_helper+0x4/0x10 [ 543.041890] [810533a0] ? kthread_worker_fn+0x120/0x120 [ 543.041944] [813e1d10] ? gs_change+0xb/0xb [ 543.041993] Padding 0x8803e4217f40: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.042718] Padding 0x8803e4217f50: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.043433] Padding 0x8803e4217f60: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.044155] Padding 0x8803e4217f70: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.044866] Padding 0x8803e4217f80: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.045590] Padding 0x8803e4217f90: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.046311] Padding 0x8803e4217fa0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.047034] Padding 0x8803e4217fb0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.047755] Padding 0x8803e4217fc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.048474] Padding 0x8803e4217fd0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.049203] Padding 0x8803e4217fe0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.049909] Padding 0x8803e4217ff0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 00 00 ZZ.. [ 543.050021] FIX ip_dst_cache: Restoring 0x8803e4217f40-0x8803e4217fff=0x5a [ 543.050021] Dropped -mm, Hugh and Andrea from CC as this does not appear to be mm or ksm related. I'll pare down the firewall and see if I can make it break easier with a smaller test set. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 07.06.2011 05:33, Brad Campbell wrote: On 07/06/11 04:10, Bart De Schuymer wrote: Hi Brad, This has probably nothing to do with ebtables, so please rmmod in case it's loaded. A few questions I didn't directly see an answer to in the threads I scanned... I'm assuming you actually use the bridging firewall functionality. So, what iptables modules do you use? Can you reduce your iptables rules to a core that triggers the bug? Or does it get triggered even with an empty set of firewall rules? Are you using a stock .35 kernel or is it patched? Is this something I can trigger on a poor guy's laptop or does it require specialized hardware (I'm catching up on qemu/kvm...)? Not specialised hardware as such, I've just not been able to reproduce it outside of this specific operating scenario. The last similar problem we've had was related to the 32/64 bit compat code. Are you running 32 bit userspace on a 64 bit kernel? I can't trigger it with empty firewall rules as it relies on a DNAT to occur. If I try it directly to the internal IP address (as I have to without netfilter loaded) then of course nothing fails. It's a pain in the bum as a fault, but it's one I can easily reproduce as long as I use the same set of circumstances. I'll try using 3.0-rc2 (current git) tonight, and if I can reproduce it on that then I'll attempt to pare down the IPTABLES rules to a bare minimum. It is nothing to do with ebtables as I don't compile it. I'm not really sure about bridging firewall functionality. I just use a couple of hand coded bash scripts to set the tables up. From one of your previous mails: # CONFIG_BRIDGE_NF_EBTABLES is not set How about CONFIG_BRIDGE_NETFILTER? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le mardi 07 juin 2011 à 21:27 +0800, Brad Campbell a écrit : On 07/06/11 04:22, Eric Dumazet wrote: Could you please try latest linux-2.6 tree ? We fixed many networking bugs that could explain your crash. No good I'm afraid. [ 543.040056] = [ 543.040136] BUG ip_dst_cache: Padding overwritten. 0x8803e4217ffe-0x8803e4217fff [ 543.040194] Thats pretty strange : These are the last two bytes of a page, set to 0x (a 16 bit value) There is no way a dst field could actually sit on this location (its a padding), since a dst is a bit less than 256 bytes (0xe8), and each entry is aligned on a 64byte address. grep dst /proc/slabinfo ip_dst_cache 32823 62944256 322 : tunables00 0 : slabdata 1967 1967 0 sizeof(struct rtable)=0xe8 - [ 543.040198] [ 543.040298] INFO: Slab 0xea000d9e74d0 objects=25 used=25 fp=0x (null) flags=0x80004081 [ 543.040364] Pid: 4576, comm: kworker/1:2 Not tainted 3.0.0-rc2 #1 [ 543.040415] Call Trace: [ 543.040472] [810b9c1d] ? slab_err+0xad/0xd0 [ 543.040528] [8102e034] ? check_preempt_wakeup+0xa4/0x160 [ 543.040595] [810ba206] ? slab_pad_check+0x126/0x170 [ 543.040650] [8133045b] ? dst_destroy+0x8b/0x110 [ 543.040701] [810ba29a] ? check_slab+0x4a/0xc0 [ 543.040753] [810baf2d] ? free_debug_processing+0x2d/0x250 [ 543.040808] [810bb27b] ? __slab_free+0x12b/0x140 [ 543.040862] [810bbe99] ? kmem_cache_free+0x99/0xa0 [ 543.040915] [8133045b] ? dst_destroy+0x8b/0x110 [ 543.040967] [813307f6] ? dst_gc_task+0x196/0x1f0 [ 543.041021] [8104e954] ? queue_delayed_work_on+0x154/0x160 [ 543.041081] [813066fe] ? do_dbs_timer+0x20e/0x3d0 [ 543.041133] [81330660] ? dst_alloc+0x180/0x180 [ 543.041187] [8104f28b] ? process_one_work+0xfb/0x3b0 [ 543.041242] [8104f964] ? worker_thread+0x144/0x3d0 [ 543.041296] [8102cc10] ? __wake_up_common+0x50/0x80 [ 543.041678] [8104f820] ? rescuer_thread+0x2e0/0x2e0 [ 543.041729] [8104f820] ? rescuer_thread+0x2e0/0x2e0 [ 543.041782] [81053436] ? kthread+0x96/0xa0 [ 543.041835] [813e1d14] ? kernel_thread_helper+0x4/0x10 [ 543.041890] [810533a0] ? kthread_worker_fn+0x120/0x120 [ 543.041944] [813e1d10] ? gs_change+0xb/0xb [ 543.041993] Padding 0x8803e4217f40: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.042718] Padding 0x8803e4217f50: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.043433] Padding 0x8803e4217f60: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.044155] Padding 0x8803e4217f70: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.044866] Padding 0x8803e4217f80: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.045590] Padding 0x8803e4217f90: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.046311] Padding 0x8803e4217fa0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.047034] Padding 0x8803e4217fb0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.047755] Padding 0x8803e4217fc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.048474] Padding 0x8803e4217fd0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.049203] Padding 0x8803e4217fe0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.049909] Padding 0x8803e4217ff0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 00 00 ZZ.. [ 543.050021] FIX ip_dst_cache: Restoring 0x8803e4217f40-0x8803e4217fff=0x5a [ 543.050021] Dropped -mm, Hugh and Andrea from CC as this does not appear to be mm or ksm related. I'll pare down the firewall and see if I can make it break easier with a smaller test set. Hmm, not sure now :( Could you reproduce another bug please ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 07/06/11 21:30, Patrick McHardy wrote: On 07.06.2011 05:33, Brad Campbell wrote: On 07/06/11 04:10, Bart De Schuymer wrote: Hi Brad, This has probably nothing to do with ebtables, so please rmmod in case it's loaded. A few questions I didn't directly see an answer to in the threads I scanned... I'm assuming you actually use the bridging firewall functionality. So, what iptables modules do you use? Can you reduce your iptables rules to a core that triggers the bug? Or does it get triggered even with an empty set of firewall rules? Are you using a stock .35 kernel or is it patched? Is this something I can trigger on a poor guy's laptop or does it require specialized hardware (I'm catching up on qemu/kvm...)? Not specialised hardware as such, I've just not been able to reproduce it outside of this specific operating scenario. The last similar problem we've had was related to the 32/64 bit compat code. Are you running 32 bit userspace on a 64 bit kernel? No, 32 bit Guest OS, but a completely 64 bit userspace on a 64 bit kernel. Userspace is current Debian Stable. Kernel is Vanilla and qemu-kvm is current git I can't trigger it with empty firewall rules as it relies on a DNAT to occur. If I try it directly to the internal IP address (as I have to without netfilter loaded) then of course nothing fails. It's a pain in the bum as a fault, but it's one I can easily reproduce as long as I use the same set of circumstances. I'll try using 3.0-rc2 (current git) tonight, and if I can reproduce it on that then I'll attempt to pare down the IPTABLES rules to a bare minimum. It is nothing to do with ebtables as I don't compile it. I'm not really sure about bridging firewall functionality. I just use a couple of hand coded bash scripts to set the tables up. From one of your previous mails: # CONFIG_BRIDGE_NF_EBTABLES is not set How about CONFIG_BRIDGE_NETFILTER? It was compiled in. With the following table set I was able to reproduce the problem on 3.0-rc2. Replaced my IP with xxx.xxx.xxx.xxx, but otherwise unmodified root@srv:~# iptables-save # Generated by iptables-save v1.4.10 on Tue Jun 7 22:11:30 2011 *filter :INPUT ACCEPT [978:107619] :FORWARD ACCEPT [142:7068] :OUTPUT ACCEPT [1659:291870] -A INPUT -i ppp0 -m state --state RELATED,ESTABLISHED -j ACCEPT -A INPUT ! -i ppp0 -m state --state NEW -j ACCEPT -A INPUT -i ppp0 -j DROP COMMIT # Completed on Tue Jun 7 22:11:30 2011 # Generated by iptables-save v1.4.10 on Tue Jun 7 22:11:30 2011 *nat :PREROUTING ACCEPT [813:49170] :INPUT ACCEPT [91:7090] :OUTPUT ACCEPT [267:20731] :POSTROUTING ACCEPT [296:22281] -A PREROUTING -d xxx.xxx.xxx.xxx/32 ! -i ppp0 -p tcp -m tcp --dport 443 -j DNAT --to-destination 192.168.253.198 COMMIT # Completed on Tue Jun 7 22:11:30 2011 # Generated by iptables-save v1.4.10 on Tue Jun 7 22:11:30 2011 *mangle :PREROUTING ACCEPT [2729:274392] :INPUT ACCEPT [2508:262976] :FORWARD ACCEPT [142:7068] :OUTPUT ACCEPT [1674:293701] :POSTROUTING ACCEPT [2131:346411] -A FORWARD -o ppp0 -p tcp -m tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1400:1536 -j TCPMSS --clamp-mss-to-pmtu COMMIT # Completed on Tue Jun 7 22:11:30 2011 I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 07.06.2011 16:40, Brad Campbell wrote: On 07/06/11 21:30, Patrick McHardy wrote: On 07.06.2011 05:33, Brad Campbell wrote: On 07/06/11 04:10, Bart De Schuymer wrote: Hi Brad, This has probably nothing to do with ebtables, so please rmmod in case it's loaded. A few questions I didn't directly see an answer to in the threads I scanned... I'm assuming you actually use the bridging firewall functionality. So, what iptables modules do you use? Can you reduce your iptables rules to a core that triggers the bug? Or does it get triggered even with an empty set of firewall rules? Are you using a stock .35 kernel or is it patched? Is this something I can trigger on a poor guy's laptop or does it require specialized hardware (I'm catching up on qemu/kvm...)? Not specialised hardware as such, I've just not been able to reproduce it outside of this specific operating scenario. The last similar problem we've had was related to the 32/64 bit compat code. Are you running 32 bit userspace on a 64 bit kernel? No, 32 bit Guest OS, but a completely 64 bit userspace on a 64 bit kernel. Userspace is current Debian Stable. Kernel is Vanilla and qemu-kvm is current git I can't trigger it with empty firewall rules as it relies on a DNAT to occur. If I try it directly to the internal IP address (as I have to without netfilter loaded) then of course nothing fails. It's a pain in the bum as a fault, but it's one I can easily reproduce as long as I use the same set of circumstances. I'll try using 3.0-rc2 (current git) tonight, and if I can reproduce it on that then I'll attempt to pare down the IPTABLES rules to a bare minimum. It is nothing to do with ebtables as I don't compile it. I'm not really sure about bridging firewall functionality. I just use a couple of hand coded bash scripts to set the tables up. From one of your previous mails: # CONFIG_BRIDGE_NF_EBTABLES is not set How about CONFIG_BRIDGE_NETFILTER? It was compiled in. With the following table set I was able to reproduce the problem on 3.0-rc2. Replaced my IP with xxx.xxx.xxx.xxx, but otherwise unmodified Which kernel was the last version without this problem? root@srv:~# iptables-save # Generated by iptables-save v1.4.10 on Tue Jun 7 22:11:30 2011 *filter :INPUT ACCEPT [978:107619] :FORWARD ACCEPT [142:7068] :OUTPUT ACCEPT [1659:291870] -A INPUT -i ppp0 -m state --state RELATED,ESTABLISHED -j ACCEPT -A INPUT ! -i ppp0 -m state --state NEW -j ACCEPT -A INPUT -i ppp0 -j DROP COMMIT # Completed on Tue Jun 7 22:11:30 2011 # Generated by iptables-save v1.4.10 on Tue Jun 7 22:11:30 2011 *nat :PREROUTING ACCEPT [813:49170] :INPUT ACCEPT [91:7090] :OUTPUT ACCEPT [267:20731] :POSTROUTING ACCEPT [296:22281] -A PREROUTING -d xxx.xxx.xxx.xxx/32 ! -i ppp0 -p tcp -m tcp --dport 443 -j DNAT --to-destination 192.168.253.198 COMMIT # Completed on Tue Jun 7 22:11:30 2011 # Generated by iptables-save v1.4.10 on Tue Jun 7 22:11:30 2011 *mangle :PREROUTING ACCEPT [2729:274392] :INPUT ACCEPT [2508:262976] :FORWARD ACCEPT [142:7068] :OUTPUT ACCEPT [1674:293701] :POSTROUTING ACCEPT [2131:346411] -A FORWARD -o ppp0 -p tcp -m tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1400:1536 -j TCPMSS --clamp-mss-to-pmtu COMMIT # Completed on Tue Jun 7 22:11:30 2011 The main suspects would be NAT and TCPMSS. Did you also try whether the crash occurs with only one of these these rules? I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. That's really weird since you're apparently not using any bridge netfilter features. It shouldn't have any effect besides changing at which point ip_tables is invoked. How are your network devices configured (specifically any bridges)? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Op 7/06/2011 16:40, Brad Campbell schreef: On 07/06/11 21:30, Patrick McHardy wrote: On 07.06.2011 05:33, Brad Campbell wrote: On 07/06/11 04:10, Bart De Schuymer wrote: Hi Brad, This has probably nothing to do with ebtables, so please rmmod in case it's loaded. A few questions I didn't directly see an answer to in the threads I scanned... I'm assuming you actually use the bridging firewall functionality. So, what iptables modules do you use? Can you reduce your iptables rules to a core that triggers the bug? Or does it get triggered even with an empty set of firewall rules? Are you using a stock .35 kernel or is it patched? Is this something I can trigger on a poor guy's laptop or does it require specialized hardware (I'm catching up on qemu/kvm...)? Not specialised hardware as such, I've just not been able to reproduce it outside of this specific operating scenario. The last similar problem we've had was related to the 32/64 bit compat code. Are you running 32 bit userspace on a 64 bit kernel? No, 32 bit Guest OS, but a completely 64 bit userspace on a 64 bit kernel. Userspace is current Debian Stable. Kernel is Vanilla and qemu-kvm is current git If the bug is easily triggered with your guest os, then you could try to capture the traffic with wireshark (or something else) in a configuration that doesn't crash your system. Save the traffic in a pcap file. Then you can see if resending that traffic in the vulnerable configuration triggers the bug (I don't know if something in Windows exists, but tcpreplay should work for Linux). Once you have such a capture , chances are the bug is even easily reproducible by us (unless it's hardware-specific). Success isn't guaranteed, but I think it's worth a shot... cheers, Bart -- Bart De Schuymer www.artinalgorithms.be -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit : The main suspects would be NAT and TCPMSS. Did you also try whether the crash occurs with only one of these these rules? I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. That's really weird since you're apparently not using any bridge netfilter features. It shouldn't have any effect besides changing at which point ip_tables is invoked. How are your network devices configured (specifically any bridges)? Something in the kernel does u16 *ptr = addr (given by kmalloc()) ptr[-1] = 0; Could be an off-one error in a memmove()/memcopy() or loop... I cant see a network issue here. I checked arch/x86/lib/memmove_64.S and it seems fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 07.06.2011 20:31, Eric Dumazet wrote: Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit : The main suspects would be NAT and TCPMSS. Did you also try whether the crash occurs with only one of these these rules? I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. That's really weird since you're apparently not using any bridge netfilter features. It shouldn't have any effect besides changing at which point ip_tables is invoked. How are your network devices configured (specifically any bridges)? Something in the kernel does u16 *ptr = addr (given by kmalloc()) ptr[-1] = 0; Could be an off-one error in a memmove()/memcopy() or loop... I cant see a network issue here. So far me neither, but netfilter appears to trigger the bug. I checked arch/x86/lib/memmove_64.S and it seems fine. I was thinking it might be a missing skb_make_writable() combined with vhost_net specifics in the netfilter code (TCPMSS and NAT are both suspect), but was unable to find something. I also went through the dst_metrics() conversion to see whether anything could cause problems with the bridge fake_rttable, but also nothing so far. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 08/06/11 02:04, Bart De Schuymer wrote: If the bug is easily triggered with your guest os, then you could try to capture the traffic with wireshark (or something else) in a configuration that doesn't crash your system. Save the traffic in a pcap file. Then you can see if resending that traffic in the vulnerable configuration triggers the bug (I don't know if something in Windows exists, but tcpreplay should work for Linux). Once you have such a capture , chances are the bug is even easily reproducible by us (unless it's hardware-specific). Success isn't guaranteed, but I think it's worth a shot... The issue with this is I don't have a configuration that does not crash the system. This only happens under the specific circumstance that traffic from VM A is being DNAT'd to VM B. If I disable CONFIG_BRIDGE_NETFILTER, or I leave out the DNAT then I can't replicate the problem as I don't seem to be able to get the packets to go where I want them to go. Let me try and explain it a little more clearly with made up IP addresses to illustrate the problem. I have VM A (1.1.1.2) and VM B (1.1.1.3) on br1 (1.1.1.1) I have public IP on ppp0 (2.2.2.2). VM B can talk to VM A using its host address (1.1.1.2) and there is no problem. The DNAT says anything destined for PPP0 that is on port 443 and coming from anywhere other than PPP0 (ie inside the network) is to be DNAT'd to 1.1.1.3. So VM B (1.1.1.3) tries to connect to ppp0 (2.2.2.2) on port 443, and this is redirected to VM B on 1.1.1.2. Only under this specific circumstance does the problem occur. I can get VM B (1.1.1.3) to talk directly to VM A (1.1.1.2) all day long and there is no problem, it's only when VM B tries to talk to ppp0 that there is an issue (and it happens within seconds of the initial connection). All these tests have been performed with VM B being a Windows XP guest. Tonight I'll try it with a Linux guest and see if I can make it happen. If that works I might be able to come up with some reproducible test case for you. I have a desktop machine that has Intel VT extensions, so I'll work toward making a portable test case. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 08/06/11 06:57, Patrick McHardy wrote: On 07.06.2011 20:31, Eric Dumazet wrote: Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit : The main suspects would be NAT and TCPMSS. Did you also try whether the crash occurs with only one of these these rules? I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. That's really weird since you're apparently not using any bridge netfilter features. It shouldn't have any effect besides changing at which point ip_tables is invoked. How are your network devices configured (specifically any bridges)? Something in the kernel does u16 *ptr = addr (given by kmalloc()) ptr[-1] = 0; Could be an off-one error in a memmove()/memcopy() or loop... I cant see a network issue here. So far me neither, but netfilter appears to trigger the bug. Would it help if I tried some older kernels? This issue only surfaced for me recently as I only installed the VM's in question about 12 weeks ago and have only just started really using them in anger. I could try reproducing it on progressively older kernels to see if I can find one that works and then bisect from there. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 07/06/11 23:35, Patrick McHardy wrote: The main suspects would be NAT and TCPMSS. Did you also try whether the crash occurs with only one of these these rules? To be honest I'm actually having trouble finding where TCPMSS is actually set in that ruleset. This is a production machine so I can only take it down after about 9PM at night. I'll have another crack at it tonight. I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. That's really weird since you're apparently not using any bridge netfilter features. It shouldn't have any effect besides changing at which point ip_tables is invoked. How are your network devices configured (specifically any bridges)? I have one bridge with all my virtual machines on it. In this particular instance the packets leave VM A destined for the IP address of ppp0 (the external interface). This is intercepted by the DNAT PREROUTING rule above and shunted back to VM B. The VM's are on br1 and the external address is ppp0. Without CONFIG_BRIDGE_NETFILTER compiled in I can see the traffic entering and leaving VM B with tcpdump, but the packets never seem to get back to VM A. VM A is XP 32 bit, VM B is Linux. I have some other Linux VM's, so I'll do some more testing tonight between those to see where the packets are going without CONFIG_BRIDGE_NETFILTER set. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le mercredi 08 juin 2011 à 08:18 +0800, Brad Campbell a écrit : On 08/06/11 06:57, Patrick McHardy wrote: On 07.06.2011 20:31, Eric Dumazet wrote: Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit : The main suspects would be NAT and TCPMSS. Did you also try whether the crash occurs with only one of these these rules? I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. That's really weird since you're apparently not using any bridge netfilter features. It shouldn't have any effect besides changing at which point ip_tables is invoked. How are your network devices configured (specifically any bridges)? Something in the kernel does u16 *ptr = addr (given by kmalloc()) ptr[-1] = 0; Could be an off-one error in a memmove()/memcopy() or loop... I cant see a network issue here. So far me neither, but netfilter appears to trigger the bug. Would it help if I tried some older kernels? This issue only surfaced for me recently as I only installed the VM's in question about 12 weeks ago and have only just started really using them in anger. I could try reproducing it on progressively older kernels to see if I can find one that works and then bisect from there. Well, a bisection definitely should help, but needs a lot of time in your case. Could you try following patch, because this is the 'usual suspect' I had yesterday : diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 46cbd28..9f548f9 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta; } +#if 0 if (fastpath size + sizeof(struct skb_shared_info) = ksize(skb-head)) { memmove(skb-head + size, skb_shinfo(skb), @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, off = nhead; goto adjust_others; } - +#endif data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); if (!data) goto nodata; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Hi Brad, This has probably nothing to do with ebtables, so please rmmod in case it's loaded. A few questions I didn't directly see an answer to in the threads I scanned... I'm assuming you actually use the bridging firewall functionality. So, what iptables modules do you use? Can you reduce your iptables rules to a core that triggers the bug? Or does it get triggered even with an empty set of firewall rules? Are you using a stock .35 kernel or is it patched? Is this something I can trigger on a poor guy's laptop or does it require specialized hardware (I'm catching up on qemu/kvm...)? cheers, Bart PS: I'm not sure if we should keep CC-ing everybody, netfilter-devel together with kvm should probably do fine. Op 3/06/2011 18:07, Brad Campbell schreef: On 03/06/11 23:50, Bernhard Held wrote: Am 03.06.2011 15:38, schrieb Brad Campbell: On 02/06/11 07:03, CaT wrote: On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote: Unfortunately the only interface that is mentioned by name anywhere in my firewall is $DMZ (which is ppp0 and not part of any bridge). All of the nat/dnat and other horrible hacks are based on IP addresses. Damn. Not referencing the bridge interfaces at all stopped our host from going down in flames when we passed it a few packets. These are two of the oopses we got from it. Whilst the kernel here is .35 we got the same issue from a range of kernels. Seems related. Well, I tried sending an explanatory message to netdev, netfilter cc'd to kvm, but it appears not to have made it to kvm or netfilter, and the cc to netdev has not elicited a response. My resend to netfilter seems to have dropped into the bit bucket also. Just another reference 3.5 months ago: http://www.spinics.net/lists/netfilter-devel/msg17239.html waves hands around shouting I have a reproducible test case for this and don't mind patching and crashing the machine to get it fixed Attempted to add netfilter-devel to the cc this time. -- To unsubscribe from this list: send the line unsubscribe netfilter-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Bart De Schuymer www.artinalgorithms.be -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le dimanche 05 juin 2011 à 21:45 +0800, Brad Campbell a écrit : On 05/06/11 16:14, Avi Kivity wrote: On 06/03/2011 04:38 PM, Brad Campbell wrote: Is there anyone who can point me at the appropriate cage to rattle? I know it appears to be a netfilter issue, but I don't seem to be able to get a message to the list (and I am subscribed to it and have been getting mail for months) and I'm not sure who to pester. The other alternative is I just stop doing that and wait for it to bite someone else. The mailing list might be set not to send your own mails back to you. Check the list archive. Yep, I did that first.. Given the response to previous issues along the same line, it looks a bit like I just remember not to actually use the system in the way that triggers the bug and be happy that 99% of the time the kernel does not panic, but have that lovely feeling in the back of the skull that says any time now, and without obvious reason the whole machine might just come crashing down.. I guess it's still better than running Xen or Windows.. Could you please try latest linux-2.6 tree ? We fixed many networking bugs that could explain your crash. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le lundi 06 juin 2011 à 22:10 +0200, Bart De Schuymer a écrit : Hi Brad, This has probably nothing to do with ebtables, so please rmmod in case it's loaded. A few questions I didn't directly see an answer to in the threads I scanned... I'm assuming you actually use the bridging firewall functionality. So, what iptables modules do you use? Can you reduce your iptables rules to a core that triggers the bug? Or does it get triggered even with an empty set of firewall rules? Are you using a stock .35 kernel or is it patched? Is this something I can trigger on a poor guy's laptop or does it require specialized hardware (I'm catching up on qemu/kvm...)? Keep netdev, as this most probably is a networking bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 07/06/11 04:10, Bart De Schuymer wrote: Hi Brad, This has probably nothing to do with ebtables, so please rmmod in case it's loaded. A few questions I didn't directly see an answer to in the threads I scanned... I'm assuming you actually use the bridging firewall functionality. So, what iptables modules do you use? Can you reduce your iptables rules to a core that triggers the bug? Or does it get triggered even with an empty set of firewall rules? Are you using a stock .35 kernel or is it patched? Is this something I can trigger on a poor guy's laptop or does it require specialized hardware (I'm catching up on qemu/kvm...)? Not specialised hardware as such, I've just not been able to reproduce it outside of this specific operating scenario. I can't trigger it with empty firewall rules as it relies on a DNAT to occur. If I try it directly to the internal IP address (as I have to without netfilter loaded) then of course nothing fails. It's a pain in the bum as a fault, but it's one I can easily reproduce as long as I use the same set of circumstances. I'll try using 3.0-rc2 (current git) tonight, and if I can reproduce it on that then I'll attempt to pare down the IPTABLES rules to a bare minimum. It is nothing to do with ebtables as I don't compile it. I'm not really sure about bridging firewall functionality. I just use a couple of hand coded bash scripts to set the tables up. brad@srv:~$ lsmod Module Size Used by xt_iprange 1637 1 xt_DSCP 2077 2 xt_length 1216 1 xt_CLASSIFY 1091 26 sch_sfq 6681 4 xt_CHECKSUM 1229 2 brad@srv:~$ lsmod Module Size Used by xt_iprange 1637 1 xt_DSCP 2077 2 xt_length 1216 1 xt_CLASSIFY 1091 26 sch_sfq 6681 4 xt_CHECKSUM 1229 2 ipt_REJECT 2277 1 ipt_MASQUERADE 1759 7 ipt_REDIRECT1133 1 xt_recent 8223 2 xt_state1226 5 iptable_nat 3993 1 nf_nat 16773 3 ipt_MASQUERADE,ipt_REDIRECT,iptable_nat nf_conntrack_ipv4 11868 8 iptable_nat,nf_nat nf_conntrack 60962 5 ipt_MASQUERADE,xt_state,iptable_nat,nf_nat,nf_conntrack_ipv4 nf_defrag_ipv4 1417 1 nf_conntrack_ipv4 xt_TCPMSS 2567 2 xt_tcpmss 1469 0 xt_tcpudp 2467 56 iptable_mangle 1487 1 pppoe 9574 2 pppox 2188 1 pppoe iptable_filter 1442 1 ip_tables 16762 3 iptable_nat,iptable_mangle,iptable_filter x_tables 20462 17 xt_iprange,xt_DSCP,xt_length,xt_CLASSIFY,xt_CHECKSUM,ipt_REJECT,ipt_MASQUERADE,ipt_REDIRECT,xt_recent,xt_state,iptable_nat,xt_TCPMSS,xt_tcpmss,xt_tcpudp,iptable_mangle,iptable_filter,ip_tables ppp_generic24243 6 pppoe,pppox slhc5293 1 ppp_generic cls_u32 6468 6 sch_htb14432 2 deflate 1937 0 zlib_deflate 21228 1 deflate des_generic16135 0 cbc 2721 0 ecb 1975 0 crypto_blkcipher 13645 2 cbc,ecb sha1_generic2095 0 md5 4001 0 hmac2977 0 crypto_hash14519 3 sha1_generic,md5,hmac cryptomgr 2636 0 aead6137 1 cryptomgr crypto_algapi 15289 9 deflate,des_generic,cbc,ecb,crypto_blkcipher,hmac,crypto_hash,cryptomgr,aead af_key 27372 0 fuse 66747 1 w83627ehf 32052 0 hwmon_vid 2867 1 w83627ehf vhost_net 16802 6 powernow_k812932 0 mperf 1263 1 powernow_k8 kvm_amd53431 24 kvm 235155 1 kvm_amd pl2303 12732 1 xhci_hcd 62865 0 i2c_piix4 8391 0 k10temp 3183 0 usbserial 34452 3 pl2303 usb_storage37887 1 usb_libusual 10999 1 usb_storage ohci_hcd 18105 0 ehci_hcd 33641 0 ahci 20748 4 usbcore 130936 7 pl2303,xhci_hcd,usbserial,usb_storage,usb_libusual,ohci_hcd,ehci_hcd libahci21202 1 ahci sata_mv26939 0 megaraid_sas 71659 14 Nat Table (external ip substituted for xxx.xxx.xxx.xxx) Chain PREROUTING (policy ACCEPT 1761K packets, 152M bytes) pkts bytes target prot opt in out source destination 5 210 DNAT udp -- ppp0 * 0.0.0.0/0 0.0.0.0/0 udp dpt:1195 to:192.168.253.199 6 252 DNAT udp -- !ppp0 * 0.0.0.0/0 xxx.xxx.xxx.xxx udp dpt:1195 to:192.168.253.199 0 0 DNAT tcp -- ppp0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:25001 to:192.168.253.199:465
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 06/03/2011 04:38 PM, Brad Campbell wrote: Is there anyone who can point me at the appropriate cage to rattle? I know it appears to be a netfilter issue, but I don't seem to be able to get a message to the list (and I am subscribed to it and have been getting mail for months) and I'm not sure who to pester. The other alternative is I just stop doing that and wait for it to bite someone else. The mailing list might be set not to send your own mails back to you. Check the list archive. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 05/06/11 16:14, Avi Kivity wrote: On 06/03/2011 04:38 PM, Brad Campbell wrote: Is there anyone who can point me at the appropriate cage to rattle? I know it appears to be a netfilter issue, but I don't seem to be able to get a message to the list (and I am subscribed to it and have been getting mail for months) and I'm not sure who to pester. The other alternative is I just stop doing that and wait for it to bite someone else. The mailing list might be set not to send your own mails back to you. Check the list archive. Yep, I did that first.. Given the response to previous issues along the same line, it looks a bit like I just remember not to actually use the system in the way that triggers the bug and be happy that 99% of the time the kernel does not panic, but have that lovely feeling in the back of the skull that says any time now, and without obvious reason the whole machine might just come crashing down.. I guess it's still better than running Xen or Windows.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 06/05/2011 04:45 PM, Brad Campbell wrote: The mailing list might be set not to send your own mails back to you. Check the list archive. Yep, I did that first.. Given the response to previous issues along the same line, it looks a bit like I just remember not to actually use the system in the way that triggers the bug and be happy that 99% of the time the kernel does not panic, but have that lovely feeling in the back of the skull that says any time now, and without obvious reason the whole machine might just come crashing down.. I guess it's still better than running Xen or Windows.. Not at all. Can some networking/netfilter expert look at this? Please file a bug with all the relevant information in this thread. If you can look for a previous version that worked, that might increase the chances of the bug being resolved faster. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 02/06/11 07:03, CaT wrote: On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote: Unfortunately the only interface that is mentioned by name anywhere in my firewall is $DMZ (which is ppp0 and not part of any bridge). All of the nat/dnat and other horrible hacks are based on IP addresses. Damn. Not referencing the bridge interfaces at all stopped our host from going down in flames when we passed it a few packets. These are two of the oopses we got from it. Whilst the kernel here is .35 we got the same issue from a range of kernels. Seems related. Well, I tried sending an explanatory message to netdev, netfilter cc'd to kvm, but it appears not to have made it to kvm or netfilter, and the cc to netdev has not elicited a response. My resend to netfilter seems to have dropped into the bit bucket also. Is there anyone who can point me at the appropriate cage to rattle? I know it appears to be a netfilter issue, but I don't seem to be able to get a message to the list (and I am subscribed to it and have been getting mail for months) and I'm not sure who to pester. The other alternative is I just stop doing that and wait for it to bite someone else. Cheers. Brad -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Am 03.06.2011 15:38, schrieb Brad Campbell: On 02/06/11 07:03, CaT wrote: On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote: Unfortunately the only interface that is mentioned by name anywhere in my firewall is $DMZ (which is ppp0 and not part of any bridge). All of the nat/dnat and other horrible hacks are based on IP addresses. Damn. Not referencing the bridge interfaces at all stopped our host from going down in flames when we passed it a few packets. These are two of the oopses we got from it. Whilst the kernel here is .35 we got the same issue from a range of kernels. Seems related. Well, I tried sending an explanatory message to netdev, netfilter cc'd to kvm, but it appears not to have made it to kvm or netfilter, and the cc to netdev has not elicited a response. My resend to netfilter seems to have dropped into the bit bucket also. Just another reference 3.5 months ago: http://www.spinics.net/lists/netfilter-devel/msg17239.html Bernhard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 03/06/11 23:50, Bernhard Held wrote: Am 03.06.2011 15:38, schrieb Brad Campbell: On 02/06/11 07:03, CaT wrote: On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote: Unfortunately the only interface that is mentioned by name anywhere in my firewall is $DMZ (which is ppp0 and not part of any bridge). All of the nat/dnat and other horrible hacks are based on IP addresses. Damn. Not referencing the bridge interfaces at all stopped our host from going down in flames when we passed it a few packets. These are two of the oopses we got from it. Whilst the kernel here is .35 we got the same issue from a range of kernels. Seems related. Well, I tried sending an explanatory message to netdev, netfilter cc'd to kvm, but it appears not to have made it to kvm or netfilter, and the cc to netdev has not elicited a response. My resend to netfilter seems to have dropped into the bit bucket also. Just another reference 3.5 months ago: http://www.spinics.net/lists/netfilter-devel/msg17239.html waves hands around shouting I have a reproducible test case for this and don't mind patching and crashing the machine to get it fixed Attempted to add netfilter-devel to the cc this time. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 01/06/11 12:52, Hugh Dickins wrote: I guess Brad could try SLUB debugging, boot with slub_debug=P for poisoning perhaps; though it might upset alignments and drive the problem underground. Or see if the same happens with SLAB instead of SLUB. Not much use I'm afraid. This is all I get in the log [ 3161.300073] = [ 3161.300147] BUG kmalloc-512: Freechain corrupt The qemu process is then frozen, unkillable but reported in state R 13881 ?R 3:27 /usr/bin/qemu -S -M pc-0.13 -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -nam The machine then progressively dies until it's frozen solid with no further error messages. I stupidly forgot to do an alt-sysrq-t prior to doing an alt-sysrq-b, but at least it responded to that. On the bright side I can reproduce it at will. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 06/01/2011 09:31 AM, Brad Campbell wrote: On 01/06/11 12:52, Hugh Dickins wrote: I guess Brad could try SLUB debugging, boot with slub_debug=P for poisoning perhaps; though it might upset alignments and drive the problem underground. Or see if the same happens with SLAB instead of SLUB. Not much use I'm afraid. This is all I get in the log [ 3161.300073] = [ 3161.300147] BUG kmalloc-512: Freechain corrupt The qemu process is then frozen, unkillable but reported in state R 13881 ?R 3:27 /usr/bin/qemu -S -M pc-0.13 -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -nam The machine then progressively dies until it's frozen solid with no further error messages. I stupidly forgot to do an alt-sysrq-t prior to doing an alt-sysrq-b, but at least it responded to that. On the bright side I can reproduce it at will. Please try slub_debug=FZPU; that should point the finger (hopefully at somebody else). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 06/01/2011 12:29 PM, Brad Campbell wrote: On 01/06/11 14:56, Avi Kivity wrote: On 06/01/2011 09:31 AM, Brad Campbell wrote: On 01/06/11 12:52, Hugh Dickins wrote: I guess Brad could try SLUB debugging, boot with slub_debug=P for poisoning perhaps; though it might upset alignments and drive the problem underground. Or see if the same happens with SLAB instead of SLUB. Not much use I'm afraid. This is all I get in the log [ 3161.300073] = [ 3161.300147] BUG kmalloc-512: Freechain corrupt The qemu process is then frozen, unkillable but reported in state R 13881 ? R 3:27 /usr/bin/qemu -S -M pc-0.13 -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -nam The machine then progressively dies until it's frozen solid with no further error messages. I stupidly forgot to do an alt-sysrq-t prior to doing an alt-sysrq-b, but at least it responded to that. On the bright side I can reproduce it at will. Please try slub_debug=FZPU; that should point the finger (hopefully at somebody else). Well the first attempt locked the machine solid. No network, no console.. I saw == on the console.. nothing after that. Would not respond to sysrq-t or any other sysrq combination other than -b, which rebooted the box. No output on netconsole at all, I had to walk to the other building to look at the monitor and reboot it. The second attempt jammed netconsole again, but I managed to get this from an ssh session I already had established. The machine died a slow and horrible death, but remained interactive enough for me to reboot it with echo b /proc/sysrq-trigger Nothing else worked. [ 413.756416] [81318f1c] ? pskb_expand_head+0x15c/0x250 [ 413.756424] [813a6c45] ? nf_bridge_copy_header+0x145/0x160 [ 413.756431] [8139f78d] ? br_dev_queue_push_xmit+0x6d/0x80 [ 413.756439] [813a55a0] ? br_nf_post_routing+0x2a0/0x2f0 [ 413.756447] [81346bc4] ? nf_iterate+0x84/0xb0 [ 413.756453] [8139f720] ? br_flood_deliver+0x20/0x20 [ 413.756459] [81346c64] ? nf_hook_slow+0x74/0x120 [ 413.756465] [8139f720] ? br_flood_deliver+0x20/0x20 [ 413.756472] [8139f7da] ? br_forward_finish+0x3a/0x60 [ 413.756479] [813a5758] ? br_nf_forward_finish+0x168/0x170 [ 413.756487] [813a5c90] ? br_nf_forward_ip+0x360/0x3a0 [ 413.756492] [81346bc4] ? nf_iterate+0x84/0xb0 [ 413.756498] [8139f7a0] ? br_dev_queue_push_xmit+0x80/0x80 [ 413.756504] [81346c64] ? nf_hook_slow+0x74/0x120 [ 413.756510] [8139f7a0] ? br_dev_queue_push_xmit+0x80/0x80 [ 413.756516] [8139f800] ? br_forward_finish+0x60/0x60 [ 413.756522] [8139f800] ? br_forward_finish+0x60/0x60 [ 413.756528] [8139f875] ? __br_forward+0x75/0xc0 [ 413.756534] [8139f426] ? deliver_clone+0x36/0x60 [ 413.756540] [8139f69d] ? br_flood+0xbd/0x100 [ 413.756546] [813a05b0] ? br_handle_local_finish+0x40/0x40 [ 413.756552] [813a080e] ? br_handle_frame_finish+0x25e/0x280 [ 413.756560] [813a60f0] ? br_nf_pre_routing_finish+0x1a0/0x330 [ 413.756568] [813a6958] ? br_nf_pre_routing+0x6d8/0x800 [ 413.756577] [8102d46a] ? enqueue_task+0x3a/0x90 [ 413.756582] [81346bc4] ? nf_iterate+0x84/0xb0 [ 413.756589] [813a05b0] ? br_handle_local_finish+0x40/0x40 [ 413.756594] [81346c64] ? nf_hook_slow+0x74/0x120 [ 413.756600] [813a05b0] ? br_handle_local_finish+0x40/0x40 [ 413.756607] [810339b0] ? try_to_wake_up+0x2c0/0x2c0 [ 413.756613] [813a09d9] ? br_handle_frame+0x1a9/0x280 [ 413.756620] [813a0830] ? br_handle_frame_finish+0x280/0x280 [ 413.756627] [81320ef7] ? __netif_receive_skb+0x157/0x5c0 [ 413.756634] [81321443] ? process_backlog+0xe3/0x1d0 [ 413.756641] [81321da5] ? net_rx_action+0xc5/0x1d0 [ 413.756650] [8103df11] ? __do_softirq+0x91/0x120 [ 413.756657] [813d838c] ? call_softirq+0x1c/0x30 [ 413.756660] EOI [81003cbd] ? do_softirq+0x4d/0x80 [ 413.756673] [81321ece] ? netif_rx_ni+0x1e/0x30 [ 413.756681] [812b3ae2] ? tun_chr_aio_write+0x332/0x4e0 [ 413.756688] [812b37b0] ? tun_sendmsg+0x4d0/0x4d0 [ 413.756697] [810c24e9] ? do_sync_readv_writev+0xa9/0xf0 [ 413.756704] [81063f9c] ? do_futex+0x13c/0xa70 [ 413.756711] [811d6730] ? timerqueue_add+0x60/0xb0 [ 413.756719] [81056ab7] ? __hrtimer_start_range_ns+0x1e7/0x410 [ 413.756726] [810c231b] ? rw_copy_check_uvector+0x7b/0x140 [ 413.756734] [810c2bcf] ? do_readv_writev+0xdf/0x210 [ 413.756742] [810c2e7e] ? sys_writev+0x4e/0xc0 [ 413.756750] [813d753b] ? system_call_fastpath+0x16/0x1b [ 413.756756] FIX kmalloc-1024:
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 06/01/2011 12:40 PM, Avi Kivity wrote: bridge and netfilter, IIRC this was also the problem last time. Do you have any ebtables loaded? Can you try building a kernel without ebtables? Without netfilter at all? Please run all tests with slub_debug=FZPU. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 01/06/11 17:41, Avi Kivity wrote: On 06/01/2011 12:40 PM, Avi Kivity wrote: bridge and netfilter, IIRC this was also the problem last time. Do you have any ebtables loaded? Never heard of them, but making a cursory check just in case.. brad@srv:/raid10/src/linux-2.6.39$ grep EBTABLE .config # CONFIG_BRIDGE_NF_EBTABLES is not set Can you try building a kernel without ebtables? Without netfilter at all? Well, without netfilter I can't get it to crash. The problem is without netfilter I can't actually use it the way I use it to get it to crash. I rebooted into a netfilter kernel, and did all the steps I'd used on the no-netfilter kernel and it ticked along happily. So the result of the experiment is inconclusive. Having said that, the backtraces certainly smell networky. To get it to crash, I have to start IE in the VM and https to the public address of the machine, which is then redirected by netfilter back into another of the VM's. I can https directly to the other VM's address, but that does not cause it to crash, however without netfilter loaded I can't bounce off the public IP. It's all rather confusing really. What next Sherlock? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 06/01/2011 01:53 PM, Brad Campbell wrote: On 01/06/11 17:41, Avi Kivity wrote: On 06/01/2011 12:40 PM, Avi Kivity wrote: bridge and netfilter, IIRC this was also the problem last time. Do you have any ebtables loaded? Never heard of them, but making a cursory check just in case.. brad@srv:/raid10/src/linux-2.6.39$ grep EBTABLE .config # CONFIG_BRIDGE_NF_EBTABLES is not set Can you try building a kernel without ebtables? Without netfilter at all? Well, without netfilter I can't get it to crash. The problem is without netfilter I can't actually use it the way I use it to get it to crash. I rebooted into a netfilter kernel, and did all the steps I'd used on the no-netfilter kernel and it ticked along happily. So the result of the experiment is inconclusive. Having said that, the backtraces certainly smell networky. To get it to crash, I have to start IE in the VM and https to the public address of the machine, which is then redirected by netfilter back into another of the VM's. I can https directly to the other VM's address, but that does not cause it to crash, however without netfilter loaded I can't bounce off the public IP. It's all rather confusing really. What next Sherlock? Maybe the Sherlocks at netdev@ can tell. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On Wed, Jun 01, 2011 at 06:53:31PM +0800, Brad Campbell wrote: I rebooted into a netfilter kernel, and did all the steps I'd used on the no-netfilter kernel and it ticked along happily. So the result of the experiment is inconclusive. Having said that, the backtraces certainly smell networky. To get it to crash, I have to start IE in the VM and https to the public address of the machine, which is then redirected by netfilter back into another of the VM's. I can https directly to the other VM's address, but that does not cause it to crash, however without netfilter loaded I can't bounce off the public IP. It's all rather confusing really. What next Sherlock? I think you're hitting something I've seen. Can you try rewriting your firewall rules so that it does not reference any bridge interfaces at all. Instead, reference the real interface names in their place. I'm betting it wont crash. (netdev added to CC since we're aleady bouncing there) -- A search of his car uncovered pornography, a homemade sex aid, women's stockings and a Jack Russell terrier. - http://www.dailytelegraph.com.au/news/wacky/indeed/story-e6frev20-118083480 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 01/06/11 19:18, CaT wrote: On Wed, Jun 01, 2011 at 06:53:31PM +0800, Brad Campbell wrote: I rebooted into a netfilter kernel, and did all the steps I'd used on the no-netfilter kernel and it ticked along happily. So the result of the experiment is inconclusive. Having said that, the backtraces certainly smell networky. To get it to crash, I have to start IE in the VM and https to the public address of the machine, which is then redirected by netfilter back into another of the VM's. I can https directly to the other VM's address, but that does not cause it to crash, however without netfilter loaded I can't bounce off the public IP. It's all rather confusing really. What next Sherlock? I think you're hitting something I've seen. Can you try rewriting your firewall rules so that it does not reference any bridge interfaces at all. Instead, reference the real interface names in their place. I'm betting it wont crash. Unfortunately the only interface that is mentioned by name anywhere in my firewall is $DMZ (which is ppp0 and not part of any bridge). All of the nat/dnat and other horrible hacks are based on IP addresses. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote: Unfortunately the only interface that is mentioned by name anywhere in my firewall is $DMZ (which is ppp0 and not part of any bridge). All of the nat/dnat and other horrible hacks are based on IP addresses. Damn. Not referencing the bridge interfaces at all stopped our host from going down in flames when we passed it a few packets. These are two of the oopses we got from it. Whilst the kernel here is .35 we got the same issue from a range of kernels. Seems related. The oopses may be a bit weird. Copy and paste from an ipmi terminal. slab error in cache_alloc_debugcheck_after(): cache `size-64': double free, or n Pid: 2431, comm: kvm Tainted: G D 2.6.35.9-local.20110314-141930 #1 Call Trace: IRQ [810fb8bf] ? __slab_error+0x1f/0x30 [810fc22b] ? cache_alloc_debugcheck_after+0x6b/0x1f0 [81530a00] ? br_nf_pre_routing_finish+0x0/0x370 [8153106b] ? br_nf_pre_routing+0x2fb/0x980 [810fdd3d] ? kmem_cache_alloc_notrace+0x7d/0xf0 [8153106b] ? br_nf_pre_routing+0x2fb/0x980 [81466e66] ? nf_iterate+0x66/0xb0 [8152b9f0] ? br_handle_frame_finish+0x0/0x1c0 [81466f14] ? nf_hook_slow+0x64/0xf0 [8152b9f0] ? br_handle_frame_finish+0x0/0x1c0 [8152bd3c] ? br_handle_frame+0x18c/0x250 [81445459] ? __netif_receive_skb+0x169/0x2a0 [81445673] ? process_backlog+0xe3/0x1d0 [81446347] ? net_rx_action+0x87/0x1c0 [810793f7] ? __do_softirq+0xa7/0x1d0 [81035b8c] ? call_softirq+0x1c/0x30 EOI [81037c6d] ? do_softirq+0x4d/0x80 [81446b4e] ? netif_rx_ni+0x1e/0x30 [8139541a] ? tun_chr_aio_write+0x36a/0x510 [813950b0] ? tun_chr_aio_write+0x0/0x510 [81102859] ? do_sync_readv_writev+0xa9/0xf0 [810973fb] ? ktime_get+0x5b/0xe0 [8104f958] ? lapic_next_event+0x18/0x20 [8109be18] ? tick_dev_program_event+0x38/0x100 [81102697] ? rw_copy_check_uvector+0x77/0x130 [81102f0c] ? do_readv_writev+0xdc/0x200 [8108dfec] ? sys_timer_settime+0x13c/0x2e0 [8110317e] ? sys_writev+0x4e/0x90 [81034d6b] ? system_call_fastpath+0x16/0x1b 8801e7621500: redzone 1:0xbf05bd010006, redzone 2:0x9f911029d74e35b -- Code: 40 01 00 00 4c 8b a4 24 48 01 00 00 4c 8b ac 24 50 01 00 00 4c 8b b4 24 5 RIP [81652c67] icmp_send+0x297/0x650 RSP 880001c036b8 ---[ end trace 9d3f7be7684ac91e ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 0, comm: swapper Tainted: G D 2.6.35.9-local.20110314-144920 #2 Call Trace: IRQ [8170eada] ? panic+0x94/0x116 [81711326] ? _raw_spin_lock_irqsave+0x26/0x40 [8103a05f] ? oops_end+0xef/0xf0 [81711a15] ? general_protection+0x25/0x30 [81652c2f] ? icmp_send+0x25f/0x650 [81652c67] ? icmp_send+0x297/0x650 [815fd8e6] ? nf_iterate+0x66/0xb0 [816dbfa0] ? br_nf_forward_finish+0x0/0x170 [815fd994] ? nf_hook_slow+0x64/0xf0 [816dbfa0] ? br_nf_forward_finish+0x0/0x170 [816dc461] ? br_nf_forward_ip+0x201/0x3e0 [815fd8e6] ? nf_iterate+0x66/0xb0 [816d6620] ? br_forward_finish+0x0/0x60 [815fd994] ? nf_hook_slow+0x64/0xf0 [816d6620] ? br_forward_finish+0x0/0x60 [816d66e9] ? __br_forward+0x69/0xb0 [816d741a] ? br_handle_frame_finish+0x12a/0x280 [816dcac8] ? br_nf_pre_routing_finish+0x208/0x370 [815fd994] ? nf_hook_slow+0x64/0xf0 [816dc8c0] ? br_nf_pre_routing_finish+0x0/0x370 [816dc538] ? br_nf_forward_ip+0x2d8/0x3e0 [816dd3b5] ? br_nf_pre_routing+0x785/0x980 [815fd8e6] ? nf_iterate+0x66/0xb0 [815fd994] ? nf_hook_slow+0x64/0xf0 [816d72f0] ? br_handle_frame_finish+0x0/0x280 [815fd994] ? nf_hook_slow+0x64/0xf0 [816d72f0] ? br_handle_frame_finish+0x0/0x280 [816d76fc] ? br_handle_frame+0x18c/0x250 [815dec5b] ? __netif_receive_skb+0x1cb/0x350 [8103d115] ? read_tsc+0x5/0x20 [815dfa18] ? netif_receive_skb+0x78/0x80 [815e0217] ? napi_gro_receive+0x27/0x40 [815e01d8] ? napi_skb_finish+0x38/0x50 [8152586d] ? bnx2_poll_work+0xd0d/0x13d0 [8160c950] ? ctnetlink_conntrack_event+0x210/0x7d0 [81092029] ? autoremove_wake_function+0x9/0x30 [8109a71b] ? ktime_get+0x5b/0xe0 [81526051] ? bnx2_poll+0x61/0x230 [81051db8] ? lapic_next_event+0x18/0x20 [815dfbef] ? net_rx_action+0x9f/0x200 [8109636f] ? __hrtimer_start_range_ns+0x22f/0x410 [8107c35f] ? __do_softirq+0xaf/0x1e0 [810ab547] ? handle_IRQ_event+0x47/0x160 [81036d5c] ? call_softirq+0x1c/0x30 [81038c85] ? do_softirq+0x65/0xa0 [8107c235] ? irq_exit+0x85/0x90
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 31/05/11 13:47, Borislav Petkov wrote: Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your machine from oopsing. Adding linux-mm. I initially thought that, so the second panic was produced with KSM disabled from boot. echo 0 /sys/kernel/mm/ksm/run If you still think that compiling ksm out of the kernel will prevent it then I'm willing to give it a go. It's a production server, so I can only really bounce it around after about 9PM - GMT+8. Regards, Brad -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On Tue, May 31, 2011 at 05:26:10PM +0800, Brad Campbell wrote: On 31/05/11 13:47, Borislav Petkov wrote: Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your machine from oopsing. Adding linux-mm. I initially thought that, so the second panic was produced with KSM disabled from boot. echo 0 /sys/kernel/mm/ksm/run If you still think that compiling ksm out of the kernel will prevent it then I'm willing to give it a go. Ok, from looking at the code, when KSM inits, it starts the ksm kernel thread and it looks like your oops comes from the function that is run in the kernel thread - ksm_scan_thread. So even if you disable it from sysfs, it runs at least once. Let's add some more people to Cc and see what happens :). -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 31/05/11 18:38, Borislav Petkov wrote: On Tue, May 31, 2011 at 05:26:10PM +0800, Brad Campbell wrote: On 31/05/11 13:47, Borislav Petkov wrote: Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your machine from oopsing. Adding linux-mm. I initially thought that, so the second panic was produced with KSM disabled from boot. echo 0 /sys/kernel/mm/ksm/run If you still think that compiling ksm out of the kernel will prevent it then I'm willing to give it a go. Ok, from looking at the code, when KSM inits, it starts the ksm kernel thread and it looks like your oops comes from the function that is run in the kernel thread - ksm_scan_thread. So even if you disable it from sysfs, it runs at least once. Just to confirm, I recompiled 2.6.38.7 without KSM enabled and I've been unable to reproduce the bug, so it looks like you were on the money. I've moved back to 2.6.38.7 as 2.6.39 has a painful SCSI bug that panics about 75% of boots, and the reboot cycle required to get luck my way into a working kernel is just too much hassle. It would appear that XP zero's its memory space on bootup, so there would be lots of pages to merge with a couple of relatively freshly booted XP machines running. Regards, Brad. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On Tue, 31 May 2011, Brad Campbell wrote: On 31/05/11 18:38, Borislav Petkov wrote: On Tue, May 31, 2011 at 05:26:10PM +0800, Brad Campbell wrote: On 31/05/11 13:47, Borislav Petkov wrote: Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your machine from oopsing. Adding linux-mm. I initially thought that, so the second panic was produced with KSM disabled from boot. echo 0 /sys/kernel/mm/ksm/run If you still think that compiling ksm out of the kernel will prevent it then I'm willing to give it a go. Ok, from looking at the code, when KSM inits, it starts the ksm kernel thread and it looks like your oops comes from the function that is run in the kernel thread - ksm_scan_thread. So even if you disable it from sysfs, it runs at least once. Just to confirm, I recompiled 2.6.38.7 without KSM enabled and I've been unable to reproduce the bug, so it looks like you were on the money. I've moved back to 2.6.38.7 as 2.6.39 has a painful SCSI bug that panics about 75% of boots, and the reboot cycle required to get luck my way into a working kernel is just too much hassle. It would appear that XP zero's its memory space on bootup, so there would be lots of pages to merge with a couple of relatively freshly booted XP machines running. Thanks for the Cc, Borislav. Brad, my suspicion is that in each case the top 16 bits of RDX have been mysteriously corrupted from to , causing the general protection faults. I don't understand what that has to do with KSM. But it's only a suspicion, because I can't make sense of the Code: lines in your traces, they have more than the expected 64 bytes, and only one of them has a (with no ) to mark faulting instruction. I did try compiling the 2.6.39 kernel from your config, but of course we have different compilers, so although I got close, it wasn't exact. Would you mind mailing me privately (it's about 73MB) the objdump -trd output for your original vmlinux (with KSM on)? (Those -trd options are the ones I'm used to typing, I bet not they're not all relevant.) Of course, it's only a tiny fraction of that output that I need, might be better to cut it down to remove_rmap_item_from_tree and dup_fd and ksm_scan_thread, if you have the time to do so. Thanks, Hugh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 01/06/11 06:31, Hugh Dickins wrote: Brad, my suspicion is that in each case the top 16 bits of RDX have been mysteriously corrupted from to , causing the general protection faults. I don't understand what that has to do with KSM. No, nor do I. The panic I reproduced with KSM off was in a completely unrelated code path. To be honest I would not be surprised if it turns out I have dodgy RAM, although it has passed multiple memtests and I've tried clocking it down. Just a gut feeling. But it's only a suspicion, because I can't make sense of the Code: lines in your traces, they have more than the expected 64 bytes, and only one of them has a (with no) to mark faulting instruction. Yeah, with hindsight I must have removed them when I re-formatted the code from the oops. Each byte was one line in the syslog so there was a lot of deleting to get it to a postable format. I did try compiling the 2.6.39 kernel from your config, but of course we have different compilers, so although I got close, it wasn't exact. Would you mind mailing me privately (it's about 73MB) the objdump -trd output for your original vmlinux (with KSM on)? (Those -trd options are the ones I'm used to typing, I bet not they're not all relevant.) Of course, it's only a tiny fraction of that output that I need, might be better to cut it down to remove_rmap_item_from_tree and dup_fd and ksm_scan_thread, if you have the time to do so. Ok, so since my initial posting I've figured out how to get a clean oops out of netconsole, so tonight (after 9PM GMT+8) I'll reproduce the oops a couple of times. What about I upload the oops, plus the vmlinux, plus .config and System.map to a server with a fat pipe and give you a link to it? At least I can reproduce it quickly and easily. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 01/06/11 06:31, Hugh Dickins wrote: Brad, my suspicion is that in each case the top 16 bits of RDX have been mysteriously corrupted from to , causing the general protection faults. I don't understand what that has to do with KSM. But it's only a suspicion, because I can't make sense of the Code: lines in your traces, they have more than the expected 64 bytes, and only one of them has a (with no) to mark faulting instruction. I did try compiling the 2.6.39 kernel from your config, but of course we have different compilers, so although I got close, it wasn't exact. Would you mind mailing me privately (it's about 73MB) the objdump -trd output for your original vmlinux (with KSM on)? (Those -trd options are the ones I'm used to typing, I bet not they're not all relevant.) Of course, it's only a tiny fraction of that output that I need, might be better to cut it down to remove_rmap_item_from_tree and dup_fd and ksm_scan_thread, if you have the time to do so. Would you believe about 20 seconds after I pressed send the kernel oopsed. http://www.fnarfbargle.com/private/003_kernel_oops/ oops reproduced here, but an un-munged version is in that directory alongside the kernel. [36542.880228] general protection fault: [#1] SMP [36542.880271] last sysfs file: /sys/devices/pci:00/:00:18.3/temp1_input [36542.880290] CPU 4 [36542.880301] Modules linked in: xt_iprange xt_DSCP xt_length xt_CLASSIFY sch_sfq xt_CHECKSUM ipt_REJECT ipt_MASQUER ADE ipt_REDIRECT xt_recent xt_state iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 x t_TCPMSS xt_tcpmss xt_tcpudp iptable_mangle ip_tables x_tables pppoe pppox ppp_generic slhc cls_u32 sch_htb deflate z lib_deflate des_generic cbc ecb crypto_blkcipher sha1_generic md5 hmac crypto_hash cryptomgr aead crypto_algapi af_ke y fuse hwmon_vid netconsole configfs vhost_net powernow_k8 mperf kvm_amd kvm pl2303 usbserial xhci_hcd k10temp i2c_pi ix4 ahci usb_storage usb_libusual ohci_hcd ehci_hcd r8169 libahci usbcore mii sata_mv megaraid_sas [last unloaded: sc si_wait_scan] [36542.880842] [36542.880858] Pid: 13346, comm: bash Not tainted 2.6.38.7 #29 To Be Filled By O.E.M. To Be Filled By O.E.M./880G Ext reme3 [36542.880911] RIP: 0010:[810cf0de] [810cf0de] do_vfs_ioctl+0x5e/0x510 [36542.880948] RSP: 0018:8802d25a1ec8 EFLAGS: 00010206 [36542.880965] RAX: fff7 RBX: 88040eb12840 RCX: 7fff4fe4a4c0 [36542.880984] RDX: 5413 RSI: 5413 RDI: 00ff [36542.881002] RBP: 00ff R08: 7fff4fe4a400 R09: [36542.881020] R10: 7fff4fe4a380 R11: 0246 R12: 7fff4fe4a4c0 [36542.881038] R13: 7fff4fe4a4c0 R14: R15: 0001 [36542.881058] FS: 7f65f725b700() GS:8800dbd0() knlGS: [36542.881081] CS: 0010 DS: ES: CR0: 80050033 [36542.881098] CR2: 01f01008 CR3: 0002d25c3000 CR4: 06e0 [36542.881116] DR0: 00a0 DR1: DR2: 0003 [36542.881133] DR3: 00b0 DR6: 0ff0 DR7: 0400 [36542.881152] Process bash (pid: 13346, threadinfo 8802d25a, task 88041df88000) [36542.881172] Stack: [36542.881183] 88041df88218 0010 0001 [36542.881225] 0002 7fff4fe4a2c0 7fff4fe4a220 0002 [36542.881268] 81046d6a 88040eb12840 00ff [36542.881312] Call Trace: [36542.881333] [81046d6a] ? sys_rt_sigaction+0x8a/0xc0 [36542.881351] [810cf5d9] ? sys_ioctl+0x49/0x80 [36542.881373] [810023fb] ? system_call_fastpath+0x16/0x1b [36542.881389] Code: 76 7b 81 fa 77 58 04 c0 0f 84 77 01 00 00 0f 1f 80 00 00 00 00 0f 87 a2 00 00 00 81 fa 60 54 00 00 0f 1f 40 00 0f 84 ba 01 00 00 48 8b 43 18 48 8b 50 30 0f b7 02 25 00 f0 00 00 3d 00 80 00 00 [36542.881793] RIP [810cf0de] do_vfs_ioctl+0x5e/0x510 [36542.881818] RSP 8802d25a1ec8 [36542.882082] ---[ end trace 1b8d730cd479e388 ]--- [36542.882126] Kernel panic - not syncing: Fatal exception [36542.882175] Pid: 13346, comm: bash Tainted: G D 2.6.38.7 #29 [36542.88] Call Trace: [36542.882269] [813c7f42] ? panic+0x92/0x18a [36542.882318] [81039a41] ? kmsg_dump+0x41/0xf0 [36542.882366] [810062bd] ? oops_end+0x8d/0xa0 [36542.882414] [813caeef] ? general_protection+0x1f/0x30 [36542.882463] [810cf0de] ? do_vfs_ioctl+0x5e/0x510 [36542.882511] [81046d6a] ? sys_rt_sigaction+0x8a/0xc0 [36542.882560] [810cf5d9] ? sys_ioctl+0x49/0x80 [36542.882608] [810023fb] ? system_call_fastpath+0x16/0x1b [36542.882688] Rebooting in 60 seconds..[ 33.104725] fuse init (API version 7.16) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Hello, On Wed, Jun 01, 2011 at 08:37:25AM +0800, Brad Campbell wrote: On 01/06/11 06:31, Hugh Dickins wrote: Brad, my suspicion is that in each case the top 16 bits of RDX have been mysteriously corrupted from to , causing the general protection faults. I don't understand what that has to do with KSM. But it's only a suspicion, because I can't make sense of the Code: lines in your traces, they have more than the expected 64 bytes, and only one of them has a (with no) to mark faulting instruction. I did try compiling the 2.6.39 kernel from your config, but of course we have different compilers, so although I got close, it wasn't exact. Would you mind mailing me privately (it's about 73MB) the objdump -trd output for your original vmlinux (with KSM on)? (Those -trd options are the ones I'm used to typing, I bet not they're not all relevant.) Of course, it's only a tiny fraction of that output that I need, might be better to cut it down to remove_rmap_item_from_tree and dup_fd and ksm_scan_thread, if you have the time to do so. Would you believe about 20 seconds after I pressed send the kernel oopsed. http://www.fnarfbargle.com/private/003_kernel_oops/ oops reproduced here, but an un-munged version is in that directory alongside the kernel. [36542.880228] general protection fault: [#1] SMP Reminds me of another oops that was reported on the kvm list for 2.6.38.1 with message id 4D8C6110.6090204. There the top 16 bits of rsi were flipped and it was a general protection too because of hitting on the not mappable virtual range. http://www.virtall.com/files/temp/kvm.txt http://www.virtall.com/files/temp/config-2.6.38.1 http://virtall.com/files/temp/mmu-objdump.txt That oops happened in kvm_unmap_rmapp though, but it looked memory corruption (Avi suggested use after free) but it was a production system so we couldn't debug it further. I recommend next thing to reproduce again with 2.6.39 or 3.0.0-rc1. Let's fix your scsi trouble if needed but it's better you test with 2.6.39. We'd need chmod +r vmlinux on private/003_kernel_oops/ Thanks, Andrea -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On 01/06/11 09:15, Andrea Arcangeli wrote: Hello, On Wed, Jun 01, 2011 at 08:37:25AM +0800, Brad Campbell wrote: On 01/06/11 06:31, Hugh Dickins wrote: Brad, my suspicion is that in each case the top 16 bits of RDX have been mysteriously corrupted from to , causing the general protection faults. I don't understand what that has to do with KSM. But it's only a suspicion, because I can't make sense of the Code: lines in your traces, they have more than the expected 64 bytes, and only one of them has a (with no) to mark faulting instruction. I did try compiling the 2.6.39 kernel from your config, but of course we have different compilers, so although I got close, it wasn't exact. Would you mind mailing me privately (it's about 73MB) the objdump -trd output for your original vmlinux (with KSM on)? (Those -trd options are the ones I'm used to typing, I bet not they're not all relevant.) Of course, it's only a tiny fraction of that output that I need, might be better to cut it down to remove_rmap_item_from_tree and dup_fd and ksm_scan_thread, if you have the time to do so. Would you believe about 20 seconds after I pressed send the kernel oopsed. http://www.fnarfbargle.com/private/003_kernel_oops/ oops reproduced here, but an un-munged version is in that directory alongside the kernel. [36542.880228] general protection fault: [#1] SMP Reminds me of another oops that was reported on the kvm list for 2.6.38.1 with message id 4D8C6110.6090204. There the top 16 bits of rsi were flipped and it was a general protection too because of hitting on the not mappable virtual range. http://www.virtall.com/files/temp/kvm.txt http://www.virtall.com/files/temp/config-2.6.38.1 http://virtall.com/files/temp/mmu-objdump.txt That oops happened in kvm_unmap_rmapp though, but it looked memory corruption (Avi suggested use after free) but it was a production system so we couldn't debug it further. I recommend next thing to reproduce again with 2.6.39 or 3.0.0-rc1. Let's fix your scsi trouble if needed but it's better you test with 2.6.39. We'd need chmod +r vmlinux on private/003_kernel_oops/ Ok, here we go then. http://www.fnarfbargle.com/private/004_kernel_oops/ The permissions are right this time. 2.6.39 + KSM [ 694.227866] general protection fault: [#1] SMP [ 694.228001] last sysfs file: /sys/devices/platform/w83627ehf.656/cpu0_vid [ 694.228050] CPU 3 [ 694.228091] Modules linked in: xt_iprange xt_DSCP xt_length xt_CLASSIFY sch_sfq xt_CHECKSUM ipt_REJECT ipt_MASQUERADE ipt_REDIRECT xt_recent xt_state iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 xt_TCPMSS xt_tcpmss xt_tcpudp iptable_mangle ip_tables x_tables pppoe pppox ppp_generic slhc cls_u32 sch_htb deflate zlib_deflate des_generic cbc ecb crypto_blkcipher sha1_generic md5 hmac crypto_hash cryptomgr aead crypto_algapi af_key fuse w83627ehf hwmon_vid netconsole configfs vhost_net powernow_k8 mperf kvm_amd kvm pl2303 usbserial i2c_piix4 k10temp xhci_hcd usb_storage usb_libusual ohci_hcd r8169 ehci_hcd ahci usbcore sata_mv mii libahci megaraid_sas [last unloaded: scsi_wait_scan] [ 694.230897] [ 694.230944] Pid: 11841, comm: keepalive Not tainted 2.6.39 #3 To Be Filled By O.E.M. To Be Filled By O.E.M./880G Extreme3 [ 694.23] RIP: 0010:[810db878] [810db878] dup_fd+0x168/0x300 [ 694.231210] RSP: 0018:8802f524fdd0 EFLAGS: 00010206 [ 694.231258] RAX: 07f8 RBX: 8802f5721b80 RCX: bfff [ 694.231308] RDX: 8802f51cacc0 RSI: 00ff RDI: 0800 [ 694.231358] RBP: 8803bf419800 R08: 88030167f6c0 R09: 0003 [ 694.231407] R10: 0001 R11: 4000 R12: 0100 [ 694.231457] R13: 880417aa9800 R14: 88030167f440 R15: 8803bd8c1600 [ 694.231507] FS: 7f02cfc32700() GS:88041fcc() knlGS: [ 694.231560] CS: 0010 DS: ES: CR0: 8005003b [ 694.231609] CR2: 7f02cf5d4810 CR3: 0002f52c3000 CR4: 06e0 [ 694.231657] DR0: 0045 DR1: DR2: [ 694.231707] DR3: 0005 DR6: 0ff0 DR7: 0400 [ 694.231757] Process keepalive (pid: 11841, threadinfo 8802f524e000, task 8802f5143690) [ 694.231809] Stack: [ 694.231852] 8802f5143690 0020 8802f56badc0 8802f5721b90 [ 694.232050] 880417aa54e0 01200011 880417aa54e0 [ 694.232248] 7f02cfc329d0 8802f5143690 81037645 [ 694.232448] Call Trace: [ 694.232499] [81037645] ? copy_process+0xa75/0xfd0 [ 694.232549] [81037c0d] ? do_fork+0x6d/0x2b0 [ 694.232599] [810457a9] ? sigprocmask+0x69/0x100 [ 694.232651] [813d0ca3] ? stub_clone+0x13/0x20 [ 694.232699] [813d0a3b] ? system_call_fastpath+0x16/0x1b [ 694.232745] Code: 4c 89 c2 e8
Re: KVM induced panic on 2.6.38[2367] 2.6.39
On Wed, 1 Jun 2011, Andrea Arcangeli wrote: On Wed, Jun 01, 2011 at 08:37:25AM +0800, Brad Campbell wrote: On 01/06/11 06:31, Hugh Dickins wrote: Brad, my suspicion is that in each case the top 16 bits of RDX have been mysteriously corrupted from to , causing the general protection faults. I don't understand what that has to do with KSM. But it's only a suspicion, because I can't make sense of the Code: lines in your traces, they have more than the expected 64 bytes, and only one of them has a (with no) to mark faulting instruction. I did try compiling the 2.6.39 kernel from your config, but of course we have different compilers, so although I got close, it wasn't exact. Would you mind mailing me privately (it's about 73MB) the objdump -trd output for your original vmlinux (with KSM on)? (Those -trd options are the ones I'm used to typing, I bet not they're not all relevant.) Of course, it's only a tiny fraction of that output that I need, might be better to cut it down to remove_rmap_item_from_tree and dup_fd and ksm_scan_thread, if you have the time to do so. Would you believe about 20 seconds after I pressed send the kernel oopsed. http://www.fnarfbargle.com/private/003_kernel_oops/ oops reproduced here, but an un-munged version is in that directory alongside the kernel. [36542.880228] general protection fault: [#1] SMP Reminds me of another oops that was reported on the kvm list for 2.6.38.1 with message id 4D8C6110.6090204. There the top 16 bits of rsi were flipped and it was a general protection too because of hitting on the not mappable virtual range. http://www.virtall.com/files/temp/kvm.txt http://www.virtall.com/files/temp/config-2.6.38.1 http://virtall.com/files/temp/mmu-objdump.txt That oops happened in kvm_unmap_rmapp though, but it looked memory corruption (Avi suggested use after free) but it was a production system so we couldn't debug it further. I recommend next thing to reproduce again with 2.6.39 or 3.0.0-rc1. Let's fix your scsi trouble if needed but it's better you test with 2.6.39. Brad, thanks for this and the other further crash, with vmlinux etc: very helpful info. Andrea, I'm pretty sure you're right to connect Brad's report with the one above. In four out of five of Brad's reports (cannot tell in the fifth), the bad pointer (with top 16 bits instead of ) had been loaded from SLUB memory at an address offset 0x7f8 (1 case) or 0xff8 (3 cases) i.e. it's the short at 0x7fe or 0xffe that has been zeroed. No reason to suspect KSM's rmap_item code, or file table handling: they just seem to be the victims of corruption from elsewhere. I notice %rax and %rsi, the corrupted pointer in your kvm.txt case, is itself a ...7f8 address; and %r13 an ...ff8 address. I've not even glanced at the code, but I wonder if that implies that KVM is close to the origin of the corruption. I doubt I'll be able to spend more time on this, hope you can take over. I guess Brad could try SLUB debugging, boot with slub_debug=P for poisoning perhaps; though it might upset alignments and drive the problem underground. Or see if the same happens with SLAB instead of SLUB. But I rather hope that you or someone will understand the 7fe clue. Thanks, Hugh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM induced panic on 2.6.38[2367] 2.6.39
G'day all, I'm running a pretty standard home server x86_64 Phenom-II 6 Core 16GB DDR 3 I run some virtual machines under that. 3 x Debian 64 Bit, 1 x XP 32 Bit. These run at boot. When I fire up another XP 32 bit instance and play with it for more than about 2 minutes, I get the panics included in this mail. I've included three of them here. The first and third are as booted. The second was with ksmd disabled just as a data point. The machine passes every load test and memory test I can throw at it, but I still can't rule out this being a hardware issue. Provided I don't start this XP VM the machine is quite stable, but running this VM will kill it within minutes. This was tested with qemu-kvm. The last commit in the git tree was commit c007db193eb6b2557acb5caf2dc4d7023639e6f3 Author: Avi Kivity a...@redhat.com Date: Sun May 29 09:00:42 2011 -0400 (I pulled it yesterday) These panics were captured with netconsole to a remote syslog daemon, and the formatting was ruined, so I've reformatted them by hand prior to posting. I've tested and reproduced this on 2.6.38.[2,3,6 7] and 2.6.39 obviously. Can anyone help shed some light on this? Regards, Brad [ 438.632061] general protection fault: [#1] SMP [ 438.632196] last sysfs file: /sys/module/x_tables/initstate [ 438.632242] CPU 4 [ 438.632282] Modules linked in: xt_iprange xt_DSCP xt_length xt_CLASSIFY sch_sfq xt_CHECKSUM ipt_REJECT ipt_MASQUERADE ipt_REDIRECT xt_recent xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpmss xt_tcpudp iptable_mangle ip_tables x_tables pppoe pppox ppp_generic slhc cls_u32 sch_htb deflate zlib_deflate des_generic cbc ecb crypto_blkcipher sha1_generic md5 hmac crypto_hash cryptomgr aead crypto_algapi af_key fuse netconsole configfs= vhost_net powernow_k8 mperf i2c_nforce2 kvm_amd kvm pl2303 usbserial xhci_hcd k10temp i2c_piix4 usb_storage usb_libusual ohci_hcd ehci_hcd usbcore ahci libahci r8169 mii sata_mv megaraid_sas [last unloaded: scsi_wait_scan] [ 438.634960] [ 438.635006] Pid: 551, comm: ksmd Not tainted 2.6.39 #3 To Be Filled By O.E.M. To Be Filled By O.E.M. /880G Extreme3 [ 438.635170] RIP: 0010:[810b4596] [810b4596] remove_rmap_item_from_tree+0x96/0x150 [ 438.635268] RSP: 0018:88041c065e20 EFLAGS: 00010282 [ 438.635314] RAX: 8804153bd8b0 RBX: 8804176c3fc0 RCX: 00057754 [ 438.635362] RDX: 880415418030 RSI: 880414b65003 RDI: ea000dde6030 [ 438.635410] RBP: 880414b65000 R08: 00057755 R09: 1bbde1bd [ 438.635458] R10: 1c28e0c3 R11: 0002 R12: ea000dde6030 [ 438.635506] R13: 8804176c3f80 R14: 8804151ed7b0 R15: 88041bf23be0 [ 438.63] FS: 7f617772e700() GS:88041fd0() knlGS: [ 438.635607] CS: 0010 DS: ES: CR0: 8005003b [ 438.635654] CR2: 00e7 CR3: 01583000 CR4: 06e0 [ 438.635703] DR0: 0045 DR1: DR2: [ 438.635750] DR3: 0005 DR6: 0ff0 DR7: 0400 [ 438.635799] Process ksmd (pid: 551, threadinfo 88041c064000, task 88041d8caa70) [ 438.635851] Stack: [ 438.635893] ea000c43c408 8804176c3fc0 036c 810b58f2 [ 438.636088] 88041c782a00 7fc2d81b5000 88041c064000 003808ff [ 438.636281] 88041c064000 88041c065e98 88041d8caa70 8804165d4480 [ 438.636479] Call Trace: [ 438.636528] [810b58f2] ? ksm_scan_thread+0x4e2/0xc20 [ 438.636580] [81052a20] ? wake_up_bit+0x40/0x40 [ 438.636628] [810b5410] ? try_to_merge_with_ksm_page+0x570/0x570 [ 438.636679] [810b5410] ? try_to_merge_with_ksm_page+0x570/0x570 [ 438.636730] [810525b6] ? kthread+0x96/0xa0 [ 438.636781] [813d1794] ? kernel_thread_helper+0x4/0x10 [ 438.636832] [81052520] ? kthread_worker_fn+0x120/0x120 [ 438.636882] [813d1790] ? gs_change+0xb/0xb [ 438.636926] Code: 28 48 89 ef e8 6c fe ff ff 48 85 c0 49 89 c4 74 d2 f0 0f ba 28 00 19 c0 85 c0 0f 85 ae 00 00 00 00 48 8b 43 30 48 8b 53 38 48 85 c0 [ 438.638504] 89 02 74 04 48 89 50 08 48 ba 00 01 10 00 00 00 00 ad de 48 b8 [ 438.639329] RIP [810b4596] remove_rmap_item_from_tree+0x96/0x150 [ 438.639414] RSP 88041c065e20 [ 438.639460] ---[ end trace c29fb871f6b874e3 ]--- [ 438.639506] Kernel panic - not syncing: Fatal exception [ 438.639553] Pid: 551, comm: ksmd Tainted: G D 2.6.39 #3 [ 438.639598] Call Trace: [ 438.639644] [813cd6f5] ? panic+0x92/0x18a [ 438.639693] [81038b61] ? kmsg_dump+0x41/0xf0 [ 438.639745] [810050ad] ? oops_end+0x8d/0xa0 [ 438.639794] [813d05ef] ? general_protection+0x1f/0x30 [ 438.639842] [810b4596] ? remove_rmap_item_from_tree+0x96/0x150 [ 438.639891] [810b58f2] ? ksm_scan_thread+0x4e2/0xc20 [ 438.639941]
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your machine from oopsing. Adding linux-mm. On Tue, May 31, 2011 at 09:24:03AM +0800, Brad Campbell wrote: G'day all, I'm running a pretty standard home server x86_64 Phenom-II 6 Core 16GB DDR 3 I run some virtual machines under that. 3 x Debian 64 Bit, 1 x XP 32 Bit. These run at boot. When I fire up another XP 32 bit instance and play with it for more than about 2 minutes, I get the panics included in this mail. I've included three of them here. The first and third are as booted. The second was with ksmd disabled just as a data point. The machine passes every load test and memory test I can throw at it, but I still can't rule out this being a hardware issue. Provided I don't start this XP VM the machine is quite stable, but running this VM will kill it within minutes. This was tested with qemu-kvm. The last commit in the git tree was commit c007db193eb6b2557acb5caf2dc4d7023639e6f3 Author: Avi Kivity a...@redhat.com Date: Sun May 29 09:00:42 2011 -0400 (I pulled it yesterday) These panics were captured with netconsole to a remote syslog daemon, and the formatting was ruined, so I've reformatted them by hand prior to posting. I've tested and reproduced this on 2.6.38.[2,3,6 7] and 2.6.39 obviously. Can anyone help shed some light on this? Regards, Brad [ 438.632061] general protection fault: [#1] SMP [ 438.632196] last sysfs file: /sys/module/x_tables/initstate [ 438.632242] CPU 4 [ 438.632282] Modules linked in: xt_iprange xt_DSCP xt_length xt_CLASSIFY sch_sfq xt_CHECKSUM ipt_REJECT ipt_MASQUERADE ipt_REDIRECT xt_recent xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpmss xt_tcpudp iptable_mangle ip_tables x_tables pppoe pppox ppp_generic slhc cls_u32 sch_htb deflate zlib_deflate des_generic cbc ecb crypto_blkcipher sha1_generic md5 hmac crypto_hash cryptomgr aead crypto_algapi af_key fuse netconsole configfs= vhost_net powernow_k8 mperf i2c_nforce2 kvm_amd kvm pl2303 usbserial xhci_hcd k10temp i2c_piix4 usb_storage usb_libusual ohci_hcd ehci_hcd usbcore ahci libahci r8169 mii sata_mv megaraid_sas [last unloaded: scsi_wait_scan] [ 438.634960] [ 438.635006] Pid: 551, comm: ksmd Not tainted 2.6.39 #3 To Be Filled By O.E.M. To Be Filled By O.E.M. /880G Extreme3 [ 438.635170] RIP: 0010:[810b4596] [810b4596] remove_rmap_item_from_tree+0x96/0x150 [ 438.635268] RSP: 0018:88041c065e20 EFLAGS: 00010282 [ 438.635314] RAX: 8804153bd8b0 RBX: 8804176c3fc0 RCX: 00057754 [ 438.635362] RDX: 880415418030 RSI: 880414b65003 RDI: ea000dde6030 [ 438.635410] RBP: 880414b65000 R08: 00057755 R09: 1bbde1bd [ 438.635458] R10: 1c28e0c3 R11: 0002 R12: ea000dde6030 [ 438.635506] R13: 8804176c3f80 R14: 8804151ed7b0 R15: 88041bf23be0 [ 438.63] FS: 7f617772e700() GS:88041fd0() knlGS: [ 438.635607] CS: 0010 DS: ES: CR0: 8005003b [ 438.635654] CR2: 00e7 CR3: 01583000 CR4: 06e0 [ 438.635703] DR0: 0045 DR1: DR2: [ 438.635750] DR3: 0005 DR6: 0ff0 DR7: 0400 [ 438.635799] Process ksmd (pid: 551, threadinfo 88041c064000, task 88041d8caa70) [ 438.635851] Stack: [ 438.635893] ea000c43c408 8804176c3fc0 036c 810b58f2 [ 438.636088] 88041c782a00 7fc2d81b5000 88041c064000 003808ff [ 438.636281] 88041c064000 88041c065e98 88041d8caa70 8804165d4480 [ 438.636479] Call Trace: [ 438.636528] [810b58f2] ? ksm_scan_thread+0x4e2/0xc20 [ 438.636580] [81052a20] ? wake_up_bit+0x40/0x40 [ 438.636628] [810b5410] ? try_to_merge_with_ksm_page+0x570/0x570 [ 438.636679] [810b5410] ? try_to_merge_with_ksm_page+0x570/0x570 [ 438.636730] [810525b6] ? kthread+0x96/0xa0 [ 438.636781] [813d1794] ? kernel_thread_helper+0x4/0x10 [ 438.636832] [81052520] ? kthread_worker_fn+0x120/0x120 [ 438.636882] [813d1790] ? gs_change+0xb/0xb [ 438.636926] Code: 28 48 89 ef e8 6c fe ff ff 48 85 c0 49 89 c4 74 d2 f0 0f ba 28 00 19 c0 85 c0 0f 85 ae 00 00 00 00 48 8b 43 30 48 8b 53 38 48 85 c0 [ 438.638504] 89 02 74 04 48 89 50 08 48 ba 00 01 10 00 00 00 00 ad de 48 b8 [ 438.639329] RIP [810b4596] remove_rmap_item_from_tree+0x96/0x150 [ 438.639414] RSP 88041c065e20 [ 438.639460] ---[ end trace c29fb871f6b874e3 ]--- [ 438.639506] Kernel panic - not syncing: Fatal exception [ 438.639553] Pid: 551, comm: ksmd Tainted: G D 2.6.39 #3 [ 438.639598] Call Trace: [ 438.639644] [813cd6f5] ? panic+0x92/0x18a [ 438.639693] [81038b61] ? kmsg_dump+0x41/0xf0 [ 438.639745] [810050ad] ?