Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-08-22 Thread Avi Kivity

On 08/20/2011 04:16 PM, Brad Campbell wrote:

Author: Alexander Duyck alexander.h.du...@intel.com
Date:   Thu Jul 1 13:28:27 2010 +

x86: Drop CONFIG_MCORE2 check around setting of NET_IP_ALIGN

This patch removes the CONFIG_MCORE2 check from around 
NET_IP_ALIGN.  It is
based on a suggestion from Andi Kleen.  The assumption is that 
there are
not any x86 cores where unaligned access is really slow, and this 
change
would allow for a performance improvement to still exist on 
configurations

that are not necessarily optimized for Core 2.

Cc: Andi Kleen a...@linux.intel.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Signed-off-by: Alexander Duyck alexander.h.du...@intel.com
Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
Acked-by: H. Peter Anvin h...@zytor.com
Signed-off-by: David S. Miller da...@davemloft.net

:04 04 5a15867789080a2f67a74b17c4422f85b7a9fb4a 
b98769348bd765731ca3ff03b33764257e23226c March


I can confirm this bug exists in the 3.0 kernel, however I'm unable to 
reproduce it on todays git.


So anyone using netfilter, kvm and bridge on kernels between 
2.6.36-rc1 and 3.0 may hit this bug, but it looks like it is fixed in 
the current 3.1-rc kernels.




Thanks for this effort.  I don't think this patch is buggy in itself, it 
merely exposed another bug which was fixed later on.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-08-22 Thread Eric Dumazet
Le lundi 22 août 2011 à 09:36 +0300, Avi Kivity a écrit :
 On 08/20/2011 04:16 PM, Brad Campbell wrote:
  Author: Alexander Duyck alexander.h.du...@intel.com
  Date:   Thu Jul 1 13:28:27 2010 +
 
  x86: Drop CONFIG_MCORE2 check around setting of NET_IP_ALIGN
 
  This patch removes the CONFIG_MCORE2 check from around 
  NET_IP_ALIGN.  It is
  based on a suggestion from Andi Kleen.  The assumption is that 
  there are
  not any x86 cores where unaligned access is really slow, and this 
  change
  would allow for a performance improvement to still exist on 
  configurations
  that are not necessarily optimized for Core 2.
 
  Cc: Andi Kleen a...@linux.intel.com
  Cc: Thomas Gleixner t...@linutronix.de
  Cc: Ingo Molnar mi...@redhat.com
  Cc: H. Peter Anvin h...@zytor.com
  Cc: x...@kernel.org
  Signed-off-by: Alexander Duyck alexander.h.du...@intel.com
  Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
  Acked-by: H. Peter Anvin h...@zytor.com
  Signed-off-by: David S. Miller da...@davemloft.net
 
  :04 04 5a15867789080a2f67a74b17c4422f85b7a9fb4a 
  b98769348bd765731ca3ff03b33764257e23226c March
 
  I can confirm this bug exists in the 3.0 kernel, however I'm unable to 
  reproduce it on todays git.
 
  So anyone using netfilter, kvm and bridge on kernels between 
  2.6.36-rc1 and 3.0 may hit this bug, but it looks like it is fixed in 
  the current 3.1-rc kernels.
 
 
 Thanks for this effort.  I don't think this patch is buggy in itself, it 
 merely exposed another bug which was fixed later on.
 

Some piece of hardware has a 2-byte offset requirement, and driver
incorrectly assumed NET_IP_ALIGN was 2 on x86.

Brad, could you post your config (lsmod, dmesg) again ?

tg3.c code for example uses a private value, not related to NET_IP_ALIGN

#define TG3_RAW_IP_ALIGN 2



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-08-20 Thread Brad Campbell

On 07/06/11 21:37, Eric Dumazet wrote:

Le mardi 07 juin 2011 à 21:27 +0800, Brad Campbell a écrit :

On 07/06/11 04:22, Eric Dumazet wrote:


Could you please try latest linux-2.6 tree ?

We fixed many networking bugs that could explain your crash.






No good I'm afraid.

[  543.040056]
=
[  543.040136] BUG ip_dst_cache: Padding overwritten.
0x8803e4217ffe-0x8803e4217fff
[  543.040194]


Thats pretty strange : These are the last two bytes of a page, set to
0x (a 16 bit value)

There is no way a dst field could actually sit on this location (its a
padding), since a dst is a bit less than 256 bytes (0xe8), and each
entry is aligned on a 64byte address.

grep dst /proc/slabinfo

ip_dst_cache   32823  62944256   322 : tunables00
0 : slabdata   1967   1967  0

sizeof(struct rtable)=0xe8



-
[  543.040198]
[  543.040298] INFO: Slab 0xea000d9e74d0 objects=25 used=25 fp=0x
 (null) flags=0x80004081
[  543.040364] Pid: 4576, comm: kworker/1:2 Not tainted 3.0.0-rc2 #1
[  543.040415] Call Trace:
[  543.040472]  [810b9c1d] ? slab_err+0xad/0xd0
[  543.040528]  [8102e034] ? check_preempt_wakeup+0xa4/0x160
[  543.040595]  [810ba206] ? slab_pad_check+0x126/0x170
[  543.040650]  [8133045b] ? dst_destroy+0x8b/0x110
[  543.040701]  [810ba29a] ? check_slab+0x4a/0xc0
[  543.040753]  [810baf2d] ? free_debug_processing+0x2d/0x250
[  543.040808]  [810bb27b] ? __slab_free+0x12b/0x140
[  543.040862]  [810bbe99] ? kmem_cache_free+0x99/0xa0
[  543.040915]  [8133045b] ? dst_destroy+0x8b/0x110
[  543.040967]  [813307f6] ? dst_gc_task+0x196/0x1f0
[  543.041021]  [8104e954] ? queue_delayed_work_on+0x154/0x160
[  543.041081]  [813066fe] ? do_dbs_timer+0x20e/0x3d0
[  543.041133]  [81330660] ? dst_alloc+0x180/0x180
[  543.041187]  [8104f28b] ? process_one_work+0xfb/0x3b0
[  543.041242]  [8104f964] ? worker_thread+0x144/0x3d0
[  543.041296]  [8102cc10] ? __wake_up_common+0x50/0x80
[  543.041678]  [8104f820] ? rescuer_thread+0x2e0/0x2e0
[  543.041729]  [8104f820] ? rescuer_thread+0x2e0/0x2e0
[  543.041782]  [81053436] ? kthread+0x96/0xa0
[  543.041835]  [813e1d14] ? kernel_thread_helper+0x4/0x10
[  543.041890]  [810533a0] ? kthread_worker_fn+0x120/0x120
[  543.041944]  [813e1d10] ? gs_change+0xb/0xb
[  543.041993]  Padding 0x8803e4217f40:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.042718]  Padding 0x8803e4217f50:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.043433]  Padding 0x8803e4217f60:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.044155]  Padding 0x8803e4217f70:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.044866]  Padding 0x8803e4217f80:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.045590]  Padding 0x8803e4217f90:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.046311]  Padding 0x8803e4217fa0:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.047034]  Padding 0x8803e4217fb0:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.047755]  Padding 0x8803e4217fc0:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.048474]  Padding 0x8803e4217fd0:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.049203]  Padding 0x8803e4217fe0:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 
[  543.049909]  Padding 0x8803e4217ff0:  5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 00 00 ZZ..
[  543.050021] FIX ip_dst_cache: Restoring
0x8803e4217f40-0x8803e4217fff=0x5a
[  543.050021]

Dropped -mm, Hugh and Andrea from CC as this does not appear to be mm or
ksm related.

I'll pare down the firewall and see if I can make it break easier with a
smaller test set.


Hmm, not sure now :(

Could you reproduce another bug please ?


I know this is an old one, but I recently purchased a second system to 
allow me to test and bisect this off-line (the live system is too much 
of a headache to bisect on).


brad@test:/raid10/src/linux-2.6$ git bisect log
git bisect start
# good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35
git bisect good 9fe6206f400646a2322096b56c59891d530e8d51
# bad: [da5cabf80e2433131bf0ed8993abc0f7ea618c73] Linux 2.6.36-rc1
git bisect bad da5cabf80e2433131bf0ed8993abc0f7ea618c73
# bad: [0f477dd0851bdcee82923da66a7fc4a44cb1bc3d] Merge branch 
'x86-cpu-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

git bisect bad 0f477dd0851bdcee82923da66a7fc4a44cb1bc3d
# bad: 

Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-12 Thread Avi Kivity

On 06/10/2011 05:52 AM, Simon Horman wrote:

At one point I would have need an 8000km long wire to the reset switch :-)


Even more off-topic, there has been a case when a 200,000,000 km long 
wire to the reset button was needed.  IIRC they got away with a watchdog.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-10 Thread Mark Lord
On 11-06-09 10:52 PM, Simon Horman wrote:
 On Thu, Jun 09, 2011 at 01:02:13AM +0800, Brad Campbell wrote:
 On 08/06/11 11:59, Eric Dumazet wrote:

 Well, a bisection definitely should help, but needs a lot of time in
 your case.

 Yes. compile, test, crash, walk out to the other building to press
 reset, lather, rinse, repeat.

 I need a reset button on the end of a 50M wire, or a hardware watchdog!


Something many of us don't realize is that nearly all Intel chipsets
have a built-in hardware watchdog timer.  This includes chipset for
consumer desktop boards as well as the big iron server stuff.

It's the i8xx_tco driver in the kernel enables use of them:

modprobe i8xx_tco

Cheers
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-10 Thread Henrique de Moraes Holschuh
On Fri, 10 Jun 2011, Mark Lord wrote:
 Something many of us don't realize is that nearly all Intel chipsets
 have a built-in hardware watchdog timer.  This includes chipset for
 consumer desktop boards as well as the big iron server stuff.
 
 It's the i8xx_tco driver in the kernel enables use of them:

That's the old module name, but yes, it is very useful in desktops and
laptops (when it works).   Server-class hardware will have a baseboard
management unit that can really power-cycle the system instead of just
rebooting.

And test it first before you depend on it triggering at a remote location,
as the firmware might cause the Intel chipset watchdog to actually hang the
box instead of causing a proper reboot (happens on the IBM thinkpad T43, for
example).

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-09 Thread Simon Horman
On Thu, Jun 09, 2011 at 01:02:13AM +0800, Brad Campbell wrote:
 On 08/06/11 11:59, Eric Dumazet wrote:
 
 Well, a bisection definitely should help, but needs a lot of time in
 your case.
 
 Yes. compile, test, crash, walk out to the other building to press
 reset, lather, rinse, repeat.
 
 I need a reset button on the end of a 50M wire, or a hardware watchdog!

Not strictly on-topic, but in situations where I have machines
that either don't have lights-out facilities or have broken ones
I find that network controlled power switches to be very useful.

At one point I would have need an 8000km long wire to the reset switch :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-08 Thread Brad Campbell

On 08/06/11 11:59, Eric Dumazet wrote:


Well, a bisection definitely should help, but needs a lot of time in
your case.


Yes. compile, test, crash, walk out to the other building to press 
reset, lather, rinse, repeat.


I need a reset button on the end of a 50M wire, or a hardware watchdog!

Actually it's not so bad. If I turn off slub debugging the kernel panics 
and reboots itself.


This.. :
[2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1
[2.913066] netconsole: device eth0 not up yet, forcing it
[3.660062] Refined TSC clocksource calibration: 3213.422 MHz.
[3.660118] Switching to clocksource tsc
[   63.200273] r8169 :03:00.0: eth0: unable to load firmware patch 
rtl_nic/rtl8168e-1.fw (-2)

[   63.223513] r8169 :03:00.0: eth0: link down
[   63.223556] r8169 :03:00.0: eth0: link down

..is slowing down reboots considerably. 3.0-rc does _not_ like some 
timing hardware in my machine. Having said that, at least it does not 
randomly panic on SCSI like 2.6.39 does.


Ok, I've ruled out TCPMSS. Found out where it was being set and neutered 
it. I've replicated it with only the single DNAT rule.




Could you try following patch, because this is the 'usual suspect' I had
yesterday :

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 46cbd28..9f548f9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
ntail,
fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta;
}

+#if 0
if (fastpath
size + sizeof(struct skb_shared_info)= ksize(skb-head)) {
memmove(skb-head + size, skb_shinfo(skb),
@@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
ntail,
off = nhead;
goto adjust_others;
}
-
+#endif
data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;





Nope.. that's not it. sigh That might have changed the characteristic 
of the fault slightly, but unfortunately I got caught with a couple of 
fsck's, so I only got to test it 3 times tonight.


It's unfortunate that this is a production system, so I can only take it 
down between about 9pm and 1am. That would normally be pretty 
productive, except that an fsck of a 14TB ext4 can take 30 minutes if it 
panics at the wrong time.


I'm out of time tonight, but I'll have a crack at some bisection 
tomorrow night. Now I just have to go back far enough that it works, and 
be near enough not to have to futz around with /proc /sys or drivers.


I really, really, really appreciate you guys helping me with this. It 
has been driving me absolutely bonkers. If I'm ever in the same town as 
any of you, dinner and drinks are on me.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-08 Thread Eric Dumazet
Le jeudi 09 juin 2011 à 01:02 +0800, Brad Campbell a écrit :
 On 08/06/11 11:59, Eric Dumazet wrote:
 
  Well, a bisection definitely should help, but needs a lot of time in
  your case.
 
 Yes. compile, test, crash, walk out to the other building to press 
 reset, lather, rinse, repeat.
 
 I need a reset button on the end of a 50M wire, or a hardware watchdog!
 
 Actually it's not so bad. If I turn off slub debugging the kernel panics 
 and reboots itself.
 
 This.. :
 [2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1
 [2.913066] netconsole: device eth0 not up yet, forcing it
 [3.660062] Refined TSC clocksource calibration: 3213.422 MHz.
 [3.660118] Switching to clocksource tsc
 [   63.200273] r8169 :03:00.0: eth0: unable to load firmware patch 
 rtl_nic/rtl8168e-1.fw (-2)
 [   63.223513] r8169 :03:00.0: eth0: link down
 [   63.223556] r8169 :03:00.0: eth0: link down
 
 ..is slowing down reboots considerably. 3.0-rc does _not_ like some 
 timing hardware in my machine. Having said that, at least it does not 
 randomly panic on SCSI like 2.6.39 does.
 
 Ok, I've ruled out TCPMSS. Found out where it was being set and neutered 
 it. I've replicated it with only the single DNAT rule.
 
 
  Could you try following patch, because this is the 'usual suspect' I had
  yesterday :
 
  diff --git a/net/core/skbuff.c b/net/core/skbuff.c
  index 46cbd28..9f548f9 100644
  --- a/net/core/skbuff.c
  +++ b/net/core/skbuff.c
  @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, 
  int ntail,
  fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta;
  }
 
  +#if 0
  if (fastpath
  size + sizeof(struct skb_shared_info)= ksize(skb-head)) {
  memmove(skb-head + size, skb_shinfo(skb),
  @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, 
  int ntail,
  off = nhead;
  goto adjust_others;
  }
  -
  +#endif
  data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
  if (!data)
  goto nodata;
 
 
 
 
 Nope.. that's not it. sigh That might have changed the characteristic 
 of the fault slightly, but unfortunately I got caught with a couple of 
 fsck's, so I only got to test it 3 times tonight.
 
 It's unfortunate that this is a production system, so I can only take it 
 down between about 9pm and 1am. That would normally be pretty 
 productive, except that an fsck of a 14TB ext4 can take 30 minutes if it 
 panics at the wrong time.
 
 I'm out of time tonight, but I'll have a crack at some bisection 
 tomorrow night. Now I just have to go back far enough that it works, and 
 be near enough not to have to futz around with /proc /sys or drivers.
 
 I really, really, really appreciate you guys helping me with this. It 
 has been driving me absolutely bonkers. If I'm ever in the same town as 
 any of you, dinner and drinks are on me.

Hmm, I wonder if kmemcheck could help you, but its slow as hell, so not
appropriate for production :(



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Brad Campbell

On 07/06/11 04:22, Eric Dumazet wrote:


Could you please try latest linux-2.6 tree ?

We fixed many networking bugs that could explain your crash.






No good I'm afraid.

[  543.040056] 
=
[  543.040136] BUG ip_dst_cache: Padding overwritten. 
0x8803e4217ffe-0x8803e4217fff
[  543.040194] 
-

[  543.040198]
[  543.040298] INFO: Slab 0xea000d9e74d0 objects=25 used=25 fp=0x 
   (null) flags=0x80004081

[  543.040364] Pid: 4576, comm: kworker/1:2 Not tainted 3.0.0-rc2 #1
[  543.040415] Call Trace:
[  543.040472]  [810b9c1d] ? slab_err+0xad/0xd0
[  543.040528]  [8102e034] ? check_preempt_wakeup+0xa4/0x160
[  543.040595]  [810ba206] ? slab_pad_check+0x126/0x170
[  543.040650]  [8133045b] ? dst_destroy+0x8b/0x110
[  543.040701]  [810ba29a] ? check_slab+0x4a/0xc0
[  543.040753]  [810baf2d] ? free_debug_processing+0x2d/0x250
[  543.040808]  [810bb27b] ? __slab_free+0x12b/0x140
[  543.040862]  [810bbe99] ? kmem_cache_free+0x99/0xa0
[  543.040915]  [8133045b] ? dst_destroy+0x8b/0x110
[  543.040967]  [813307f6] ? dst_gc_task+0x196/0x1f0
[  543.041021]  [8104e954] ? queue_delayed_work_on+0x154/0x160
[  543.041081]  [813066fe] ? do_dbs_timer+0x20e/0x3d0
[  543.041133]  [81330660] ? dst_alloc+0x180/0x180
[  543.041187]  [8104f28b] ? process_one_work+0xfb/0x3b0
[  543.041242]  [8104f964] ? worker_thread+0x144/0x3d0
[  543.041296]  [8102cc10] ? __wake_up_common+0x50/0x80
[  543.041678]  [8104f820] ? rescuer_thread+0x2e0/0x2e0
[  543.041729]  [8104f820] ? rescuer_thread+0x2e0/0x2e0
[  543.041782]  [81053436] ? kthread+0x96/0xa0
[  543.041835]  [813e1d14] ? kernel_thread_helper+0x4/0x10
[  543.041890]  [810533a0] ? kthread_worker_fn+0x120/0x120
[  543.041944]  [813e1d10] ? gs_change+0xb/0xb
[  543.041993]  Padding 0x8803e4217f40:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.042718]  Padding 0x8803e4217f50:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.043433]  Padding 0x8803e4217f60:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.044155]  Padding 0x8803e4217f70:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.044866]  Padding 0x8803e4217f80:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.045590]  Padding 0x8803e4217f90:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.046311]  Padding 0x8803e4217fa0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.047034]  Padding 0x8803e4217fb0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.047755]  Padding 0x8803e4217fc0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.048474]  Padding 0x8803e4217fd0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.049203]  Padding 0x8803e4217fe0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a 5a 
[  543.049909]  Padding 0x8803e4217ff0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 00 00 ZZ..
[  543.050021] FIX ip_dst_cache: Restoring 
0x8803e4217f40-0x8803e4217fff=0x5a

[  543.050021]

Dropped -mm, Hugh and Andrea from CC as this does not appear to be mm or 
ksm related.


I'll pare down the firewall and see if I can make it break easier with a 
smaller test set.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Patrick McHardy
On 07.06.2011 05:33, Brad Campbell wrote:
 On 07/06/11 04:10, Bart De Schuymer wrote:
 Hi Brad,

 This has probably nothing to do with ebtables, so please rmmod in case
 it's loaded.
 A few questions I didn't directly see an answer to in the threads I
 scanned...
 I'm assuming you actually use the bridging firewall functionality. So,
 what iptables modules do you use? Can you reduce your iptables rules to
 a core that triggers the bug?
 Or does it get triggered even with an empty set of firewall rules?
 Are you using a stock .35 kernel or is it patched?
 Is this something I can trigger on a poor guy's laptop or does it
 require specialized hardware (I'm catching up on qemu/kvm...)?
 
 Not specialised hardware as such, I've just not been able to reproduce
 it outside of this specific operating scenario.

The last similar problem we've had was related to the 32/64 bit compat
code. Are you running 32 bit userspace on a 64 bit kernel?

 I can't trigger it with empty firewall rules as it relies on a DNAT to
 occur. If I try it directly to the internal IP address (as I have to
 without netfilter loaded) then of course nothing fails.
 
 It's a pain in the bum as a fault, but it's one I can easily reproduce
 as long as I use the same set of circumstances.
 
 I'll try using 3.0-rc2 (current git) tonight, and if I can reproduce it
 on that then I'll attempt to pare down the IPTABLES rules to a bare
 minimum.
 
 It is nothing to do with ebtables as I don't compile it. I'm not really
 sure about bridging firewall functionality. I just use a couple of
 hand coded bash scripts to set the tables up.

From one of your previous mails:

 # CONFIG_BRIDGE_NF_EBTABLES is not set

How about CONFIG_BRIDGE_NETFILTER?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Eric Dumazet
Le mardi 07 juin 2011 à 21:27 +0800, Brad Campbell a écrit :
 On 07/06/11 04:22, Eric Dumazet wrote:
 
  Could you please try latest linux-2.6 tree ?
 
  We fixed many networking bugs that could explain your crash.
 
 
 
 
 
 No good I'm afraid.
 
 [  543.040056] 
 =
 [  543.040136] BUG ip_dst_cache: Padding overwritten. 
 0x8803e4217ffe-0x8803e4217fff
 [  543.040194] 

Thats pretty strange : These are the last two bytes of a page, set to
0x (a 16 bit value)

There is no way a dst field could actually sit on this location (its a
padding), since a dst is a bit less than 256 bytes (0xe8), and each
entry is aligned on a 64byte address.

grep dst /proc/slabinfo 

ip_dst_cache   32823  62944256   322 : tunables00
0 : slabdata   1967   1967  0

sizeof(struct rtable)=0xe8


 -
 [  543.040198]
 [  543.040298] INFO: Slab 0xea000d9e74d0 objects=25 used=25 fp=0x 
 (null) flags=0x80004081
 [  543.040364] Pid: 4576, comm: kworker/1:2 Not tainted 3.0.0-rc2 #1
 [  543.040415] Call Trace:
 [  543.040472]  [810b9c1d] ? slab_err+0xad/0xd0
 [  543.040528]  [8102e034] ? check_preempt_wakeup+0xa4/0x160
 [  543.040595]  [810ba206] ? slab_pad_check+0x126/0x170
 [  543.040650]  [8133045b] ? dst_destroy+0x8b/0x110
 [  543.040701]  [810ba29a] ? check_slab+0x4a/0xc0
 [  543.040753]  [810baf2d] ? free_debug_processing+0x2d/0x250
 [  543.040808]  [810bb27b] ? __slab_free+0x12b/0x140
 [  543.040862]  [810bbe99] ? kmem_cache_free+0x99/0xa0
 [  543.040915]  [8133045b] ? dst_destroy+0x8b/0x110
 [  543.040967]  [813307f6] ? dst_gc_task+0x196/0x1f0
 [  543.041021]  [8104e954] ? queue_delayed_work_on+0x154/0x160
 [  543.041081]  [813066fe] ? do_dbs_timer+0x20e/0x3d0
 [  543.041133]  [81330660] ? dst_alloc+0x180/0x180
 [  543.041187]  [8104f28b] ? process_one_work+0xfb/0x3b0
 [  543.041242]  [8104f964] ? worker_thread+0x144/0x3d0
 [  543.041296]  [8102cc10] ? __wake_up_common+0x50/0x80
 [  543.041678]  [8104f820] ? rescuer_thread+0x2e0/0x2e0
 [  543.041729]  [8104f820] ? rescuer_thread+0x2e0/0x2e0
 [  543.041782]  [81053436] ? kthread+0x96/0xa0
 [  543.041835]  [813e1d14] ? kernel_thread_helper+0x4/0x10
 [  543.041890]  [810533a0] ? kthread_worker_fn+0x120/0x120
 [  543.041944]  [813e1d10] ? gs_change+0xb/0xb
 [  543.041993]  Padding 0x8803e4217f40:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.042718]  Padding 0x8803e4217f50:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.043433]  Padding 0x8803e4217f60:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.044155]  Padding 0x8803e4217f70:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.044866]  Padding 0x8803e4217f80:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.045590]  Padding 0x8803e4217f90:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.046311]  Padding 0x8803e4217fa0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.047034]  Padding 0x8803e4217fb0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.047755]  Padding 0x8803e4217fc0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.048474]  Padding 0x8803e4217fd0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.049203]  Padding 0x8803e4217fe0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.049909]  Padding 0x8803e4217ff0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 00 00 ZZ..
 [  543.050021] FIX ip_dst_cache: Restoring 
 0x8803e4217f40-0x8803e4217fff=0x5a
 [  543.050021]
 
 Dropped -mm, Hugh and Andrea from CC as this does not appear to be mm or 
 ksm related.
 
 I'll pare down the firewall and see if I can make it break easier with a 
 smaller test set.

Hmm, not sure now :(

Could you reproduce another bug please ?



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Brad Campbell

On 07/06/11 21:30, Patrick McHardy wrote:

On 07.06.2011 05:33, Brad Campbell wrote:

On 07/06/11 04:10, Bart De Schuymer wrote:

Hi Brad,

This has probably nothing to do with ebtables, so please rmmod in case
it's loaded.
A few questions I didn't directly see an answer to in the threads I
scanned...
I'm assuming you actually use the bridging firewall functionality. So,
what iptables modules do you use? Can you reduce your iptables rules to
a core that triggers the bug?
Or does it get triggered even with an empty set of firewall rules?
Are you using a stock .35 kernel or is it patched?
Is this something I can trigger on a poor guy's laptop or does it
require specialized hardware (I'm catching up on qemu/kvm...)?


Not specialised hardware as such, I've just not been able to reproduce
it outside of this specific operating scenario.


The last similar problem we've had was related to the 32/64 bit compat
code. Are you running 32 bit userspace on a 64 bit kernel?


No, 32 bit Guest OS, but a completely 64 bit userspace on a 64 bit kernel.

Userspace is current Debian Stable. Kernel is Vanilla and qemu-kvm is 
current git




I can't trigger it with empty firewall rules as it relies on a DNAT to
occur. If I try it directly to the internal IP address (as I have to
without netfilter loaded) then of course nothing fails.

It's a pain in the bum as a fault, but it's one I can easily reproduce
as long as I use the same set of circumstances.

I'll try using 3.0-rc2 (current git) tonight, and if I can reproduce it
on that then I'll attempt to pare down the IPTABLES rules to a bare
minimum.

It is nothing to do with ebtables as I don't compile it. I'm not really
sure about bridging firewall functionality. I just use a couple of
hand coded bash scripts to set the tables up.


 From one of your previous mails:


# CONFIG_BRIDGE_NF_EBTABLES is not set


How about CONFIG_BRIDGE_NETFILTER?



It was compiled in.

With the following table set I was able to reproduce the problem on 
3.0-rc2. Replaced my IP with xxx.xxx.xxx.xxx, but otherwise unmodified


root@srv:~# iptables-save
# Generated by iptables-save v1.4.10 on Tue Jun  7 22:11:30 2011
*filter
:INPUT ACCEPT [978:107619]
:FORWARD ACCEPT [142:7068]
:OUTPUT ACCEPT [1659:291870]
-A INPUT -i ppp0 -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT ! -i ppp0 -m state --state NEW -j ACCEPT
-A INPUT -i ppp0 -j DROP
COMMIT
# Completed on Tue Jun  7 22:11:30 2011
# Generated by iptables-save v1.4.10 on Tue Jun  7 22:11:30 2011
*nat
:PREROUTING ACCEPT [813:49170]
:INPUT ACCEPT [91:7090]
:OUTPUT ACCEPT [267:20731]
:POSTROUTING ACCEPT [296:22281]
-A PREROUTING -d xxx.xxx.xxx.xxx/32 ! -i ppp0 -p tcp -m tcp --dport 443 
-j DNAT --to-destination 192.168.253.198

COMMIT
# Completed on Tue Jun  7 22:11:30 2011
# Generated by iptables-save v1.4.10 on Tue Jun  7 22:11:30 2011
*mangle
:PREROUTING ACCEPT [2729:274392]
:INPUT ACCEPT [2508:262976]
:FORWARD ACCEPT [142:7068]
:OUTPUT ACCEPT [1674:293701]
:POSTROUTING ACCEPT [2131:346411]
-A FORWARD -o ppp0 -p tcp -m tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 
1400:1536 -j TCPMSS --clamp-mss-to-pmtu

COMMIT
# Completed on Tue Jun  7 22:11:30 2011

I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access 
the address the way I was doing it, so that's a no-go for me.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Patrick McHardy
On 07.06.2011 16:40, Brad Campbell wrote:
 On 07/06/11 21:30, Patrick McHardy wrote:
 On 07.06.2011 05:33, Brad Campbell wrote:
 On 07/06/11 04:10, Bart De Schuymer wrote:
 Hi Brad,

 This has probably nothing to do with ebtables, so please rmmod in case
 it's loaded.
 A few questions I didn't directly see an answer to in the threads I
 scanned...
 I'm assuming you actually use the bridging firewall functionality. So,
 what iptables modules do you use? Can you reduce your iptables rules to
 a core that triggers the bug?
 Or does it get triggered even with an empty set of firewall rules?
 Are you using a stock .35 kernel or is it patched?
 Is this something I can trigger on a poor guy's laptop or does it
 require specialized hardware (I'm catching up on qemu/kvm...)?

 Not specialised hardware as such, I've just not been able to reproduce
 it outside of this specific operating scenario.

 The last similar problem we've had was related to the 32/64 bit compat
 code. Are you running 32 bit userspace on a 64 bit kernel?
 
 No, 32 bit Guest OS, but a completely 64 bit userspace on a 64 bit kernel.
 
 Userspace is current Debian Stable. Kernel is Vanilla and qemu-kvm is
 current git
 
 
 I can't trigger it with empty firewall rules as it relies on a DNAT to
 occur. If I try it directly to the internal IP address (as I have to
 without netfilter loaded) then of course nothing fails.

 It's a pain in the bum as a fault, but it's one I can easily reproduce
 as long as I use the same set of circumstances.

 I'll try using 3.0-rc2 (current git) tonight, and if I can reproduce it
 on that then I'll attempt to pare down the IPTABLES rules to a bare
 minimum.

 It is nothing to do with ebtables as I don't compile it. I'm not really
 sure about bridging firewall functionality. I just use a couple of
 hand coded bash scripts to set the tables up.

  From one of your previous mails:

 # CONFIG_BRIDGE_NF_EBTABLES is not set

 How about CONFIG_BRIDGE_NETFILTER?

 
 It was compiled in.
 
 With the following table set I was able to reproduce the problem on
 3.0-rc2. Replaced my IP with xxx.xxx.xxx.xxx, but otherwise unmodified

Which kernel was the last version without this problem?

 root@srv:~# iptables-save
 # Generated by iptables-save v1.4.10 on Tue Jun  7 22:11:30 2011
 *filter
 :INPUT ACCEPT [978:107619]
 :FORWARD ACCEPT [142:7068]
 :OUTPUT ACCEPT [1659:291870]
 -A INPUT -i ppp0 -m state --state RELATED,ESTABLISHED -j ACCEPT
 -A INPUT ! -i ppp0 -m state --state NEW -j ACCEPT
 -A INPUT -i ppp0 -j DROP
 COMMIT
 # Completed on Tue Jun  7 22:11:30 2011
 # Generated by iptables-save v1.4.10 on Tue Jun  7 22:11:30 2011
 *nat
 :PREROUTING ACCEPT [813:49170]
 :INPUT ACCEPT [91:7090]
 :OUTPUT ACCEPT [267:20731]
 :POSTROUTING ACCEPT [296:22281]
 -A PREROUTING -d xxx.xxx.xxx.xxx/32 ! -i ppp0 -p tcp -m tcp --dport 443
 -j DNAT --to-destination 192.168.253.198
 COMMIT
 # Completed on Tue Jun  7 22:11:30 2011
 # Generated by iptables-save v1.4.10 on Tue Jun  7 22:11:30 2011
 *mangle
 :PREROUTING ACCEPT [2729:274392]
 :INPUT ACCEPT [2508:262976]
 :FORWARD ACCEPT [142:7068]
 :OUTPUT ACCEPT [1674:293701]
 :POSTROUTING ACCEPT [2131:346411]
 -A FORWARD -o ppp0 -p tcp -m tcp --tcp-flags SYN,RST SYN -m tcpmss --mss
 1400:1536 -j TCPMSS --clamp-mss-to-pmtu
 COMMIT
 # Completed on Tue Jun  7 22:11:30 2011

The main suspects would be NAT and TCPMSS. Did you also try whether
the crash occurs with only one of these these rules?

 I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
 the address the way I was doing it, so that's a no-go for me.

That's really weird since you're apparently not using any bridge
netfilter features. It shouldn't have any effect besides changing
at which point ip_tables is invoked. How are your network devices
configured (specifically any bridges)?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Bart De Schuymer

Op 7/06/2011 16:40, Brad Campbell schreef:

On 07/06/11 21:30, Patrick McHardy wrote:

On 07.06.2011 05:33, Brad Campbell wrote:

On 07/06/11 04:10, Bart De Schuymer wrote:

Hi Brad,

This has probably nothing to do with ebtables, so please rmmod in case
it's loaded.
A few questions I didn't directly see an answer to in the threads I
scanned...
I'm assuming you actually use the bridging firewall functionality. So,
what iptables modules do you use? Can you reduce your iptables 
rules to

a core that triggers the bug?
Or does it get triggered even with an empty set of firewall rules?
Are you using a stock .35 kernel or is it patched?
Is this something I can trigger on a poor guy's laptop or does it
require specialized hardware (I'm catching up on qemu/kvm...)?


Not specialised hardware as such, I've just not been able to reproduce
it outside of this specific operating scenario.


The last similar problem we've had was related to the 32/64 bit compat
code. Are you running 32 bit userspace on a 64 bit kernel?


No, 32 bit Guest OS, but a completely 64 bit userspace on a 64 bit 
kernel.


Userspace is current Debian Stable. Kernel is Vanilla and qemu-kvm is 
current git


If the bug is easily triggered with your guest os, then you could try to 
capture the traffic with wireshark (or something else) in a 
configuration that doesn't crash your system. Save the traffic in a pcap 
file. Then you can see if resending that traffic in the vulnerable 
configuration triggers the bug (I don't know if something in Windows 
exists, but tcpreplay should work for Linux). Once you have such a 
capture , chances are the bug is even easily reproducible by us (unless 
it's hardware-specific). Success isn't guaranteed, but I think it's 
worth a shot...


cheers,
Bart


--
Bart De Schuymer
www.artinalgorithms.be

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Eric Dumazet
Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :

 The main suspects would be NAT and TCPMSS. Did you also try whether
 the crash occurs with only one of these these rules?
 
  I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
  the address the way I was doing it, so that's a no-go for me.
 
 That's really weird since you're apparently not using any bridge
 netfilter features. It shouldn't have any effect besides changing
 at which point ip_tables is invoked. How are your network devices
 configured (specifically any bridges)?

Something in the kernel does 

u16 *ptr = addr (given by kmalloc())

ptr[-1] = 0;

Could be an off-one error in a memmove()/memcopy() or loop...

I cant see a network issue here.

I checked arch/x86/lib/memmove_64.S and it seems fine.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Patrick McHardy
On 07.06.2011 20:31, Eric Dumazet wrote:
 Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :
 
 The main suspects would be NAT and TCPMSS. Did you also try whether
 the crash occurs with only one of these these rules?

 I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
 the address the way I was doing it, so that's a no-go for me.

 That's really weird since you're apparently not using any bridge
 netfilter features. It shouldn't have any effect besides changing
 at which point ip_tables is invoked. How are your network devices
 configured (specifically any bridges)?
 
 Something in the kernel does 
 
 u16 *ptr = addr (given by kmalloc())
 
 ptr[-1] = 0;
 
 Could be an off-one error in a memmove()/memcopy() or loop...
 
 I cant see a network issue here.

So far me neither, but netfilter appears to trigger the bug.

 I checked arch/x86/lib/memmove_64.S and it seems fine.

I was thinking it might be a missing skb_make_writable() combined
with vhost_net specifics in the netfilter code (TCPMSS and NAT are
both suspect), but was unable to find something. I also went
through the dst_metrics() conversion to see whether anything could
cause problems with the bridge fake_rttable, but also nothing
so far.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Brad Campbell

On 08/06/11 02:04, Bart De Schuymer wrote:


If the bug is easily triggered with your guest os, then you could try to
capture the traffic with wireshark (or something else) in a
configuration that doesn't crash your system. Save the traffic in a pcap
file. Then you can see if resending that traffic in the vulnerable
configuration triggers the bug (I don't know if something in Windows
exists, but tcpreplay should work for Linux). Once you have such a
capture , chances are the bug is even easily reproducible by us (unless
it's hardware-specific). Success isn't guaranteed, but I think it's
worth a shot...


The issue with this is I don't have a configuration that does not crash 
the system. This only happens under the specific circumstance that 
traffic from VM A is being DNAT'd to VM B. If I disable 
CONFIG_BRIDGE_NETFILTER, or I leave out the DNAT then I can't replicate 
the problem as I don't seem to be able to get the packets to go where I 
want them to go.


Let me try and explain it a little more clearly with made up IP 
addresses to illustrate the problem.


I have VM A (1.1.1.2) and VM B (1.1.1.3) on br1 (1.1.1.1)
I have public IP on ppp0 (2.2.2.2).

VM B can talk to VM A using its host address (1.1.1.2) and there is no 
problem.


The DNAT says anything destined for PPP0 that is on port 443 and coming 
from anywhere other than PPP0 (ie inside the network) is to be DNAT'd to 
1.1.1.3.


So VM B (1.1.1.3) tries to connect to ppp0 (2.2.2.2) on port 443, and 
this is redirected to VM B on 1.1.1.2.


Only under this specific circumstance does the problem occur. I can get 
VM B (1.1.1.3) to talk directly to VM A (1.1.1.2) all day long and there 
is no problem, it's only when VM B tries to talk to ppp0 that there is 
an issue (and it happens within seconds of the initial connection).


All these tests have been performed with VM B being a Windows XP guest. 
Tonight I'll try it with a Linux guest and see if I can make it happen. 
If that works I might be able to come up with some reproducible test 
case for you. I have a desktop machine that has Intel VT extensions, so 
I'll work toward making a portable test case.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Brad Campbell

On 08/06/11 06:57, Patrick McHardy wrote:

On 07.06.2011 20:31, Eric Dumazet wrote:

Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :


The main suspects would be NAT and TCPMSS. Did you also try whether
the crash occurs with only one of these these rules?


I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
the address the way I was doing it, so that's a no-go for me.


That's really weird since you're apparently not using any bridge
netfilter features. It shouldn't have any effect besides changing
at which point ip_tables is invoked. How are your network devices
configured (specifically any bridges)?


Something in the kernel does

u16 *ptr = addr (given by kmalloc())

ptr[-1] = 0;

Could be an off-one error in a memmove()/memcopy() or loop...

I cant see a network issue here.


So far me neither, but netfilter appears to trigger the bug.


Would it help if I tried some older kernels? This issue only surfaced 
for me recently as I only installed the VM's in question about 12 weeks 
ago and have only just started really using them in anger. I could try 
reproducing it on progressively older kernels to see if I can find one 
that works and then bisect from there.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Brad Campbell

On 07/06/11 23:35, Patrick McHardy wrote:


The main suspects would be NAT and TCPMSS. Did you also try whether
the crash occurs with only one of these these rules?


To be honest I'm actually having trouble finding where TCPMSS is 
actually set in that ruleset. This is a production machine so I can only 
take it down after about 9PM at night. I'll have another crack at it 
tonight.



I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
the address the way I was doing it, so that's a no-go for me.


That's really weird since you're apparently not using any bridge
netfilter features. It shouldn't have any effect besides changing
at which point ip_tables is invoked. How are your network devices
configured (specifically any bridges)?



I have one bridge with all my virtual machines on it.

In this particular instance the packets leave VM A destined for the IP 
address of ppp0 (the external interface). This is intercepted by the 
DNAT PREROUTING rule above and shunted back to VM B.


The VM's are on br1 and the external address is ppp0. Without 
CONFIG_BRIDGE_NETFILTER compiled in I can see the traffic entering and 
leaving VM B with tcpdump, but the packets never seem to get back to VM A.


VM A is XP 32 bit, VM B is Linux. I have some other Linux VM's, so I'll 
do some more testing tonight between those to see where the packets are 
going without CONFIG_BRIDGE_NETFILTER set.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Eric Dumazet
Le mercredi 08 juin 2011 à 08:18 +0800, Brad Campbell a écrit :
 On 08/06/11 06:57, Patrick McHardy wrote:
  On 07.06.2011 20:31, Eric Dumazet wrote:
  Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :
 
  The main suspects would be NAT and TCPMSS. Did you also try whether
  the crash occurs with only one of these these rules?
 
  I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
  the address the way I was doing it, so that's a no-go for me.
 
  That's really weird since you're apparently not using any bridge
  netfilter features. It shouldn't have any effect besides changing
  at which point ip_tables is invoked. How are your network devices
  configured (specifically any bridges)?
 
  Something in the kernel does
 
  u16 *ptr = addr (given by kmalloc())
 
  ptr[-1] = 0;
 
  Could be an off-one error in a memmove()/memcopy() or loop...
 
  I cant see a network issue here.
 
  So far me neither, but netfilter appears to trigger the bug.
 
 Would it help if I tried some older kernels? This issue only surfaced 
 for me recently as I only installed the VM's in question about 12 weeks 
 ago and have only just started really using them in anger. I could try 
 reproducing it on progressively older kernels to see if I can find one 
 that works and then bisect from there.

Well, a bisection definitely should help, but needs a lot of time in
your case.

Could you try following patch, because this is the 'usual suspect' I had
yesterday :

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 46cbd28..9f548f9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
ntail,
fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta;
}
 
+#if 0
if (fastpath 
size + sizeof(struct skb_shared_info) = ksize(skb-head)) {
memmove(skb-head + size, skb_shinfo(skb),
@@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
ntail,
off = nhead;
goto adjust_others;
}
-
+#endif
data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-06 Thread Bart De Schuymer

Hi Brad,

This has probably nothing to do with ebtables, so please rmmod in case 
it's loaded.
A few questions I didn't directly see an answer to in the threads I 
scanned...
I'm assuming you actually use the bridging firewall functionality. So, 
what iptables modules do you use? Can you reduce your iptables rules to 
a core that triggers the bug?

Or does it get triggered even with an empty set of firewall rules?
Are you using a stock .35 kernel or is it patched?
Is this something I can trigger on a poor guy's laptop or does it 
require specialized hardware (I'm catching up on qemu/kvm...)?


cheers,
Bart

PS: I'm not sure if we should keep CC-ing everybody, netfilter-devel 
together with kvm should probably do fine.


Op 3/06/2011 18:07, Brad Campbell schreef:

On 03/06/11 23:50, Bernhard Held wrote:

Am 03.06.2011 15:38, schrieb Brad Campbell:

On 02/06/11 07:03, CaT wrote:

On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote:

Unfortunately the only interface that is mentioned by name anywhere
in my firewall is $DMZ (which is ppp0 and not part of any bridge).

All of the nat/dnat and other horrible hacks are based on IP 
addresses.


Damn. Not referencing the bridge interfaces at all stopped our host 
from

going down in flames when we passed it a few packets. These are two
of the oopses we got from it. Whilst the kernel here is .35 we got the
same issue from a range of kernels. Seems related.


Well, I tried sending an explanatory message to netdev, netfilter 
cc'd to kvm,
but it appears not to have made it to kvm or netfilter, and the cc to
netdev has
not elicited a response. My resend to netfilter seems to have dropped
into the
bit bucket also.

Just another reference 3.5 months ago:
http://www.spinics.net/lists/netfilter-devel/msg17239.html


waves hands around shouting I have a reproducible test case for this 
and don't mind patching and crashing the machine to get it fixed


Attempted to add netfilter-devel to the cc this time.
--
To unsubscribe from this list: send the line unsubscribe 
netfilter-devel in

the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html




--
Bart De Schuymer
www.artinalgorithms.be

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-06 Thread Eric Dumazet
Le dimanche 05 juin 2011 à 21:45 +0800, Brad Campbell a écrit :
 On 05/06/11 16:14, Avi Kivity wrote:
  On 06/03/2011 04:38 PM, Brad Campbell wrote:
 
  Is there anyone who can point me at the appropriate cage to rattle? I
  know it appears to be a netfilter issue, but I don't seem to be able
  to get a message to the list (and I am subscribed to it and have been
  getting mail for months) and I'm not sure who to pester. The other
  alternative is I just stop doing that and wait for it to bite
  someone else.
 
  The mailing list might be set not to send your own mails back to you.
  Check the list archive.
 
 Yep, I did that first..
 
 Given the response to previous issues along the same line, it looks a 
 bit like I just remember not to actually use the system in the way that 
 triggers the bug and be happy that 99% of the time the kernel does not 
 panic, but have that lovely feeling in the back of the skull that says 
 any time now, and without obvious reason the whole machine might just 
 come crashing down..
 
 I guess it's still better than running Xen or Windows..

Could you please try latest linux-2.6 tree ?

We fixed many networking bugs that could explain your crash.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-06 Thread Eric Dumazet
Le lundi 06 juin 2011 à 22:10 +0200, Bart De Schuymer a écrit :
 Hi Brad,
 
 This has probably nothing to do with ebtables, so please rmmod in case 
 it's loaded.
 A few questions I didn't directly see an answer to in the threads I 
 scanned...
 I'm assuming you actually use the bridging firewall functionality. So, 
 what iptables modules do you use? Can you reduce your iptables rules to 
 a core that triggers the bug?
 Or does it get triggered even with an empty set of firewall rules?
 Are you using a stock .35 kernel or is it patched?
 Is this something I can trigger on a poor guy's laptop or does it 
 require specialized hardware (I'm catching up on qemu/kvm...)?
 

Keep netdev, as this most probably is a networking bug.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-06 Thread Brad Campbell

On 07/06/11 04:10, Bart De Schuymer wrote:

Hi Brad,

This has probably nothing to do with ebtables, so please rmmod in case
it's loaded.
A few questions I didn't directly see an answer to in the threads I
scanned...
I'm assuming you actually use the bridging firewall functionality. So,
what iptables modules do you use? Can you reduce your iptables rules to
a core that triggers the bug?
Or does it get triggered even with an empty set of firewall rules?
Are you using a stock .35 kernel or is it patched?
Is this something I can trigger on a poor guy's laptop or does it
require specialized hardware (I'm catching up on qemu/kvm...)?


Not specialised hardware as such, I've just not been able to reproduce 
it outside of this specific operating scenario.


I can't trigger it with empty firewall rules as it relies on a DNAT to 
occur. If I try it directly to the internal IP address (as I have to 
without netfilter loaded) then of course nothing fails.


It's a pain in the bum as a fault, but it's one I can easily reproduce 
as long as I use the same set of circumstances.


I'll try using 3.0-rc2 (current git) tonight, and if I can reproduce it 
on that then I'll attempt to pare down the IPTABLES rules to a bare minimum.


It is nothing to do with ebtables as I don't compile it. I'm not really 
sure about bridging firewall functionality. I just use a couple of 
hand coded bash scripts to set the tables up.


brad@srv:~$ lsmod
Module  Size  Used by
xt_iprange  1637  1
xt_DSCP 2077  2
xt_length   1216  1
xt_CLASSIFY 1091  26
sch_sfq 6681  4
xt_CHECKSUM 1229  2 brad@srv:~$ lsmod
Module  Size  Used by
xt_iprange  1637  1
xt_DSCP 2077  2
xt_length   1216  1
xt_CLASSIFY 1091  26
sch_sfq 6681  4
xt_CHECKSUM 1229  2
ipt_REJECT  2277  1
ipt_MASQUERADE  1759  7
ipt_REDIRECT1133  1
xt_recent   8223  2
xt_state1226  5
iptable_nat 3993  1
nf_nat 16773  3 ipt_MASQUERADE,ipt_REDIRECT,iptable_nat
nf_conntrack_ipv4  11868  8 iptable_nat,nf_nat
nf_conntrack   60962  5 
ipt_MASQUERADE,xt_state,iptable_nat,nf_nat,nf_conntrack_ipv4

nf_defrag_ipv4  1417  1 nf_conntrack_ipv4
xt_TCPMSS   2567  2
xt_tcpmss   1469  0
xt_tcpudp   2467  56
iptable_mangle  1487  1
pppoe   9574  2
pppox   2188  1 pppoe
iptable_filter  1442  1
ip_tables  16762  3 iptable_nat,iptable_mangle,iptable_filter
x_tables   20462  17 
xt_iprange,xt_DSCP,xt_length,xt_CLASSIFY,xt_CHECKSUM,ipt_REJECT,ipt_MASQUERADE,ipt_REDIRECT,xt_recent,xt_state,iptable_nat,xt_TCPMSS,xt_tcpmss,xt_tcpudp,iptable_mangle,iptable_filter,ip_tables

ppp_generic24243  6 pppoe,pppox
slhc5293  1 ppp_generic
cls_u32 6468  6
sch_htb14432  2
deflate 1937  0
zlib_deflate   21228  1 deflate
des_generic16135  0
cbc 2721  0
ecb 1975  0
crypto_blkcipher   13645  2 cbc,ecb
sha1_generic2095  0
md5 4001  0
hmac2977  0
crypto_hash14519  3 sha1_generic,md5,hmac
cryptomgr   2636  0
aead6137  1 cryptomgr
crypto_algapi  15289  9 
deflate,des_generic,cbc,ecb,crypto_blkcipher,hmac,crypto_hash,cryptomgr,aead

af_key 27372  0
fuse   66747  1
w83627ehf  32052  0
hwmon_vid   2867  1 w83627ehf
vhost_net  16802  6
powernow_k812932  0
mperf   1263  1 powernow_k8
kvm_amd53431  24
kvm   235155  1 kvm_amd
pl2303 12732  1
xhci_hcd   62865  0
i2c_piix4   8391  0
k10temp 3183  0
usbserial  34452  3 pl2303
usb_storage37887  1
usb_libusual   10999  1 usb_storage
ohci_hcd   18105  0
ehci_hcd   33641  0
ahci   20748  4
usbcore   130936  7 
pl2303,xhci_hcd,usbserial,usb_storage,usb_libusual,ohci_hcd,ehci_hcd

libahci21202  1 ahci
sata_mv26939  0
megaraid_sas   71659  14

Nat Table (external ip substituted for xxx.xxx.xxx.xxx)

Chain PREROUTING (policy ACCEPT 1761K packets, 152M bytes)
 pkts bytes target prot opt in out source 
destination
5   210 DNAT   udp  --  ppp0   *   0.0.0.0/0 
0.0.0.0/0   udp dpt:1195 to:192.168.253.199
6   252 DNAT   udp  --  !ppp0  *   0.0.0.0/0 
xxx.xxx.xxx.xxx   udp dpt:1195 to:192.168.253.199
0 0 DNAT   tcp  --  ppp0   *   0.0.0.0/0 
0.0.0.0/0   tcp dpt:25001 to:192.168.253.199:465

Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-05 Thread Avi Kivity

On 06/03/2011 04:38 PM, Brad Campbell wrote:


Is there anyone who can point me at the appropriate cage to rattle? I 
know it appears to be a netfilter issue, but I don't seem to be able 
to get a message to the list (and I am subscribed to it and have been 
getting mail for months) and I'm not sure who to pester. The other 
alternative is I just stop doing that and wait for it to bite 
someone else.


The mailing list might be set not to send your own mails back to you.  
Check the list archive.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-05 Thread Brad Campbell

On 05/06/11 16:14, Avi Kivity wrote:

On 06/03/2011 04:38 PM, Brad Campbell wrote:


Is there anyone who can point me at the appropriate cage to rattle? I
know it appears to be a netfilter issue, but I don't seem to be able
to get a message to the list (and I am subscribed to it and have been
getting mail for months) and I'm not sure who to pester. The other
alternative is I just stop doing that and wait for it to bite
someone else.


The mailing list might be set not to send your own mails back to you.
Check the list archive.


Yep, I did that first..

Given the response to previous issues along the same line, it looks a 
bit like I just remember not to actually use the system in the way that 
triggers the bug and be happy that 99% of the time the kernel does not 
panic, but have that lovely feeling in the back of the skull that says 
any time now, and without obvious reason the whole machine might just 
come crashing down..


I guess it's still better than running Xen or Windows..
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-05 Thread Avi Kivity

On 06/05/2011 04:45 PM, Brad Campbell wrote:

The mailing list might be set not to send your own mails back to you.
Check the list archive.



Yep, I did that first..

Given the response to previous issues along the same line, it looks a 
bit like I just remember not to actually use the system in the way 
that triggers the bug and be happy that 99% of the time the kernel 
does not panic, but have that lovely feeling in the back of the skull 
that says any time now, and without obvious reason the whole machine 
might just come crashing down..


I guess it's still better than running Xen or Windows..


Not at all.  Can some networking/netfilter expert look at this?

Please file a bug with all the relevant information in this thread.  If 
you can look for a previous version that worked, that might increase the 
chances of the bug being resolved faster.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-03 Thread Brad Campbell

On 02/06/11 07:03, CaT wrote:

On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote:

Unfortunately the only interface that is mentioned by name anywhere
in my firewall is $DMZ (which is ppp0 and not part of any bridge).

All of the nat/dnat and other horrible hacks are based on IP addresses.


Damn. Not referencing the bridge interfaces at all stopped our host from
going down in flames when we passed it a few packets. These are two
of the oopses we got from it. Whilst the kernel here is .35 we got the
same issue from a range of kernels. Seems related.


Well, I tried sending an explanatory message to netdev, netfilter  cc'd 
to kvm, but it appears not to have made it to kvm or netfilter, and the 
cc to netdev has not elicited a response. My resend to netfilter seems 
to have dropped into the bit bucket also.


Is there anyone who can point me at the appropriate cage to rattle? I 
know it appears to be a netfilter issue, but I don't seem to be able to 
get a message to the list (and I am subscribed to it and have been 
getting mail for months) and I'm not sure who to pester. The other 
alternative is I just stop doing that and wait for it to bite someone 
else.


Cheers.
Brad
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-03 Thread Bernhard Held

Am 03.06.2011 15:38, schrieb Brad Campbell:

On 02/06/11 07:03, CaT wrote:

On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote:

Unfortunately the only interface that is mentioned by name anywhere
in my firewall is $DMZ (which is ppp0 and not part of any bridge).

All of the nat/dnat and other horrible hacks are based on IP addresses.


Damn. Not referencing the bridge interfaces at all stopped our host from
going down in flames when we passed it a few packets. These are two
of the oopses we got from it. Whilst the kernel here is .35 we got the
same issue from a range of kernels. Seems related.


Well, I tried sending an explanatory message to netdev, netfilter  cc'd to kvm,
but it appears not to have made it to kvm or netfilter, and the cc to netdev has
not elicited a response. My resend to netfilter seems to have dropped into the
bit bucket also.

Just another reference 3.5 months ago:
http://www.spinics.net/lists/netfilter-devel/msg17239.html

Bernhard

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-03 Thread Brad Campbell

On 03/06/11 23:50, Bernhard Held wrote:

Am 03.06.2011 15:38, schrieb Brad Campbell:

On 02/06/11 07:03, CaT wrote:

On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote:

Unfortunately the only interface that is mentioned by name anywhere
in my firewall is $DMZ (which is ppp0 and not part of any bridge).

All of the nat/dnat and other horrible hacks are based on IP addresses.


Damn. Not referencing the bridge interfaces at all stopped our host from
going down in flames when we passed it a few packets. These are two
of the oopses we got from it. Whilst the kernel here is .35 we got the
same issue from a range of kernels. Seems related.


Well, I tried sending an explanatory message to netdev, netfilter 
cc'd to kvm,
but it appears not to have made it to kvm or netfilter, and the cc to
netdev has
not elicited a response. My resend to netfilter seems to have dropped
into the
bit bucket also.

Just another reference 3.5 months ago:
http://www.spinics.net/lists/netfilter-devel/msg17239.html


waves hands around shouting I have a reproducible test case for this 
and don't mind patching and crashing the machine to get it fixed


Attempted to add netfilter-devel to the cc this time.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread Brad Campbell

On 01/06/11 12:52, Hugh Dickins wrote:



I guess Brad could try SLUB debugging, boot with slub_debug=P
for poisoning perhaps; though it might upset alignments and
drive the problem underground.  Or see if the same happens
with SLAB instead of SLUB.


Not much use I'm afraid.
This is all I get in the log

[ 3161.300073] 
=

[ 3161.300147] BUG kmalloc-512: Freechain corrupt

The qemu process is then frozen, unkillable but reported in state R

13881 ?R  3:27 /usr/bin/qemu -S -M pc-0.13 -enable-kvm -m 
1024 -smp 2,sockets=2,cores=1,threads=1 -nam


The machine then progressively dies until it's frozen solid with no 
further error messages.


I stupidly forgot to do an alt-sysrq-t prior to doing an alt-sysrq-b, 
but at least it responded to that.


On the bright side I can reproduce it at will.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread Avi Kivity

On 06/01/2011 09:31 AM, Brad Campbell wrote:

On 01/06/11 12:52, Hugh Dickins wrote:



I guess Brad could try SLUB debugging, boot with slub_debug=P
for poisoning perhaps; though it might upset alignments and
drive the problem underground.  Or see if the same happens
with SLAB instead of SLUB.


Not much use I'm afraid.
This is all I get in the log

[ 3161.300073] 
=

[ 3161.300147] BUG kmalloc-512: Freechain corrupt

The qemu process is then frozen, unkillable but reported in state R

13881 ?R  3:27 /usr/bin/qemu -S -M pc-0.13 -enable-kvm -m 
1024 -smp 2,sockets=2,cores=1,threads=1 -nam


The machine then progressively dies until it's frozen solid with no 
further error messages.


I stupidly forgot to do an alt-sysrq-t prior to doing an alt-sysrq-b, 
but at least it responded to that.


On the bright side I can reproduce it at will.


Please try slub_debug=FZPU; that should point the finger (hopefully at 
somebody else).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread Avi Kivity

On 06/01/2011 12:29 PM, Brad Campbell wrote:

On 01/06/11 14:56, Avi Kivity wrote:

On 06/01/2011 09:31 AM, Brad Campbell wrote:

On 01/06/11 12:52, Hugh Dickins wrote:



I guess Brad could try SLUB debugging, boot with slub_debug=P
for poisoning perhaps; though it might upset alignments and
drive the problem underground. Or see if the same happens
with SLAB instead of SLUB.


Not much use I'm afraid.
This is all I get in the log

[ 3161.300073]
= 



[ 3161.300147] BUG kmalloc-512: Freechain corrupt

The qemu process is then frozen, unkillable but reported in state R

13881 ? R 3:27 /usr/bin/qemu -S -M pc-0.13 -enable-kvm -m 1024 -smp
2,sockets=2,cores=1,threads=1 -nam

The machine then progressively dies until it's frozen solid with no
further error messages.

I stupidly forgot to do an alt-sysrq-t prior to doing an alt-sysrq-b,
but at least it responded to that.

On the bright side I can reproduce it at will.


Please try slub_debug=FZPU; that should point the finger (hopefully at
somebody else).



Well the first attempt locked the machine solid. No network, no console..

I saw 
==


on the console.. nothing after that. Would not respond to sysrq-t or 
any other sysrq combination other than -b, which rebooted the box.



No output on netconsole at all, I had to walk to the other building to 
look at the monitor and reboot it.


The second attempt jammed netconsole again, but I managed to get this 
from an ssh session I already had established. The machine died a slow 
and horrible death, but remained interactive enough for me to reboot 
it with


echo b  /proc/sysrq-trigger

Nothing else worked.


[  413.756416]  [81318f1c] ? pskb_expand_head+0x15c/0x250
[  413.756424]  [813a6c45] ? nf_bridge_copy_header+0x145/0x160
[  413.756431]  [8139f78d] ? br_dev_queue_push_xmit+0x6d/0x80
[  413.756439]  [813a55a0] ? br_nf_post_routing+0x2a0/0x2f0
[  413.756447]  [81346bc4] ? nf_iterate+0x84/0xb0
[  413.756453]  [8139f720] ? br_flood_deliver+0x20/0x20
[  413.756459]  [81346c64] ? nf_hook_slow+0x74/0x120
[  413.756465]  [8139f720] ? br_flood_deliver+0x20/0x20
[  413.756472]  [8139f7da] ? br_forward_finish+0x3a/0x60
[  413.756479]  [813a5758] ? br_nf_forward_finish+0x168/0x170
[  413.756487]  [813a5c90] ? br_nf_forward_ip+0x360/0x3a0
[  413.756492]  [81346bc4] ? nf_iterate+0x84/0xb0
[  413.756498]  [8139f7a0] ? br_dev_queue_push_xmit+0x80/0x80
[  413.756504]  [81346c64] ? nf_hook_slow+0x74/0x120
[  413.756510]  [8139f7a0] ? br_dev_queue_push_xmit+0x80/0x80
[  413.756516]  [8139f800] ? br_forward_finish+0x60/0x60
[  413.756522]  [8139f800] ? br_forward_finish+0x60/0x60
[  413.756528]  [8139f875] ? __br_forward+0x75/0xc0
[  413.756534]  [8139f426] ? deliver_clone+0x36/0x60
[  413.756540]  [8139f69d] ? br_flood+0xbd/0x100
[  413.756546]  [813a05b0] ? br_handle_local_finish+0x40/0x40
[  413.756552]  [813a080e] ? br_handle_frame_finish+0x25e/0x280
[  413.756560]  [813a60f0] ? 
br_nf_pre_routing_finish+0x1a0/0x330

[  413.756568]  [813a6958] ? br_nf_pre_routing+0x6d8/0x800
[  413.756577]  [8102d46a] ? enqueue_task+0x3a/0x90
[  413.756582]  [81346bc4] ? nf_iterate+0x84/0xb0
[  413.756589]  [813a05b0] ? br_handle_local_finish+0x40/0x40
[  413.756594]  [81346c64] ? nf_hook_slow+0x74/0x120
[  413.756600]  [813a05b0] ? br_handle_local_finish+0x40/0x40
[  413.756607]  [810339b0] ? try_to_wake_up+0x2c0/0x2c0
[  413.756613]  [813a09d9] ? br_handle_frame+0x1a9/0x280
[  413.756620]  [813a0830] ? br_handle_frame_finish+0x280/0x280
[  413.756627]  [81320ef7] ? __netif_receive_skb+0x157/0x5c0
[  413.756634]  [81321443] ? process_backlog+0xe3/0x1d0
[  413.756641]  [81321da5] ? net_rx_action+0xc5/0x1d0
[  413.756650]  [8103df11] ? __do_softirq+0x91/0x120
[  413.756657]  [813d838c] ? call_softirq+0x1c/0x30
[  413.756660] EOI  [81003cbd] ? do_softirq+0x4d/0x80
[  413.756673]  [81321ece] ? netif_rx_ni+0x1e/0x30
[  413.756681]  [812b3ae2] ? tun_chr_aio_write+0x332/0x4e0
[  413.756688]  [812b37b0] ? tun_sendmsg+0x4d0/0x4d0
[  413.756697]  [810c24e9] ? do_sync_readv_writev+0xa9/0xf0
[  413.756704]  [81063f9c] ? do_futex+0x13c/0xa70
[  413.756711]  [811d6730] ? timerqueue_add+0x60/0xb0
[  413.756719]  [81056ab7] ? 
__hrtimer_start_range_ns+0x1e7/0x410

[  413.756726]  [810c231b] ? rw_copy_check_uvector+0x7b/0x140
[  413.756734]  [810c2bcf] ? do_readv_writev+0xdf/0x210
[  413.756742]  [810c2e7e] ? sys_writev+0x4e/0xc0
[  413.756750]  [813d753b] ? system_call_fastpath+0x16/0x1b
[  413.756756] FIX kmalloc-1024: 

Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread Avi Kivity

On 06/01/2011 12:40 PM, Avi Kivity wrote:


bridge and netfilter, IIRC this was also the problem last time.

Do you have any ebtables loaded?

Can you try building a kernel without ebtables?  Without netfilter at 
all?


Please run all tests with slub_debug=FZPU.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread Brad Campbell

On 01/06/11 17:41, Avi Kivity wrote:

On 06/01/2011 12:40 PM, Avi Kivity wrote:


bridge and netfilter, IIRC this was also the problem last time.

Do you have any ebtables loaded?


Never heard of them, but making a cursory check just in case..

brad@srv:/raid10/src/linux-2.6.39$ grep EBTABLE .config
# CONFIG_BRIDGE_NF_EBTABLES is not set


Can you try building a kernel without ebtables? Without netfilter at all?


Well, without netfilter I can't get it to crash. The problem is without 
netfilter I can't actually use it the way I use it to get it to crash.


I rebooted into a netfilter kernel, and did all the steps I'd used on 
the no-netfilter kernel and it ticked along happily.


So the result of the experiment is inconclusive. Having said that, the 
backtraces certainly smell networky.


To get it to crash, I have to start IE in the VM and https to the public 
address of the machine, which is then redirected by netfilter back into 
another of the VM's.


I can https directly to the other VM's address, but that does not cause 
it to crash, however without netfilter loaded I can't bounce off the 
public IP. It's all rather confusing really.


What next Sherlock?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread Avi Kivity

On 06/01/2011 01:53 PM, Brad Campbell wrote:

On 01/06/11 17:41, Avi Kivity wrote:

On 06/01/2011 12:40 PM, Avi Kivity wrote:


bridge and netfilter, IIRC this was also the problem last time.

Do you have any ebtables loaded?


Never heard of them, but making a cursory check just in case..

brad@srv:/raid10/src/linux-2.6.39$ grep EBTABLE .config
# CONFIG_BRIDGE_NF_EBTABLES is not set

Can you try building a kernel without ebtables? Without netfilter at 
all?


Well, without netfilter I can't get it to crash. The problem is 
without netfilter I can't actually use it the way I use it to get it 
to crash.


I rebooted into a netfilter kernel, and did all the steps I'd used on 
the no-netfilter kernel and it ticked along happily.


So the result of the experiment is inconclusive. Having said that, the 
backtraces certainly smell networky.


To get it to crash, I have to start IE in the VM and https to the 
public address of the machine, which is then redirected by netfilter 
back into another of the VM's.


I can https directly to the other VM's address, but that does not 
cause it to crash, however without netfilter loaded I can't bounce off 
the public IP. It's all rather confusing really.


What next Sherlock?



Maybe the Sherlocks at netdev@ can tell.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread CaT
On Wed, Jun 01, 2011 at 06:53:31PM +0800, Brad Campbell wrote:
 I rebooted into a netfilter kernel, and did all the steps I'd used
 on the no-netfilter kernel and it ticked along happily.
 
 So the result of the experiment is inconclusive. Having said that,
 the backtraces certainly smell networky.
 
 To get it to crash, I have to start IE in the VM and https to the
 public address of the machine, which is then redirected by netfilter
 back into another of the VM's.
 
 I can https directly to the other VM's address, but that does not
 cause it to crash, however without netfilter loaded I can't bounce
 off the public IP. It's all rather confusing really.
 
 What next Sherlock?

I think you're hitting something I've seen. Can you try rewriting
your firewall rules so that it does not reference any bridge
interfaces at all. Instead, reference the real interface names
in their place. I'm betting it wont crash.

(netdev added to CC since we're aleady bouncing there)

-- 
  A search of his car uncovered pornography, a homemade sex aid, women's 
  stockings and a Jack Russell terrier.
- 
http://www.dailytelegraph.com.au/news/wacky/indeed/story-e6frev20-118083480
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread Brad Campbell

On 01/06/11 19:18, CaT wrote:

On Wed, Jun 01, 2011 at 06:53:31PM +0800, Brad Campbell wrote:

I rebooted into a netfilter kernel, and did all the steps I'd used
on the no-netfilter kernel and it ticked along happily.

So the result of the experiment is inconclusive. Having said that,
the backtraces certainly smell networky.

To get it to crash, I have to start IE in the VM and https to the
public address of the machine, which is then redirected by netfilter
back into another of the VM's.

I can https directly to the other VM's address, but that does not
cause it to crash, however without netfilter loaded I can't bounce
off the public IP. It's all rather confusing really.

What next Sherlock?


I think you're hitting something I've seen. Can you try rewriting
your firewall rules so that it does not reference any bridge
interfaces at all. Instead, reference the real interface names
in their place. I'm betting it wont crash.



Unfortunately the only interface that is mentioned by name anywhere in 
my firewall is $DMZ (which is ppp0 and not part of any bridge).


All of the nat/dnat and other horrible hacks are based on IP addresses.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-01 Thread CaT
On Wed, Jun 01, 2011 at 07:52:33PM +0800, Brad Campbell wrote:
 Unfortunately the only interface that is mentioned by name anywhere
 in my firewall is $DMZ (which is ppp0 and not part of any bridge).
 
 All of the nat/dnat and other horrible hacks are based on IP addresses.

Damn. Not referencing the bridge interfaces at all stopped our host from
going down in flames when we passed it a few packets. These are two
of the oopses we got from it. Whilst the kernel here is .35 we got the
same issue from a range of kernels. Seems related.

The oopses may be a bit weird. Copy and paste from an ipmi terminal.


slab error in cache_alloc_debugcheck_after(): cache `size-64': double 
free, or n
Pid: 2431, comm: kvm Tainted: G  D 
2.6.35.9-local.20110314-141930 #1
Call Trace:
IRQ  [810fb8bf] ? __slab_error+0x1f/0x30
  [810fc22b] ? cache_alloc_debugcheck_after+0x6b/0x1f0
  [81530a00] ? br_nf_pre_routing_finish+0x0/0x370
  [8153106b] ? br_nf_pre_routing+0x2fb/0x980
  [810fdd3d] ? kmem_cache_alloc_notrace+0x7d/0xf0

  [8153106b] ? br_nf_pre_routing+0x2fb/0x980
  [81466e66] ? nf_iterate+0x66/0xb0
  [8152b9f0] ? br_handle_frame_finish+0x0/0x1c0
  [81466f14] ? nf_hook_slow+0x64/0xf0
  [8152b9f0] ? br_handle_frame_finish+0x0/0x1c0
  [8152bd3c] ? br_handle_frame+0x18c/0x250
  [81445459] ? __netif_receive_skb+0x169/0x2a0
  [81445673] ? process_backlog+0xe3/0x1d0
  [81446347] ? net_rx_action+0x87/0x1c0
  [810793f7] ? __do_softirq+0xa7/0x1d0
  [81035b8c] ? call_softirq+0x1c/0x30
EOI  [81037c6d] ? do_softirq+0x4d/0x80
  [81446b4e] ? netif_rx_ni+0x1e/0x30
  [8139541a] ? tun_chr_aio_write+0x36a/0x510
  [813950b0] ? tun_chr_aio_write+0x0/0x510
  [81102859] ? do_sync_readv_writev+0xa9/0xf0
  [810973fb] ? ktime_get+0x5b/0xe0
  [8104f958] ? lapic_next_event+0x18/0x20
  [8109be18] ? tick_dev_program_event+0x38/0x100
  [81102697] ? rw_copy_check_uvector+0x77/0x130
  [81102f0c] ? do_readv_writev+0xdc/0x200
  [8108dfec] ? sys_timer_settime+0x13c/0x2e0
  [8110317e] ? sys_writev+0x4e/0x90
  [81034d6b] ? system_call_fastpath+0x16/0x1b
8801e7621500: redzone 1:0xbf05bd010006, redzone 2:0x9f911029d74e35b

--

Code: 40 01 00 00 4c 8b a4 24 48 01 00 00 4c 8b ac 24 50 01 00 00 4c 8b 
b4 24 5
RIP  [81652c67] icmp_send+0x297/0x650
  RSP 880001c036b8
---[ end trace 9d3f7be7684ac91e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G  D 
2.6.35.9-local.20110314-144920 #2
Call Trace:
IRQ  [8170eada] ? panic+0x94/0x116
  [81711326] ? _raw_spin_lock_irqsave+0x26/0x40
  [8103a05f] ? oops_end+0xef/0xf0
  [81711a15] ? general_protection+0x25/0x30
  [81652c2f] ? icmp_send+0x25f/0x650
  [81652c67] ? icmp_send+0x297/0x650
  [815fd8e6] ? nf_iterate+0x66/0xb0
  [816dbfa0] ? br_nf_forward_finish+0x0/0x170
  [815fd994] ? nf_hook_slow+0x64/0xf0
  [816dbfa0] ? br_nf_forward_finish+0x0/0x170
  [816dc461] ? br_nf_forward_ip+0x201/0x3e0
  [815fd8e6] ? nf_iterate+0x66/0xb0
  [816d6620] ? br_forward_finish+0x0/0x60
  [815fd994] ? nf_hook_slow+0x64/0xf0
  [816d6620] ? br_forward_finish+0x0/0x60
  [816d66e9] ? __br_forward+0x69/0xb0
  [816d741a] ? br_handle_frame_finish+0x12a/0x280
  [816dcac8] ? br_nf_pre_routing_finish+0x208/0x370
  [815fd994] ? nf_hook_slow+0x64/0xf0
  [816dc8c0] ? br_nf_pre_routing_finish+0x0/0x370
  [816dc538] ? br_nf_forward_ip+0x2d8/0x3e0
  [816dd3b5] ? br_nf_pre_routing+0x785/0x980
  [815fd8e6] ? nf_iterate+0x66/0xb0
  [815fd994] ? nf_hook_slow+0x64/0xf0
  [816d72f0] ? br_handle_frame_finish+0x0/0x280
  [815fd994] ? nf_hook_slow+0x64/0xf0
  [816d72f0] ? br_handle_frame_finish+0x0/0x280
  [816d76fc] ? br_handle_frame+0x18c/0x250
  [815dec5b] ? __netif_receive_skb+0x1cb/0x350
  [8103d115] ? read_tsc+0x5/0x20
  [815dfa18] ? netif_receive_skb+0x78/0x80
  [815e0217] ? napi_gro_receive+0x27/0x40
  [815e01d8] ? napi_skb_finish+0x38/0x50
  [8152586d] ? bnx2_poll_work+0xd0d/0x13d0
  [8160c950] ? ctnetlink_conntrack_event+0x210/0x7d0
  [81092029] ? autoremove_wake_function+0x9/0x30
  [8109a71b] ? ktime_get+0x5b/0xe0
  [81526051] ? bnx2_poll+0x61/0x230
  [81051db8] ? lapic_next_event+0x18/0x20
  [815dfbef] ? net_rx_action+0x9f/0x200
  [8109636f] ? __hrtimer_start_range_ns+0x22f/0x410
  [8107c35f] ? __do_softirq+0xaf/0x1e0
  [810ab547] ? handle_IRQ_event+0x47/0x160
  [81036d5c] ? call_softirq+0x1c/0x30
  [81038c85] ? do_softirq+0x65/0xa0
  [8107c235] ? irq_exit+0x85/0x90
  

Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Brad Campbell

On 31/05/11 13:47, Borislav Petkov wrote:

Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your
machine from oopsing.

Adding linux-mm.



I initially thought that, so the second panic was produced with KSM 
disabled from boot.


echo 0  /sys/kernel/mm/ksm/run

If you still think that compiling ksm out of the kernel will prevent it 
then I'm willing to give it a go.


It's a production server, so I can only really bounce it around after 
about 9PM - GMT+8.


Regards,
Brad
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Borislav Petkov
On Tue, May 31, 2011 at 05:26:10PM +0800, Brad Campbell wrote:
 On 31/05/11 13:47, Borislav Petkov wrote:
 Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your
 machine from oopsing.
 
 Adding linux-mm.
 
 
 I initially thought that, so the second panic was produced with KSM
 disabled from boot.
 
 echo 0  /sys/kernel/mm/ksm/run
 
 If you still think that compiling ksm out of the kernel will prevent
 it then I'm willing to give it a go.

Ok, from looking at the code, when KSM inits, it starts the ksm kernel
thread and it looks like your oops comes from the function that is run
in the kernel thread - ksm_scan_thread.

So even if you disable it from sysfs, it runs at least once.

Let's add some more people to Cc and see what happens :).

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Brad Campbell

On 31/05/11 18:38, Borislav Petkov wrote:

On Tue, May 31, 2011 at 05:26:10PM +0800, Brad Campbell wrote:

On 31/05/11 13:47, Borislav Petkov wrote:

Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your
machine from oopsing.

Adding linux-mm.



I initially thought that, so the second panic was produced with KSM
disabled from boot.

echo 0  /sys/kernel/mm/ksm/run

If you still think that compiling ksm out of the kernel will prevent
it then I'm willing to give it a go.


Ok, from looking at the code, when KSM inits, it starts the ksm kernel
thread and it looks like your oops comes from the function that is run
in the kernel thread - ksm_scan_thread.

So even if you disable it from sysfs, it runs at least once.



Just to confirm, I recompiled 2.6.38.7 without KSM enabled and I've been 
unable to reproduce the bug, so it looks like you were on the money.


I've moved back to 2.6.38.7 as 2.6.39 has a painful SCSI bug that panics 
about 75% of boots, and the reboot cycle required to get luck my way 
into a working kernel is just too much hassle.


It would appear that XP zero's its memory space on bootup, so there 
would be lots of pages to merge with a couple of relatively freshly 
booted XP machines running.


Regards,
Brad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Hugh Dickins
On Tue, 31 May 2011, Brad Campbell wrote:
 On 31/05/11 18:38, Borislav Petkov wrote:
  On Tue, May 31, 2011 at 05:26:10PM +0800, Brad Campbell wrote:
   On 31/05/11 13:47, Borislav Petkov wrote:
Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your
machine from oopsing.

Adding linux-mm.

   
   I initially thought that, so the second panic was produced with KSM
   disabled from boot.
   
   echo 0  /sys/kernel/mm/ksm/run
   
   If you still think that compiling ksm out of the kernel will prevent
   it then I'm willing to give it a go.
  
  Ok, from looking at the code, when KSM inits, it starts the ksm kernel
  thread and it looks like your oops comes from the function that is run
  in the kernel thread - ksm_scan_thread.
  
  So even if you disable it from sysfs, it runs at least once.
  
 
 Just to confirm, I recompiled 2.6.38.7 without KSM enabled and I've been
 unable to reproduce the bug, so it looks like you were on the money.
 
 I've moved back to 2.6.38.7 as 2.6.39 has a painful SCSI bug that panics
 about 75% of boots, and the reboot cycle required to get luck my way into a
 working kernel is just too much hassle.
 
 It would appear that XP zero's its memory space on bootup, so there would be
 lots of pages to merge with a couple of relatively freshly booted XP
 machines running.

Thanks for the Cc, Borislav.

Brad, my suspicion is that in each case the top 16 bits of RDX have been
mysteriously corrupted from  to , causing the general protection
faults.  I don't understand what that has to do with KSM.

But it's only a suspicion, because I can't make sense of the Code:
lines in your traces, they have more than the expected 64 bytes, and
only one of them has a  (with no ) to mark faulting instruction.

I did try compiling the 2.6.39 kernel from your config, but of course
we have different compilers, so although I got close, it wasn't exact.

Would you mind mailing me privately (it's about 73MB) the objdump -trd
output for your original vmlinux (with KSM on)?  (Those -trd options are
the ones I'm used to typing, I bet not they're not all relevant.)

Of course, it's only a tiny fraction of that output that I need,
might be better to cut it down to remove_rmap_item_from_tree and
dup_fd and ksm_scan_thread, if you have the time to do so.

Thanks,
Hugh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Brad Campbell

On 01/06/11 06:31, Hugh Dickins wrote:


Brad, my suspicion is that in each case the top 16 bits of RDX have been
mysteriously corrupted from  to , causing the general protection
faults.  I don't understand what that has to do with KSM.


No, nor do I. The panic I reproduced with KSM off was in a completely 
unrelated code path. To be honest I would not be surprised if it turns 
out I have dodgy RAM, although it has passed multiple memtests and I've 
tried clocking it down. Just a gut feeling.



But it's only a suspicion, because I can't make sense of the Code:
lines in your traces, they have more than the expected 64 bytes, and
only one of them has a  (with no) to mark faulting instruction.


Yeah, with hindsight I must have removed them when I re-formatted the 
code from the oops. Each byte was one line in the syslog so there was a 
lot of deleting to get it to a postable format.



I did try compiling the 2.6.39 kernel from your config, but of course
we have different compilers, so although I got close, it wasn't exact.

Would you mind mailing me privately (it's about 73MB) the objdump -trd
output for your original vmlinux (with KSM on)?  (Those -trd options are
the ones I'm used to typing, I bet not they're not all relevant.)

Of course, it's only a tiny fraction of that output that I need,
might be better to cut it down to remove_rmap_item_from_tree and
dup_fd and ksm_scan_thread, if you have the time to do so.


Ok, so since my initial posting I've figured out how to get a clean oops 
out of netconsole, so tonight (after 9PM GMT+8) I'll reproduce the oops 
a couple of times. What about I upload the oops, plus the vmlinux, plus 
.config and System.map to a server with a fat pipe and give you a link 
to it?


At least I can reproduce it quickly and easily.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Brad Campbell

On 01/06/11 06:31, Hugh Dickins wrote:

Brad, my suspicion is that in each case the top 16 bits of RDX have been
mysteriously corrupted from  to , causing the general protection
faults.  I don't understand what that has to do with KSM.

But it's only a suspicion, because I can't make sense of the Code:
lines in your traces, they have more than the expected 64 bytes, and
only one of them has a  (with no) to mark faulting instruction.

I did try compiling the 2.6.39 kernel from your config, but of course
we have different compilers, so although I got close, it wasn't exact.

Would you mind mailing me privately (it's about 73MB) the objdump -trd
output for your original vmlinux (with KSM on)?  (Those -trd options are
the ones I'm used to typing, I bet not they're not all relevant.)

Of course, it's only a tiny fraction of that output that I need,
might be better to cut it down to remove_rmap_item_from_tree and
dup_fd and ksm_scan_thread, if you have the time to do so.


Would you believe about 20 seconds after I pressed send the kernel oopsed.

http://www.fnarfbargle.com/private/003_kernel_oops/

oops reproduced here, but an un-munged version is in that directory 
alongside the kernel.


[36542.880228] general protection fault:  [#1] SMP
[36542.880271] last sysfs file: 
/sys/devices/pci:00/:00:18.3/temp1_input

[36542.880290] CPU 4
[36542.880301] Modules linked in: xt_iprange xt_DSCP xt_length 
xt_CLASSIFY sch_sfq xt_CHECKSUM ipt_REJECT ipt_MASQUER
ADE ipt_REDIRECT xt_recent xt_state iptable_filter iptable_nat nf_nat 
nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 x
t_TCPMSS xt_tcpmss xt_tcpudp iptable_mangle ip_tables x_tables pppoe 
pppox ppp_generic slhc cls_u32 sch_htb deflate z
lib_deflate des_generic cbc ecb crypto_blkcipher sha1_generic md5 hmac 
crypto_hash cryptomgr aead crypto_algapi af_ke
y fuse hwmon_vid netconsole configfs vhost_net powernow_k8 mperf kvm_amd 
kvm pl2303 usbserial xhci_hcd k10temp i2c_pi
ix4 ahci usb_storage usb_libusual ohci_hcd ehci_hcd r8169 libahci 
usbcore mii sata_mv megaraid_sas [last unloaded: sc

si_wait_scan]
[36542.880842]
[36542.880858] Pid: 13346, comm: bash Not tainted 2.6.38.7 #29 To Be 
Filled By O.E.M. To Be Filled By O.E.M./880G Ext

reme3
[36542.880911] RIP: 0010:[810cf0de]  [810cf0de] 
do_vfs_ioctl+0x5e/0x510

[36542.880948] RSP: 0018:8802d25a1ec8  EFLAGS: 00010206
[36542.880965] RAX: fff7 RBX: 88040eb12840 RCX: 
7fff4fe4a4c0
[36542.880984] RDX: 5413 RSI: 5413 RDI: 
00ff
[36542.881002] RBP: 00ff R08: 7fff4fe4a400 R09: 

[36542.881020] R10: 7fff4fe4a380 R11: 0246 R12: 
7fff4fe4a4c0
[36542.881038] R13: 7fff4fe4a4c0 R14:  R15: 
0001
[36542.881058] FS:  7f65f725b700() GS:8800dbd0() 
knlGS:

[36542.881081] CS:  0010 DS:  ES:  CR0: 80050033
[36542.881098] CR2: 01f01008 CR3: 0002d25c3000 CR4: 
06e0
[36542.881116] DR0: 00a0 DR1:  DR2: 
0003
[36542.881133] DR3: 00b0 DR6: 0ff0 DR7: 
0400
[36542.881152] Process bash (pid: 13346, threadinfo 8802d25a, 
task 88041df88000)

[36542.881172] Stack:
[36542.881183]   88041df88218 0010 
0001
[36542.881225]  0002 7fff4fe4a2c0 7fff4fe4a220 
0002
[36542.881268]   81046d6a 88040eb12840 
00ff

[36542.881312] Call Trace:
[36542.881333]  [81046d6a] ? sys_rt_sigaction+0x8a/0xc0
[36542.881351]  [810cf5d9] ? sys_ioctl+0x49/0x80
[36542.881373]  [810023fb] ? system_call_fastpath+0x16/0x1b
[36542.881389] Code: 76 7b 81 fa 77 58 04 c0 0f 84 77 01 00 00 0f 1f 80 
00 00 00 00 0f 87 a2 00 00 00 81 fa 60 54 00 00 0f 1f 40 00 0f 84 ba 01 
00 00 48 8b 43 18 48 8b 50 30 0f b7 02 25 00 f0 00 00 3d 00 80 00 00

[36542.881793] RIP  [810cf0de] do_vfs_ioctl+0x5e/0x510
[36542.881818]  RSP 8802d25a1ec8
[36542.882082] ---[ end trace 1b8d730cd479e388 ]---
[36542.882126] Kernel panic - not syncing: Fatal exception
[36542.882175] Pid: 13346, comm: bash Tainted: G  D 2.6.38.7 #29
[36542.88] Call Trace:
[36542.882269]  [813c7f42] ? panic+0x92/0x18a
[36542.882318]  [81039a41] ? kmsg_dump+0x41/0xf0
[36542.882366]  [810062bd] ? oops_end+0x8d/0xa0
[36542.882414]  [813caeef] ? general_protection+0x1f/0x30
[36542.882463]  [810cf0de] ? do_vfs_ioctl+0x5e/0x510
[36542.882511]  [81046d6a] ? sys_rt_sigaction+0x8a/0xc0
[36542.882560]  [810cf5d9] ? sys_ioctl+0x49/0x80
[36542.882608]  [810023fb] ? system_call_fastpath+0x16/0x1b
[36542.882688] Rebooting in 60 seconds..[   33.104725] fuse init (API 
version 7.16)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to 

Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Andrea Arcangeli
Hello,

On Wed, Jun 01, 2011 at 08:37:25AM +0800, Brad Campbell wrote:
 On 01/06/11 06:31, Hugh Dickins wrote:
  Brad, my suspicion is that in each case the top 16 bits of RDX have been
  mysteriously corrupted from  to , causing the general protection
  faults.  I don't understand what that has to do with KSM.
 
  But it's only a suspicion, because I can't make sense of the Code:
  lines in your traces, they have more than the expected 64 bytes, and
  only one of them has a  (with no) to mark faulting instruction.
 
  I did try compiling the 2.6.39 kernel from your config, but of course
  we have different compilers, so although I got close, it wasn't exact.
 
  Would you mind mailing me privately (it's about 73MB) the objdump -trd
  output for your original vmlinux (with KSM on)?  (Those -trd options are
  the ones I'm used to typing, I bet not they're not all relevant.)
 
  Of course, it's only a tiny fraction of that output that I need,
  might be better to cut it down to remove_rmap_item_from_tree and
  dup_fd and ksm_scan_thread, if you have the time to do so.
 
 Would you believe about 20 seconds after I pressed send the kernel oopsed.
 
 http://www.fnarfbargle.com/private/003_kernel_oops/
 
 oops reproduced here, but an un-munged version is in that directory 
 alongside the kernel.
 
 [36542.880228] general protection fault:  [#1] SMP

Reminds me of another oops that was reported on the kvm list for
2.6.38.1 with message id 4D8C6110.6090204. There the top 16 bits of
rsi were flipped and it was a general protection too because of
hitting on the not mappable virtual range.

http://www.virtall.com/files/temp/kvm.txt
http://www.virtall.com/files/temp/config-2.6.38.1
http://virtall.com/files/temp/mmu-objdump.txt

That oops happened in kvm_unmap_rmapp though, but it looked memory
corruption (Avi suggested use after free) but it was a production
system so we couldn't debug it further.

I recommend next thing to reproduce again with 2.6.39 or
3.0.0-rc1. Let's fix your scsi trouble if needed but it's better you
test with 2.6.39.

We'd need chmod +r vmlinux on private/003_kernel_oops/

Thanks,
Andrea
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Brad Campbell

On 01/06/11 09:15, Andrea Arcangeli wrote:

Hello,

On Wed, Jun 01, 2011 at 08:37:25AM +0800, Brad Campbell wrote:

On 01/06/11 06:31, Hugh Dickins wrote:

Brad, my suspicion is that in each case the top 16 bits of RDX have been
mysteriously corrupted from  to , causing the general protection
faults.  I don't understand what that has to do with KSM.

But it's only a suspicion, because I can't make sense of the Code:
lines in your traces, they have more than the expected 64 bytes, and
only one of them has a  (with no) to mark faulting instruction.

I did try compiling the 2.6.39 kernel from your config, but of course
we have different compilers, so although I got close, it wasn't exact.

Would you mind mailing me privately (it's about 73MB) the objdump -trd
output for your original vmlinux (with KSM on)?  (Those -trd options are
the ones I'm used to typing, I bet not they're not all relevant.)

Of course, it's only a tiny fraction of that output that I need,
might be better to cut it down to remove_rmap_item_from_tree and
dup_fd and ksm_scan_thread, if you have the time to do so.


Would you believe about 20 seconds after I pressed send the kernel oopsed.

http://www.fnarfbargle.com/private/003_kernel_oops/

oops reproduced here, but an un-munged version is in that directory
alongside the kernel.

[36542.880228] general protection fault:  [#1] SMP


Reminds me of another oops that was reported on the kvm list for
2.6.38.1 with message id 4D8C6110.6090204. There the top 16 bits of
rsi were flipped and it was a general protection too because of
hitting on the not mappable virtual range.

http://www.virtall.com/files/temp/kvm.txt
http://www.virtall.com/files/temp/config-2.6.38.1
http://virtall.com/files/temp/mmu-objdump.txt

That oops happened in kvm_unmap_rmapp though, but it looked memory
corruption (Avi suggested use after free) but it was a production
system so we couldn't debug it further.

I recommend next thing to reproduce again with 2.6.39 or
3.0.0-rc1. Let's fix your scsi trouble if needed but it's better you
test with 2.6.39.

We'd need chmod +r vmlinux on private/003_kernel_oops/


Ok, here we go then.

http://www.fnarfbargle.com/private/004_kernel_oops/

The permissions are right this time.
2.6.39 + KSM

[  694.227866] general protection fault:  [#1] SMP
[  694.228001] last sysfs file: /sys/devices/platform/w83627ehf.656/cpu0_vid
[  694.228050] CPU 3
[  694.228091] Modules linked in: xt_iprange xt_DSCP xt_length 
xt_CLASSIFY sch_sfq xt_CHECKSUM ipt_REJECT ipt_MASQUERADE ipt_REDIRECT 
xt_recent xt_state iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 xt_TCPMSS xt_tcpmss xt_tcpudp iptable_mangle 
ip_tables x_tables pppoe pppox ppp_generic slhc cls_u32 sch_htb deflate 
zlib_deflate des_generic cbc ecb crypto_blkcipher sha1_generic md5 hmac 
crypto_hash cryptomgr aead crypto_algapi af_key fuse w83627ehf hwmon_vid 
netconsole configfs vhost_net powernow_k8 mperf kvm_amd kvm pl2303 
usbserial i2c_piix4 k10temp xhci_hcd usb_storage usb_libusual ohci_hcd 
r8169 ehci_hcd ahci usbcore sata_mv mii libahci megaraid_sas [last 
unloaded: scsi_wait_scan]

[  694.230897]
[  694.230944] Pid: 11841, comm: keepalive Not tainted 2.6.39 #3 To Be 
Filled By O.E.M. To Be Filled By O.E.M./880G Extreme3
[  694.23] RIP: 0010:[810db878]  [810db878] 
dup_fd+0x168/0x300

[  694.231210] RSP: 0018:8802f524fdd0  EFLAGS: 00010206
[  694.231258] RAX: 07f8 RBX: 8802f5721b80 RCX: 
bfff
[  694.231308] RDX: 8802f51cacc0 RSI: 00ff RDI: 
0800
[  694.231358] RBP: 8803bf419800 R08: 88030167f6c0 R09: 
0003
[  694.231407] R10: 0001 R11: 4000 R12: 
0100
[  694.231457] R13: 880417aa9800 R14: 88030167f440 R15: 
8803bd8c1600
[  694.231507] FS:  7f02cfc32700() GS:88041fcc() 
knlGS:

[  694.231560] CS:  0010 DS:  ES:  CR0: 8005003b
[  694.231609] CR2: 7f02cf5d4810 CR3: 0002f52c3000 CR4: 
06e0
[  694.231657] DR0: 0045 DR1:  DR2: 

[  694.231707] DR3: 0005 DR6: 0ff0 DR7: 
0400
[  694.231757] Process keepalive (pid: 11841, threadinfo 
8802f524e000, task 8802f5143690)

[  694.231809] Stack:
[  694.231852]  8802f5143690 0020 8802f56badc0 
8802f5721b90
[  694.232050]  880417aa54e0 01200011 880417aa54e0 

[  694.232248]  7f02cfc329d0 8802f5143690  
81037645

[  694.232448] Call Trace:
[  694.232499]  [81037645] ? copy_process+0xa75/0xfd0
[  694.232549]  [81037c0d] ? do_fork+0x6d/0x2b0
[  694.232599]  [810457a9] ? sigprocmask+0x69/0x100
[  694.232651]  [813d0ca3] ? stub_clone+0x13/0x20
[  694.232699]  [813d0a3b] ? system_call_fastpath+0x16/0x1b
[  694.232745] Code: 4c 89 c2 e8 

Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-31 Thread Hugh Dickins
On Wed, 1 Jun 2011, Andrea Arcangeli wrote:
 On Wed, Jun 01, 2011 at 08:37:25AM +0800, Brad Campbell wrote:
  On 01/06/11 06:31, Hugh Dickins wrote:
   Brad, my suspicion is that in each case the top 16 bits of RDX have been
   mysteriously corrupted from  to , causing the general protection
   faults.  I don't understand what that has to do with KSM.
  
   But it's only a suspicion, because I can't make sense of the Code:
   lines in your traces, they have more than the expected 64 bytes, and
   only one of them has a  (with no) to mark faulting instruction.
  
   I did try compiling the 2.6.39 kernel from your config, but of course
   we have different compilers, so although I got close, it wasn't exact.
  
   Would you mind mailing me privately (it's about 73MB) the objdump -trd
   output for your original vmlinux (with KSM on)?  (Those -trd options are
   the ones I'm used to typing, I bet not they're not all relevant.)
  
   Of course, it's only a tiny fraction of that output that I need,
   might be better to cut it down to remove_rmap_item_from_tree and
   dup_fd and ksm_scan_thread, if you have the time to do so.
  
  Would you believe about 20 seconds after I pressed send the kernel oopsed.
  
  http://www.fnarfbargle.com/private/003_kernel_oops/
  
  oops reproduced here, but an un-munged version is in that directory 
  alongside the kernel.
  
  [36542.880228] general protection fault:  [#1] SMP
 
 Reminds me of another oops that was reported on the kvm list for
 2.6.38.1 with message id 4D8C6110.6090204. There the top 16 bits of
 rsi were flipped and it was a general protection too because of
 hitting on the not mappable virtual range.
 
 http://www.virtall.com/files/temp/kvm.txt
 http://www.virtall.com/files/temp/config-2.6.38.1
 http://virtall.com/files/temp/mmu-objdump.txt
 
 That oops happened in kvm_unmap_rmapp though, but it looked memory
 corruption (Avi suggested use after free) but it was a production
 system so we couldn't debug it further.
 
 I recommend next thing to reproduce again with 2.6.39 or
 3.0.0-rc1. Let's fix your scsi trouble if needed but it's better you
 test with 2.6.39.

Brad, thanks for this and the other further crash, with vmlinux etc:
very helpful info.

Andrea, I'm pretty sure you're right to connect Brad's report with
the one above.

In four out of five of Brad's reports (cannot tell in the fifth),
the bad pointer (with top 16 bits  instead of ) had been
loaded from SLUB memory at an address offset 0x7f8 (1 case) or
0xff8 (3 cases) i.e. it's the short at 0x7fe or 0xffe that has
been zeroed.

No reason to suspect KSM's rmap_item code, or file table handling:
they just seem to be the victims of corruption from elsewhere.

I notice %rax and %rsi, the corrupted pointer in your kvm.txt
case, is itself a ...7f8 address; and %r13 an ...ff8 address.
I've not even glanced at the code, but I wonder if that implies
that KVM is close to the origin of the corruption.

I doubt I'll be able to spend more time on this, hope you can
take over.

I guess Brad could try SLUB debugging, boot with slub_debug=P
for poisoning perhaps; though it might upset alignments and
drive the problem underground.  Or see if the same happens
with SLAB instead of SLUB.

But I rather hope that you or someone will understand the 7fe clue.

Thanks,
Hugh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-30 Thread Brad Campbell

G'day all,

I'm running a pretty standard home server
x86_64 Phenom-II 6 Core
16GB DDR 3

I run some virtual machines under that. 3 x Debian 64 Bit, 1 x XP 32 
Bit. These run at boot.


When I fire up another XP 32 bit instance and play with it for more than 
about 2 minutes, I get the panics included in this mail.


I've included three of them here. The first and third are as booted. The 
second was with ksmd disabled just as a data point.


The machine passes every load test and memory test I can throw at it, 
but I still can't rule out this being a hardware issue.


Provided I don't start this XP VM the machine is quite stable, but 
running this VM will kill it within minutes.


This was tested with qemu-kvm.
The last commit in the git tree was
commit c007db193eb6b2557acb5caf2dc4d7023639e6f3
Author: Avi Kivity a...@redhat.com
Date:   Sun May 29 09:00:42 2011 -0400
(I pulled it yesterday)

These panics were captured with netconsole to a remote syslog daemon, 
and the formatting was ruined, so I've reformatted them by hand prior to 
posting.


I've tested and reproduced this on 2.6.38.[2,3,6  7] and 2.6.39 obviously.

Can anyone help shed some light on this?

Regards,
Brad

[ 438.632061] general protection fault:  [#1] SMP
[ 438.632196] last sysfs file: /sys/module/x_tables/initstate
[ 438.632242] CPU 4
[ 438.632282] Modules linked in: xt_iprange xt_DSCP xt_length 
xt_CLASSIFY sch_sfq xt_CHECKSUM ipt_REJECT ipt_MASQUERADE ipt_REDIRECT 
xt_recent xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack 
nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpmss xt_tcpudp 
iptable_mangle ip_tables x_tables pppoe pppox ppp_generic slhc cls_u32 
sch_htb deflate zlib_deflate des_generic cbc ecb crypto_blkcipher 
sha1_generic md5 hmac crypto_hash cryptomgr aead crypto_algapi af_key 
fuse netconsole configfs= vhost_net powernow_k8 mperf i2c_nforce2 
kvm_amd kvm pl2303 usbserial xhci_hcd k10temp i2c_piix4 usb_storage 
usb_libusual ohci_hcd ehci_hcd usbcore ahci libahci r8169 mii sata_mv 
megaraid_sas [last unloaded: scsi_wait_scan]

[ 438.634960]
[ 438.635006] Pid: 551, comm: ksmd Not tainted 2.6.39 #3 To Be Filled By 
O.E.M. To Be Filled By O.E.M. /880G Extreme3
[ 438.635170] RIP: 0010:[810b4596] [810b4596] 
remove_rmap_item_from_tree+0x96/0x150

[ 438.635268] RSP: 0018:88041c065e20  EFLAGS: 00010282
[ 438.635314] RAX: 8804153bd8b0 RBX: 8804176c3fc0 RCX: 
00057754
[ 438.635362] RDX: 880415418030 RSI: 880414b65003 RDI: 
ea000dde6030
[ 438.635410] RBP: 880414b65000 R08: 00057755 R09: 
1bbde1bd
[ 438.635458] R10: 1c28e0c3 R11: 0002 R12: 
ea000dde6030
[ 438.635506] R13: 8804176c3f80 R14: 8804151ed7b0 R15: 
88041bf23be0
[ 438.63] FS:  7f617772e700() GS:88041fd0() 
knlGS:

[ 438.635607] CS:  0010 DS:  ES:  CR0: 8005003b
[ 438.635654] CR2: 00e7 CR3: 01583000 CR4: 
06e0
[ 438.635703] DR0: 0045 DR1:  DR2: 

[ 438.635750] DR3: 0005 DR6: 0ff0 DR7: 
0400
[ 438.635799] Process ksmd (pid: 551, threadinfo 88041c064000, task 
88041d8caa70)

[ 438.635851] Stack:
[ 438.635893]  ea000c43c408 8804176c3fc0 036c 
810b58f2
[ 438.636088]  88041c782a00 7fc2d81b5000 88041c064000 
003808ff
[ 438.636281]  88041c064000 88041c065e98 88041d8caa70 
8804165d4480

[ 438.636479] Call Trace:
[ 438.636528]  [810b58f2] ? ksm_scan_thread+0x4e2/0xc20
[ 438.636580]  [81052a20] ? wake_up_bit+0x40/0x40
[ 438.636628]  [810b5410] ? try_to_merge_with_ksm_page+0x570/0x570
[ 438.636679]  [810b5410] ? try_to_merge_with_ksm_page+0x570/0x570
[ 438.636730]  [810525b6] ? kthread+0x96/0xa0
[ 438.636781]  [813d1794] ? kernel_thread_helper+0x4/0x10
[ 438.636832]  [81052520] ? kthread_worker_fn+0x120/0x120
[ 438.636882]  [813d1790] ? gs_change+0xb/0xb
[ 438.636926] Code: 28 48 89 ef e8 6c fe ff ff 48 85 c0 49 89 c4 74 d2 
f0 0f ba 28 00 19 c0 85 c0 0f 85 ae 00 00 00 00 48 8b 43 30 48 8b 53 38 
48 85 c0

[ 438.638504] 89 02 74 04 48 89 50 08 48 ba 00 01 10 00 00 00 00 ad de 48 b8
[ 438.639329] RIP [810b4596] remove_rmap_item_from_tree+0x96/0x150
[ 438.639414]  RSP 88041c065e20
[ 438.639460] ---[ end trace c29fb871f6b874e3 ]---
[ 438.639506] Kernel panic - not syncing: Fatal exception
[ 438.639553] Pid: 551, comm: ksmd Tainted: G  D 2.6.39 #3
[ 438.639598] Call Trace:
[ 438.639644]  [813cd6f5] ? panic+0x92/0x18a
[ 438.639693]  [81038b61] ? kmsg_dump+0x41/0xf0
[ 438.639745]  [810050ad] ? oops_end+0x8d/0xa0
[ 438.639794]  [813d05ef] ? general_protection+0x1f/0x30
[ 438.639842]  [810b4596] ? remove_rmap_item_from_tree+0x96/0x150
[ 438.639891]  [810b58f2] ? ksm_scan_thread+0x4e2/0xc20
[ 438.639941]  

Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-05-30 Thread Borislav Petkov
Looks like a KSM issue. Disabling CONFIG_KSM should at least stop your
machine from oopsing.

Adding linux-mm.

On Tue, May 31, 2011 at 09:24:03AM +0800, Brad Campbell wrote:
 G'day all,
 
 I'm running a pretty standard home server
 x86_64 Phenom-II 6 Core
 16GB DDR 3
 
 I run some virtual machines under that. 3 x Debian 64 Bit, 1 x XP 32
 Bit. These run at boot.
 
 When I fire up another XP 32 bit instance and play with it for more
 than about 2 minutes, I get the panics included in this mail.
 
 I've included three of them here. The first and third are as booted.
 The second was with ksmd disabled just as a data point.
 
 The machine passes every load test and memory test I can throw at
 it, but I still can't rule out this being a hardware issue.
 
 Provided I don't start this XP VM the machine is quite stable, but
 running this VM will kill it within minutes.
 
 This was tested with qemu-kvm.
 The last commit in the git tree was
 commit c007db193eb6b2557acb5caf2dc4d7023639e6f3
 Author: Avi Kivity a...@redhat.com
 Date:   Sun May 29 09:00:42 2011 -0400
 (I pulled it yesterday)
 
 These panics were captured with netconsole to a remote syslog
 daemon, and the formatting was ruined, so I've reformatted them by
 hand prior to posting.
 
 I've tested and reproduced this on 2.6.38.[2,3,6  7] and 2.6.39 obviously.
 
 Can anyone help shed some light on this?
 
 Regards,
 Brad
 
 [ 438.632061] general protection fault:  [#1] SMP
 [ 438.632196] last sysfs file: /sys/module/x_tables/initstate
 [ 438.632242] CPU 4
 [ 438.632282] Modules linked in: xt_iprange xt_DSCP xt_length
 xt_CLASSIFY sch_sfq xt_CHECKSUM ipt_REJECT ipt_MASQUERADE
 ipt_REDIRECT xt_recent xt_state iptable_nat nf_nat nf_conntrack_ipv4
 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpmss
 xt_tcpudp iptable_mangle ip_tables x_tables pppoe pppox ppp_generic
 slhc cls_u32 sch_htb deflate zlib_deflate des_generic cbc ecb
 crypto_blkcipher sha1_generic md5 hmac crypto_hash cryptomgr aead
 crypto_algapi af_key fuse netconsole configfs= vhost_net powernow_k8
 mperf i2c_nforce2 kvm_amd kvm pl2303 usbserial xhci_hcd k10temp
 i2c_piix4 usb_storage usb_libusual ohci_hcd ehci_hcd usbcore ahci
 libahci r8169 mii sata_mv megaraid_sas [last unloaded:
 scsi_wait_scan]
 [ 438.634960]
 [ 438.635006] Pid: 551, comm: ksmd Not tainted 2.6.39 #3 To Be
 Filled By O.E.M. To Be Filled By O.E.M. /880G Extreme3
 [ 438.635170] RIP: 0010:[810b4596] [810b4596]
 remove_rmap_item_from_tree+0x96/0x150
 [ 438.635268] RSP: 0018:88041c065e20  EFLAGS: 00010282
 [ 438.635314] RAX: 8804153bd8b0 RBX: 8804176c3fc0 RCX:
 00057754
 [ 438.635362] RDX: 880415418030 RSI: 880414b65003 RDI:
 ea000dde6030
 [ 438.635410] RBP: 880414b65000 R08: 00057755 R09:
 1bbde1bd
 [ 438.635458] R10: 1c28e0c3 R11: 0002 R12:
 ea000dde6030
 [ 438.635506] R13: 8804176c3f80 R14: 8804151ed7b0 R15:
 88041bf23be0
 [ 438.63] FS:  7f617772e700() GS:88041fd0()
 knlGS:
 [ 438.635607] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 438.635654] CR2: 00e7 CR3: 01583000 CR4:
 06e0
 [ 438.635703] DR0: 0045 DR1:  DR2:
 
 [ 438.635750] DR3: 0005 DR6: 0ff0 DR7:
 0400
 [ 438.635799] Process ksmd (pid: 551, threadinfo 88041c064000,
 task 88041d8caa70)
 [ 438.635851] Stack:
 [ 438.635893]  ea000c43c408 8804176c3fc0 036c
 810b58f2
 [ 438.636088]  88041c782a00 7fc2d81b5000 88041c064000
 003808ff
 [ 438.636281]  88041c064000 88041c065e98 88041d8caa70
 8804165d4480
 [ 438.636479] Call Trace:
 [ 438.636528]  [810b58f2] ? ksm_scan_thread+0x4e2/0xc20
 [ 438.636580]  [81052a20] ? wake_up_bit+0x40/0x40
 [ 438.636628]  [810b5410] ? try_to_merge_with_ksm_page+0x570/0x570
 [ 438.636679]  [810b5410] ? try_to_merge_with_ksm_page+0x570/0x570
 [ 438.636730]  [810525b6] ? kthread+0x96/0xa0
 [ 438.636781]  [813d1794] ? kernel_thread_helper+0x4/0x10
 [ 438.636832]  [81052520] ? kthread_worker_fn+0x120/0x120
 [ 438.636882]  [813d1790] ? gs_change+0xb/0xb
 [ 438.636926] Code: 28 48 89 ef e8 6c fe ff ff 48 85 c0 49 89 c4 74
 d2 f0 0f ba 28 00 19 c0 85 c0 0f 85 ae 00 00 00 00 48 8b 43 30 48 8b
 53 38 48 85 c0
 [ 438.638504] 89 02 74 04 48 89 50 08 48 ba 00 01 10 00 00 00 00 ad de 48 b8
 [ 438.639329] RIP [810b4596] remove_rmap_item_from_tree+0x96/0x150
 [ 438.639414]  RSP 88041c065e20
 [ 438.639460] ---[ end trace c29fb871f6b874e3 ]---
 [ 438.639506] Kernel panic - not syncing: Fatal exception
 [ 438.639553] Pid: 551, comm: ksmd Tainted: G  D 2.6.39 #3
 [ 438.639598] Call Trace:
 [ 438.639644]  [813cd6f5] ? panic+0x92/0x18a
 [ 438.639693]  [81038b61] ? kmsg_dump+0x41/0xf0
 [ 438.639745]  [810050ad] ?