Processed: Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume

2009-08-21 Thread Debian Bug Tracking System
Processing commands for cont...@bugs.debian.org:

 tags 542250 +patch
Bug #542250 [src:linux-2.6] repeatable crashes while copying 500G from NFS 
mount to local logical volume
Bug #516479 [src:linux-2.6] linux-image-2.6.26-1-xen-amd64: kernel-panic in 
xen_spin_wait an mutlicore dom0 with high load, not interruption save?
Ignoring request to alter tags of bug #542250 to the same tags previously set
Ignoring request to alter tags of bug #516479 to the same tags previously set
 thanks
Stopping processing here.

Please contact me if you need assistance.

Debian bug tracking system administrator
(administrator, Debian Bugs database)


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume

2009-08-19 Thread Nikita V. Youshchenko
 This asserts that if we spin on a lock after interrupting another spin,
 and interrupts are enabled, we must be in a softirq.

Looking at the bottom of the same file drivers/xen/core/spinlock.c:

void xen_spin_kick(raw_spinlock_t *lock, unsigned int token)
{
unsigned int cpu;

token = (1U  TICKET_SHIFT) - 1;
for_each_online_cpu(cpu) {
if (spinning(per_cpu(spinning, cpu), cpu, lock, token))
return;
if (in_interrupt()
 spinning(per_cpu(spinning_bh, cpu), cpu, lock, token))
return;
if (raw_irqs_disabled()
 spinning(per_cpu(spinning_irq, cpu), cpu, lock, token))
return;
}
}
EXPORT_SYMBOL(xen_spin_kick);

... I may guess that line 74 should check for in_interrupt() instead of 
in_softirq().

However it is just a guess based on analogy.
I don't currently understand the logic of that code.


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume

2009-08-19 Thread Nikita V. Youshchenko
tags 542250 +patch
thanks

 ... I may guess that line 74 should check for in_interrupt() instead of
 in_softirq().

I've tried that and it really fixed the problem. Server already runs the 
same backup procedure for several hours. Previously it crashed within 15 
minutes.

Here is the patch I've applied:

--- a/drivers/xen/core/spinlock.c   2009-08-19 16:20:17.0 +0400
+++ b/drivers/xen/core/spinlock.c   2009-08-19 17:36:55.0 +0400
@@ -71,7 +71,7 @@
BUG_ON(__get_cpu_var(spinning_bh).lock == lock);
spinning = __get_cpu_var(spinning_irq);
} else {
-   BUG_ON(!in_softirq());
+   BUG_ON(!in_interrupt());
spinning = __get_cpu_var(spinning_bh);
}
BUG_ON(spinning-lock);


signature.asc
Description: This is a digitally signed message part.


Processed: Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume

2009-08-19 Thread Debian Bug Tracking System
Processing commands for cont...@bugs.debian.org:

 tags 542250 +patch
Bug #542250 [linux-image-2.6.26-2-xen-amd64] repeatable crashes while copying 
500G from NFS mount to local logical volume
Added tag(s) patch.
 thanks
Stopping processing here.

Please contact me if you need assistance.

Debian bug tracking system administrator
(administrator, Debian Bugs database)


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume

2009-08-19 Thread Ben Hutchings
On Wed, 2009-08-19 at 22:36 +0400, Nikita V. Youshchenko wrote:
 tags 542250 +patch
 thanks
 
  ... I may guess that line 74 should check for in_interrupt() instead of
  in_softirq().
 
 I've tried that and it really fixed the problem. Server already runs the 
 same backup procedure for several hours. Previously it crashed within 15 
 minutes.
 
 Here is the patch I've applied:
 
 --- a/drivers/xen/core/spinlock.c   2009-08-19 16:20:17.0 +0400
 +++ b/drivers/xen/core/spinlock.c   2009-08-19 17:36:55.0 +0400
 @@ -71,7 +71,7 @@
 BUG_ON(__get_cpu_var(spinning_bh).lock == lock);
 spinning = __get_cpu_var(spinning_irq);
 } else {
 -   BUG_ON(!in_softirq());
 +   BUG_ON(!in_interrupt());
 spinning = __get_cpu_var(spinning_bh);
 }
 BUG_ON(spinning-lock);

I'm glad it works for you, but it isn't a proper fix.

Ben.

-- 
Ben Hutchings
If at first you don't succeed, you're doing about average.


signature.asc
Description: This is a digitally signed message part


Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume

2009-08-19 Thread Nikita V. Youshchenko
 On Wed, 2009-08-19 at 22:36 +0400, Nikita V. Youshchenko wrote:
  tags 542250 +patch
  thanks
 
   ... I may guess that line 74 should check for in_interrupt() instead
   of in_softirq().
 
  I've tried that and it really fixed the problem. Server already runs
  the same backup procedure for several hours. Previously it crashed
  within 15 minutes.
 
  Here is the patch I've applied:
 
  --- a/drivers/xen/core/spinlock.c   2009-08-19 16:20:17.0
  +0400 +++ b/drivers/xen/core/spinlock.c   2009-08-19
  17:36:55.0 +0400 @@ -71,7 +71,7 @@
  BUG_ON(__get_cpu_var(spinning_bh).lock ==
  lock); spinning = __get_cpu_var(spinning_irq); } else {
  -   BUG_ON(!in_softirq());
  +   BUG_ON(!in_interrupt());
  spinning = __get_cpu_var(spinning_bh);
  }
  BUG_ON(spinning-lock);

 I'm glad it works for you, but it isn't a proper fix.

Could you please explain? How that code line cod hit if not in interrupt 
handler?

Here is my understanding of the logic of that code. They try to track 
spinlocks CPU currently spins at. CPU spinning may be interrupted only by 
irq. There normal (not SA_NODELAY) interrupt handlers can't be active at 
the same CPU at the same time. That leads to maximum 3 spinings:
- one from process context,
- one from normal irq handler that interrupted that process context,
- and one from SA_NODELAY irq handler that interrupted normal irq handler. 
This one can't be interrupted since it runs with interrupts disabled.

If such, the code path in question corresponds to normal interrupt 
handler starting to spin. Thus it should be in_interrupt().

How this is wrong?

Perhaps softirq handler could be activated at exit of the normal handler? 
Maybe better check is BUG_ON(!in_interrupt()  !in_softrq()). Need to 
check the code ...

Nikita


signature.asc
Description: This is a digitally signed message part.