Re: 2.4.0-test10-pre3 Ooops

2000-10-18 Thread Mike Galbraith

On Wed, 18 Oct 2000, Mike Galbraith wrote:

> On Wed, 18 Oct 2000, Gary E. Miller wrote:
> 
> > Yo Mike!
> > 
> > On Wed, 18 Oct 2000, Mike Galbraith wrote:
> > 
> > > > Help!  See below for my kernel oops.  I have not been able to use any 
> > > > kernel after 2.4.0-test5 due to this problem.  It happens shortly
> > > > after booting the kernel and is very repeatable.
> > 
> > > Are you sure that you used the right System.map?
> > Yes, I just rechecked the date and time of all the files.  I will
> > re-run the whole test in the morning just to be sure.
> > 
> > > > Trace; c01df362 
> > >  ^^^
> > > Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size.
> > Mine is 0x26c, maybe you have a different compiler or different config?
> 
> Ok, I don't _see_ what could make such a big difference, but something
> obviously does.

(Decides to take another peek.  Yup, scsi logging adds quite a bit)

Looks like deadlock at scsi.c:696 to me.  ide_end_request() has the
io_request_lock on cpu0 and we try to grab it again with the same cpu.

What is unclear to me (SMP.. technology, sufficiently advanced.. magic)
is how in the heck the scsi interrupt happened on cpu0.  It _looks_ to
me as though md_spin_unlock_irq() enabled interrupts at a very bad time.

>>EIP; c0272362<=
Trace; c01df362 scsi.c:696 deadlock cpu0
Trace; c010cab1 
Trace; c010cc98  Uhoh!
Trace; c010b370 
Trace; c02215d3   md_spin_unlock_irq()!!
Trace; c022167a 
Trace; c01927c9  ll_rw_blk.c:1000
Trace; c01c99a8 ide.c:516 (cpu0 has lock,
Trace; c01cdadb (interrupts are disabled)
Trace; c01cb27e 
Trace; c01cda30 
Trace; c010cab1 
Trace; c010cc98 
Trace; c010b370 
Code;  c0272362 

If this were happening on my system, I'd boldly change raid1_free_bh()
like such..

--- drivers/md/raid1.c.org  Wed Oct 18 15:30:07 2000
+++ drivers/md/raid1.c  Wed Oct 18 15:33:08 2000
@@ -91,7 +91,8 @@
 
 static inline void raid1_free_bh(raid1_conf_t *conf, struct buffer_head *bh)
 {
-   md_spin_lock_irq(>device_lock);
+   unsigned long flags;
+   md_spin_lock_irqsave(>device_lock, flags);
while (bh) {
struct buffer_head *t = bh;
bh=bh->b_next;
@@ -103,7 +104,7 @@
conf->freebh_cnt++;
}
}
-   md_spin_unlock_irq(>device_lock);
+   md_spin_unlock_irqrestore(>device_lock, flags);
wake_up(>wait_buffer);
 }

..and see if the problem went away.
 
Now, if I'm off base, someone please clean my clock so I'll understand
better next time I foolishly attempt to figure out an SMP deadlock ;-)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3 Ooops

2000-10-18 Thread Mike Galbraith

On Wed, 18 Oct 2000, Gary E. Miller wrote:

> Yo Mike!
> 
> On Wed, 18 Oct 2000, Mike Galbraith wrote:
> 
> > > Help!  See below for my kernel oops.  I have not been able to use any 
> > > kernel after 2.4.0-test5 due to this problem.  It happens shortly
> > > after booting the kernel and is very repeatable.
> 
> > Are you sure that you used the right System.map?
> Yes, I just rechecked the date and time of all the files.  I will
> re-run the whole test in the morning just to be sure.
> 
> > > Trace; c01df362 
> >  ^^^
> > Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size.
> Mine is 0x26c, maybe you have a different compiler or different config?

Ok, I don't _see_ what could make such a big difference, but something
obviously does.

> > If you want to try some light troubleshooting, grab kdb..
> > ftp://oss.sgi.com/www/projects/kdb/download/ix86/kdb-v1.5-2.4.0-test9-pre9.gz
> > .. and check what both cpus were up to at lock time.  A stack trace of
> > both cpus might help developers locate the trouble.
> 
> I thought Linus was not a big fan of kernel debuggers?  I will look into
> it if I get no better suggestions.

Linus doesn't need a supporter (nads of steel;), but the vast majority
of folks out there aren't in the same league as Linus.  Heck, some of
us aren't even sure we're playing the same sport.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3 Ooops

2000-10-18 Thread Mike Galbraith



On Tue, 17 Oct 2000, Gary E. Miller wrote:

> Yo All!
> 
> Help!  See below for my kernel oops.  I have not been able to use any 
> kernel after 2.4.0-test5 due to this problem.  It happens shortly
> after booting the kernel and is very repeatable.
> 
> This is a dual PII system with PIIX4 ide, 53c875 scsi and Raid 1.
> It is not a production system so I am open to any patches or
> tests.
> 
> The system would not even stay up long enough to run ksymoops so
> I had to copy the data and run it under 2.2.17.
> 
> Any ideas out there?

Hi,

Are you sure that you used the right System.map?

> Trace; c01df362 
 ^^^
Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size.

(scsi_dispatch_cmd+0x1c6 is very close to scsi_wait_req():down ())

If you want to try some light troubleshooting, grab kdb..
ftp://oss.sgi.com/www/projects/kdb/download/ix86/kdb-v1.5-2.4.0-test9-pre9.gz
.. and check what both cpus were up to at lock time.  A stack trace of
both cpus might help developers locate the trouble.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3 Ooops

2000-10-18 Thread Mike Galbraith



On Tue, 17 Oct 2000, Gary E. Miller wrote:

 Yo All!
 
 Help!  See below for my kernel oops.  I have not been able to use any 
 kernel after 2.4.0-test5 due to this problem.  It happens shortly
 after booting the kernel and is very repeatable.
 
 This is a dual PII system with PIIX4 ide, 53c875 scsi and Raid 1.
 It is not a production system so I am open to any patches or
 tests.
 
 The system would not even stay up long enough to run ksymoops so
 I had to copy the data and run it under 2.2.17.
 
 Any ideas out there?

Hi,

Are you sure that you used the right System.map?

 Trace; c01df362 scsi_dispatch_cmd+1c6/26c
 ^^^
Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size.

(scsi_dispatch_cmd+0x1c6 is very close to scsi_wait_req():down (sem))

If you want to try some light troubleshooting, grab kdb..
ftp://oss.sgi.com/www/projects/kdb/download/ix86/kdb-v1.5-2.4.0-test9-pre9.gz
.. and check what both cpus were up to at lock time.  A stack trace of
both cpus might help developers locate the trouble.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.0-test10-pre3 Ooops

2000-10-18 Thread Mike Galbraith

On Wed, 18 Oct 2000, Gary E. Miller wrote:

 Yo Mike!
 
 On Wed, 18 Oct 2000, Mike Galbraith wrote:
 
   Help!  See below for my kernel oops.  I have not been able to use any 
   kernel after 2.4.0-test5 due to this problem.  It happens shortly
   after booting the kernel and is very repeatable.
 
  Are you sure that you used the right System.map?
 Yes, I just rechecked the date and time of all the files.  I will
 re-run the whole test in the morning just to be sure.
 
   Trace; c01df362 scsi_dispatch_cmd+1c6/26c
   ^^^
  Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size.
 Mine is 0x26c, maybe you have a different compiler or different config?

Ok, I don't _see_ what could make such a big difference, but something
obviously does.

  If you want to try some light troubleshooting, grab kdb..
  ftp://oss.sgi.com/www/projects/kdb/download/ix86/kdb-v1.5-2.4.0-test9-pre9.gz
  .. and check what both cpus were up to at lock time.  A stack trace of
  both cpus might help developers locate the trouble.
 
 I thought Linus was not a big fan of kernel debuggers?  I will look into
 it if I get no better suggestions.

Linus doesn't need a supporter (nads of steel;), but the vast majority
of folks out there aren't in the same league as Linus.  Heck, some of
us aren't even sure we're playing the same sport.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/