Re: 2.4.0-test10-pre3 Ooops
On Wed, 18 Oct 2000, Mike Galbraith wrote: > On Wed, 18 Oct 2000, Gary E. Miller wrote: > > > Yo Mike! > > > > On Wed, 18 Oct 2000, Mike Galbraith wrote: > > > > > > Help! See below for my kernel oops. I have not been able to use any > > > > kernel after 2.4.0-test5 due to this problem. It happens shortly > > > > after booting the kernel and is very repeatable. > > > > > Are you sure that you used the right System.map? > > Yes, I just rechecked the date and time of all the files. I will > > re-run the whole test in the morning just to be sure. > > > > > > Trace; c01df362 > > > ^^^ > > > Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size. > > Mine is 0x26c, maybe you have a different compiler or different config? > > Ok, I don't _see_ what could make such a big difference, but something > obviously does. (Decides to take another peek. Yup, scsi logging adds quite a bit) Looks like deadlock at scsi.c:696 to me. ide_end_request() has the io_request_lock on cpu0 and we try to grab it again with the same cpu. What is unclear to me (SMP.. technology, sufficiently advanced.. magic) is how in the heck the scsi interrupt happened on cpu0. It _looks_ to me as though md_spin_unlock_irq() enabled interrupts at a very bad time. >>EIP; c0272362<= Trace; c01df362 scsi.c:696 deadlock cpu0 Trace; c010cab1 Trace; c010cc98 Uhoh! Trace; c010b370 Trace; c02215d3 md_spin_unlock_irq()!! Trace; c022167a Trace; c01927c9 ll_rw_blk.c:1000 Trace; c01c99a8 ide.c:516 (cpu0 has lock, Trace; c01cdadb (interrupts are disabled) Trace; c01cb27e Trace; c01cda30 Trace; c010cab1 Trace; c010cc98 Trace; c010b370 Code; c0272362 If this were happening on my system, I'd boldly change raid1_free_bh() like such.. --- drivers/md/raid1.c.org Wed Oct 18 15:30:07 2000 +++ drivers/md/raid1.c Wed Oct 18 15:33:08 2000 @@ -91,7 +91,8 @@ static inline void raid1_free_bh(raid1_conf_t *conf, struct buffer_head *bh) { - md_spin_lock_irq(>device_lock); + unsigned long flags; + md_spin_lock_irqsave(>device_lock, flags); while (bh) { struct buffer_head *t = bh; bh=bh->b_next; @@ -103,7 +104,7 @@ conf->freebh_cnt++; } } - md_spin_unlock_irq(>device_lock); + md_spin_unlock_irqrestore(>device_lock, flags); wake_up(>wait_buffer); } ..and see if the problem went away. Now, if I'm off base, someone please clean my clock so I'll understand better next time I foolishly attempt to figure out an SMP deadlock ;-) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3 Ooops
On Wed, 18 Oct 2000, Gary E. Miller wrote: > Yo Mike! > > On Wed, 18 Oct 2000, Mike Galbraith wrote: > > > > Help! See below for my kernel oops. I have not been able to use any > > > kernel after 2.4.0-test5 due to this problem. It happens shortly > > > after booting the kernel and is very repeatable. > > > Are you sure that you used the right System.map? > Yes, I just rechecked the date and time of all the files. I will > re-run the whole test in the morning just to be sure. > > > > Trace; c01df362 > > ^^^ > > Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size. > Mine is 0x26c, maybe you have a different compiler or different config? Ok, I don't _see_ what could make such a big difference, but something obviously does. > > If you want to try some light troubleshooting, grab kdb.. > > ftp://oss.sgi.com/www/projects/kdb/download/ix86/kdb-v1.5-2.4.0-test9-pre9.gz > > .. and check what both cpus were up to at lock time. A stack trace of > > both cpus might help developers locate the trouble. > > I thought Linus was not a big fan of kernel debuggers? I will look into > it if I get no better suggestions. Linus doesn't need a supporter (nads of steel;), but the vast majority of folks out there aren't in the same league as Linus. Heck, some of us aren't even sure we're playing the same sport. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3 Ooops
On Tue, 17 Oct 2000, Gary E. Miller wrote: > Yo All! > > Help! See below for my kernel oops. I have not been able to use any > kernel after 2.4.0-test5 due to this problem. It happens shortly > after booting the kernel and is very repeatable. > > This is a dual PII system with PIIX4 ide, 53c875 scsi and Raid 1. > It is not a production system so I am open to any patches or > tests. > > The system would not even stay up long enough to run ksymoops so > I had to copy the data and run it under 2.2.17. > > Any ideas out there? Hi, Are you sure that you used the right System.map? > Trace; c01df362 ^^^ Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size. (scsi_dispatch_cmd+0x1c6 is very close to scsi_wait_req():down ()) If you want to try some light troubleshooting, grab kdb.. ftp://oss.sgi.com/www/projects/kdb/download/ix86/kdb-v1.5-2.4.0-test9-pre9.gz .. and check what both cpus were up to at lock time. A stack trace of both cpus might help developers locate the trouble. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3 Ooops
On Tue, 17 Oct 2000, Gary E. Miller wrote: Yo All! Help! See below for my kernel oops. I have not been able to use any kernel after 2.4.0-test5 due to this problem. It happens shortly after booting the kernel and is very repeatable. This is a dual PII system with PIIX4 ide, 53c875 scsi and Raid 1. It is not a production system so I am open to any patches or tests. The system would not even stay up long enough to run ksymoops so I had to copy the data and run it under 2.2.17. Any ideas out there? Hi, Are you sure that you used the right System.map? Trace; c01df362 scsi_dispatch_cmd+1c6/26c ^^^ Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size. (scsi_dispatch_cmd+0x1c6 is very close to scsi_wait_req():down (sem)) If you want to try some light troubleshooting, grab kdb.. ftp://oss.sgi.com/www/projects/kdb/download/ix86/kdb-v1.5-2.4.0-test9-pre9.gz .. and check what both cpus were up to at lock time. A stack trace of both cpus might help developers locate the trouble. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3 Ooops
On Wed, 18 Oct 2000, Gary E. Miller wrote: Yo Mike! On Wed, 18 Oct 2000, Mike Galbraith wrote: Help! See below for my kernel oops. I have not been able to use any kernel after 2.4.0-test5 due to this problem. It happens shortly after booting the kernel and is very repeatable. Are you sure that you used the right System.map? Yes, I just rechecked the date and time of all the files. I will re-run the whole test in the morning just to be sure. Trace; c01df362 scsi_dispatch_cmd+1c6/26c ^^^ Here, scsi_dispatch_cmd() isn't that large.. it's only 0x168 in size. Mine is 0x26c, maybe you have a different compiler or different config? Ok, I don't _see_ what could make such a big difference, but something obviously does. If you want to try some light troubleshooting, grab kdb.. ftp://oss.sgi.com/www/projects/kdb/download/ix86/kdb-v1.5-2.4.0-test9-pre9.gz .. and check what both cpus were up to at lock time. A stack trace of both cpus might help developers locate the trouble. I thought Linus was not a big fan of kernel debuggers? I will look into it if I get no better suggestions. Linus doesn't need a supporter (nads of steel;), but the vast majority of folks out there aren't in the same league as Linus. Heck, some of us aren't even sure we're playing the same sport. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/