Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Hi, On Thu, Mar 22, 2007 at 07:42:35PM +0100, Jens Axboe wrote: > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > > index b6491c0..ca84f0b 100644 > > --- a/block/cfq-iosched.c > > +++ b/block/cfq-iosched.c > > @@ -961,8 +961,8 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct > > cfq_queue *cfqq, > > /* > > * follow expired path, else get first next available > > */ > > - if ((rq = cfq_check_fifo(cfqq)) == NULL) > > - rq = cfqq->next_rq; > > + if (!(rq = cfq_check_fifo(cfqq)) && !(rq = cfqq->next_rq)) > > + break; > > That still only hides a bug. It is illegal for ->next_rq to be NULL > while the RB tree is non-empty. As I noticed afterwards this isn't even the point where the NULL ptr is dereferenced. It must be in the next code-line, cfqd->queue or cfqd->queue->elevator was NULL when the oops occured or am I wrong? 'Hannes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Wed, Mar 21 2007, Johannes Weiner wrote: > Hi, > > I think I found where the NULL may come from. Please, anybody, do not > apply this patch before a trustful person reviewed it... Jens? ;) > > My thoughts on this are, that there are two possibilities cfqq->next_rq > could be NULL: End of list or a bug when it is set (or not set). > But why does RB_EMPTY_ROOT() as last call in this loop does not trigger? > > Did I even get the right place on where the NULL pointer dereference > happens? :) > > =Hannes > > Signed-off-by: Johannes Weiner <[EMAIL PROTECTED]> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index b6491c0..ca84f0b 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -961,8 +961,8 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct > cfq_queue *cfqq, > /* >* follow expired path, else get first next available >*/ > - if ((rq = cfq_check_fifo(cfqq)) == NULL) > - rq = cfqq->next_rq; > + if (!(rq = cfq_check_fifo(cfqq)) && !(rq = cfqq->next_rq)) > + break; That still only hides a bug. It is illegal for ->next_rq to be NULL while the RB tree is non-empty. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Hi, On Wed, Mar 21, 2007 at 08:04:00PM +0100, Johannes Weiner wrote: > Did I even get the right place on where the NULL pointer dereference > happens? :) No, I did not. Sorry for the noise. =Hannes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Wed, 2007-03-21 at 20:59 +0100, Jens Axboe wrote: > On Wed, Mar 21 2007, Dale Blount wrote: > > On Wed, 2007-03-21 at 14:09 -0400, Chuck Ebbert wrote: > > > Dale Blount wrote: > > > >> I'm puzzled why this is hitting Dan, but no one else has reported > > > >> anything. Dan, did 2.6.19 work for you? > > > > > > > > Actually, I believe it is happening to me too. This is on a 4-disk > > > > raid5 with > > > > one failed disk on two 2-port sata_sil pci controller cards. > > > > > > > > The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. > > > > The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have > > > > worked > > > > or not. > > > > > > > > > > > > > > Just to be clear: you have a RAID array (what level?) with a failed disk > > > and it's giving you the same error that Dan gets? > > > > > > Yes, the BUG looks to be pretty similar to me. It's a 4 disk raid5 with > > one disk missing (dead and sent in for RMA). > > Interesting, I'll definitely see if I can reproduce it like that. > FWIW, the server made it through a round using 2.6.21-rc4 without errors. 2.6.20.x failed 5 times straight and worked 4 times prior to that (also with a missing disk), so it could be a coincidence too. The disk is back from RMA, do you need me to hold off replacing it? Thanks, Dale - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Wed, 2007-03-21 at 20:59 +0100, Jens Axboe wrote: On Wed, Mar 21 2007, Dale Blount wrote: On Wed, 2007-03-21 at 14:09 -0400, Chuck Ebbert wrote: Dale Blount wrote: I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? Actually, I believe it is happening to me too. This is on a 4-disk raid5 with one failed disk on two 2-port sata_sil pci controller cards. The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have worked or not. Just to be clear: you have a RAID array (what level?) with a failed disk and it's giving you the same error that Dan gets? Yes, the BUG looks to be pretty similar to me. It's a 4 disk raid5 with one disk missing (dead and sent in for RMA). Interesting, I'll definitely see if I can reproduce it like that. FWIW, the server made it through a round using 2.6.21-rc4 without errors. 2.6.20.x failed 5 times straight and worked 4 times prior to that (also with a missing disk), so it could be a coincidence too. The disk is back from RMA, do you need me to hold off replacing it? Thanks, Dale - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Hi, On Wed, Mar 21, 2007 at 08:04:00PM +0100, Johannes Weiner wrote: Did I even get the right place on where the NULL pointer dereference happens? :) No, I did not. Sorry for the noise. =Hannes - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Wed, Mar 21 2007, Johannes Weiner wrote: Hi, I think I found where the NULL may come from. Please, anybody, do not apply this patch before a trustful person reviewed it... Jens? ;) My thoughts on this are, that there are two possibilities cfqq-next_rq could be NULL: End of list or a bug when it is set (or not set). But why does RB_EMPTY_ROOT() as last call in this loop does not trigger? Did I even get the right place on where the NULL pointer dereference happens? :) =Hannes Signed-off-by: Johannes Weiner [EMAIL PROTECTED] diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b6491c0..ca84f0b 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -961,8 +961,8 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq, /* * follow expired path, else get first next available */ - if ((rq = cfq_check_fifo(cfqq)) == NULL) - rq = cfqq-next_rq; + if (!(rq = cfq_check_fifo(cfqq)) !(rq = cfqq-next_rq)) + break; That still only hides a bug. It is illegal for -next_rq to be NULL while the RB tree is non-empty. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Hi, On Thu, Mar 22, 2007 at 07:42:35PM +0100, Jens Axboe wrote: diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b6491c0..ca84f0b 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -961,8 +961,8 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq, /* * follow expired path, else get first next available */ - if ((rq = cfq_check_fifo(cfqq)) == NULL) - rq = cfqq-next_rq; + if (!(rq = cfq_check_fifo(cfqq)) !(rq = cfqq-next_rq)) + break; That still only hides a bug. It is illegal for -next_rq to be NULL while the RB tree is non-empty. As I noticed afterwards this isn't even the point where the NULL ptr is dereferenced. It must be in the next code-line, cfqd-queue or cfqd-queue-elevator was NULL when the oops occured or am I wrong? 'Hannes - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Wed, Mar 21 2007, Dale Blount wrote: > On Wed, 2007-03-21 at 14:09 -0400, Chuck Ebbert wrote: > > Dale Blount wrote: > > >> I'm puzzled why this is hitting Dan, but no one else has reported > > >> anything. Dan, did 2.6.19 work for you? > > > > > > Actually, I believe it is happening to me too. This is on a 4-disk raid5 > > > with > > > one failed disk on two 2-port sata_sil pci controller cards. > > > > > > The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. > > > The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have > > > worked > > > or not. > > > > > > > > > > Just to be clear: you have a RAID array (what level?) with a failed disk > > and it's giving you the same error that Dan gets? > > > Yes, the BUG looks to be pretty similar to me. It's a 4 disk raid5 with > one disk missing (dead and sent in for RMA). Interesting, I'll definitely see if I can reproduce it like that. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Hi, I think I found where the NULL may come from. Please, anybody, do not apply this patch before a trustful person reviewed it... Jens? ;) My thoughts on this are, that there are two possibilities cfqq->next_rq could be NULL: End of list or a bug when it is set (or not set). But why does RB_EMPTY_ROOT() as last call in this loop does not trigger? Did I even get the right place on where the NULL pointer dereference happens? :) =Hannes Signed-off-by: Johannes Weiner <[EMAIL PROTECTED]> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b6491c0..ca84f0b 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -961,8 +961,8 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq, /* * follow expired path, else get first next available */ - if ((rq = cfq_check_fifo(cfqq)) == NULL) - rq = cfqq->next_rq; + if (!(rq = cfq_check_fifo(cfqq)) && !(rq = cfqq->next_rq)) + break; /* * finally, insert request into driver dispatch list
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Dale Blount wrote: > On Wed, 2007-03-21 at 14:09 -0400, Chuck Ebbert wrote: >> Dale Blount wrote: I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? >>> Actually, I believe it is happening to me too. This is on a 4-disk raid5 >>> with >>> one failed disk on two 2-port sata_sil pci controller cards. >>> >>> The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. >>> The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have >>> worked >>> or not. >>> >>> >> Just to be clear: you have a RAID array (what level?) with a failed disk >> and it's giving you the same error that Dan gets? > > > Yes, the BUG looks to be pretty similar to me. It's a 4 disk raid5 with > one disk missing (dead and sent in for RMA). > OK, I'm add NeilB to the cc: (Dan has a RAID6 array with two failed disks.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Wed, 2007-03-21 at 14:09 -0400, Chuck Ebbert wrote: > Dale Blount wrote: > >> I'm puzzled why this is hitting Dan, but no one else has reported > >> anything. Dan, did 2.6.19 work for you? > > > > Actually, I believe it is happening to me too. This is on a 4-disk raid5 > > with > > one failed disk on two 2-port sata_sil pci controller cards. > > > > The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. > > The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have > > worked > > or not. > > > > > > Just to be clear: you have a RAID array (what level?) with a failed disk > and it's giving you the same error that Dan gets? Yes, the BUG looks to be pretty similar to me. It's a 4 disk raid5 with one disk missing (dead and sent in for RMA). Dale - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Dale Blount wrote: >> I'm puzzled why this is hitting Dan, but no one else has reported >> anything. Dan, did 2.6.19 work for you? > > Actually, I believe it is happening to me too. This is on a 4-disk raid5 with > one failed disk on two 2-port sata_sil pci controller cards. > > The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. > The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have worked > or not. > > Just to be clear: you have a RAID array (what level?) with a failed disk and it's giving you the same error that Dan gets? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
> I'm puzzled why this is hitting Dan, but no one else has reported > anything. Dan, did 2.6.19 work for you? Actually, I believe it is happening to me too. This is on a 4-disk raid5 with one failed disk on two 2-port sata_sil pci controller cards. The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have worked or not. BUG: unable to handle kernel NULL pointer dereference at virtual address 005c printing eip: *pde = Oops: [#1] PREEMPT SMP Modules linked in: ipv6 raid456 xor md_mod e1000 usbcore ext3 jbd mbcache sata_sil sd_mod sr_mod cdrom generic ide_core ata_piix libata CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010086 (2.6.20-ARCH #1) EIP is at cfq_dispatch_insert+0x19/0x70 eax: c2bc1340 ebx: ecx: 0049 edx: esi: c298802c edi: f7c17cc0 ebp: 00e47352 esp: f18addfc ds: 007b es: 007b ss: 0068 Process rsync (pid: 3698, ti=f18ac000 task=dfe97570 task.ti=f18ac000) Stack: c298ce3c f7c17cc0 00e47352 c0220bd9 0040 c2b3cd40 c2b3c4a8 cafa6500 c0294750 0004 d8ed1480 cafa6500 c298802c c29f0800 c2b3c000 c298802c c0215690 c0294750 cafa6500 Call Trace: [] cfq_dispatch_requests+0xf9/0x4e0 [] scsi_done+0x0/0x20 [] elv_next_request+0x20/0x1d0 [] scsi_done+0x0/0x20 [] scsi_dispatch_cmd+0x146/0x230 [] scsi_request_fn+0x194/0x2d0 [] blk_run_queue+0x58/0x70 [] scsi_next_command+0x30/0x50 [] scsi_end_request+0xab/0xe0 [] scsi_io_completion+0x86/0x370 [] ata_hsm_qc_complete+0x90/0x110 [libata] [] sd_rw_intr+0x2b/0x200 [sd_mod] [] scsi_finish_command+0x49/0x60 [] blk_done_softirq+0x58/0x70 [] __do_softirq+0x82/0xf0 [] do_softirq+0x37/0x40 [] irq_exit+0x45/0x50 [] do_IRQ+0x45/0x80 [] common_interrupt+0x23/0x28 === Code: ae 40 ee ff e9 25 ff ff ff 0f 0b eb fe 0f 0b eb fe 90 83 ec 10 89 1c 24 89 d3 89 74 24 04 89 c6 89 7c 24 08 89 6c 24 0c 8b 40 0c <8b> 7a 5c 8b 68 04 89 d0 e8 7a fe ff ff 8b 43 14 89 da 25 01 80 EIP: [] cfq_dispatch_insert+0x19/0x70 SS:ESP 0068:f18addfc <0>Kernel panic - not syncing: Fatal exception in interrupt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? Actually, I believe it is happening to me too. This is on a 4-disk raid5 with one failed disk on two 2-port sata_sil pci controller cards. The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have worked or not. BUG: unable to handle kernel NULL pointer dereference at virtual address 005c printing eip: *pde = Oops: [#1] PREEMPT SMP Modules linked in: ipv6 raid456 xor md_mod e1000 usbcore ext3 jbd mbcache sata_sil sd_mod sr_mod cdrom generic ide_core ata_piix libata CPU:0 EIP:0060:[c0220a29]Not tainted VLI EFLAGS: 00010086 (2.6.20-ARCH #1) EIP is at cfq_dispatch_insert+0x19/0x70 eax: c2bc1340 ebx: ecx: 0049 edx: esi: c298802c edi: f7c17cc0 ebp: 00e47352 esp: f18addfc ds: 007b es: 007b ss: 0068 Process rsync (pid: 3698, ti=f18ac000 task=dfe97570 task.ti=f18ac000) Stack: c298ce3c f7c17cc0 00e47352 c0220bd9 0040 c2b3cd40 c2b3c4a8 cafa6500 c0294750 0004 d8ed1480 cafa6500 c298802c c29f0800 c2b3c000 c298802c c0215690 c0294750 cafa6500 Call Trace: [c0220bd9] cfq_dispatch_requests+0xf9/0x4e0 [c0294750] scsi_done+0x0/0x20 [c0215690] elv_next_request+0x20/0x1d0 [c0294750] scsi_done+0x0/0x20 [c0294ab6] scsi_dispatch_cmd+0x146/0x230 [c0299904] scsi_request_fn+0x194/0x2d0 [c0219518] blk_run_queue+0x58/0x70 [c0298460] scsi_next_command+0x30/0x50 [c029866b] scsi_end_request+0xab/0xe0 [c0298776] scsi_io_completion+0x86/0x370 [f882c770] ata_hsm_qc_complete+0x90/0x110 [libata] [f88127eb] sd_rw_intr+0x2b/0x200 [sd_mod] [c02945a9] scsi_finish_command+0x49/0x60 [c0219d88] blk_done_softirq+0x58/0x70 [c012bc12] __do_softirq+0x82/0xf0 [c012bcb7] do_softirq+0x37/0x40 [c012bee5] irq_exit+0x45/0x50 [c0105dc5] do_IRQ+0x45/0x80 [c0103c0f] common_interrupt+0x23/0x28 === Code: ae 40 ee ff e9 25 ff ff ff 0f 0b eb fe 0f 0b eb fe 90 83 ec 10 89 1c 24 89 d3 89 74 24 04 89 c6 89 7c 24 08 89 6c 24 0c 8b 40 0c 8b 7a 5c 8b 68 04 89 d0 e8 7a fe ff ff 8b 43 14 89 da 25 01 80 EIP: [c0220a29] cfq_dispatch_insert+0x19/0x70 SS:ESP 0068:f18addfc 0Kernel panic - not syncing: Fatal exception in interrupt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Dale Blount wrote: I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? Actually, I believe it is happening to me too. This is on a 4-disk raid5 with one failed disk on two 2-port sata_sil pci controller cards. The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have worked or not. Just to be clear: you have a RAID array (what level?) with a failed disk and it's giving you the same error that Dan gets? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Wed, 2007-03-21 at 14:09 -0400, Chuck Ebbert wrote: Dale Blount wrote: I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? Actually, I believe it is happening to me too. This is on a 4-disk raid5 with one failed disk on two 2-port sata_sil pci controller cards. The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have worked or not. Just to be clear: you have a RAID array (what level?) with a failed disk and it's giving you the same error that Dan gets? Yes, the BUG looks to be pretty similar to me. It's a 4 disk raid5 with one disk missing (dead and sent in for RMA). Dale - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Dale Blount wrote: On Wed, 2007-03-21 at 14:09 -0400, Chuck Ebbert wrote: Dale Blount wrote: I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? Actually, I believe it is happening to me too. This is on a 4-disk raid5 with one failed disk on two 2-port sata_sil pci controller cards. The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have worked or not. Just to be clear: you have a RAID array (what level?) with a failed disk and it's giving you the same error that Dan gets? Yes, the BUG looks to be pretty similar to me. It's a 4 disk raid5 with one disk missing (dead and sent in for RMA). OK, I'm add NeilB to the cc: (Dan has a RAID6 array with two failed disks.) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Hi, I think I found where the NULL may come from. Please, anybody, do not apply this patch before a trustful person reviewed it... Jens? ;) My thoughts on this are, that there are two possibilities cfqq-next_rq could be NULL: End of list or a bug when it is set (or not set). But why does RB_EMPTY_ROOT() as last call in this loop does not trigger? Did I even get the right place on where the NULL pointer dereference happens? :) =Hannes Signed-off-by: Johannes Weiner [EMAIL PROTECTED] diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b6491c0..ca84f0b 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -961,8 +961,8 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq, /* * follow expired path, else get first next available */ - if ((rq = cfq_check_fifo(cfqq)) == NULL) - rq = cfqq-next_rq; + if (!(rq = cfq_check_fifo(cfqq)) !(rq = cfqq-next_rq)) + break; /* * finally, insert request into driver dispatch list
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Wed, Mar 21 2007, Dale Blount wrote: On Wed, 2007-03-21 at 14:09 -0400, Chuck Ebbert wrote: Dale Blount wrote: I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? Actually, I believe it is happening to me too. This is on a 4-disk raid5 with one failed disk on two 2-port sata_sil pci controller cards. The BUG below is from 2.6.20.3, but I will try 2.6.21-rc4 tonight. The disk didn't fail until 2.6.20, so I don't know if 2.6.19 would have worked or not. Just to be clear: you have a RAID array (what level?) with a failed disk and it's giving you the same error that Dan gets? Yes, the BUG looks to be pretty similar to me. It's a 4 disk raid5 with one disk missing (dead and sent in for RMA). Interesting, I'll definitely see if I can reproduce it like that. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On 3/1/07, Jens Axboe <[EMAIL PROTECTED]> wrote: On Thu, Mar 01 2007, Frank Seidel wrote: > Am Mittwoch, 28. Februar 2007 19:02 schrieb Dan Williams: > > I can reliably reproduce a null pointer dereference on 2.6.20 and > > 2.6.21-rc2. I will keep digging to find the kernel version where > > this last worked, but wanted to see if there were any immediate > > experiments I should try. > > ... > > Kernel 2.6.21-rc2 on an i686 > > ... > > [ 431.709022] BUG: unable to handle kernel NULL pointer dereference > > at virtual address 005c [ 431.717993] printing eip: > > ... > > [ 431.825386] EIP is at cfq_dispatch_insert+0xb/0x53 > > ... > > [ 431.887396] [] cfq_dispatch_requests+0x138/0x3f0 > Hi, > unfortunately i yet don't really have much/enough knowledge of cfq and > the kernels inwards at the moment... > but looking at cfq_dispatch_insert+0xb it seems the struct request > pointer given (as second parameter by cfq_dispatch_request) was NULL > and dereferencing it in the RQ_CFQQ macro leads to this oops. > > The "break"-out patch below for __cfq_dispatch_request might be at least > a possible workaround for this, but it could also be total bullsh.. > Perhaps someone smarter might pick this up.. and give a real fix. > > Have fun, > Frank > --- > > block/cfq-iosched.c |3 ++- > 1 files changed, 2 insertions(+), 1 deletion(-) > > Index: linux-2.6/block/cfq-iosched.c > === > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -962,7 +962,8 @@ __cfq_dispatch_requests(struct cfq_data > * follow expired path, else get first next available > */ > if ((rq = cfq_check_fifo(cfqq)) == NULL) > - rq = cfqq->next_rq; > + if ((rq = cfqq->next_rq) == NULL) > + break; > > /* > * finally, insert request into driver dispatch list That is not the right fix. A little further up in this function, a check (well BUG_ON()) is done for a non-empty sort list. So we know at this point, that we have requests pending for this queue. When that is the case, ->next_rq must always be kept uptodate and non-NULL. The oops at least tells us this, it should not be papered around. The real fix is finding out _where_ this now isn't being updated. I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? I am puzzled as well, although I do not think many people run raid6 arrays with 2-failed disks, so it might be an under-tested path, but a non-degraded array runs fine... I fired up a 2.6.19 kernel and tiobench ran past the point (in terms of time) where it had failed on .20 and .21-rc. However I noticed things were running much slower since the cpu optimizations had fallen back to Pentium-Pro from Core2 which affects the raid6 p+q calculation speed among other things. So I need to re-baseline the failure against a more common config to say whether it is actually gone in 2.6.19. I should have time to try these tests next week. -- Jens Axboe Regards, Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Thu, Mar 01 2007, Frank Seidel wrote: > Am Mittwoch, 28. Februar 2007 19:02 schrieb Dan Williams: > > I can reliably reproduce a null pointer dereference on 2.6.20 and > > 2.6.21-rc2. I will keep digging to find the kernel version where > > this last worked, but wanted to see if there were any immediate > > experiments I should try. > > ... > > Kernel 2.6.21-rc2 on an i686 > > ... > > [ 431.709022] BUG: unable to handle kernel NULL pointer dereference > > at virtual address 005c [ 431.717993] printing eip: > > ... > > [ 431.825386] EIP is at cfq_dispatch_insert+0xb/0x53 > > ... > > [ 431.887396] [] cfq_dispatch_requests+0x138/0x3f0 > Hi, > unfortunately i yet don't really have much/enough knowledge of cfq and > the kernels inwards at the moment... > but looking at cfq_dispatch_insert+0xb it seems the struct request > pointer given (as second parameter by cfq_dispatch_request) was NULL > and dereferencing it in the RQ_CFQQ macro leads to this oops. > > The "break"-out patch below for __cfq_dispatch_request might be at least > a possible workaround for this, but it could also be total bullsh.. > Perhaps someone smarter might pick this up.. and give a real fix. > > Have fun, > Frank > --- > > block/cfq-iosched.c |3 ++- > 1 files changed, 2 insertions(+), 1 deletion(-) > > Index: linux-2.6/block/cfq-iosched.c > === > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -962,7 +962,8 @@ __cfq_dispatch_requests(struct cfq_data > * follow expired path, else get first next available > */ > if ((rq = cfq_check_fifo(cfqq)) == NULL) > - rq = cfqq->next_rq; > + if ((rq = cfqq->next_rq) == NULL) > + break; > > /* > * finally, insert request into driver dispatch list That is not the right fix. A little further up in this function, a check (well BUG_ON()) is done for a non-empty sort list. So we know at this point, that we have requests pending for this queue. When that is the case, ->next_rq must always be kept uptodate and non-NULL. The oops at least tells us this, it should not be papered around. The real fix is finding out _where_ this now isn't being updated. I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Am Mittwoch, 28. Februar 2007 19:02 schrieb Dan Williams: > I can reliably reproduce a null pointer dereference on 2.6.20 and > 2.6.21-rc2. I will keep digging to find the kernel version where > this last worked, but wanted to see if there were any immediate > experiments I should try. > ... > Kernel 2.6.21-rc2 on an i686 > ... > [ 431.709022] BUG: unable to handle kernel NULL pointer dereference > at virtual address 005c [ 431.717993] printing eip: > ... > [ 431.825386] EIP is at cfq_dispatch_insert+0xb/0x53 > ... > [ 431.887396] [] cfq_dispatch_requests+0x138/0x3f0 Hi, unfortunately i yet don't really have much/enough knowledge of cfq and the kernels inwards at the moment... but looking at cfq_dispatch_insert+0xb it seems the struct request pointer given (as second parameter by cfq_dispatch_request) was NULL and dereferencing it in the RQ_CFQQ macro leads to this oops. The "break"-out patch below for __cfq_dispatch_request might be at least a possible workaround for this, but it could also be total bullsh.. Perhaps someone smarter might pick this up.. and give a real fix. Have fun, Frank --- block/cfq-iosched.c |3 ++- 1 files changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c === --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -962,7 +962,8 @@ __cfq_dispatch_requests(struct cfq_data * follow expired path, else get first next available */ if ((rq = cfq_check_fifo(cfqq)) == NULL) - rq = cfqq->next_rq; + if ((rq = cfqq->next_rq) == NULL) + break; /* * finally, insert request into driver dispatch list - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Am Mittwoch, 28. Februar 2007 19:02 schrieb Dan Williams: I can reliably reproduce a null pointer dereference on 2.6.20 and 2.6.21-rc2. I will keep digging to find the kernel version where this last worked, but wanted to see if there were any immediate experiments I should try. ... Kernel 2.6.21-rc2 on an i686 ... [ 431.709022] BUG: unable to handle kernel NULL pointer dereference at virtual address 005c [ 431.717993] printing eip: ... [ 431.825386] EIP is at cfq_dispatch_insert+0xb/0x53 ... [ 431.887396] [c01e1fc9] cfq_dispatch_requests+0x138/0x3f0 Hi, unfortunately i yet don't really have much/enough knowledge of cfq and the kernels inwards at the moment... but looking at cfq_dispatch_insert+0xb it seems the struct request pointer given (as second parameter by cfq_dispatch_request) was NULL and dereferencing it in the RQ_CFQQ macro leads to this oops. The break-out patch below for __cfq_dispatch_request might be at least a possible workaround for this, but it could also be total bullsh.. Perhaps someone smarter might pick this up.. and give a real fix. Have fun, Frank --- block/cfq-iosched.c |3 ++- 1 files changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c === --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -962,7 +962,8 @@ __cfq_dispatch_requests(struct cfq_data * follow expired path, else get first next available */ if ((rq = cfq_check_fifo(cfqq)) == NULL) - rq = cfqq-next_rq; + if ((rq = cfqq-next_rq) == NULL) + break; /* * finally, insert request into driver dispatch list - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On Thu, Mar 01 2007, Frank Seidel wrote: Am Mittwoch, 28. Februar 2007 19:02 schrieb Dan Williams: I can reliably reproduce a null pointer dereference on 2.6.20 and 2.6.21-rc2. I will keep digging to find the kernel version where this last worked, but wanted to see if there were any immediate experiments I should try. ... Kernel 2.6.21-rc2 on an i686 ... [ 431.709022] BUG: unable to handle kernel NULL pointer dereference at virtual address 005c [ 431.717993] printing eip: ... [ 431.825386] EIP is at cfq_dispatch_insert+0xb/0x53 ... [ 431.887396] [c01e1fc9] cfq_dispatch_requests+0x138/0x3f0 Hi, unfortunately i yet don't really have much/enough knowledge of cfq and the kernels inwards at the moment... but looking at cfq_dispatch_insert+0xb it seems the struct request pointer given (as second parameter by cfq_dispatch_request) was NULL and dereferencing it in the RQ_CFQQ macro leads to this oops. The break-out patch below for __cfq_dispatch_request might be at least a possible workaround for this, but it could also be total bullsh.. Perhaps someone smarter might pick this up.. and give a real fix. Have fun, Frank --- block/cfq-iosched.c |3 ++- 1 files changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c === --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -962,7 +962,8 @@ __cfq_dispatch_requests(struct cfq_data * follow expired path, else get first next available */ if ((rq = cfq_check_fifo(cfqq)) == NULL) - rq = cfqq-next_rq; + if ((rq = cfqq-next_rq) == NULL) + break; /* * finally, insert request into driver dispatch list That is not the right fix. A little further up in this function, a check (well BUG_ON()) is done for a non-empty sort list. So we know at this point, that we have requests pending for this queue. When that is the case, -next_rq must always be kept uptodate and non-NULL. The oops at least tells us this, it should not be papered around. The real fix is finding out _where_ this now isn't being updated. I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
On 3/1/07, Jens Axboe [EMAIL PROTECTED] wrote: On Thu, Mar 01 2007, Frank Seidel wrote: Am Mittwoch, 28. Februar 2007 19:02 schrieb Dan Williams: I can reliably reproduce a null pointer dereference on 2.6.20 and 2.6.21-rc2. I will keep digging to find the kernel version where this last worked, but wanted to see if there were any immediate experiments I should try. ... Kernel 2.6.21-rc2 on an i686 ... [ 431.709022] BUG: unable to handle kernel NULL pointer dereference at virtual address 005c [ 431.717993] printing eip: ... [ 431.825386] EIP is at cfq_dispatch_insert+0xb/0x53 ... [ 431.887396] [c01e1fc9] cfq_dispatch_requests+0x138/0x3f0 Hi, unfortunately i yet don't really have much/enough knowledge of cfq and the kernels inwards at the moment... but looking at cfq_dispatch_insert+0xb it seems the struct request pointer given (as second parameter by cfq_dispatch_request) was NULL and dereferencing it in the RQ_CFQQ macro leads to this oops. The break-out patch below for __cfq_dispatch_request might be at least a possible workaround for this, but it could also be total bullsh.. Perhaps someone smarter might pick this up.. and give a real fix. Have fun, Frank --- block/cfq-iosched.c |3 ++- 1 files changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c === --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -962,7 +962,8 @@ __cfq_dispatch_requests(struct cfq_data * follow expired path, else get first next available */ if ((rq = cfq_check_fifo(cfqq)) == NULL) - rq = cfqq-next_rq; + if ((rq = cfqq-next_rq) == NULL) + break; /* * finally, insert request into driver dispatch list That is not the right fix. A little further up in this function, a check (well BUG_ON()) is done for a non-empty sort list. So we know at this point, that we have requests pending for this queue. When that is the case, -next_rq must always be kept uptodate and non-NULL. The oops at least tells us this, it should not be papered around. The real fix is finding out _where_ this now isn't being updated. I'm puzzled why this is hitting Dan, but no one else has reported anything. Dan, did 2.6.19 work for you? I am puzzled as well, although I do not think many people run raid6 arrays with 2-failed disks, so it might be an under-tested path, but a non-degraded array runs fine... I fired up a 2.6.19 kernel and tiobench ran past the point (in terms of time) where it had failed on .20 and .21-rc. However I noticed things were running much slower since the cpu optimizations had fallen back to Pentium-Pro from Core2 which affects the raid6 p+q calculation speed among other things. So I need to re-baseline the failure against a more common config to say whether it is actually gone in 2.6.19. I should have time to try these tests next week. -- Jens Axboe Regards, Dan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Chuck Ebbert wrote: > There are two patches for raid5/6 out there that might fix this. I'll > attach them (the second just fixes a minor bug in the first one.) Never mind, those patches are already in 2.6.21-rc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Dan Williams wrote: > I can reliably reproduce a null pointer dereference on 2.6.20 and > 2.6.21-rc2. I will keep digging to find the kernel version where this > last worked, but wanted to see if there were any immediate experiments I > should try. > > The failure is caused by running tiobench on a MD raid6 array with 6 out > of 8 disks available. The commands I issued to reproduce this are: > > mdadm -A /dev/md0 /dev/sd[bcdefg] > mount /dev/md0 /mnt/raid > tiobench --numruns 5 --size 2048 --dir /mnt/raid > > The filesystem is ext3. The controller is an LSI 1068. Here are the > two BUG messages first 2.6.21-rc2 followed by 2.6.20. I will reply to > this message with the config. > Kernel 2.6.20 on an i686 > > [ 177.299787] BUG: unable to handle kernel NULL pointer dereference at > virtual address 005c > [ 177.308526] printing eip: > [ 177.311287] c01de510 > [ 177.313521] *pde = 34d40001 > [ 177.316353] Oops: [#1] > [ 177.319202] SMP > [ 177.321107] Modules linked in: raid456 xor nfsd exportfs lockd nfs_acl > sunrpc autofs4 hidp l2cap bluetooth iptable_raw xt_policy xt_multiport > ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_SAME ipt_REJECT ipt_REDIRECT > ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt_LOG ipt_iprange ipt_ECN > ipt_ecn ipt_CLUSTERIP ipt_ah ipt_addrtype xt_tcpmss xt_pkttype xt_physdev > xt_NFQUEUE xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_dccp > xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY xt_tcpudp xt_state > iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_mangle nfnetlink > iptable_filter ip_tables x_tables video sbs i2c_ec dock button battery > asus_acpi ac radeon drm ipv6 lp parport_pc parport e1000 uhci_hcd floppy > mptsas mptscsih mptbase sg ehci_hcd scsi_transport_sas i2c_i801 i2c_core > pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix ata_generic libata > sd_mod scsi_mod ext3 jbd > [ 177.402252] CPU:2 > [ 177.402253] EIP:0060:[]Not tainted VLI > [ 177.402253] EFLAGS: 00210016 (2.6.20 #5) > [ 177.414194] EIP is at cfq_dispatch_insert+0xb/0x53 > [ 177.419056] eax: f7773ec0 ebx: ecx: f7773cc0 edx: > [ 177.425982] esi: f70abae0 edi: f7773cc0 ebp: esp: f34dbcbc > [ 177.432953] ds: 007b es: 007b ss: 0068 > [ 177.437127] Process tiotest (pid: 5405, ti=f34db000 task=f7efc030 > task.ti=f34db000) > [ 177.444763] Stack: 0049 f77d3b9c f7773cc0 c01de6ce c014041e > f8a26806 0082 > [ 177.453456]f7efc030 fffe22d6 0004 > f7efc030 f7773cc0 > [ 177.462121] f70abae0 f7cd5800 f70abae0 > c01d4fcc 0001 > [ 177.470798] Call Trace: > [ 177.473503] [] cfq_dispatch_requests+0x12d/0x466 > [ 177.479223] [] __lock_acquire+0x9e9/0xa72 > [ 177.484285] [] scsi_request_fn+0x286/0x336 [scsi_mod] > [ 177.490485] [] elv_next_request+0x1a2/0x1b2 > [ 177.495766] [] scsi_request_fn+0x286/0x336 [scsi_mod] > [ 177.501912] [] _spin_lock_irq+0x38/0x43 > [ 177.506840] [] scsi_request_fn+0x59/0x336 [scsi_mod] > [ 177.512981] [] blk_remove_plug+0x5a/0x66 > [ 177.517983] [] __generic_unplug_device+0x1d/0x1f > [ 177.523705] [] generic_unplug_device+0x15/0x21 > [ 177.529272] [] unplug_slaves+0x54/0x88 [raid456] > [ 177.535013] [] blk_backing_dev_unplug+0x73/0x7b > [ 177.540657] [] _spin_unlock_irqrestore+0x3e/0x4d > [ 177.546382] [] sync_page+0x0/0x3b > [ 177.550774] [] trace_hardirqs_on+0x12e/0x158 > [ 177.556108] [] sync_page+0x0/0x3b > [ 177.560471] [] block_sync_page+0x31/0x32 > [ 177.565449] [] sync_page+0x33/0x3b > [ 177.569916] [] __wait_on_bit_lock+0x2a/0x52 > [ 177.575201] [] __lock_page+0x58/0x5e > [ 177.579810] [] wake_bit_function+0x0/0x3c > [ 177.584905] [] do_generic_mapping_read+0x1db/0x44f > [ 177.590911] [] generic_file_aio_read+0x173/0x1a4 > [ 177.596617] [] file_read_actor+0x0/0xdb > [ 177.601525] [] do_sync_read+0xc7/0x10a > [ 177.606365] [] autoremove_wake_function+0x0/0x35 > [ 177.612130] [] do_sync_read+0x0/0x10a > [ 177.616867] [] vfs_read+0xa6/0x152 > [ 177.621362] [] sys_read+0x41/0x67 > [ 177.625794] [] syscall_call+0x7/0xb > [ 177.630403] === > [ 177.634031] Code: da 11 3b c0 c7 04 24 51 9d 39 c0 e8 c9 a1 f4 ff e8 ca 6e > f2 ff ff 4f 34 83 c4 18 5b 5e 5f 5d c3 55 57 56 89 c6 53 8b 40 0c 89 d3 <8b> > 7a 5c 8b 68 04 89 d0 e8 b5 fe ff ff 8b 43 14 89 da 25 01 80 > [ 177.654378] EIP: [] cfq_dispatch_insert+0xb/0x53 SS:ESP > 0068:f34dbcbc cfq_dispatch_requests() has called cfq_dispatch_insert() with a NULL second argument (struct request *rq) There are two patches for raid5/6 out there that might fix this. I'll attach them (the second just fixes a minor bug in the first one.) From: Neil Brown <[EMAIL PROTECTED]> On Sunday February 11, [EMAIL PROTECTED] wrote: > > Greetings, > > > > I've been running md on my server for some time now and a few days ago one > > of
PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
I can reliably reproduce a null pointer dereference on 2.6.20 and 2.6.21-rc2. I will keep digging to find the kernel version where this last worked, but wanted to see if there were any immediate experiments I should try. The failure is caused by running tiobench on a MD raid6 array with 6 out of 8 disks available. The commands I issued to reproduce this are: mdadm -A /dev/md0 /dev/sd[bcdefg] mount /dev/md0 /mnt/raid tiobench --numruns 5 --size 2048 --dir /mnt/raid The filesystem is ext3. The controller is an LSI 1068. Here are the two BUG messages first 2.6.21-rc2 followed by 2.6.20. I will reply to this message with the config. Fedora Core release 5 (Bordeaux) Kernel 2.6.21-rc2 on an i686 [ 431.709022] BUG: unable to handle kernel NULL pointer dereference at virtual address 005c [ 431.717993] printing eip: [ 431.720825] c01e1e00 [ 431.723112] *pde = 32e70001 [ 431.726065] Oops: [#1] [ 431.728997] SMP [ 431.730922] Modules linked in: raid456 xor nfsd exportfs lockd nfs_acl sunrpc autofs4 hidp l2cap bluetooth iptable_raw xt_policy xt_multiport ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt_LOG ipt_iprange ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah ipt_addrtype xt_tcpmss xt_pkttype xt_physdev xt_NFQUEUE xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables video sbs i2c_ec dock button battery asus_acpi ac radeon drm ipv6 lp parport_pc parport floppy uhci_hcd ehci_hcd e1000 i2c_i801 sg mptsas mptscsih mptbase i2c_core scsi_transport_sas pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix ata_generic libata sd_mod scsi_mod ext3 jbd [ 431.812682] CPU:0 [ 431.812682] EIP:0060:[]Not tainted VLI [ 431.812683] EFLAGS: 00010002 (2.6.21-rc2 #4) [ 431.825386] EIP is at cfq_dispatch_insert+0xb/0x53 [ 431.830413] eax: f6c96ec0 ebx: ecx: c0410568 edx: [ 431.837608] esi: f7e956a4 edi: ebp: f6c96cc0 esp: c0491e54 [ 431.844760] ds: 007b es: 007b fs: 00d8 gs: ss: 0068 [ 431.850847] Process swapper (pid: 0, ti=c0491000 task=c03ff4c0 task.ti=c0447000) [ 431.858360] Stack: f76ae3bc f6c96cc0 f6c96cc0 c01e1fc9 00e7 [ 431.867165]c03ffa10 c0143123 0004 c03ff4c0 f7e957ac [ 431.875998]f7e956a4 f7e956a4 f7d39000 f7e956a4 c01d8767 0001 0046 [ 431.884656] Call Trace: [ 431.887396] [] cfq_dispatch_requests+0x138/0x3f0 [ 431.893274] [] __lock_acquire+0xb64/0xbf4 [ 431.898513] [] elv_next_request+0x1a1/0x1b1 [ 431.903923] [] scsi_request_fn+0x59/0x336 [scsi_mod] [ 431.910148] [] blk_run_queue+0x37/0x63 [ 431.915100] [] scsi_next_command+0x25/0x2f [scsi_mod] [ 431.921330] [] scsi_end_request+0x9e/0xa8 [scsi_mod] [ 431.927493] [] scsi_io_completion+0x15a/0x32b [scsi_mod] [ 431.934113] [] sd_rw_intr+0x21b/0x245 [sd_mod] [ 431.939787] [] _spin_unlock_irqrestore+0x3e/0x4d [ 431.945640] [] scsi_finish_command+0x84/0x8b [scsi_mod] [ 431.952051] [] trace_hardirqs_on+0x116/0x158 [ 431.957446] [] __do_softirq+0x5a/0xe9 [ 431.962329] [] blk_done_softirq+0x68/0x73 [ 431.967447] [] __do_softirq+0x72/0xe9 [ 431.972290] [] do_softirq+0x6f/0xec [ 431.976888] [] _spin_unlock_irq+0x20/0x2c [ 431.982064] [] __sched_text_start+0x96b/0x9f3 [ 431.987574] [] handle_fasteoi_irq+0x0/0xab [ 431.992823] [] do_IRQ+0xbd/0xd4 [ 431.997061] [] common_interrupt+0x2e/0x34 [ 432.002301] [] mwait_idle_with_hints+0x3b/0x3f [ 432.007931] [] cpu_idle+0xb5/0xce [ 432.012368] [] start_kernel+0x4a5/0x4ad [ 432.017398] [] unknown_bootoption+0x0/0x202 [ 432.022829] === [ 432.026511] Code: 1f e9 3b c0 c7 04 24 51 6d 3a c0 e8 43 83 f4 ff e8 77 46 f2 ff ff 4f 34 83 c4 18 5b 5e 5f 5d c3 55 57 56 89 c6 53 8b 40 0c 89 d3 <8b> 7a 5c 8b 68 04 89 d0 e8 b5 fe ff ff 8b 43 14 89 da 25 01 80 [ 432.046781] EIP: [] cfq_dispatch_insert+0xb/0x53 SS:ESP 0068:c0491e54 [ 432.054403] Kernel panic - not syncing: Fatal exception in interrupt [ 432.060912] BUG: at arch/i386/kernel/smp.c:546 smp_call_function() [ 432.067203] [] smp_call_function+0x64/0xd0 [ 432.072473] [] do_unblank_screen+0x25/0x11b [ 432.077910] [] smp_send_stop+0x1b/0x40 [ 432.082848] [] panic+0x54/0xfd [ 432.087033] [] die+0x202/0x236 [ 432.091222] [] do_page_fault+0x507/0x5e0 [ 432.096323] [] kmem_cache_free+0xa1/0xb2 [ 432.101353] [] kmem_cache_free+0xa1/0xb2 [ 432.106415] [] do_page_fault+0x0/0x5e0 [ 432.111334] [] error_code+0x7c/0x84 [ 432.115934] [] cfq_dispatch_insert+0xb/0x53 [ 432.121304] [] cfq_dispatch_requests+0x138/0x3f0 [ 432.127161] [] __lock_acquire+0xb64/0xbf4 [ 432.132338] [] elv_next_request+0x1a1/0x1b1 [ 432.137608] []
PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
I can reliably reproduce a null pointer dereference on 2.6.20 and 2.6.21-rc2. I will keep digging to find the kernel version where this last worked, but wanted to see if there were any immediate experiments I should try. The failure is caused by running tiobench on a MD raid6 array with 6 out of 8 disks available. The commands I issued to reproduce this are: mdadm -A /dev/md0 /dev/sd[bcdefg] mount /dev/md0 /mnt/raid tiobench --numruns 5 --size 2048 --dir /mnt/raid The filesystem is ext3. The controller is an LSI 1068. Here are the two BUG messages first 2.6.21-rc2 followed by 2.6.20. I will reply to this message with the config. Fedora Core release 5 (Bordeaux) Kernel 2.6.21-rc2 on an i686 [ 431.709022] BUG: unable to handle kernel NULL pointer dereference at virtual address 005c [ 431.717993] printing eip: [ 431.720825] c01e1e00 [ 431.723112] *pde = 32e70001 [ 431.726065] Oops: [#1] [ 431.728997] SMP [ 431.730922] Modules linked in: raid456 xor nfsd exportfs lockd nfs_acl sunrpc autofs4 hidp l2cap bluetooth iptable_raw xt_policy xt_multiport ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt_LOG ipt_iprange ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah ipt_addrtype xt_tcpmss xt_pkttype xt_physdev xt_NFQUEUE xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables video sbs i2c_ec dock button battery asus_acpi ac radeon drm ipv6 lp parport_pc parport floppy uhci_hcd ehci_hcd e1000 i2c_i801 sg mptsas mptscsih mptbase i2c_core scsi_transport_sas pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix ata_generic libata sd_mod scsi_mod ext3 jbd [ 431.812682] CPU:0 [ 431.812682] EIP:0060:[c01e1e00]Not tainted VLI [ 431.812683] EFLAGS: 00010002 (2.6.21-rc2 #4) [ 431.825386] EIP is at cfq_dispatch_insert+0xb/0x53 [ 431.830413] eax: f6c96ec0 ebx: ecx: c0410568 edx: [ 431.837608] esi: f7e956a4 edi: ebp: f6c96cc0 esp: c0491e54 [ 431.844760] ds: 007b es: 007b fs: 00d8 gs: ss: 0068 [ 431.850847] Process swapper (pid: 0, ti=c0491000 task=c03ff4c0 task.ti=c0447000) [ 431.858360] Stack: f76ae3bc f6c96cc0 f6c96cc0 c01e1fc9 00e7 [ 431.867165]c03ffa10 c0143123 0004 c03ff4c0 f7e957ac [ 431.875998]f7e956a4 f7e956a4 f7d39000 f7e956a4 c01d8767 0001 0046 [ 431.884656] Call Trace: [ 431.887396] [c01e1fc9] cfq_dispatch_requests+0x138/0x3f0 [ 431.893274] [c0143123] __lock_acquire+0xb64/0xbf4 [ 431.898513] [c01d8767] elv_next_request+0x1a1/0x1b1 [ 431.903923] [f8a26621] scsi_request_fn+0x59/0x336 [scsi_mod] [ 431.910148] [c01dbb20] blk_run_queue+0x37/0x63 [ 431.915100] [f8a25561] scsi_next_command+0x25/0x2f [scsi_mod] [ 431.921330] [f8a2571f] scsi_end_request+0x9e/0xa8 [scsi_mod] [ 431.927493] [f8a258c0] scsi_io_completion+0x15a/0x32b [scsi_mod] [ 431.934113] [f882c5fb] sd_rw_intr+0x21b/0x245 [sd_mod] [ 431.939787] [c031b23a] _spin_unlock_irqrestore+0x3e/0x4d [ 431.945640] [f8a213f6] scsi_finish_command+0x84/0x8b [scsi_mod] [ 431.952051] [c0142166] trace_hardirqs_on+0x116/0x158 [ 431.957446] [c012e181] __do_softirq+0x5a/0xe9 [ 431.962329] [c01dc291] blk_done_softirq+0x68/0x73 [ 431.967447] [c012e199] __do_softirq+0x72/0xe9 [ 431.972290] [c0107033] do_softirq+0x6f/0xec [ 431.976888] [c031b0ce] _spin_unlock_irq+0x20/0x2c [ 431.982064] [c0318b1b] __sched_text_start+0x96b/0x9f3 [ 431.987574] [c01553a1] handle_fasteoi_irq+0x0/0xab [ 431.992823] [c010716d] do_IRQ+0xbd/0xd4 [ 431.997061] [c0105886] common_interrupt+0x2e/0x34 [ 432.002301] [c0103240] mwait_idle_with_hints+0x3b/0x3f [ 432.007931] [c01033b9] cpu_idle+0xb5/0xce [ 432.012368] [c044ca9a] start_kernel+0x4a5/0x4ad [ 432.017398] [c044c1b8] unknown_bootoption+0x0/0x202 [ 432.022829] === [ 432.026511] Code: 1f e9 3b c0 c7 04 24 51 6d 3a c0 e8 43 83 f4 ff e8 77 46 f2 ff ff 4f 34 83 c4 18 5b 5e 5f 5d c3 55 57 56 89 c6 53 8b 40 0c 89 d3 8b 7a 5c 8b 68 04 89 d0 e8 b5 fe ff ff 8b 43 14 89 da 25 01 80 [ 432.046781] EIP: [c01e1e00] cfq_dispatch_insert+0xb/0x53 SS:ESP 0068:c0491e54 [ 432.054403] Kernel panic - not syncing: Fatal exception in interrupt [ 432.060912] BUG: at arch/i386/kernel/smp.c:546 smp_call_function() [ 432.067203] [c0118c63] smp_call_function+0x64/0xd0 [ 432.072473] [c023df9a] do_unblank_screen+0x25/0x11b [ 432.077910] [c0118cea] smp_send_stop+0x1b/0x40 [ 432.082848] [c01296cb] panic+0x54/0xfd [ 432.087033] [c010639c] die+0x202/0x236 [ 432.091222] [c031cc58] do_page_fault+0x507/0x5e0 [ 432.096323] [c01716e2] kmem_cache_free+0xa1/0xb2 [ 432.101353] [c01716e2] kmem_cache_free+0xa1/0xb2 [ 432.106415]
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Dan Williams wrote: I can reliably reproduce a null pointer dereference on 2.6.20 and 2.6.21-rc2. I will keep digging to find the kernel version where this last worked, but wanted to see if there were any immediate experiments I should try. The failure is caused by running tiobench on a MD raid6 array with 6 out of 8 disks available. The commands I issued to reproduce this are: mdadm -A /dev/md0 /dev/sd[bcdefg] mount /dev/md0 /mnt/raid tiobench --numruns 5 --size 2048 --dir /mnt/raid The filesystem is ext3. The controller is an LSI 1068. Here are the two BUG messages first 2.6.21-rc2 followed by 2.6.20. I will reply to this message with the config. Kernel 2.6.20 on an i686 [ 177.299787] BUG: unable to handle kernel NULL pointer dereference at virtual address 005c [ 177.308526] printing eip: [ 177.311287] c01de510 [ 177.313521] *pde = 34d40001 [ 177.316353] Oops: [#1] [ 177.319202] SMP [ 177.321107] Modules linked in: raid456 xor nfsd exportfs lockd nfs_acl sunrpc autofs4 hidp l2cap bluetooth iptable_raw xt_policy xt_multiport ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt_LOG ipt_iprange ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah ipt_addrtype xt_tcpmss xt_pkttype xt_physdev xt_NFQUEUE xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables video sbs i2c_ec dock button battery asus_acpi ac radeon drm ipv6 lp parport_pc parport e1000 uhci_hcd floppy mptsas mptscsih mptbase sg ehci_hcd scsi_transport_sas i2c_i801 i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix ata_generic libata sd_mod scsi_mod ext3 jbd [ 177.402252] CPU:2 [ 177.402253] EIP:0060:[c01de510]Not tainted VLI [ 177.402253] EFLAGS: 00210016 (2.6.20 #5) [ 177.414194] EIP is at cfq_dispatch_insert+0xb/0x53 [ 177.419056] eax: f7773ec0 ebx: ecx: f7773cc0 edx: [ 177.425982] esi: f70abae0 edi: f7773cc0 ebp: esp: f34dbcbc [ 177.432953] ds: 007b es: 007b ss: 0068 [ 177.437127] Process tiotest (pid: 5405, ti=f34db000 task=f7efc030 task.ti=f34db000) [ 177.444763] Stack: 0049 f77d3b9c f7773cc0 c01de6ce c014041e f8a26806 0082 [ 177.453456]f7efc030 fffe22d6 0004 f7efc030 f7773cc0 [ 177.462121] f70abae0 f7cd5800 f70abae0 c01d4fcc 0001 [ 177.470798] Call Trace: [ 177.473503] [c01de6ce] cfq_dispatch_requests+0x12d/0x466 [ 177.479223] [c014041e] __lock_acquire+0x9e9/0xa72 [ 177.484285] [f8a26806] scsi_request_fn+0x286/0x336 [scsi_mod] [ 177.490485] [c01d4fcc] elv_next_request+0x1a2/0x1b2 [ 177.495766] [f8a26806] scsi_request_fn+0x286/0x336 [scsi_mod] [ 177.501912] [c0315ba8] _spin_lock_irq+0x38/0x43 [ 177.506840] [f8a265d9] scsi_request_fn+0x59/0x336 [scsi_mod] [ 177.512981] [c01d7e7d] blk_remove_plug+0x5a/0x66 [ 177.517983] [c01d7ea6] __generic_unplug_device+0x1d/0x1f [ 177.523705] [c01d8278] generic_unplug_device+0x15/0x21 [ 177.529272] [f97ee054] unplug_slaves+0x54/0x88 [raid456] [ 177.535013] [c01d997a] blk_backing_dev_unplug+0x73/0x7b [ 177.540657] [c0315d82] _spin_unlock_irqrestore+0x3e/0x4d [ 177.546382] [c0154b26] sync_page+0x0/0x3b [ 177.550774] [c013f5f4] trace_hardirqs_on+0x12e/0x158 [ 177.556108] [c0154b26] sync_page+0x0/0x3b [ 177.560471] [c018caa5] block_sync_page+0x31/0x32 [ 177.565449] [c0154b59] sync_page+0x33/0x3b [ 177.569916] [c0313d9e] __wait_on_bit_lock+0x2a/0x52 [ 177.575201] [c0154b18] __lock_page+0x58/0x5e [ 177.579810] [c0139612] wake_bit_function+0x0/0x3c [ 177.584905] [c0155228] do_generic_mapping_read+0x1db/0x44f [ 177.590911] [c01570cb] generic_file_aio_read+0x173/0x1a4 [ 177.596617] [c0154930] file_read_actor+0x0/0xdb [ 177.601525] [c0171b47] do_sync_read+0xc7/0x10a [ 177.606365] [c01395dd] autoremove_wake_function+0x0/0x35 [ 177.612130] [c0171a80] do_sync_read+0x0/0x10a [ 177.616867] [c01723ce] vfs_read+0xa6/0x152 [ 177.621362] [c0172830] sys_read+0x41/0x67 [ 177.625794] [c0103e24] syscall_call+0x7/0xb [ 177.630403] === [ 177.634031] Code: da 11 3b c0 c7 04 24 51 9d 39 c0 e8 c9 a1 f4 ff e8 ca 6e f2 ff ff 4f 34 83 c4 18 5b 5e 5f 5d c3 55 57 56 89 c6 53 8b 40 0c 89 d3 8b 7a 5c 8b 68 04 89 d0 e8 b5 fe ff ff 8b 43 14 89 da 25 01 80 [ 177.654378] EIP: [c01de510] cfq_dispatch_insert+0xb/0x53 SS:ESP 0068:f34dbcbc cfq_dispatch_requests() has called cfq_dispatch_insert() with a NULL second argument (struct request *rq) There are two patches for raid5/6 out there that might fix this. I'll attach them (the second just fixes a minor bug in the first one.) From: Neil Brown [EMAIL
Re: PROBLEM: null pointer dereference in cfq_dispatch_requests (2.6.21-rc2 and 2.6.20)
Chuck Ebbert wrote: There are two patches for raid5/6 out there that might fix this. I'll attach them (the second just fixes a minor bug in the first one.) Never mind, those patches are already in 2.6.21-rc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/