RE: AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]
At which version of the kernel did the aacraid driver allegedly first go broken? At which version did it get fixed? (Since 1.1.5-2451 is older than latest represented on kernel.org) How is the SATA disk'd arrayed on the aacraid controller? The controller is limited to generating 24 arrays and since /dev/sdac is the 29th target, it would appear we need more details on your array's topology inside the aacraid controller. If you are using the driver with aacraid.physical=1 and thus using the physical drives directly (in the case of a SATA disk, a SATr0.9 translation in the Firmware), this is not a supported configuration and was added only to enable limited experimentation. If there is a problem in that path in the driver, I will glad to fix it, but still unsupported. You may need to acquire a diagnostic dump from the controller (Adaptec technical support can advise, it will depend on your application suite) and a report of any error recovery actions reported by the driver in the system log as initiated by the SCSI subsystem. There are no changes in the I/O path for the aacraid driver. Due to the simplicity of the I/O path to the processor based controller, it is unlikely to be an issue in this path. There have been several changes in the driver to deal with error recovery actions initiated by the SCSI subsystem. One likely candidate was to extend the default SCSI layer timeout because it was shorter than the adapter's firmware timeout. You can check if this is the issue by manually increasing the timeout for the target(s) via sysfs. There were recent patches to deal with orphaned commands resulting from devices being taken offline by the SCSI layer. There has been changes in the driver to reset the controller should it go into a BlinkLED (Firmware Assert) state. The symptom also acts like a condition in the older drivers (pre 08/08/2006 on scsi-misc-2.6, showing up in 2.6.20.4) which did not reset the adapter when it entered the BlinkLED state and merely allowed the system to lock, but alas you are working with a driver with this reset fix in the version you report. A BlinkLED condition generally indicates a serious hardware problem or target incompatibility; and is generally rare as they are a result of corner case conditions within the Adapter Firmware. The diagnostic dump reported by the Adaptec utilities should be able to point to the fault you are experiencing if these appear to be the root causes. Sincerely -- Mark Salyzyn -Original Message- From: Mike Snitzer [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 22, 2008 7:10 PM To: linux-raid@vger.kernel.org; NeilBrown Cc: [EMAIL PROTECTED]; K. Tanaka; AACRAID; [EMAIL PROTECTED] Subject: AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN] On Jan 22, 2008 12:29 AM, Mike Snitzer [EMAIL PROTECTED] wrote: cc'ing Tanaka-san given his recent raid1 BUG report: http://lkml.org/lkml/2008/1/14/515 On Jan 21, 2008 6:04 PM, Mike Snitzer [EMAIL PROTECTED] wrote: Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to an aacraid controller) that was acting as the local raid1 member of /dev/md30. Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by doing a read (with dd) from /dev/md30: The raid1d thread is locked at line 720 in raid1.c (raid1d+2437); aka freeze_array: (gdb) l *0x2539 0x2539 is in raid1d (drivers/md/raid1.c:720). 715 * wait until barrier+nr_pending match nr_queued+2 716 */ 717 spin_lock_irq(conf-resync_lock); 718 conf-barrier++; 719 conf-nr_waiting++; 720 wait_event_lock_irq(conf-wait_barrier, 721 conf-barrier+conf-nr_pending == conf-nr_queued+2, 722 conf-resync_lock, 723 raid1_unplug(conf-mddev-queue)); 724 spin_unlock_irq(conf-resync_lock); Given Tanaka-san's report against 2.6.23 and me hitting what seems to be the same deadlock in 2.6.22.16; it stands to reason this affects raid1 in 2.6.24-rcX too. Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when you pull a drive); it responds to MD's write requests with uptodate=1 (in raid1_end_write_request) for the drive that was pulled! I've not looked to see if aacraid has been fixed in newer kernels... are others aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24? After the drive was physically pulled, and small periodic writes continued to the associated MD device, the raid1 MD driver did _NOT_ detect the pulled drive's writes as having failed (verified this with systemtap). MD happily thought the write completed to both members (so MD had no reason to mark the pulled drive faulty; or mark the raid degraded). Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to work
Re: AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]
. The diagnostic dump reported by the Adaptec utilities should be able to point to the fault you are experiencing if these appear to be the root causes. snitzer: It would seem that 1.1.5-2451 has the firmware reset support given the log I provided above, no? Anyway, with 2.6.22.16 when a drive is pulled using the aacraid 1.1-5[2437]-mh4 there is absolutely no errors from the aacraid driver; in fact the scsi layer doesn't see anything until I force the issue with explicit reads/writes to the device that was pulled. It could be that on a drive pull the 1.1.5-2451 driver results in a BlinkLED, resets the firmware, and continues. Whereas with the 1.1-5[2437]-mh4 I get no BlinkLED and as such Linux (both scsi and raid1) is completely unaware of any disconnect of the physical device. thanks, Mike -Original Message- From: Mike Snitzer [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 22, 2008 7:10 PM To: linux-raid@vger.kernel.org; NeilBrown Cc: [EMAIL PROTECTED]; K. Tanaka; AACRAID; [EMAIL PROTECTED] Subject: AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN] On Jan 22, 2008 12:29 AM, Mike Snitzer [EMAIL PROTECTED] wrote: cc'ing Tanaka-san given his recent raid1 BUG report: http://lkml.org/lkml/2008/1/14/515 On Jan 21, 2008 6:04 PM, Mike Snitzer [EMAIL PROTECTED] wrote: Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to an aacraid controller) that was acting as the local raid1 member of /dev/md30. Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by doing a read (with dd) from /dev/md30: The raid1d thread is locked at line 720 in raid1.c (raid1d+2437); aka freeze_array: (gdb) l *0x2539 0x2539 is in raid1d (drivers/md/raid1.c:720). 715 * wait until barrier+nr_pending match nr_queued+2 716 */ 717 spin_lock_irq(conf-resync_lock); 718 conf-barrier++; 719 conf-nr_waiting++; 720 wait_event_lock_irq(conf-wait_barrier, 721 conf-barrier+conf-nr_pending == conf-nr_queued+2, 722 conf-resync_lock, 723 raid1_unplug(conf-mddev-queue)); 724 spin_unlock_irq(conf-resync_lock); Given Tanaka-san's report against 2.6.23 and me hitting what seems to be the same deadlock in 2.6.22.16; it stands to reason this affects raid1 in 2.6.24-rcX too. Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when you pull a drive); it responds to MD's write requests with uptodate=1 (in raid1_end_write_request) for the drive that was pulled! I've not looked to see if aacraid has been fixed in newer kernels... are others aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24? After the drive was physically pulled, and small periodic writes continued to the associated MD device, the raid1 MD driver did _NOT_ detect the pulled drive's writes as having failed (verified this with systemtap). MD happily thought the write completed to both members (so MD had no reason to mark the pulled drive faulty; or mark the raid degraded). Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to work as expected. That said, I now have a recipe for hitting the raid1 deadlock that Tanaka first reported over a week ago. I'm still surprised that all of this chatter about that BUG hasn't drawn interest/scrutiny from others!? regards, Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]
On Jan 22, 2008 12:29 AM, Mike Snitzer [EMAIL PROTECTED] wrote: cc'ing Tanaka-san given his recent raid1 BUG report: http://lkml.org/lkml/2008/1/14/515 On Jan 21, 2008 6:04 PM, Mike Snitzer [EMAIL PROTECTED] wrote: Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to an aacraid controller) that was acting as the local raid1 member of /dev/md30. Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by doing a read (with dd) from /dev/md30: The raid1d thread is locked at line 720 in raid1.c (raid1d+2437); aka freeze_array: (gdb) l *0x2539 0x2539 is in raid1d (drivers/md/raid1.c:720). 715 * wait until barrier+nr_pending match nr_queued+2 716 */ 717 spin_lock_irq(conf-resync_lock); 718 conf-barrier++; 719 conf-nr_waiting++; 720 wait_event_lock_irq(conf-wait_barrier, 721 conf-barrier+conf-nr_pending == conf-nr_queued+2, 722 conf-resync_lock, 723 raid1_unplug(conf-mddev-queue)); 724 spin_unlock_irq(conf-resync_lock); Given Tanaka-san's report against 2.6.23 and me hitting what seems to be the same deadlock in 2.6.22.16; it stands to reason this affects raid1 in 2.6.24-rcX too. Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when you pull a drive); it responds to MD's write requests with uptodate=1 (in raid1_end_write_request) for the drive that was pulled! I've not looked to see if aacraid has been fixed in newer kernels... are others aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24? After the drive was physically pulled, and small periodic writes continued to the associated MD device, the raid1 MD driver did _NOT_ detect the pulled drive's writes as having failed (verified this with systemtap). MD happily thought the write completed to both members (so MD had no reason to mark the pulled drive faulty; or mark the raid degraded). Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to work as expected. That said, I now have a recipe for hitting the raid1 deadlock that Tanaka first reported over a week ago. I'm still surprised that all of this chatter about that BUG hasn't drawn interest/scrutiny from others!? regards, Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN
Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to an aacraid controller) that was acting as the local raid1 member of /dev/md30. Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by doing a read (with dd) from /dev/md30: Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 71 Jan 21 17:08:07 lab17-233 kernel: printk: 3 messages suppressed. Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 8 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 16 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 24 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 32 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 40 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 48 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 56 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 64 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 72 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 80 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 343 Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:08 lab17-233 kernel: Info fld=0x0 ... Jan 21 17:08:12 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:12 lab17-233 kernel: end_request: I/O error, dev sdac, sector 3399 Jan 21 17:08:12 lab17-233 kernel: printk: 765 messages suppressed. Jan 21 17:08:12 lab17-233 kernel: raid1: sdac1: rescheduling sector 3336 However, the MD layer still hasn't marked the sdac1 member faulty: md30 : active raid1 nbd2[1](W) sdac1[0] 4016204 blocks super 1.0 [2/2] [UU] bitmap: 1/8 pages [4KB], 256KB chunk The dd I used to read from /dev/md30 is blocked on IO: Jan 21 17:13:55 lab17-233 kernel: ddD 0afa9cf5c346 0 12337 7702 (NOTLB) Jan 21 17:13:55 lab17-233 kernel: 81010c449868 0082 80268f14 Jan 21 17:13:55 lab17-233 kernel: 81015da6f320 81015de532c0 0008 81012d9d7780 Jan 21 17:13:55 lab17-233 kernel: 81015fae2880 4926 81012d9d7970 0001802879a0 Jan 21 17:13:55 lab17-233 kernel: Call Trace: Jan 21 17:13:55 lab17-233 kernel: [80268f14] mempool_alloc+0x24/0xda Jan 21 17:13:55 lab17-233 kernel: [88b91381] :raid1:wait_barrier+0x84/0xc2 Jan 21 17:13:55 lab17-233 kernel: [8022d8fa] default_wake_function+0x0/0xe Jan 21 17:13:55 lab17-233 kernel: [88b92093] :raid1:make_request+0x83/0x5c0 Jan 21 17:13:55 lab17-233 kernel: [80305acd] __make_request+0x57f/0x668 Jan 21 17:13:55 lab17-233 kernel: [80302dc7] generic_make_request+0x26e/0x2a9 Jan 21 17:13:55 lab17-233 kernel: [80268f14] mempool_alloc+0x24/0xda Jan 21 17:13:55 lab17-233 kernel: [8030db39] __next_cpu+0x19/0x28 Jan 21 17:13:55 lab17-233 kernel: [80305162] submit_bio+0xb6/0xbd Jan 21 17:13:55 lab17-233 kernel: [802aba6a] submit_bh+0xdf/0xff Jan 21 17:13:55 lab17-233 kernel: [802ae188] block_read_full_page+0x271/0x28e Jan 21 17:13:55 lab17-233 kernel: [802b0b27] blkdev_get_block+0x0/0x46 Jan 21 17:13:55 lab17-233 kernel: [803103ad] radix_tree_insert+0xcb/0x18c Jan 21 17:13:55 lab17-233 kernel: [8026d003] __do_page_cache_readahead+0x16d/0x1df Jan 21 17:13:55 lab17-233 kernel: [80248c51] getnstimeofday+0x32/0x8d Jan 21 17:13:55 lab17-233 kernel: [80247e5e] ktime_get_ts+0x1a/0x4e Jan 21 17:13:55 lab17-233 kernel: [80265543] delayacct_end+0x7d/0x88 Jan 21 17:13:55 lab17-233 kernel: [8026d0c8] blockable_page_cache_readahead+0x53/0xb2 Jan 21 17:13:55 lab17-233 kernel: [8026d1a9] make_ahead_window+0x82/0x9e Jan 21 17:13:55 lab17-233 kernel: [8026d34f] page_cache_readahead+0x18a/0x1c1 Jan 21 17:13:55 lab17-233 kernel: [8026723c] do_generic_mapping_read+0x135/0x3fc Jan 21 17:13:55 lab17-233 kernel: [80266755] file_read_actor+0x0/0x170 Jan 21 17:13:55 lab17-233 kernel:
Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN
cc'ing Tanaka-san given his recent raid1 BUG report: http://lkml.org/lkml/2008/1/14/515 On Jan 21, 2008 6:04 PM, Mike Snitzer [EMAIL PROTECTED] wrote: Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to an aacraid controller) that was acting as the local raid1 member of /dev/md30. Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by doing a read (with dd) from /dev/md30: Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 71 Jan 21 17:08:07 lab17-233 kernel: printk: 3 messages suppressed. Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 8 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 16 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 24 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 32 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 40 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 48 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 56 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 64 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 72 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 80 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 343 Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:08 lab17-233 kernel: Info fld=0x0 ... Jan 21 17:08:12 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:12 lab17-233 kernel: end_request: I/O error, dev sdac, sector 3399 Jan 21 17:08:12 lab17-233 kernel: printk: 765 messages suppressed. Jan 21 17:08:12 lab17-233 kernel: raid1: sdac1: rescheduling sector 3336 However, the MD layer still hasn't marked the sdac1 member faulty: md30 : active raid1 nbd2[1](W) sdac1[0] 4016204 blocks super 1.0 [2/2] [UU] bitmap: 1/8 pages [4KB], 256KB chunk The dd I used to read from /dev/md30 is blocked on IO: Jan 21 17:13:55 lab17-233 kernel: ddD 0afa9cf5c346 0 12337 7702 (NOTLB) Jan 21 17:13:55 lab17-233 kernel: 81010c449868 0082 80268f14 Jan 21 17:13:55 lab17-233 kernel: 81015da6f320 81015de532c0 0008 81012d9d7780 Jan 21 17:13:55 lab17-233 kernel: 81015fae2880 4926 81012d9d7970 0001802879a0 Jan 21 17:13:55 lab17-233 kernel: Call Trace: Jan 21 17:13:55 lab17-233 kernel: [80268f14] mempool_alloc+0x24/0xda Jan 21 17:13:55 lab17-233 kernel: [88b91381] :raid1:wait_barrier+0x84/0xc2 Jan 21 17:13:55 lab17-233 kernel: [8022d8fa] default_wake_function+0x0/0xe Jan 21 17:13:55 lab17-233 kernel: [88b92093] :raid1:make_request+0x83/0x5c0 Jan 21 17:13:55 lab17-233 kernel: [80305acd] __make_request+0x57f/0x668 Jan 21 17:13:55 lab17-233 kernel: [80302dc7] generic_make_request+0x26e/0x2a9 Jan 21 17:13:55 lab17-233 kernel: [80268f14] mempool_alloc+0x24/0xda Jan 21 17:13:55 lab17-233 kernel: [8030db39] __next_cpu+0x19/0x28 Jan 21 17:13:55 lab17-233 kernel: [80305162] submit_bio+0xb6/0xbd Jan 21 17:13:55 lab17-233 kernel: [802aba6a] submit_bh+0xdf/0xff Jan 21 17:13:55 lab17-233 kernel: [802ae188] block_read_full_page+0x271/0x28e Jan 21 17:13:55 lab17-233 kernel: [802b0b27] blkdev_get_block+0x0/0x46 Jan 21 17:13:55 lab17-233 kernel: [803103ad] radix_tree_insert+0xcb/0x18c Jan 21 17:13:55 lab17-233 kernel: [8026d003] __do_page_cache_readahead+0x16d/0x1df Jan 21 17:13:55 lab17-233 kernel: [80248c51] getnstimeofday+0x32/0x8d Jan 21 17:13:55 lab17-233 kernel: [80247e5e] ktime_get_ts+0x1a/0x4e Jan 21 17:13:55 lab17-233 kernel: [80265543] delayacct_end+0x7d/0x88 Jan 21 17:13:55 lab17-233 kernel: [8026d0c8] blockable_page_cache_readahead+0x53/0xb2 Jan 21 17:13:55 lab17-233 kernel: [8026d1a9] make_ahead_window+0x82/0x9e Jan 21 17:13:55 lab17-233 kernel: