Re: System lockups caused by USB external HDD
On 01/24/11 13:27, Hans Petter Selasky wrote: > On Monday 24 January 2011 12:08:47 CDP wrote: >> On 01/24/11 11:34, Hans Petter Selasky wrote: >>> On Monday 24 January 2011 10:00:53 CDP wrote: On 01/24/11 01:56, Daniel O'Connor wrote: > On 24/01/2011, at 9:10, CDP wrote: >> g_vfs_done():da0s2[WRITE(offset=, length=16384)]error = 5 >> [several more lines similar to the above] >> panic: softdep_move_dependencies: need merge code >> cpuid = 0 >> KDB: stack backtrace: >> #0 0x... at kdb_backtrace+0x5e >> #1 0x... at panic+0x182 > > It looks like the disk is dying, or the FS is corrupt (the former might > cause the later). > > Can you run smartctl on the disk? Unfortunately a lot of enclosures > reject SMART commands so you might not be able to :( I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet run a SMART long test for the simple reason that the disk is going into sleep mode and interrupts it. Haven't bothered to keep it alive for a long test but I might just do that. Although, I doubt it's a disk failure, since I do backups on it without problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x fails. And I am talking about over 150GB of data in one run, while 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the past, on SATA, and a few read/write errors never caused a system lockup. My feeling is that enough traffic on USB causes the problem, and that this problem is only present in the new USB stack. Unfortunately downgrading to 7.x is not an option because there are things that won't work on this notebook. >>> >>> If you run a simple test like this: >>> >>> dd if=/dev/da0 of=/dev/null bs=65536 >>> dd if=/dev/da0 of=/dev/null bs=16384 >>> >>> Do you then see any errors? >>> >>> Do you have a spare USB memory stick which you could run similar write >>> tests on? >> >> Both reads fail with I/O error, while writes to an unused partition seem >> to be fine (I interrupted the writes after a while): >> >> % dd if=/dev/da0 of=/dev/null bs=65536 >> dd: /dev/da0: Input/output error >> 191732+0 records in >> 191732+0 records out >> 12565348352 bytes transferred in 429.999272 secs (29221790 bytes/sec) >> >> % dd if=/dev/da0 of=/dev/null bs=16384 >> dd: /dev/da0: Input/output error >> 126427+0 records in >> 126427+0 records out >> 2071379968 bytes transferred in 169.431766 secs (12225452 bytes/sec) >> >> # dd if=/dev/random of=/dev/da0s3 bs=65536 >> ^C329378+0 records in >> 329377+0 records out >> 21586051072 bytes transferred in 1003.020293 secs (21521051 bytes/sec) >> >> # dd if=/dev/random of=/dev/da0s3 bs=16384 >> ^C679571+0 records in >> 679571+0 records out >> 11134091264 bytes transferred in 690.135793 secs (16133189 bytes/sec) >> >> This is what I get in /var/log/messages when the I/O error occurs: >> (da0:umass-sim0:0:0:0): AutoSense failed >> >> However, I experience no lockup. Maybe this situation is not handled >> correctly at another level ? > > I haven't looked into the code of CAM or GEOM that much so I won't say too > much about that. I believe the USB/umass is not to blame. What you could do > is > to add a conditional error printout in "umass_t_bbb_status_callback()" in > /sys/dev/usb/storage/umass.c when the error happens. If that error is not a > USB transport error, then we are most likely seeing a SCSI issue in layers > above umass. Or if you have access to USB analyser use that. There is now > also > the option to trace USB from the kernel itself, but the feature is in its > early development. You are right, I've tracked the problem down to CAM (cam_periph.c: camperiphsensedone()). I've changed the code to behave as it did in 7.3, and it mitigates the problem. I don't get "AutoSense failed" errors anymore and I don't get any lockups/crashes, not even when using softupdates on the external hdd. However, the pauses in disk operations still happen, but this doesn't seem to create any further issues. I haven't looked into this. I've attached a patch. I don't know if this behavior is correct, and I hope someone that knows CAM can take a look into this issue. Claudiu. --- sys/cam/cam_periph.c.orig 2011-01-26 09:38:21.0 +0200 +++ sys/cam/cam_periph.c2011-01-26 09:38:02.0 +0200 @@ -1024,7 +1024,9 @@ int frozen = 0; u_int sense_key; int depth = done_ccb->ccb_h.recovery_depth; + int xpt_done_ccb; + xpt_done_ccb = FALSE; status = done_ccb->ccb_h.status; if (status & CAM_DEV_QFRZN) { frozen = 1; @@ -1049,14 +1051,22 @@ if (sense_key != SSD_KEY_NO_SENSE) { saved_ccb->ccb_h.status |= CAM_AUTOSNS_VALID; - } else { + +xpt_done_ccb = TRUE; +
Re: System lockups caused by USB external HDD
On 01/24/11 13:27, Hans Petter Selasky wrote: > On Monday 24 January 2011 12:08:47 CDP wrote: >> On 01/24/11 11:34, Hans Petter Selasky wrote: >>> On Monday 24 January 2011 10:00:53 CDP wrote: On 01/24/11 01:56, Daniel O'Connor wrote: > On 24/01/2011, at 9:10, CDP wrote: >> g_vfs_done():da0s2[WRITE(offset=, length=16384)]error = 5 >> [several more lines similar to the above] >> panic: softdep_move_dependencies: need merge code >> cpuid = 0 >> KDB: stack backtrace: >> #0 0x... at kdb_backtrace+0x5e >> #1 0x... at panic+0x182 > > It looks like the disk is dying, or the FS is corrupt (the former might > cause the later). > > Can you run smartctl on the disk? Unfortunately a lot of enclosures > reject SMART commands so you might not be able to :( I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet run a SMART long test for the simple reason that the disk is going into sleep mode and interrupts it. Haven't bothered to keep it alive for a long test but I might just do that. Although, I doubt it's a disk failure, since I do backups on it without problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x fails. And I am talking about over 150GB of data in one run, while 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the past, on SATA, and a few read/write errors never caused a system lockup. My feeling is that enough traffic on USB causes the problem, and that this problem is only present in the new USB stack. Unfortunately downgrading to 7.x is not an option because there are things that won't work on this notebook. >>> >>> If you run a simple test like this: >>> >>> dd if=/dev/da0 of=/dev/null bs=65536 >>> dd if=/dev/da0 of=/dev/null bs=16384 >>> >>> Do you then see any errors? >>> >>> Do you have a spare USB memory stick which you could run similar write >>> tests on? >> >> Both reads fail with I/O error, while writes to an unused partition seem >> to be fine (I interrupted the writes after a while): >> >> % dd if=/dev/da0 of=/dev/null bs=65536 >> dd: /dev/da0: Input/output error >> 191732+0 records in >> 191732+0 records out >> 12565348352 bytes transferred in 429.999272 secs (29221790 bytes/sec) >> >> % dd if=/dev/da0 of=/dev/null bs=16384 >> dd: /dev/da0: Input/output error >> 126427+0 records in >> 126427+0 records out >> 2071379968 bytes transferred in 169.431766 secs (12225452 bytes/sec) >> >> # dd if=/dev/random of=/dev/da0s3 bs=65536 >> ^C329378+0 records in >> 329377+0 records out >> 21586051072 bytes transferred in 1003.020293 secs (21521051 bytes/sec) >> >> # dd if=/dev/random of=/dev/da0s3 bs=16384 >> ^C679571+0 records in >> 679571+0 records out >> 11134091264 bytes transferred in 690.135793 secs (16133189 bytes/sec) >> >> This is what I get in /var/log/messages when the I/O error occurs: >> (da0:umass-sim0:0:0:0): AutoSense failed >> >> However, I experience no lockup. Maybe this situation is not handled >> correctly at another level ? > > I haven't looked into the code of CAM or GEOM that much so I won't say too > much about that. I believe the USB/umass is not to blame. What you could do > is > to add a conditional error printout in "umass_t_bbb_status_callback()" in > /sys/dev/usb/storage/umass.c when the error happens. If that error is not a > USB transport error, then we are most likely seeing a SCSI issue in layers > above umass. Or if you have access to USB analyser use that. There is now > also > the option to trace USB from the kernel itself, but the feature is in its > early development. The panics I was able to catch/inspect (latest from add_to_worklist() / ffs_softdep.c) indicated they were thrown by ffs/softupdates code, therefore I tried disabling softupdates. The system doesn't panic anymore. The operations on the USB HDD still stop, but after several tens of seconds the system logs the 'autosense failed' error, a bunch of write errors, and the copy operation resumes. md5 shows the copied files are identical to the source files. In 7.x I don't recall having any kind of errors, neither temporary locks in disk operations, so I'm guessing the 'autosense failed' situation is handled differently in 8.x, compared to 7.x. Claudiu. ___ freebsd-usb@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-usb To unsubscribe, send any mail to "freebsd-usb-unsubscr...@freebsd.org"
Re: System lockups caused by USB external HDD
On Monday 24 January 2011 12:08:47 CDP wrote: > On 01/24/11 11:34, Hans Petter Selasky wrote: > > On Monday 24 January 2011 10:00:53 CDP wrote: > >> On 01/24/11 01:56, Daniel O'Connor wrote: > >>> On 24/01/2011, at 9:10, CDP wrote: > g_vfs_done():da0s2[WRITE(offset=, length=16384)]error = 5 > [several more lines similar to the above] > panic: softdep_move_dependencies: need merge code > cpuid = 0 > KDB: stack backtrace: > #0 0x... at kdb_backtrace+0x5e > #1 0x... at panic+0x182 > >>> > >>> It looks like the disk is dying, or the FS is corrupt (the former might > >>> cause the later). > >>> > >>> Can you run smartctl on the disk? Unfortunately a lot of enclosures > >>> reject SMART commands so you might not be able to :( > >> > >> I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet > >> run a SMART long test for the simple reason that the disk is going into > >> sleep mode and interrupts it. Haven't bothered to keep it alive for a > >> long test but I might just do that. > >> > >> Although, I doubt it's a disk failure, since I do backups on it without > >> problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x > >> fails. And I am talking about over 150GB of data in one run, while > >> 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the > >> past, on SATA, and a few read/write errors never caused a system lockup. > >> > >> My feeling is that enough traffic on USB causes the problem, and that > >> this problem is only present in the new USB stack. > >> Unfortunately downgrading to 7.x is not an option because there are > >> things that won't work on this notebook. > > > > If you run a simple test like this: > > > > dd if=/dev/da0 of=/dev/null bs=65536 > > dd if=/dev/da0 of=/dev/null bs=16384 > > > > Do you then see any errors? > > > > Do you have a spare USB memory stick which you could run similar write > > tests on? > > Both reads fail with I/O error, while writes to an unused partition seem > to be fine (I interrupted the writes after a while): > > % dd if=/dev/da0 of=/dev/null bs=65536 > dd: /dev/da0: Input/output error > 191732+0 records in > 191732+0 records out > 12565348352 bytes transferred in 429.999272 secs (29221790 bytes/sec) > > % dd if=/dev/da0 of=/dev/null bs=16384 > dd: /dev/da0: Input/output error > 126427+0 records in > 126427+0 records out > 2071379968 bytes transferred in 169.431766 secs (12225452 bytes/sec) > > # dd if=/dev/random of=/dev/da0s3 bs=65536 > ^C329378+0 records in > 329377+0 records out > 21586051072 bytes transferred in 1003.020293 secs (21521051 bytes/sec) > > # dd if=/dev/random of=/dev/da0s3 bs=16384 > ^C679571+0 records in > 679571+0 records out > 11134091264 bytes transferred in 690.135793 secs (16133189 bytes/sec) > > This is what I get in /var/log/messages when the I/O error occurs: > (da0:umass-sim0:0:0:0): AutoSense failed > > However, I experience no lockup. Maybe this situation is not handled > correctly at another level ? I haven't looked into the code of CAM or GEOM that much so I won't say too much about that. I believe the USB/umass is not to blame. What you could do is to add a conditional error printout in "umass_t_bbb_status_callback()" in /sys/dev/usb/storage/umass.c when the error happens. If that error is not a USB transport error, then we are most likely seeing a SCSI issue in layers above umass. Or if you have access to USB analyser use that. There is now also the option to trace USB from the kernel itself, but the feature is in its early development. --HPS ___ freebsd-usb@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-usb To unsubscribe, send any mail to "freebsd-usb-unsubscr...@freebsd.org"
Re: System lockups caused by USB external HDD
On 01/24/11 11:34, Hans Petter Selasky wrote: > On Monday 24 January 2011 10:00:53 CDP wrote: >> On 01/24/11 01:56, Daniel O'Connor wrote: >>> On 24/01/2011, at 9:10, CDP wrote: g_vfs_done():da0s2[WRITE(offset=, length=16384)]error = 5 [several more lines similar to the above] panic: softdep_move_dependencies: need merge code cpuid = 0 KDB: stack backtrace: #0 0x... at kdb_backtrace+0x5e #1 0x... at panic+0x182 >>> >>> It looks like the disk is dying, or the FS is corrupt (the former might >>> cause the later). >>> >>> Can you run smartctl on the disk? Unfortunately a lot of enclosures >>> reject SMART commands so you might not be able to :( >> >> I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet >> run a SMART long test for the simple reason that the disk is going into >> sleep mode and interrupts it. Haven't bothered to keep it alive for a >> long test but I might just do that. >> >> Although, I doubt it's a disk failure, since I do backups on it without >> problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x >> fails. And I am talking about over 150GB of data in one run, while >> 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the >> past, on SATA, and a few read/write errors never caused a system lockup. >> >> My feeling is that enough traffic on USB causes the problem, and that >> this problem is only present in the new USB stack. >> Unfortunately downgrading to 7.x is not an option because there are >> things that won't work on this notebook. > > If you run a simple test like this: > > dd if=/dev/da0 of=/dev/null bs=65536 > dd if=/dev/da0 of=/dev/null bs=16384 > > Do you then see any errors? > > Do you have a spare USB memory stick which you could run similar write tests > on? Both reads fail with I/O error, while writes to an unused partition seem to be fine (I interrupted the writes after a while): % dd if=/dev/da0 of=/dev/null bs=65536 dd: /dev/da0: Input/output error 191732+0 records in 191732+0 records out 12565348352 bytes transferred in 429.999272 secs (29221790 bytes/sec) % dd if=/dev/da0 of=/dev/null bs=16384 dd: /dev/da0: Input/output error 126427+0 records in 126427+0 records out 2071379968 bytes transferred in 169.431766 secs (12225452 bytes/sec) # dd if=/dev/random of=/dev/da0s3 bs=65536 ^C329378+0 records in 329377+0 records out 21586051072 bytes transferred in 1003.020293 secs (21521051 bytes/sec) # dd if=/dev/random of=/dev/da0s3 bs=16384 ^C679571+0 records in 679571+0 records out 11134091264 bytes transferred in 690.135793 secs (16133189 bytes/sec) This is what I get in /var/log/messages when the I/O error occurs: (da0:umass-sim0:0:0:0): AutoSense failed However, I experience no lockup. Maybe this situation is not handled correctly at another level ? I've done the read test with a 4GB memory stick and it passed. I'll do the read tests with another HDD later today, but I expect to get the same error, since on file copying it behaves in the same way. Claudiu. ___ freebsd-usb@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-usb To unsubscribe, send any mail to "freebsd-usb-unsubscr...@freebsd.org"
Re: System lockups caused by USB external HDD
On Monday 24 January 2011 10:00:53 CDP wrote: > On 01/24/11 01:56, Daniel O'Connor wrote: > > On 24/01/2011, at 9:10, CDP wrote: > >> g_vfs_done():da0s2[WRITE(offset=, length=16384)]error = 5 > >> [several more lines similar to the above] > >> panic: softdep_move_dependencies: need merge code > >> cpuid = 0 > >> KDB: stack backtrace: > >> #0 0x... at kdb_backtrace+0x5e > >> #1 0x... at panic+0x182 > > > > It looks like the disk is dying, or the FS is corrupt (the former might > > cause the later). > > > > Can you run smartctl on the disk? Unfortunately a lot of enclosures > > reject SMART commands so you might not be able to :( > > I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet > run a SMART long test for the simple reason that the disk is going into > sleep mode and interrupts it. Haven't bothered to keep it alive for a > long test but I might just do that. > > Although, I doubt it's a disk failure, since I do backups on it without > problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x > fails. And I am talking about over 150GB of data in one run, while > 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the > past, on SATA, and a few read/write errors never caused a system lockup. > > My feeling is that enough traffic on USB causes the problem, and that > this problem is only present in the new USB stack. > Unfortunately downgrading to 7.x is not an option because there are > things that won't work on this notebook. If you run a simple test like this: dd if=/dev/da0 of=/dev/null bs=65536 dd if=/dev/da0 of=/dev/null bs=16384 Do you then see any errors? Do you have a spare USB memory stick which you could run similar write tests on? --HPS ___ freebsd-usb@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-usb To unsubscribe, send any mail to "freebsd-usb-unsubscr...@freebsd.org"
Re: System lockups caused by USB external HDD
On 01/24/11 01:56, Daniel O'Connor wrote: > > On 24/01/2011, at 9:10, CDP wrote: >> g_vfs_done():da0s2[WRITE(offset=, length=16384)]error = 5 >> [several more lines similar to the above] >> panic: softdep_move_dependencies: need merge code >> cpuid = 0 >> KDB: stack backtrace: >> #0 0x... at kdb_backtrace+0x5e >> #1 0x... at panic+0x182 > > It looks like the disk is dying, or the FS is corrupt (the former might cause > the later). > > Can you run smartctl on the disk? Unfortunately a lot of enclosures reject > SMART commands so you might not be able to :( I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet run a SMART long test for the simple reason that the disk is going into sleep mode and interrupts it. Haven't bothered to keep it alive for a long test but I might just do that. Although, I doubt it's a disk failure, since I do backups on it without problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x fails. And I am talking about over 150GB of data in one run, while 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the past, on SATA, and a few read/write errors never caused a system lockup. My feeling is that enough traffic on USB causes the problem, and that this problem is only present in the new USB stack. Unfortunately downgrading to 7.x is not an option because there are things that won't work on this notebook. Regards, Claudiu. smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.2-RC2 amd64] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD10EADS-00M2B0 Serial Number:WD-WCAV53634762 Firmware Version: 01.00A01 User Capacity:1,000,204,886,016 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is:Mon Jan 24 10:39:01 2011 EET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (19980) seconds. Offline data collection capabilities:(0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 230) minutes. Conveyance self-test routine recommended polling time:( 5) minutes. SCT capabilities: (0x3037) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051Pre-fail Always - 0 3 Spin_Up_Time0x0027 112 111 021Pre-fail Always - 7366 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 144 5 Reallocated_Sector_Ct 0x0033 200 200 140Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000Old_age Always - 146 10 Spin_Retry_Count0x0032 100 100 000Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000Old_age A
Re: System lockups caused by USB external HDD
On 24/01/2011, at 9:10, CDP wrote: > g_vfs_done():da0s2[WRITE(offset=, length=16384)]error = 5 > [several more lines similar to the above] > panic: softdep_move_dependencies: need merge code > cpuid = 0 > KDB: stack backtrace: > #0 0x... at kdb_backtrace+0x5e > #1 0x... at panic+0x182 It looks like the disk is dying, or the FS is corrupt (the former might cause the later). Can you run smartctl on the disk? Unfortunately a lot of enclosures reject SMART commands so you might not be able to :( -- Daniel O'Connor software and network engineer for Genesis Software - http://www.gsoft.com.au "The nice thing about standards is that there are so many of them to choose from." -- Andrew Tanenbaum GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C ___ freebsd-usb@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-usb To unsubscribe, send any mail to "freebsd-usb-unsubscr...@freebsd.org"