[Bug 1042369] [NEW] SCSI bus errors with 3TB HDDs + data corruption

Sean Clarke Mon, 27 Aug 2012 11:46:04 -0700

Public bug reported:

Hi,
    I am building a NAS unit with 6x Seagate 3TB HDDs. When using the drives I 
get a flood of errors in the kernel log.


I have replaced the motherboard and even upgraded to 12.10 'Quantal
Quetzal' and still the broblem remains.


Aug 25 19:06:58 enterprise kernel: [  595.548983] ata7.00: error: { ICRC ABRT }
Aug 25 19:06:58 enterprise kernel: [  595.549359] ata7: hard resetting link
Aug 25 19:06:58 enterprise kernel: [  595.629945] ata8: SATA link up 6.0 Gbps 
(SStatus 133 SControl 310)
Aug 25 19:06:58 enterprise kernel: [  595.769862] ata8.00: configured for 
UDMA/133
Aug 25 19:06:58 enterprise kernel: [  595.769889] sd 7:0:0:0: [sdc]  Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 25 19:06:58 enterprise kernel: [  595.769893] sd 7:0:0:0: [sdc]  Sense Key 
: Aborted Command [current] [descriptor]
Aug 25 19:06:58 enterprise kernel: [  595.769898] Descriptor sense data with 
sense descriptors (in hex):
Aug 25 19:06:58 enterprise kernel: [  595.769900]         72 0b 47 00 00 00 00 
0c 00 0a 80 00 00 00 00 01
Aug 25 19:06:58 enterprise kernel: [  595.769910]         5d 50 9f b0
Aug 25 19:06:58 enterprise kernel: [  595.769914] sd 7:0:0:0: [sdc]  Add. 
Sense: Scsi parity error
Aug 25 19:06:58 enterprise kernel: [  595.769918] sd 7:0:0:0: [sdc] CDB: 
Write(16): 8a 00 00 00 00 01 5d 50 9f b0 00 00 03 a8 00 00
Aug 25 19:06:58 enterprise kernel: [  595.769930] end_request: I/O error, dev 
sdc, sector 5860532144
Aug 25 19:06:58 enterprise kernel: [  595.770250] quiet_error: 502 callbacks 
suppressed
Aug 25 19:06:58 enterprise kernel: [  595.770253] Buffer I/O error on device 
sdc, logical block 732566518
Aug 25 19:06:58 enterprise kernel: [  595.770567] lost page write due to I/O 
error on sdc
Aug 25 19:06:58 enterprise kernel: [  595.770571] Buffer I/O error on device 
sdc, logical block 732566519
Aug 25 19:06:58 enterprise kernel: [  595.770874] lost page write due to I/O 
error on sdc
Aug 25 19:06:58 enterprise kernel: [  595.770877] Buffer I/O error on device 
sdc, logical block 732566520
Aug 25 19:06:58 enterprise kernel: [  595.771193] lost page write due to I/O 
error on sdc
Aug 25 19:06:58 enterprise kernel: [  595.771196] Buffer I/O error on device 
sdc, logical block 732566521
Aug 25 19:06:58 enterprise kernel: [  595.771556] lost page write due to I/O 
error on sdc
Aug 25 19:06:58 enterprise kernel: [  595.771559] Buffer I/O error on device 
sdc, logical block 732566522
Aug 25 19:06:58 enterprise kernel: [  595.771910] lost page write due to I/O 
error on sdc
Aug 25 19:06:58 enterprise kernel: [  595.771913] Buffer I/O error on device 
sdc, logical block 732566523
Aug 25 19:06:58 enterprise kernel: [  595.772260] lost page write due to I/O 
error on sdc

Aug 25 19:06:58 enterprise kernel: [  595.773664] ata8: EH complete
Aug 25 19:06:58 enterprise kernel: [  595.794893] ata8.00: exception Emask 0x0 
SAct 0x7ff SErr 0x0 action 0x6
Aug 25 19:06:58 enterprise kernel: [  595.795185] ata8.00: irq_stat 0x40000008
Aug 25 19:06:58 enterprise kernel: [  595.795464] ata8.00: failed command: 
WRITE FPDMA QUEUED
Aug 25 19:06:58 enterprise kernel: [  595.795742] ata8.00: cmd 
61/00:00:00:0c:00/04:00:00:00:00/40 tag 0 ncq 524288 out
Aug 25 19:06:58 enterprise kernel: [  595.795743]          res 
41/84:00:00:0c:00/00:04:00:00:00/00 Emask 0x410 (ATA bus error) <F>
Aug 25 19:06:58 enterprise kernel: [  595.796292] ata8.00: status: { DRDY ERR }
Aug 25 19:06:58 enterprise kernel: [  595.796581] ata8.00: error: { ICRC ABRT }
Aug 25 19:06:58 enterprise kernel: [  595.796894] ata8: hard resetting link
Aug 25 19:06:58 enterprise kernel: [  595.873861] ata7: SATA link up 6.0 Gbps 
(SStatus 133 SControl 310)
Aug 25 19:06:58 enterprise kernel: [  596.017488] ata7.00: configured for 
UDMA/33
Aug 25 19:06:58 enterprise kernel: [  596.017502] ata7: EH complete
Aug 25 19:06:58 enterprise kernel: [  596.038766] ata7.00: exception Emask 0x0 
SAct 0x3f SErr 0x0 action 0x6
Aug 25 19:06:58 enterprise kernel: [  596.039055] ata7.00: irq_stat 0x40000008
Aug 25 19:06:58 enterprise kernel: [  596.039334] ata7.00: failed command: 
WRITE FPDMA QUEUED
Aug 25 19:06:58 enterprise kernel: [  596.039614] ata7.00: cmd 
61/00:00:b0:97:50/04:00:5d:01:00/40 tag 0 ncq 524288 out
Aug 25 19:06:58 enterprise kernel: [  596.039616]          res 
41/84:00:b0:97:50/00:04:5d:01:00/00 Emask 0x410 (ATA bus error) <F>
Aug 25 19:06:58 enterprise kernel: [  596.040165] ata7.00: status: { DRDY ERR }
Aug 25 19:06:58 enterprise kernel: [  596.040459] ata7.00: error: { ICRC ABRT }
Aug 25 19:06:58 enterprise kernel: [  596.040774] ata7: hard resetting link
Aug 25 19:06:59 enterprise kernel: [  596.121778] ata8: SATA link up 6.0 Gbps 
(SStatus 133 SControl 310)
Aug 25 19:06:59 enterprise kernel: [  596.261840] ata8.00: configured for 
UDMA/133
Aug 25 19:06:59 enterprise kernel: [  596.261866] sd 7:0:0:0: [sdc]  Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 25 19:06:59 enterprise kernel: [  596.261871] sd 7:0:0:0: [sdc]  Sense Key 
: Aborted Command [current] [descriptor]
Aug 25 19:06:59 enterprise kernel: [  596.261875] Descriptor sense data with 
sense descriptors (in hex):
Aug 25 19:06:59 enterprise kernel: [  596.261877]         72 0b 47 00 00 00 00 
0c 00 0a 80 00 00 00 00 00
Aug 25 19:06:59 enterprise kernel: [  596.261887]         00 00 0c 00
Aug 25 19:06:59 enterprise kernel: [  596.261891] sd 7:0:0:0: [sdc]  Add. 
Sense: Scsi parity error
Aug 25 19:06:59 enterprise kernel: [  596.261895] sd 7:0:0:0: [sdc] CDB: 
Write(10): 2a 00 00 00 0c 00 00 04 00 00
Aug 25 19:06:59 enterprise kernel: [  596.261904] end_request: I/O error, dev 
sdc, sector 3072
Aug 25 19:06:59 enterprise kernel: [  596.262275] ata8: EH complete


Just to recap - all 6 drives are affected at one point or another, 2 
systems(Core i7 and AMD Bulldozer 8 core) and 2 kernels (12.01 running 
3.2.0-29-generic #46-Ubuntu SMP and 12.10 running 3.5.0-11-generic #11-Ubuntu 
SMP) and I have also tried a new PSU in case that was the cause,

I have experienced intermittent errors using parted (when writing to
disk) and both ext4 and btrfs show data errors - btrfs found over 10K
errors (and corrected them) during a scrub.

I have run the smart quick tests and the disks report the are all good.

Here is the output from one of them:


smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.5.0-11-generic] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model:     ST3000DM001-1CH166
Serial Number:    S1F0MM27
LU WWN Device Id: 5 000c50 051759d96
Firmware Version: CC43
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Aug 27 20:33:49 2012 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 333) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   115   100   006    Pre-fail  Always       
-       95707744
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       
-       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       
-       60
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       
-       4294983044
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       9
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       
-       61
183 Runtime_Bad_Block       0x0032   095   095   000    Old_age   Always       
-       5
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       
-       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       
-       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       
-       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0022   072   056   045    Old_age   Always       
-       28 (Min/Max 26/28)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       
-       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       
-       51
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       
-       65
194 Temperature_Celsius     0x0022   028   044   000    Old_age   Always       
-       28 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x003e   200   163   000    Old_age   Always       
-       69
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      
-       79070347919370
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      
-       232499601
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      
-       93817799

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Short offline       Completed without error       00%         6         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

====================================================================

I can get the disks to fail almost immediately by doing a mkfs.btrfs
across the full set - at some point one or two of them flood the syslog
with errors.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1042369

Title:
  SCSI bus errors with 3TB HDDs + data corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1042369/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1042369] [NEW] SCSI bus errors with 3TB HDDs + data corruption

Reply via email to