Re: [zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

2012-05-31 Thread Antonio S. Cofiño

Jim,

Thank you for the explanation. I have 'discovered' that is a typical 
situation that makes the system unstable.


Just for curiosity, this morning it happened again. Below, you can che 
the log oupu. This time a HBA with LSI 1068E Chip, mpt driver, the 
previous one was with a LSI 2008, mpt_sas driver.


In this case the ZFS 'dicovered' the error and it was able to self 
healing, and the system is working smooth.



Antonio

May 31 10:48:11 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:11 seal.macc.unican.es mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:11 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:11 seal.macc.unican.es mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:13 seal.macc.unican.es Log info 0x31123000 received for 
target 12.
May 31 10:48:13 seal.macc.unican.es scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:13 seal.macc.unican.es Log info 0x31123000 received for 
target 12.
May 31 10:48:13 seal.macc.unican.es scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:13 seal.macc.unican.es Log info 0x31123000 received for 
target 12.
May 31 10:48:13 seal.macc.unican.es scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:13 seal.macc.unican.es Log info 0x31123000 received for 
target 12.
May 31 10:48:13 seal.macc.unican.es scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:16 seal.macc.unican.es mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x3000
May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:16 seal.macc.unican.es mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x3000
May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:16 seal.macc.unican.es mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:16 seal.macc.unican.es mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:17 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:17 seal.macc.unican.es Log info 0x3000 received for 
target 12.
May 31 10:48:17 seal.macc.unican.es scsi_status=0x0, 
ioc_status=0x804b, scsi_state=0xc
May 31 10:48:20 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:20 seal.macc.unican.es SAS Discovery Error on port 0. 
DiscoveryStatus is DiscoveryStatus is |Unaddressable device found|
May 31 10:48:22 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:22 seal.macc.unican.es mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:22 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:22 seal.macc.unican.es mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31123000
May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:27 seal.macc.unican.es mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x3000
May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:27 seal.macc.unican.es mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x3000
May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:27 seal.macc.unican.es mpt_handle_event_sync: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] 
WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:27 seal.macc.unican.es mpt_handle_event: 
IOCStatus=0x8000, IOCLogInfo=0x31112000
May 31 10:48:28 seal.macc.unican.es scsi: [ID 365881 kern.info] 
/pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2):
May 31 10:48:28 

Re: [zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

2012-05-31 Thread Antonio S. Cofiño

Markus,

After Jim's answer I have started to read bout the well known issue.


Is it just mpt causing the errors or also mpt_sas?


Both drivers are causing the reset storm (See my answer to Jim's e-mail).


General consensus from various people: don't use SATA drives on SAS back-
planes. Some SATA drives might work better, but there seems to be no
guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.


In the Paul Kraus's answer it mentions that Oracle support says (among 
other things)

4. the problem happens with SAS as well as SATA drives, but is much
less frequent




That means, that using SAS drives it will reduce the probability of the 
issue but no guarantee exists.




General consensus from various people: don't use SATA drives on SAS back-
planes. Some SATA drives might work better, but there seems to be no
guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.


Yes, may be the 'general consensus' is right but 'general consensus' 
said me to use hardware based raid solutions. But I started to do 'risky 
business' (as some vendors told me) using ZFS  and have ended 
discovering how robust is ZFS for this kind of protocol errors.


From my complete naive point of view it appears more a issue with the 
HBA's FW than a issue with SATA drives.


With you answers I have make a lot of re-search helping me to learn new 
things.


Please more comments and help are welcome (from some SAS expert?).

Antonio

--
Antonio S. Cofiño


El 31/05/2012 18:04, Weber, Markus escribió:

Antonio S. Cofiño wrote:

[...]
The system is a supermicro motherboard X8DTH-6F in a 4U chassis
(SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1).
It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2)
each of them connected to a 4 different HBA (2x LSI 3081E-R (1068
chip) + 2x LSI SAS9200-8e (2008 chip)).
This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS)
+ 34 SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi HDS723020BLA642))
The issue arise when one of the disk starts to fail making long time
accesses. After some time (minutes, but I'm not sure) all the disks,
connected to the same HBA, start to report errors. This situation
produce a general failure on the ZFS making the whole POOL unavailable.
[...]


Have been there and gave up at the end[1]. Could reproduce (even though
it took a bit longer) under most Linux versions (incl. using latest LSI
drivers) and LSI 3081E-R HBA.

Is it just mpt causing the errors or also mpt_sas?

In a lab environment the LSI 9200 HBA behaved better - I/O only dropped
shortly and then continued on the other disks without generating errors.

Had a lengthy Oracle case on this, but all proposed workarounds did
not worked for me at all, which had been (some also from other forums)

- disabling NCQ
- allow-bus-device-reset=0; to /kernel/drv/sd.conf
- set zfs:zfs_vdev_max_pending=1
- set mpt:mpt_enable_msi=0
- keep usage below 90%
- no fmservices running and did temporarily did fmadm unload disk-transport
   or other disk access stuff (smartd?)
- tried changing retries-timeout via sd-conf for the disks without any
   success and ended it doing via mdb

At the end I knew the bad sector of the bad disk and by simply dd
this sector once or twice to /dev/zero I could easily bring down the
system/pool without any load on the disk system.


General consensus from various people: don't use SATA drives on SAS back-
planes. Some SATA drives might work better, but there seems to be no
guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.

Markus



[1] Search for What's wrong with LSI 3081 (1068) + expander + (bad) SATA
 disk?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

2012-05-31 Thread Richard Elling
On May 31, 2012, at 9:45 AM, Antonio S. Cofiño wrote:

 Markus,
 
 After Jim's answer I have started to read bout the well known issue.
 
 Is it just mpt causing the errors or also mpt_sas?
 
 Both drivers are causing the reset storm (See my answer to Jim's e-mail).

No. Resets are corrective actions that occur because of command timeouts. 
The cause is the command timeout.

 General consensus from various people: don't use SATA drives on SAS back-
 planes. Some SATA drives might work better, but there seems to be no
 guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
 
 In the Paul Kraus's answer it mentions that Oracle support says (among other 
 things)
 4. the problem happens with SAS as well as SATA drives, but is much
 less frequent
 
 
 That means, that using SAS drives it will reduce the probability of the issue 
 but no guarantee exists.

I have seen broken SAS drives crush expanders/HBAs such that POST would not run.
Obviously, at this point there is no OS running, so we can't blame the OS or 
drivers.

I have a few SATA disks that have the same affect on motherboards. I call them, 
the drives of doom :-)

 General consensus from various people: don't use SATA drives on SAS back-
 planes. Some SATA drives might work better, but there seems to be no
 guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.
 
 Yes, may be the 'general consensus' is right but 'general consensus' said me 
 to use hardware based raid solutions. But I started to do 'risky business' 
 (as some vendors told me) using ZFS  and have ended discovering how robust is 
 ZFS for this kind of protocol errors.

hardware RAID solutions are also susceptible to these failure modes.

 From my complete naive point of view it appears more a issue with the HBA's 
 FW than a issue with SATA drives.

There are multiple contributors. But perhaps the most difficult to overcome is 
the
fundamental differences in the SAS and SATA protocols. Let's just agree that 
SATA
was not designed for network-like fabrics.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA

2012-05-31 Thread Hung-Sheng Tsao Ph.D.

just FYI
this is from intel
http://www.intel.com/support/motherboards/server/sb/CS-031831.htm

Another observation:
Oracle/Sun has move  away from SATA to SAS in ZFS storage/Appliance

If you want to go deeper  take look these presentations
http://www.scsita.org/sas_library/tutorials/
and other presentations  on the site
regards


On 5/31/2012 12:45 PM, Antonio S. Cofiño wrote:

Markus,

After Jim's answer I have started to read bout the well known issue.


Is it just mpt causing the errors or also mpt_sas?


Both drivers are causing the reset storm (See my answer to Jim's e-mail).

General consensus from various people: don't use SATA drives on SAS 
back-

planes. Some SATA drives might work better, but there seems to be no
guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.


In the Paul Kraus's answer it mentions that Oracle support says (among 
other things)

4. the problem happens with SAS as well as SATA drives, but is much
less frequent




That means, that using SAS drives it will reduce the probability of 
the issue but no guarantee exists.



General consensus from various people: don't use SATA drives on SAS 
back-

planes. Some SATA drives might work better, but there seems to be no
guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.


Yes, may be the 'general consensus' is right but 'general consensus' 
said me to use hardware based raid solutions. But I started to do 
'risky business' (as some vendors told me) using ZFS  and have ended 
discovering how robust is ZFS for this kind of protocol errors.


From my complete naive point of view it appears more a issue with the 
HBA's FW than a issue with SATA drives.


With you answers I have make a lot of re-search helping me to learn 
new things.


Please more comments and help are welcome (from some SAS expert?).

Antonio

--
Antonio S. Cofiño


El 31/05/2012 18:04, Weber, Markus escribió:

Antonio S. Cofiño wrote:

[...]
The system is a supermicro motherboard X8DTH-6F in a 4U chassis
(SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1).
It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2)
each of them connected to a 4 different HBA (2x LSI 3081E-R (1068
chip) + 2x LSI SAS9200-8e (2008 chip)).
This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS)
+ 34 SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi 
HDS723020BLA642))

The issue arise when one of the disk starts to fail making long time
accesses. After some time (minutes, but I'm not sure) all the disks,
connected to the same HBA, start to report errors. This situation
produce a general failure on the ZFS making the whole POOL unavailable.
[...]


Have been there and gave up at the end[1]. Could reproduce (even though
it took a bit longer) under most Linux versions (incl. using latest LSI
drivers) and LSI 3081E-R HBA.

Is it just mpt causing the errors or also mpt_sas?

In a lab environment the LSI 9200 HBA behaved better - I/O only dropped
shortly and then continued on the other disks without generating errors.

Had a lengthy Oracle case on this, but all proposed workarounds did
not worked for me at all, which had been (some also from other forums)

- disabling NCQ
- allow-bus-device-reset=0; to /kernel/drv/sd.conf
- set zfs:zfs_vdev_max_pending=1
- set mpt:mpt_enable_msi=0
- keep usage below 90%
- no fmservices running and did temporarily did fmadm unload 
disk-transport

   or other disk access stuff (smartd?)
- tried changing retries-timeout via sd-conf for the disks without any
   success and ended it doing via mdb

At the end I knew the bad sector of the bad disk and by simply dd
this sector once or twice to /dev/zero I could easily bring down the
system/pool without any load on the disk system.


General consensus from various people: don't use SATA drives on SAS 
back-

planes. Some SATA drives might work better, but there seems to be no
guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes.

Markus



[1] Search for What's wrong with LSI 3081 (1068) + expander + (bad) 
SATA

 disk?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--


attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss