Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-06 Thread Jim Klimov

2011-12-05 5:15, Ryan Wehler wrote:

Well if we want to get into theories on faulty hardware batches and such we 
can. Though I think the likelihood is slim but not impossible I suppose.

I did the best I can diagnostic wise given I have no spare parts that have 
never been a part of this SAN. As I said, I still think the likelihood of two 
failed HBAs or failed cables just doesn't add up.  The errors thrown between 
cards is pretty consistent between cable swaps too, so nothing really 
indicative of A bad cable, let alone two.



Well, speculation-wise, if these were nearly-identical items serving
for the same time in identical conditions (same enclosure), they could
fail together just because they were subjected to the same shocks,
power surges, or perhaps more likely aging of components (i.e. drying
up of capacitors, oxydization of soldered connections, diffusion of
atoms in the microchips - whatever). Regarding soldered connections -
there was a true story some 10 years ago about Fujitsu desktop drives
dying at nearly the same age after exiting the factory (few months
old), which was tracked to some more-than-usual acidity of soldering
lead or its addons. Overall, the electrical links just stopped working
after a while due to oxydization into the bulk of the metal blobs :)

Still, congratulations on that replacement hardware did solve the
problem! ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-05 Thread Ryan Wehler
Here's LSIUTIL after swapping to a 6GB backplane and dual 9211-8i cards on
a fresh boot.

Much better. :)

Adapter Phy 0:  Link Up, No Errors

Adapter Phy 1:  Link Up, No Errors

Adapter Phy 2:  Link Up, No Errors

Adapter Phy 3:  Link Up, No Errors

Adapter Phy 4:  Link Down, No Errors

Adapter Phy 5:  Link Down, No Errors

Adapter Phy 6:  Link Down, No Errors

Adapter Phy 7:  Link Down, No Errors

Expander (Handle 0009) Phy 0:  Link Down, No Errors

Expander (Handle 0009) Phy 1:  Link Down, No Errors

Expander (Handle 0009) Phy 2:  Link Down, No Errors

Expander (Handle 0009) Phy 3:  Link Down, No Errors

Expander (Handle 0009) Phy 4:  Link Up
  Invalid DWord Count   8
  Running Disparity Error Count 7
  Loss of DWord Synch Count 2
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 5:  Link Up
  Invalid DWord Count   8
  Running Disparity Error Count 6
  Loss of DWord Synch Count 2
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 6:  Link Up
  Invalid DWord Count   8
  Running Disparity Error Count 5
  Loss of DWord Synch Count 2
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 7:  Link Up
  Invalid DWord Count   8
  Running Disparity Error Count 7
  Loss of DWord Synch Count 2
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 8:  Link Up
  Invalid DWord Count   8
  Running Disparity Error Count 7
  Loss of DWord Synch Count 2
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 9:  Link Up
  Invalid DWord Count   8
  Running Disparity Error Count 4
  Loss of DWord Synch Count 2
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 10:  Link Up
  Invalid DWord Count   8
  Running Disparity Error Count 6
  Loss of DWord Synch Count 2
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 11:  Link Up
  Invalid DWord Count   8
  Running Disparity Error Count 6
  Loss of DWord Synch Count 2
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 12:  Link Down, No Errors

Expander (Handle 0009) Phy 13:  Link Down, No Errors

Expander (Handle 0009) Phy 14:  Link Down, No Errors

Expander (Handle 0009) Phy 15:  Link Down, No Errors

Expander (Handle 0009) Phy 16:  Link Down, No Errors

Expander (Handle 0009) Phy 17:  Link Down, No Errors

Expander (Handle 0009) Phy 18:  Link Up, No Errors

Expander (Handle 0009) Phy 19:  Link Up, No Errors

Expander (Handle 0009) Phy 20:  Link Down, No Errors

Expander (Handle 0009) Phy 21:  Link Down, No Errors

Expander (Handle 0009) Phy 22:  Link Down, No Errors

Expander (Handle 0009) Phy 23:  Link Down, No Errors

Expander (Handle 0009) Phy 24:  Link Down, No Errors

Expander (Handle 0009) Phy 25:  Link Down, No Errors

Expander (Handle 0009) Phy 26:  Link Down, No Errors

Expander (Handle 0009) Phy 27:  Link Down, No Errors

Expander (Handle 0009) Phy 28:  Link Up, No Errors

Expander (Handle 0009) Phy 29:  Link Up, No Errors

Expander (Handle 0009) Phy 30:  Link Up, No Errors

Expander (Handle 0009) Phy 31:  Link Up, No Errors

Expander (Handle 0009) Phy 32:  Link Up, No Errors

Expander (Handle 0009) Phy 33:  Link Up, No Errors

Expander (Handle 0009) Phy 34:  Link Down, No Errors

Expander (Handle 0009) Phy 35:  Link Down, No Errors

Expander (Handle 0009) Phy 36:  Link Up, No Errors

Expander (Handle 0009) Phy 37:  Link Down, No Errors
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-05 Thread Ryan Wehler
Whoops. Make that 9211-4i cards. :) Still promising.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-04 Thread Ryan Wehler

On Dec 3, 2011, at 11:45 PM, Richard Elling wrote:

 On Dec 3, 2011, at 9:32 PM, Ryan Wehler wrote:
 On Dec 3, 2011, at 11:18 PM, Richard Elling wrote:
 
 On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote:
 
 On Dec 3, 2011, at 10:31 PM, Richard Elling wrote:
 
 On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:
 
 Hi Richard,
 Thanks for getting back to me.
 
 
 On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:
 
 On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
 
 During the diagnostics of my SAN failure last week we thought we had 
 seen a backplane failure due to high error counts with 'lsiutil'.  
 However, even with a new backplane and ruling out failed cards (MPXIO 
 or singular) or bad cables I'm still seeing my error count with 
 LSIUTIL increment.  I've got no disks attached to the array right now 
 so I've also ruled those out.
 
 The link error counters are on the receiving side. To see the complete 
 picture, you need to look at
 link errors on both ends of each link (more below…)
 
 
 Even with nothing connected but the HBA to the backplane expander, a 
 simple restart of the SAN into a OpenIndiana LiveCD or other 
 distribution (NexentaStor) increments the counter.
 
 A few counters can tick up when the system is reset at boot. These can 
 be ignored.
 
 What you are looking for is  a consistent increase of the  counters 
 under load. In some cases
 I have seen millions of errors per minute on a very unhappy system.
 
 But we're talking about 600,000 - 2,000,000 errors on a simple reset at 
 boot.  Per my VAR their 6GB hardware show significantly less (in the 10s 
 to 100s of errors, not 100s to millions). 
 
 For high-quality hardware, I see 4 to 8.  If I see  1,000, then I start 
 replacing hardware.
 
 
 And how do you define high quality hardware?  Obviously these aren't 
 crummy SATA adapters and low cost drives.  The Chassis and backplane are 
 on Nexenta's HSL.  While the cards are not, explicitly listed. The 
 underlying chip (LSI 1068) is on another card (3081E-R) that is on the HSL.
 
 I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero 
 errors.
 
 I'm assuming these had some sort of LSI cards in them since that's the 
 primary focus here.  Do you happen to know models and what expander chip was 
 used on the backplane(s)?
 
 LSI 2008 chipset (HP SC08Ge HBA).  Expanders are HP-branded, I'll speculate 
 they are LSI SAS2x28.
 
 Note: there is also firmware on the HBAs and expanders. But I do not expect 
 firmware to change the
 link error counts. I suspect that is more of a physical issue.

In an effort to solve this problem I did update my 3442E-R HBAs from a 2009 
firmware to Phase 21 which came out earlier this year from LSI.  The 
replacement backplane I got from my VAR when they thought that was the issue 
moved the backplane firmware from 7015 to 7017 per lsiutil's output.   You're 
right it must be a physical issue but it just seems highly unlikely that BOTH 
HBAs failed and BOTH SAS cables failed (we'll take the expander out of the 
equation since it was replaced)

 
 Currently, the test process for HSL records any errors, but as long as the 
 root cause can be
 explained, the devices can pass certification.
 
 Well since we can't even come to a reasonable justification on why these 
 errors exist with no true indicator of bad hardware, something like this 
 could pass the HSL if the VAR can justify it? I'm not saying thats what 
 happened.. I'm just trying to understand the process.
 
 A certification does not mean that any specific implementation operates 
 without errors. A failed part,
 noisy environment, or other influences will affect any specific 
 implementation.

Would it not be more prudent to re-run the tests after a failure was fixed and 
try to eliminate environmental variables?  If you were to look up the reason it 
made it onto the HSL it should be It just works!, not it works, but this is 
why we're seeing errors. That leads to doubt when there are caveats and trying 
to diagnose like/same hardware in the future.

 -- richard
 
 -- 
 
 ZFS and performance consulting
 http://www.RichardElling.com
 LISA '11, Boston, MA, December 4-9 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-04 Thread Richard Elling
On Dec 4, 2011, at 8:50 AM, Ryan Wehler wrote:
 
 A certification does not mean that any specific implementation operates 
 without errors. A failed part,
 noisy environment, or other influences will affect any specific 
 implementation.
 
 Would it not be more prudent to re-run the tests after a failure was fixed 
 and try to eliminate environmental variables?  If you were to look up the 
 reason it made it onto the HSL it should be It just works!, not it works, 
 but this is why we're seeing errors. That leads to doubt when there are 
 caveats and trying to diagnose like/same hardware in the future.

Perhaps I wasn't clear. When we root cause an error reported during 
certification it is to 
absolve the device under test. For example, if we run a test against a disk and 
see errors
on the wire caused by a backplane or cable, then we must absolve the disk of 
the errors.
If the disk is the root cause of the error reports, then it fails certification.

Do not confuse certification with it runs forever with no problems in all 
cases
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA '11, Boston, MA, December 4-9 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-04 Thread James C. McPherson

On  5/12/11 02:50 AM, Ryan Wehler wrote:
...

In an effort to solve this problem I did update my 3442E-R HBAs from a
2009 firmware to Phase 21 which came out earlier this year from LSI. The
replacement backplane I got from my VAR when they thought that was the
issue moved the backplane firmware from 7015 to 7017 per lsiutil's output.
You're right it must be a physical issue but it just seems highly unlikely
that BOTH HBAs failed and BOTH SAS cables failed (we'll take the expander
out of the equation since it was replaced)



You need to look at the data available, rather than making
assumptions. When I was part of CPRE (now PTS?) in Sun we
referred to swapping hardware without investigation as
practicing swaptronics. Every escalation we got where this
had happened took longer to resolve as a result.

So yes, it certainly could be a hardware problem twice in a
row. You'd want to examine the serial numbers and other identifying
data such as manufacturing date codes to see how likely that is.
In the past I've seen cases where replacement disks turned out to
be duds across several different batches and different factories
involved. The true root cause was traced to a chip that was supplied
to the manufacturer by a third party.

Personally, I'd start looking at the cables first - in my
experience they seem to incur more physical stress through the
connect/disconnect operations than HBAs.



James C. McPherson
--
Oracle
http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-04 Thread Ryan Wehler
Well if we want to get into theories on faulty hardware batches and such we 
can. Though I think the likelihood is slim but not impossible I suppose.

I did the best I can diagnostic wise given I have no spare parts that have 
never been a part of this SAN. As I said, I still think the likelihood of two 
failed HBAs or failed cables just doesn't add up.  The errors thrown between 
cards is pretty consistent between cable swaps too, so nothing really 
indicative of A bad cable, let alone two.

My vendor has more hardware on it's way to me early this coming week.. so I'll 
be able to report back once I have new HBAs and cables too.

On Dec 4, 2011, at 4:11 PM, James C. McPherson wrote:

 On  5/12/11 02:50 AM, Ryan Wehler wrote:
 ...
 In an effort to solve this problem I did update my 3442E-R HBAs from a
 2009 firmware to Phase 21 which came out earlier this year from LSI. The
 replacement backplane I got from my VAR when they thought that was the
 issue moved the backplane firmware from 7015 to 7017 per lsiutil's output.
 You're right it must be a physical issue but it just seems highly unlikely
 that BOTH HBAs failed and BOTH SAS cables failed (we'll take the expander
 out of the equation since it was replaced)
 
 
 You need to look at the data available, rather than making
 assumptions. When I was part of CPRE (now PTS?) in Sun we
 referred to swapping hardware without investigation as
 practicing swaptronics. Every escalation we got where this
 had happened took longer to resolve as a result.
 
 So yes, it certainly could be a hardware problem twice in a
 row. You'd want to examine the serial numbers and other identifying
 data such as manufacturing date codes to see how likely that is.
 In the past I've seen cases where replacement disks turned out to
 be duds across several different batches and different factories
 involved. The true root cause was traced to a chip that was supplied
 to the manufacturer by a third party.
 
 Personally, I'd start looking at the cables first - in my
 experience they seem to incur more physical stress through the
 connect/disconnect operations than HBAs.
 
 
 
 James C. McPherson
 --
 Oracle
 http://www.jmcp.homeunix.com/blog

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-03 Thread Richard Elling
On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:

 During the diagnostics of my SAN failure last week we thought we had seen a 
 backplane failure due to high error counts with 'lsiutil'.  However, even 
 with a new backplane and ruling out failed cards (MPXIO or singular) or bad 
 cables I'm still seeing my error count with LSIUTIL increment.  I've got no 
 disks attached to the array right now so I've also ruled those out.

The link error counters are on the receiving side. To see the complete picture, 
you need to look at
link errors on both ends of each link (more below…)

 
 Even with nothing connected but the HBA to the backplane expander, a simple 
 restart of the SAN into a OpenIndiana LiveCD or other distribution 
 (NexentaStor) increments the counter.

A few counters can tick up when the system is reset at boot. These can be 
ignored.

What you are looking for is  a consistent increase of the  counters under load. 
In some cases
I have seen millions of errors per minute on a very unhappy system.

 I've been as careful as I can be to clear the counter between changes to 
 parts to try and eliminate a potentially bad cable/card/etc.  You can see phy 
 8-15 throw errors irregardless of MPXIO or single card config, OR which 
 expander port I use on the backplane.

The info you attaced doesn't show the topology (lsiutil command 16), so it is 
difficult to say
why this occurs.

 
 According to my VAR something in the mptsas code changed recently (not sure 
 what that means in time terms) and they do not see the problems with 6GB 
 backplanes and adapters.

These counters are in the physical interfaces, far away from any OS.

 
 SAS Diags.txt
 
 
 Attached is a log I took through NexentaStor 3.1.1 with my disks still 
 attached.  The disks themselves don't seem to be throwing errors, so that's 
 good.

To see errors from the disk's perspective, you need to look at the disk's logs.
I use sg3 utils for this (sg_logs -a /dev/rdsk/...)

 
 
 Has anyone seen anything like this?  I have not tried to boot into an older 
 version of Solaris or NexentaStor yet, but booting into Scientific Linux 6.1 
 yields about the same results with lsiutil.

Yes. Root cause is always hardware.

 
 Nothing from fmadm, /var/adm/messages or otherwise indicate these data errors 
 outside of lsiutil.

Those errors are counters as part of the SAS link state machine. The symptoms 
will show as
poor performance or occasional command resets at the OS level.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA '11, Boston, MA, December 4-9 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-03 Thread Ryan Wehler
Hi Richard,
  Thanks for getting back to me.


On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:

 On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
 
 During the diagnostics of my SAN failure last week we thought we had seen a 
 backplane failure due to high error counts with 'lsiutil'.  However, even 
 with a new backplane and ruling out failed cards (MPXIO or singular) or bad 
 cables I'm still seeing my error count with LSIUTIL increment.  I've got no 
 disks attached to the array right now so I've also ruled those out.
 
 The link error counters are on the receiving side. To see the complete 
 picture, you need to look at
 link errors on both ends of each link (more below…)
 
 
 Even with nothing connected but the HBA to the backplane expander, a simple 
 restart of the SAN into a OpenIndiana LiveCD or other distribution 
 (NexentaStor) increments the counter.
 
 A few counters can tick up when the system is reset at boot. These can be 
 ignored.
 
 What you are looking for is  a consistent increase of the  counters under 
 load. In some cases
 I have seen millions of errors per minute on a very unhappy system.

But we're talking about 600,000 - 2,000,000 errors on a simple reset at boot.  
Per my VAR their 6GB hardware show significantly less (in the 10s to 100s of 
errors, not 100s to millions). 

 
 I've been as careful as I can be to clear the counter between changes to 
 parts to try and eliminate a potentially bad cable/card/etc.  You can see 
 phy 8-15 throw errors irregardless of MPXIO or single card config, OR which 
 expander port I use on the backplane.
 
 The info you attaced doesn't show the topology (lsiutil command 16), so it is 
 difficult to say
 why this occurs.

Attached is the output of option 16 on each card.

{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf360
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
\margl1440\margr1440\vieww9000\viewh8400\viewkind0
\deftab720
\pard\pardeftab720\ql\qnatural

\f0\fs24 \cf0 Card #1\
\
SAS1068E's links are 3.0 G, 3.0 G, 3.0 G, 3.0 G, down, down, down, down\
\
\'a0B___T \'a0 \'a0 SASAddress \'a0 \'a0 PhyNum \'a0Handle \'a0Parent \'a0Type\
\'a0 \'a0 \'a0 \'a0 500605b0014cb930 \'a0 \'a0 \'a0 \'a0 \'a0 0001 \'a0 \'a0 \'a0 \'a0 \'a0 SAS Initiator\
\'a0 \'a0 \'a0 \'a0 500605b0014cb931 \'a0 \'a0 \'a0 \'a0 \'a0 0002 \'a0 \'a0 \'a0 \'a0 \'a0 SAS Initiator\
\'a0 \'a0 \'a0 \'a0 500605b0014cb932 \'a0 \'a0 \'a0 \'a0 \'a0 0003 \'a0 \'a0 \'a0 \'a0 \'a0 SAS Initiator\
\'a0 \'a0 \'a0 \'a0 500605b0014cb933 \'a0 \'a0 \'a0 \'a0 \'a0 0004 \'a0 \'a0 \'a0 \'a0 \'a0 SAS Initiator\
\'a0 \'a0 \'a0 \'a0 500605b0014cb934 \'a0 \'a0 \'a0 \'a0 \'a0 0005 \'a0 \'a0 \'a0 \'a0 \'a0 SAS Initiator\
\'a0 \'a0 \'a0 \'a0 500605b0014cb935 \'a0 \'a0 \'a0 \'a0 \'a0 0006 \'a0 \'a0 \'a0 \'a0 \'a0 SAS Initiator\
\'a0 \'a0 \'a0 \'a0 500605b0014cb936 \'a0 \'a0 \'a0 \'a0 \'a0 0007 \'a0 \'a0 \'a0 \'a0 \'a0 SAS Initiator\
\'a0 \'a0 \'a0 \'a0 500605b0014cb937 \'a0 \'a0 \'a0 \'a0 \'a0 0008 \'a0 \'a0 \'a0 \'a0 \'a0 SAS Initiator\
\'a0 \'a0 \'a0 \'a0 500304800111abff \'a0 \'a0 0 \'a0 \'a0 0009 \'a0 \'a1 \'a0 Edge Expander\
\'a00 \'a0 0 \'a05000c500040e92f5 \'a0 \'a0 0 \'a0 \'a0 000a \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a0 1 \'a05000c500040eb1f1 \'a0 \'a0 1 \'a0 \'a0 000b \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a0 2 \'a05000c50017a58869 \'a0 \'a0 2 \'a0 \'a0 000c \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a0 3 \'a05000c50017418d4d \'a0 \'a0 3 \'a0 \'a0 000d \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a0 4 \'a05000c50017a58b51 \'a0 \'a0 4 \'a0 \'a0 000e \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a0 5 \'a05000c50017a5892d \'a0 \'a0 5 \'a0 \'a0 000f \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a0 6 \'a05000c50017a66061 \'a0 \'a0 6 \'a0 \'a0 0010 \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a0 7 \'a05000c50017a677f5 \'a0 \'a0 7 \'a0 \'a0 0011 \'a0 \'a9 \'a0 SAS Target\
\'a0 \'a0 \'a0 \'a0 500605b0014ca930 \'a0 \'a012 \'a0 \'a0 0012 \'a0 \'a9 \'a0 SAS Initiator, not present\
\'a00 \'a0 8 \'a05000c5003bf025b5 \'a0 \'a022 \'a0 \'a0 0013 \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a0 9 \'a05000c5003be44d55 \'a0 \'a023 \'a0 \'a0 0014 \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a010 \'a05000c50017a6a635 \'a0 \'a024 \'a0 \'a0 0015 \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a011 \'a05000c500174132f9 \'a0 \'a025 \'a0 \'a0 0016 \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a012 \'a05000c50017a64da1 \'a0 \'a026 \'a0 \'a0 0017 \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a013 \'a05000c50017a58729 \'a0 \'a027 \'a0 \'a0 0018 \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a014 \'a05000c5001741b321 \'a0 \'a028 \'a0 \'a0 0019 \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a015 \'a05000c500174190a5 \'a0 \'a029 \'a0 \'a0 001a \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a016 \'a05000c50017a58b11 \'a0 \'a030 \'a0 \'a0 001b \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a017 \'a05000c50017a69511 \'a0 \'a031 \'a0 \'a0 001c \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a018 \'a05000c5003bf22ebd \'a0 \'a032 \'a0 \'a0 001d \'a0 \'a9 \'a0 SAS Target\
\'a00 \'a019 

Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-03 Thread Richard Elling
On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:

 Hi Richard,
  Thanks for getting back to me.
 
 
 On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:
 
 On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
 
 During the diagnostics of my SAN failure last week we thought we had seen a 
 backplane failure due to high error counts with 'lsiutil'.  However, even 
 with a new backplane and ruling out failed cards (MPXIO or singular) or bad 
 cables I'm still seeing my error count with LSIUTIL increment.  I've got no 
 disks attached to the array right now so I've also ruled those out.
 
 The link error counters are on the receiving side. To see the complete 
 picture, you need to look at
 link errors on both ends of each link (more below…)
 
 
 Even with nothing connected but the HBA to the backplane expander, a simple 
 restart of the SAN into a OpenIndiana LiveCD or other distribution 
 (NexentaStor) increments the counter.
 
 A few counters can tick up when the system is reset at boot. These can be 
 ignored.
 
 What you are looking for is  a consistent increase of the  counters under 
 load. In some cases
 I have seen millions of errors per minute on a very unhappy system.
 
 But we're talking about 600,000 - 2,000,000 errors on a simple reset at 
 boot.  Per my VAR their 6GB hardware show significantly less (in the 10s to 
 100s of errors, not 100s to millions). 

For high-quality hardware, I see 4 to 8.  If I see  1,000, then I start 
replacing hardware.

 I've been as careful as I can be to clear the counter between changes to 
 parts to try and eliminate a potentially bad cable/card/etc.  You can see 
 phy 8-15 throw errors irregardless of MPXIO or single card config, OR which 
 expander port I use on the backplane.
 
 The info you attaced doesn't show the topology (lsiutil command 16), so it 
 is difficult to say
 why this occurs.
 
 Attached is the output of option 16 on each card.
 
 LSI1068.rtf

This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator).

It is unusual to see millions of errors there.

Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1)
you see on the order of  thousand errors. From the expander (handle 0009)
you see millions of errors on phys 12 to 15, that are connected to the HBA.

Also interesting is that one of the phys, adapter phy 0, shows no errors, but 
we see
errors on the others. This is unusual because there are 4 links in the cable.

Still smells like hardware to me.
 -- richard

 
 According to my VAR something in the mptsas code changed recently (not 
 sure what that means in time terms) and they do not see the problems with 
 6GB backplanes and adapters.
 
 These counters are in the physical interfaces, far away from any OS.
 
 
 SAS Diags.txt
 
 
 Attached is a log I took through NexentaStor 3.1.1 with my disks still 
 attached.  The disks themselves don't seem to be throwing errors, so that's 
 good.
 
 To see errors from the disk's perspective, you need to look at the disk's 
 logs.
 I use sg3 utils for this (sg_logs -a /dev/rdsk/...)
 
 
 I'd paste some of this, but the output would be pretty big.  :) I'll look 
 more into this. Though my errors corrected without substantial delay stands 
 out as pretty high, even on a new disk I just received. Is there anything 
 specific I should be looking at?
 
 
 
 
 Has anyone seen anything like this?  I have not tried to boot into an older 
 version of Solaris or NexentaStor yet, but booting into Scientific Linux 
 6.1 yields about the same results with lsiutil.
 
 Yes. Root cause is always hardware.
 
 
 Nothing from fmadm, /var/adm/messages or otherwise indicate these data 
 errors outside of lsiutil.
 
 Those errors are counters as part of the SAS link state machine. The 
 symptoms will show as
 poor performance or occasional command resets at the OS level.
 -- richard
 
 -- 
 
 ZFS and performance consulting
 http://www.RichardElling.com
 LISA '11, Boston, MA, December 4-9 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA '11, Boston, MA, December 4-9 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-03 Thread Ryan Wehler

On Dec 3, 2011, at 10:31 PM, Richard Elling wrote:

 On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:
 
 Hi Richard,
 Thanks for getting back to me.
 
 
 On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:
 
 On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
 
 During the diagnostics of my SAN failure last week we thought we had seen 
 a backplane failure due to high error counts with 'lsiutil'.  However, 
 even with a new backplane and ruling out failed cards (MPXIO or singular) 
 or bad cables I'm still seeing my error count with LSIUTIL increment.  
 I've got no disks attached to the array right now so I've also ruled those 
 out.
 
 The link error counters are on the receiving side. To see the complete 
 picture, you need to look at
 link errors on both ends of each link (more below…)
 
 
 Even with nothing connected but the HBA to the backplane expander, a 
 simple restart of the SAN into a OpenIndiana LiveCD or other distribution 
 (NexentaStor) increments the counter.
 
 A few counters can tick up when the system is reset at boot. These can be 
 ignored.
 
 What you are looking for is  a consistent increase of the  counters under 
 load. In some cases
 I have seen millions of errors per minute on a very unhappy system.
 
 But we're talking about 600,000 - 2,000,000 errors on a simple reset at 
 boot.  Per my VAR their 6GB hardware show significantly less (in the 10s to 
 100s of errors, not 100s to millions). 
 
 For high-quality hardware, I see 4 to 8.  If I see  1,000, then I start 
 replacing hardware.


And how do you define high quality hardware?  Obviously these aren't crummy 
SATA adapters and low cost drives.  The Chassis and backplane are on Nexenta's 
HSL.  While the cards are not, explicitly listed. The underlying chip (LSI 
1068) is on another card (3081E-R) that is on the HSL.

 
 I've been as careful as I can be to clear the counter between changes to 
 parts to try and eliminate a potentially bad cable/card/etc.  You can see 
 phy 8-15 throw errors irregardless of MPXIO or single card config, OR 
 which expander port I use on the backplane.
 
 The info you attaced doesn't show the topology (lsiutil command 16), so it 
 is difficult to say
 why this occurs.
 
 Attached is the output of option 16 on each card.
 
 LSI1068.rtf
 
 This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator).
 
 It is unusual to see millions of errors there.
 
 Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1)
 you see on the order of  thousand errors. From the expander (handle 0009)
 you see millions of errors on phys 12 to 15, that are connected to the HBA.
 
 Also interesting is that one of the phys, adapter phy 0, shows no errors, but 
 we see
 errors on the others. This is unusual because there are 4 links in the cable.
 
 Still smells like hardware to me.
 -- richard
 

I'm not quite extrapolating this data like you are.  I see handle 0009 which 
looks to be the expander.  Card #1 is hooked to phy 8-11 and Card #2 is hooked 
to phy 12-15.  (port 0 and 1 on the expander)

As far as symmetrical errors, yeah the whole thing is screwy. The one thing I 
am seeing as stand out that I did not notice before for some reason is that 
right card (the one that normally handles phy 12-15) in my previous output 
from my initial inquiry carries 1+M errors on the expander phys regardless of 
the right or left cable.  Perhaps that is an indicator of hardware 
malfunction. The left card (usually responsible for phy 8-11) throws 
something in the order of 600+K (under 1M) using right or left cable (phy 
8-11 or 12-15).  Those numbers are uncomfortably high too, though.

Basically the output of my SAS Diag.txt was flipping between single use of each 
card with each of the two cables I had available to me.  If I were to show the 
output now with both cards enabled phy 8-15 on the expander all show link up 
situation.

The other mystery as you mentioned is why Adapter phy 0 is error free while the 
other 3 phys are not. It's also persistent across cables used AND cards used.

 
 According to my VAR something in the mptsas code changed recently (not 
 sure what that means in time terms) and they do not see the problems with 
 6GB backplanes and adapters.
 
 These counters are in the physical interfaces, far away from any OS.
 
 
 SAS Diags.txt
 
 
 Attached is a log I took through NexentaStor 3.1.1 with my disks still 
 attached.  The disks themselves don't seem to be throwing errors, so 
 that's good.
 
 To see errors from the disk's perspective, you need to look at the disk's 
 logs.
 I use sg3 utils for this (sg_logs -a /dev/rdsk/...)
 
 
 I'd paste some of this, but the output would be pretty big.  :) I'll look 
 more into this. Though my errors corrected without substantial delay 
 stands out as pretty high, even on a new disk I just received. Is there 
 anything specific I should be looking at?
 
 
 
 
 Has anyone seen anything like this?  I have not tried to boot into an 
 older 

Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-03 Thread Richard Elling
On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote:
 
 On Dec 3, 2011, at 10:31 PM, Richard Elling wrote:
 
 On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:
 
 Hi Richard,
 Thanks for getting back to me.
 
 
 On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:
 
 On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
 
 During the diagnostics of my SAN failure last week we thought we had seen 
 a backplane failure due to high error counts with 'lsiutil'.  However, 
 even with a new backplane and ruling out failed cards (MPXIO or singular) 
 or bad cables I'm still seeing my error count with LSIUTIL increment.  
 I've got no disks attached to the array right now so I've also ruled 
 those out.
 
 The link error counters are on the receiving side. To see the complete 
 picture, you need to look at
 link errors on both ends of each link (more below…)
 
 
 Even with nothing connected but the HBA to the backplane expander, a 
 simple restart of the SAN into a OpenIndiana LiveCD or other distribution 
 (NexentaStor) increments the counter.
 
 A few counters can tick up when the system is reset at boot. These can be 
 ignored.
 
 What you are looking for is  a consistent increase of the  counters under 
 load. In some cases
 I have seen millions of errors per minute on a very unhappy system.
 
 But we're talking about 600,000 - 2,000,000 errors on a simple reset at 
 boot.  Per my VAR their 6GB hardware show significantly less (in the 10s to 
 100s of errors, not 100s to millions). 
 
 For high-quality hardware, I see 4 to 8.  If I see  1,000, then I start 
 replacing hardware.
 
 
 And how do you define high quality hardware?  Obviously these aren't crummy 
 SATA adapters and low cost drives.  The Chassis and backplane are on 
 Nexenta's HSL.  While the cards are not, explicitly listed. The underlying 
 chip (LSI 1068) is on another card (3081E-R) that is on the HSL.

I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero errors.

Currently, the test process for HSL records any errors, but as long as the root 
cause can be
explained, the devices can pass certification.

 I've been as careful as I can be to clear the counter between changes to 
 parts to try and eliminate a potentially bad cable/card/etc.  You can see 
 phy 8-15 throw errors irregardless of MPXIO or single card config, OR 
 which expander port I use on the backplane.
 
 The info you attaced doesn't show the topology (lsiutil command 16), so it 
 is difficult to say
 why this occurs.
 
 Attached is the output of option 16 on each card.
 
 LSI1068.rtf
 
 This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator).
 
 It is unusual to see millions of errors there.
 
 Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1)
 you see on the order of  thousand errors. From the expander (handle 0009)
 you see millions of errors on phys 12 to 15, that are connected to the HBA.
 
 Also interesting is that one of the phys, adapter phy 0, shows no errors, 
 but we see
 errors on the others. This is unusual because there are 4 links in the cable.
 
 Still smells like hardware to me.
 -- richard
 
 
 I'm not quite extrapolating this data like you are.  I see handle 0009 which 
 looks to be the expander.  Card #1 is hooked to phy 8-11 and Card #2 is 
 hooked to phy 12-15.  (port 0 and 1 on the expander)
 
 As far as symmetrical errors, yeah the whole thing is screwy. The one thing I 
 am seeing as stand out that I did not notice before for some reason is that 
 right card (the one that normally handles phy 12-15) in my previous output 
 from my initial inquiry carries 1+M errors on the expander phys regardless of 
 the right or left cable.  Perhaps that is an indicator of hardware 
 malfunction. The left card (usually responsible for phy 8-11) throws 
 something in the order of 600+K (under 1M) using right or left cable (phy 
 8-11 or 12-15).  Those numbers are uncomfortably high too, though.

Agree.

 Basically the output of my SAS Diag.txt was flipping between single use of 
 each card with each of the two cables I had available to me.  If I were to 
 show the output now with both cards enabled phy 8-15 on the expander all show 
 link up situation.

Are the cables of the same make/model? Unfortunately, it is not uncommon to see 
bad cables :-(
I had one just last week :-(

 The other mystery as you mentioned is why Adapter phy 0 is error free while 
 the other 3 phys are not. It's also persistent across cables used AND cards 
 used.

A mystery…
 -- richard
-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA '11, Boston, MA, December 4-9 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-03 Thread Ryan Wehler

On Dec 3, 2011, at 11:18 PM, Richard Elling wrote:

 On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote:
 
 On Dec 3, 2011, at 10:31 PM, Richard Elling wrote:
 
 On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:
 
 Hi Richard,
 Thanks for getting back to me.
 
 
 On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:
 
 On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
 
 During the diagnostics of my SAN failure last week we thought we had 
 seen a backplane failure due to high error counts with 'lsiutil'.  
 However, even with a new backplane and ruling out failed cards (MPXIO or 
 singular) or bad cables I'm still seeing my error count with LSIUTIL 
 increment.  I've got no disks attached to the array right now so I've 
 also ruled those out.
 
 The link error counters are on the receiving side. To see the complete 
 picture, you need to look at
 link errors on both ends of each link (more below…)
 
 
 Even with nothing connected but the HBA to the backplane expander, a 
 simple restart of the SAN into a OpenIndiana LiveCD or other 
 distribution (NexentaStor) increments the counter.
 
 A few counters can tick up when the system is reset at boot. These can be 
 ignored.
 
 What you are looking for is  a consistent increase of the  counters under 
 load. In some cases
 I have seen millions of errors per minute on a very unhappy system.
 
 But we're talking about 600,000 - 2,000,000 errors on a simple reset at 
 boot.  Per my VAR their 6GB hardware show significantly less (in the 10s 
 to 100s of errors, not 100s to millions). 
 
 For high-quality hardware, I see 4 to 8.  If I see  1,000, then I start 
 replacing hardware.
 
 
 And how do you define high quality hardware?  Obviously these aren't 
 crummy SATA adapters and low cost drives.  The Chassis and backplane are on 
 Nexenta's HSL.  While the cards are not, explicitly listed. The underlying 
 chip (LSI 1068) is on another card (3081E-R) that is on the HSL.
 
 I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero 
 errors.

I'm assuming these had some sort of LSI cards in them since that's the primary 
focus here.  Do you happen to know models and what expander chip was used on 
the backplane(s)?

 Currently, the test process for HSL records any errors, but as long as the 
 root cause can be
 explained, the devices can pass certification.

Well since we can't even come to a reasonable justification on why these 
errors exist with no true indicator of bad hardware, something like this 
could pass the HSL if the VAR can justify it?  I'm not saying thats what 
happened.. I'm just trying to understand the process.


 
 I've been as careful as I can be to clear the counter between changes to 
 parts to try and eliminate a potentially bad cable/card/etc.  You can 
 see phy 8-15 throw errors irregardless of MPXIO or single card config, 
 OR which expander port I use on the backplane.
 
 The info you attaced doesn't show the topology (lsiutil command 16), so 
 it is difficult to say
 why this occurs.
 
 Attached is the output of option 16 on each card.
 
 LSI1068.rtf
 
 This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator).
 
 It is unusual to see millions of errors there.
 
 Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1)
 you see on the order of  thousand errors. From the expander (handle 0009)
 you see millions of errors on phys 12 to 15, that are connected to the HBA.
 
 Also interesting is that one of the phys, adapter phy 0, shows no errors, 
 but we see
 errors on the others. This is unusual because there are 4 links in the 
 cable.
 
 Still smells like hardware to me.
 -- richard
 
 
 I'm not quite extrapolating this data like you are.  I see handle 0009 which 
 looks to be the expander.  Card #1 is hooked to phy 8-11 and Card #2 is 
 hooked to phy 12-15.  (port 0 and 1 on the expander)
 
 As far as symmetrical errors, yeah the whole thing is screwy. The one thing 
 I am seeing as stand out that I did not notice before for some reason is 
 that right card (the one that normally handles phy 12-15) in my previous 
 output from my initial inquiry carries 1+M errors on the expander phys 
 regardless of the right or left cable.  Perhaps that is an indicator of 
 hardware malfunction. The left card (usually responsible for phy 8-11) 
 throws something in the order of 600+K (under 1M) using right or left 
 cable (phy 8-11 or 12-15).  Those numbers are uncomfortably high too, though.
 
 Agree.
 
 Basically the output of my SAS Diag.txt was flipping between single use of 
 each card with each of the two cables I had available to me.  If I were to 
 show the output now with both cards enabled phy 8-15 on the expander all 
 show link up situation.
 
 Are the cables of the same make/model? Unfortunately, it is not uncommon to 
 see bad cables :-(
 I had one just last week :-(

The cables are identical.  My VAR put this all together about 2 years ago.   I 
don't have any other cables to test but the 

Re: [zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-03 Thread Richard Elling
On Dec 3, 2011, at 9:32 PM, Ryan Wehler wrote:
 On Dec 3, 2011, at 11:18 PM, Richard Elling wrote:
 
 On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote:
 
 On Dec 3, 2011, at 10:31 PM, Richard Elling wrote:
 
 On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:
 
 Hi Richard,
 Thanks for getting back to me.
 
 
 On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:
 
 On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
 
 During the diagnostics of my SAN failure last week we thought we had 
 seen a backplane failure due to high error counts with 'lsiutil'.  
 However, even with a new backplane and ruling out failed cards (MPXIO 
 or singular) or bad cables I'm still seeing my error count with LSIUTIL 
 increment.  I've got no disks attached to the array right now so I've 
 also ruled those out.
 
 The link error counters are on the receiving side. To see the complete 
 picture, you need to look at
 link errors on both ends of each link (more below…)
 
 
 Even with nothing connected but the HBA to the backplane expander, a 
 simple restart of the SAN into a OpenIndiana LiveCD or other 
 distribution (NexentaStor) increments the counter.
 
 A few counters can tick up when the system is reset at boot. These can 
 be ignored.
 
 What you are looking for is  a consistent increase of the  counters 
 under load. In some cases
 I have seen millions of errors per minute on a very unhappy system.
 
 But we're talking about 600,000 - 2,000,000 errors on a simple reset at 
 boot.  Per my VAR their 6GB hardware show significantly less (in the 10s 
 to 100s of errors, not 100s to millions). 
 
 For high-quality hardware, I see 4 to 8.  If I see  1,000, then I start 
 replacing hardware.
 
 
 And how do you define high quality hardware?  Obviously these aren't 
 crummy SATA adapters and low cost drives.  The Chassis and backplane are on 
 Nexenta's HSL.  While the cards are not, explicitly listed. The underlying 
 chip (LSI 1068) is on another card (3081E-R) that is on the HSL.
 
 I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero 
 errors.
 
 I'm assuming these had some sort of LSI cards in them since that's the 
 primary focus here.  Do you happen to know models and what expander chip was 
 used on the backplane(s)?

LSI 2008 chipset (HP SC08Ge HBA).  Expanders are HP-branded, I'll speculate 
they are LSI SAS2x28.

Note: there is also firmware on the HBAs and expanders. But I do not expect 
firmware to change the
link error counts. I suspect that is more of a physical issue.

 Currently, the test process for HSL records any errors, but as long as the 
 root cause can be
 explained, the devices can pass certification.
 
 Well since we can't even come to a reasonable justification on why these 
 errors exist with no true indicator of bad hardware, something like this 
 could pass the HSL if the VAR can justify it?  I'm not saying thats what 
 happened.. I'm just trying to understand the process.

A certification does not mean that any specific implementation operates without 
errors. A failed part,
noisy environment, or other influences will affect any specific implementation.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA '11, Boston, MA, December 4-9 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)

2011-12-01 Thread Ryan Wehler
During the diagnostics of my SAN failure last week we thought we had seen a 
backplane failure due to high error counts with 'lsiutil'.  However, even with 
a new backplane and ruling out failed cards (MPXIO or singular) or bad cables 
I'm still seeing my error count with LSIUTIL increment.  I've got no disks 
attached to the array right now so I've also ruled those out.

Even with nothing connected but the HBA to the backplane expander, a simple 
restart of the SAN into a OpenIndiana LiveCD or other distribution 
(NexentaStor) increments the counter.

I've been as careful as I can be to clear the counter between changes to parts 
to try and eliminate a potentially bad cable/card/etc.  You can see phy 8-15 
throw errors irregardless of MPXIO or single card config, OR which expander 
port I use on the backplane.

According to my VAR something in the mptsas code changed recently (not sure 
what that means in time terms) and they do not see the problems with 6GB 
backplanes and adapters.

*
* Right Card - Right Cable - Round Robin (Adpt #2 in lsiutil)
*
Adapter Phy 0:  Link Up, No Errors

Adapter Phy 1:  Link Up
  Invalid DWord Count   1,276
  Running Disparity Error Count 1,167
  Loss of DWord Synch Count 0
  Phy Reset Problem Count   0

Adapter Phy 2:  Link Up
  Invalid DWord Count   3,779
  Running Disparity Error Count 3,494
  Loss of DWord Synch Count 0
  Phy Reset Problem Count   0

Adapter Phy 3:  Link Up
  Invalid DWord Count   3,477
  Running Disparity Error Count 2,964
  Loss of DWord Synch Count 0
  Phy Reset Problem Count   0

Adapter Phy 4:  Link Down, No Errors

Adapter Phy 5:  Link Down, No Errors

Adapter Phy 6:  Link Down, No Errors

Adapter Phy 7:  Link Down, No Errors

Expander (Handle 0009) Phy 0:  Link Up, No Errors

Expander (Handle 0009) Phy 1:  Link Up, No Errors

Expander (Handle 0009) Phy 2:  Link Up, No Errors

Expander (Handle 0009) Phy 3:  Link Up, No Errors

Expander (Handle 0009) Phy 4:  Link Up, No Errors

Expander (Handle 0009) Phy 5:  Link Up, No Errors

Expander (Handle 0009) Phy 6:  Link Up, No Errors

Expander (Handle 0009) Phy 7:  Link Up, No Errors

Expander (Handle 0009) Phy 8:  Link Down, No Errors

Expander (Handle 0009) Phy 9:  Link Down, No Errors

Expander (Handle 0009) Phy 10:  Link Down, No Errors

Expander (Handle 0009) Phy 11:  Link Down, No Errors

Expander (Handle 0009) Phy 12:  Link Up
  Invalid DWord Count 687,520
  Running Disparity Error Count   651,781
  Loss of DWord Synch Count 1
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 13:  Link Up
  Invalid DWord Count 689,145
  Running Disparity Error Count   678,705
  Loss of DWord Synch Count 1
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 14:  Link Up
  Invalid DWord Count 663,734
  Running Disparity Error Count   622,380
  Loss of DWord Synch Count 1
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 15:  Link Up
  Invalid DWord Count 645,744
  Running Disparity Error Count   611,468
  Loss of DWord Synch Count 1
  Phy Reset Problem Count   0

Expander (Handle 0009) Phy 16:  Link Down, No Errors

Expander (Handle 0009) Phy 17:  Link Down, No Errors

Expander (Handle 0009) Phy 18:  Link Down, No Errors

Expander (Handle 0009) Phy 19:  Link Down, No Errors

Expander (Handle 0009) Phy 20:  Link Down, No Errors

Expander (Handle 0009) Phy 21:  Link Down, No Errors

Expander (Handle 0009) Phy 22:  Link Up, No Errors

Expander (Handle 0009) Phy 23:  Link Up, No Errors

Expander (Handle 0009) Phy 24:  Link Up, No Errors

Expander (Handle 0009) Phy 25:  Link Up, No Errors

Expander (Handle 0009) Phy 26:  Link Up, No Errors

Expander (Handle 0009) Phy 27:  Link Up, No Errors

Expander (Handle 0009) Phy 28:  Link Up, No Errors

Expander (Handle 0009) Phy 29:  Link Up, No Errors

Expander (Handle 0009) Phy 30:  Link Up, No Errors

Expander (Handle 0009) Phy 31:  Link Up, No Errors

Expander (Handle 0009) Phy 32:  Link Up, No Errors