On Dec 3, 2011, at 9:32 PM, Ryan Wehler wrote:
> On Dec 3, 2011, at 11:18 PM, Richard Elling wrote:
>> On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote:
>>> On Dec 3, 2011, at 10:31 PM, Richard Elling wrote:
>>>> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:
>>>>> Hi Richard,
>>>>> Thanks for getting back to me.
>>>>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:
>>>>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
>>>>>>> During the diagnostics of my SAN failure last week we thought we had 
>>>>>>> seen a backplane failure due to high error counts with 'lsiutil'.  
>>>>>>> However, even with a new backplane and ruling out failed cards (MPXIO 
>>>>>>> or singular) or bad cables I'm still seeing my error count with LSIUTIL 
>>>>>>> increment.  I've got no disks attached to the array right now so I've 
>>>>>>> also ruled those out.
>>>>>> The link error counters are on the receiving side. To see the complete 
>>>>>> picture, you need to look at
>>>>>> link errors on both ends of each link (more below…)
>>>>>>> Even with nothing connected but the HBA to the backplane expander, a 
>>>>>>> simple restart of the SAN into a OpenIndiana LiveCD or other 
>>>>>>> distribution (NexentaStor) increments the counter.
>>>>>> A few counters can tick up when the system is reset at boot. These can 
>>>>>> be ignored.
>>>>>> What you are looking for is  a consistent increase of the  counters 
>>>>>> under load. In some cases
>>>>>> I have seen millions of errors per minute on a very unhappy system.
>>>>> But we're talking about 600,000 -> 2,000,000 errors on a simple reset at 
>>>>> boot.  Per my VAR their 6GB hardware show significantly less (in the 10s 
>>>>> to 100s of errors, not 100s to millions). 
>>>> For high-quality hardware, I see 4 to 8.  If I see > 1,000, then I start 
>>>> replacing hardware.
>>> And how do you define "high quality hardware"?  Obviously these aren't 
>>> crummy SATA adapters and low cost drives.  The Chassis and backplane are on 
>>> Nexenta's HSL.  While the cards are not, explicitly listed. The underlying 
>>> chip (LSI 1068) is on another card (3081E-R) that is on the HSL.
>> I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero 
>> errors.
> I'm assuming these had some sort of LSI cards in them since that's the 
> primary focus here.  Do you happen to know models and what expander chip was 
> used on the backplane(s)?

LSI 2008 chipset (HP SC08Ge HBA).  Expanders are HP-branded, I'll speculate 
they are LSI SAS2x28.

Note: there is also firmware on the HBAs and expanders. But I do not expect 
firmware to change the
link error counts. I suspect that is more of a physical issue.

>> Currently, the test process for HSL records any errors, but as long as the 
>> root cause can be
>> explained, the devices can pass certification.
> Well.... since we can't even come to a reasonable justification on why these 
> errors exist with no "true" indicator of bad hardware, something like this 
> could pass the HSL if the VAR can justify it?  I'm not saying thats what 
> happened.. I'm just trying to understand the process.

A certification does not mean that any specific implementation operates without 
errors. A failed part,
noisy environment, or other influences will affect any specific implementation.
 -- richard


ZFS and performance consulting
LISA '11, Boston, MA, December 4-9 

zfs-discuss mailing list

Reply via email to