Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2010-06-10 Thread Eugen Leitl
On Thu, Jun 10, 2010 at 04:04:42PM +0300, Pasi Kärkkäinen wrote:

> > Intel X25-M G1 firmware 8820 (80GB MLC)
> > Intel X25-M G2 firmware 02HD (160GB MLC)
> > 
> 
> What problems did you have with the X25-M models?

I'm not the OP, but I've had two X25M G2's (80 and 160 GByte)
suddenly die out me, out of a sample size of maybe 20.

-- 
Eugen* Leitl http://leitl.org";>leitl http://leitl.org
__
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2010-06-10 Thread Pasi Kärkkäinen
On Thu, Jun 10, 2010 at 05:46:19AM -0700, Peter Eriksson wrote:
> Just a quick followup that the same issue still seems to be there on our 
> X4500s with the latest Solaris 10 with all the latest patches and the 
> following SSD disks:
> 
> Intel X25-M G1 firmware 8820 (80GB MLC)
> Intel X25-M G2 firmware 02HD (160GB MLC)
> 

What problems did you have with the X25-M models?

-- Pasi

> However - things seem to work smoothly with:
> 
> Intel X25-E G1 firmware 8850 (32GB SLC)
> OCZ Vertex 2 firmware 1.00 and 1.02 (100GB MLC)
> 
> I'm currently testing a setup with dual OCZ Vertex 2 100GB SSD units that 
> will be used both as mirrored boot/root (32GB of the 100GB), and the use the 
> rest of those disks as L2ARC cache devices for the big data zpool. And have 
> two mirrored X25-E as slog devices:
> 
> zpool create DATA raidz2 c0t0d0 c0t1d0 c1t0d0 c1t1d0 c2t0d0 c2t1d0 c3t1d0 \
>   raidz2 c4t0d0 c4t1d0 c5t0d0 c5t1d0 c0t2d0 c0t3d0 c3t2d0 \
>   raidz2 c1t2d0 c1t3d0 c2t2d0 c2t3d0 c4t2d0 c4t3d0 c3t3d0 \
>   raidz2 c5t2d0 c5t3d0 c0t4d0 c0t5d0 c1t4d0 c1t5d0 c3t5d0 \
>   raidz2 c2t4d0 c2t5d0 c4t4d0 c4t5d0 c5t4d0 c5t5d0 c3t6d0 \
>   raidz2 c0t6d0 c0t7d0 c1t6d0 c1t7d0 c2t6d0 c2t7d0 c3t7d0 \
>   spare c4t6d0 c5t6d0 \
>   cache c3t0d0s3 c3t4d0s3 \
>   log mirror c4t7d0 c5t7d0
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2010-06-10 Thread Peter Eriksson
Just a quick followup that the same issue still seems to be there on our X4500s 
with the latest Solaris 10 with all the latest patches and the following SSD 
disks:

Intel X25-M G1 firmware 8820 (80GB MLC)
Intel X25-M G2 firmware 02HD (160GB MLC)

However - things seem to work smoothly with:

Intel X25-E G1 firmware 8850 (32GB SLC)
OCZ Vertex 2 firmware 1.00 and 1.02 (100GB MLC)

I'm currently testing a setup with dual OCZ Vertex 2 100GB SSD units that will 
be used both as mirrored boot/root (32GB of the 100GB), and the use the rest of 
those disks as L2ARC cache devices for the big data zpool. And have two 
mirrored X25-E as slog devices:

zpool create DATA raidz2 c0t0d0 c0t1d0 c1t0d0 c1t1d0 c2t0d0 c2t1d0 c3t1d0 \
  raidz2 c4t0d0 c4t1d0 c5t0d0 c5t1d0 c0t2d0 c0t3d0 c3t2d0 \
  raidz2 c1t2d0 c1t3d0 c2t2d0 c2t3d0 c4t2d0 c4t3d0 c3t3d0 \
  raidz2 c5t2d0 c5t3d0 c0t4d0 c0t5d0 c1t4d0 c1t5d0 c3t5d0 \
  raidz2 c2t4d0 c2t5d0 c4t4d0 c4t5d0 c5t4d0 c5t5d0 c3t6d0 \
  raidz2 c0t6d0 c0t7d0 c1t6d0 c1t7d0 c2t6d0 c2t7d0 c3t7d0 \
  spare c4t6d0 c5t6d0 \
  cache c3t0d0s3 c3t4d0s3 \
  log mirror c4t7d0 c5t7d0
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-15 Thread Paul B. Henson
On Tue, 15 Sep 2009, Eric Schrock wrote:

> I don't have the ATA spec in front of me, but that that looks like pretty
> normal output to me.  Glad to hear they addressed the issue.

Excellent; I reinstalled it in my test x4500, if no other issues show up I
can try to get my proposal to install them in production going again;
they make a huge difference for common sysadmin operations such as tarball
extraction or code development scenarios like revision control checkouts.
If I'm lucky maybe the ability to import a pool with a dead slog will make
it into U8, that was the only other potential snag in my deployment plan,
as I'd only have one SSD in each system.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-15 Thread Eric Schrock


On Sep 15, 2009, at 8:32 PM, Paul B. Henson wrote:


I updated to the new X25-E firmware, and I think it might have  
resolved the
problem. smartctl under Linux no longer give a warning, and the  
diskstat
check under Solaris no longer appears to have garbage. I attached  
output
from smartctl, diskstat, and the dtrace script at the bottom, does  
it look

like the firmware is returning valid stuff now?


I don't have the ATA spec in front of me, but that that looks like  
pretty normal output to me.  Glad to hear they addressed the issue.


- Eric



Absolutely.  The SATA code could definitely be cleaned up to bail  
when

processing an invalid record.  I can file a CR for you if you haven't
already done so.


I haven't; even if the new firmware does resolve the problem, I like
robustness :), so it would still be nice in general for the code to  
be more

forgiving and perhaps just log a warning.

Thanks...

--

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number:CVEM902600J6032HGN
Firmware Version: 045C8850
User Capacity:32,000,000,000 bytes
Device is:Not in smartctl database [for details use: -P  
showall]

ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
Local Time is:Mon Sep 14 18:26:09 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection  
activity

   was never started.
   Auto Offline Data Collection:
Disabled.
Self-test execution status:  (  32) The self-test routine was
interrupted
   by the host with a hard or soft
reset.
Total time to complete Offline
data collection: (   1) seconds.
Offline data collection
capabilities:(0x75) SMART execute Offline  
immediate.

   No Auto Offline data collection
support.
   Abort Offline collection upon  
new

   command.
   No Offline surface scan  
supported.

   Self-test supported.
   Conveyance Self-test supported.
   Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before  
entering

   power-saving mode.
   Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
   General Purpose Logging  
supported.

Short self-test routine
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:(   2) minutes.
Conveyance self-test routine
recommended polling time:(   1) minutes.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE   
UPDATED

WHEN_FAILED RAW_VALUE
 3 Spin_Up_Time0x   100   000   000Old_age
Offline

In_the_past 0
 4 Start_Stop_Count0x   100   000   000Old_age
Offline

In_the_past 0
 5 Reallocated_Sector_Ct   0x0002   100   100   000Old_age
Always

-   0
 9 Power_On_Hours  0x0002   100   100   000Old_age
Always

-   68
12 Power_Cycle_Count   0x0002   100   100   000Old_age
Always

-   151
192 Power-Off_Retract_Count 0x0002   100   100   000Old_age
Always

-   22
232 Unknown_Attribute   0x0003   100   100   010Pre-fail   
Always

-   0
233 Unknown_Attribute   0x0002   099   099   000Old_age
Always

-   0
225 Load_Cycle_Count0x   200   200   000Old_age
Offline

-   50147
226 Load-in_Time0x0002   255   000   000Old_age
Always

In_the_past 4294967295
227 Torq-amp_Count  0x0002   000   000   000Old_age
Always

FAILING_NOW 281474976710655
228 Power-off_Retract_Count 0x0002   000   000   000Old_age
Always

FAILING_NOW 4294967295

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_DescriptionStatus  Remaining  LifeTime 
(hours)

LBA_of_first_error
# 1  Short offline   Completed without error   00%68
-
# 2  Short offline   Completed without error   00%68
-
# 3  S

Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-15 Thread Paul B. Henson
On Sun, 13 Sep 2009, Eric Schrock wrote:

> Actually, it's not one byte - the entire page is garbage (as we saw in
> the dtrace output).  But I'm guessing that smartctl (and hardware SATL)
> is aborting on the first invalid record, while we keep going and blindly
> "translate" one form of garbage into another.

I updated to the new X25-E firmware, and I think it might have resolved the
problem. smartctl under Linux no longer give a warning, and the diskstat
check under Solaris no longer appears to have garbage. I attached output
from smartctl, diskstat, and the dtrace script at the bottom, does it look
like the firmware is returning valid stuff now?

> Absolutely.  The SATA code could definitely be cleaned up to bail when
> processing an invalid record.  I can file a CR for you if you haven't
> already done so.

I haven't; even if the new firmware does resolve the problem, I like
robustness :), so it would still be nice in general for the code to be more
forgiving and perhaps just log a warning.

Thanks...

--

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SSDSA2SH032G1GN INTEL
Serial Number:CVEM902600J6032HGN
Firmware Version: 045C8850
User Capacity:32,000,000,000 bytes
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
Local Time is:Mon Sep 14 18:26:09 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection:
Disabled.
Self-test execution status:  (  32) The self-test routine was
interrupted
by the host with a hard or soft
reset.
Total time to complete Offline
data collection: (   1) seconds.
Offline data collection
capabilities:(0x75) SMART execute Offline immediate.
No Auto Offline data collection
support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:(   2) minutes.
Conveyance self-test routine
recommended polling time:(   1) minutes.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED
WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time0x   100   000   000Old_age   Offline
In_the_past 0
  4 Start_Stop_Count0x   100   000   000Old_age   Offline
In_the_past 0
  5 Reallocated_Sector_Ct   0x0002   100   100   000Old_age   Always
-   0
  9 Power_On_Hours  0x0002   100   100   000Old_age   Always
-   68
 12 Power_Cycle_Count   0x0002   100   100   000Old_age   Always
-   151
192 Power-Off_Retract_Count 0x0002   100   100   000Old_age   Always
-   22
232 Unknown_Attribute   0x0003   100   100   010Pre-fail  Always
-   0
233 Unknown_Attribute   0x0002   099   099   000Old_age   Always
-   0
225 Load_Cycle_Count0x   200   200   000Old_age   Offline
-   50147
226 Load-in_Time0x0002   255   000   000Old_age   Always
In_the_past 4294967295
227 Torq-amp_Count  0x0002   000   000   000Old_age   Always
FAILING_NOW 281474976710655
228 Power-off_Retract_Count 0x0002   000   000   000Old_age   Always
FAILING_NOW 4294967295

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_DescriptionStatus  Remaining  LifeTime(hours)
LBA_of_first_error
# 1  Short offline   Completed without error   00%68
-
# 2  Short offline  

Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-14 Thread Peter Eriksson
Now tested a firmware 8850 X25-E in one of our X4500:s and things look better:

> # /ifm/bin/smartctl -d scsi -l selftest /dev/rdsk/c5t7d0s0
> smartctl version 5.38 [i386-pc-solaris2.10] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> No self-tests have been logged

No "scsi" console errors so far.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-14 Thread Peter Eriksson
I can confirm that on an X4240 with the LSI (mpt) controller:

X25-M G1 with 8820 still returns invalid selftest data
X25-E G1 with 8850 now returns correct selftest data
(I haven't got any X25-M G2)

Going to replace an X25-E with the old firmware in one of our X4500s
soon and we'll see if things work right there)

I still see heavy write load-induced bus resets with the 8850-firmware X25-Es 
on the X4240 though.
(Unless I wrap the X25-E inside a DiskSuite SVM metadevice for some strange 
reason).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-13 Thread Paul B. Henson
On Sun, 13 Sep 2009, Mike Gerdts wrote:

> August 11 they released firmware revisions 8820, 8850, and 02G9,
> depending on the drive model.

Ooooh, cool, last time I checked they only had updates for the X25-M.
Thanks for the pointer.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-13 Thread Eric Schrock


On Sep 12, 2009, at 10:49 PM, Paul B. Henson wrote:


In any case, I agree with you that the firmware is buggy; however I
disagree with you as to the outcome of that bug. The drive is not  
returning
random garbage, it has *one* byte wrong. Other than that all of the  
data

seems ok, at least to my inexpert eyes. smartctl under Linux issues a
warning about that invalid byte and reports everything else ok.  
Solaris on

an x4500 evidentally barfs over that invalid byte and returns garbage.


Actually, it's not one byte - the entire page is garbage (as we saw in  
the dtrace output).  But I'm guessing that smartctl (and hardware  
SATL) is aborting on the first invalid record, while we keep going and  
blindly "translate" one form of garbage into another.


Overall, I think the Linux approach seems more useful. Be strict in  
what
you generate, and lenient in what you accept ;), or something like  
that. As
I already said, it would be really really nice if the Solaris driver  
could
be fixed to be a little more forgiving and deal better with the  
drive, but

I've got no expectation that it should be done. But it could be :).


Absolutely.  The SATA code could definitely be cleaned up to bail when  
processing an invalid record.  I can file a CR for you if you haven't  
already done so.  Also, I'd encourage any developers out there with  
one of these drives to take a shot at fixing the issue via the  
OpenSolaris sponsor process.


- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-13 Thread Eric Schrock


On Sep 12, 2009, at 11:14 PM, Paul B. Henson wrote:


On Sat, 12 Sep 2009, Paul B. Henson wrote:

On another note, my understanding is that the official Sun sold
and supported SSD for the x4540 is basically just an OEM'd Intel X25- 
E. Did
Sun install their own fixed firmware on their version of that drive,  
or
does it have the same buggy firmware as the street version? It would  
be

funny if you guys were shipping a drive with buggy firmware that just
happens to work because the x4540 hardware doesn't trip over the one
invalid byte :)...


The X4540 uses SAS, not SATA.  So the translation via SATL is done in  
hardware, not software.


- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-13 Thread Mike Gerdts
On Sun, Sep 13, 2009 at 1:14 AM, Paul B. Henson  wrote:
> On Sat, 12 Sep 2009, Paul B. Henson wrote:
>
>> In any case, I agree with you that the firmware is buggy; however I
>> disagree with you as to the outcome of that bug. The drive is not
>> returning random garbage, it has *one* byte wrong. Other than that all of
>> the data seems ok, at least to my inexpert eyes. smartctl under Linux
>> issues a warning about that invalid byte and reports everything else ok.
>> Solaris on an x4500 evidentally barfs over that invalid byte and returns
>> garbage.
>
> On another note, my understanding is that the official Sun sold
> and supported SSD for the x4540 is basically just an OEM'd Intel X25-E. Did
> Sun install their own fixed firmware on their version of that drive, or
> does it have the same buggy firmware as the street version? It would be
> funny if you guys were shipping a drive with buggy firmware that just
> happens to work because the x4540 hardware doesn't trip over the one
> invalid byte :)...

Perhaps some of their fixes have made it upstream.  Your message at
http://mail.opensolaris.org/pipermail/fm-discuss/2009-June/000436.html
from June 10 suggests you are running firmware release (045C)8626.  On
August 11 they released firmware revisions 8820, 8850, and 02G9,
depending on the drive model.

http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&ProdId=3043&DwnldID=17485&lang=eng

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread Paul B. Henson
On Sat, 12 Sep 2009, Paul B. Henson wrote:

> In any case, I agree with you that the firmware is buggy; however I
> disagree with you as to the outcome of that bug. The drive is not
> returning random garbage, it has *one* byte wrong. Other than that all of
> the data seems ok, at least to my inexpert eyes. smartctl under Linux
> issues a warning about that invalid byte and reports everything else ok.
> Solaris on an x4500 evidentally barfs over that invalid byte and returns
> garbage.

On another note, my understanding is that the official Sun sold
and supported SSD for the x4540 is basically just an OEM'd Intel X25-E. Did
Sun install their own fixed firmware on their version of that drive, or
does it have the same buggy firmware as the street version? It would be
funny if you guys were shipping a drive with buggy firmware that just
happens to work because the x4540 hardware doesn't trip over the one
invalid byte :)...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread Paul B. Henson
On Sat, 12 Sep 2009, Eric Schrock wrote:

> Also, were you ever able to get this disk behind a SAS transport (X4540,
> J4400, J4500, etc)?  It would be interesting to see how hardware SATL
> deals with this invalid data.  Output from 'smartctl -d sat' and
> 'smartctl -d scsi' on such a system would show both the ATA data and the
> translated SCSI data.  My guess is that it just gives up at the first
> invalid version record, something we should probably be doing.

Phil Steinbachs gave you some data from an X25-E in a J4400 attached to an
X4240 via an LSI 1068E based HBA, as well as one in one of the X4240's SAS
slots connected to the internal Adaptec RAID controller:

http://mail.opensolaris.org/pipermail/fm-discuss/2009-June/000432.html

and:

http://mail.opensolaris.org/pipermail/fm-discuss/2009-June/000435.html

Your last email on the subject was:

http://mail.opensolaris.org/pipermail/fm-discuss/2009-June/000447.html

in which you said:

"The primary thing is that this drive is completely busted - it's reporting
totally invalid data in response to the ATA READ EXT LOG command for log
0x07 (Extended SMART self-test log).  The spec defines that byte 0 must be
0x1 and that byte 1 is reserved."

Phil might still be in a position to run smartctl on the drives if you're
still interested in the data.

I guess this is why you're now saying the drive is returning invalid data,
I had forgotten the details, that was almost three months ago.

In any case, I agree with you that the firmware is buggy; however I
disagree with you as to the outcome of that bug. The drive is not returning
random garbage, it has *one* byte wrong. Other than that all of the data
seems ok, at least to my inexpert eyes. smartctl under Linux issues a
warning about that invalid byte and reports everything else ok. Solaris on
an x4500 evidentally barfs over that invalid byte and returns garbage.

Overall, I think the Linux approach seems more useful. Be strict in what
you generate, and lenient in what you accept ;), or something like that. As
I already said, it would be really really nice if the Solaris driver could
be fixed to be a little more forgiving and deal better with the drive, but
I've got no expectation that it should be done. But it could be :).

Thanks again for your help. I apologize if I've been a bit antagonistic, I
tend to go "dog with a bone" when I start debating something.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread Paul B. Henson
On Sat, 12 Sep 2009, Eric Schrock wrote:

> Your statement that it is "just fine" is false:

I didn't say it worked "perfectly", I said it worked "fine". Yes, it gave a
*warning* that the "SMART Selective Self-Test Log Data Structure Revision
Number" was 0 instead of 1, **however** other than that warning the data
smartctl returned from the drive appeared correct.

Results from the virgin drive:

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


Results after manually initiating self tests:

SMART Self-test log structure revision number 1
Num  Test_DescriptionStatus  Remaining  LifeTime(hours)
LBA_of_first_error
# 1  Extended offlineCompleted without error   00%68
-
# 2  Short offline   Completed without error   00%68
-


The exact same drive in the x4500 running your test program to check
self-test results:

self-test-failure = (embedded nvlist)
nvlist version: 0
result-code = 0x4
timestamp = 0x48a5
segment = 0x0
address = 0xa548a548a548
(end self-test-failure)


There's definitely invalid data all right, but it's **not** originating
from the drive.

For that matter, the warning is about the "SMART Selective Self-Test Log
Data Structure Revision Number", not the "SMART Self-test log structure
revision number" -- which is correctly version 1.

> Like I said, there are ways we could tighten up the FMA code to better
> handle bad data before going off the rails - most likely smartctl gives
> up when it sees this invalid record, while we (via SATL) keep going.
> But any way you slice it, the drive is returning invalid data.

The drive is not returning invalid data in a Linux box running smartctl.
Other than a *warning* about the wrong revision of a data structure for a
different self test, the drive seems to work just fine.

I really appreciated the help you provided with figuring out what was going
on with this drive in an x4500 under Solaris. I understand there's no
obligation on anybody's part to make this unsupported drive work. However,
given it does work correctly (at least in regards to returning smart
self-test logs) under Linux, I don't see why it could not work correctly
under Solaris. If it doesn't get fixed, it doesn't get fixed, but I don't
understand why you're saying the drive is returning invalid data when the
evidence does not support that conclusion.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread Carson Gaspar

Carson Gaspar wrote:

Except you replied to me, not to the person who has SSDs. I have dead 
standard hard disks, and the mpt driver is just not happy. After 
applying 141737-04 to my  Sol 10 system, things improved greatly, and 
the constant bus resets went away. After upgrading to OpenSolaris 6/09 
things went back to being crappy. Updating to b118 did not help.


And for the curious, here are one week of uniq'd log messages I receive when I'm 
having problems:


Log info 0x31110b00 received for target 1.
Log info 0x3113 received for target 0.
Log info 0x3113 received for target 1.
Log info 0x3114 received for target 0.
Log info 0x3114 received for target 1.
Log info 0x3114 received for target 3.
mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31110b00
mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31110b00

All disks are identical. An example iostat -nE output (note the 93 transport 
errors...):


c7t1d0   Soft Errors: 0 Hard Errors: 6 Transport Errors: 93
Vendor: ATA  Product: HDS725050KLA360  Revision: A10C Serial No:
Size: 500.11GB <500107861504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 6 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread Carson Gaspar

James C. McPherson wrote:

On Thu, 10 Sep 2009 12:31:11 -0700
Carson Gaspar  wrote:


Alex Li wrote:

We finally resolved this issue by change LSI driver. For details, please
refer to here
http://enginesmith.wordpress.com/2009/08/28/ssd-faults-finally-resolved/
Anyone from Sun have any knowledge of when the open source mpt driver will be 
less broken? Things improved greatly for me re: bus resets with a recent Sol 10 
patch, but after my upgrade to OpenSolaris, they're back with a vengeance. An 
update to b118 didn't improve things, and I dare not go to anything more recent 
until the ZFS bug fixes hit the dev repo.



From reading your blog post, it appears that mpt and fma were
trying really hard to tell you that your SSD was misbehaving,
and therefore you should do something about it. Turning _off_
disk fma and then totally replacing the driver with one that
doesn't support fma were definitely not the recommended actions!

Given the rest of this thread, I'm really keen to see (as somebody
who works on mpt(7d)) how your system behaves with fixed SSD
firmware, using mpt(7d) and with disk fma turned on again.

After that, let's talk about "broken" drivers. 


Except you replied to me, not to the person who has SSDs. I have dead standard 
hard disks, and the mpt driver is just not happy. After applying 141737-04 to my 
 Sol 10 system, things improved greatly, and the constant bus resets went away. 
After upgrading to OpenSolaris 6/09 things went back to being crappy. Updating 
to b118 did not help.


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread James C. McPherson
On Thu, 10 Sep 2009 12:31:11 -0700
Carson Gaspar  wrote:

> Alex Li wrote:
> > We finally resolved this issue by change LSI driver. For details, please
> > refer to here
> > http://enginesmith.wordpress.com/2009/08/28/ssd-faults-finally-resolved/
> 
> Anyone from Sun have any knowledge of when the open source mpt driver will be 
> less broken? Things improved greatly for me re: bus resets with a recent Sol 
> 10 
> patch, but after my upgrade to OpenSolaris, they're back with a vengeance. An 
> update to b118 didn't improve things, and I dare not go to anything more 
> recent 
> until the ZFS bug fixes hit the dev repo.


>From reading your blog post, it appears that mpt and fma were
trying really hard to tell you that your SSD was misbehaving,
and therefore you should do something about it. Turning _off_
disk fma and then totally replacing the driver with one that
doesn't support fma were definitely not the recommended actions!

Given the rest of this thread, I'm really keen to see (as somebody
who works on mpt(7d)) how your system behaves with fixed SSD
firmware, using mpt(7d) and with disk fma turned on again.

After that, let's talk about "broken" drivers. 


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread Eric Schrock
Also, were you ever able to get this disk behind a SAS transport  
(X4540, J4400, J4500, etc)?  It would be interesting to see how  
hardware SATL deals with this invalid data.  Output from 'smartctl -d  
sat' and 'smartctl -d scsi' on such a system would show both the ATA  
data and the translated SCSI data.  My guess is that it just gives up  
at the first invalid version record, something we should probably be  
doing.


- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread Eric Schrock


On Sep 12, 2009, at 12:00 AM, Paul B. Henson wrote:


Well, I won't claim the drive firmware is completely innocent, but as
evidenced in

	http://mail.opensolaris.org/pipermail/fm-discuss/2009-June/ 
000436.html


smartctl on a Linux box seems to work just fine. The exact same  
model drive
also works just fine in an x4540. So I think the assertion that the  
drive

returns random data is demonstrably false.


Your statement that it is "just fine" is false:

---
SMART Selective Self-Test Log Data Structure Revision Number (0)  
should be

1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data  
structure

revision number = 1
---

Like I said, there are ways we could tighten up the FMA code to better  
handle bad data before going off the rails - most likely smartctl  
gives up when it sees this invalid record, while we (via SATL) keep  
going.  But any way you slice it, the drive is returning invalid data.


- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-12 Thread Paul B. Henson
On Fri, 11 Sep 2009, Eric Schrock wrote:

> It's clearly bad firmware - there's no bug in the sata driver.  That
> drive basically returns random data, and if you're unlucky that
> randomness will look like a valid failure response.  In the process I
> found one or two things that could be tightened up with the FMA analysis,
> but when your drive is returning random log data it's impossible to
> actually fix the problem in software.

Well, I won't claim the drive firmware is completely innocent, but as
evidenced in

http://mail.opensolaris.org/pipermail/fm-discuss/2009-June/000436.html

smartctl on a Linux box seems to work just fine. The exact same model drive
also works just fine in an x4540. So I think the assertion that the drive
returns random data is demonstrably false. There's something about the SSD
in an x4500 that just doesn't play nice -- it might be partially the drive
firmware, it might be the SAS controller, it might be something else -- but
it's *not* simply random data being returned from the drive.

It would be really appreciated if that problem could be tracked down so the
drive works as well SMART-wise in an x4500 as it does in a Linux box or an
x4540, but I understand Sun does not certify the x4500 with SSD's so
there's no expectation that would happen. But it would be really really
appreciated :)...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-11 Thread Eric Schrock


On Sep 11, 2009, at 8:48 PM, Paul B. Henson wrote:


x4500's have Marvell SATA controllers, not LSI. My issue with Intel  
SSD's

being marked faulty in X4500's has yet to be resolved. The last time I
rebooted it fm started marking the SSD failed again due to invalid
self-check log data. I had some correspondence with Eric Schrock who
indicated it looked like a combination of buggy Intel firmware and a  
bug in
the Solaris SATL driver, but haven't heard back from him as to  
whether they

might fix it.


It's clearly bad firmware - there's no bug in the sata driver.  That  
drive basically returns random data, and if you're unlucky that  
randomness will look like a valid failure response.  In the process I  
found one or two things that could be tightened up with the FMA  
analysis, but when your drive is returning random log data it's  
impossible to actually fix the problem in software.


- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-11 Thread Paul B. Henson
On Thu, 10 Sep 2009, Alex Li wrote:

> We finally resolved this issue by change LSI driver. For details, please
> refer to here
> http://enginesmith.wordpress.com/2009/08/28/ssd-faults-finally-resolved/

I believe you hijacked my thread ;).

x4500's have Marvell SATA controllers, not LSI. My issue with Intel SSD's
being marked faulty in X4500's has yet to be resolved. The last time I
rebooted it fm started marking the SSD failed again due to invalid
self-check log data. I had some correspondence with Eric Schrock who
indicated it looked like a combination of buggy Intel firmware and a bug in
the Solaris SATL driver, but haven't heard back from him as to whether they
might fix it.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-10 Thread Carson Gaspar

Alex Li wrote:

We finally resolved this issue by change LSI driver. For details, please
refer to here
http://enginesmith.wordpress.com/2009/08/28/ssd-faults-finally-resolved/


Anyone from Sun have any knowledge of when the open source mpt driver will be 
less broken? Things improved greatly for me re: bus resets with a recent Sol 10 
patch, but after my upgrade to OpenSolaris, they're back with a vengeance. An 
update to b118 didn't improve things, and I dare not go to anything more recent 
until the ZFS bug fixes hit the dev repo.


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-09-10 Thread Alex Li
We finally resolved this issue by change LSI driver. For details, please refer 
to here http://enginesmith.wordpress.com/2009/08/28/ssd-faults-finally-resolved/
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel X25-E SSD in x4500 followup

2009-07-30 Thread Alex Li
We found lots of SAS Controller Reset and errors to SSD on our servers 
(OpenSolaris 2008.05 and 2009.06 with third-party JBOD and X25-E). Whenever 
there is an error, the MySQL insert takes more than 4 seconds. It was quite 
scary.

Eventually our engineer disabled the Fault Management SMART Pooling and seems 
working.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss