Re: [zfs-discuss] What is your data error rate?

2012-01-25 Thread Anonymous Remailer (austria)

I've been watching the heat control issue carefully since I had to take a
job offshore (cough reverse H1B cough) in a place without adequate AC and I
was able to get them to ship my servers and some other gear. Then I read
Intel is guaranteeing their servers will work up to 100 degrees F ambient
temps in the pricing wars to sell servers, he who goes green and saves data
center cooling budget will win big since now everyone realizes AC costs more
than hardware for server farms. And this is not on new special heat-tolerant
gear, I heard they will put this in writing even for their older units. From
that I would conclude at least commercial server gear can take a lot more
abuse than it gets and still not be affected enough to make components fail
because if they did, Intel could not afford to make this guarantee. YMMV of
course. I still feel nervous running equipment in this kind of environment
but after 3 years of doing that including commodity desktops I haven't seen
any abnormal failures.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-25 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stefan Ring
 
 I’ve been running a raidz1 on three 1TB consumer disks for
 approx. 2 years now (about 90% full), and I scrub the pool every 3-4
 weeks and have never had a single error. 

Well...  You're probably not 100% active 100% of the time...
And...
Assuming the failure rate of drives is not linear, but skewed toward higher 
failure rate after some period of time (say, 3 yrs) then you're more likely to 
experience no errors for the first year or two, and you're more likely to 
experience multiple simultaneous failures after 3yrs or so.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-25 Thread John Martin

On 01/25/12 09:08, Edward Ned Harvey wrote:


Assuming the failure rate of drives is not linear, but skewed toward higher 
failure rate after some period of time (say, 3 yrs)  ...


See section 3.1 of the Google study:

  http://research.google.com/archive/disk_failures.pdf

although section 4.2 of the Carnegie Mellon study
is much more supportive of the assumption.

  http://www.usenix.org/events/fast07/tech/schroeder/schroeder.pdf
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-25 Thread Bob Friesenhahn

On Wed, 25 Jan 2012, Anonymous Remailer (austria) wrote:



I've been watching the heat control issue carefully since I had to take a
job offshore (cough reverse H1B cough) in a place without adequate AC and I
was able to get them to ship my servers and some other gear. Then I read
Intel is guaranteeing their servers will work up to 100 degrees F ambient
temps in the pricing wars to sell servers, he who goes green and saves data


Most servers seem to be specified to run up to 95 degrees, with some 
particularly-dense ones specified to only handle 90.  Network 
switching gear is usually specified to handle 105.


My own equipment typically experiences up to 83 degrees during the 
peak of summer (but quite a lot more if the AC fails).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-25 Thread Paul Kraus
On Tue, Jan 24, 2012 at 10:50 AM, Stefan Ring stefan...@gmail.com wrote:

 After having read this mailing list for a little while, I get the
 impression that there are at least some people who regularly
 experience on-disk corruption that ZFS should be able to report and
 handle. I’ve been running a raidz1 on three 1TB consumer disks for
 approx. 2 years now (about 90% full), and I scrub the pool every 3-4
 weeks and have never had a single error. From the oft-quoted 10^14
 error rate that consumer disks are rated at, I should have seen an
 error by now -- the scrubbing process is not the only activity on the
 disks, after all, and the data transfer volume from that alone clocks
 in at almost exactly 10^14 by now.

The 10^-14 (or 10^-15 or 10^-16) number is a statistical average.
So if you have a big enough pool of drives, for every drive that moves
more than 10^14 with no uncorrectable errors, then there will be a
drive that moves less than 10^14 before hitting an uncorrectable
error. The three 1 TB consumer drives you have must have been
manufactured on a good day and not a bad day :-)

Note the error rate is 10^-14 (or 10^-15 or 10^-15) which
translates into one error per 10^14 bits (bytes ?) transferred to /
from the drive. Note the sign change on the exponent :-)

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, Troy Civic Theatre Company
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] What is your data error rate?

2012-01-24 Thread Stefan Ring
After having read this mailing list for a little while, I get the
impression that there are at least some people who regularly
experience on-disk corruption that ZFS should be able to report and
handle. I’ve been running a raidz1 on three 1TB consumer disks for
approx. 2 years now (about 90% full), and I scrub the pool every 3-4
weeks and have never had a single error. From the oft-quoted 10^14
error rate that consumer disks are rated at, I should have seen an
error by now -- the scrubbing process is not the only activity on the
disks, after all, and the data transfer volume from that alone clocks
in at almost exactly 10^14 by now.

Not that I’m worried, of course, but it comes at a slight surprise to
me. Or does the 10^14 rating just reflect the strength of the on-disk
ECC algorithm?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-24 Thread Jim Klimov

2012-01-24 19:50, Stefan Ring пишет:

After having read this mailing list for a little while, I get the
impression that there are at least some people who regularly
experience on-disk corruption that ZFS should be able to report and
handle. I’ve been running a raidz1 on three 1TB consumer disks for
approx. 2 years now (about 90% full), and I scrub the pool every 3-4
weeks and have never had a single error. From the oft-quoted 10^14
error rate that consumer disks are rated at, I should have seen an
error by now -- the scrubbing process is not the only activity on the
disks, after all, and the data transfer volume from that alone clocks
in at almost exactly 10^14 by now.

Not that I’m worried, of course, but it comes at a slight surprise to
me. Or does the 10^14 rating just reflect the strength of the on-disk
ECC algorithm?


I maintained several dozen storage servers for about
12 years, and I've seen quite a few drive deaths as
well as automatically triggered RAID array rebuilds.
But usually these were infant deaths in the first
year, and those drives who passed the age test often
give no noticeable problems for the next decade.
Several 2-4 disk systems work as OpenSolaris SXCE
servers with ZFS pools for root and data for years
now, and also show now problems. However most of
these are branded systems and disks from Sun.
I think we've only had one or two drives die, but
happened to have cold-spares due to over-ordering ;)

I do have a suspiciously high error rate on my home-NAS
which was thrown together from whatever pieces I had
at home at the time I left for an overseas trip. The
box is nearly unmaintained since then, and can suffer
from physical reasons known and unknown, such as the
SATA cabling (varied and quite possibly bad), non-ECC
memory, dust and overheating, etc.

It is also possible that aging components such as the
CPU and Motherboard which have about 5 years of active
lifetime (including an overclocked past) can contribute
to error-rates.

The old 80gb root drive has had some bad sectors (READ
errors in scrub and data access) and rpool was recreated
with copies=2 for a few times now, thanks to LiveUSB,
but the main data pool had no substantial errors until
the CKSUM errors reported this winter (metadata:0x0 and
then the dozen of in-file checksum mismatches). Since
one of the drives got itself lost soon after, and only
reappeared after all the cables were replugged, I still
tend to blame this on SATA cabling as the most probable
root cause.

I do not have an up-to-date SMART error report, and
the box is not accessible at the moment, so I can't
comment on lower-level errors in the main pool drives.
They were new at the time I put the box together (almost
a year ago now).

However, so far much more than discovered on-disk CKSUM
errors (whichever way they've appeared) I am bothered
by tendency of this box to lock up and/or reboot after
somewhat repeatable actions (such as destroying large
snapshots of deduped datasets, etc.) I tend to write
this off as shortcomings of the OS (i.e. memory-hunger
and lockup in scarate hell as the most frequent cause),
and this really bothers me more now - causing lots of
downtime until some friend comes to that apartment to
reboot the box.

 Or does the 10^14 rating just reflect the strength
 of the on-disk ECC algorithm?

I am not sure how much the algorithms differ between
enterprise and consumer disks, while the UBER is
said to differ about 100 times. It might have also
to do with quality of materials (better steel in ball
bearings, etc.) as well as better firmware/processors
which optimize mechanical workloads and reduce the
mechanical wear. Maybe so, at least...

Finally, this is statistics. It does not guarantee
that for some 90Tbits of transferred data you will
certainly see an error (and just one for that matter).
Those drives which died young hopefully also count
in the overall stats, moving the bar a bit higher
for their better-made brethren.

Also, disk UBER regards media failures and ability
of disks' cache, firmware and ECC to deal with that.
After the disk sends the correct sector on the wire,
many things can happen like noise in bad connectors,
electromagnetic interference from all the motors in
your computer onto the data cable, ability or lack
thereof for the data protocol (IDE, ATA, SCSI) to
detect and/or recover from such incoming random bits
between disk and HBA, errors in HBA chips and code,
noise in old rusty PCI* connector slots, bitflips in
non-ECC RAM or overheated CPUs, power surges from PSU...
There is a lot of stuff that can break :)

//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-24 Thread Bob Friesenhahn

On Tue, 24 Jan 2012, Jim Klimov wrote:



Or does the 10^14 rating just reflect the strength
of the on-disk ECC algorithm?


I am not sure how much the algorithms differ between
enterprise and consumer disks, while the UBER is
said to differ about 100 times. It might have also
to do with quality of materials (better steel in ball
bearings, etc.) as well as better firmware/processors
which optimize mechanical workloads and reduce the
mechanical wear. Maybe so, at least...


In addition to the above, an important factor is that enterprise disks 
with 10^16 ratings also offer considerably less storage density. 
Instead of 3TB storage per drive, you get 400GB storage per drive.


So-called nearline enterprise storage drives fit in somewhere in the 
middle, with higher storage densities, but also higher error rates.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-24 Thread Gregg Wonderly
What I've noticed, is that when I have my drives in a situation of small 
airflow, and hence hotter operating temperatures, my disks will drop quite 
quickly.  I've now moved my systems into large cases, which large amounts of 
airflow and using the icydock brand of removable drive enclosures.


http://www.newegg.com/Product/Product.aspx?Item=N82E16817994097
http://www.newegg.com/Product/Product.aspx?Item=N82E16817994113

I use the SASUC8I SATA/SAS controller to access 8 drives.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816117157

I put it in PCI-e x16 slots on graphics heavy motherboards which might have as 
many as 4x PCI-e x16 slots.  I am replacing an old motherboard with this one.


http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1124780

The case that I found to be a good match for my needs is the Raven

http://www.newegg.com/Product/Product.aspx?Item=N82E16811163180

It has enough slots (7) to put 2x 3-in-2 and 1x 4-in-3 icy dock bays in to 
provide 10 drives in hot swap bays.


I really think that the big issue is that you must move the air.  The drives 
really need to stay cool or else you will see degraded performance and/or data 
loss much more often.


Gregg Wonderly

On 1/24/2012 9:50 AM, Stefan Ring wrote:

After having read this mailing list for a little while, I get the
impression that there are at least some people who regularly
experience on-disk corruption that ZFS should be able to report and
handle. I’ve been running a raidz1 on three 1TB consumer disks for
approx. 2 years now (about 90% full), and I scrub the pool every 3-4
weeks and have never had a single error. From the oft-quoted 10^14
error rate that consumer disks are rated at, I should have seen an
error by now -- the scrubbing process is not the only activity on the
disks, after all, and the data transfer volume from that alone clocks
in at almost exactly 10^14 by now.

Not that I’m worried, of course, but it comes at a slight surprise to
me. Or does the 10^14 rating just reflect the strength of the on-disk
ECC algorithm?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-24 Thread John Martin

On 01/24/12 17:06, Gregg Wonderly wrote:

What I've noticed, is that when I have my drives in a situation of small
airflow, and hence hotter operating temperatures, my disks will drop
quite quickly.


While I *believe* the same thing and thus have over provisioned
airflow in my cases (for both drives and memory), there
are studies which failed to find a strong correlation between
drive temperature and failure rates:

  http://research.google.com/archive/disk_failures.pdf

  http://www.usenix.org/events/fast07/tech/schroeder.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss