Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Casper . Dik

On 05/22/09 21:08, Toby Thain wrote:
 Yes, the important thing is to *detect* them, no system can run reliably
 with bad memory, and that includes any system with ZFS. Doing nutty
 things like calculating the checksum twice does not buy anything of
 value here.

All memory is bad if it doesn't have ECC. There are only varying
degrees of badness. Calculating the checksum twice on its own would
be nutty, as you say, but doing so on a separate copy of the data
might prevent unrecoverable errors after writes to mirrored drives.
You can't detect memory errors if you don't have ECC.

And where exactly do you get the second good copy of the data?

If you copy the code you've just doubled your chance of using bad memory.
The original copy can be good or bad; the second copy cannot be better
than the first copy.

 But you can
try to mitigate them. Without doing so makes ZFS less reliable than
the memory it is running on. The problem is that ZFS makes any file
with a bad checksum inaccessible, even if one really doesn't care
if the data has been corrupted. A workaround might be a way to allow
such files to be readable despite the bad checksum...

You can disable the checksums if you don't care.

But it isn't. Applications aren't dying, compilers are not segfaulting
(it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm
is staying up for weeks at a time... And I wouldn't consider running a
non-trivial database application on a machine without ECC.

One broken bit may not have cause serious damage most things work.

 Absolutely, memory diags are essential. And you certainly run them if
 you see unexpected behaviour that has no other obvious cause.

Runs for days, as noted.

Doesn't proof anything.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] nonunique devids with Solaris 10 zfs

2009-05-26 Thread Willi Burmeister
Hi,

I'm trying to get Solaris 10U6 on a old V240 with two new Seagate disks
using zfs as the root filesystem, but failed with this status:

--
# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
rpoolDEGRADED 0 0 0
  mirror DEGRADED 0 0 0
3665986270438154650  FAULTED  0 0 0  was 
/dev/dsk/c0t0d0s0
c0t1d0s0 ONLINE   0 0 0

errors: No known data errors
--

I think the reason are nonunique devids for both drives

--
# zdb -l /dev/dsk/c0t0d0s0 | egrep devid | head -1
devid='id1,s...@n5000/a'

# zdb -l /dev/dsk/c0t1d0s0 | egrep devid | head -1
devid='id1,s...@n5000/a'
--

How is this 'devid' generated and who makes sure these devids will 
getting unique? And will Update 7 or OpenSolaris help?

Any help is appreciated.

Willi

P.S. here some more informations about the drives:

Disk 0:
  SN:   3LM63XBW
  Model:ST3300655LC
  Firmware: 0003
  LOT No:   A-01-0925-3

Disk 1:
  SN:   3LM62RDB
  Model:ST3300655LC
  Firmware: 0003
  LOT No:   A-01-0925-3



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zfs send speed. Was: User quota design discussion..

2009-05-26 Thread Dirk Wriedt

Jorgen,

what is the size of the sending zfs?

I thought replication speed depends on the size of the sending fs, too not only size of the 
snapshot being sent.


Regards
Dirk


--On Freitag, Mai 22, 2009 19:19:34 +0900 Jorgen Lundman lund...@gmo.jp wrote:


Sorry, yes. It is straight;

# time zfs send zpool1/leroy_c...@speedtest | nc 172.20.12.232 3001
real19m48.199s

# /var/tmp/nc -l -p 3001 -vvv | time zfs recv -v zpool1/le...@speedtest
received 82.3GB stream in 1195 seconds (70.5MB/sec)


Sending is osol-b114.
Receiver is Solaris 10 10/08

When we tested Solaris 10 10/08 - Solaris 10 10/08 these were the results;

zfs send | nc | zfs recv - 1 MB/s
tar -cvf /zpool/leroy | nc | tar -xvf -  - 2.5 MB/s
ufsdump | nc | ufsrestore- 5.0 MB/s

So, none of those solutions was usable with regular Sol 10. Note most our 
volumes are ufs in
zvol, but even zfs volumes were slow.

Someone else had mentioned the speed was fixed in an earlier release, I had not 
had a chance to
upgrade. But since we wanted to try zfs user-quotas, I finally had the chance.

Lund


Brent Jones wrote:

On Thu, May 21, 2009 at 10:17 PM, Jorgen Lundman lund...@gmo.jp wrote:

To finally close my quest. I tested zfs send in osol-b114 version:

received 82.3GB stream in 1195 seconds (70.5MB/sec)

Yeeaahh!

That makes it completely usable! Just need to change our support contract to
allow us to run b114 and we're set! :)


Thanks,

Lund


Jorgen Lundman wrote:

We finally managed to upgrade the production x4500s to Sol 10 10/08
(unrelated to this) but with the hope that it would also make zfs send
usable.

Exactly how does build 105 translate to Solaris 10 10/08?  My current
speed test has sent 34Gb in 24 hours, which isn't great. Perhaps the next
version of Solaris 10 will have the improvements.

1



Robert Milkowski wrote:

Hello Jorgen,

If you look at the list archives you will see that it made a huge
difference for some people including me. Now I'm easily able to
saturate GbE linke while zfs send|recv'ing.


Since build 105 it should be *MUCH* for faster.



--
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Can you give any details about your data set, what you piped zfs
send/receive through (SSH?), hardware/network, etc?
I'm envious of your speeds!




--
Dirk Wriedt, dirk.wri...@sun.com, Sun Microsystems GmbH
Systemingenieur Strategic Accounts
Nagelsweg 55, 20097 Hamburg, Germany
Tel.: +49-40-251523-132 Fax: +49-40-251523-425 Mobile: +49 172 848 4166
Never been afraid of chances I been takin' - Joan Jett

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zfs send speed. Was: User quota design discussion..

2009-05-26 Thread Jorgen Lundman



So you recommend I also do speed test on larger volumes? The test data I 
had on the b114 server was only 90GB. Previous tests included 500G ufs 
on zvol etc.  It is just it will take 4 days to send it to the b114 
server to start with ;) (From Sol10 servers).


Lund

Dirk Wriedt wrote:

Jorgen,

what is the size of the sending zfs?

I thought replication speed depends on the size of the sending fs, too 
not only size of the snapshot being sent.


Regards
Dirk


--On Freitag, Mai 22, 2009 19:19:34 +0900 Jorgen Lundman 
lund...@gmo.jp wrote:



Sorry, yes. It is straight;

# time zfs send zpool1/leroy_c...@speedtest | nc 172.20.12.232 3001
real19m48.199s

# /var/tmp/nc -l -p 3001 -vvv | time zfs recv -v zpool1/le...@speedtest
received 82.3GB stream in 1195 seconds (70.5MB/sec)


Sending is osol-b114.
Receiver is Solaris 10 10/08

When we tested Solaris 10 10/08 - Solaris 10 10/08 these were the 
results;


zfs send | nc | zfs recv - 1 MB/s
tar -cvf /zpool/leroy | nc | tar -xvf -  - 2.5 MB/s
ufsdump | nc | ufsrestore- 5.0 MB/s

So, none of those solutions was usable with regular Sol 10. Note most 
our volumes are ufs in

zvol, but even zfs volumes were slow.

Someone else had mentioned the speed was fixed in an earlier release, 
I had not had a chance to
upgrade. But since we wanted to try zfs user-quotas, I finally had the 
chance.


Lund


Brent Jones wrote:

On Thu, May 21, 2009 at 10:17 PM, Jorgen Lundman lund...@gmo.jp wrote:

To finally close my quest. I tested zfs send in osol-b114 version:

received 82.3GB stream in 1195 seconds (70.5MB/sec)

Yeeaahh!

That makes it completely usable! Just need to change our support 
contract to

allow us to run b114 and we're set! :)


Thanks,

Lund


Jorgen Lundman wrote:

We finally managed to upgrade the production x4500s to Sol 10 10/08
(unrelated to this) but with the hope that it would also make zfs 
send

usable.

Exactly how does build 105 translate to Solaris 10 10/08?  My 
current
speed test has sent 34Gb in 24 hours, which isn't great. Perhaps 
the next

version of Solaris 10 will have the improvements.

1



Robert Milkowski wrote:

Hello Jorgen,

If you look at the list archives you will see that it made a huge
difference for some people including me. Now I'm easily able to
saturate GbE linke while zfs send|recv'ing.


Since build 105 it should be *MUCH* for faster.



--
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Can you give any details about your data set, what you piped zfs
send/receive through (SSH?), hardware/network, etc?
I'm envious of your speeds!




--
Dirk Wriedt, dirk.wri...@sun.com, Sun Microsystems GmbH
Systemingenieur Strategic Accounts
Nagelsweg 55, 20097 Hamburg, Germany
Tel.: +49-40-251523-132 Fax: +49-40-251523-425 Mobile: +49 172 848 4166
Never been afraid of chances I been takin' - Joan Jett

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
Kirchheim-Heimstetten

Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering




--
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nonunique devids with Solaris 10 zfs

2009-05-26 Thread James C. McPherson
On Tue, 26 May 2009 10:19:06 +0200
Willi Burmeister w...@cs.uni-kiel.de wrote:

 Hi,
 
 I'm trying to get Solaris 10U6 on a old V240 with two new Seagate disks
 using zfs as the root filesystem, but failed with this status:
 
 --
 # zpool status
   pool: rpool
  state: DEGRADED
 status: One or more devices could not be used because the label is missing or
 invalid.  Sufficient replicas exist for the pool to continue
 functioning in a degraded state.
 action: Replace the device using 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-4J
  scrub: none requested
 config:
 
 NAME STATE READ WRITE CKSUM
 rpoolDEGRADED 0 0 0
   mirror DEGRADED 0 0 0
 3665986270438154650  FAULTED  0 0 0  was 
 /dev/dsk/c0t0d0s0
 c0t1d0s0 ONLINE   0 0 0
 
 errors: No known data errors
 --
 
 I think the reason are nonunique devids for both drives
 
 --
 # zdb -l /dev/dsk/c0t0d0s0 | egrep devid | head -1
 devid='id1,s...@n5000/a'
 
 # zdb -l /dev/dsk/c0t1d0s0 | egrep devid | head -1
 devid='id1,s...@n5000/a'
 --
 
 How is this 'devid' generated and who makes sure these devids will 
 getting unique? And will Update 7 or OpenSolaris help?

Yes, that'll be the most likely cause of the problem.

The devid is generated from the SCSI INQUIRY Page83 data
if that's available, or Page80 if not, or faked in some
cases.

You can read more about devids in my presentation on them

http://www.jmcp.homeunix.com/~jmcp/WhatIsAGuid.pdf



 P.S. here some more informations about the drives:
 
 Disk 0:
   SN:   3LM63XBW
   Model:ST3300655LC
   Firmware: 0003
   LOT No:   A-01-0925-3
 
 Disk 1:
   SN:   3LM62RDB
   Model:ST3300655LC
   Firmware: 0003
   LOT No:   A-01-0925-3

I would have hoped that new Seagate disks would be providing
a correct response to the Page83 inquiry. 


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
Kernel Conference Australia - http://au.sun.com/sunnews/events/2009/kernel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Frank Middleton

On 05/23/09 10:21, Richard Elling wrote:

preface
This forum is littered with claims of zfs checksums are broken where
the root cause turned out to be faulty hardware or firmware in the data
path.
/preface

I think that before you should speculate on a redesign, we should get to
the root cause.


The hardware is clearly misbehaving. No argument. The questions is - how
far out of reasonable behavior is it?

Redesign? I'm not sure I can conceive an architecture that would make
double buffering difficult to do. It is unclear how faulty hardware or
firmware could be responsible for such a low error rate (1 in 4*10^10).
Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.


The checksum occurs in the pipeline prior to write to disk.
So if the data is damaged prior to checksum, then ZFS will
never know. Nor will UFS. Neither will be able to detect
this. In Solaris, if the damage is greater than the ability
of the memory system and CPU to detect or correct, then
even Solaris won't know. If the memory system or CPU
detects a problem, then Solaris fault management will kick
in and do something, preempting ZFS.


Exactly. My whole point. And without ECC there's no way of knowing.
But if the data is damaged /after/ checksum but /before/ write, then
you have a real problem...


Memory diagnostics just test memory. Disk diagnostics just test disks.


This is not completely accurate. Disk diagnostics also test the
data path. Memory tests also test the CPU. The difference is the
amount of test coverage for the subsystem.


Quite. But the disk diagnostic doesn't really test memory beyond what
it uses to run itself. Likewise it may not test the FPU forexample.


ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.


In general, for like configurations, ZFS won't keep a disk any more
busy than other file systems. In fact, because ZFS groups transactions,
it may create less activity than other file systems, such as UFS.


That's a point in it's favor, although not really relevant. If the disks
are really busy they will load the PSU more and that could drag the supply
down which in turn might make errors occur that otherwise wouldn't.


Ironically, the Open Solaris installer does not allow for ZFS
mirroring at install time, one time where it might be really important!
Now that sounds like a more useful RFE, especially since it would be
relatively easy to implement. Anaconda does it...


This is not an accurate statement. The OpenSolaris installer does
support mirrored boot disks via the Automated Installer method.
http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html
You can also install Solaris 10 to mirrored root pools via JumpStart.


Talking about the live CD here. I prefer to install via jumpstart, but
AFAIK Open Solaris (indiana) isn't available as an installable DVD. But
most consumers are going to be installing from the live CD and they
are the ones with the low end hardware without ECC. There was recently
a suggestion on another thread about an RFE to add mirroring as an
install option.
 

I think a better test would be to md5 the file from all systems
and see if the md5 hashes are the same. If they are, then yes,
the finger would point more in the direction of ZFS. The
send/recv protocol hasn't changed in quite some time, but it
is arguably not as robust as it could be.


Thanks! md5 hash is exactly the kind of test I was looking for.
ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS)
md5sum on X86   9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)


ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2
for data (by default) and fletcher4 for metadata. The same fletcher
code is used. So if you believe fletcher4 is broken for send/recv,
how do you explain that it works for the metadata? Or does it?
There may be another failure mode at work here...
(see comment on scrubs at the end of this extended post)

[Did you forget the scrubs comment?]

Never said it was broken. I assume the same code is used for both SPARC
and X86, and it works fine on SPARC. It would seem that this machine
gets memory errors so often (even though it passes the Linux memory
diagnostic) that it can never get to the end of a 4GB recv stream. Odd
that it can do the md5sum, but as mentioned, perhaps doing the i/o
puts more strain on the machine and stresses it to where more memory
faults occur. I can't quite picture a software bug that would cause
random failures on specific hardware and I am happy to give ZFS the
benefit of the doubt.


It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.


file on those files resulted in bus error. Is there a way to actually
read a file reported by ZFS as unrecoverable to do just that (and to
separately 

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Frank Middleton

On 05/26/09 03:23, casper@sun.com wrote:


And where exactly do you get the second good copy of the data?


From the first. And if it is already bad, as noted previously, this
is no worse than the UFS/ext3 case. If you want total freedom from
this class of errors, use ECC.
 

If you copy the code you've just doubled your chance of using bad memory.
The original copy can be good or bad; the second copy cannot be better
than the first copy.


The whole point is that the memory isn't bad. About once a month, 4GB
of memory of any quality can experience 1 bit being flipped, perhaps
more or less often. If that bit happens to be in the checksummed buffer
then you'll get an unrecoverable error on a mirrored drive. And if I
understand correctly, ZFS keeps data in memory for a lot longer than
other file systems and uses more memory doing so. Good features, but
makes it more vulnerable to random bit flips. This is why decent
machine have ECC. To argue that ZFS should work reliably on machines
without ECC flies in the face of statistical reality and the reason
for ECC in the first place.


You can disable the checksums if you don't care.


But I do care. I'd like to know if my files have been corrupted, or at
least as much as possible. But there are huge classes of files for
which the odd flipped bit doesn't matter and the loss of which would
be very painful. Email archives and videos come to mind. An easy
workaround is to simply store all important stuff on a machine with
ECC. Problem solved...


One broken bit may not have cause serious damage most things work.


Exactly.


Absolutely, memory diags are essential. And you certainly run them if
you see unexpected behaviour that has no other obvious cause.

Runs for days, as noted.


Doesn't proof anything.


Quite. But nonetheless, the unrecoverable errors did occur on mirrored
drives and it seems to defeat the whole purpose of mirroring, which is
AFAIK, keeping two independent copies of every file in case one gets lost.
Writing both images from one buffer appears to violate the premise. I
can think of two RFEs

1) Add an option to buffer writes on machines without ECC memory to
   avoid the possibility of random memory flips causing unrecoverable
   errors with mirrored drives.

2) An option to read files even if they have failed checksums.

1) could be fixed in the documentation - ZFS should be used with caution
on machines with no ECC since random bit flips can cause unrecoverable
checksum failures on mirrored drives. Or ZFS isn't supported on
machines with memory that has no ECC.

Disabling checksums is one way of working around 2). But it also disables
a cool feature. I suppose you could optionally change checksum failure
from an error to a warning, but ideally it would be file by file...

Ironically, I wonder if this is even a problem with raidz? But grotty
machines like these can't really support 3 or more internal drives...

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Bob Friesenhahn

On Tue, 26 May 2009, Frank Middleton wrote:


1) could be fixed in the documentation - ZFS should be used with caution
on machines with no ECC since random bit flips can cause unrecoverable
checksum failures on mirrored drives. Or ZFS isn't supported on
machines with memory that has no ECC.


What problem are you looking to solve?  Data is written by application 
software which includes none of the extra safeguards you are insisting 
should be in ZFS.  This means that the data may be undetectably 
corrupted.


I strongly recommend that you purchase a system with ECC in order to 
operate reliably in the (apparent) radium mine where you live.  It is 
time to wake up, smell the radon, and do something about the problem. 
Check this map to see if there is cause for concern in your area: 
http://upload.wikimedia.org/wikipedia/en/8/8b/US_homes_over_recommended_radon_levels.gif;.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Bob Friesenhahn

On Tue, 26 May 2009, Frank Middleton wrote:

Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.


Machines lacking ECC do not suffer from inevitable memory errors. 
Memory errors are not like death and taxes.



Exactly. My whole point. And without ECC there's no way of knowing.
But if the data is damaged /after/ checksum but /before/ write, then
you have a real problem...


If memory does not work, then you do have a real problem.  The ZFS ARC 
consumes a large amount of memory.  Note that the problem of 
corruption around the time of the checksum/write is minor compared to 
corruption in the ZFS ARC since data is continually read from the ZFS 
ARC and so bad data may be returned to the user even though it is 
(was?) fine on disk.  This is as close as ZFS comes to having an 
Achilles' heel.  Solving this problem would require crippling the 
system performance.



Never said it was broken. I assume the same code is used for both SPARC
and X86, and it works fine on SPARC. It would seem that this machine
gets memory errors so often (even though it passes the Linux memory
diagnostic) that it can never get to the end of a 4GB recv stream. Odd


Maybe you need a new computer, or need to fix your broken one.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Richard Elling

Frank brings up some interesting ideas, some of which might
need some additional thoughts...

Frank Middleton wrote:

On 05/23/09 10:21, Richard Elling wrote:

preface
This forum is littered with claims of zfs checksums are broken where
the root cause turned out to be faulty hardware or firmware in the data
path.
/preface

I think that before you should speculate on a redesign, we should get to
the root cause.


The hardware is clearly misbehaving. No argument. The questions is - how
far out of reasonable behavior is it?


Hardware is much less expensive than software, even free software.
Your system has a negative ROI, kinda like trading credit default
swaps.  The best thing you can do is junk it :-)



Redesign? I'm not sure I can conceive an architecture that would make
double buffering difficult to do. It is unclear how faulty hardware or
firmware could be responsible for such a low error rate (1 in 4*10^10).
Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.


It is a good RFE, but it isn't an RFE for the software folks.


The checksum occurs in the pipeline prior to write to disk.
So if the data is damaged prior to checksum, then ZFS will
never know. Nor will UFS. Neither will be able to detect
this. In Solaris, if the damage is greater than the ability
of the memory system and CPU to detect or correct, then
even Solaris won't know. If the memory system or CPU
detects a problem, then Solaris fault management will kick
in and do something, preempting ZFS.


Exactly. My whole point. And without ECC there's no way of knowing.
But if the data is damaged /after/ checksum but /before/ write, then
you have a real problem...


To put this in perspective, ECC is a broad category.  When we
think of ECC for memory, it is usually Single Error (bit) Correction,
Double Error (bit) Detection (SECDED).  A well designed system
will also do Single Device Data Correction (aka Chipkill or Extended
ECC, since Chipkill is trademarked).  What this means is that faults
of more than 2 bits per word are not detected, unless all of the faults
occur in the same chip for SDDC cases.

Clearly, this wouldn't scale well to large data streams, which is why
they use checksums like Fletcher or hash functions like SHA-256.


ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.


In general, for like configurations, ZFS won't keep a disk any more
busy than other file systems. In fact, because ZFS groups transactions,
it may create less activity than other file systems, such as UFS.


That's a point in it's favor, although not really relevant. If the disks
are really busy they will load the PSU more and that could drag the 
supply

down which in turn might make errors occur that otherwise wouldn't.


The dynamic loads of modern disk drives are not very great.  I don't
believe your argument is very strong, here.  Also, the solution is,
once again, fix the hardware.


I think a better test would be to md5 the file from all systems
and see if the md5 hashes are the same. If they are, then yes,
the finger would point more in the direction of ZFS. The
send/recv protocol hasn't changed in quite some time, but it
is arguably not as robust as it could be.


Thanks! md5 hash is exactly the kind of test I was looking for.
ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS)
md5sum on X86   9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)


Good.


ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2
for data (by default) and fletcher4 for metadata. The same fletcher
code is used. So if you believe fletcher4 is broken for send/recv,
how do you explain that it works for the metadata? Or does it?
There may be another failure mode at work here...
(see comment on scrubs at the end of this extended post)

[Did you forget the scrubs comment?]


no, you responded that you had been seeing scrubs fix errors.


Never said it was broken. I assume the same code is used for both SPARC
and X86, and it works fine on SPARC. It would seem that this machine
gets memory errors so often (even though it passes the Linux memory
diagnostic) that it can never get to the end of a 4GB recv stream. Odd
that it can do the md5sum, but as mentioned, perhaps doing the i/o
puts more strain on the machine and stresses it to where more memory
faults occur. I can't quite picture a software bug that would cause
random failures on specific hardware and I am happy to give ZFS the
benefit of the doubt.


Yes, software can trigger memory failures.  More below...


It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.


file on those files resulted in bus error. Is there a way to 
actually

read a file reported by ZFS as unrecoverable to do just that (and to
separately retrieve the copy 

Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Darren J Moffat

Bob Friesenhahn wrote:

On Tue, 26 May 2009, Frank Middleton wrote:

Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.


Machines lacking ECC do not suffer from inevitable memory errors. 
Memory errors are not like death and taxes.



Exactly. My whole point. And without ECC there's no way of knowing.
But if the data is damaged /after/ checksum but /before/ write, then
you have a real problem...


If memory does not work, then you do have a real problem.  The ZFS ARC 
consumes a large amount of memory.  Note that the problem of corruption 
around the time of the checksum/write is minor compared to corruption in 
the ZFS ARC since data is continually read from the ZFS ARC and so bad 
data may be returned to the user even though it is (was?) fine on disk.  
This is as close as ZFS comes to having an Achilles' heel.  Solving this 
problem would require crippling the system performance.


When running a DEBUG kernel (not something most people would do on a 
production system) ZFS does actually checksum and verify the buffers 
in the ARC - not on every access but certain operations cause it to happen.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eon or nexentacore or opensolaris

2009-05-26 Thread Erast
May be what you saying is true wrt. NexentaCore 2.0. But hey, think 
about open source principals and development process. We do hope that 
NexentaCore will become an official Debian distribution some day! We 
evolving and driven completely by the community here. Anyone can 
participate and fix the bugs and make it happen:


https://launchpad.net/distros/nexenta

As far commercial bits:

1. NexentaStor is still based off 1.x. Once 2.x branch is more or less 
polished we will make a safe transition


2. ON patches goes through serious stress testing not only by Nexenta 
but also by the growing list of Nexenta partners - i.e. to ensure that 
end solution is absolutely stable and safe:


http://www.nexenta.com/partners

3. The development model of NexentaCore is indeed very much Debian-like. 
 However, NexentaStor is developed with different rules in mind - rules 
of focused testing, conservative principals and partner-wide openness


4. Is Debian helping NexentaStor to integrate stuff? Yes, absolutely! 
Lots of advantages here. Debian is NOT just package management as one 
could think of - it is as well a polished distribution foundation. 
NexentaStor plugins, which are pretty much Debian packages, used to 
extend NexentaStor capabilities. Learn more:


http://www.nexenta.com/corp/index.php?option=com_jreviewsItemid=112

C. Bergström wrote:

Anil Gulecha wrote:

On Sat, May 23, 2009 at 1:19 PM, Bogdan M. Maryniuk
bogdan.maryn...@gmail.com wrote:
 

On Sat, May 23, 2009 at 4:56 AM, Joe S js.li...@gmail.com wrote:
   

EON ZFS NAS
http://eonstorage.blogspot.com/
  

No idea.

   

NexentaCore Platform (v2.0 RC3)
http://www.nexenta.org/os/NexentaCore
  

Personally, I tried it few times. For now, it is still too much broken
for me yet and looks scary. Previous version is much more stable but
also older. Newer v2.0 looks exactly like bleeding edge Debian old
times: each time you run apt-get upgrade you have to use shaman's
tambourine dancing around the fireplace. I don't remember exactly, but
some packages are just broken and can not find dependencies,
installation crashes, pollutes your system and can not be restored
nicely etc. However, when it will be not that broken anymore, it must
be a great distribution with excellent package management and very
convenient to use.



Hi Bogdan,

Which particular packages were these? RC3 is quite stable, and all
server packages are solid. If you do face issues with a particular
one, we'd appreciate a bug report. All information on this is
helpful..
  
I've done some preliminary patch review on the core on-nexenta patches 
and I'd concur to put Nexenta pretty low on the trusted list for 
enterprise storage.  This is in addition to the packaging problems 
you've pointed out.  If the issues at hand were not enough when I sent 
an email to their dev list it was completely ignored.  Marketing for 
Nexenta as Anil points out is strong, but like many other distributions 
outside Sun there's still a lot of work to go.  I'm not sure EON's 
update delivery, but I believe it's just a minimal repackage of 
OpenSolaris release.  This isn't the advocacy list so if you're 
interested in other alternatives feel free to email me off list.


Cheers,


./Christopher

--
OSUNIX - Built from the best of OpenSolaris Technology
http://www.osunix.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Toby Thain


On 26-May-09, at 10:21 AM, Frank Middleton wrote:


On 05/26/09 03:23, casper@sun.com wrote:


And where exactly do you get the second good copy of the data?


From the first. And if it is already bad, as noted previously, this
is no worse than the UFS/ext3 case. If you want total freedom from
this class of errors, use ECC.

If you copy the code you've just doubled your chance of using bad  
memory.
The original copy can be good or bad; the second copy cannot be  
better

than the first copy.


The whole point is that the memory isn't bad. About once a month, 4GB
of memory of any quality can experience 1 bit being flipped, perhaps
more or less often.



What you are proposing does practically nothing to mitigate random  
bit flips. Think about the probabilities involved. You're testing  
one tiny buffer, very occasionally, for an extremely improbable  
event. It is also nothing to do with ZFS, and leaves every other byte  
of your RAM untested. See the reasoning?


--Toby


...

Cheers -- Frank



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Toby Thain


On 25-May-09, at 11:16 PM, Frank Middleton wrote:


On 05/22/09 21:08, Toby Thain wrote:
Yes, the important thing is to *detect* them, no system can run  
reliably

with bad memory, and that includes any system with ZFS. Doing nutty
things like calculating the checksum twice does not buy anything of
value here.


All memory is bad if it doesn't have ECC. There are only varying
degrees of badness. Calculating the checksum twice on its own would
be nutty, as you say, but doing so on a separate copy of the data
might prevent unrecoverable errors


I don't see this at all. The kernel reads the application buffer. How  
does reading it twice buy you anything?? It sounds like you are  
assuming 1) the buffer includes faulty RAM; and 2) the faulty RAM  
reads differently each time. Doesn't that seem statistically unlikely  
to you? And even if you really are chasing this improbable scenario,  
why make ZFS do the job of a memory tester?



after writes to mirrored drives.
You can't detect memory errors if you don't have ECC. But you can
try to mitigate them. Without doing so makes ZFS less reliable than
the memory it is running on. The problem is that ZFS makes any file
with a bad checksum inaccessible, even if one really doesn't care
if the data has been corrupted. A workaround might be a way to allow
such files to be readable despite the bad checksum...


I am not sure what you are trying to say here.



...


How can a machine with bad memory work fine with ext3?


It does. It works fine with ZFS too. Just really annoying  
unrecoverable
files every now and then on mirrored drives. This shouldn't happen  
even

with lousy memory and wouldn't (doesn't) with ECC. If there was a way
to examine the files and their checksums, I would be surprised if they
were different (If they were, it would almost certainly be the  
controller

or the PCI bus itself causing the problem). But I speculate that it is
predictable memory hits.


You're making this harder than it really is. Run a memory test. If it  
fails, take the machine out of service until it's fixed. There's no  
reasonable way to keep running faulty hardware.


--Toby



-- Frank



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Kjetil Torgrim Homme
Frank Middleton f.middle...@apogeect.com writes:

 Exactly. My whole point. And without ECC there's no way of knowing.
 But if the data is damaged /after/ checksum but /before/ write, then
 you have a real problem...

we can't do much to protect ourselves from damage to the data itself
(an extra copy in RAM will help little and ruin performance).

damages to the bits holding the computed checksum before it is written
can be alleviated by doing the calculation independently for each
written copy.  in particular, this will help if the bit error is
transient.

since the number of octets in RAM holding the checksum dwarves the
number of octets occupied by data by a large ratio (256 bits vs. one
mebibit for a full default sized record), such a paranoia mode will
most likely tell you that the *data* is corrupt, not the checksum.
but today you don't know, so it's an improvement in my book.

 Quoting the ZFS admin guide: The failmode property ... provides the
 failmode property for determining the behavior of a catastrophic
 pool failure due to a loss of device connectivity or the failure of
 all devices in the pool. . Has this changed since the ZFS admin
 guide was last updated?  If not, it doesn't seem relevant.

I guess checksum error handling is orthogonal to this and should have
its own property.  it sure would be nice if the admin could ask the OS
to deliver the bits contained in a file, no matter what, and just log
the problem.

 Cheers -- Frank

thank you for pointing out this potential weakness in ZFS' consistency
checking, I didn't realise it was there.

also thank you, all ZFS developers, for your great job :-)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] disabling showmount -e behaviour

2009-05-26 Thread Roman V Shaposhnik
I must admit that this question originates in the context of Sun's
Storage 7210 product, which impose additional restrictions on the
kind of knobs I can turn.

But here's the question: suppose I have an installation where ZFS
is the storage for user home directories. Since I need quotas, each
directory gets to be its own filesystem. Since I also need these
homes to be accessible remotely each FS is exported via NFS. Here's
the question though: how do I prevent showmount -e (or a manually 
constructed EXPORT/EXPORTALL RPC request) to disclose a list of
users that are hosted on a particular server?

Thanks,
Roman.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] eon or nexentacore or opensolaris

2009-05-26 Thread Bogdan M. Maryniuk
On Sun, May 24, 2009 at 6:11 PM, Anil Gulecha anil.ve...@gmail.com wrote:
 One example is StormOS, and XFCE based distro being built on NCP2.
 According to the latest blog entry.. a release is imminent. Perhaps
 you'll have better desktop experience with this. (www.stormos.org)

So.Tried it just now. Shortly: I'd stay with OpenSolaris for at least
a year. :-)
--
Kind regards, bm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss