Re: [zfs-discuss] Errors on mirrored drive
On 05/22/09 21:08, Toby Thain wrote: Yes, the important thing is to *detect* them, no system can run reliably with bad memory, and that includes any system with ZFS. Doing nutty things like calculating the checksum twice does not buy anything of value here. All memory is bad if it doesn't have ECC. There are only varying degrees of badness. Calculating the checksum twice on its own would be nutty, as you say, but doing so on a separate copy of the data might prevent unrecoverable errors after writes to mirrored drives. You can't detect memory errors if you don't have ECC. And where exactly do you get the second good copy of the data? If you copy the code you've just doubled your chance of using bad memory. The original copy can be good or bad; the second copy cannot be better than the first copy. But you can try to mitigate them. Without doing so makes ZFS less reliable than the memory it is running on. The problem is that ZFS makes any file with a bad checksum inaccessible, even if one really doesn't care if the data has been corrupted. A workaround might be a way to allow such files to be readable despite the bad checksum... You can disable the checksums if you don't care. But it isn't. Applications aren't dying, compilers are not segfaulting (it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm is staying up for weeks at a time... And I wouldn't consider running a non-trivial database application on a machine without ECC. One broken bit may not have cause serious damage most things work. Absolutely, memory diags are essential. And you certainly run them if you see unexpected behaviour that has no other obvious cause. Runs for days, as noted. Doesn't proof anything. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] nonunique devids with Solaris 10 zfs
Hi, I'm trying to get Solaris 10U6 on a old V240 with two new Seagate disks using zfs as the root filesystem, but failed with this status: -- # zpool status pool: rpool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAME STATE READ WRITE CKSUM rpoolDEGRADED 0 0 0 mirror DEGRADED 0 0 0 3665986270438154650 FAULTED 0 0 0 was /dev/dsk/c0t0d0s0 c0t1d0s0 ONLINE 0 0 0 errors: No known data errors -- I think the reason are nonunique devids for both drives -- # zdb -l /dev/dsk/c0t0d0s0 | egrep devid | head -1 devid='id1,s...@n5000/a' # zdb -l /dev/dsk/c0t1d0s0 | egrep devid | head -1 devid='id1,s...@n5000/a' -- How is this 'devid' generated and who makes sure these devids will getting unique? And will Update 7 or OpenSolaris help? Any help is appreciated. Willi P.S. here some more informations about the drives: Disk 0: SN: 3LM63XBW Model:ST3300655LC Firmware: 0003 LOT No: A-01-0925-3 Disk 1: SN: 3LM62RDB Model:ST3300655LC Firmware: 0003 LOT No: A-01-0925-3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs send speed. Was: User quota design discussion..
Jorgen, what is the size of the sending zfs? I thought replication speed depends on the size of the sending fs, too not only size of the snapshot being sent. Regards Dirk --On Freitag, Mai 22, 2009 19:19:34 +0900 Jorgen Lundman lund...@gmo.jp wrote: Sorry, yes. It is straight; # time zfs send zpool1/leroy_c...@speedtest | nc 172.20.12.232 3001 real19m48.199s # /var/tmp/nc -l -p 3001 -vvv | time zfs recv -v zpool1/le...@speedtest received 82.3GB stream in 1195 seconds (70.5MB/sec) Sending is osol-b114. Receiver is Solaris 10 10/08 When we tested Solaris 10 10/08 - Solaris 10 10/08 these were the results; zfs send | nc | zfs recv - 1 MB/s tar -cvf /zpool/leroy | nc | tar -xvf - - 2.5 MB/s ufsdump | nc | ufsrestore- 5.0 MB/s So, none of those solutions was usable with regular Sol 10. Note most our volumes are ufs in zvol, but even zfs volumes were slow. Someone else had mentioned the speed was fixed in an earlier release, I had not had a chance to upgrade. But since we wanted to try zfs user-quotas, I finally had the chance. Lund Brent Jones wrote: On Thu, May 21, 2009 at 10:17 PM, Jorgen Lundman lund...@gmo.jp wrote: To finally close my quest. I tested zfs send in osol-b114 version: received 82.3GB stream in 1195 seconds (70.5MB/sec) Yeeaahh! That makes it completely usable! Just need to change our support contract to allow us to run b114 and we're set! :) Thanks, Lund Jorgen Lundman wrote: We finally managed to upgrade the production x4500s to Sol 10 10/08 (unrelated to this) but with the hope that it would also make zfs send usable. Exactly how does build 105 translate to Solaris 10 10/08? My current speed test has sent 34Gb in 24 hours, which isn't great. Perhaps the next version of Solaris 10 will have the improvements. 1 Robert Milkowski wrote: Hello Jorgen, If you look at the list archives you will see that it made a huge difference for some people including me. Now I'm easily able to saturate GbE linke while zfs send|recv'ing. Since build 105 it should be *MUCH* for faster. -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Can you give any details about your data set, what you piped zfs send/receive through (SSH?), hardware/network, etc? I'm envious of your speeds! -- Dirk Wriedt, dirk.wri...@sun.com, Sun Microsystems GmbH Systemingenieur Strategic Accounts Nagelsweg 55, 20097 Hamburg, Germany Tel.: +49-40-251523-132 Fax: +49-40-251523-425 Mobile: +49 172 848 4166 Never been afraid of chances I been takin' - Joan Jett Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Haering ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs send speed. Was: User quota design discussion..
So you recommend I also do speed test on larger volumes? The test data I had on the b114 server was only 90GB. Previous tests included 500G ufs on zvol etc. It is just it will take 4 days to send it to the b114 server to start with ;) (From Sol10 servers). Lund Dirk Wriedt wrote: Jorgen, what is the size of the sending zfs? I thought replication speed depends on the size of the sending fs, too not only size of the snapshot being sent. Regards Dirk --On Freitag, Mai 22, 2009 19:19:34 +0900 Jorgen Lundman lund...@gmo.jp wrote: Sorry, yes. It is straight; # time zfs send zpool1/leroy_c...@speedtest | nc 172.20.12.232 3001 real19m48.199s # /var/tmp/nc -l -p 3001 -vvv | time zfs recv -v zpool1/le...@speedtest received 82.3GB stream in 1195 seconds (70.5MB/sec) Sending is osol-b114. Receiver is Solaris 10 10/08 When we tested Solaris 10 10/08 - Solaris 10 10/08 these were the results; zfs send | nc | zfs recv - 1 MB/s tar -cvf /zpool/leroy | nc | tar -xvf - - 2.5 MB/s ufsdump | nc | ufsrestore- 5.0 MB/s So, none of those solutions was usable with regular Sol 10. Note most our volumes are ufs in zvol, but even zfs volumes were slow. Someone else had mentioned the speed was fixed in an earlier release, I had not had a chance to upgrade. But since we wanted to try zfs user-quotas, I finally had the chance. Lund Brent Jones wrote: On Thu, May 21, 2009 at 10:17 PM, Jorgen Lundman lund...@gmo.jp wrote: To finally close my quest. I tested zfs send in osol-b114 version: received 82.3GB stream in 1195 seconds (70.5MB/sec) Yeeaahh! That makes it completely usable! Just need to change our support contract to allow us to run b114 and we're set! :) Thanks, Lund Jorgen Lundman wrote: We finally managed to upgrade the production x4500s to Sol 10 10/08 (unrelated to this) but with the hope that it would also make zfs send usable. Exactly how does build 105 translate to Solaris 10 10/08? My current speed test has sent 34Gb in 24 hours, which isn't great. Perhaps the next version of Solaris 10 will have the improvements. 1 Robert Milkowski wrote: Hello Jorgen, If you look at the list archives you will see that it made a huge difference for some people including me. Now I'm easily able to saturate GbE linke while zfs send|recv'ing. Since build 105 it should be *MUCH* for faster. -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Can you give any details about your data set, what you piped zfs send/receive through (SSH?), hardware/network, etc? I'm envious of your speeds! -- Dirk Wriedt, dirk.wri...@sun.com, Sun Microsystems GmbH Systemingenieur Strategic Accounts Nagelsweg 55, 20097 Hamburg, Germany Tel.: +49-40-251523-132 Fax: +49-40-251523-425 Mobile: +49 172 848 4166 Never been afraid of chances I been takin' - Joan Jett Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Haering -- Jorgen Lundman | lund...@lundman.net Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nonunique devids with Solaris 10 zfs
On Tue, 26 May 2009 10:19:06 +0200 Willi Burmeister w...@cs.uni-kiel.de wrote: Hi, I'm trying to get Solaris 10U6 on a old V240 with two new Seagate disks using zfs as the root filesystem, but failed with this status: -- # zpool status pool: rpool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAME STATE READ WRITE CKSUM rpoolDEGRADED 0 0 0 mirror DEGRADED 0 0 0 3665986270438154650 FAULTED 0 0 0 was /dev/dsk/c0t0d0s0 c0t1d0s0 ONLINE 0 0 0 errors: No known data errors -- I think the reason are nonunique devids for both drives -- # zdb -l /dev/dsk/c0t0d0s0 | egrep devid | head -1 devid='id1,s...@n5000/a' # zdb -l /dev/dsk/c0t1d0s0 | egrep devid | head -1 devid='id1,s...@n5000/a' -- How is this 'devid' generated and who makes sure these devids will getting unique? And will Update 7 or OpenSolaris help? Yes, that'll be the most likely cause of the problem. The devid is generated from the SCSI INQUIRY Page83 data if that's available, or Page80 if not, or faked in some cases. You can read more about devids in my presentation on them http://www.jmcp.homeunix.com/~jmcp/WhatIsAGuid.pdf P.S. here some more informations about the drives: Disk 0: SN: 3LM63XBW Model:ST3300655LC Firmware: 0003 LOT No: A-01-0925-3 Disk 1: SN: 3LM62RDB Model:ST3300655LC Firmware: 0003 LOT No: A-01-0925-3 I would have hoped that new Seagate disks would be providing a correct response to the Page83 inquiry. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog Kernel Conference Australia - http://au.sun.com/sunnews/events/2009/kernel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On 05/23/09 10:21, Richard Elling wrote: preface This forum is littered with claims of zfs checksums are broken where the root cause turned out to be faulty hardware or firmware in the data path. /preface I think that before you should speculate on a redesign, we should get to the root cause. The hardware is clearly misbehaving. No argument. The questions is - how far out of reasonable behavior is it? Redesign? I'm not sure I can conceive an architecture that would make double buffering difficult to do. It is unclear how faulty hardware or firmware could be responsible for such a low error rate (1 in 4*10^10). Just asking if an option for machines with no ecc and their inevitable memory errors is a reasonable thing to suggest in an RFE. The checksum occurs in the pipeline prior to write to disk. So if the data is damaged prior to checksum, then ZFS will never know. Nor will UFS. Neither will be able to detect this. In Solaris, if the damage is greater than the ability of the memory system and CPU to detect or correct, then even Solaris won't know. If the memory system or CPU detects a problem, then Solaris fault management will kick in and do something, preempting ZFS. Exactly. My whole point. And without ECC there's no way of knowing. But if the data is damaged /after/ checksum but /before/ write, then you have a real problem... Memory diagnostics just test memory. Disk diagnostics just test disks. This is not completely accurate. Disk diagnostics also test the data path. Memory tests also test the CPU. The difference is the amount of test coverage for the subsystem. Quite. But the disk diagnostic doesn't really test memory beyond what it uses to run itself. Likewise it may not test the FPU forexample. ZFS keeps disks pretty busy, so perhaps it loads the power supply to the point where it heats up and memory glitches are more likely. In general, for like configurations, ZFS won't keep a disk any more busy than other file systems. In fact, because ZFS groups transactions, it may create less activity than other file systems, such as UFS. That's a point in it's favor, although not really relevant. If the disks are really busy they will load the PSU more and that could drag the supply down which in turn might make errors occur that otherwise wouldn't. Ironically, the Open Solaris installer does not allow for ZFS mirroring at install time, one time where it might be really important! Now that sounds like a more useful RFE, especially since it would be relatively easy to implement. Anaconda does it... This is not an accurate statement. The OpenSolaris installer does support mirrored boot disks via the Automated Installer method. http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html You can also install Solaris 10 to mirrored root pools via JumpStart. Talking about the live CD here. I prefer to install via jumpstart, but AFAIK Open Solaris (indiana) isn't available as an installable DVD. But most consumers are going to be installing from the live CD and they are the ones with the low end hardware without ECC. There was recently a suggestion on another thread about an RFE to add mirroring as an install option. I think a better test would be to md5 the file from all systems and see if the md5 hashes are the same. If they are, then yes, the finger would point more in the direction of ZFS. The send/recv protocol hasn't changed in quite some time, but it is arguably not as robust as it could be. Thanks! md5 hash is exactly the kind of test I was looking for. ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS) md5sum on X86 9ec4f7da41741b469fcd7cb8c5040564 (remote NFS) ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2 for data (by default) and fletcher4 for metadata. The same fletcher code is used. So if you believe fletcher4 is broken for send/recv, how do you explain that it works for the metadata? Or does it? There may be another failure mode at work here... (see comment on scrubs at the end of this extended post) [Did you forget the scrubs comment?] Never said it was broken. I assume the same code is used for both SPARC and X86, and it works fine on SPARC. It would seem that this machine gets memory errors so often (even though it passes the Linux memory diagnostic) that it can never get to the end of a 4GB recv stream. Odd that it can do the md5sum, but as mentioned, perhaps doing the i/o puts more strain on the machine and stresses it to where more memory faults occur. I can't quite picture a software bug that would cause random failures on specific hardware and I am happy to give ZFS the benefit of the doubt. It would have been nice if we were able to recover the contents of the file; if you also know what was supposed to be there, you can diff and then we can find out what was wrong. file on those files resulted in bus error. Is there a way to actually read a file reported by ZFS as unrecoverable to do just that (and to separately
Re: [zfs-discuss] Errors on mirrored drive
On 05/26/09 03:23, casper@sun.com wrote: And where exactly do you get the second good copy of the data? From the first. And if it is already bad, as noted previously, this is no worse than the UFS/ext3 case. If you want total freedom from this class of errors, use ECC. If you copy the code you've just doubled your chance of using bad memory. The original copy can be good or bad; the second copy cannot be better than the first copy. The whole point is that the memory isn't bad. About once a month, 4GB of memory of any quality can experience 1 bit being flipped, perhaps more or less often. If that bit happens to be in the checksummed buffer then you'll get an unrecoverable error on a mirrored drive. And if I understand correctly, ZFS keeps data in memory for a lot longer than other file systems and uses more memory doing so. Good features, but makes it more vulnerable to random bit flips. This is why decent machine have ECC. To argue that ZFS should work reliably on machines without ECC flies in the face of statistical reality and the reason for ECC in the first place. You can disable the checksums if you don't care. But I do care. I'd like to know if my files have been corrupted, or at least as much as possible. But there are huge classes of files for which the odd flipped bit doesn't matter and the loss of which would be very painful. Email archives and videos come to mind. An easy workaround is to simply store all important stuff on a machine with ECC. Problem solved... One broken bit may not have cause serious damage most things work. Exactly. Absolutely, memory diags are essential. And you certainly run them if you see unexpected behaviour that has no other obvious cause. Runs for days, as noted. Doesn't proof anything. Quite. But nonetheless, the unrecoverable errors did occur on mirrored drives and it seems to defeat the whole purpose of mirroring, which is AFAIK, keeping two independent copies of every file in case one gets lost. Writing both images from one buffer appears to violate the premise. I can think of two RFEs 1) Add an option to buffer writes on machines without ECC memory to avoid the possibility of random memory flips causing unrecoverable errors with mirrored drives. 2) An option to read files even if they have failed checksums. 1) could be fixed in the documentation - ZFS should be used with caution on machines with no ECC since random bit flips can cause unrecoverable checksum failures on mirrored drives. Or ZFS isn't supported on machines with memory that has no ECC. Disabling checksums is one way of working around 2). But it also disables a cool feature. I suppose you could optionally change checksum failure from an error to a warning, but ideally it would be file by file... Ironically, I wonder if this is even a problem with raidz? But grotty machines like these can't really support 3 or more internal drives... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On Tue, 26 May 2009, Frank Middleton wrote: 1) could be fixed in the documentation - ZFS should be used with caution on machines with no ECC since random bit flips can cause unrecoverable checksum failures on mirrored drives. Or ZFS isn't supported on machines with memory that has no ECC. What problem are you looking to solve? Data is written by application software which includes none of the extra safeguards you are insisting should be in ZFS. This means that the data may be undetectably corrupted. I strongly recommend that you purchase a system with ECC in order to operate reliably in the (apparent) radium mine where you live. It is time to wake up, smell the radon, and do something about the problem. Check this map to see if there is cause for concern in your area: http://upload.wikimedia.org/wikipedia/en/8/8b/US_homes_over_recommended_radon_levels.gif;. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On Tue, 26 May 2009, Frank Middleton wrote: Just asking if an option for machines with no ecc and their inevitable memory errors is a reasonable thing to suggest in an RFE. Machines lacking ECC do not suffer from inevitable memory errors. Memory errors are not like death and taxes. Exactly. My whole point. And without ECC there's no way of knowing. But if the data is damaged /after/ checksum but /before/ write, then you have a real problem... If memory does not work, then you do have a real problem. The ZFS ARC consumes a large amount of memory. Note that the problem of corruption around the time of the checksum/write is minor compared to corruption in the ZFS ARC since data is continually read from the ZFS ARC and so bad data may be returned to the user even though it is (was?) fine on disk. This is as close as ZFS comes to having an Achilles' heel. Solving this problem would require crippling the system performance. Never said it was broken. I assume the same code is used for both SPARC and X86, and it works fine on SPARC. It would seem that this machine gets memory errors so often (even though it passes the Linux memory diagnostic) that it can never get to the end of a 4GB recv stream. Odd Maybe you need a new computer, or need to fix your broken one. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
Frank brings up some interesting ideas, some of which might need some additional thoughts... Frank Middleton wrote: On 05/23/09 10:21, Richard Elling wrote: preface This forum is littered with claims of zfs checksums are broken where the root cause turned out to be faulty hardware or firmware in the data path. /preface I think that before you should speculate on a redesign, we should get to the root cause. The hardware is clearly misbehaving. No argument. The questions is - how far out of reasonable behavior is it? Hardware is much less expensive than software, even free software. Your system has a negative ROI, kinda like trading credit default swaps. The best thing you can do is junk it :-) Redesign? I'm not sure I can conceive an architecture that would make double buffering difficult to do. It is unclear how faulty hardware or firmware could be responsible for such a low error rate (1 in 4*10^10). Just asking if an option for machines with no ecc and their inevitable memory errors is a reasonable thing to suggest in an RFE. It is a good RFE, but it isn't an RFE for the software folks. The checksum occurs in the pipeline prior to write to disk. So if the data is damaged prior to checksum, then ZFS will never know. Nor will UFS. Neither will be able to detect this. In Solaris, if the damage is greater than the ability of the memory system and CPU to detect or correct, then even Solaris won't know. If the memory system or CPU detects a problem, then Solaris fault management will kick in and do something, preempting ZFS. Exactly. My whole point. And without ECC there's no way of knowing. But if the data is damaged /after/ checksum but /before/ write, then you have a real problem... To put this in perspective, ECC is a broad category. When we think of ECC for memory, it is usually Single Error (bit) Correction, Double Error (bit) Detection (SECDED). A well designed system will also do Single Device Data Correction (aka Chipkill or Extended ECC, since Chipkill is trademarked). What this means is that faults of more than 2 bits per word are not detected, unless all of the faults occur in the same chip for SDDC cases. Clearly, this wouldn't scale well to large data streams, which is why they use checksums like Fletcher or hash functions like SHA-256. ZFS keeps disks pretty busy, so perhaps it loads the power supply to the point where it heats up and memory glitches are more likely. In general, for like configurations, ZFS won't keep a disk any more busy than other file systems. In fact, because ZFS groups transactions, it may create less activity than other file systems, such as UFS. That's a point in it's favor, although not really relevant. If the disks are really busy they will load the PSU more and that could drag the supply down which in turn might make errors occur that otherwise wouldn't. The dynamic loads of modern disk drives are not very great. I don't believe your argument is very strong, here. Also, the solution is, once again, fix the hardware. I think a better test would be to md5 the file from all systems and see if the md5 hashes are the same. If they are, then yes, the finger would point more in the direction of ZFS. The send/recv protocol hasn't changed in quite some time, but it is arguably not as robust as it could be. Thanks! md5 hash is exactly the kind of test I was looking for. ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS) md5sum on X86 9ec4f7da41741b469fcd7cb8c5040564 (remote NFS) Good. ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2 for data (by default) and fletcher4 for metadata. The same fletcher code is used. So if you believe fletcher4 is broken for send/recv, how do you explain that it works for the metadata? Or does it? There may be another failure mode at work here... (see comment on scrubs at the end of this extended post) [Did you forget the scrubs comment?] no, you responded that you had been seeing scrubs fix errors. Never said it was broken. I assume the same code is used for both SPARC and X86, and it works fine on SPARC. It would seem that this machine gets memory errors so often (even though it passes the Linux memory diagnostic) that it can never get to the end of a 4GB recv stream. Odd that it can do the md5sum, but as mentioned, perhaps doing the i/o puts more strain on the machine and stresses it to where more memory faults occur. I can't quite picture a software bug that would cause random failures on specific hardware and I am happy to give ZFS the benefit of the doubt. Yes, software can trigger memory failures. More below... It would have been nice if we were able to recover the contents of the file; if you also know what was supposed to be there, you can diff and then we can find out what was wrong. file on those files resulted in bus error. Is there a way to actually read a file reported by ZFS as unrecoverable to do just that (and to separately retrieve the copy
Re: [zfs-discuss] Errors on mirrored drive
Bob Friesenhahn wrote: On Tue, 26 May 2009, Frank Middleton wrote: Just asking if an option for machines with no ecc and their inevitable memory errors is a reasonable thing to suggest in an RFE. Machines lacking ECC do not suffer from inevitable memory errors. Memory errors are not like death and taxes. Exactly. My whole point. And without ECC there's no way of knowing. But if the data is damaged /after/ checksum but /before/ write, then you have a real problem... If memory does not work, then you do have a real problem. The ZFS ARC consumes a large amount of memory. Note that the problem of corruption around the time of the checksum/write is minor compared to corruption in the ZFS ARC since data is continually read from the ZFS ARC and so bad data may be returned to the user even though it is (was?) fine on disk. This is as close as ZFS comes to having an Achilles' heel. Solving this problem would require crippling the system performance. When running a DEBUG kernel (not something most people would do on a production system) ZFS does actually checksum and verify the buffers in the ARC - not on every access but certain operations cause it to happen. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eon or nexentacore or opensolaris
May be what you saying is true wrt. NexentaCore 2.0. But hey, think about open source principals and development process. We do hope that NexentaCore will become an official Debian distribution some day! We evolving and driven completely by the community here. Anyone can participate and fix the bugs and make it happen: https://launchpad.net/distros/nexenta As far commercial bits: 1. NexentaStor is still based off 1.x. Once 2.x branch is more or less polished we will make a safe transition 2. ON patches goes through serious stress testing not only by Nexenta but also by the growing list of Nexenta partners - i.e. to ensure that end solution is absolutely stable and safe: http://www.nexenta.com/partners 3. The development model of NexentaCore is indeed very much Debian-like. However, NexentaStor is developed with different rules in mind - rules of focused testing, conservative principals and partner-wide openness 4. Is Debian helping NexentaStor to integrate stuff? Yes, absolutely! Lots of advantages here. Debian is NOT just package management as one could think of - it is as well a polished distribution foundation. NexentaStor plugins, which are pretty much Debian packages, used to extend NexentaStor capabilities. Learn more: http://www.nexenta.com/corp/index.php?option=com_jreviewsItemid=112 C. Bergström wrote: Anil Gulecha wrote: On Sat, May 23, 2009 at 1:19 PM, Bogdan M. Maryniuk bogdan.maryn...@gmail.com wrote: On Sat, May 23, 2009 at 4:56 AM, Joe S js.li...@gmail.com wrote: EON ZFS NAS http://eonstorage.blogspot.com/ No idea. NexentaCore Platform (v2.0 RC3) http://www.nexenta.org/os/NexentaCore Personally, I tried it few times. For now, it is still too much broken for me yet and looks scary. Previous version is much more stable but also older. Newer v2.0 looks exactly like bleeding edge Debian old times: each time you run apt-get upgrade you have to use shaman's tambourine dancing around the fireplace. I don't remember exactly, but some packages are just broken and can not find dependencies, installation crashes, pollutes your system and can not be restored nicely etc. However, when it will be not that broken anymore, it must be a great distribution with excellent package management and very convenient to use. Hi Bogdan, Which particular packages were these? RC3 is quite stable, and all server packages are solid. If you do face issues with a particular one, we'd appreciate a bug report. All information on this is helpful.. I've done some preliminary patch review on the core on-nexenta patches and I'd concur to put Nexenta pretty low on the trusted list for enterprise storage. This is in addition to the packaging problems you've pointed out. If the issues at hand were not enough when I sent an email to their dev list it was completely ignored. Marketing for Nexenta as Anil points out is strong, but like many other distributions outside Sun there's still a lot of work to go. I'm not sure EON's update delivery, but I believe it's just a minimal repackage of OpenSolaris release. This isn't the advocacy list so if you're interested in other alternatives feel free to email me off list. Cheers, ./Christopher -- OSUNIX - Built from the best of OpenSolaris Technology http://www.osunix.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On 26-May-09, at 10:21 AM, Frank Middleton wrote: On 05/26/09 03:23, casper@sun.com wrote: And where exactly do you get the second good copy of the data? From the first. And if it is already bad, as noted previously, this is no worse than the UFS/ext3 case. If you want total freedom from this class of errors, use ECC. If you copy the code you've just doubled your chance of using bad memory. The original copy can be good or bad; the second copy cannot be better than the first copy. The whole point is that the memory isn't bad. About once a month, 4GB of memory of any quality can experience 1 bit being flipped, perhaps more or less often. What you are proposing does practically nothing to mitigate random bit flips. Think about the probabilities involved. You're testing one tiny buffer, very occasionally, for an extremely improbable event. It is also nothing to do with ZFS, and leaves every other byte of your RAM untested. See the reasoning? --Toby ... Cheers -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
On 25-May-09, at 11:16 PM, Frank Middleton wrote: On 05/22/09 21:08, Toby Thain wrote: Yes, the important thing is to *detect* them, no system can run reliably with bad memory, and that includes any system with ZFS. Doing nutty things like calculating the checksum twice does not buy anything of value here. All memory is bad if it doesn't have ECC. There are only varying degrees of badness. Calculating the checksum twice on its own would be nutty, as you say, but doing so on a separate copy of the data might prevent unrecoverable errors I don't see this at all. The kernel reads the application buffer. How does reading it twice buy you anything?? It sounds like you are assuming 1) the buffer includes faulty RAM; and 2) the faulty RAM reads differently each time. Doesn't that seem statistically unlikely to you? And even if you really are chasing this improbable scenario, why make ZFS do the job of a memory tester? after writes to mirrored drives. You can't detect memory errors if you don't have ECC. But you can try to mitigate them. Without doing so makes ZFS less reliable than the memory it is running on. The problem is that ZFS makes any file with a bad checksum inaccessible, even if one really doesn't care if the data has been corrupted. A workaround might be a way to allow such files to be readable despite the bad checksum... I am not sure what you are trying to say here. ... How can a machine with bad memory work fine with ext3? It does. It works fine with ZFS too. Just really annoying unrecoverable files every now and then on mirrored drives. This shouldn't happen even with lousy memory and wouldn't (doesn't) with ECC. If there was a way to examine the files and their checksums, I would be surprised if they were different (If they were, it would almost certainly be the controller or the PCI bus itself causing the problem). But I speculate that it is predictable memory hits. You're making this harder than it really is. Run a memory test. If it fails, take the machine out of service until it's fixed. There's no reasonable way to keep running faulty hardware. --Toby -- Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Errors on mirrored drive
Frank Middleton f.middle...@apogeect.com writes: Exactly. My whole point. And without ECC there's no way of knowing. But if the data is damaged /after/ checksum but /before/ write, then you have a real problem... we can't do much to protect ourselves from damage to the data itself (an extra copy in RAM will help little and ruin performance). damages to the bits holding the computed checksum before it is written can be alleviated by doing the calculation independently for each written copy. in particular, this will help if the bit error is transient. since the number of octets in RAM holding the checksum dwarves the number of octets occupied by data by a large ratio (256 bits vs. one mebibit for a full default sized record), such a paranoia mode will most likely tell you that the *data* is corrupt, not the checksum. but today you don't know, so it's an improvement in my book. Quoting the ZFS admin guide: The failmode property ... provides the failmode property for determining the behavior of a catastrophic pool failure due to a loss of device connectivity or the failure of all devices in the pool. . Has this changed since the ZFS admin guide was last updated? If not, it doesn't seem relevant. I guess checksum error handling is orthogonal to this and should have its own property. it sure would be nice if the admin could ask the OS to deliver the bits contained in a file, no matter what, and just log the problem. Cheers -- Frank thank you for pointing out this potential weakness in ZFS' consistency checking, I didn't realise it was there. also thank you, all ZFS developers, for your great job :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] disabling showmount -e behaviour
I must admit that this question originates in the context of Sun's Storage 7210 product, which impose additional restrictions on the kind of knobs I can turn. But here's the question: suppose I have an installation where ZFS is the storage for user home directories. Since I need quotas, each directory gets to be its own filesystem. Since I also need these homes to be accessible remotely each FS is exported via NFS. Here's the question though: how do I prevent showmount -e (or a manually constructed EXPORT/EXPORTALL RPC request) to disclose a list of users that are hosted on a particular server? Thanks, Roman. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] eon or nexentacore or opensolaris
On Sun, May 24, 2009 at 6:11 PM, Anil Gulecha anil.ve...@gmail.com wrote: One example is StormOS, and XFCE based distro being built on NCP2. According to the latest blog entry.. a release is imminent. Perhaps you'll have better desktop experience with this. (www.stormos.org) So.Tried it just now. Shortly: I'd stay with OpenSolaris for at least a year. :-) -- Kind regards, bm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss