Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-23 18:25, Jim Klimov wrote: 4) I did not get to check whether "dedup=verify" triggers a checksum mismatch alarm if the preexisting on-disk data does not in fact match the checksum. All checksum mismatches are handled the same way. > I have yet to test (to be certain) whether writing over a > block which is invalid on-disk and marked as deduped, with > dedup=verify, would increase the CKSUM counter. I checked (oi_148a LiveUSB), by writing a correct block (128KB) instead of the corrupted one into the file, and: * "dedup=on" neither fixed the on-disk file, nor logged an error, and subsequent reads produced IO errors (and increased the counter). Probably just the DDT counter was increased during the write (that's the "works as designed" part); * "dedup=verify" *doesn't* log a checksum error if it finds a block whose assumed checksum matches the newly written block, but contents differ from the new block during dedup-verification and in fact these contents do not match the checksum either (at least, not the one in block pointer). Reading the block produced no errors; * what's worse, reenabling "dedup=on" and writing the same again block crashes (reboots) the system instantly. Possibly, because now there are two DDT entries pointing to same checksum in different blocks, and no verification was explicitly requested? Reenactment of the test (as a hopefully reproducible) test case constitutes the remainder of the post and thus it is going to be lengthy... Analyze that! ;) I think such alarm should exist and to as much as a scrub, read or other means of error detection and recovery would. Statement/desire still stands. Checksum mismatches are logged, no they are not (in this case) >> what was your root cause? Probably same as before - some sort of existing on-disk data corruption which overwrote some sectors and raidz2 failed to reconstruct the stripe. I seem to have had about a dozen of such files. Fixed some by rsync with different dedup settings, before going into it all deeper. I am not sure if any of them had overlapping DVAs (those which remain corrupted now - don't), but many addresses lie in very roughly similar address ranges (within several GBs or so). As written above, at least for one case it was probably a random write by a disk over existing sectors, invalidating the block. Still, according to "Works as designed" above, logging the mismatch so far has no effect on not-using the old DDT entry pointing to corrupt data. Just in case, logged as https://www.illumos.org/issues/2015 REENACTMENT OF THE TEST CASE Beside illustrating my error for those who decide to take on the bug, I hope this post would also help others in their data recovery attempts, zfs research, etc. If my methodology is faulty, I hope someone points that out ;) 1) Uh, I have unrecoverable errors! The computer was freshly rebooted, pool imported (with rollback), no newly known CKSUM errors (but we have the nvlist of existing mismatching files): # zpool status -vx pool: pool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scan: scrub repaired 244K in 138h57m with 31 errors on Sat Jan 14 01:50:16 2012 config: NAMESTATE READ WRITE CKSUM poolONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: :<0x0> ... pool/mymedia:/incoming/DSCF1234.AVI NOTE: I don't yet have full detail of :<0x0> error, asked about it numerously on the list. 2) Mine some information about the file and error location * mount the dataset # zfs mount pool/mymedia * find the inode number # ls -i /pool/mymedia/incoming/DSCF1234.AVI 6313 /pool/mymedia/incoming/DSCF1234.AVI * dump ZDB info # zdb -dd pool/mymedia 6313 > /tmp/zdb.txt * find the bad block offset # dd if=/pool/mymedia/incoming/DSCF1234.AVI of=/dev/null \ bs=512 conv=noerror,notrunc dd: reading `/pool/mymedia/incoming/DSCF1234.AVI': I/O error 58880+0 records in 58880+0 records out 30146560 bytes (30 MB) copied, 676.772 s, 44.5 kB/s (error repeated 256 times) 239145+1 records in 239145+1 records out 122442738 bytes (122 MB) copied, 2136.19 s, 57.3 kB/s So the error is at offset 58800*512 bytes = 0x1CC And its size is 512b*256 = 128KB 3) Review the /tmp/zdb.txt information We need the L0 entry for the erroneous block and its parent L1 entry:
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-22 22:58, Richard Elling wrote: On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote: ... So "it currently seems to me, that": 1) My on-disk data could get corrupted for whatever reason ZFS tries to protect it from, at least once probably from misdirected writes (i.e. the head landed not where it was asked to write). It can not be ruled out that the checksums got broken in non-ECC RAM before writes of block pointers for some of my data, thus leading to mismatches. One way or another, ZFS noted the discrepancy during scrubs and "normal" file accesses. There is no (automatic) way to tell which part is faulty - checksum or data. Untrue. If a block pointer is corrupted, then on read it will be logged and ignored. I'm not sure you have grasped the concept of checksums in the parent object. If a block pointer is corrupted on disk after the write - then yes, it will not match the parent's checksum, and there would be another 1 or 2 ditto copies with possibly correct data. Is that the correct grasping of the concept? ;) Now, the (non-zero possibility) scenario I meant was that the checksum for the block was calculated and was corrupted in RAM/CPU before the ditto blocks were fanned out to disks, and before the parent block checksums were calculated. In this case on-disk data block is correct as compared to other sources (if it is copies=2 - it may even be the same as its other copy), but it does not match the BP's checksum while the BP tree seems valid (all tree checksums match). I believe in this case ZFS should flag the data checksum mismatch, although in reality (with miniscule probability) it is the bad checksum mismatching the good data. Anyway, the situation would seem the same if the data block was corrupted in RAM before fanning out with copies>1, and that is more probable given the size of this block compared to the 256 bits of checksum. Just *HOW* probable is that on an ECC and a non-ECC system, with or without an overclocked overheated CPU in enthusiasts overpumped workstation or unsuspecting consumer's dusty closet - that is a separate maths questions, with different answers for different models. Random answer - on par with disk UBER errors which ZFS by design considers serious enough to combat. 2) In the case where on-disk data did get corrupted, the checksum in block pointer was correct (matching original data), but the raidz2 redundancy did not aid recovery. I think your analysis is incomplete. As I last wrote, I dumped the blocks with ZDB and compared the bytes with the same block from a good copy. Particularly, that copy had the same SHA256 checksum as was stored in my problematic pool's blockpointer entry for the corrupt block. These blocks differed in three sets of 4096 bytes starting at "round" offsets at even intervals (4KB, 36KB, 68KB). 4kb is my disks' block size. It seems that some disk(s?) overwrote existing data, or got scratched, or whatever (no IO errors in dmesg though). I am not certain why raidz2 did not suffice to fix the block, and what garbage or data exists on all 6 drives - I did not get zdb to dump all 0x3 bytes of raidz2 raw data to try permutations myself. Possibly, for whatever reason (such as cable error, or some firmware error given the same model of the drives), several drives got the same erroneous write command at once, and ultimately invalidated parts of the same stripe. Many of the files in peril now, have existed on the pool for some time, and scrubs completed successfully many times. > Have you determined the root cause? Unfortunately, I'm currently in another country away from my home-NAS server. So all physical maintenance including pushing the reset button is done by friends living in the apartment. And there is not much physical examination that can be done this way. At one point in time recently (during a scrub in January), one of the disks got lost and was not seen by motherboard even after reboots, so I had my friends take out and replug the SATA cables. This helped, so connector noise was possibly the root cause. It might also account for incorrect address for a certain write that slashed randomly on the platter. The PSU is excessive for the box's requirements, with slack performance to degrade ;) The P4 CPU is not overclocked. RAM is non-ECC and that is not changeable given the Intel CPU, chipset and motherboard. HDDs are on MB's controller. The 6 HDDs in raidz2 pool are consumer-grade SATA Seagate ST2000DL003-9VT166 firmware CC32. Degrading cabling and/or connectors can indeed be one of about two main causes, the other being non-ECC RAM. Or aging CPU. 3) The file in question was created on a dataset with enabled deduplication, so at the very least the dedup bit was set on the corrupted block's pointer and a DDT entry likely existed. Attempts to rewrite the block with the original one (having "dedup=on") failed in fact, probably because the matching checksum was already in DDT. Works as designed. If this is the case,
Re: [zfs-discuss] ZFS Dedup and bad checksums
On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote: > 2012-01-21 0:33, Jim Klimov wrote: >> 2012-01-13 4:12, Jim Klimov wrote: >>> As I recently wrote, my data pool has experienced some >>> "unrecoverable errors". It seems that a userdata block >>> of deduped data got corrupted and no longer matches the >>> stored checksum. For whatever reason, raidz2 did not >>> help in recovery of this data, so I rsync'ed the files >>> over from another copy. Then things got interesting... >> >> >> Well, after some crawling over my data with zdb, od and dd, >> I guess ZFS was right about finding checksum errors - the >> metadata's checksum matched that of a block on original >> system, and the data block was indeed erring. > > Well, as I'm moving to close my quest with broken data, I'd > like to draw up some conclusions and RFEs. I am still not > sure if they are factually true, I'm still learning the ZFS > internals. So "it currently seems to me, that": > > 1) My on-disk data could get corrupted for whatever reason > ZFS tries to protect it from, at least once probably > from misdirected writes (i.e. the head landed not where > it was asked to write). It can not be ruled out that the > checksums got broken in non-ECC RAM before writes of > block pointers for some of my data, thus leading to > mismatches. One way or another, ZFS noted the discrepancy > during scrubs and "normal" file accesses. There is no > (automatic) way to tell which part is faulty - checksum > or data. Untrue. If a block pointer is corrupted, then on read it will be logged and ignored. I'm not sure you have grasped the concept of checksums in the parent object. > > 2) In the case where on-disk data did get corrupted, the > checksum in block pointer was correct (matching original > data), but the raidz2 redundancy did not aid recovery. I think your analysis is incomplete. Have you determined the root cause? > > 3) The file in question was created on a dataset with enabled > deduplication, so at the very least the dedup bit was set > on the corrupted block's pointer and a DDT entry likely > existed. Attempts to rewrite the block with the original > one (having "dedup=on") failed in fact, probably because > the matching checksum was already in DDT. Works as designed. > > Rewrites of such blocks with "dedup=off" or "dedup=verify" > succeeded. > > Failure/success were tested by "sync; md5sum FILE" some > time after the fix attempt. (When done just after the > fix, test tends to return success even if the ondisk data > is bad, "thanks" to caching). No, I think you've missed the root cause. By default, data that does not match its checksum is not used. > > My last attempt was to set "dedup=on" and write the block > again and sync; the (remote) computer hung instantly :( > > 3*)The RFE stands: deduped blocks found to be invalid and not > recovered by redundancy should somehow be evicted from DDT > (or marked for required verification-before-write) so as > not to pollute further writes, including repair attmepts. > > Alternatively, "dedup=verify" takes care of the situation > and should be the recommended option. I have lobbied for this, but so far people prefer performance to dependability. > > 3**) It was suggested to set "dedupditto" to small values, > like "2". My oi_148a refused to set values smaller than 100. > Moreover, it seems reasonable to have two dedupditto values: > for example, to make a ditto copy when DDT reference counter > exceeds some small value (2-5), and add ditto copies every > "N" values for frequently-referenced data (every 64-128). > > 4) I did not get to check whether "dedup=verify" triggers a > checksum mismatch alarm if the preexisting on-disk data > does not in fact match the checksum. All checksum mismatches are handled the same way. > > I think such alarm should exist and to as much as a scrub, > read or other means of error detection and recovery would. Checksum mismatches are logged, what was your root cause? > > 5) It seems like a worthy RFE to include a pool-wide option to > "verify-after-write/commit" - to test that recent TXG sync > data has indeed made it to disk on (consumer-grade) hardware > into the designated sector numbers. Perhaps the test should > be delayed several seconds after the sync writes. There are highly-reliable systems that do this in the fault-tolerant market. > > If the verifcation fails, currently cached data from recent > TXGs can be recovered from on-disk redundancy and/or still > exist in RAM cache, and rewritten again (and tested again). > > More importantly, a failed test *may* mean that the write > landed on disk randomly, and the pool should be scrubbed > ASAP. It may be guessed that the yet-unknown error can lie > within "epsilon" tracks (sector numbers) from the currently > found non-written data, so if it is possible to scrub just > a portion of the pool based on
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-22 0:55, Bob Friesenhahn wrote: On Sun, 22 Jan 2012, Jim Klimov wrote: So far I rather considered "flaky" hardware with lousy consumer qualities. The server you describe is likely to exceed that bar ;) The most common "flaky" behavior of consumer hardware which causes troubles for zfs is not honoring cache-related requests. Unfortunately, it is not possible for zfs to fix such hardware. Zfs works best with hardware which does what it is told. Also true. That's what the "option" stood for in my proposal: since the verification feature is going to be expensive and add random IOs, we don't want to enforce it on everybody. Besides, the user might choose to trust his reliable and expensive hardware like a SAN/NAS with battery-backed NVRAM, which is indeed likely better that a homebrewn NAS box with random HDDs thrown in with no measure, but with a desire for some reliability nonetheless ;) We can "expect" the individual HDDs caches to get expired after some time (i.e. after we've sent 64Mbs worth of writes to the particualr disk with a 64Mb cache), and after that we are likely to get true media reads. That's when the verification reads are likely to return most relevant (ondisk) sectors... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
On Sun, 22 Jan 2012, Jim Klimov wrote: So far I rather considered "flaky" hardware with lousy consumer qualities. The server you describe is likely to exceed that bar ;) The most common "flaky" behavior of consumer hardware which causes troubles for zfs is not honoring cache-related requests. Unfortunately, it is not possible for zfs to fix such hardware. Zfs works best with hardware which does what it is told. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-21 20:50, Bob Friesenhahn wrote: > TXGs get forgotten from memory as soon as they are written. As I said, that can be arranged - i.e. free the TXG cache after the corresponding TXG number has been verified? Point about ARC being overwritten seems valid... Zfs already knows how to by-pass the ARC. However, any "media" reads are subject to caching since the underlying devices try very hard to cache data in order to improve read performance. As a pointer, the "format" command presents options to disable (separately) read and write caching on drives it sees. MAYBE there is some option to explicitly read data from media, like sync-writes. Whether the drive firmwares honor that (disabling caching and/or such hypothetical sync-reads) - it's something out of ZFS's control. But we can do the best effort... As an extreme case of caching, consider a device represented by an iSCSI LUN on a OpenSolaris server with 512GB of RAM. If you request to read data you are exceedingly likely to read data from the zfs ARC on that server rather than underlying "media". So far I rather considered "flaky" hardware with lousy consumer qualities. The server you describe is likely to exceed that bar ;) Besides, if this OpenSolaris server is up-to-date, it would do such media checks itself, and/or honour the sync-read requests or temporary cache disabling ;) Of course, this can't be guaranteed of other devices, so in general ZFS can do best-effort verification. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
On Sat, 21 Jan 2012, Jim Klimov wrote: Regarding the written data, I believe it may find place in the ARC, and a for the past few TXGs it could still remain there. Any data in the ARC is subject to being overwritten with updated data just a millisecond later. It is a live cache. I am not sure it is feasible to "guarantee" that it remains in RAM for a certain time. Also there should be a way to enforce media reads and not ARC re-reads when verifying writes... Zfs already knows how to by-pass the ARC. However, any "media" reads are subject to caching since the underlying devices try very hard to cache data in order to improve read performance. As an extreme case of caching, consider a device represented by an iSCSI LUN on a OpenSolaris server with 512GB of RAM. If you request to read data you are exceedingly likely to read data from the zfs ARC on that server rather than underlying "media". Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-21 19:18, Bob Friesenhahn wrote: On Sat, 21 Jan 2012, Jim Klimov wrote: 5) It seems like a worthy RFE to include a pool-wide option to "verify-after-write/commit" - to test that recent TXG sync data has indeed made it to disk on (consumer-grade) hardware into the designated sector numbers. Perhaps the test should be delayed several seconds after the sync writes. This is an interesting idea. I think that you would want to do a mini-scrub on a TXG at least one behind the last one written since otherwise any test would surely be foiled by caching. The ability to restore data from RAM is doubtful since TXGs get forgotten from memory as soon as they are written. That could be rearranged as part of the bug/RFE resolution ;) Regarding the written data, I believe it may find place in the ARC, and a for the past few TXGs it could still remain there. I am not sure it is feasible to "guarantee" that it remains in RAM for a certain time. Also there should be a way to enforce media reads and not ARC re-reads when verifying writes... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
On Sat, 21 Jan 2012, Jim Klimov wrote: 5) It seems like a worthy RFE to include a pool-wide option to "verify-after-write/commit" - to test that recent TXG sync data has indeed made it to disk on (consumer-grade) hardware into the designated sector numbers. Perhaps the test should be delayed several seconds after the sync writes. This is an interesting idea. I think that you would want to do a mini-scrub on a TXG at least one behind the last one written since otherwise any test would surely be foiled by caching. The ability to restore data from RAM is doubtful since TXGs get forgotten from memory as soon as they are written. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-21 0:33, Jim Klimov wrote: 2012-01-13 4:12, Jim Klimov wrote: As I recently wrote, my data pool has experienced some "unrecoverable errors". It seems that a userdata block of deduped data got corrupted and no longer matches the stored checksum. For whatever reason, raidz2 did not help in recovery of this data, so I rsync'ed the files over from another copy. Then things got interesting... Well, after some crawling over my data with zdb, od and dd, I guess ZFS was right about finding checksum errors - the metadata's checksum matched that of a block on original system, and the data block was indeed erring. Well, as I'm moving to close my quest with broken data, I'd like to draw up some conclusions and RFEs. I am still not sure if they are factually true, I'm still learning the ZFS internals. So "it currently seems to me, that": 1) My on-disk data could get corrupted for whatever reason ZFS tries to protect it from, at least once probably from misdirected writes (i.e. the head landed not where it was asked to write). It can not be ruled out that the checksums got broken in non-ECC RAM before writes of block pointers for some of my data, thus leading to mismatches. One way or another, ZFS noted the discrepancy during scrubs and "normal" file accesses. There is no (automatic) way to tell which part is faulty - checksum or data. 2) In the case where on-disk data did get corrupted, the checksum in block pointer was correct (matching original data), but the raidz2 redundancy did not aid recovery. 3) The file in question was created on a dataset with enabled deduplication, so at the very least the dedup bit was set on the corrupted block's pointer and a DDT entry likely existed. Attempts to rewrite the block with the original one (having "dedup=on") failed in fact, probably because the matching checksum was already in DDT. Rewrites of such blocks with "dedup=off" or "dedup=verify" succeeded. Failure/success were tested by "sync; md5sum FILE" some time after the fix attempt. (When done just after the fix, test tends to return success even if the ondisk data is bad, "thanks" to caching). My last attempt was to set "dedup=on" and write the block again and sync; the (remote) computer hung instantly :( 3*)The RFE stands: deduped blocks found to be invalid and not recovered by redundancy should somehow be evicted from DDT (or marked for required verification-before-write) so as not to pollute further writes, including repair attmepts. Alternatively, "dedup=verify" takes care of the situation and should be the recommended option. 3**) It was suggested to set "dedupditto" to small values, like "2". My oi_148a refused to set values smaller than 100. Moreover, it seems reasonable to have two dedupditto values: for example, to make a ditto copy when DDT reference counter exceeds some small value (2-5), and add ditto copies every "N" values for frequently-referenced data (every 64-128). 4) I did not get to check whether "dedup=verify" triggers a checksum mismatch alarm if the preexisting on-disk data does not in fact match the checksum. I think such alarm should exist and to as much as a scrub, read or other means of error detection and recovery would. 5) It seems like a worthy RFE to include a pool-wide option to "verify-after-write/commit" - to test that recent TXG sync data has indeed made it to disk on (consumer-grade) hardware into the designated sector numbers. Perhaps the test should be delayed several seconds after the sync writes. If the verifcation fails, currently cached data from recent TXGs can be recovered from on-disk redundancy and/or still exist in RAM cache, and rewritten again (and tested again). More importantly, a failed test *may* mean that the write landed on disk randomly, and the pool should be scrubbed ASAP. It may be guessed that the yet-unknown error can lie within "epsilon" tracks (sector numbers) from the currently found non-written data, so if it is possible to scrub just a portion of the pool based on DVAs - that's a preferred start. It is possible that some data can be recovered if it is tended to ASAP (i.e. on mirror, raidz, copies>1)... Finally, I should say I'm sorry for lame questions arising from not reading the format spec and zdb blogs carefully ;) In particular, it was my understanding for a long time that block pointers each have a sector of their own, leading to overheads that I've seen. Now I know (and checked) that most of the blockpointer tree is made of larger groupings (128 blkptr_t's in a single 16KB block), reducing the impact of BP's on fragmentation and/or slacky waste of large sectors that I predicted and expected for the past year. Sad that nobody ever contradicted that (mis)understanding of mine. //Jim Klimov ___ zfs-discuss mailing list zfs
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-13 4:12, Jim Klimov wrote: As I recently wrote, my data pool has experienced some "unrecoverable errors". It seems that a userdata block of deduped data got corrupted and no longer matches the stored checksum. For whatever reason, raidz2 did not help in recovery of this data, so I rsync'ed the files over from another copy. Then things got interesting... Well, after some crawling over my data with zdb, od and dd, I guess ZFS was right about finding checksum errors - the metadata's checksum matched that of a block on original system, and the data block was indeed erring. Just in case it helps others, the SHA256 checksums can be tested with openssl as I show below. I still search for a command-line fletcher4/fletcher2 checker (as that weak hash is used on metadata; I wonder why). Here's a tail from on-disk blkptr_t, bytes with checksum: # tail -2 /tmp/osx.l0+110.blkptr.txt 000460 1f 6f 4c 73 5d c1 ab 15 00 cc 56 90 38 8e b4 dd 000470 a9 8e 54 6f f1 a7 db 43 7d 61 9e 01 23 45 2e 70 In byte 0x435 I have value 0x8 - SHA256. And here is the SHA256 hash for the excerpt from original file (128Kb cut out with dd): # dd if=osx.zip of=/tmp/osx.l0+110.bin.orig bs=512 skip=34816 count=256 # openssl dgst -sha256 < /tmp/osx.l0+110.bin.orig 15abc15d734c6f1fddb48e389056cc0043dba7f16f548ea9702e4523019e617d As my x86 is little-endian, the four 8-byte words of the checksum appear reversed. But you can see it matches, so my source file is okay. I did not find the DDT entries (yet), so I don't know what hash is there or what addresses it references for how many files. The block pointer has the dedup bit set, though. However, of all my files with errors, there are no DVA overlaps. I hexdumped (with od) the two 128Kb excerpts (one from the original file, another fetched with zdb) and diffed them, and while some lines matched, others did not. What is more interesting, is that most of the error area contains a repeating pattern like this, sometimes with "extra" chars thrown in: fc 42 fc 42 fc 42 fc 42 fc 42 fc fc 42 1f fc 42 fc 42 42 ff fc 42 fc 42 fc 42 fc 42 I have seen similar patterns when I zdb-dumped compressed blocks without decompression, so I guess this could be a miswrite of compressed data and/or parity destined for another file (which also did not get it). The erroneous data starts and ends at "round" offsets like 0x1000-0x2000, 0x9000-0xa000, 0x11000-0x12000 (step 0x8000 between both sets of mismatches, size 4kb is my disk sector size), which also suggests a non-coincidental problem. However, part of the differing data is "normal-looking random noise", while some part is that pattern above, starting and ending at a seemingly random location mid-sector. Here's about all I have to say and share so far :) Open to suggestions how to compute fletcher checksums on blocks... Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-13 4:26, Richard Elling wrote: On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: The problem was solved by disabling dedup for the dataset involved and rsync-updating the file in-place. After the dedup feature was disabled and new blocks were uniquely written, everything was readable (and md5sums matched) as expected. In theory, the verify option will correct this going forward. Well, I have got more complaining blocks, and even new errors in files that I've previously "repaired" with rsync, before I figured out the problem with dedup today. Now I've set the verify flag instead of dedup=off, and the rsync replacement from external storage seems to happen a lot faster. It also seems to persist even a few minutes after the copying ;) Thanks for the tip, Richard! //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-13 5:34, Daniel Carosone wrote: On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov wrote: Either I misunderstand some of the above, or I fail to see how verification would eliminate this failure mode (namely, as per my suggestion, replace the bad block with a good one and have all references updated and block-chains -> files fixed with one shot). It doesn't update past data. It gets treated as if there were a hash collision, and the new data is really different despite having the same checksum, and so gets written out instead of incrementing the existing DDT pointer. So it addresses your ability to recover the primary filesystem by overwriting with same data, that dedup was previously defeating. But (yes/no?) I have to do this repair file-by-file, either with dedup=off or dedup=verify. Actually, that's what I properly should do if there is such a serious error, but what if the original data is not available so I can't fix it file-by-file, or if there are very many errors (read, DDT references from a number of files just under dedupditto value) and such match-and-repair procedure is prohibitively inconvenient, slow, whatever? Say, previously we trusted the hash algorithm: that same checksums mean identical blocks. With such trust the user might want to replace the faulty block with another one (matching the checksum) and expect ALL deduped files that used this block to become automagically recovered. Chances are, they actually would be correct (by external verification). And if we trust unverified dedup in the first place, there is nothing wrong with such approach to repair. It would not make possible errors worse than there were in originally saved on-disk data (even if there were hash collisions of really-different blocks - user had discarded that difference long ago). I think the user should be given an (informed) ability to shoot himself in the foot or recover data, depending on his luck. Anyway, people are doing it thanks to Max Bruning's or Viktor Latushkin's posts and direct help, or they research hardcore internals of ZFS. We might as well play along and increase their chances of success, even if unsupported and unguaranteed - no? This situation with "obscured" recovery methods reminds me of prohibited changes of firmware on cell phones: customers are allowed to sit on a phone or drop it into a sink, and perhaps have it replaced, but they are not allowed to install different software. Many still do. //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov wrote: > 2012-01-13 4:26, Richard Elling wrote: >> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: >>> Alternatively (opportunistically), a flag might be set >>> in the DDT entry requesting that a new write mathching >>> this stored checksum should get committed to disk - thus >>> "repairing" all files which reference the block (at least, >>> stopping the IO errors). >> >> verify eliminates this failure mode. > > Thinking about it... got more questions: > > In this case: DDT/BP contain multiple references with > correct checksums, but the on-disk block is bad. > Newly written block has the same checksum, and verification > proves that on-disk data is different byte-to-byte. > > 1) How does the write-stack interact with those checksums >that do not match the data? Would any checksum be tested >for this verification read of existing data at all? > > 2) It would make sense for the failed verification to >have the new block committed to disk, and a new DDT >entry with same checksum created. I would normally >expect this to be the new unique block of a new file, >and have no influence on existing data (block chains). >However in the discussed problematic case, this safe >behavior would also mean not contributing to reparation >of those existing block chains which include the >mismatching on-disk block. > > Either I misunderstand some of the above, or I fail to > see how verification would eliminate this failure mode > (namely, as per my suggestion, replace the bad block > with a good one and have all references updated and > block-chains -> files fixed with one shot). It doesn't update past data. It gets treated as if there were a hash collision, and the new data is really different despite having the same checksum, and so gets written out instead of incrementing the existing DDT pointer. So it addresses your ability to recover the primary filesystem by overwriting with same data, that dedup was previously defeating. -- Dan. pgp4l8LOTdUOb.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-13 4:26, Richard Elling wrote: On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: Alternatively (opportunistically), a flag might be set in the DDT entry requesting that a new write mathching this stored checksum should get committed to disk - thus "repairing" all files which reference the block (at least, stopping the IO errors). verify eliminates this failure mode. Thinking about it... got more questions: In this case: DDT/BP contain multiple references with correct checksums, but the on-disk block is bad. Newly written block has the same checksum, and verification proves that on-disk data is different byte-to-byte. 1) How does the write-stack interact with those checksums that do not match the data? Would any checksum be tested for this verification read of existing data at all? 2) It would make sense for the failed verification to have the new block committed to disk, and a new DDT entry with same checksum created. I would normally expect this to be the new unique block of a new file, and have no influence on existing data (block chains). However in the discussed problematic case, this safe behavior would also mean not contributing to reparation of those existing block chains which include the mismatching on-disk block. Either I misunderstand some of the above, or I fail to see how verification would eliminate this failure mode (namely, as per my suggestion, replace the bad block with a good one and have all references updated and block-chains -> files fixed with one shot). Would you please explain? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
2012-01-13 4:26, Richard Elling wrote: On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: As I recently wrote, my data pool has experienced some "unrecoverable errors". It seems that a userdata block of deduped data got corrupted and no longer matches the stored checksum. For whatever reason, raidz2 did not help in recovery of this data, so I rsync'ed the files over from another copy. Then things got interesting... Bug alert: it seems the block-pointer block with that mismatching checksum did not get invalidated, so my attempts to rsync known-good versions of the bad files from external source seemed to work, but in fact failed: subsequent reads of the files produced IO errors. Apparently (my wild guess), upon writing the blocks, checksums were calculated and the matching DDT entry was found. ZFS did not care that the entry pointed to inconsistent data (not matching the checksum now), it still increased the DDT counter. The problem was solved by disabling dedup for the dataset involved and rsync-updating the file in-place. After the dedup feature was disabled and new blocks were uniquely written, everything was readable (and md5sums matched) as expected. I think of a couple of solutions: In theory, the verify option will correct this going forward. But in practice there are many suggestions to disable verification because it is slowing down the writes beyond what DDT does to performance, and since there is just some 10^-77 chance that two blocks would have same values of checksums, it is there only for paranoics. If the block is detected to be corrupt (checksum mismatches the data), the checksum value in blockpointers and DDT should be rewritten to an "impossible" value, perhaps all-zeroes or such, when the error is detected. What if it is a transient fault? Reread disk, retest checksums?.. I don't know... :) Alternatively (opportunistically), a flag might be set in the DDT entry requesting that a new write mathching this stored checksum should get committed to disk - thus "repairing" all files which reference the block (at least, stopping the IO errors). verify eliminates this failure mode. Sounds true, I didn't try that, though. But my scrub is not yet complete, maybe there will be more test subjects ;) Alas, so far there is anyways no guarantee that it was not the checksum itself that got corrupted (except for using ZDB to retrieve the block contents and matching that with a known-good copy of the data, if any), so corruption of the checksum would also cause replacement of "really-good-but-normally-inaccessible" data. Extrememly unlikely. The metadata is also checksummed. To arrive here you will have to have two corruptions each of which generate the proper checksum. Not impossible, but… I'd buy a lottery ticket instead. I've rather meant the opposite: file data is actually good, but checksums (apparently both DDT and BlockPointer ones with all their ditto copies) are bad, either due to disk rot or RAM failures. For example, are the "blockpointer" and "dedup" versions of the sha256 checksum recalculated by both stages, or reused, on writes of a block?.. See also dedupditto. I could argue that the default value of dedupditto should be 2 rather than "off". I couldn't set it to smallish values (like 64), on oi_148a LiveUSB: root@openindiana:~# zpool set dedupditto=64 pool cannot set property for 'pool': invalid argument for this pool operation root@openindiana:~# zpool set dedupditto=2 pool cannot set property for 'pool': invalid argument for this pool operation root@openindiana:~# zpool set dedupditto=127 pool root@openindiana:~# zpool get dedupditto pool NAME PROPERTYVALUE SOURCE pool dedupditto 127 local Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup and bad checksums
On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: > As I recently wrote, my data pool has experienced some > "unrecoverable errors". It seems that a userdata block > of deduped data got corrupted and no longer matches the > stored checksum. For whatever reason, raidz2 did not > help in recovery of this data, so I rsync'ed the files > over from another copy. Then things got interesting... > > Bug alert: it seems the block-pointer block with that > mismatching checksum did not get invalidated, so my > attempts to rsync known-good versions of the bad files > from external source seemed to work, but in fact failed: > subsequent reads of the files produced IO errors. > Apparently (my wild guess), upon writing the blocks, > checksums were calculated and the matching DDT entry > was found. ZFS did not care that the entry pointed to > inconsistent data (not matching the checksum now), > it still increased the DDT counter. > > The problem was solved by disabling dedup for the dataset > involved and rsync-updating the file in-place. After the > dedup feature was disabled and new blocks were uniquely > written, everything was readable (and md5sums matched) > as expected. > > I think of a couple of solutions: In theory, the verify option will correct this going forward. > If the block is detected to be corrupt (checksum mismatches > the data), the checksum value in blockpointers and DDT > should be rewritten to an "impossible" value, perhaps > all-zeroes or such, when the error is detected. What if it is a transient fault? > Alternatively (opportunistically), a flag might be set > in the DDT entry requesting that a new write mathching > this stored checksum should get committed to disk - thus > "repairing" all files which reference the block (at least, > stopping the IO errors). verify eliminates this failure mode. > Alas, so far there is anyways no guarantee that it was > not the checksum itself that got corrupted (except for > using ZDB to retrieve the block contents and matching > that with a known-good copy of the data, if any), so > corruption of the checksum would also cause replacement > of "really-good-but-normally-inaccessible" data. Extrememly unlikely. The metadata is also checksummed. To arrive here you will have to have two corruptions each of which generate the proper checksum. Not impossible, but… I'd buy a lottery ticket instead. See also dedupditto. I could argue that the default value of dedupditto should be 2 rather than "off". > //Jim Klimov > > (Bug reported to Illumos: https://www.illumos.org/issues/1981) Thanks! -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss