Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-23 Thread Jim Klimov

2012-01-23 18:25, Jim Klimov wrote:

4) I did not get to check whether "dedup=verify" triggers a
checksum mismatch alarm if the preexisting on-disk data
does not in fact match the checksum.


All checksum mismatches are handled the same way.


> I have yet to test (to be certain) whether writing over a
> block which is invalid on-disk and marked as deduped, with
> dedup=verify, would increase the CKSUM counter.

I checked (oi_148a LiveUSB), by writing a correct block
(128KB) instead of the corrupted one into the file, and:
* "dedup=on" neither fixed the on-disk file, nor logged
  an error, and subsequent reads produced IO errors
  (and increased the counter). Probably just the DDT
  counter was increased during the write (that's the
  "works as designed" part);
* "dedup=verify" *doesn't* log a checksum error if it
  finds a block whose assumed checksum matches the newly
  written block, but contents differ from the new block
  during dedup-verification and in fact these contents
  do not match the checksum either (at least, not the
  one in block pointer). Reading the block produced no
  errors;
* what's worse, reenabling "dedup=on" and writing the
  same again block crashes (reboots) the system instantly.
  Possibly, because now there are two DDT entries pointing
  to same checksum in different blocks, and no verification
  was explicitly requested?

Reenactment of the test (as a hopefully reproducible)
test case constitutes the remainder of the post and
thus it is going to be lengthy... Analyze that! ;)


I think such alarm should exist and to as much as a scrub,
read or other means of error detection and recovery would.


Statement/desire still stands.


Checksum mismatches are logged,


no they are not (in this case)

>> what was your root cause?

Probably same as before - some sort of existing on-disk data
corruption which overwrote some sectors and raidz2 failed to
reconstruct the stripe. I seem to have had about a dozen of
such files. Fixed some by rsync with different dedup settings,
before going into it all deeper. I am not sure if any of them
had overlapping DVAs (those which remain corrupted now - don't),
but many addresses lie in very roughly similar address ranges
(within several GBs or so).



As written above, at least for one case it was probably
a random write by a disk over existing sectors, invalidating
the block.

Still, according to "Works as designed" above, logging the
mismatch so far has no effect on not-using the old DDT entry
pointing to corrupt data.

Just in case, logged as https://www.illumos.org/issues/2015


REENACTMENT OF THE TEST CASE

Beside illustrating my error for those who decide to take on
the bug, I hope this post would also help others in their data
recovery attempts, zfs research, etc.

If my methodology is faulty, I hope someone points that out ;)


1) Uh, I have unrecoverable errors!

The computer was freshly rebooted, pool imported (with rollback),
no newly known CKSUM errors (but we have the nvlist of existing
mismatching files):

# zpool status -vx
  pool: pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scan: scrub repaired 244K in 138h57m with 31 errors on Sat Jan 14 
01:50:16 2012

config:

NAMESTATE READ WRITE CKSUM
poolONLINE   0 0 0
  raidz2-0  ONLINE   0 0 0
c6t0d0  ONLINE   0 0 0
c6t1d0  ONLINE   0 0 0
c6t2d0  ONLINE   0 0 0
c6t3d0  ONLINE   0 0 0
c6t4d0  ONLINE   0 0 0
c6t5d0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

:<0x0>
...
pool/mymedia:/incoming/DSCF1234.AVI

NOTE: I don't yet have full detail of :<0x0> error,
asked about it numerously on the list.


2) Mine some information about the file and error location

* mount the dataset
  # zfs mount pool/mymedia
* find the inode number
  # ls -i /pool/mymedia/incoming/DSCF1234.AVI
  6313 /pool/mymedia/incoming/DSCF1234.AVI
* dump ZDB info
  # zdb -dd pool/mymedia 6313 > /tmp/zdb.txt
* find the bad block offset
  # dd if=/pool/mymedia/incoming/DSCF1234.AVI of=/dev/null \
bs=512 conv=noerror,notrunc
  dd: reading `/pool/mymedia/incoming/DSCF1234.AVI': I/O error
  58880+0 records in
  58880+0 records out
  30146560 bytes (30 MB) copied, 676.772 s, 44.5 kB/s
(error repeated 256 times)
  239145+1 records in
  239145+1 records out
  122442738 bytes (122 MB) copied, 2136.19 s, 57.3 kB/s

  So the error is at offset 58800*512 bytes = 0x1CC
  And its size is 512b*256 = 128KB



3) Review the /tmp/zdb.txt information

We need the L0 entry for the erroneous block and its
parent L1 entry:

   

Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-23 Thread Jim Klimov

2012-01-22 22:58, Richard Elling wrote:

On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:



... So "it currently seems to me, that":

1) My on-disk data could get corrupted for whatever reason
ZFS tries to protect it from, at least once probably
from misdirected writes (i.e. the head landed not where
it was asked to write). It can not be ruled out that the
checksums got broken in non-ECC RAM before writes of
block pointers for some of my data, thus leading to
mismatches. One way or another, ZFS noted the discrepancy
during scrubs and "normal" file accesses. There is no
(automatic) way to tell which part is faulty - checksum
or data.


Untrue. If a block pointer is corrupted, then on read it will be logged
and ignored. I'm not sure you have grasped the concept of checksums
in the parent object.


If a block pointer is corrupted on disk after the write -
then yes, it will not match the parent's checksum, and
there would be another 1 or 2 ditto copies with possibly
correct data. Is that the correct grasping of the concept? ;)

Now, the (non-zero possibility) scenario I meant was that
the checksum for the block was calculated and was corrupted
in RAM/CPU before the ditto blocks were fanned out to disks,
and before the parent block checksums were calculated.

In this case on-disk data block is correct as compared to
other sources (if it is copies=2 - it may even be the same
as its other copy), but it does not match the BP's checksum
while the BP tree seems valid (all tree checksums match).

I believe in this case ZFS should flag the data checksum
mismatch, although in reality (with miniscule probability)
it is the bad checksum mismatching the good data. Anyway,
the situation would seem the same if the data block was
corrupted in RAM before fanning out with copies>1, and that
is more probable given the size of this block compared to
the 256 bits of checksum.

Just *HOW* probable is that on an ECC and a non-ECC system,
with or without an overclocked overheated CPU in enthusiasts
overpumped workstation or unsuspecting consumer's dusty
closet - that is a separate maths questions, with different
answers for different models. Random answer - on par with
disk UBER errors which ZFS by design considers serious
enough to combat.


2) In the case where on-disk data did get corrupted, the
checksum in block pointer was correct (matching original
data), but the raidz2 redundancy did not aid recovery.


I think your analysis is incomplete.


As I last wrote, I dumped the blocks with ZDB and compared
the bytes with the same block from a good copy. Particularly,
that copy had the same SHA256 checksum as was stored in my
problematic pool's blockpointer entry for the corrupt block.

These blocks differed in three sets of 4096 bytes starting
at "round" offsets at even intervals (4KB, 36KB, 68KB).
4kb is my disks' block size. It seems that some disk(s?)
overwrote existing data, or got scratched, or whatever
(no IO errors in dmesg though).

I am not certain why raidz2 did not suffice to fix the block,
and what garbage or data exists on all 6 drives - I did not
get zdb to dump all 0x3 bytes of raidz2 raw data to try
permutations myself.

Possibly, for whatever reason (such as cable error, or some
firmware error given the same model of the drives), several
drives got the same erroneous write command at once, and
ultimately invalidated parts of the same stripe.

Many of the files in peril now, have existed on the pool
for some time, and scrubs completed successfully many times.

> Have you determined the root cause?

Unfortunately, I'm currently in another country away from
my home-NAS server. So all physical maintenance including
pushing the reset button is done by friends living in the
apartment. And there is not much physical examination that
can be done this way.

At one point in time recently (during a scrub in January),
one of the disks got lost and was not seen by motherboard
even after reboots, so I had my friends take out and replug
the SATA cables. This helped, so connector noise was possibly
the root cause. It might also account for incorrect address
for a certain write that slashed randomly on the platter.

The PSU is excessive for the box's requirements, with slack
performance to degrade ;) The P4 CPU is not overclocked.
RAM is non-ECC and that is not changeable given the Intel
CPU, chipset and motherboard. HDDs are on MB's controller.
The 6 HDDs in raidz2 pool are consumer-grade SATA Seagate
ST2000DL003-9VT166 firmware CC32.

Degrading cabling and/or connectors can indeed be one
of about two main causes, the other being non-ECC RAM.
Or aging CPU.


3) The file in question was created on a dataset with enabled
deduplication, so at the very least the dedup bit was set
on the corrupted block's pointer and a DDT entry likely
existed. Attempts to rewrite the block with the original
one (having "dedup=on") failed in fact, probably because
the matching checksum was already in DDT.


Works as designed.


If this is the case, 

Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-22 Thread Richard Elling
On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:
> 2012-01-21 0:33, Jim Klimov wrote:
>> 2012-01-13 4:12, Jim Klimov wrote:
>>> As I recently wrote, my data pool has experienced some
>>> "unrecoverable errors". It seems that a userdata block
>>> of deduped data got corrupted and no longer matches the
>>> stored checksum. For whatever reason, raidz2 did not
>>> help in recovery of this data, so I rsync'ed the files
>>> over from another copy. Then things got interesting...
>> 
>> 
>> Well, after some crawling over my data with zdb, od and dd,
>> I guess ZFS was right about finding checksum errors - the
>> metadata's checksum matched that of a block on original
>> system, and the data block was indeed erring.
> 
> Well, as I'm moving to close my quest with broken data, I'd
> like to draw up some conclusions and RFEs. I am still not
> sure if they are factually true, I'm still learning the ZFS
> internals. So "it currently seems to me, that":
> 
> 1) My on-disk data could get corrupted for whatever reason
>   ZFS tries to protect it from, at least once probably
>   from misdirected writes (i.e. the head landed not where
>   it was asked to write). It can not be ruled out that the
>   checksums got broken in non-ECC RAM before writes of
>   block pointers for some of my data, thus leading to
>   mismatches. One way or another, ZFS noted the discrepancy
>   during scrubs and "normal" file accesses. There is no
>   (automatic) way to tell which part is faulty - checksum
>   or data.

Untrue. If a block pointer is corrupted, then on read it will be logged
and ignored. I'm not sure you have grasped the concept of checksums
in the parent object.

> 
> 2) In the case where on-disk data did get corrupted, the
>   checksum in block pointer was correct (matching original
>   data), but the raidz2 redundancy did not aid recovery.

I think your analysis is incomplete. Have you determined the root cause?

> 
> 3) The file in question was created on a dataset with enabled
>   deduplication, so at the very least the dedup bit was set
>   on the corrupted block's pointer and a DDT entry likely
>   existed. Attempts to rewrite the block with the original
>   one (having "dedup=on") failed in fact, probably because
>   the matching checksum was already in DDT.

Works as designed.

> 
>   Rewrites of such blocks with "dedup=off" or "dedup=verify"
>   succeeded.
> 
>   Failure/success were tested by "sync; md5sum FILE" some
>   time after the fix attempt. (When done just after the
>   fix, test tends to return success even if the ondisk data
>   is bad, "thanks" to caching).

No, I think you've missed the root cause. By default, data that does
not match its checksum is not used.

> 
>   My last attempt was to set "dedup=on" and write the block
>   again and sync; the (remote) computer hung instantly :(
> 
> 3*)The RFE stands: deduped blocks found to be invalid and not
>   recovered by redundancy should somehow be evicted from DDT
>   (or marked for required verification-before-write) so as
>   not to pollute further writes, including repair attmepts.
> 
>   Alternatively, "dedup=verify" takes care of the situation
>   and should be the recommended option.

I have lobbied for this, but so far people prefer performance to dependability.

> 
> 3**) It was suggested to set "dedupditto" to small values,
>   like "2". My oi_148a refused to set values smaller than 100.
>   Moreover, it seems reasonable to have two dedupditto values:
>   for example, to make a ditto copy when DDT reference counter
>   exceeds some small value (2-5), and add ditto copies every
>   "N" values for frequently-referenced data (every 64-128).
> 
> 4) I did not get to check whether "dedup=verify" triggers a
>   checksum mismatch alarm if the preexisting on-disk data
>   does not in fact match the checksum.

All checksum mismatches are handled the same way.

> 
>   I think such alarm should exist and to as much as a scrub,
>   read or other means of error detection and recovery would.

Checksum mismatches are logged, what was your root cause?

> 
> 5) It seems like a worthy RFE to include a pool-wide option to
>   "verify-after-write/commit" - to test that recent TXG sync
>   data has indeed made it to disk on (consumer-grade) hardware
>   into the designated sector numbers. Perhaps the test should
>   be delayed several seconds after the sync writes.

There are highly-reliable systems that do this in the fault-tolerant
market.

> 
>   If the verifcation fails, currently cached data from recent
>   TXGs can be recovered from on-disk redundancy and/or still
>   exist in RAM cache, and rewritten again (and tested again).
> 
>   More importantly, a failed test *may* mean that the write
>   landed on disk randomly, and the pool should be scrubbed
>   ASAP. It may be guessed that the yet-unknown error can lie
>   within "epsilon" tracks (sector numbers) from the currently
>   found non-written data, so if it is possible to scrub just
>   a portion of the pool based on

Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 Thread Jim Klimov

2012-01-22 0:55, Bob Friesenhahn wrote:

On Sun, 22 Jan 2012, Jim Klimov wrote:

So far I rather considered "flaky" hardware with lousy
consumer qualities. The server you describe is likely
to exceed that bar ;)


The most common "flaky" behavior of consumer hardware which causes
troubles for zfs is not honoring cache-related requests. Unfortunately,
it is not possible for zfs to fix such hardware. Zfs works best with
hardware which does what it is told.


Also true. That's what the "option" stood for in my proposal:
since the verification feature is going to be expensive and
add random IOs, we don't want to enforce it on everybody.

Besides, the user might choose to trust his reliable and
expensive hardware like a SAN/NAS with battery-backed NVRAM,
which is indeed likely better that a homebrewn NAS box with
random HDDs thrown in with no measure, but with a desire for
some reliability nonetheless ;)

We can "expect" the individual HDDs caches to get expired
after some time (i.e. after we've sent 64Mbs worth of writes
to the particualr disk with a 64Mb cache), and after that
we are likely to get true media reads. That's when the
verification reads are likely to return most relevant
(ondisk) sectors...

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 Thread Bob Friesenhahn

On Sun, 22 Jan 2012, Jim Klimov wrote:

So far I rather considered "flaky" hardware with lousy
consumer qualities. The server you describe is likely
to exceed that bar ;)


The most common "flaky" behavior of consumer hardware which causes 
troubles for zfs is not honoring cache-related requests. 
Unfortunately, it is not possible for zfs to fix such hardware.  Zfs 
works best with hardware which does what it is told.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 Thread Jim Klimov

2012-01-21 20:50, Bob Friesenhahn wrote:
> TXGs get forgotten from memory as soon as they are written.

As I said, that can be arranged - i.e. free the TXG cache
after the corresponding TXG number has been verified?

Point about ARC being overwritten seems valid...


Zfs already knows how to by-pass the ARC. However, any "media" reads are
subject to caching since the underlying devices try very hard to cache
data in order to improve read performance.


As a pointer, the "format" command presents options to
disable (separately) read and write caching on drives
it sees. MAYBE there is some option to explicitly read
data from media, like sync-writes. Whether the drive
firmwares honor that (disabling caching and/or such
hypothetical sync-reads) - it's something out of ZFS's
control. But we can do the best effort...


As an extreme case of caching, consider a device represented by an iSCSI
LUN on a OpenSolaris server with 512GB of RAM. If you request to read
data you are exceedingly likely to read data from the zfs ARC on that
server rather than underlying "media".


So far I rather considered "flaky" hardware with lousy
consumer qualities. The server you describe is likely
to exceed that bar ;)

Besides, if this OpenSolaris server is up-to-date, it
would do such media checks itself, and/or honour the
sync-read requests or temporary cache disabling ;)

Of course, this can't be guaranteed of other devices,
so in general ZFS can do best-effort verification.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 Thread Bob Friesenhahn

On Sat, 21 Jan 2012, Jim Klimov wrote:


Regarding the written data, I believe it may find place in the
ARC, and a for the past few TXGs it could still remain there.


Any data in the ARC is subject to being overwritten with updated data 
just a millisecond later.  It is a live cache.



I am not sure it is feasible to "guarantee" that it remains in
RAM for a certain time. Also there should be a way to enforce
media reads and not ARC re-reads when verifying writes...


Zfs already knows how to by-pass the ARC.  However, any "media" reads 
are subject to caching since the underlying devices try very hard to 
cache data in order to improve read performance.


As an extreme case of caching, consider a device represented by an 
iSCSI LUN on a OpenSolaris server with 512GB of RAM.  If you request 
to read data you are exceedingly likely to read data from the zfs ARC 
on that server rather than underlying "media".


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 Thread Jim Klimov

2012-01-21 19:18, Bob Friesenhahn wrote:

On Sat, 21 Jan 2012, Jim Klimov wrote:


5) It seems like a worthy RFE to include a pool-wide option to
"verify-after-write/commit" - to test that recent TXG sync
data has indeed made it to disk on (consumer-grade) hardware
into the designated sector numbers. Perhaps the test should
be delayed several seconds after the sync writes.


This is an interesting idea. I think that you would want to do a
mini-scrub on a TXG at least one behind the last one written since
otherwise any test would surely be foiled by caching. The ability to
restore data from RAM is doubtful since TXGs get forgotten from memory
as soon as they are written.


That could be rearranged as part of the bug/RFE resolution ;)

Regarding the written data, I believe it may find place in the
ARC, and a for the past few TXGs it could still remain there.
I am not sure it is feasible to "guarantee" that it remains in
RAM for a certain time. Also there should be a way to enforce
media reads and not ARC re-reads when verifying writes...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 Thread Bob Friesenhahn

On Sat, 21 Jan 2012, Jim Klimov wrote:


5) It seems like a worthy RFE to include a pool-wide option to
  "verify-after-write/commit" - to test that recent TXG sync
  data has indeed made it to disk on (consumer-grade) hardware
  into the designated sector numbers. Perhaps the test should
  be delayed several seconds after the sync writes.


This is an interesting idea.  I think that you would want to do a 
mini-scrub on a TXG at least one behind the last one written since 
otherwise any test would surely be foiled by caching.  The ability to 
restore data from RAM is doubtful since TXGs get forgotten from memory 
as soon as they are written.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 Thread Jim Klimov

2012-01-21 0:33, Jim Klimov wrote:

2012-01-13 4:12, Jim Klimov wrote:

As I recently wrote, my data pool has experienced some
"unrecoverable errors". It seems that a userdata block
of deduped data got corrupted and no longer matches the
stored checksum. For whatever reason, raidz2 did not
help in recovery of this data, so I rsync'ed the files
over from another copy. Then things got interesting...



Well, after some crawling over my data with zdb, od and dd,
I guess ZFS was right about finding checksum errors - the
metadata's checksum matched that of a block on original
system, and the data block was indeed erring.


Well, as I'm moving to close my quest with broken data, I'd
like to draw up some conclusions and RFEs. I am still not
sure if they are factually true, I'm still learning the ZFS
internals. So "it currently seems to me, that":

1) My on-disk data could get corrupted for whatever reason
   ZFS tries to protect it from, at least once probably
   from misdirected writes (i.e. the head landed not where
   it was asked to write). It can not be ruled out that the
   checksums got broken in non-ECC RAM before writes of
   block pointers for some of my data, thus leading to
   mismatches. One way or another, ZFS noted the discrepancy
   during scrubs and "normal" file accesses. There is no
   (automatic) way to tell which part is faulty - checksum
   or data.

2) In the case where on-disk data did get corrupted, the
   checksum in block pointer was correct (matching original
   data), but the raidz2 redundancy did not aid recovery.

3) The file in question was created on a dataset with enabled
   deduplication, so at the very least the dedup bit was set
   on the corrupted block's pointer and a DDT entry likely
   existed. Attempts to rewrite the block with the original
   one (having "dedup=on") failed in fact, probably because
   the matching checksum was already in DDT.

   Rewrites of such blocks with "dedup=off" or "dedup=verify"
   succeeded.

   Failure/success were tested by "sync; md5sum FILE" some
   time after the fix attempt. (When done just after the
   fix, test tends to return success even if the ondisk data
   is bad, "thanks" to caching).

   My last attempt was to set "dedup=on" and write the block
   again and sync; the (remote) computer hung instantly :(

3*)The RFE stands: deduped blocks found to be invalid and not
   recovered by redundancy should somehow be evicted from DDT
   (or marked for required verification-before-write) so as
   not to pollute further writes, including repair attmepts.

   Alternatively, "dedup=verify" takes care of the situation
   and should be the recommended option.

3**) It was suggested to set "dedupditto" to small values,
   like "2". My oi_148a refused to set values smaller than 100.
   Moreover, it seems reasonable to have two dedupditto values:
   for example, to make a ditto copy when DDT reference counter
   exceeds some small value (2-5), and add ditto copies every
   "N" values for frequently-referenced data (every 64-128).

4) I did not get to check whether "dedup=verify" triggers a
   checksum mismatch alarm if the preexisting on-disk data
   does not in fact match the checksum.

   I think such alarm should exist and to as much as a scrub,
   read or other means of error detection and recovery would.

5) It seems like a worthy RFE to include a pool-wide option to
   "verify-after-write/commit" - to test that recent TXG sync
   data has indeed made it to disk on (consumer-grade) hardware
   into the designated sector numbers. Perhaps the test should
   be delayed several seconds after the sync writes.

   If the verifcation fails, currently cached data from recent
   TXGs can be recovered from on-disk redundancy and/or still
   exist in RAM cache, and rewritten again (and tested again).

   More importantly, a failed test *may* mean that the write
   landed on disk randomly, and the pool should be scrubbed
   ASAP. It may be guessed that the yet-unknown error can lie
   within "epsilon" tracks (sector numbers) from the currently
   found non-written data, so if it is possible to scrub just
   a portion of the pool based on DVAs - that's a preferred
   start. It is possible that some data can be recovered if
   it is tended to ASAP (i.e. on mirror, raidz, copies>1)...

Finally, I should say I'm sorry for lame questions arising
from not reading the format spec and zdb blogs carefully ;)

In particular, it was my understanding for a long time that
block pointers each have a sector of their own, leading to
overheads that I've seen. Now I know (and checked) that most
of the blockpointer tree is made of larger groupings (128
blkptr_t's in a single 16KB block), reducing the impact of
BP's on fragmentation and/or slacky waste of large sectors
that I predicted and expected for the past year.

Sad that nobody ever contradicted that (mis)understanding
of mine.

//Jim Klimov
___
zfs-discuss mailing list
zfs

Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-20 Thread Jim Klimov

2012-01-13 4:12, Jim Klimov wrote:

As I recently wrote, my data pool has experienced some
"unrecoverable errors". It seems that a userdata block
of deduped data got corrupted and no longer matches the
stored checksum. For whatever reason, raidz2 did not
help in recovery of this data, so I rsync'ed the files
over from another copy. Then things got interesting...



Well, after some crawling over my data with zdb, od and dd,
I guess ZFS was right about finding checksum errors - the
metadata's checksum matched that of a block on original
system, and the data block was indeed erring.

Just in case it helps others, the SHA256 checksums can be
tested with openssl as I show below. I still search for a
command-line fletcher4/fletcher2 checker (as that weak
hash is used on metadata; I wonder why).

Here's a tail from on-disk blkptr_t, bytes with checksum:
# tail -2 /tmp/osx.l0+110.blkptr.txt
000460 1f 6f 4c 73 5d c1 ab 15 00 cc 56 90 38 8e b4 dd
000470 a9 8e 54 6f f1 a7 db 43 7d 61 9e 01 23 45 2e 70

In byte 0x435 I have value 0x8 - SHA256. And here is the
SHA256 hash for the excerpt from original file (128Kb
cut out with dd):

# dd if=osx.zip of=/tmp/osx.l0+110.bin.orig bs=512 skip=34816 count=256

# openssl dgst -sha256 < /tmp/osx.l0+110.bin.orig
15abc15d734c6f1fddb48e389056cc0043dba7f16f548ea9702e4523019e617d

As my x86 is little-endian, the four 8-byte words of
the checksum appear reversed. But you can see it matches,
so my source file is okay.

I did not find the DDT entries (yet), so I don't know
what hash is there or what addresses it references for
how many files. The block pointer has the dedup bit set,
though.

However, of all my files with errors, there are no DVA
overlaps.

I hexdumped (with od) the two 128Kb excerpts (one from the
original file, another fetched with zdb) and diffed them,
and while some lines matched, others did not.

What is more interesting, is that most of the error area
contains a repeating pattern like this, sometimes with
"extra" chars thrown in:
fc 42 fc 42 fc 42 fc 42 fc 42 fc
fc 42 1f fc 42 fc 42
42 ff fc 42 fc 42 fc 42 fc 42

I have seen similar patterns when I zdb-dumped compressed
blocks without decompression, so I guess this could be a
miswrite of compressed data and/or parity destined for
another file (which also did not get it).

The erroneous data starts and ends at "round" offsets like
0x1000-0x2000, 0x9000-0xa000, 0x11000-0x12000 (step 0x8000
between both sets of mismatches, size 4kb is my disk sector
size), which also suggests a non-coincidental problem.

However, part of the differing data is "normal-looking
random noise", while some part is that pattern above,
starting and ending at a seemingly random location mid-sector.

Here's about all I have to say and share so far :)

Open to suggestions how to compute fletcher checksums
on blocks...

Thanks,
//Jim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-12 Thread Jim Klimov

2012-01-13 4:26, Richard Elling wrote:

On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:

The problem was solved by disabling dedup for the dataset
involved and rsync-updating the file in-place. After the
dedup feature was disabled and new blocks were uniquely
written, everything was readable (and md5sums matched)
as expected.

In theory, the verify option will correct this going forward.


Well, I have got more complaining blocks, and even new errors
in files that I've previously "repaired" with rsync, before I
figured out the problem with dedup today.

Now I've set the verify flag instead of dedup=off, and the
rsync replacement from external storage seems to happen a lot
faster. It also seems to persist even a few minutes after the
copying ;)

Thanks for the tip, Richard!
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-12 Thread Jim Klimov

2012-01-13 5:34, Daniel Carosone wrote:

On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov wrote:

Either I misunderstand some of the above, or I fail to
see how verification would eliminate this failure mode
(namely, as per my suggestion, replace the bad block
with a good one and have all references updated and
block-chains ->  files fixed with one shot).


It doesn't update past data.

It gets treated as if there were a hash collision, and the new data is
really different despite having the same checksum, and so gets written
out instead of incrementing the existing DDT pointer.  So it addresses
your ability to recover the primary filesystem by overwriting with
same data, that dedup was previously defeating.


But (yes/no?) I have to do this repair file-by-file,
either with dedup=off or dedup=verify.

Actually, that's what I properly should do if there
is such a serious error, but what if the original data
is not available so I can't fix it file-by-file, or
if there are very many errors (read, DDT references
from a number of files just under dedupditto value)
and such match-and-repair procedure is prohibitively
inconvenient, slow, whatever?

Say, previously we trusted the hash algorithm: that same
checksums mean identical blocks. With such trust the
user might want to replace the faulty block with another
one (matching the checksum) and expect ALL deduped files
that used this block to become automagically recovered.
Chances are, they actually would be correct (by external
verification).

And if we trust unverified dedup in the first place,
there is nothing wrong with such approach to repair.
It would not make possible errors worse than there were
in originally saved on-disk data (even if there were
hash collisions of really-different blocks - user had
discarded that difference long ago).

I think the user should be given an (informed) ability
to shoot himself in the foot or recover data, depending
on his luck. Anyway, people are doing it thanks to
Max Bruning's or Viktor Latushkin's posts and direct
help, or they research hardcore internals of ZFS.
We might as well play along and increase their chances
of success, even if unsupported and unguaranteed - no?

This situation with "obscured" recovery methods reminds
me of prohibited changes of firmware on cell phones:
customers are allowed to sit on a phone or drop it into
a sink, and perhaps have it replaced, but they are not
allowed to install different software. Many still do.

//Jim Klimov


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-12 Thread Daniel Carosone
On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov wrote:
> 2012-01-13 4:26, Richard Elling wrote:
>> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:
>>> Alternatively (opportunistically), a flag might be set
>>> in the DDT entry requesting that a new write mathching
>>> this stored checksum should get committed to disk - thus
>>> "repairing" all files which reference the block (at least,
>>> stopping the IO errors).
>>
>> verify eliminates this failure mode.
>
> Thinking about it... got more questions:
>
> In this case: DDT/BP contain multiple references with
> correct checksums, but the on-disk block is bad.
> Newly written block has the same checksum, and verification
> proves that on-disk data is different byte-to-byte.
>
> 1) How does the write-stack interact with those checksums
>that do not match the data? Would any checksum be tested
>for this verification read of existing data at all?
>
> 2) It would make sense for the failed verification to
>have the new block committed to disk, and a new DDT
>entry with same checksum created. I would normally
>expect this to be the new unique block of a new file,
>and have no influence on existing data (block chains).
>However in the discussed problematic case, this safe
>behavior would also mean not contributing to reparation
>of those existing block chains which include the
>mismatching on-disk block.
>
> Either I misunderstand some of the above, or I fail to
> see how verification would eliminate this failure mode
> (namely, as per my suggestion, replace the bad block
> with a good one and have all references updated and
> block-chains -> files fixed with one shot).

It doesn't update past data.

It gets treated as if there were a hash collision, and the new data is
really different despite having the same checksum, and so gets written
out instead of incrementing the existing DDT pointer.  So it addresses
your ability to recover the primary filesystem by overwriting with
same data, that dedup was previously defeating. 

--
Dan.



pgp4l8LOTdUOb.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-12 Thread Jim Klimov

2012-01-13 4:26, Richard Elling wrote:

On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:

Alternatively (opportunistically), a flag might be set
in the DDT entry requesting that a new write mathching
this stored checksum should get committed to disk - thus
"repairing" all files which reference the block (at least,
stopping the IO errors).


verify eliminates this failure mode.


Thinking about it... got more questions:

In this case: DDT/BP contain multiple references with
correct checksums, but the on-disk block is bad.
Newly written block has the same checksum, and verification
proves that on-disk data is different byte-to-byte.

1) How does the write-stack interact with those checksums
   that do not match the data? Would any checksum be tested
   for this verification read of existing data at all?

2) It would make sense for the failed verification to
   have the new block committed to disk, and a new DDT
   entry with same checksum created. I would normally
   expect this to be the new unique block of a new file,
   and have no influence on existing data (block chains).
   However in the discussed problematic case, this safe
   behavior would also mean not contributing to reparation
   of those existing block chains which include the
   mismatching on-disk block.

Either I misunderstand some of the above, or I fail to
see how verification would eliminate this failure mode
(namely, as per my suggestion, replace the bad block
with a good one and have all references updated and
block-chains -> files fixed with one shot).

Would you please explain?
Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-12 Thread Jim Klimov

2012-01-13 4:26, Richard Elling wrote:

On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:


As I recently wrote, my data pool has experienced some
"unrecoverable errors". It seems that a userdata block
of deduped data got corrupted and no longer matches the
stored checksum. For whatever reason, raidz2 did not
help in recovery of this data, so I rsync'ed the files
over from another copy. Then things got interesting...

Bug alert: it seems the block-pointer block with that
mismatching checksum did not get invalidated, so my
attempts to rsync known-good versions of the bad files
from external source seemed to work, but in fact failed:
subsequent reads of the files produced IO errors.
Apparently (my wild guess), upon writing the blocks,
checksums were calculated and the matching DDT entry
was found. ZFS did not care that the entry pointed to
inconsistent data (not matching the checksum now),
it still increased the DDT counter.

The problem was solved by disabling dedup for the dataset
involved and rsync-updating the file in-place. After the
dedup feature was disabled and new blocks were uniquely
written, everything was readable (and md5sums matched)
as expected.

I think of a couple of solutions:


In theory, the verify option will correct this going forward.


But in practice there are many suggestions to disable
verification because it is slowing down the writes beyond
what DDT does to performance, and since there is just
some 10^-77 chance that two blocks would have same values
of checksums, it is there only for paranoics.



If the block is detected to be corrupt (checksum mismatches
the data), the checksum value in blockpointers and DDT
should be rewritten to an "impossible" value, perhaps
all-zeroes or such, when the error is detected.


What if it is a transient fault?


Reread disk, retest checksums?.. I don't know... :)




Alternatively (opportunistically), a flag might be set
in the DDT entry requesting that a new write mathching
this stored checksum should get committed to disk - thus
"repairing" all files which reference the block (at least,
stopping the IO errors).


verify eliminates this failure mode.


Sounds true, I didn't try that, though.
But my scrub is not yet complete, maybe there will be more
test subjects ;)




Alas, so far there is anyways no guarantee that it was
not the checksum itself that got corrupted (except for
using ZDB to retrieve the block contents and matching
that with a known-good copy of the data, if any), so
corruption of the checksum would also cause replacement
of "really-good-but-normally-inaccessible" data.


Extrememly unlikely. The metadata is also checksummed. To arrive here
you will have to have two corruptions each of which generate the proper
checksum. Not impossible, but… I'd buy a lottery ticket instead.


I've rather meant the opposite: file data is actually good,
but checksums (apparently both DDT and BlockPointer ones
with all their ditto copies) are bad, either due to disk
rot or RAM failures. For example, are the "blockpointer"
and "dedup" versions of the sha256 checksum recalculated
by both stages, or reused, on writes of a block?..



See also dedupditto. I could argue that the default value of dedupditto
should be 2 rather than "off".


I couldn't set it to smallish values (like 64), on oi_148a LiveUSB:

root@openindiana:~# zpool set dedupditto=64 pool
cannot set property for 'pool': invalid argument for this pool operation

root@openindiana:~# zpool set dedupditto=2 pool
cannot set property for 'pool': invalid argument for this pool operation

root@openindiana:~# zpool set dedupditto=127 pool
root@openindiana:~# zpool get dedupditto pool
NAME  PROPERTYVALUE   SOURCE
pool  dedupditto  127 local


Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-12 Thread Richard Elling
On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:

> As I recently wrote, my data pool has experienced some
> "unrecoverable errors". It seems that a userdata block
> of deduped data got corrupted and no longer matches the
> stored checksum. For whatever reason, raidz2 did not
> help in recovery of this data, so I rsync'ed the files
> over from another copy. Then things got interesting...
> 
> Bug alert: it seems the block-pointer block with that
> mismatching checksum did not get invalidated, so my
> attempts to rsync known-good versions of the bad files
> from external source seemed to work, but in fact failed:
> subsequent reads of the files produced IO errors.
> Apparently (my wild guess), upon writing the blocks,
> checksums were calculated and the matching DDT entry
> was found. ZFS did not care that the entry pointed to
> inconsistent data (not matching the checksum now),
> it still increased the DDT counter.
> 
> The problem was solved by disabling dedup for the dataset
> involved and rsync-updating the file in-place. After the
> dedup feature was disabled and new blocks were uniquely
> written, everything was readable (and md5sums matched)
> as expected.
> 
> I think of a couple of solutions:

In theory, the verify option will correct this going forward.

> If the block is detected to be corrupt (checksum mismatches
> the data), the checksum value in blockpointers and DDT
> should be rewritten to an "impossible" value, perhaps
> all-zeroes or such, when the error is detected.

What if it is a transient fault?

> Alternatively (opportunistically), a flag might be set
> in the DDT entry requesting that a new write mathching
> this stored checksum should get committed to disk - thus
> "repairing" all files which reference the block (at least,
> stopping the IO errors).

verify eliminates this failure mode.

> Alas, so far there is anyways no guarantee that it was
> not the checksum itself that got corrupted (except for
> using ZDB to retrieve the block contents and matching
> that with a known-good copy of the data, if any), so
> corruption of the checksum would also cause replacement
> of "really-good-but-normally-inaccessible" data.

Extrememly unlikely. The metadata is also checksummed. To arrive here
you will have to have two corruptions each of which generate the proper
checksum. Not impossible, but… I'd buy a lottery ticket instead.

See also dedupditto. I could argue that the default value of dedupditto 
should be 2 rather than "off".

> //Jim Klimov
> 
> (Bug reported to Illumos: https://www.illumos.org/issues/1981)

Thanks!
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss