Re: [zfs-discuss] How does resilver/scrub work?

2012-05-25 Thread zfs user

On 5/23/12 11:28 PM, Richard Elling wrote:

The man page is clear on this topic, IMHO


Indeed, even in snv_117 the zpool man page says that. But the
console/dmesg message was also quite clear, so go figure whom
to trust (or fear) more ;)


The FMA message is consistent with the man page.


The man page seems to not mention the critical part of the FMA msg that OP is 
worried about.
OP said that his motivation for clearing the errors and fearing the degraded 
state was because he feared this:


 AUTO-RESPONSE: The device has been marked as degraded.  An attempt
 will be made to activate a hot spare if available.

he doesn't want his dd'd new device kicked out of the vdev and replaced by a 
hot spare (if avaialable) due to the number of errors and the scarlet letter 
of degraded at the device level - I don't think he cares about the pool 
level degraded status since it doesn't do anything.



fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, 
SEVERITY: Major
EVENT-TIME: Wed May 16 03:27:31 MSK 2012
PLATFORM: Sun Fire X4500, CSN: 0804AMT023, HOSTNAME: thumper
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3
DESC: The number of checksum errors associated with a ZFS device
exceeded acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-GH for more 
information.
AUTO-RESPONSE: The device has been marked as degraded.  An attempt
will be made to activate a hot spare if available.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' and replace the bad device.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-25 Thread Richard Elling
On May 25, 2012, at 1:53 PM, zfs user wrote:

 On 5/23/12 11:28 PM, Richard Elling wrote:
 The man page is clear on this topic, IMHO
 
 Indeed, even in snv_117 the zpool man page says that. But the
 console/dmesg message was also quite clear, so go figure whom
 to trust (or fear) more ;)
 
 The FMA message is consistent with the man page.
 
 The man page seems to not mention the critical part of the FMA msg that OP is 
 worried about.
 OP said that his motivation for clearing the errors and fearing the degraded 
 state was because he feared this:
 
  AUTO-RESPONSE: The device has been marked as degraded.  An attempt
  will be made to activate a hot spare if available.
 
 he doesn't want his dd'd new device kicked out of the vdev and replaced by a 
 hot spare (if avaialable) due to the number of errors and the scarlet letter 
 of degraded at the device level - I don't think he cares about the pool 
 level degraded status since it doesn't do anything.

By the time you could read such a message, the hot spare would have already
kicked in. Obviously, this was not the OP's issue.
 -- richard

 
 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 
 1, SEVERITY: Major
 EVENT-TIME: Wed May 16 03:27:31 MSK 2012
 PLATFORM: Sun Fire X4500, CSN: 0804AMT023, HOSTNAME: thumper
 SOURCE: zfs-diagnosis, REV: 1.0
 EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3
 DESC: The number of checksum errors associated with a ZFS device
 exceeded acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-GH for 
 more information.
 AUTO-RESPONSE: The device has been marked as degraded.  An attempt
 will be made to activate a hot spare if available.
 IMPACT: Fault tolerance of the pool may be compromised.
 REC-ACTION: Run 'zpool status -x' and replace the bad device.

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-25 Thread Jim Klimov

2012-05-26 1:07, Richard Elling wrote:

On May 25, 2012, at 1:53 PM, zfs user wrote:

The man page seems to not mention the critical part of the FMA msg
that OP is worried about.
OP said that his motivation for clearing the errors and fearing the
degraded state was because he feared this:

 AUTO-RESPONSE: The device has been marked as degraded. An attempt
 will be made to activate a hot spare if available.

he doesn't want his dd'd new device kicked out of the vdev and
replaced by a hot spare (if avaialable) due to the number of errors
and the scarlet letter of degraded at the device level - I don't
think he cares about the pool level degraded status since it doesn't
do anything.


By the time you could read such a message, the hot spare would have already
kicked in. Obviously, this was not the OP's issue.
-- richard


Kind of, it was - the motivation for feeling insecure and
ultimately for clearing the CKSUM errors every minute (that
there is a nonzero error count), at least - the script you
said should never be used in normal practice, and I agree
to that conclusion. (Manual) DDing is not the normal practice
sanely covered by the degradation/hotsparing mechanism.

As I wrote, the first time I saw the message, the pool did not
have an assigned hotspare, but it got marked degraded. Just in
case, I came up with that cksum-mismatch-cleansing script and
restarted the scrub, since I knew the errors on-disk were due
to an unfinished proper resilver onto it. I was not convinced
whether the new disk still fully operates in the pool when it
is marked as degraded, and I did not want the scrub to continue
just to find out whether the disk won't be actively used and
repaired.

To say that in other words, I know that sometimes docs can lag
behind or hop ahead of implemented features, and the latter can
also be buggy or incomplete. While the theory (FMA and manpage
snippets) said the disk should continue being used by the array
despite the DEGRADED mark, I did not have an intention of staging
an experiment here to find out whether it actually would, in that
aged version of the software.

Thanks,
//the OP ;)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-24 Thread Richard Elling

On May 23, 2012, at 2:56 PM, Jim Klimov wrote:

 Thanks again,
 
 2012-05-24 1:01, Richard Elling wrote:
 At least the textual error message infers that if a hotspare
 were available for the pool, it would kick in and invalidate
 the device I am scrubbing to update into the pool after the
 DD-phase (well, it was not DD but a hung-up resilver in this
 case, but that is not substantial).
 
 The man page is clear on this topic, IMHO
 
 Indeed, even in snv_117 the zpool man page says that. But the
 console/dmesg message was also quite clear, so go figure whom
 to trust (or fear) more ;)

The FMA message is consistent with the man page.

 
 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, 
 SEVERITY: Major
 EVENT-TIME: Wed May 16 03:27:31 MSK 2012
 PLATFORM: Sun Fire X4500, CSN: 0804AMT023, HOSTNAME: thumper
 SOURCE: zfs-diagnosis, REV: 1.0
 EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3
 DESC: The number of checksum errors associated with a ZFS device
 exceeded acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-GH for more 
 information.
 AUTO-RESPONSE: The device has been marked as degraded.  An attempt
 will be made to activate a hot spare if available.
 IMPACT: Fault tolerance of the pool may be compromised.
 REC-ACTION: Run 'zpool status -x' and replace the bad device.
 
 
 
  dd, or simular dumb block copiers, should work fine.
  However, they are inefficient...
 
 Define efficient? In terms of transferring the 900Gb payload
 of a 1Tb HDD used for ZFS for a year - DD would beat resilver
 anytime, in terms of getting most or (less likely) all of the
 valid bits with data onto the new device. It is the next phase
 (getting the rest of the bits into valid state) that needs
 some attention, manual or automated.
 
 speed != efficiency
 
 Ummm... this is likely to start a flame war with other posters,
 and you did not say what efficiency is to you? How can we compare
 apples to meat, not even knowing whether the latter is a steak or
 a pork knee?

Efficiency allows use of denominators other than time. Speed is restricted
to a denominator of time. There is no flame war here, look elsewhere.

 I, for now, choose to stand by a statement that reduction of the
 timeframe that the old disk needs to be in the system is a good
 thing, as well as that changing the IO pattern from random writes
 into (mostly) sequential writes and after that random reads may
 be also somewhat more efficient, especially under other loads
 (interfering less with them). Even though the whole replacement
 process may take more wallclock time, there are cases when I'd
 likely trust it to do a better job than original resilvering.
 
 I think, someone with equipment could stage an experiment and
 compare the two procedures (existing and proposed) on a nearly
 full and somewhat fragmented pool.

Operationally, your method loses every time.

 
 Maybe you can disenchant me (not with vague phrases but either
 theory or practice) and I would then see that my trust is blind,
 misdirected and without basement. =)

 IMHO, this is too operationally complex for most folks. KISS wins.
 
 That's why I proposed to tuck this scenario under the zfs hood
 (DD + selective scrub + ditto writes during the process,
 as an optional alternative to current resilver), or explain
 coherently why this should not be done - not for any situation.
 Implementing it as a standard supported command would be KISS ;)
 
 Especially if it is known that with some quirks this procedure
 works, and may be beneficial to some cases, i.e. by reducing
 the timeframe that a pool with a flaky disk in place is exposed
 to potential loss of redundancy and large amounts of data, and
 in the worst case the loss is constrained to those sectors
 which couldn't be (correctly) read by DD from the source disk
 and couldn't be reconstructed by raidz/mirror redundancies due
 to whatever overlaying problems (i.e. a sector from same block
 died on another disk too).

You have not made a case for why this hybrid and failure-prone 
procedure is required. What problem are you trying to solve?

 What is it about error counters that frightens you enough to want to clear
 them often?
 
 In this case, mostly, the fright of having the device kicked
 out of the pool automatically instead of getting it synced
 (resilvered is an improper term here, I guess) to proper state.

Why not follow the well-designed existing procedure?

 In general - since this is a part of some migration procedure
 which is, again, expected to have errors, we don't really care
 for signalling them. Why doesn't the original resilver signal
 several million CKSUM errors per new empty disk when it does
 reconstruction of sectors onto it? I'd say this is functionally
 identical. (At least, would be - if it were part of a supported
 procedure as I suggest).
 
 Thanks,
 //Jim Klimov
 
 PS: I pondered for a while if I should make up an argument that
 on a dying disk mechanics, lots of random IO 

Re: [zfs-discuss] How does resilver/scrub work?

2012-05-24 Thread Jim Klimov

Let me try to formulate my idea again... You called a similar
process pushing the rope some time ago, I think.

I feel like I'm passing some exam and am trying to pick answers
for a discipline like philosophy and I have no idea about the
examinator's preferences - is he an ex-Communism teacher or an
eager new religion fanatic? The same answer can lead to an A
or to an F on a state exam. Ah, that was some fun experience :)

Well, what we know is what remains after we forget everything
that we were taught, while the exams are our last chance to
learn something at all =)

2012-05-24 10:28, Richard Elling wrote:

You have not made a case for why this hybrid and failure-prone
procedure is required. What problem are you trying to solve?


Bigger-better-faster? ;)

The original proposal in this thread was about understanding
how resilvers and scrubs work, why they are so dog slow on
HDDs in comparison to sequential reads, and thinking aloud
what can be improved in this area.

One of the later posts was about improving the disk replacement
(where the original is still responsive, but may be imperfect)
for filled-up fragmented pools by including a stage of fast
data transfer and a different IO pattern for verification and
updating of the new disk image, in comparison with current
resilver's IO patterns.

This may or may not have some benefits in certain (corner?)
cases which are of practical interest to some users on this
list, and if this discussion leads to a POC made by a competent
ZFS programmer, which can be tested on a variety of ZFS pools
(without risking one's only pool on a homeNAS) - so much the
better. Then we would see if this scenario is viable or utterly
useless and bad in every tested case.

The practical numbers I have from the same box and disks are:
* Copy from a 250Gb raidz1 (9*(4+1)) pool to a single-disk 3Tb
  test pool took 24 hours to fill the new disk - including the
  ZFS overheads.
* Copying of one raw 250(232)Gb partition takes under 2 hours
  (if it can sustain about 70Mb/s reads from the source without
  distractions like other pool IO - then 1 hour).
* Proper resilvering (reading all BP-tree from the original pool,
  reading all blocks from the TLVDEV, writing reconstructed(?)
  sectors to the target disk) from one partition to another
  took 17 hours.
* Full scrubbing (reading all blocks from the pool, fixing
  checksum mismatches) takes 25-27 hours.
* Selective scrubbing - unimplemented, timeframe unknown
  (reading all BP-tree from the original pool, reading all
  blocks from the TLVDEV including the target disk and the
  original disk, fixing checksum mismatches without panicky
  messages and/or hotspares kicking in).
  I *guess* it would have similar speed to a resilver, but
  less bound to random write IO patterns, which may be better
  for latencies of other tasks on the system.

So, in case of original resilver, I replace the not-yet-dead
disk with a hotspare, and after 17 hours of waiting I see if
it was successfully resilvered or not. During this time the
disk can die for example, leaving my pool with lowered
protection (or lack thereof in case of raidz1 or two-way
mirrors).

In case of the new method proposed for a POC implementation,
after 1 hour I'd already have a somewhat reliable copy of
that vdev (a few blocks may have mismatches, but if the
source disk dies or is taken away now - not the whole TLVDEV
or pool is degraded and has compromised protection). Then
after the same +17 hours for scrubs I'd be certain that
this copy is good.

If the new writes incoming to this TLVDEV between start of
DD and end of scrub are directed to be written on both the
source disk and its copy, then there are less (down to zero)
checksum discrepancies that the scrub phase would find.


Why not follow the well-designed existing procedure?


First it was a theoretical speculation, but a couple of days
later the incomplete resilver made me a practical experiment
of the idea.


The failure data does not support your hypothesis.

Ok, then my made-up and dismissed argument does not stand ;)

Thanks for the discussion,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-24 Thread Richard Elling
big assumption below...

On May 24, 2012, at 6:06 AM, Jim Klimov wrote:

 Let me try to formulate my idea again... You called a similar
 process pushing the rope some time ago, I think.
 
 I feel like I'm passing some exam and am trying to pick answers
 for a discipline like philosophy and I have no idea about the
 examinator's preferences - is he an ex-Communism teacher or an
 eager new religion fanatic? The same answer can lead to an A
 or to an F on a state exam. Ah, that was some fun experience :)
 
 Well, what we know is what remains after we forget everything
 that we were taught, while the exams are our last chance to
 learn something at all =)
 
 2012-05-24 10:28, Richard Elling wrote:
 You have not made a case for why this hybrid and failure-prone
 procedure is required. What problem are you trying to solve?
 
 Bigger-better-faster? ;)
 
 The original proposal in this thread was about understanding
 how resilvers and scrubs work, why they are so dog slow on
 HDDs in comparison to sequential reads, and thinking aloud
 what can be improved in this area.
 
 One of the later posts was about improving the disk replacement
 (where the original is still responsive, but may be imperfect)
 for filled-up fragmented pools by including a stage of fast
 data transfer and a different IO pattern for verification and
 updating of the new disk image, in comparison with current
 resilver's IO patterns.
 
 This may or may not have some benefits in certain (corner?)
 cases which are of practical interest to some users on this
 list, and if this discussion leads to a POC made by a competent
 ZFS programmer, which can be tested on a variety of ZFS pools
 (without risking one's only pool on a homeNAS) - so much the
 better. Then we would see if this scenario is viable or utterly
 useless and bad in every tested case.
 
 The practical numbers I have from the same box and disks are:
 * Copy from a 250Gb raidz1 (9*(4+1)) pool to a single-disk 3Tb
  test pool took 24 hours to fill the new disk - including the
  ZFS overheads.
 * Copying of one raw 250(232)Gb partition takes under 2 hours
  (if it can sustain about 70Mb/s reads from the source without
  distractions like other pool IO - then 1 hour).
 * Proper resilvering (reading all BP-tree from the original pool,
  reading all blocks from the TLVDEV, writing reconstructed(?)
  sectors to the target disk) from one partition to another
  took 17 hours.
 * Full scrubbing (reading all blocks from the pool, fixing
  checksum mismatches) takes 25-27 hours.
 * Selective scrubbing - unimplemented, timeframe unknown
  (reading all BP-tree from the original pool, reading all
  blocks from the TLVDEV including the target disk and the
  original disk, fixing checksum mismatches without panicky
  messages and/or hotspares kicking in).
  I *guess* it would have similar speed to a resilver, but
  less bound to random write IO patterns, which may be better
  for latencies of other tasks on the system.
 
 So, in case of original resilver, I replace the not-yet-dead
 disk with a hotspare, and after 17 hours of waiting I see if
 it was successfully resilvered or not. During this time the
 disk can die for example, leaving my pool with lowered
 protection (or lack thereof in case of raidz1 or two-way
 mirrors).
 
 In case of the new method proposed for a POC implementation,
 after 1 hour I'd already have a somewhat reliable copy of
 that vdev (a few blocks may have mismatches,

This is a big assumption -- that the disk will operate normally, even
for data it cannot read. In my experience, this assumption is not valid
for the majority of HDD failure modes. Also, in the case of consumer-grade
disks, a single sector media error could take a very long time to retry/fail.

 but if the
 source disk dies or is taken away now - not the whole TLVDEV
 or pool is degraded and has compromised protection). Then
 after the same +17 hours for scrubs I'd be certain that
 this copy is good.
 
 If the new writes incoming to this TLVDEV between start of
 DD and end of scrub are directed to be written on both the
 source disk and its copy, then there are less (down to zero)
 checksum discrepancies that the scrub phase would find.
 
 Why not follow the well-designed existing procedure?
 
 First it was a theoretical speculation, but a couple of days
 later the incomplete resilver made me a practical experiment
 of the idea.
 
 The failure data does not support your hypothesis.
 Ok, then my made-up and dismissed argument does not stand ;)
 
 Thanks for the discussion,

np 
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-24 Thread Jim Klimov

2012-05-24 18:55, Richard Elling wrote:

This is a big assumption -- that the disk will operate normally, even
for data it cannot read. In my experience, this assumption is not valid
for the majority of HDD failure modes. Also, in the case of consumer-grade
disks, a single sector media error could take a very long time to
retry/fail.


Indeed it is, and I've covered this in the thread earlier -
the bulk copying phase (DD-phase) should monitor its real
progress, and if it detects lags in comparison to the average
or expected speeds (expected = some tuning variable i.e. 50Mb/s),
the process should skip over some (arbitrary) range of sectors
and go on from another location (such skipped sectors are in
danger indeed, until the scrub-phase detects and reconstructs
them) or fall back to the original resilver method completely.
That was already described in some detail I thought of at the
time of the posting, and I can't add much to that yet.

From what I've seen with faulty sectors is that they are usually
either single errors or a scratched range which can be worked
around with i.e. partitioning for legacy FSes (if the SMART
relocation doesn't deal with them properly for any reason),
while most of the rest of the disk is okay. Retries may be
lengthy, ranging from several seconds up to a minute, but
they are often constrained in a few locations and *may* add
little delay in the overall scheme of things. If the delay
is more than acceptable and/or we can't find a working
location on the source disk, we just fall back to the
old method - either original resilver, or if much data has
been copied to the new disk - to the new selective scrub
(it being much like the resilver, but taking into account
those sectors on the target disk which may have been copied
over correctly).

A somewhat worse case is intermittent errors in random times
and logical disk locations due to who knows what - overheating,
firmware overflow errors, bus resets, or whatever. It's rather
them being the reason for scrub-validation of data after mass
migration, perhaps (as well as a reason for preventive regular
scrubs)...

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-23 Thread Richard Elling
comments far below...

On May 22, 2012, at 1:42 AM, Jim Klimov wrote:

 2012-05-22 7:30, Daniel Carosone wrote:
 On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote:
 On Mon, 21 May 2012, Jim Klimov wrote:
 This is so far a relatively raw idea and I've probably missed
 something. Do you think it is worth pursuing and asking some
 zfs developers to make a POC? ;)
 
 I did read all of your text. :-)
 
 This is an interesting idea and could be of some use but it would be
 wise to test it first a few times before suggesting it as a general
 course.
 
 I've done basically this kind of thing before: dd a disk and then
 scrub rather than replace, treating errors as expected.
 
 I got into similar situation last night on that Thumper -
 it is now migrating a flaky source disk in the array from
 an original old 250Gb disk into a same-sized partition on
 the new 3Tb drive (as I outlined as IDEA7 in another thread).
 The source disk itself had about 300 CKSUM errors during
 the process, and for reasons beyond my current understanding,
 the resilver never completed.
 
 In zpool status it said that the process was done several
 hours before the time I looked at it, but the TLVDEV still
 had a spare component device comprised of the old disk
 and new partition, and the (same) hotspare device in the
 pool was INUSE.
 
 After a while we just detached the old disk from the pool
 and ran scrub, which first found some 178 CKSUM errors on
 the new partition right away, and degraded the TLVDEV and
 pool.
 
 We cleared the errors, and ran the script below to log
 the detected errors and clear them, so the disk is fixed
 and not kicked out of the pool due to mismatches.
 Overall 1277 errors were logged and apparently fixed, and
 the pool is now on its second full scrub run - no bugs so
 far (knocking wood; certainly none this early in the scrub
 as we had last time).
 
 So in effect, this methodology works for two of us :)
 
 Since you did similar stuff already, I have a few questions:
 1) How/what did you DD? The whole slice with the zfs vdev?

dd, or simular dumb block copiers, should work fine. However, they 
are inefficient and operationally difficult to manage, which is why they
tend to fall in the prefer-to-use-something-else catagory.

   Did the system complain (much) about the renaming of the
   device compared to paths embedded in pool/vdev headers?

It shouldn't unless you did something to confuse it, such as having both
the original and the dd copy online at the same time. In that case, you
will have two different copies of the same identified device that are
independent. This is an operational mistake, hence my comment above.

   Did you do anything manually to remedy that (forcing
   import, DDing some handcrafted uberblocks, anything?)

Not needed.

 
 2) How did you treat errors as expected during scrub?
   As I've discovered, there were hoops to jump through.
   Is there a switch to disable degrading of pools and
   TLVDEVs based on only the CKSUM counts?

DEGRADED is the status. You clear degraded states by fixing the problem
and running zpool clear. DEGRADED, in and of itself, is not a problem.

 
 
 My raw hoop-jumping script:
 -
 
 #!/bin/bash
 
 # /root/scrubwatch.sh
 # Watches 'pond' scrub and resets errors to avoid auto-degrading
 # the device, but logs the detected error counts however.
 # See also fmstat|grep zfs-diag for precise counts.
 # See also https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great
 #  for details on FMA and fmstat with zfs hotspares
 
 while true; do
zpool status pond | gegrep -A4 -B3 'resilv|error|c1t2d|c5t6d|%'
date
echo 
 
C1=`zpool status pond | grep c1t2d`
C2=`echo $C1 | grep 'c1t2d0s1  ONLINE   0 0 0'`
if [ x$C2 = x ]; then
echo `date`: $C1  /var/tmp/zpool-clear_pond.log
zpool clear pond
zpool status pond | gegrep -A4 -B3 'resilv|error|c1t2d|c5t6d|%'
date
fi
echo 
 
sleep 60
 done

I would never allow such scripts in my site. It is important to track the 
progress and state changes. This script resets those counters for no
good reason.

I post this comment in the hope that future searches will not encourage 
people to try such things.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-23 Thread Jim Klimov

2012-05-23 20:54, Richard Elling wrote:

comments far below...


Thank you Richard for taking notice of this thread and the
definitive answers I needed not quote below, for further
questions ;)


2) How did you treat errors as expected during scrub?
As I've discovered, there were hoops to jump through.
Is there a switch to disable degrading of pools and
TLVDEVs based on only the CKSUM counts?


DEGRADED is the status. You clear degraded states by fixing the problem
and running zpool clear. DEGRADED, in and of itself, is not a problem.


Doesn't this status preclude the device with many CKSUM errors
from participating in the pool (TLVDEV) and the remainder of
the scrub in particular?

At least the textual error message infers that if a hotspare
were available for the pool, it would kick in and invalidate
the device I am scrubbing to update into the pool after the
DD-phase (well, it was not DD but a hung-up resilver in this
case, but that is not substantial).

Such automatic replacement is definitely not what I needed
in this particular case, so if it were to happen - it would
be a problem indeed, in and of itself.

 dd, or simular dumb block copiers, should work fine.
 However, they are inefficient...

Define efficient? In terms of transferring the 900Gb payload
of a 1Tb HDD used for ZFS for a year - DD would beat resilver
anytime, in terms of getting most or (less likely) all of the
valid bits with data onto the new device. It is the next phase
(getting the rest of the bits into valid state) that needs
some attention, manual or automated.

Again, DD is not a good usecase indeed for pools with little
data on big disks, and while I see why these could be used
(i.e. to never face fragmentation), I haven't seen them in
practice around here.

... and operationally difficult to manage

Actually, that's why I asked whether it makes sense to
automate such a scenario as another legal variant of disk
replacement, complete with fast data transfer and verification
and simultaneous work of the new and old devices until the
data migration is marked complete. In particular that would
take care of accepting the scrub errors as an expected part
of the disk replacement and not a fatal fault/degradation,
and/or allowing new writes to propagate onto the new disk
while the replacement is going on and minimize discrepancies
right on the run.

In visible effect this would be similar to current resilver
during replacement of a live disk with a hotspare, but the
prcess would follow a different scenario I suggested earlier
in the thread.


My raw hoop-jumping script:

...

I would never allow such scripts in my site. It is important to track the
progress and state changes. This script resets those counters for no
good reason.

I post this comment in the hope that future searches will not encourage
people to try such things.


Understood, point taken, I won't try to promote such a solution,
and I agree that certainly it is not a good general idea indeed.
It should be noted however (or I want to be corrected, please,
if I am wrong), that:

1) Errors are expected on this run since the DD'ed copy is expected
   to deviate from current pool state; if the degradation mark of
   new disk would force it to be kicked out of the pool just because
   there are many CKSUM errors - which we know should be there due
   to manual DD-phase - then the reason is good IMHO (in this one
   case);

2) The progress is tracked by logging the error counts into a text
   file. If the admin fired up the script (manually in his terminal
   or a vnc/screen session), he can also look into the log file or
   even tail it.

3) The individual CKSUM errors are summed up in fmstat output, and
   this script does not zero them out, so even system-side tracking
   is not disturbed here.

Anyhow, if there is a device with just a few CKSUM errors, then the
next scrub clears its error counts anyway (if no new problems are
found)

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-23 Thread Jim Klimov

Thanks again,

2012-05-24 1:01, Richard Elling wrote:

At least the textual error message infers that if a hotspare
were available for the pool, it would kick in and invalidate
the device I am scrubbing to update into the pool after the
DD-phase (well, it was not DD but a hung-up resilver in this
case, but that is not substantial).


The man page is clear on this topic, IMHO


Indeed, even in snv_117 the zpool man page says that. But the
console/dmesg message was also quite clear, so go figure whom
to trust (or fear) more ;)

fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, 
VER: 1, SEVERITY: Major

EVENT-TIME: Wed May 16 03:27:31 MSK 2012
PLATFORM: Sun Fire X4500, CSN: 0804AMT023, HOSTNAME: thumper
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3
DESC: The number of checksum errors associated with a ZFS device
exceeded acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-GH for 
more information.

AUTO-RESPONSE: The device has been marked as degraded.  An attempt
will be made to activate a hot spare if available.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' and replace the bad device.




 dd, or simular dumb block copiers, should work fine.
 However, they are inefficient...

Define efficient? In terms of transferring the 900Gb payload
of a 1Tb HDD used for ZFS for a year - DD would beat resilver
anytime, in terms of getting most or (less likely) all of the
valid bits with data onto the new device. It is the next phase
(getting the rest of the bits into valid state) that needs
some attention, manual or automated.


speed != efficiency


Ummm... this is likely to start a flame war with other posters,
and you did not say what efficiency is to you? How can we compare
apples to meat, not even knowing whether the latter is a steak or
a pork knee?

I, for now, choose to stand by a statement that reduction of the
timeframe that the old disk needs to be in the system is a good
thing, as well as that changing the IO pattern from random writes
into (mostly) sequential writes and after that random reads may
be also somewhat more efficient, especially under other loads
(interfering less with them). Even though the whole replacement
process may take more wallclock time, there are cases when I'd
likely trust it to do a better job than original resilvering.

I think, someone with equipment could stage an experiment and
compare the two procedures (existing and proposed) on a nearly
full and somewhat fragmented pool.

Maybe you can disenchant me (not with vague phrases but either
theory or practice) and I would then see that my trust is blind,
misdirected and without basement. =)


IMHO, this is too operationally complex for most folks. KISS wins.


That's why I proposed to tuck this scenario under the zfs hood
(DD + selective scrub + ditto writes during the process,
as an optional alternative to current resilver), or explain
coherently why this should not be done - not for any situation.
Implementing it as a standard supported command would be KISS ;)

Especially if it is known that with some quirks this procedure
works, and may be beneficial to some cases, i.e. by reducing
the timeframe that a pool with a flaky disk in place is exposed
to potential loss of redundancy and large amounts of data, and
in the worst case the loss is constrained to those sectors
which couldn't be (correctly) read by DD from the source disk
and couldn't be reconstructed by raidz/mirror redundancies due
to whatever overlaying problems (i.e. a sector from same block
died on another disk too).


What is it about error counters that frightens you enough to want to clear
them often?


In this case, mostly, the fright of having the device kicked
out of the pool automatically instead of getting it synced
(resilvered is an improper term here, I guess) to proper state.

In general - since this is a part of some migration procedure
which is, again, expected to have errors, we don't really care
for signalling them. Why doesn't the original resilver signal
several million CKSUM errors per new empty disk when it does
reconstruction of sectors onto it? I'd say this is functionally
identical. (At least, would be - if it were part of a supported
procedure as I suggest).

Thanks,
//Jim Klimov

PS: I pondered for a while if I should make up an argument that
on a dying disk mechanics, lots of random IO (resilver) instead
of sequential IO (DD) would cause it to die faster, but that's
just a FUD not backed by any scientific data or statistics -
which you likely have, and perhaps opposing this argument indeed.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-22 Thread Jim Klimov

2012-05-22 7:30, Daniel Carosone wrote:

On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote:

On Mon, 21 May 2012, Jim Klimov wrote:

This is so far a relatively raw idea and I've probably missed
something. Do you think it is worth pursuing and asking some
zfs developers to make a POC? ;)


I did read all of your text. :-)

This is an interesting idea and could be of some use but it would be
wise to test it first a few times before suggesting it as a general
course.


I've done basically this kind of thing before: dd a disk and then
scrub rather than replace, treating errors as expected.


I got into similar situation last night on that Thumper -
it is now migrating a flaky source disk in the array from
an original old 250Gb disk into a same-sized partition on
the new 3Tb drive (as I outlined as IDEA7 in another thread).
The source disk itself had about 300 CKSUM errors during
the process, and for reasons beyond my current understanding,
the resilver never completed.

In zpool status it said that the process was done several
hours before the time I looked at it, but the TLVDEV still
had a spare component device comprised of the old disk
and new partition, and the (same) hotspare device in the
pool was INUSE.

After a while we just detached the old disk from the pool
and ran scrub, which first found some 178 CKSUM errors on
the new partition right away, and degraded the TLVDEV and
pool.

We cleared the errors, and ran the script below to log
the detected errors and clear them, so the disk is fixed
and not kicked out of the pool due to mismatches.
Overall 1277 errors were logged and apparently fixed, and
the pool is now on its second full scrub run - no bugs so
far (knocking wood; certainly none this early in the scrub
as we had last time).

So in effect, this methodology works for two of us :)

Since you did similar stuff already, I have a few questions:
1) How/what did you DD? The whole slice with the zfs vdev?
   Did the system complain (much) about the renaming of the
   device compared to paths embedded in pool/vdev headers?
   Did you do anything manually to remedy that (forcing
   import, DDing some handcrafted uberblocks, anything?)

2) How did you treat errors as expected during scrub?
   As I've discovered, there were hoops to jump through.
   Is there a switch to disable degrading of pools and
   TLVDEVs based on only the CKSUM counts?


My raw hoop-jumping script:
-

#!/bin/bash

# /root/scrubwatch.sh
# Watches 'pond' scrub and resets errors to avoid auto-degrading
# the device, but logs the detected error counts however.
# See also fmstat|grep zfs-diag for precise counts.
# See also https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great
#  for details on FMA and fmstat with zfs hotspares

while true; do
zpool status pond | gegrep -A4 -B3 'resilv|error|c1t2d|c5t6d|%'
date
echo 

C1=`zpool status pond | grep c1t2d`
C2=`echo $C1 | grep 'c1t2d0s1  ONLINE   0 0 0'`
if [ x$C2 = x ]; then
echo `date`: $C1  /var/tmp/zpool-clear_pond.log
zpool clear pond
zpool status pond | gegrep -A4 -B3 'resilv|error|c1t2d|c5t6d|%'
date
fi
echo 

sleep 60
done




HTH,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-22 Thread Daniel Carosone
On Tue, May 22, 2012 at 12:42:02PM +0400, Jim Klimov wrote:
 2012-05-22 7:30, Daniel Carosone wrote:
 I've done basically this kind of thing before: dd a disk and then
 scrub rather than replace, treating errors as expected.

 I got into similar situation last night on that Thumper -
 it is now migrating a flaky source disk in the array from
 an original old 250Gb disk into a same-sized partition on
 the new 3Tb drive (as I outlined as IDEA7 in another thread).
 The source disk itself had about 300 CKSUM errors during
 the process, and for reasons beyond my current understanding,
 the resilver never completed.

 In zpool status it said that the process was done several
 hours before the time I looked at it, but the TLVDEV still
 had a spare component device comprised of the old disk
 and new partition, and the (same) hotspare device in the
 pool was INUSE.

I think this is at least in part an issue with older code.  There have
been various fixes for hangs/restarts/incomplete replaces and sparings
over the time since.  

 After a while we just detached the old disk from the pool
 and ran scrub, which first found some 178 CKSUM errors on
 the new partition right away, and degraded the TLVDEV and
 pool.

 We cleared the errors, and ran the script below to log
 the detected errors and clear them, so the disk is fixed
 and not kicked out of the pool due to mismatches.

 So in effect, this methodology works for two of us :)

 Since you did similar stuff already, I have a few questions:
 1) How/what did you DD? The whole slice with the zfs vdev?
Did the system complain (much) about the renaming of the
device compared to paths embedded in pool/vdev headers?
Did you do anything manually to remedy that (forcing
import, DDing some handcrafted uberblocks, anything?)

I've done it a couple of times at least:

 * a failed disk in a raidz1, where i didn't trust that the other
   disks didn't also have errors.  Basically did a ddrescue from one
   disk to the new. I think these days, a 'replace' where the
   original disk is still online will use that content, like a
   hotspare replace, rather than assume it has gone away and must be
   recreated, but that wasn't the case at the time.

 * Where I had an iscsi mirror of a laptop hard disk, but it was out
   of date and had been detached when the laptop iscsi initiator
   refused to start.  Later, the disk developed a few bad sectors.  I
   made a new submirror, let it sync (with the error still), then
   blatted bits of the old image over the new in the areas where the
   bad sectors where being reported.  Scrub again, and they were fixed
   (as well as some blocks on the new submirror repaired coming back
   up to date again). 

 2) How did you treat errors as expected during scrub?

Pretty much as you did: decline to panic and restart scrubs.

--
Dan.

pgpis9PrONjka.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-21 Thread Bob Friesenhahn

On Mon, 21 May 2012, Jim Klimov wrote:

This is so far a relatively raw idea and I've probably missed
something. Do you think it is worth pursuing and asking some
zfs developers to make a POC? ;)


I did read all of your text. :-)

This is an interesting idea and could be of some use but it would be 
wise to test it first a few times before suggesting it as a general 
course.  Zfs is still totally not foolproof.  I still see postings 
from time to time regarding pools which panic/crash the system 
(probably due to memory corruption).


Zfs will try to keep the data compacted at the beginning of the 
partition so if you have a way to know how far out it extends, then 
the initial 'dd' could be much faster when the pool is not close to 
full.


Zfs scrub does need to do many more reads than a resilver since it 
reads all data and metadata copies.  Triggering a resilver operation 
for the specific disk would likely hasten progress.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-21 Thread Daniel Carosone
On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote:
 On Mon, 21 May 2012, Jim Klimov wrote:
 This is so far a relatively raw idea and I've probably missed
 something. Do you think it is worth pursuing and asking some
 zfs developers to make a POC? ;)

 I did read all of your text. :-)

 This is an interesting idea and could be of some use but it would be  
 wise to test it first a few times before suggesting it as a general  
 course. 

I've done basically this kind of thing before: dd a disk and then
scrub rather than replace, treating errors as expected. 

 Zfs will try to keep the data compacted at the beginning of the  
 partition so if you have a way to know how far out it extends, then the 
 initial 'dd' could be much faster when the pool is not close to full.

zdb will show you usage per metaslab, you could use that and
effectively select offset ranges to skip any empty ones.  After a
while, and once the pool has seen usage fill past low %'ages, I'd say
most metaslabs would have some usage, so you might not save much
time.  Going to finer detail within a metaslab is not worthwhile -
much more involved and involves the seeks you're trying to avoid.

--
Dan.

pgpZTKmE3x5dy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-20 Thread Jim Klimov

I hope there is some good outcome of this thread after all, below...
I wonder if anyone else thinks the following proposal is reasonable? ;)

2012-05-18 10:18, Daniel Carosone wrote:

Let's go over those, and clarify terminology, before going through the
rest of your post:
...* Replace: A device has gone, and needs to be completely
 reconstructed.


As I detail below, i see Replace happening when a device
is going to be gone - but is still available and is being
proactively replaced.



Scrub is very similar to normal reads, apart from checking all copies
rather than serving the data from whichever copy successfully returns
first. Errors are not expected, are counted and repaired as/if found.

Resilver and Replace are very similar, and the terms are often used
interchangably. Replace is essentially resilver with a starting TXG of
0 (plus some labelling). In both cases, an error is expected or
assumed from the device in question, and repair initiated
unconditionally (and without incrementing error counters).

You're suggesting an assymetry between Resilver and Replace to exploit
the possibile speedup of sequential access; ok, seems attractive at
first blush, let's explore the idea.


Well, I've gone to a swimming pool today to swim the halfmile
and clear my head (metaphorically at least), and from the
depths I emerged with another idea:

From what I do see with the pool I'm upgrading (in another
thread), there is also a Replace mode for hotspare devices,
namely:
* I attached the hotspare to the pool
  zpool add poolname spare c1t2d0
* I asked the pool to migrate a flaky disk's data to the new disk:
  zpool replace poolname c5t6d0 c1t2d0
* I asked the pool to forget the old disk so it can be removed:
  zpool detach poolname c5t6d0
  (cfgadm, removal, pluck in the new disk, cfgadm, etc)

From iostat I see that all existing TLVDEV's drives, including
the one being replaced, are actively thrashed by reads for many
hours, with some writes pouring onto the new disk.


SO THE IDEA IS as follows: the disk being explicitly replaced,
as in upgrades of the pool to larger drives, should first be
copied onto new media DD-style, which would be sequential IO
for both devices, bandwidth-bound and rather fast. Then there
should be a selective scrub, reading and checking allocated
blocks from this TLVDEV only - like resilver does today - and
repairing possible discrepancies (since the pool was likely
live during the DD stage, as well as errors were possible
on the source drive as well as any other), and after this
selective scrub the process is complete.

BENEFITS:
* The pool quickly gets a more-or-less good copy of the original
  disk, if it has not died completely and is able to serve reads
  for DD-style copying. This decreases the window of exposure of
  the TLVDEV to complete failure due to decreased redundancy, and
  can already help to salvage much of the data in case of partly
  bad source disk.

  That is, after the DD-style copy the new disk may be able to
  serve much of the valid data, and discrepancies might be easy
  to repair using normal checksum-mismatch modes - if the old
  disk kicks the bucket and/or is removed before the selective
  scrub is complete to gracefully finish the replacement procedure.

  The standard scrubbing approach after the DD-copy takes care
  of ensuring that by the end of the procedure the new disk's
  data is fully valid. This also allows to not bother about the
  problems of the source disk being updated in locations ahead
  or behind the point where we're reading now - some corrections
  to be made by the selective scrub are expected anyway.

  However, arguably, incoming writes may be placed on the source
  disk and its syncing-up spare replacement (into correct sector
  locations right from the start).

* Instead of scheduling many random writes, which may be slower
  due to sync requirements, caching priorities, etc., we lean
  towards many random reads - which would still be used if we
  were using the original replace/resilver mode. Arguably, the
  reads can be optimized better by ZFS pipeline and HDD NCQ/TCQ,
  and in a safer manner than (random) write optimizations.

* This method should be beneficial to raidz as well as mirrors,
  although the latter may have more options to cheaply recover
  bad sectors detected (as HDD IO errors) on source media of
  the one disk being replaced, on the fly - during DD-phase.

CAVEATS:

* This mode is of benefit for users whose pools are rather
  fragmented and full, so that sequential copy is noticeably
  faster than BP-tree-walk based resilvering. It is about 30x
  quicker on the utilized servers and homeNAS'es that I see.

  For example, on a Thumper in my other thread, resilvering
  of a 250Gb disk (partition) takes 15-17 hours while writing
  files and zfs-sends into a single-disk ZFS pool located on
  the same 3Tb drive fills it up in 24 hours. A full scrub of
  the original pool (45*250Gb) takes 24-27 hours. Time matters.

  

Re: [zfs-discuss] How does resilver/scrub work?

2012-05-18 Thread Daniel Carosone
On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote:
   While waiting for that resilver to complete last week,
 I caught myself wondering how the resilvers (are supposed
 to) work in ZFS?

The devil finds work for idle hands... :-)

   Based on what I see in practice and read in this list
 and some blogs, I've built a picture and would be grateful
 if some experts actually familiar with code and architecture
 would say how far off I guessed from the truth ;)

Well, I'm not that - certainly not on the code.  It would probably be
best (for both of us) to spend idle time looking at the code, before
spending too much on speculation. Nonetheless, let's have at it! :)

   Ultimately I wonder if there are possible optimizations
 to make the scrub process more resembling a sequential
 drive-cloning (bandwidth/throughput-bound), than an
 IOPS-bound random seek thrashing for hours that we
 often see now, at least on (over?)saturated pools.

The tradeoff will be code complexity and resulting fragility. Choose
wisely what you wish for.

 This may possibly improve zfs send speeds as well.

Less likely, that's pretty much always going to have to go in txg
order.

   First of all, I state (and ask to confirm): I think
 resilvers are a subset of scrubs, in that:
 1) resilvers are limited to a particular top-level VDEV
 (and its number is a component of each block's DVA address)
 and
 2) when scrub finds a block mismatching its known checksum,
 scrub reallocates the whole block anew using the recovered
 known-valid data - in essence it is a newly written block
 with a new path in BP tree and so on; a resilver expects
 to have a disk full of known-missing pieces of blocks,
 and reconstructed pieces are written on the resilvering
 disk in-place at an address dictated by the known DVA -
 this allows to not rewrite the other disks and BP tree
 as COW would otherwise require.

No. Scrub (and any other repair, such as for errors found in the
course of normal reads) rewrite the reconstructed blocks in-place: to
the original DVA as referenced by its parents in the BP tree, even if
the device underneath that DVA is actually a new disk.

There is no COW. This is not a rewrite, and there is no original data
to preserve, this is a repair: making the disk sector contain what the
rest of the filesystem tree 'expects' it to contain. More specifically,
making it contain data that checksums to the value that block pointers
elsewhere say it should, via reconstruction using redundant
information (same DVA on a mirror/RAIDZ recon, or ditto blocks at
different DVAs found in the parent BP for copies1, including metadata)

BTW, if a new BP tree was required to repair blocks, we'd have
bp-rewrite already (or we wouldn't have repair yet).

   Other than these points, resilvers and scrubs should
 work the same, perhaps with nuances like separate tunables
 for throttling and such - but generic algorithms should
 be nearly identical.

 Q1: Is this assessment true?

In a sense, yes, despite the correction above.  There is less
difference between these cases than you expected, so they are nearly
identical :-)

   So I'll call them both a scrub below - it's shorter :)

Call them all repair.

The difference is not in how repair happens, but in how the need for a
given sector to be repaired is discovered.

Let's go over those, and clarify terminology, before going through the
rest of your post:

 * Normal reads: a device error or checksum failure triggers a
   repair. 

 * Scrub: Devices may be fine, but we want to verify that and fix any
   errors. In particular, we want to check all redundant copies.

 * Resilver: A device has been offline for a while, and needs to be
   'caught up', from its last known-good TXG to current.

 * Replace: A device has gone, and needs to be completely
   reconstructed.

Scrub is very similar to normal reads, apart from checking all copies
rather than serving the data from whichever copy successfully returns
first. Errors are not expected, are counted and repaired as/if found.

Resilver and Replace are very similar, and the terms are often used
interchangably. Replace is essentially resilver with a starting TXG of
0 (plus some labelling). In both cases, an error is expected or
assumed from the device in question, and repair initiated
unconditionally (and without incrementing error counters). 

You're suggesting an assymetry between Resilver and Replace to exploit
the possibile speedup of sequential access; ok, seems attractive at
first blush, let's explore the idea.

   Now, as everybody knows, at least by word-of-mouth on
 this list, the scrub tends to be slow on pools with a rich
 life (many updates and deletions, causing fragmentation,
 with old and young blocks intermixed on disk), more
 so if the pools are quite full (over about 80% for some
 reporters). This slowness (on non-SSD disks with non-zero
 seek latency) is attributed to several reasons I've seen
 stated and/or thought up while pondering. The reasons 

Re: [zfs-discuss] How does resilver/scrub work?

2012-05-18 Thread Daniel Carosone
On Fri, May 18, 2012 at 04:18:12PM +1000, Daniel Carosone wrote:
 
 When doing a scrub, you start at the root bp and walk the tree, doing
 reads for everything, verifying checksums, and letting repair happen
 for any errors. That traversal is either a breadth-first or
 depth-first traversal of the tree (I'm not sure which) done in TXG
 order.  
 
 [..]
 
 Note that there can be a lot of fanout in the tree;

Given the latter point, I'm going to guess depth-first.  Yes, I should
look at the code instead of posting speculation. 

--
Dan.


pgpX6jrYdlAjS.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-18 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov

I'm reading the ZFS on-disk spec, and I get the idea that there's an uberblock 
pointing to a self-balancing tree (some say b-tree, some say avl-tree, some say 
nv-tree), where data is only contained in the nodes.  But I haven't found one 
particular important detail yet:

On which values does the balancing tree balance?  Is it balancing on the 
logical block address?  This would make sense, as an application requests to 
read/write some logical block, making it easy and fast to find the 
corresponding physical blocks...

If that is the case, wouldn't scrub/resilver need to work according to logical 
block order?  (Which would also be random-ish, but decidedly NOT the same as 
TXG temporal order.)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-18 Thread Jim Klimov

First of all, thank you Daniel for taking the time to post a
lengthy reply! I do not get that kind of high-quality feedback
very often :)

I hope the community and googlers would benefit from that
conversation sometime. I did straighten out some thoughts
and (mis-)understandings, at least, more on that below :)

2012-05-18 15:30, Daniel Carosone wrote:

On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote:

While waiting for that resilver to complete last week,
 I caught myself wondering how the resilvers (are supposed
 to) work in ZFS?
 The devil finds work for idle hands... :-)

Or rather, brains ;)

 Well, I'm not that - certainly not on the code.  It would probably be
 best (for both of us) to spend idle time looking at the code, before
 spending too much on speculation. Nonetheless, let's have at it! :)

...Yes, I should look at the code instead of posting speculation.



Good idea any day, but rather lengthy in time. I have looked at the
code, at blogs, at mailing list archives, at the aged ZFS spec, for
about a year on-and-off now, and as you could see - understanding
remains imperfect ;)

Besides, turning the specific C code, even with those good comments
that are in place, into a narrative description like we did in this
thread, is bulky, time-consuming and likely useless (not conveyed)
to other people wanting to understand the same and perhaps hoping
to contribute - even if only algorithmic ideas ;)

Finally, breaking the head over existing code only, instead of
sitting back and doing some educated thinking (speculation),
*may* be useless in the sense that if the current algorithms
(or their implementation) work unsatisfactorily for at least
the use-cases I see them used in. Thus I as a n00b researcher
might care a bit less about what exactly is wrong in the system
that does not work (the way I want it to, at least), and I'd
care a bit more about designing and planning = speculating =
how (I think) it should work to suit my needs and usage patterns.
In this regard the existing implementation may be seen as a
POC which demostrates what can be done, even if sub-optimally.
It works somewhat, and since we see downsides - it might work
better.

At the very least I can try to understand how it works now
and why some particular choices and tradeoffs were mare
(perhaps we do use the lesser of evils indeed) - explained
in higher-level concepts and natural-language words that
correspondents like you or other ZFS experts (and authors)
on this list can quickly confirm or deny without wasting
their precious time (no sarcasm) on lengthy posts like these,
describing it all in detail. This is a useful experience and
learning source, and different from what reading the code
alone gives me.

Anyway, this speculation would be done by this n00b reader of
the code implicitly and with less (without any?) constructive
discussion (thanks again for that!) if I were to look into code
trying to fix something without planning ahead, and I know that
often does not end very well.

Ultimately, I guess I got more understanding by spending a few
hours to formulate correct questions (and thankfully getting some
answers) than from compiling all the disparate (and often outdated)
docs and blogs, and code, into some form of a structure in my head.
I also got to confirm that much of this compilation was correct
and which parts I missed ;)

Perhaps, now I (or someone else) won't waste months on inventing
or implementing something senseless from the start, or would find
ways to make a pluggable writing policy for tests of different
allocators for different purposes, or something of that kind... -
as you propose here:
 That said, there are always opportunities for tweaks and improvements
 to the allocation policy, or even for multiple allocation policies
 each more suited/tuned to specific workloads if known in advance.

Now, on to my ZFS questions and your selected responses:

 This may possibly improve zfs send speeds as well.

 Less likely, that's pretty much always going to have to go in txg
 order.

Would that be really TXG order - i.e. send blocks from TXG(N),
then send blocks from TXG(N+1), and so on; OR a BPtree walk
of the selected branch (starting from the root of snapshot
dataset), perhaps limiting the range of chosen TXG numbers
by the snapshot's creation and completion TXG timestamps?

Essentially, I don't want to quote all those pieces of text,
but I still doubt that tree walks are done in TXG order - at
least the way I understand it (which may be different from
your or others' understanding): I interpreted TXG order as
I said above - a monotonous incremental walk from older TXG
numbers to newer ones. In order to do that you must have the
whole tree in RAM and sort it by TXGs (perhaps making an
array of all defined TXGs and pointers to individual block
pointers that have this TXG), which is lengthy, bulky on
RAM and I don't think I see it happening in real life.

If the statement means that when walking the 

Re: [zfs-discuss] How does resilver/scrub work?

2012-05-18 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov

I'm reading the ZFS on-disk spec, and I get the idea that there's an
uberblock pointing to a self-balancing tree (some say b-tree, some say
avl-tree, some say nv-tree), where data is only contained in the nodes.  But
I haven't found one particular important detail yet:

On which values does the balancing tree balance?  Is it balancing on the
logical block address?  This would make sense, as an application requests to
read/write some logical block, making it easy and fast to find the
corresponding physical blocks...

If that is the case, wouldn't scrub/resilver need to work according to
logical block order?  (Which would also be random-ish, but decidedly NOT the
same as TXG temporal order.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-18 Thread Jim Klimov

2012-05-18 19:08, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov


I'm reading the ZFS on-disk spec, and I get the idea that there's an
uberblock pointing to a self-balancing tree (some say b-tree, some say
avl-tree, some say nv-tree), where data is only contained in the nodes.  But
I haven't found one particular important detail yet:

On which values does the balancing tree balance?  Is it balancing on the
logical block address?  This would make sense, as an application requests to
read/write some logical block, making it easy and fast to find the
corresponding physical blocks...


My memory fails me here for a precise answer... I think that
the on-disk data within a raidzN top-level VDEV (mirrors are
trivial) is laid out as follows, for an arbitrary 6-disk set
of raidz2 TLVDEV:

D1   D2   D3   D4   D5   D6
Ar1  Ar2  Ad1  Ad2  Ad3  Ad4
Br1  Br2  Bd1  Cr1  Cr2  Cd1
Cd2  Cd3  Cd4  Cr3  Cr4  Cd5
Cd6  Dr1  Dr2  Dd1  ...

In these examples above, several blocks are laid out in sectors
of different disks, including the redundancy blocks. Sequential
accesses on one disk progress in a column from top to bottom.
Accesses in a row are parallelized between many disks.

The A block userdata is 4 sectors long, with 2 redundancy blocks.
The B block has just one userdata sector, and the C block has
6 userdata sectors with a redundancy started for each 4 sectors.

AFAIK each ZFS block fully resides within one TLVDEV (and ditto
copies have their own separate life in another TLVDEV if available),
and striping over several TLVDEVs occurs at a whole-block level.
This, in particular, allows disbalanced pools with TLVDEVs of
different size and layout.

IF this picture is correct (confirmation or the reverse is
kindly requested), then:

1) DVA to LBA translation should be somewhat trivial, since
   the DVA is defined as ID(tlvdev):offset:length in 512-byte
   units (regardless of ashift value on the pool). I did not
   test this in practice or incur from the code though.

   I don't know if there are any gaps to take into account
   (i.e. maybe between metaslabs, which are supposed to be
   about 200 of which per vdev (or tlvdev, or pool?) in order to
   limit seeking between data written at roughly the same time.
   Even if there are gaps (i.e. to round allocations to on-disk
   tracks or offsets at multiples of a given number), I'd not
   complicate things and just leave the gaps as addressable but
   unreferenced free spaces.

   A poster on the list recently referenced slabs, I don't
   think I saw this term - but I guess it stands for the total
   allocation needed for a userdata block?

2) Addressing of blocks (or reverse - saying that these sectors
   belong to a particular block or are available) is impossible
   without knowing the (generally whole) blockpointer tree, and
   depending on (re-)written object sizes, the same sector can
   at different times in its life belong to blocks (slabs?) of
   different lengths and starting at different DVA offsets...

   Indeed, we also can not assume that sectors read-in from the
   disks contain a valid part of the blockpointer tree (despite
   even matching some magic number), not until we find a path
   through the known tree that leads to this block (I discussed
   this in my other post regarding vdev prefetch and defrag).
   However since reads are free as long as the HDD head is in
   the right location, and if blkptr_t's leading one to another
   are colocated on the disk, clever use of the prefetch and
   timely inspection of the prefetch cache can hopefully boost
   the BPtree walking speed.

   MAYBE I am wrong in this and there is also an allocation map
   in the large metaslabs or something? (I know there is some
   cleverness about finding available locations to write into,
   but I'm not ready to speak about it off the top of my head).

   I am not sure if this gives a clue to whether it's balancing
   on the logical block address? though :) AFAIK the balancing
   tries to keep the maximum tree depth shortest, yet there is
   one root block and no rewriting of existing unchanged stale
   blocks (tree nodes). I am puzzled too :)


3) The layout is fixed at tlvdev creation time by its total
   number of disks since that directly affects the calculation
   on which disk does an offset'ed sector belong - it would
   be offset modulo number of disks for raidzN regardless of N
   (because of not-full stripes), and just 0 for single drives
   and mirrors. This is why resizing a raidz set is indeed hard,
   while conversion of single disks to mirrors and back is easy.

   To a lesser extent the layout is limited by vdev size (which
   can be increased easily, but can not be decreased without
   reallocation and BP rewrite [*1]), and somewhat by the number
   of redundancy disks which influences individual blocks' on-disk
   representation and required length [*2].

[*1]: This might 

[zfs-discuss] How does resilver/scrub work?

2012-05-17 Thread Jim Klimov

Hello all,

  While waiting for that resilver to complete last week,
I caught myself wondering how the resilvers (are supposed
to) work in ZFS?

  Based on what I see in practice and read in this list
and some blogs, I've built a picture and would be grateful
if some experts actually familiar with code and architecture
would say how far off I guessed from the truth ;)

  Ultimately I wonder if there are possible optimizations
to make the scrub process more resembling a sequential
drive-cloning (bandwidth/throughput-bound), than an
IOPS-bound random seek thrashing for hours that we
often see now, at least on (over?)saturated pools.
This may possibly improve zfs send speeds as well.

  First of all, I state (and ask to confirm): I think
resilvers are a subset of scrubs, in that:
1) resilvers are limited to a particular top-level VDEV
(and its number is a component of each block's DVA address)
and
2) when scrub finds a block mismatching its known checksum,
scrub reallocates the whole block anew using the recovered
known-valid data - in essence it is a newly written block
with a new path in BP tree and so on; a resilver expects
to have a disk full of known-missing pieces of blocks,
and reconstructed pieces are written on the resilvering
disk in-place at an address dictated by the known DVA -
this allows to not rewrite the other disks and BP tree
as COW would otherwise require.

  Other than these points, resilvers and scrubs should
work the same, perhaps with nuances like separate tunables
for throttling and such - but generic algorithms should
be nearly identical.

Q1: Is this assessment true?

  So I'll call them both a scrub below - it's shorter :)



  Now, as everybody knows, at least by word-of-mouth on
this list, the scrub tends to be slow on pools with a rich
life (many updates and deletions, causing fragmentation,
with old and young blocks intermixed on disk), more
so if the pools are quite full (over about 80% for some
reporters). This slowness (on non-SSD disks with non-zero
seek latency) is attributed to several reasons I've seen
stated and/or thought up while pondering. The reasons may
include statements like:



1) Scrub goes on in TXG order.

If it is indeed so - the system must find older blocks,
then newer ones, and so on. IF the block-pointer tree
starting from uberblock is the only reference to the
entirety of the on-disk blocks (unlike say DDT) then
this tree would have to be read into memory and sorted
by TXG age and then processed.

From my system's failures I know that this tree would
take about 30Gb on my home-NAS box with 8Gb RAM, and
the kernel crashes the machine by depleting RAM and
not going into swap after certain operations (i.e.
large deletes on datasets with enabled deduplication).
That was discussed last year by me, and recently by
other posters.

Since the scrub does not do that and does not even
press on RAM in a fatal manner, I think this reason
is wrong. I also fail to see why one would do that
processing ordering in the first place - on a fairly
fragmented system even the blocks from newer TXGs
do not necessarily follow those from the previous
ones.

What this rumour could reflect, however, is that a scrub
(or more importantly, a resilver) are indeed limited by
the interesting range of TXGs, such as picking only
those blocks which were written between the last TXG that
a lost-and-reconnected disk knew of (known to the system
via that disk's stale uberblock), and the current TXG
at the moment of its reconnection. Newer writes would
probably land onto all disks anyway, so a resilver has
only to find and fix those missing TXG numbers.

In my problematic system however I only saw full resilvers
even after they restarted numerously... This may actually
support the idea that scrubs are NOT txg-ordered, otherwise
a regularly updated tabkeeping attribute on the disk (in
uberblock?) would note that some TXGs are known to fully
exist on the resilvering drive - and this is not happening.




2) Scrub walks the block-pointer tree.

That seems like a viable reason for lots of random reads
(hitting the IOPS barrier). It does not directly explain
the reports I think I've seen about L2ARC improving scrub
speeds and system responsiveness - although extra caching
takes the repetitive load off the HDDs and leaves them
some more timeslices to participate in scrubbing (and
*that* should incur reads from disks, not caches).

On an active system, block pointer entries are relatively
short-lived, with whole branches of a tree being updated
and written in a new location upon every file update.
This image is bound to look like good cheese after a while
even if the writes were initially coalesced into few IOs.



3) If there are N top-level VDEVs in a pool, then only
the one with the resilvering disk would be hit for
performance - not quite true, because pieces of the
BPtree are spread across all VDEVs. The one resilvering
would get the most bulk traffic, when DVAs residing on
it are found and