First of all, thank you Daniel for taking the time to post a
lengthy reply! I do not get that kind of high-quality feedback
very often :)
I hope the community and googlers would benefit from that
conversation sometime. I did straighten out some thoughts
and (mis-)understandings, at least, more on that below :)
2012-05-18 15:30, Daniel Carosone wrote:
On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote:
>> While waiting for that resilver to complete last week,
>> I caught myself wondering how the resilvers (are supposed
>> to) work in ZFS?
> The devil finds work for idle hands... :-)
Or rather, brains ;)
> Well, I'm not that - certainly not on the code. It would probably be
> best (for both of us) to spend idle time looking at the code, before
> spending too much on speculation. Nonetheless, let's have at it! :)
...Yes, I should look at the code instead of posting speculation.
Good idea any day, but rather lengthy in time. I have looked at the
code, at blogs, at mailing list archives, at the aged ZFS spec, for
about a year on-and-off now, and as you could see - understanding
remains imperfect ;)
Besides, turning the specific C code, even with those good comments
that are in place, into a narrative description like we did in this
thread, is bulky, time-consuming and likely useless (not conveyed)
to other people wanting to understand the same and perhaps hoping
to contribute - even if only algorithmic ideas ;)
Finally, breaking the head over existing code only, instead of
sitting back and doing some educated thinking (speculation),
*may* be useless in the sense that if the current algorithms
(or their implementation) work unsatisfactorily for at least
the use-cases I see them used in. Thus I as a n00b researcher
might care a bit less about what exactly is wrong in the system
that does not work (the way I want it to, at least), and I'd
care a bit more about designing and planning = speculating =
how (I think) it should work to suit my needs and usage patterns.
In this regard the existing implementation may be seen as a
POC which demostrates what can be done, even if sub-optimally.
It works somewhat, and since we see downsides - it might work
At the very least I can try to understand how it works now
and why some particular choices and tradeoffs were mare
(perhaps we do use the lesser of evils indeed) - explained
in higher-level concepts and natural-language words that
correspondents like you or other ZFS experts (and authors)
on this list can quickly confirm or deny without wasting
their precious time (no sarcasm) on lengthy posts like these,
describing it all in detail. This is a useful experience and
learning source, and different from what reading the code
alone gives me.
Anyway, this "speculation" would be done by this n00b reader of
the code implicitly and with less (without any?) constructive
discussion (thanks again for that!) if I were to look into code
trying to fix something without planning ahead, and I know that
often does not end very well.
Ultimately, I guess I got more understanding by spending a few
hours to formulate correct questions (and thankfully getting some
answers) than from compiling all the disparate (and often outdated)
docs and blogs, and code, into some form of a structure in my head.
I also got to confirm that much of this compilation was correct
and which parts I missed ;)
Perhaps, now I (or someone else) won't waste months on inventing
or implementing something senseless from the start, or would find
ways to make a pluggable writing policy for tests of different
allocators for different purposes, or something of that kind... -
as you propose here:
> That said, there are always opportunities for tweaks and improvements
> to the allocation policy, or even for multiple allocation policies
> each more suited/tuned to specific workloads if known in advance.
Now, on to my ZFS questions and your selected responses:
>> This may possibly improve zfs send speeds as well.
> Less likely, that's pretty much always going to have to go in txg
Would that be really TXG order - i.e. send blocks from TXG(N),
then send blocks from TXG(N+1), and so on; OR a BPtree walk
of the selected branch (starting from the root of snapshot
dataset), perhaps limiting the range of chosen TXG numbers
by the snapshot's creation and completion "TXG timestamps"?
Essentially, I don't want to quote all those pieces of text,
but I still doubt that tree walks are done in TXG order - at
least the way I understand it (which may be different from
your or others' understanding): I interpreted "TXG order" as
I said above - a monotonous incremental walk from older TXG
numbers to newer ones. In order to do that you must have the
whole tree in RAM and sort it by TXGs (perhaps making an
array of all defined TXGs and pointers to individual block
pointers that have this TXG), which is lengthy, bulky on
RAM and I don't think I see it happening in real life.
If the statement means that "when walking the tree, first
walk the child branch with lower TXG" then the statement
makes sense somewhat - but it is not strictly "TXG-ordered",
I think. At the very least, the walk starts with the most
recent TXG being the uberblock (or poolwide root block) ;)
Such a walk would indeed reach out to the oldest TXGs in a
particular branch first, but starting from (and backtracking
to) newer ones.
So in order to benefit from sequential reads during the
tree walk, the written blocks with the block-pointer tree
(at least one copy of them) should be stored on disk in
essentially this same order that a tree walk reader expects
to find them. Then a read request (with associated vdev
prefetch) would find large portions of the BP tree needed
"now or in a few steps" in one mechanical IO...
> So, if reading blocks sequentially, you can't verify them. You don't
> know what their checksums are supposed to be, or even what they
> contain or where to look for the checksums, even if you were prepared
> to seek to find them. This is why scrub walks the bp tree.
...And perhaps to take more advantage of this, the system
should not descend into a single child BP and its branch
right away, but rather try to see in the rolling prefetch
cache (after a read was satisfied by a mechanical IO) if
more of the soon-to-be-needed blkptrs are in RAM currently
and should be relocated to the ARC/L2ARC before they roll
out of the prefetch cache, even if actual requests for
them would come after the subtree walk, perhaps in a few
seconds or minutes. If the subtree is so big that these
ARCed entries would be pushed out by then, well, we did
all we could to speed up the system for smaller branches
and lost little time in the process. And cache misses
would be logged so users can know to upgrade their ARCs.
> No. Scrub (and any other repair, such as for errors found in the
> course of normal reads) rewrite the reconstructed blocks in-place: to
> the original DVA as referenced by its parents in the BP tree, even if
> the device underneath that DVA is actually a new disk.
> There is no COW. This is not a rewrite, and there is no original data
> to preserve...
Okay, thanks, I guess this simplifies things - although
somewhat defies the BPtree defrag approach I proposed.
> BTW, if a new BP tree was required to repair blocks, we'd have
> bp-rewrite already (or we wouldn't have repair yet).
I'm not so sure. I've seen discussed (and proposed) many small
tasks that could be done by a BP rewrite in general, but can
be done "elsehow". Taking as an example my (mis)understanding
of scrub repairs, the recovered block data could just be written
into the pool just like any other new data block, and cause the
rewriting of the BP tree branch leading to it. If that is not
done (or required) here - well, that's for the better I guess.
> ...This is bp rewrite, or rather, why bp-rewrite is hard.
The generic BP rewrite also should handle things like
reduction of VDEV sizes, removal of TLVDEVs, changes to
TLVDEV layouts (i.e. migration of raidz levels) and so
on. That is likely hard (especially to do online) indeed.
Individual operations, like defragmentation, recompression
or dedup of existing data, all of which can be done today
by zfs-sending data away from the pool, cleaning it up, and
zfs-receiving the data back - without all the lowlevel layout
changes that BP rewrite can do - well, they can be done today.
Why not in-place?
Unlike manual send-away-and-receive cycles incurring downtime,
the equivalent in-place manipulations can be done transparently
to ZPL/ZVOL users by just invalidating parts of the ARC (by DVA
of reallocated blocks), I think, and do not seem as inherently
difficult as complete BP rewrites.
Again, this interim solution may be just a POC for later works
on BP rewrite to include and improve :)
> "Just" is such a four-letter word.
> If you move a bp, you change its DVA. Which means that the parent bp
> pointing to it needs to be updated and rewritten, and its parents as
> well. This is new, COW data, with a new TXG attached -- but referring
> to data that is old and has not been changed.
> This gets back to the misunderstanding (way) above. Repair is not
> COW; repair is repairing the disk block to the original, correct
Changes of DVAs causing reallocation of the whole branch of
BPs during the defrag - yes, as I also wrote. However I am
not sure that it would induce such changes to TXG numbers
that must be fatal to snapshots and scrubs: as I've seen in
the code (unlike the ZFS on-disk format docs), the current
blkptr_t includes two fields for a TXG number - the birth
TXG and (IIRC) the write TXG. I guess one refers to the
timestamp of when the data block was initially allocated
in the queue, and another one (if non-zero) refers to the
timestamp of when the block was optionally reallocated and
written into the pool - perhaps upon recovery from ZIL, or
(as I thought above) upon generic repair, or my proposed
idea of defrag.
So perhaps the system is already ready to correctly
process such reallocations, or can be cheated into that
by "clever" use and/or ignoration of one of these fields...
> You just broke snapshots and scrub, at least.
As for snapshots: you can send a series of incremental
snapshots from one system to another, and of course the
TXG numbers on a particular pool for blocks of the snapshot
dataset would differ. But this does not matter, as long as
they are committed on disk in a particular order, with
BPtree branches properly pointing to timestamp-ordered
snapshots of the parent dataset.
Your concern seems valid indeed, but I think it can be
countered by scheduling a BPtree defrag to involve
relocating and updating block pointers for all snapshots
of a dataset (and maybe its clones), or at least ensuring
that the parent blocks of newer snapshots have higher TXG
numbers - if that is required. This may place non-trivial
demands on cache or buffer memory size and usage in order
to prepare the big transaction in case of large datasets,
so perhaps if the system detects it can't properly defrag
the BPtree branch in one operation, it should abort without
crashing the OS into scanrate-hell ;)
> It's not going to help a scrub, since that reads all of the ditto
> block copies, so bunching just one copy isn't useful.
I can agree - but only partially. If the point of storing
the blockpointers together and minimizing mechanical reads
to get many of them at once is reachable, then it becomes
possible to "preread" the "colocated" version of BP tree
or its large portions quickly (if there are no checksum
or device errors during such reads - otherwise we fall
back to scattered ditto copies of those corrupted BP tree
blocks). Then we can schedule more optimal reads for the
scattered data, including the ditto blocks of the BP tree
that we've already read in (the other copies of these blocks).
It would be the same walk covering the same data objects
on disk, but possibly in a different (and hopefully faster)
manner than today.
Thanks a lot for the discussion, I really appreciate it :)
zfs-discuss mailing list