While waiting for that resilver to complete last week,
I caught myself wondering how the resilvers (are supposed
to) work in ZFS?
Based on what I see in practice and read in this list
and some blogs, I've built a picture and would be grateful
if some experts actually familiar with code and architecture
would say how far off I guessed from the truth ;)
Ultimately I wonder if there are possible optimizations
to make the scrub process more resembling a sequential
drive-cloning (bandwidth/throughput-bound), than an
IOPS-bound random seek thrashing for hours that we
often see now, at least on (over?)saturated pools.
This may possibly improve zfs send speeds as well.
First of all, I state (and ask to confirm): I think
resilvers are a subset of scrubs, in that:
1) resilvers are limited to a particular top-level VDEV
(and its number is a component of each block's DVA address)
2) when scrub finds a block mismatching its known checksum,
scrub reallocates the whole block anew using the recovered
known-valid data - in essence it is a newly written block
with a new path in BP tree and so on; a resilver expects
to have a disk full of known-missing pieces of blocks,
and reconstructed pieces are written on the resilvering
disk "in-place" at an address dictated by the known DVA -
this allows to not rewrite the other disks and BP tree
as COW would otherwise require.
Other than these points, resilvers and scrubs should
work the same, perhaps with nuances like separate tunables
for throttling and such - but generic algorithms should
be nearly identical.
Q1: Is this assessment true?
So I'll call them both a "scrub" below - it's shorter :)
Now, as everybody knows, at least by word-of-mouth on
this list, the scrub tends to be slow on pools with a rich
life (many updates and deletions, causing fragmentation,
with "old" and "young" blocks intermixed on disk), more
so if the pools are quite full (over about 80% for some
reporters). This slowness (on non-SSD disks with non-zero
seek latency) is attributed to several reasons I've seen
stated and/or thought up while pondering. The reasons may
include statements like:
1) "Scrub goes on in TXG order".
If it is indeed so - the system must find older blocks,
then newer ones, and so on. IF the block-pointer tree
starting from uberblock is the only reference to the
entirety of the on-disk blocks (unlike say DDT) then
this tree would have to be read into memory and sorted
by TXG age and then processed.
From my system's failures I know that this tree would
take about 30Gb on my home-NAS box with 8Gb RAM, and
the kernel crashes the machine by depleting RAM and
not going into swap after certain operations (i.e.
large deletes on datasets with enabled deduplication).
That was discussed last year by me, and recently by
Since the scrub does not do that and does not even
press on RAM in a fatal manner, I think this "reason"
is wrong. I also fail to see why one would do that
processing ordering in the first place - on a fairly
fragmented system even the blocks from "newer" TXGs
do not necessarily follow those from the "previous"
What this rumour could reflect, however, is that a scrub
(or more importantly, a resilver) are indeed limited by
the "interesting" range of TXGs, such as picking only
those blocks which were written between the last TXG that
a lost-and-reconnected disk knew of (known to the system
via that disk's stale uberblock), and the current TXG
at the moment of its reconnection. Newer writes would
probably land onto all disks anyway, so a resilver has
only to find and fix those missing TXG numbers.
In my problematic system however I only saw full resilvers
even after they restarted numerously... This may actually
support the idea that scrubs are NOT txg-ordered, otherwise
a regularly updated tabkeeping attribute on the disk (in
uberblock?) would note that some TXGs are known to fully
exist on the resilvering drive - and this is not happening.
2) "Scrub walks the block-pointer tree".
That seems like a viable reason for lots of random reads
(hitting the IOPS barrier). It does not directly explain
the reports I think I've seen about L2ARC improving scrub
speeds and system responsiveness - although extra caching
takes the repetitive load off the HDDs and leaves them
some more timeslices to participate in scrubbing (and
*that* should incur reads from disks, not caches).
On an active system, block pointer entries are relatively
short-lived, with whole branches of a tree being updated
and written in a new location upon every file update.
This image is bound to look like good cheese after a while
even if the writes were initially coalesced into few IOs.
3) "If there are N top-level VDEVs in a pool, then only
the one with the resilvering disk would be hit for
performance" - not quite true, because pieces of the
BPtree are spread across all VDEVs. The one resilvering
would get the most bulk traffic, when DVAs residing on
it are found and userdata blocks get transferred, but
random read seeks caused by the resilvering process
should happen all over the pool.
Q2: Am I correct with the interpretation of statements 1-3?
One optimization that could take place here would be to
store some of the BPs' ditto copies in compact locations
on disk (not all over it evenly), albeit maybe hurting
the write performance. This way a resilver run, or even
a scrub or zfs send, might be like a vdev-prefetch - a
scooping read of several megabytes worth of blockpointers
(this would especially help if the whole tree would fit
in RAM/L2ARC/swap), then sorting out the tree or its major
branches. The benefit would be little mechanical seeking
for lots of BP data. This might possibly require us to
invalidate the freed BP slots somehow as well :\
In case of scrubs, where we would have to read in all of
the allocated blocks from the media to test it, this would
let us schedule a sequential read of the drives userdata
while making sense of the sectors we find (as particular
In case of resilvering - this would let us find DVAs of
blocks in the interesting TLVDEV and in the TXG range and
also schedule huge sequential reads instead of random
In case of zfs send, this would help us pick out the
TXG-limited ranges of the blocks for a dataset, and
again schedule the sequential reads for userdata (if any).
Q3: Does the IDEA above make sense - storing BP entries
(one of the ditto blocks) in some common location on disk,
so as to minimize mechanical seeks while reading much of
the BP tree?
It seems possible to enable defragmentation of the BP tree
(those ditto copies that are stored together) by just
relocating the valid ones in correct order onto a free
metaslab. It seems that ZFS keeps some free space for
passive defrag purposes anyway - why not use it actively?
Live migration of blocks like this seems to be available
with scrub's repair of the mismatching blocks. However,
here some care should be taken to take into account that
the parent blockpointers would also need to be reallocated
since the childrens' checksums would change - so the whole
tree/branch of reallocations would have to be planned and
written out in sequential order onto the spare free space.
Overall, if my understanding somewhat resembles how things
really are, these ideas may help create and maintain such
layout of metadata that it can be bulk-read, which is IMHO
critical for many operations as well as to shorted recovery
windows when resilvering disks.
Q4: I wonder if similar (equivalent) solutions are already
in place and did not help much? ;)
zfs-discuss mailing list