The idea of dedicated metadata devices (likely SSDs) for ZFS
has been generically discussed a number of times on this list,
but I don't think I've seen a final proposal that someone would
take up for implementation (as a public source code, at least).
I'd like to take a liberty of summarizing the ideas I've either
seen in discussions or proposed myself on this matter, to see if
the overall idea would make sense to gurus of ZFS architecture.
So, the assumption was that the performance killer in ZFS at
least on smallish deployments (few HDDs and an SSD accelerator),
like those in Home-NAS types of boxes, was random IO to lots of
metadata. This IMHO includes primarily the block pointer tree
and the DDT for those who risked using dedup. I am not sure how
frequent is the required read access to other types of metadata
(like dataset descriptors, etc.) that the occasional reading and
caching won't solve.
Another idea was that L2ARC caching might not really cut it
for metadata in comparison to a dedicated metadata storage,
partly due to the L2ARC becoming empty upon every export/import
(boot) and needing to get re-heated.
So, here go the highlights of proposal (up for discussion).
In short, the idea is to use today's format of the blkptr_t
which by default allows to store up to 3 DVA addresses of the
block, and many types of metadata use only 2 copies (at least
by default). This new feature adds a specially processed
TLVDEV in the common DVA address space of the pool, and
enforces storage of added third copies for certain types
of metadata blocks on these devices. (Limited) Backwards
compatibility is quite possible, on-disk format change may
be not required. The proposal also addresses some questions
that arose in previous discussions, especially about proposals
where SSDs would be the only storage for pool's metadata:
* What if the dedicated metadata device overflows?
* What if the dedicated metadata device breaks?
= okay/expected by design, nothing dies.
In more detail:
1) Add a special Top-Level VDEV (TLVDEV below) device type (like
"cache" and "log" - say, "metaxel" for "metadata accelerator"?),
and allow (even encourage) use of mirrored devices and allow
expansion (raid0, raid10 and/or separate TLVDEVs) with added
singlets/mirrors of such devices.
Method of device type definition for the pool is discussable,
I'd go with a special attribute (array) or nvlist in the pool
descriptor, rather than some special type ID in the ZFS label
(backwards compatibility, see point 4 for detailed rationale).
Discussable: enable pool-wide or per-dataset (i.e. don't
waste accelerator space and lifetime for rarely-reused
datasets like rolling backups)? Choose what to store on
(particular) metaxels - DDT, BPTree, something else?
Overall, this availability of choice is similar to choice
of modes for ARC/L2ARC caching or enabling ZIL per-dataset...
2) These devices should be formally addressable as part of the
pool in DVA terms (tlvdev:offset:size), but writes onto them
are artificially limited by ZFS scheduler so as to only allow
specific types of metadata blocks (blkptr_t's, DDT entries),
and also enforce writing of added third copies (for blocks
of metadata with usual copies=2) onto these devices.
3) Absence or "FAULTEDness" of this device should not be fatal
to the pool, but it may require manual intervention to force
the import. Particularly, removal, replacement or resilvering
onto different storage (i.e. migrating to larger SSDs) should
be supported in the design.
Beside experimentation and migration concerns, this approach
should also ease replacement of SSDs used for metadata in case
of their untimely fatal failures - and this may be a concern
for many SSD deployments, increasingly susceptible to write
wearing and ultimate death (at least in the cheaper bulkier
range, which is a likely component in Home-NAS solutions).
4) For backwards compatibility, to older versions of ZFS this
device should seem like a normal single-disk or mirror TLVDEV
which contains blocks addressed within the common pool DVA
address-space. This should have no effect for read-only
imports. However, other ZFS releases likely won't respect the
filtering and alignment limitations enforced for the device
normally in this design, and can "contaminate" the device
with other types of blocks (and would refuse to import the
pool if the device is missing/faulted).
5) The ZFS reads should be tweaked to first consult the copy
of metadata blocks on the metadata accelerator device, and
only use spinning rust (ordinary TLVDEVs) if there are some
errors (checksum mismatches, lacking devices, etc.) or during
scrubs and similar tasks which would require full reads of
the pool's addressed blocks.
Prioritized reads from this metadata accelerator won't need
a special bit in the blkptr_t (like is done for deduped-bit) -
the TLVDEV number in the DVA already points to the known
identifier of the TLVDEV, which we know is a metaxel.
6) The ZFS writes onto this storage should take into account
the increased blocksize (likely 4-8Kb for either current
HDDs or for SSDs) and subsequent coalescing and pagination
required to reduce SSD wear-out. This might be a tweakable
component of the scheduler, which could be disabled if some
different media is used and this scheduler is not needed
(small-sectored HDDs, DDR, SSDs of the future), but the
default writing mode today should expect SSDs.
7) A special tool like scrub should be added to walk the pool's
block tree and rewrite the existing block pointers (and I am
not sure this is as problematic as the generic BPRewrite -
if needed, the task can be done once offline, for example).
By definition this is a restartable task (initiating a new
tree walk), so it should be pausable or abortable as well.
As the result, new copies of metadata blocks would be created
on the accelerator device, and the two old copies remain in
Just in case the metaxel TLVDEV is "contaminated" by other
block types (by virtue of read-write import and usage by
ZFS implementations unsupporting this feature), those should
be relocated onto the main HDD pool.
One thing to discuss is: what should be done for metadata
with already existing three copies on HDDs? Should one copy
be disposed of and recreated on the accelerator?
8) If the metaxel devices are filled up, the "overflowing"
metadata blocks may just not get the third copies (only
the standard HDD-based copies are written). If the metaxel
device is freed up (by deletion of data and release of the
block pointers) or expanded (or another one is added), then
another run of the "scrub-like" procedure from point 7 can
add the missing copies.
9) Also, the solution should allow to discard and recreate the
copies of block pointers on the accelerator TLVDEV in case
that it fails fatally or is replaced by new empty media.
Unlike usual data, where loss of a TLVDEV is considered
fatal to the pool, in this case we are known to have
(redundant) copies of these blocks on other media.
If the new TLVDEV is at least as big as the failed one,
the pre-recorded accelerator tlvdevid:offsets in HDD-based
copies of the block pointers can be used to re-instantiate
the copies on metaxel, just like scrub or in-flight repairs
happen on usual pools (rewriting corrupt blocks in-place
and not having to change the BP tree at all). In this case
the tlvdevid part of DVA can (should) remain unchanged.
For new metaxels smaller than the replaced one new DVA
allocations might be required. To enforce this and avoid
mixups, the tlvdevid should change to some new unique
number, and the BP tree gets rewritten as in point 7.
10) If these metaxel devices are used (and known to be SSDs?)
then (SSD-based) L2ARC caches should not be used for the
metadata blocks readily available from the metaxel.
Guess: This might in fact reduce overheads from use of
dedup, where pushing blocks into L2ARC only halves the
needed RAM footprint. With an SSD metaxel we can just
drop unneeded DDT entries from RAM ARC, and quickly get
them from stable storage when needed.
I hope I covered all or most of what I think on this matter,
discussion (and ultimately open-sourced implementations) are
most welcome ;)
zfs-discuss mailing list