Hello all,

  The idea of dedicated metadata devices (likely SSDs) for ZFS
has been generically discussed a number of times on this list,
but I don't think I've seen a final proposal that someone would
take up for implementation (as a public source code, at least).

  I'd like to take a liberty of summarizing the ideas I've either
seen in discussions or proposed myself on this matter, to see if
the overall idea would make sense to gurus of ZFS architecture.

  So, the assumption was that the performance killer in ZFS at
least on smallish deployments (few HDDs and an SSD accelerator),
like those in Home-NAS types of boxes, was random IO to lots of
metadata. This IMHO includes primarily the block pointer tree
and the DDT for those who risked using dedup. I am not sure how
frequent is the required read access to other types of metadata
(like dataset descriptors, etc.) that the occasional reading and
caching won't solve.

  Another idea was that L2ARC caching might not really cut it
for metadata in comparison to a dedicated metadata storage,
partly due to the L2ARC becoming empty upon every export/import
(boot) and needing to get re-heated.

  So, here go the highlights of proposal (up for discussion).

In short, the idea is to use today's format of the blkptr_t
which by default allows to store up to 3 DVA addresses of the
block, and many types of metadata use only 2 copies (at least
by default). This new feature adds a specially processed
TLVDEV in the common DVA address space of the pool, and
enforces storage of added third copies for certain types
of metadata blocks on these devices. (Limited) Backwards
compatibility is quite possible, on-disk format change may
be not required. The proposal also addresses some questions
that arose in previous discussions, especially about proposals
where SSDs would be the only storage for pool's metadata:
* What if the dedicated metadata device overflows?
* What if the dedicated metadata device breaks?
= okay/expected by design, nothing dies.

  In more detail:
1) Add a special Top-Level VDEV (TLVDEV below) device type (like
   "cache" and "log" - say, "metaxel" for "metadata accelerator"?),
   and allow (even encourage) use of mirrored devices and allow
   expansion (raid0, raid10 and/or separate TLVDEVs) with added
   singlets/mirrors of such devices.
   Method of device type definition for the pool is discussable,
   I'd go with a special attribute (array) or nvlist in the pool
   descriptor, rather than some special type ID in the ZFS label
   (backwards compatibility, see point 4 for detailed rationale).

   Discussable: enable pool-wide or per-dataset (i.e. don't
   waste accelerator space and lifetime for rarely-reused
   datasets like rolling backups)? Choose what to store on
   (particular) metaxels - DDT, BPTree, something else?
   Overall, this availability of choice is similar to choice
   of modes for ARC/L2ARC caching or enabling ZIL per-dataset...

2) These devices should be formally addressable as part of the
   pool in DVA terms (tlvdev:offset:size), but writes onto them
   are artificially limited by ZFS scheduler so as to only allow
   specific types of metadata blocks (blkptr_t's, DDT entries),
   and also enforce writing of added third copies (for blocks
   of metadata with usual copies=2) onto these devices.

3) Absence or "FAULTEDness" of this device should not be fatal
   to the pool, but it may require manual intervention to force
   the import. Particularly, removal, replacement or resilvering
   onto different storage (i.e. migrating to larger SSDs) should
   be supported in the design.
   Beside experimentation and migration concerns, this approach
   should also ease replacement of SSDs used for metadata in case
   of their untimely fatal failures - and this may be a concern
   for many SSD deployments, increasingly susceptible to write
   wearing and ultimate death (at least in the cheaper bulkier
   range, which is a likely component in Home-NAS solutions).

4) For backwards compatibility, to older versions of ZFS this
   device should seem like a normal single-disk or mirror TLVDEV
   which contains blocks addressed within the common pool DVA
   address-space. This should have no effect for read-only
   imports. However, other ZFS releases likely won't respect the
   filtering and alignment limitations enforced for the device
   normally in this design, and can "contaminate" the device
   with other types of blocks (and would refuse to import the
   pool if the device is missing/faulted).

5) The ZFS reads should be tweaked to first consult the copy
   of metadata blocks on the metadata accelerator device, and
   only use spinning rust (ordinary TLVDEVs) if there are some
   errors (checksum mismatches, lacking devices, etc.) or during
   scrubs and similar tasks which would require full reads of
   the pool's addressed blocks.
   Prioritized reads from this metadata accelerator won't need
   a special bit in the blkptr_t (like is done for deduped-bit) -
   the TLVDEV number in the DVA already points to the known
   identifier of the TLVDEV, which we know is a metaxel.

6) The ZFS writes onto this storage should take into account
   the increased blocksize (likely 4-8Kb for either current
   HDDs or for SSDs) and subsequent coalescing and pagination
   required to reduce SSD wear-out. This might be a tweakable
   component of the scheduler, which could be disabled if some
   different media is used and this scheduler is not needed
   (small-sectored HDDs, DDR, SSDs of the future), but the
   default writing mode today should expect SSDs.

7) A special tool like scrub should be added to walk the pool's
   block tree and rewrite the existing block pointers (and I am
   not sure this is as problematic as the generic BPRewrite -
   if needed, the task can be done once offline, for example).

   By definition this is a restartable task (initiating a new
   tree walk), so it should be pausable or abortable as well.

   As the result, new copies of metadata blocks would be created
   on the accelerator device, and the two old copies remain in
   Just in case the metaxel TLVDEV is "contaminated" by other
   block types (by virtue of read-write import and usage by
   ZFS implementations unsupporting this feature), those should
   be relocated onto the main HDD pool.

   One thing to discuss is: what should be done for metadata
   with already existing three copies on HDDs? Should one copy
   be disposed of and recreated on the accelerator?

8) If the metaxel devices are filled up, the "overflowing"
   metadata blocks may just not get the third copies (only
   the standard HDD-based copies are written). If the metaxel
   device is freed up (by deletion of data and release of the
   block pointers) or expanded (or another one is added), then
   another run of the "scrub-like" procedure from point 7 can
   add the missing copies.

9) Also, the solution should allow to discard and recreate the
   copies of block pointers on the accelerator TLVDEV in case
   that it fails fatally or is replaced by new empty media.

   Unlike usual data, where loss of a TLVDEV is considered
   fatal to the pool, in this case we are known to have
   (redundant) copies of these  blocks on other media.

   If the new TLVDEV is at least as big as the failed one,
   the pre-recorded accelerator tlvdevid:offsets in HDD-based
   copies of the block pointers can be used to re-instantiate
   the copies on metaxel, just like scrub or in-flight repairs
   happen on usual pools (rewriting corrupt blocks in-place
   and not having to change the BP tree at all). In this case
   the tlvdevid part of DVA can (should) remain unchanged.

   For new metaxels smaller than the replaced one new DVA
   allocations might be required. To enforce this and avoid
   mixups, the tlvdevid should change to some new unique
   number, and the BP tree gets rewritten as in point 7.

10) If these metaxel devices are used (and known to be SSDs?)
   then (SSD-based) L2ARC caches should not be used for the
   metadata blocks readily available from the metaxel.
   Guess: This might in fact reduce overheads from use of
   dedup, where pushing blocks into L2ARC only halves the
   needed RAM footprint. With an SSD metaxel we can just
   drop unneeded DDT entries from RAM ARC, and quickly get
   them from stable storage when needed.

I hope I covered all or most of what I think on this matter,
discussion (and ultimately open-sourced implementations) are
most welcome ;)

//Jim Klimov

zfs-discuss mailing list

Reply via email to