This is something I've been looking into in the code and my take on your
proposed points this:
1) This requires many and deep changes across much of ZFS's architecture
(especially the ability to sustain tlvdev failures).
2) Most of this can be achieved (except for cache persistency) by
implementing ARC space reservations for certain types of data.
The latter has the added benefit of spreading load across all ARC and
L2ARC resources, so your metaxel device never becomes the sole
bottleneck and it better embraces the ZFS design philosophy of pooled
I plan on having a look at implementing cache management policies (which
would allow for tuning space reservations for metadata/etc. in a
fine-grained manner without the cruft of having to worry about physical
cache devices as well).
On 08/24/2012 03:39 PM, Jim Klimov wrote:
> Hello all,
> The idea of dedicated metadata devices (likely SSDs) for ZFS
> has been generically discussed a number of times on this list,
> but I don't think I've seen a final proposal that someone would
> take up for implementation (as a public source code, at least).
> I'd like to take a liberty of summarizing the ideas I've either
> seen in discussions or proposed myself on this matter, to see if
> the overall idea would make sense to gurus of ZFS architecture.
> So, the assumption was that the performance killer in ZFS at
> least on smallish deployments (few HDDs and an SSD accelerator),
> like those in Home-NAS types of boxes, was random IO to lots of
> metadata. This IMHO includes primarily the block pointer tree
> and the DDT for those who risked using dedup. I am not sure how
> frequent is the required read access to other types of metadata
> (like dataset descriptors, etc.) that the occasional reading and
> caching won't solve.
> Another idea was that L2ARC caching might not really cut it
> for metadata in comparison to a dedicated metadata storage,
> partly due to the L2ARC becoming empty upon every export/import
> (boot) and needing to get re-heated.
> So, here go the highlights of proposal (up for discussion).
> In short, the idea is to use today's format of the blkptr_t
> which by default allows to store up to 3 DVA addresses of the
> block, and many types of metadata use only 2 copies (at least
> by default). This new feature adds a specially processed
> TLVDEV in the common DVA address space of the pool, and
> enforces storage of added third copies for certain types
> of metadata blocks on these devices. (Limited) Backwards
> compatibility is quite possible, on-disk format change may
> be not required. The proposal also addresses some questions
> that arose in previous discussions, especially about proposals
> where SSDs would be the only storage for pool's metadata:
> * What if the dedicated metadata device overflows?
> * What if the dedicated metadata device breaks?
> = okay/expected by design, nothing dies.
> In more detail:
> 1) Add a special Top-Level VDEV (TLVDEV below) device type (like
> "cache" and "log" - say, "metaxel" for "metadata accelerator"?),
> and allow (even encourage) use of mirrored devices and allow
> expansion (raid0, raid10 and/or separate TLVDEVs) with added
> singlets/mirrors of such devices.
> Method of device type definition for the pool is discussable,
> I'd go with a special attribute (array) or nvlist in the pool
> descriptor, rather than some special type ID in the ZFS label
> (backwards compatibility, see point 4 for detailed rationale).
> Discussable: enable pool-wide or per-dataset (i.e. don't
> waste accelerator space and lifetime for rarely-reused
> datasets like rolling backups)? Choose what to store on
> (particular) metaxels - DDT, BPTree, something else?
> Overall, this availability of choice is similar to choice
> of modes for ARC/L2ARC caching or enabling ZIL per-dataset...
> 2) These devices should be formally addressable as part of the
> pool in DVA terms (tlvdev:offset:size), but writes onto them
> are artificially limited by ZFS scheduler so as to only allow
> specific types of metadata blocks (blkptr_t's, DDT entries),
> and also enforce writing of added third copies (for blocks
> of metadata with usual copies=2) onto these devices.
> 3) Absence or "FAULTEDness" of this device should not be fatal
> to the pool, but it may require manual intervention to force
> the import. Particularly, removal, replacement or resilvering
> onto different storage (i.e. migrating to larger SSDs) should
> be supported in the design.
> Beside experimentation and migration concerns, this approach
> should also ease replacement of SSDs used for metadata in case
> of their untimely fatal failures - and this may be a concern
> for many SSD deployments, increasingly susceptible to write
> wearing and ultimate death (at least in the cheaper bulkier
> range, which is a likely component in Home-NAS solutions).
> 4) For backwards compatibility, to older versions of ZFS this
> device should seem like a normal single-disk or mirror TLVDEV
> which contains blocks addressed within the common pool DVA
> address-space. This should have no effect for read-only
> imports. However, other ZFS releases likely won't respect the
> filtering and alignment limitations enforced for the device
> normally in this design, and can "contaminate" the device
> with other types of blocks (and would refuse to import the
> pool if the device is missing/faulted).
> 5) The ZFS reads should be tweaked to first consult the copy
> of metadata blocks on the metadata accelerator device, and
> only use spinning rust (ordinary TLVDEVs) if there are some
> errors (checksum mismatches, lacking devices, etc.) or during
> scrubs and similar tasks which would require full reads of
> the pool's addressed blocks.
> Prioritized reads from this metadata accelerator won't need
> a special bit in the blkptr_t (like is done for deduped-bit) -
> the TLVDEV number in the DVA already points to the known
> identifier of the TLVDEV, which we know is a metaxel.
> 6) The ZFS writes onto this storage should take into account
> the increased blocksize (likely 4-8Kb for either current
> HDDs or for SSDs) and subsequent coalescing and pagination
> required to reduce SSD wear-out. This might be a tweakable
> component of the scheduler, which could be disabled if some
> different media is used and this scheduler is not needed
> (small-sectored HDDs, DDR, SSDs of the future), but the
> default writing mode today should expect SSDs.
> 7) A special tool like scrub should be added to walk the pool's
> block tree and rewrite the existing block pointers (and I am
> not sure this is as problematic as the generic BPRewrite -
> if needed, the task can be done once offline, for example).
> By definition this is a restartable task (initiating a new
> tree walk), so it should be pausable or abortable as well.
> As the result, new copies of metadata blocks would be created
> on the accelerator device, and the two old copies remain in
> Just in case the metaxel TLVDEV is "contaminated" by other
> block types (by virtue of read-write import and usage by
> ZFS implementations unsupporting this feature), those should
> be relocated onto the main HDD pool.
> One thing to discuss is: what should be done for metadata
> with already existing three copies on HDDs? Should one copy
> be disposed of and recreated on the accelerator?
> 8) If the metaxel devices are filled up, the "overflowing"
> metadata blocks may just not get the third copies (only
> the standard HDD-based copies are written). If the metaxel
> device is freed up (by deletion of data and release of the
> block pointers) or expanded (or another one is added), then
> another run of the "scrub-like" procedure from point 7 can
> add the missing copies.
> 9) Also, the solution should allow to discard and recreate the
> copies of block pointers on the accelerator TLVDEV in case
> that it fails fatally or is replaced by new empty media.
> Unlike usual data, where loss of a TLVDEV is considered
> fatal to the pool, in this case we are known to have
> (redundant) copies of these blocks on other media.
> If the new TLVDEV is at least as big as the failed one,
> the pre-recorded accelerator tlvdevid:offsets in HDD-based
> copies of the block pointers can be used to re-instantiate
> the copies on metaxel, just like scrub or in-flight repairs
> happen on usual pools (rewriting corrupt blocks in-place
> and not having to change the BP tree at all). In this case
> the tlvdevid part of DVA can (should) remain unchanged.
> For new metaxels smaller than the replaced one new DVA
> allocations might be required. To enforce this and avoid
> mixups, the tlvdevid should change to some new unique
> number, and the BP tree gets rewritten as in point 7.
> 10) If these metaxel devices are used (and known to be SSDs?)
> then (SSD-based) L2ARC caches should not be used for the
> metadata blocks readily available from the metaxel.
> Guess: This might in fact reduce overheads from use of
> dedup, where pushing blocks into L2ARC only halves the
> needed RAM footprint. With an SSD metaxel we can just
> drop unneeded DDT entries from RAM ARC, and quickly get
> them from stable storage when needed.
> I hope I covered all or most of what I think on this matter,
> discussion (and ultimately open-sourced implementations) are
> most welcome ;)
> //Jim Klimov
> zfs-discuss mailing list
zfs-discuss mailing list