On Aug 24, 2012, at 6:50 AM, Sašo Kiselkov wrote:
> This is something I've been looking into in the code and my take on your
> proposed points this:
> 1) This requires many and deep changes across much of ZFS's architecture
> (especially the ability to sustain tlvdev failures).
> 2) Most of this can be achieved (except for cache persistency) by
> implementing ARC space reservations for certain types of data.
I think the simple solution of increasing default metadata limit above 1/4 of
arc_max will take care of the vast majority of small system complaints. The
limit is arbitrary and set well before dedupe was delivered.
> The latter has the added benefit of spreading load across all ARC and
> L2ARC resources, so your metaxel device never becomes the sole
> bottleneck and it better embraces the ZFS design philosophy of pooled
> I plan on having a look at implementing cache management policies (which
> would allow for tuning space reservations for metadata/etc. in a
> fine-grained manner without the cruft of having to worry about physical
> cache devices as well).
> On 08/24/2012 03:39 PM, Jim Klimov wrote:
>> Hello all,
>> The idea of dedicated metadata devices (likely SSDs) for ZFS
>> has been generically discussed a number of times on this list,
>> but I don't think I've seen a final proposal that someone would
>> take up for implementation (as a public source code, at least).
>> I'd like to take a liberty of summarizing the ideas I've either
>> seen in discussions or proposed myself on this matter, to see if
>> the overall idea would make sense to gurus of ZFS architecture.
>> So, the assumption was that the performance killer in ZFS at
>> least on smallish deployments (few HDDs and an SSD accelerator),
>> like those in Home-NAS types of boxes, was random IO to lots of
It is a bad idea to make massive investments in development and
testing because of an assumption. Build test cases, prove that the
benefits of the investment can outweigh other alternatives, and then
>> This IMHO includes primarily the block pointer tree
>> and the DDT for those who risked using dedup. I am not sure how
>> frequent is the required read access to other types of metadata
>> (like dataset descriptors, etc.) that the occasional reading and
>> caching won't solve.
>> Another idea was that L2ARC caching might not really cut it
>> for metadata in comparison to a dedicated metadata storage,
>> partly due to the L2ARC becoming empty upon every export/import
>> (boot) and needing to get re-heated.
>> So, here go the highlights of proposal (up for discussion).
>> In short, the idea is to use today's format of the blkptr_t
>> which by default allows to store up to 3 DVA addresses of the
>> block, and many types of metadata use only 2 copies (at least
>> by default). This new feature adds a specially processed
>> TLVDEV in the common DVA address space of the pool, and
>> enforces storage of added third copies for certain types
>> of metadata blocks on these devices. (Limited) Backwards
>> compatibility is quite possible, on-disk format change may
>> be not required. The proposal also addresses some questions
>> that arose in previous discussions, especially about proposals
>> where SSDs would be the only storage for pool's metadata:
>> * What if the dedicated metadata device overflows?
>> * What if the dedicated metadata device breaks?
>> = okay/expected by design, nothing dies.
>> In more detail:
>> 1) Add a special Top-Level VDEV (TLVDEV below) device type (like
>> "cache" and "log" - say, "metaxel" for "metadata accelerator"?),
>> and allow (even encourage) use of mirrored devices and allow
>> expansion (raid0, raid10 and/or separate TLVDEVs) with added
>> singlets/mirrors of such devices.
>> Method of device type definition for the pool is discussable,
>> I'd go with a special attribute (array) or nvlist in the pool
>> descriptor, rather than some special type ID in the ZFS label
>> (backwards compatibility, see point 4 for detailed rationale).
>> Discussable: enable pool-wide or per-dataset (i.e. don't
>> waste accelerator space and lifetime for rarely-reused
>> datasets like rolling backups)? Choose what to store on
>> (particular) metaxels - DDT, BPTree, something else?
>> Overall, this availability of choice is similar to choice
>> of modes for ARC/L2ARC caching or enabling ZIL per-dataset...
>> 2) These devices should be formally addressable as part of the
>> pool in DVA terms (tlvdev:offset:size), but writes onto them
>> are artificially limited by ZFS scheduler so as to only allow
>> specific types of metadata blocks (blkptr_t's, DDT entries),
>> and also enforce writing of added third copies (for blocks
>> of metadata with usual copies=2) onto these devices.
>> 3) Absence or "FAULTEDness" of this device should not be fatal
>> to the pool, but it may require manual intervention to force
>> the import. Particularly, removal, replacement or resilvering
>> onto different storage (i.e. migrating to larger SSDs) should
>> be supported in the design.
>> Beside experimentation and migration concerns, this approach
>> should also ease replacement of SSDs used for metadata in case
>> of their untimely fatal failures - and this may be a concern
>> for many SSD deployments, increasingly susceptible to write
>> wearing and ultimate death (at least in the cheaper bulkier
>> range, which is a likely component in Home-NAS solutions).
>> 4) For backwards compatibility, to older versions of ZFS this
>> device should seem like a normal single-disk or mirror TLVDEV
>> which contains blocks addressed within the common pool DVA
>> address-space. This should have no effect for read-only
>> imports. However, other ZFS releases likely won't respect the
>> filtering and alignment limitations enforced for the device
>> normally in this design, and can "contaminate" the device
>> with other types of blocks (and would refuse to import the
>> pool if the device is missing/faulted).
>> 5) The ZFS reads should be tweaked to first consult the copy
>> of metadata blocks on the metadata accelerator device, and
>> only use spinning rust (ordinary TLVDEVs) if there are some
>> errors (checksum mismatches, lacking devices, etc.) or during
>> scrubs and similar tasks which would require full reads of
>> the pool's addressed blocks.
>> Prioritized reads from this metadata accelerator won't need
>> a special bit in the blkptr_t (like is done for deduped-bit) -
>> the TLVDEV number in the DVA already points to the known
>> identifier of the TLVDEV, which we know is a metaxel.
>> 6) The ZFS writes onto this storage should take into account
>> the increased blocksize (likely 4-8Kb for either current
>> HDDs or for SSDs) and subsequent coalescing and pagination
>> required to reduce SSD wear-out. This might be a tweakable
>> component of the scheduler, which could be disabled if some
>> different media is used and this scheduler is not needed
>> (small-sectored HDDs, DDR, SSDs of the future), but the
>> default writing mode today should expect SSDs.
>> 7) A special tool like scrub should be added to walk the pool's
>> block tree and rewrite the existing block pointers (and I am
>> not sure this is as problematic as the generic BPRewrite -
>> if needed, the task can be done once offline, for example).
>> By definition this is a restartable task (initiating a new
>> tree walk), so it should be pausable or abortable as well.
>> As the result, new copies of metadata blocks would be created
>> on the accelerator device, and the two old copies remain in
>> Just in case the metaxel TLVDEV is "contaminated" by other
>> block types (by virtue of read-write import and usage by
>> ZFS implementations unsupporting this feature), those should
>> be relocated onto the main HDD pool.
>> One thing to discuss is: what should be done for metadata
>> with already existing three copies on HDDs? Should one copy
>> be disposed of and recreated on the accelerator?
>> 8) If the metaxel devices are filled up, the "overflowing"
>> metadata blocks may just not get the third copies (only
>> the standard HDD-based copies are written). If the metaxel
>> device is freed up (by deletion of data and release of the
>> block pointers) or expanded (or another one is added), then
>> another run of the "scrub-like" procedure from point 7 can
>> add the missing copies.
>> 9) Also, the solution should allow to discard and recreate the
>> copies of block pointers on the accelerator TLVDEV in case
>> that it fails fatally or is replaced by new empty media.
>> Unlike usual data, where loss of a TLVDEV is considered
>> fatal to the pool, in this case we are known to have
>> (redundant) copies of these blocks on other media.
>> If the new TLVDEV is at least as big as the failed one,
>> the pre-recorded accelerator tlvdevid:offsets in HDD-based
>> copies of the block pointers can be used to re-instantiate
>> the copies on metaxel, just like scrub or in-flight repairs
>> happen on usual pools (rewriting corrupt blocks in-place
>> and not having to change the BP tree at all). In this case
>> the tlvdevid part of DVA can (should) remain unchanged.
>> For new metaxels smaller than the replaced one new DVA
>> allocations might be required. To enforce this and avoid
>> mixups, the tlvdevid should change to some new unique
>> number, and the BP tree gets rewritten as in point 7.
>> 10) If these metaxel devices are used (and known to be SSDs?)
>> then (SSD-based) L2ARC caches should not be used for the
>> metadata blocks readily available from the metaxel.
>> Guess: This might in fact reduce overheads from use of
>> dedup, where pushing blocks into L2ARC only halves the
>> needed RAM footprint. With an SSD metaxel we can just
>> drop unneeded DDT entries from RAM ARC, and quickly get
>> them from stable storage when needed.
>> I hope I covered all or most of what I think on this matter,
>> discussion (and ultimately open-sourced implementations) are
>> most welcome ;)
>> //Jim Klimov
>> zfs-discuss mailing list
> zfs-discuss mailing list
ZFS Performance and Training
zfs-discuss mailing list