Hello all, The idea of dedicated metadata devices (likely SSDs) for ZFS has been generically discussed a number of times on this list, but I don't think I've seen a final proposal that someone would take up for implementation (as a public source code, at least).
I'd like to take a liberty of summarizing the ideas I've either seen in discussions or proposed myself on this matter, to see if the overall idea would make sense to gurus of ZFS architecture. So, the assumption was that the performance killer in ZFS at least on smallish deployments (few HDDs and an SSD accelerator), like those in Home-NAS types of boxes, was random IO to lots of metadata. This IMHO includes primarily the block pointer tree and the DDT for those who risked using dedup. I am not sure how frequent is the required read access to other types of metadata (like dataset descriptors, etc.) that the occasional reading and caching won't solve. Another idea was that L2ARC caching might not really cut it for metadata in comparison to a dedicated metadata storage, partly due to the L2ARC becoming empty upon every export/import (boot) and needing to get re-heated. So, here go the highlights of proposal (up for discussion). In short, the idea is to use today's format of the blkptr_t which by default allows to store up to 3 DVA addresses of the block, and many types of metadata use only 2 copies (at least by default). This new feature adds a specially processed TLVDEV in the common DVA address space of the pool, and enforces storage of added third copies for certain types of metadata blocks on these devices. (Limited) Backwards compatibility is quite possible, on-disk format change may be not required. The proposal also addresses some questions that arose in previous discussions, especially about proposals where SSDs would be the only storage for pool's metadata: * What if the dedicated metadata device overflows? * What if the dedicated metadata device breaks? = okay/expected by design, nothing dies. In more detail: 1) Add a special Top-Level VDEV (TLVDEV below) device type (like "cache" and "log" - say, "metaxel" for "metadata accelerator"?), and allow (even encourage) use of mirrored devices and allow expansion (raid0, raid10 and/or separate TLVDEVs) with added singlets/mirrors of such devices. Method of device type definition for the pool is discussable, I'd go with a special attribute (array) or nvlist in the pool descriptor, rather than some special type ID in the ZFS label (backwards compatibility, see point 4 for detailed rationale). Discussable: enable pool-wide or per-dataset (i.e. don't waste accelerator space and lifetime for rarely-reused datasets like rolling backups)? Choose what to store on (particular) metaxels - DDT, BPTree, something else? Overall, this availability of choice is similar to choice of modes for ARC/L2ARC caching or enabling ZIL per-dataset... 2) These devices should be formally addressable as part of the pool in DVA terms (tlvdev:offset:size), but writes onto them are artificially limited by ZFS scheduler so as to only allow specific types of metadata blocks (blkptr_t's, DDT entries), and also enforce writing of added third copies (for blocks of metadata with usual copies=2) onto these devices. 3) Absence or "FAULTEDness" of this device should not be fatal to the pool, but it may require manual intervention to force the import. Particularly, removal, replacement or resilvering onto different storage (i.e. migrating to larger SSDs) should be supported in the design. Beside experimentation and migration concerns, this approach should also ease replacement of SSDs used for metadata in case of their untimely fatal failures - and this may be a concern for many SSD deployments, increasingly susceptible to write wearing and ultimate death (at least in the cheaper bulkier range, which is a likely component in Home-NAS solutions). 4) For backwards compatibility, to older versions of ZFS this device should seem like a normal single-disk or mirror TLVDEV which contains blocks addressed within the common pool DVA address-space. This should have no effect for read-only imports. However, other ZFS releases likely won't respect the filtering and alignment limitations enforced for the device normally in this design, and can "contaminate" the device with other types of blocks (and would refuse to import the pool if the device is missing/faulted). 5) The ZFS reads should be tweaked to first consult the copy of metadata blocks on the metadata accelerator device, and only use spinning rust (ordinary TLVDEVs) if there are some errors (checksum mismatches, lacking devices, etc.) or during scrubs and similar tasks which would require full reads of the pool's addressed blocks. Prioritized reads from this metadata accelerator won't need a special bit in the blkptr_t (like is done for deduped-bit) - the TLVDEV number in the DVA already points to the known identifier of the TLVDEV, which we know is a metaxel. 6) The ZFS writes onto this storage should take into account the increased blocksize (likely 4-8Kb for either current HDDs or for SSDs) and subsequent coalescing and pagination required to reduce SSD wear-out. This might be a tweakable component of the scheduler, which could be disabled if some different media is used and this scheduler is not needed (small-sectored HDDs, DDR, SSDs of the future), but the default writing mode today should expect SSDs. 7) A special tool like scrub should be added to walk the pool's block tree and rewrite the existing block pointers (and I am not sure this is as problematic as the generic BPRewrite - if needed, the task can be done once offline, for example). By definition this is a restartable task (initiating a new tree walk), so it should be pausable or abortable as well. As the result, new copies of metadata blocks would be created on the accelerator device, and the two old copies remain in place. Just in case the metaxel TLVDEV is "contaminated" by other block types (by virtue of read-write import and usage by ZFS implementations unsupporting this feature), those should be relocated onto the main HDD pool. One thing to discuss is: what should be done for metadata with already existing three copies on HDDs? Should one copy be disposed of and recreated on the accelerator? 8) If the metaxel devices are filled up, the "overflowing" metadata blocks may just not get the third copies (only the standard HDD-based copies are written). If the metaxel device is freed up (by deletion of data and release of the block pointers) or expanded (or another one is added), then another run of the "scrub-like" procedure from point 7 can add the missing copies. 9) Also, the solution should allow to discard and recreate the copies of block pointers on the accelerator TLVDEV in case that it fails fatally or is replaced by new empty media. Unlike usual data, where loss of a TLVDEV is considered fatal to the pool, in this case we are known to have (redundant) copies of these blocks on other media. If the new TLVDEV is at least as big as the failed one, the pre-recorded accelerator tlvdevid:offsets in HDD-based copies of the block pointers can be used to re-instantiate the copies on metaxel, just like scrub or in-flight repairs happen on usual pools (rewriting corrupt blocks in-place and not having to change the BP tree at all). In this case the tlvdevid part of DVA can (should) remain unchanged. For new metaxels smaller than the replaced one new DVA allocations might be required. To enforce this and avoid mixups, the tlvdevid should change to some new unique number, and the BP tree gets rewritten as in point 7. 10) If these metaxel devices are used (and known to be SSDs?) then (SSD-based) L2ARC caches should not be used for the metadata blocks readily available from the metaxel. Guess: This might in fact reduce overheads from use of dedup, where pushing blocks into L2ARC only halves the needed RAM footprint. With an SSD metaxel we can just drop unneeded DDT entries from RAM ARC, and quickly get them from stable storage when needed. I hope I covered all or most of what I think on this matter, discussion (and ultimately open-sourced implementations) are most welcome ;) HTH, //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss