First of all, thanks for reading and discussing! :)
2012-08-24 17:50, Sašo Kiselkov wrote:
This is something I've been looking into in the code and my take on your
proposed points this:
1) This requires many and deep changes across much of ZFS's architecture
(especially the ability to sustain tlvdev failures).
I'd trust the expert; on the outside it did not seem as a very
deep change. At least, if for the first POC tests we leave out the
rewriting of existing block pointers to store copies of existing
metadata on an SSD, and the resilience to failures and absence
Basically, for a POC implementation we can just make a regular
top-level VDEV forced as a single disk or mirror and add some
hint to describe that it is a METAXEL component of the pool,
so the ZFS kernel gets some restrictions on what gets written
there (for new metadata writes) and to prioritize reads (fetch
metadata from METAXELs, unless there is no copy on a known
METAXEL or the copy is corrupted).
The POC as outlined would be useful to estimate the benefits and
impacts of the solution, and like "BP Rewrite", the more advanced
features might be delayed by a few years - so even the POC would
easily be the useful solution for many of us, especially if applied
to new pools from TXG=0.
"There is nothing as immortal as a temporary solution" ;)
2) Most of this can be achieved (except for cache persistency) by
implementing ARC space reservations for certain types of data.
The latter has the added benefit of spreading load across all ARC and
L2ARC resources, so your METAXEL device never becomes the sole
bottleneck and it better embraces the ZFS design philosophy of pooled
Well, we already have somewhat non-pooled ZILs and L2ARCs.
Or, rather, they are in sub-pools of their own, reserved
for specific tasks to optimize and speed up the ZFS storage
subsystem in face of particular problems.
My proposal does indeed add another sub-pool for another such
task (and nominally METAXELs are parts of the common pool -
more than cache and log devices are today), and explicitly
describes adding several METAXELs or raid10'ing them (thus
regarding the bottleneck question). On larger systems, this
metadata storage might be available with a different SAS
controller on a separate PCI bus, further boosting performance
and reducing bottlenecks. Unlike L2ARC, METAXELs can be N-way
mirrored and so instances are available in parallel from
several controllers and lanes - further boosting IO and
reliability of metadata operations.
However, unlike L2ARC in general, here we know our "target
audience" better, so we can do an optimization for a particular
useful situation: gigabytes worth of data in small portions
(sized from 512b to 8Kb, IIRC?), quite randomly stored and
often read in comparison to amount of writes.
Regarding size in particular: with blocks of 128K and BP entries
of 512b, the minimum overhead for a single copy of BPtree metadata
is 1/256 (without actually the tree, dataset labels, etc).
So for each 1Tb of written ZFS pool userdata we get at least 4Gb
metadata of just the block pointer tree (likely more in reality).
For practical Home-NAS pools of about 10Tb this warrants about
60Gb (give or take an order of magnitude) on SSD dedicated to
casual metadata without even a DDT, be it generic L2ARC or an
The tradeoffs for dedicating a storage device (or several) to
this one task are, hopefully: no need for heating up the cache
every time with gigabytes that are known to be needed again
and again, even if only to boost weekly scrubs, some RAM ARC
savings and release of L2ARC to tasks it is more efficient at
(generic larger blocks). Eliminating many small random IOs to
spinning rust, we're winning in HDD performance and arguably
power consumption and vitality (less mechanical overheads and
delays per overall amount of transferred gigabytes).
There is also relatively large RAM pointer overhead for storing
small pieces of data (such as metadata blocks sized 1 or few
sectors) in L2ARC, which I expect to be eliminated by storing
and using these blocks directly from the pool (on SSD METAXELs),
having both SSD-fast access to the blocks and no expiration into
L2ARC and back with inefficiently-sized ARC pointers to remember.
I guess METAXEL might indeed be cheaper and faster than L2ARC,
for this particular use-case (metadata). Also, this way the
true L2ARC device would be more available to "real" userdata
which is likely to use larger blocks - improving benefits
from your L2ARC compression features as well as reducing
the overhead percentage for ARC pointers; and being a random
selection of the pool's blocks, the userdata is unpredictable
for good acceleration by other means (short of a full-SSD pool).
Also, having this bulky amount of bytes (BPTree, DDT) is
essentially required for fast operation of the overall pool,
and it is not some unpredictable random set of blocks as is
expected for usual cacheable data - so why keep reheating it
into the cache upon every restart (and the greener home-NAS
users might power down their boxes when not in use, to save
on power bills, so reheating L2ARC is frequent), needlessly
wearing it out with writes and anyway chopping this amount
of bytes from usual L2ARC data and RAM ARC as well.
The DDT also hopefully won't have the drastic impacts we see
today on budgeting and/or performance of smaller machines
(like your HP Microserver with 8Gb RAM tops) with enabled
dedup, because DDT entries can be quickly re-seeked from
SSD and not consume RAM while they are expired from ARC as
today - thus freeing it for more efficient caching of real
data or for pointers to bulkier userdata in L2ARC.
Finally, even scrubs should be faster - beside checking
integrity of on-HDD copies of metadata blocks, the system
won't need to read them with slower access times in order
to find addresses and checksums of the bulk of userdata.
This scrubs are likely to become more sequential and fast
with little to no special coding to do this boost.
What do you think?
zfs-discuss mailing list