Oh man, that's a million-billion points you made. I'll try to run
through each quickly.
On 08/24/2012 05:43 PM, Jim Klimov wrote:
> First of all, thanks for reading and discussing! :)
No problem at all ;)
> 2012-08-24 17:50, Sašo Kiselkov wrote:
>> This is something I've been looking into in the code and my take on your
>> proposed points this:
>> 1) This requires many and deep changes across much of ZFS's architecture
>> (especially the ability to sustain tlvdev failures).
> I'd trust the expert; on the outside it did not seem as a very
> deep change. At least, if for the first POC tests we leave out the
> rewriting of existing block pointers to store copies of existing
> metadata on an SSD, and the resilience to failures and absence
> of METAXELs.
The initial set of change areas I can identify, even for the stripped
down version of your proposal is:
*) implement a new vdev type (mirrored or straight metaxel)
*) integrate all format changes to labels to describe these
*) alter the block allocator strategy so that if there are metaxels
present, we utilize those
*) alter the metadata fetch points (of which there are many) to
preferably fetch from metaxels when possible, or fall back to
*) make sure that the previous two points play nicely with copies=X
The other points you mentioned, i.e. fault-resiliency, block-pointer
rewrite and other stuff is another mountain of work with an even higher
mountain of testing to be done on all possible combinations.
> Basically, for a POC implementation we can just make a regular
> top-level VDEV forced as a single disk or mirror and add some
> hint to describe that it is a METAXEL component of the pool,
> so the ZFS kernel gets some restrictions on what gets written
> there (for new metadata writes) and to prioritize reads (fetch
> metadata from METAXELs, unless there is no copy on a known
> METAXEL or the copy is corrupted).
As noted before, you'll have to go through the code to look for paths
which fetch metadata (mostly the object layer) and replace those with
metaxel-aware calls. That's a lot of work for a POC.
> The POC as outlined would be useful to estimate the benefits and
> impacts of the solution, and like "BP Rewrite", the more advanced
> features might be delayed by a few years - so even the POC would
> easily be the useful solution for many of us, especially if applied
> to new pools from TXG=0.
I wish I had all the time to implement it, but alas, I'm just a zfs n00b
and am not doing this for a living :-)
>> 2) Most of this can be achieved (except for cache persistency) by
>> implementing ARC space reservations for certain types of data.
>> The latter has the added benefit of spreading load across all ARC and
>> L2ARC resources, so your METAXEL device never becomes the sole
>> bottleneck and it better embraces the ZFS design philosophy of pooled
> Well, we already have somewhat non-pooled ZILs and L2ARCs.
Yes, that's because these have vastly different performance properties
from main-pool storage. However, metaxels and cache devices are
essentially the same (many small random reads, infrequent large async
> Or, rather, they are in sub-pools of their own, reserved
> for specific tasks to optimize and speed up the ZFS storage
> subsystem in face of particular problems.
Exactly. The difference between metaxel and cache, however, is cosmetic.
> My proposal does indeed add another sub-pool for another such
> task (and nominally METAXELs are parts of the common pool -
> more than cache and log devices are today), and explicitly
> describes adding several METAXELs or raid10'ing them (thus
> regarding the bottleneck question).
The problem regarding bottlenecking is that you're creating a new
separate island of resources which has very little difference in
performance requirements to cache devices, yet by separating them out
artificially, you're creating a potential scalability barrier.
> On larger systems, this
> metadata storage might be available with a different SAS
> controller on a separate PCI bus, further boosting performance
> and reducing bottlenecks. Unlike L2ARC, METAXELs can be N-way
> mirrored and so instances are available in parallel from
> several controllers and lanes - further boosting IO and
> reliability of metadata operations.
How often do you expect cache devices to fail? I mean we're talking
about a one-off occasional event that doesn't even present data loss
(only a little bit of performance loss, especially if you use multiple
cache devices). And since you're proposing mirroring metaxels, you are
essentially going to be continuously doing twice the write work for a
50% reduction in read performance from the vdev in case of a device
failure. If you just used both devices as cache, you'll get 100% speedup
in read AND write performance (in case you lose one cache device, you've
still got 50% of your cache data available). So to sum up, you're
applying raid to something that doesn't need it.
> However, unlike L2ARC in general, here we know our "target
> audience" better, so we can do an optimization for a particular
> useful situation: gigabytes worth of data in small portions
> (sized from 512b to 8Kb, IIRC?), quite randomly stored and
> often read in comparison to amount of writes.
L2ARC also knows its target audience well, and it's nearly identical
with what you've described. L2ARC doesn't cache prefetched buffers or
streaming workloads (unless you instruct it to). It's there merely to
serve as a low-latency random-read accelerator.
> Regarding size in particular: with blocks of 128K and BP entries
> of 512b, the minimum overhead for a single copy of BPtree metadata
> is 1/256 (without actually the tree, dataset labels, etc).
Block pointers are actually much smaller, ZFS groups them into gang
blocks if it needs to store multiple of them and is below the
SPA_MINBLOCKSIZE (hope I remember that macro's name right).
> So for each 1Tb of written ZFS pool userdata we get at least 4Gb
> metadata of just the block pointer tree (likely more in reality).
> For practical Home-NAS pools of about 10Tb this warrants about
> 60Gb (give or take an order of magnitude) on SSD dedicated to
> casual metadata without even a DDT, be it generic L2ARC or an
> optimized METAXEL.
And how is that different to having a cache-sizing policy which selects
how much each data type get allocated from a single common cache?
> The tradeoffs for dedicating a storage device (or several) to
> this one task are, hopefully: no need for heating up the cache
> every time with gigabytes that are known to be needed again
> and again,
Agree, the persistency would be nice to have, and in fact it might be a
lot easier to implement (I've already thought about how to do this, but
that's a topic for another day).
> even if only to boost weekly scrubs, some RAM ARC
> savings and release of L2ARC to tasks it is more efficient at
> (generic larger blocks).
Scrubs will populate your cache anyways, so only the first one will be
slow, the next one will be much faster. Also, you're wrong if you think
the clientele of l2arc and metaxel would be different - it most likely
> Eliminating many small random IOs to
> spinning rust, we're winning in HDD performance and arguably
> power consumption and vitality (less mechanical overheads and
> delays per overall amount of transferred gigabytes).
No disagreement there.
> There is also relatively large RAM pointer overhead for storing
> small pieces of data (such as metadata blocks sized 1 or few
> sectors) in L2ARC, which I expect to be eliminated by storing
> and using these blocks directly from the pool (on SSD METAXELs),
> having both SSD-fast access to the blocks and no expiration into
> L2ARC and back with inefficiently-sized ARC pointers to remember.
You'd still need to reference metaxel data from ARC, so your savings
would be very small. ZFS already is pretty efficient there.
> I guess METAXEL might indeed be cheaper and faster than L2ARC,
> for this particular use-case (metadata). Also, this way the
> true L2ARC device would be more available to "real" userdata
> which is likely to use larger blocks - improving benefits
> from your L2ARC compression features as well as reducing
> the overhead percentage for ARC pointers; and being a random
> selection of the pool's blocks, the userdata is unpredictable
> for good acceleration by other means (short of a full-SSD pool).
While compression indeed works much better on larger blocks, I hardly
think the proportion to regular data is somehow significant in any way
to warrant taking it out of the compression datastream. At worst it's a
few percent of compression overhead - in fact, my current implementation
of l2arc compression already does a check for block size and refuses to
compress blocks smaller than ~2048 bytes.
> Also, having this bulky amount of bytes (BPTree, DDT) is
> essentially required for fast operation of the overall pool,
> and it is not some unpredictable random set of blocks as is
> expected for usual cacheable data - so why keep reheating it
> into the cache upon every restart (and the greener home-NAS
> users might power down their boxes when not in use, to save
> on power bills, so reheating L2ARC is frequent), needlessly
> wearing it out with writes and anyway chopping this amount
> of bytes from usual L2ARC data and RAM ARC as well.
> The DDT also hopefully won't have the drastic impacts we see
> today on budgeting and/or performance of smaller machines
> (like your HP Microserver with 8Gb RAM tops) with enabled
> dedup, because DDT entries can be quickly re-seeked from
> SSD and not consume RAM while they are expired from ARC as
> today - thus freeing it for more efficient caching of real
> data or for pointers to bulkier userdata in L2ARC.
So essentially that's an argument for l2arc persistency. As I said, it
can be done (and more easily than using metaxels).
> Finally, even scrubs should be faster - beside checking
> integrity of on-HDD copies of metadata blocks, the system
> won't need to read them with slower access times in order
> to find addresses and checksums of the bulk of userdata.
> This scrubs are likely to become more sequential and fast
> with little to no special coding to do this boost.
See above. All of this can be solved by cache sizing policies and l2arc
zfs-discuss mailing list