Re: [zfs-discuss] Dedicated metadata devices

Sašo Kiselkov Fri, 24 Aug 2012 17:40:26 -0700

On 08/25/2012 12:22 AM, Jim Klimov wrote:
> 2012-08-25 0:42, Sašo Kiselkov wrote:
>> Oh man, that's a million-billion points you made. I'll try to run
>> through each quickly.
> 
> Thanks...
> I still do not have the feeling that you've fully got my
> idea, or, alternately, that I correctly understand ARC :)


Could be I misunderstood you, it's past midnight here...

>>> There is also relatively large RAM pointer overhead for storing
>>> small pieces of data (such as metadata blocks sized 1 or few
>>> sectors) in L2ARC, which I expect to be eliminated by storing
>>> and using these blocks directly from the pool (on SSD METAXELs),
>>> having both SSD-fast access to the blocks and no expiration into
>>> L2ARC and back with inefficiently-sized ARC pointers to remember.
> 
> ...And these counter-arguments probably are THE point of deviation:
> 
>> However, metaxels and cache devices are essentially the same
>> (many small random reads, infrequent large async writes).
>> The difference between metaxel and cache, however, is cosmetic.
> 
>> You'd still need to reference metaxel data from ARC, so your savings
>> would be very small. ZFS already is pretty efficient there.
> 
> No, you don't! "Republic credits WON'T do fine!" ;)
> 
> The way I understood ARC (without/before L2ARC), it either caches
> pool blocks or it doesn't. More correctly, there is also a cache
> of ghosts without bulk block data, so we can account for misses
> of recently expired blocks of one of the two categories, and so
> adjust the cache subdivision towards MRU or MFU. Ultimately, those
> ghosts which were not requested, also expire away from the cache,
> and no reference to a recently-cached block remains.

Correct so far.

> With L2ARC on the other hand, there is some list of pointers in
> the ARC so it knows which blocks were cached on the SSD - and
> lack of this list upon pool import is in effect the perceived
> emptiness of the L2ARC device. L2ARC's pointers are of comparable
> size to the small metadata blocks,

No they're not, here's l2arc_buf_hdr_t a per-buffer structure held for
buffers which were moved to l2arc:

typedef struct l2arc_buf_hdr {
        l2arc_dev_t     *b_dev;
        uint64_t        b_daddr;
} l2arc_buf_hdr_t;

That's about 16-bytes overhead per block, or 3.125% if the block's data
is 512 bytes long.

> and *this* consideration IMHO
> makes it much more efficient to use L2ARC with larger cached blocks,
> especially on systems with limited RAM (which effectively limits
> addressable L2ARC size as accounted in amount of blocks), with
> the added benefit that you can compress larger blocks in L2ARC.

The main overhead comes from an arc_buf_hdr_t, which is pretty fat,
around 180 bytes by a first degree approximation, so in all around 200
bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully
inefficient (around 39% overhead), however, at 4k average block size,
this drops to ~5% and at 64k average block size (which is entirely
possible on average untuned storage pools) this drops down to ~0.3%
overhead.

> This way, the *difference* between L2ARC and a METAXEL is that
> the latter is an ordinary pool tlvdev with a specially biased
> read priority and write filter. If a metadata block is read,
> it goes into the ARC. If it expires - then there's a ghost
> for a while and soon there is no memory that this block was
> cached - unlike L2ARC's list of pointers which are just a
> couple of times smaller than the cached block of this type.
> But re-fetching metadata from SSD METAXEL is faster, when
> it is needed again.

As explained above, the difference would be about 9% at best:
sizeof(l2arc_buf_hdr_t) / sizeof(arc_buf_hdr_t) = 0.0888...

>> Also, you're wrong if you think the clientele of l2arc and
>> metaxel would be different - it most likely wouldn't.
> 
> This only stresses the problem with L2ARC's shortcomings for
> metadata, the way I see them (if they do indeed exist), and
> in particular chews your RAM a lot more than it could or
> should, being a mechanism to increase caching efficiency.

And as I demonstrated above, the savings would be negligible.

> If their clientele is indeed similar, and if metaxels would
> be more efficient for metadata storage, then you might not
> need L2ARC with its overheads, or not as much of it, and
> get a clear win in system resource consumption ;)

Would it be a win? Probably. But the cost-benefit analysis suggests to
me that it would probably simply not be worth the added hassle.

>> How often do you expect cache devices to fail?
> 
> From what I hear, life expectancy for today's consumer-scale
> devices is small (1-3 years) for heavy writes - at which the
> L2ARC would likely exceed METAXEL's write rates, due to the
> need to write the same metadata into L2ARC time and again,
> if it were not for the special throttling to limit L2ARC
> write bandwidth.

Depending on your workload, l2arc write throughput tends to get pretty
low once you've cached in your working dataset. Remember, l2arc only
caches random reads, so think stuff like databases, not linear copy
operations. Once it's warmed up, it's pretty much read-only (assuming
most of your working dataset fits in there).

>> So to sum up, you're applying raid to something that doesn't
>> need it.
> 
> Well, metadata is kinda important - though here we do add
> a third copy where we previously sufficed to have two. And
> you're not "required" to mirror it. Also, on the other hand,
> if a METAXEL is a top-level vdev without special resilience
> to its failure/absence as described in my first post, then
> its failure would formally be considered a fatal situation
> and bring down the whole pool - unlike problems with L2ARC
> or ZIL devices, which can be ignored at admin's discretion.

Is doubly-redundant metadata not enough? Remember, if you've lost even a
single vdev, your data is essentially toast, and doubly-redundant
metadata is there essentially to try and save your ass by letting you
copy off what remains readable (by making sure you have another metadata
copy available somewhere else). If you've got double-vdev failure, then
that's essentially considered a catastrophic pool failure.

>> And how is that different to having a cache-sizing policy
>> which selects how much each data type get allocated from
>> a single common cache?
> ...
>> All of this can be solved by cache sizing policies and
>> l2arc persistency.
> 
> Ultimately, I don't disagree with this point :)
> But I do think that this might not be the optimal solution
> in terms of RAM requirements and coding complexity, etc.
> If you want to store some data long-term, such as is my
> desire to store the metadata - ZFS has mechanisms for that
> in ways of normal VDEVs (or subclassing that into metaxels) ;)

How about we instead implemented l2arc persistency? That's a lot easier
to do and it would allow us to make all/most caches persistent, not just
the metadata cache.

>>  *) implement a new vdev type (mirrored or straight metaxel)
>>  *) integrate all format changes to labels to describe these
> 
> As one idea in the proposal - though I don't require sticking
> to it - is that the metaxel's job is described in the pool
> metadata (i.e. a readonly attribute which can be set during
> tlvdev device creation/addition - metaxels:list-of-guids).
> Until the pool is imported, a metaxel seems like a normal
> singledisk/mirrored tlvdev in a normal pool.

Yeah, that would be workable, but the trouble is that when somebody
mounts the pool on an older version, they might allocate non-metadata
blocks there, resulting an inconsistent metaxel state. That would make
implementing metaxel-failure resilience a lot harder. Plus, you'll need
to propagate information on data type (metadata/normal data) to the spa
layer - might not be that hard, I haven't looked in that code yet.

> This approach can limit importability of a pool with failed
> metaxels, unless we expect that and try to make sense of
> other pool devices - essentially until we can decipher the
> nvlist and see that the absent device is a metaxel, so the
> error is deemed not fatal. However, this also requires no
> label changes or other incompatible on-disk format changes,
> the way I see it. As long as the metaxel is not faulted,
> any other ZFS implementation (like grub or an older livecd)
> can import this pool and read 1/3 of metadata faster, on
> average ;)

Which is why I would propose to use cache sizing policies and possibly
persistent l2arc contents. A persistency-unaware host would simply use
the l2arc device as normal (so backwards-compatibility wouldn't be an
issue), while newer hosts could happily coexist there.

>> As noted before, you'll have to go through the code to look for paths
>> which fetch metadata (mostly the object layer) and replace those with
>> metaxel-aware calls. That's a lot of work for a POC.
> 
> Alas, for some years now I'm a lot less of a programmer and
> a lot more of a brainstormer ;) Still, judging from whatever
> experience I have, a working POC with some corners cut might
> be a matter of a week or two of coding... Just to see if the
> expected benefits in comparison to L2ARC do exist.
> The full-scale thing, yes, might take months or years from
> even a team of programmers ;)

Code talks ;-)

Cheers,
--
Saso
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedicated metadata devices

Reply via email to