Re: [zfs-discuss] Dedicated metadata devices

Jim Klimov Fri, 24 Aug 2012 15:23:35 -0700

2012-08-25 0:42, Sašo Kiselkov wrote:

Oh man, that's a million-billion points you made. I'll try to run
through each quickly.


Thanks...
I still do not have the feeling that you've fully got my
idea, or, alternately, that I correctly understand ARC :)

There is also relatively large RAM pointer overhead for storing
small pieces of data (such as metadata blocks sized 1 or few
sectors) in L2ARC, which I expect to be eliminated by storing
and using these blocks directly from the pool (on SSD METAXELs),
having both SSD-fast access to the blocks and no expiration into
L2ARC and back with inefficiently-sized ARC pointers to remember.


...And these counter-arguments probably are THE point of deviation:

However, metaxels and cache devices are essentially the same

> (many small random reads, infrequent large async writes).
> The difference between metaxel and cache, however, is cosmetic.

You'd still need to reference metaxel data from ARC, so your savings
would be very small. ZFS already is pretty efficient there.


No, you don't! "Republic credits WON'T do fine!" ;)

The way I understood ARC (without/before L2ARC), it either caches
pool blocks or it doesn't. More correctly, there is also a cache
of ghosts without bulk block data, so we can account for misses
of recently expired blocks of one of the two categories, and so
adjust the cache subdivision towards MRU or MFU. Ultimately, those
ghosts which were not requested, also expire away from the cache,
and no reference to a recently-cached block remains.

With L2ARC on the other hand, there is some list of pointers in
the ARC so it knows which blocks were cached on the SSD - and
lack of this list upon pool import is in effect the perceived
emptiness of the L2ARC device. L2ARC's pointers are of comparable
size to the small metadata blocks, and *this* consideration IMHO
makes it much more efficient to use L2ARC with larger cached blocks,
especially on systems with limited RAM (which effectively limits
addressable L2ARC size as accounted in amount of blocks), with
the added benefit that you can compress larger blocks in L2ARC.

This way, the *difference* between L2ARC and a METAXEL is that
the latter is an ordinary pool tlvdev with a specially biased
read priority and write filter. If a metadata block is read,
it goes into the ARC. If it expires - then there's a ghost
for a while and soon there is no memory that this block was
cached - unlike L2ARC's list of pointers which are just a
couple of times smaller than the cached block of this type.
But re-fetching metadata from SSD METAXEL is faster, when
it is needed again.

> Also, you're wrong if you think the clientele of l2arc and
> metaxel would be different - it most likely wouldn't.

This only stresses the problem with L2ARC's shortcomings for
metadata, the way I see them (if they do indeed exist), and
in particular chews your RAM a lot more than it could or
should, being a mechanism to increase caching efficiency.

If their clientele is indeed similar, and if metaxels would
be more efficient for metadata storage, then you might not
need L2ARC with its overheads, or not as much of it, and
get a clear win in system resource consumption ;)

> How often do you expect cache devices to fail?

From what I hear, life expectancy for today's consumer-scale
devices is small (1-3 years) for heavy writes - at which the
L2ARC would likely exceed METAXEL's write rates, due to the
need to write the same metadata into L2ARC time and again,
if it were not for the special throttling to limit L2ARC
write bandwidth.

> So to sum up, you're applying raid to something that doesn't
> need it.

Well, metadata is kinda important - though here we do add
a third copy where we previously sufficed to have two. And
you're not "required" to mirror it. Also, on the other hand,
if a METAXEL is a top-level vdev without special resilience
to its failure/absence as described in my first post, then
its failure would formally be considered a fatal situation
and bring down the whole pool - unlike problems with L2ARC
or ZIL devices, which can be ignored at admin's discretion.

> And how is that different to having a cache-sizing policy
> which selects how much each data type get allocated from
> a single common cache?
...
> All of this can be solved by cache sizing policies and
> l2arc persistency.

Ultimately, I don't disagree with this point :)
But I do think that this might not be the optimal solution
in terms of RAM requirements and coding complexity, etc.
If you want to store some data long-term, such as is my
desire to store the metadata - ZFS has mechanisms for that
in ways of normal VDEVs (or subclassing that into metaxels) ;)

 *) implement a new vdev type (mirrored or straight metaxel)
 *) integrate all format changes to labels to describe these


As one idea in the proposal - though I don't require sticking
to it - is that the metaxel's job is described in the pool
metadata (i.e. a readonly attribute which can be set during
tlvdev device creation/addition - metaxels:list-of-guids).
Until the pool is imported, a metaxel seems like a normal
singledisk/mirrored tlvdev in a normal pool.

This approach can limit importability of a pool with failed
metaxels, unless we expect that and try to make sense of
other pool devices - essentially until we can decipher the
nvlist and see that the absent device is a metaxel, so the
error is deemed not fatal. However, this also requires no
label changes or other incompatible on-disk format changes,
the way I see it. As long as the metaxel is not faulted,
any other ZFS implementation (like grub or an older livecd)
can import this pool and read 1/3 of metadata faster, on
average ;)

As noted before, you'll have to go through the code to look for paths
which fetch metadata (mostly the object layer) and replace those with
metaxel-aware calls. That's a lot of work for a POC.


Alas, for some years now I'm a lot less of a programmer and
a lot more of a brainstormer ;) Still, judging from whatever
experience I have, a working POC with some corners cut might
be a matter of a week or two of coding... Just to see if the
expected benefits in comparison to L2ARC do exist.
The full-scale thing, yes, might take months or years from
even a team of programmers ;)

Thanks,
//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedicated metadata devices

Reply via email to