Re: [zfs-discuss] Dedicated metadata devices

2012-08-28 Thread Karl Wagner

On 2012-08-24 14:39, Jim Klimov wrote:

Hello all,

  The idea of dedicated metadata devices (likely SSDs) for ZFS
has been generically discussed a number of times on this list,
but I don't think I've seen a final proposal that someone would
take up for implementation (as a public source code, at least).



Hi

OK, I am not a ZFS dev and have barely even looked at the code, but it 
seems to me that this could be dealt with in an easier and more 
efficient manner by modifying current L2ARC code to make a persistent 
cache device, and adding the preference mechanism somebody has already 
suggested (e.g. prefer metadata, or prefer specific typed of metadata)


My reasoning is as follows:
1) As metadata is already available on the main pool devices, there is 
no need to make this data redundant. It is there for acceleration. In 
the event of a failure, it can just be read directly from the pool, and 
there is no need to write the data twice (as would be in a mirrored 
'metaxel') or waste the space. This is only my oppinion, but it makes 
sense to me. The other option, for me, would be to make it the main 
storage area for metadata, with no requirement to store it on the main 
pool devices beyond needing enough copies. i.e. if you need 2 metadata 
copies but have only one metaxel, store on on there and one in the pool. 
If you need 2 copies and there are 2 metaxels, store them on the 
metaxels, no pool storage needed.
2) Persistent cache devices and cache policies would bring more 
benefits to the system overall than adding this metaxel: No warming of 
the cache (besides reading in what is stored there on import/boot, so 
lets say accelerated warming) & finer control over what to store in the 
cache. The cache devices could then be tuned on a per dataset (and 
possibly per cache dev, so certain data types prefer the cache dev with 
the best performance profile for it) basis to provide the best for your 
own unique situation. Possibly even a "keep this dataset in cache at all 
times" would be usefull for less frequently accessed but time-critical 
data (so no more loops cat'ing to /dev/null to keep data in cache).
3) This would provide, IMHO, the building blocks for a combined 
cache/log device. This would basically go as follows: You set up, say, a 
pair of persistent cache devices. You then tell ZFS that these can be 
used for ZIL blocks, with something like the copies attribute to tell it 
to ensure redundancy. So it basically builds a ZIL device from blocks 
within the cache as it needs it. It would not be as fast as a dedicated 
log device, but would allow greater efficiency.


Point 3 would be for future development, but I believe the benefits of 
cache persistence and policies are enough to make them a priority. I 
believe it would cover what the metaxel is trying to do and more.


The other, simpler, option I could see is a flag which tells ZFS "Keep 
metadata in the cache", which ensures all metadata (where possible) is 
stored in ARC/L2ARC at all times, and possibly forces it to be read in 
on import/boot.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedicated metadata devices

2012-08-25 Thread Jim Klimov

2012-08-25 15:46, Sašo Kiselkov wrote:

The difference is that when you want to go fetch a block from a metaxel,
you still need some way to reference it. Either you use direct
references (i.e. ARC entries as above), or you use an indirect
mechanism, which means that for each read you will need to walk the
metaxel device, which is slow.


Um... how does your application (or the filesystem during its
metadata traversal) know that it wants to read a certain block?
IMHO, it has the block's address for that (likely known from
a higher-level "parent" block of metadata), and it requests -
"give me L bytes from offset O on tlvdev T", which is the layman
interpretation of DVA.

From what I understand, with ARC-cached blocks, we traverse the
RAM-based cache and find one with the requested DVA; then we
have its data already in RAM and return it to the caller.
If the block is not in ARC (and there's no L2ARC), we can fetch
it from the media using the DVA address(es?) we already know
from the request.
In case of L2ARC there is probably a non-null pointer to the
l2arc_buf_hdr_t, so we can request the block from the L2ARC.

If true, this is not faster than fetching the block from the
same SSD used as a metadata accelerator instead of being an
L2ARC device with a policy (or even without one, as today),
and in comparison only wasted RAM for the ARC entries.

BTW, as I see in "struct arc_buf_hdr", it only stores one DVA -
so I guess for blocks with multiple on-disk copies it is possible
to have them cached twice, or does ZFS always enforce storing and
seeking the ARC by a particular DVA of the block (likely DVA[0])?

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#433

If we do go with METAXELs as I described/proposed, and prefer
fetching metadata from SSD unless there are errors, then some
care should be taken to use this instance of DVA to reference
cached metadata blocks in the ARC.

//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedicated metadata devices

2012-08-25 Thread Sašo Kiselkov
On 08/25/2012 11:53 AM, Jim Klimov wrote:
>> No they're not, here's l2arc_buf_hdr_t a per-buffer structure 
>> held for
>> buffers which were moved to l2arc:
>>
>> typedef struct l2arc_buf_hdr {
>> l2arc_dev_t *b_dev;
>> uint64_t b_daddr;
>> } l2arc_buf_hdr_t;
>>
>> That's about 16-bytes overhead per block, or 3.125% if the 
>> block's data is 512 bytes long.
>>
>> The main overhead comes from an arc_buf_hdr_t, which is pretty fat,
>> around 180 bytes by a first degree approximation, so in all 
>> around 200
>> bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully
>> inefficient (around 39% overhead), however, at 4k average block size,
>> this drops to ~5% and at 64k average block size (which is entirely
>> possible on average untuned storage pools) this drops down to ~0.3%
>> overhead.
> 
> So... unless I miscalculated before drinking a morning coffee, for a 512b 
> block
> quickly fetchable from SSD in both L2ARC and METAXEL cases, we have
> roughly these numbers?:
> 1) When it is in RAM, we consume 512+180 bytes (though some ZFS
> slides said that for 1 byte stored we spend 1 byte - i thought this meant zero
> overhead, though I couldn't imagine how... or 100% overhead, also quite
> unimaginable =) )
>  
> 2L) When the block is on L2ARC SSD, we spend 180+16 bytes (though
> discussions about DDT on L2ARC at least, settled on 176 bytes of cache
> metainformation per entry moved off to L2ARC, with the DDT entry's size 
> being around 350 bytes, IIRC).
>  
> 2M) When the block is expired from ARC and is only stored on the pool,
> including the SSD-based copy on a METAXEL, we spend zero RAM to
> reference this block from ARC - because we don't remember it anymore.
> And when needed, we can access it just as fast (right?) as from L2ARC
> on the same media type.
>  
> Where am I wrong, because we seem to dispute over THIS point over 
> several emails, and I'm ready to accept that you've seen the code and 
> I'm the clueless one. So I want to learn, then ;)

The difference is that when you want to go fetch a block from a metaxel,
you still need some way to reference it. Either you use direct
references (i.e. ARC entries as above), or you use an indirect
mechanism, which means that for each read you will need to walk the
metaxel device, which is slow.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedicated metadata devices

2012-08-25 Thread Jim Klimov
> No they're not, here's l2arc_buf_hdr_t a per-buffer structure 
> held for
> buffers which were moved to l2arc:
> 
> typedef struct l2arc_buf_hdr {
> l2arc_dev_t *b_dev;
> uint64_t b_daddr;
> } l2arc_buf_hdr_t;
> 
> That's about 16-bytes overhead per block, or 3.125% if the 
> block's data is 512 bytes long.
> 
> The main overhead comes from an arc_buf_hdr_t, which is pretty fat,
> around 180 bytes by a first degree approximation, so in all 
> around 200
> bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully
> inefficient (around 39% overhead), however, at 4k average block size,
> this drops to ~5% and at 64k average block size (which is entirely
> possible on average untuned storage pools) this drops down to ~0.3%
> overhead.

So... unless I miscalculated before drinking a morning coffee, for a 512b block
quickly fetchable from SSD in both L2ARC and METAXEL cases, we have
roughly these numbers?:
1) When it is in RAM, we consume 512+180 bytes (though some ZFS
slides said that for 1 byte stored we spend 1 byte - i thought this meant zero
overhead, though I couldn't imagine how... or 100% overhead, also quite
unimaginable =) )
 
2L) When the block is on L2ARC SSD, we spend 180+16 bytes (though
discussions about DDT on L2ARC at least, settled on 176 bytes of cache
metainformation per entry moved off to L2ARC, with the DDT entry's size 
being around 350 bytes, IIRC).
 
2M) When the block is expired from ARC and is only stored on the pool,
including the SSD-based copy on a METAXEL, we spend zero RAM to
reference this block from ARC - because we don't remember it anymore.
And when needed, we can access it just as fast (right?) as from L2ARC
on the same media type.
 
Where am I wrong, because we seem to dispute over THIS point over 
several emails, and I'm ready to accept that you've seen the code and 
I'm the clueless one. So I want to learn, then ;)
 
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Sašo Kiselkov
On 08/25/2012 12:22 AM, Jim Klimov wrote:
> 2012-08-25 0:42, Sašo Kiselkov wrote:
>> Oh man, that's a million-billion points you made. I'll try to run
>> through each quickly.
> 
> Thanks...
> I still do not have the feeling that you've fully got my
> idea, or, alternately, that I correctly understand ARC :)

Could be I misunderstood you, it's past midnight here...

>>> There is also relatively large RAM pointer overhead for storing
>>> small pieces of data (such as metadata blocks sized 1 or few
>>> sectors) in L2ARC, which I expect to be eliminated by storing
>>> and using these blocks directly from the pool (on SSD METAXELs),
>>> having both SSD-fast access to the blocks and no expiration into
>>> L2ARC and back with inefficiently-sized ARC pointers to remember.
> 
> ...And these counter-arguments probably are THE point of deviation:
> 
>> However, metaxels and cache devices are essentially the same
>> (many small random reads, infrequent large async writes).
>> The difference between metaxel and cache, however, is cosmetic.
> 
>> You'd still need to reference metaxel data from ARC, so your savings
>> would be very small. ZFS already is pretty efficient there.
> 
> No, you don't! "Republic credits WON'T do fine!" ;)
> 
> The way I understood ARC (without/before L2ARC), it either caches
> pool blocks or it doesn't. More correctly, there is also a cache
> of ghosts without bulk block data, so we can account for misses
> of recently expired blocks of one of the two categories, and so
> adjust the cache subdivision towards MRU or MFU. Ultimately, those
> ghosts which were not requested, also expire away from the cache,
> and no reference to a recently-cached block remains.

Correct so far.

> With L2ARC on the other hand, there is some list of pointers in
> the ARC so it knows which blocks were cached on the SSD - and
> lack of this list upon pool import is in effect the perceived
> emptiness of the L2ARC device. L2ARC's pointers are of comparable
> size to the small metadata blocks,

No they're not, here's l2arc_buf_hdr_t a per-buffer structure held for
buffers which were moved to l2arc:

typedef struct l2arc_buf_hdr {
l2arc_dev_t *b_dev;
uint64_tb_daddr;
} l2arc_buf_hdr_t;

That's about 16-bytes overhead per block, or 3.125% if the block's data
is 512 bytes long.

> and *this* consideration IMHO
> makes it much more efficient to use L2ARC with larger cached blocks,
> especially on systems with limited RAM (which effectively limits
> addressable L2ARC size as accounted in amount of blocks), with
> the added benefit that you can compress larger blocks in L2ARC.

The main overhead comes from an arc_buf_hdr_t, which is pretty fat,
around 180 bytes by a first degree approximation, so in all around 200
bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully
inefficient (around 39% overhead), however, at 4k average block size,
this drops to ~5% and at 64k average block size (which is entirely
possible on average untuned storage pools) this drops down to ~0.3%
overhead.

> This way, the *difference* between L2ARC and a METAXEL is that
> the latter is an ordinary pool tlvdev with a specially biased
> read priority and write filter. If a metadata block is read,
> it goes into the ARC. If it expires - then there's a ghost
> for a while and soon there is no memory that this block was
> cached - unlike L2ARC's list of pointers which are just a
> couple of times smaller than the cached block of this type.
> But re-fetching metadata from SSD METAXEL is faster, when
> it is needed again.

As explained above, the difference would be about 9% at best:
sizeof(l2arc_buf_hdr_t) / sizeof(arc_buf_hdr_t) = 0.0888...

>> Also, you're wrong if you think the clientele of l2arc and
>> metaxel would be different - it most likely wouldn't.
> 
> This only stresses the problem with L2ARC's shortcomings for
> metadata, the way I see them (if they do indeed exist), and
> in particular chews your RAM a lot more than it could or
> should, being a mechanism to increase caching efficiency.

And as I demonstrated above, the savings would be negligible.

> If their clientele is indeed similar, and if metaxels would
> be more efficient for metadata storage, then you might not
> need L2ARC with its overheads, or not as much of it, and
> get a clear win in system resource consumption ;)

Would it be a win? Probably. But the cost-benefit analysis suggests to
me that it would probably simply not be worth the added hassle.

>> How often do you expect cache devices to fail?
> 
> From what I hear, life expectancy for today's consumer-scale
> devices is small (1-3 years) for heavy writes - at which the
> L2ARC would likely exceed METAXEL's write rates, due to the
> need to write the same metadata into L2ARC time and again,
> if it were not for the special throttling to limit L2ARC
> write bandwidth.

Depending on your workload, l2arc write throughput tends to get pretty
low once you've cached in your working da

Re: [zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Jim Klimov

2012-08-25 0:42, Sašo Kiselkov wrote:

Oh man, that's a million-billion points you made. I'll try to run
through each quickly.


Thanks...
I still do not have the feeling that you've fully got my
idea, or, alternately, that I correctly understand ARC :)


There is also relatively large RAM pointer overhead for storing
small pieces of data (such as metadata blocks sized 1 or few
sectors) in L2ARC, which I expect to be eliminated by storing
and using these blocks directly from the pool (on SSD METAXELs),
having both SSD-fast access to the blocks and no expiration into
L2ARC and back with inefficiently-sized ARC pointers to remember.


...And these counter-arguments probably are THE point of deviation:


However, metaxels and cache devices are essentially the same

> (many small random reads, infrequent large async writes).
> The difference between metaxel and cache, however, is cosmetic.


You'd still need to reference metaxel data from ARC, so your savings
would be very small. ZFS already is pretty efficient there.


No, you don't! "Republic credits WON'T do fine!" ;)

The way I understood ARC (without/before L2ARC), it either caches
pool blocks or it doesn't. More correctly, there is also a cache
of ghosts without bulk block data, so we can account for misses
of recently expired blocks of one of the two categories, and so
adjust the cache subdivision towards MRU or MFU. Ultimately, those
ghosts which were not requested, also expire away from the cache,
and no reference to a recently-cached block remains.

With L2ARC on the other hand, there is some list of pointers in
the ARC so it knows which blocks were cached on the SSD - and
lack of this list upon pool import is in effect the perceived
emptiness of the L2ARC device. L2ARC's pointers are of comparable
size to the small metadata blocks, and *this* consideration IMHO
makes it much more efficient to use L2ARC with larger cached blocks,
especially on systems with limited RAM (which effectively limits
addressable L2ARC size as accounted in amount of blocks), with
the added benefit that you can compress larger blocks in L2ARC.

This way, the *difference* between L2ARC and a METAXEL is that
the latter is an ordinary pool tlvdev with a specially biased
read priority and write filter. If a metadata block is read,
it goes into the ARC. If it expires - then there's a ghost
for a while and soon there is no memory that this block was
cached - unlike L2ARC's list of pointers which are just a
couple of times smaller than the cached block of this type.
But re-fetching metadata from SSD METAXEL is faster, when
it is needed again.

> Also, you're wrong if you think the clientele of l2arc and
> metaxel would be different - it most likely wouldn't.

This only stresses the problem with L2ARC's shortcomings for
metadata, the way I see them (if they do indeed exist), and
in particular chews your RAM a lot more than it could or
should, being a mechanism to increase caching efficiency.

If their clientele is indeed similar, and if metaxels would
be more efficient for metadata storage, then you might not
need L2ARC with its overheads, or not as much of it, and
get a clear win in system resource consumption ;)

> How often do you expect cache devices to fail?

From what I hear, life expectancy for today's consumer-scale
devices is small (1-3 years) for heavy writes - at which the
L2ARC would likely exceed METAXEL's write rates, due to the
need to write the same metadata into L2ARC time and again,
if it were not for the special throttling to limit L2ARC
write bandwidth.

> So to sum up, you're applying raid to something that doesn't
> need it.

Well, metadata is kinda important - though here we do add
a third copy where we previously sufficed to have two. And
you're not "required" to mirror it. Also, on the other hand,
if a METAXEL is a top-level vdev without special resilience
to its failure/absence as described in my first post, then
its failure would formally be considered a fatal situation
and bring down the whole pool - unlike problems with L2ARC
or ZIL devices, which can be ignored at admin's discretion.

> And how is that different to having a cache-sizing policy
> which selects how much each data type get allocated from
> a single common cache?
...
> All of this can be solved by cache sizing policies and
> l2arc persistency.

Ultimately, I don't disagree with this point :)
But I do think that this might not be the optimal solution
in terms of RAM requirements and coding complexity, etc.
If you want to store some data long-term, such as is my
desire to store the metadata - ZFS has mechanisms for that
in ways of normal VDEVs (or subclassing that into metaxels) ;)


 *) implement a new vdev type (mirrored or straight metaxel)
 *) integrate all format changes to labels to describe these


As one idea in the proposal - though I don't require sticking
to it - is that the metaxel's job is described in the pool
metadata (i.e. a readonly attribute which can be set during
tlvdev de

Re: [zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Sašo Kiselkov
Oh man, that's a million-billion points you made. I'll try to run
through each quickly.

On 08/24/2012 05:43 PM, Jim Klimov wrote:
> First of all, thanks for reading and discussing! :)

No problem at all ;)

> 2012-08-24 17:50, Sašo Kiselkov wrote:
>> This is something I've been looking into in the code and my take on your
>> proposed points this:
>>
>> 1) This requires many and deep changes across much of ZFS's architecture
>> (especially the ability to sustain tlvdev failures).
> 
> I'd trust the expert; on the outside it did not seem as a very
> deep change. At least, if for the first POC tests we leave out the
> rewriting of existing block pointers to store copies of existing
> metadata on an SSD, and the resilience to failures and absence
> of METAXELs.

The initial set of change areas I can identify, even for the stripped
down version of your proposal is:

 *) implement a new vdev type (mirrored or straight metaxel)
 *) integrate all format changes to labels to describe these
 *) alter the block allocator strategy so that if there are metaxels
present, we utilize those
 *) alter the metadata fetch points (of which there are many) to
preferably fetch from metaxels when possible, or fall back to
main-pool copies
 *) make sure that the previous two points play nicely with copies=X

The other points you mentioned, i.e. fault-resiliency, block-pointer
rewrite and other stuff is another mountain of work with an even higher
mountain of testing to be done on all possible combinations.

> Basically, for a POC implementation we can just make a regular
> top-level VDEV forced as a single disk or mirror and add some
> hint to describe that it is a METAXEL component of the pool,
> so the ZFS kernel gets some restrictions on what gets written
> there (for new metadata writes) and to prioritize reads (fetch
> metadata from METAXELs, unless there is no copy on a known
> METAXEL or the copy is corrupted).

As noted before, you'll have to go through the code to look for paths
which fetch metadata (mostly the object layer) and replace those with
metaxel-aware calls. That's a lot of work for a POC.

> The POC as outlined would be useful to estimate the benefits and
> impacts of the solution, and like "BP Rewrite", the more advanced
> features might be delayed by a few years - so even the POC would
> easily be the useful solution for many of us, especially if applied
> to new pools from TXG=0.

I wish I had all the time to implement it, but alas, I'm just a zfs n00b
and am not doing this for a living :-)

>> 2) Most of this can be achieved (except for cache persistency) by
>> implementing ARC space reservations for certain types of data.
>>
>> The latter has the added benefit of spreading load across all ARC and
>> L2ARC resources, so your METAXEL device never becomes the sole
>> bottleneck and it better embraces the ZFS design philosophy of pooled
>> storage.
> 
> Well, we already have somewhat non-pooled ZILs and L2ARCs.

Yes, that's because these have vastly different performance properties
from main-pool storage. However, metaxels and cache devices are
essentially the same (many small random reads, infrequent large async
writes).

> Or, rather, they are in sub-pools of their own, reserved
> for specific tasks to optimize and speed up the ZFS storage
> subsystem in face of particular problems.

Exactly. The difference between metaxel and cache, however, is cosmetic.

> My proposal does indeed add another sub-pool for another such
> task (and nominally METAXELs are parts of the common pool -
> more than cache and log devices are today), and explicitly
> describes adding several METAXELs or raid10'ing them (thus
> regarding the bottleneck question).

The problem regarding bottlenecking is that you're creating a new
separate island of resources which has very little difference in
performance requirements to cache devices, yet by separating them out
artificially, you're creating a potential scalability barrier.

> On larger systems, this
> metadata storage might be available with a different SAS
> controller on a separate PCI bus, further boosting performance
> and reducing bottlenecks. Unlike L2ARC, METAXELs can be N-way
> mirrored and so instances are available in parallel from
> several controllers and lanes - further boosting IO and
> reliability of metadata operations.

How often do you expect cache devices to fail? I mean we're talking
about a one-off occasional event that doesn't even present data loss
(only a little bit of performance loss, especially if you use multiple
cache devices). And since you're proposing mirroring metaxels, you are
essentially going to be continuously doing twice the write work for a
50% reduction in read performance from the vdev in case of a device
failure. If you just used both devices as cache, you'll get 100% speedup
in read AND write performance (in case you lose one cache device, you've
still got 50% of your cache data available). So to sum up, you're
applying raid to s

Re: [zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Jim Klimov

2012-08-24 17:39, Jim Klimov wrote:

Hello all,

   The idea of dedicated metadata devices (likely SSDs) for ZFS
has been generically discussed a number of times on this list,
but I don't think I've seen a final proposal that someone would
take up for implementation (as a public source code, at least).



Hmmm... now that I think of it, this solution might also eat some
of the cake for dedicated ZIL devices: having an SSD with in-pool
metadata, we can send sync writes (for metadata blocks) straight
to the METAXEL SSD, while the TXG sync would flush out their HDD-
based counterparts and other (async) data. In case of pool import
after breakage we just need to repair the last uncommitted TXG's
worth of recorded metadata on METAXEL...

Also, I now think that directories, ACLs and such (POSIX fs layer
metadata) should have a copy on the METAXEL SSDs too.

This certainly does not replace the ZIL in general, but unlike the
rolling "write a lot, read never" ZIL approach, this would actually
write the needed data onto the pool with low latency, won't abuse
the flash cells needlessly (hopefully).

Expert opinion and/or tests could confirm my guess that this could
provide "good enough" boost to some modes of sync writes (NFS maybe?)
that are heavier on metadata updates than on userdata, and these
tight-on-budget deployments which think about costly dedicated ZIL
devices might no longer require one, or not need it so heavily.


Is there any sanity to this? ;)
Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Jim Klimov

First of all, thanks for reading and discussing! :)

2012-08-24 17:50, Sašo Kiselkov wrote:

This is something I've been looking into in the code and my take on your
proposed points this:

1) This requires many and deep changes across much of ZFS's architecture
(especially the ability to sustain tlvdev failures).


I'd trust the expert; on the outside it did not seem as a very
deep change. At least, if for the first POC tests we leave out the
rewriting of existing block pointers to store copies of existing
metadata on an SSD, and the resilience to failures and absence
of METAXELs.

Basically, for a POC implementation we can just make a regular
top-level VDEV forced as a single disk or mirror and add some
hint to describe that it is a METAXEL component of the pool,
so the ZFS kernel gets some restrictions on what gets written
there (for new metadata writes) and to prioritize reads (fetch
metadata from METAXELs, unless there is no copy on a known
METAXEL or the copy is corrupted).

The POC as outlined would be useful to estimate the benefits and
impacts of the solution, and like "BP Rewrite", the more advanced
features might be delayed by a few years - so even the POC would
easily be the useful solution for many of us, especially if applied
to new pools from TXG=0.

"There is nothing as immortal as a temporary solution" ;)



2) Most of this can be achieved (except for cache persistency) by
implementing ARC space reservations for certain types of data.

The latter has the added benefit of spreading load across all ARC and
L2ARC resources, so your METAXEL device never becomes the sole
bottleneck and it better embraces the ZFS design philosophy of pooled
storage.


Well, we already have somewhat non-pooled ZILs and L2ARCs.
Or, rather, they are in sub-pools of their own, reserved
for specific tasks to optimize and speed up the ZFS storage
subsystem in face of particular problems.

My proposal does indeed add another sub-pool for another such
task (and nominally METAXELs are parts of the common pool -
more than cache and log devices are today), and explicitly
describes adding several METAXELs or raid10'ing them (thus
regarding the bottleneck question). On larger systems, this
metadata storage might be available with a different SAS
controller on a separate PCI bus, further boosting performance
and reducing bottlenecks. Unlike L2ARC, METAXELs can be N-way
mirrored and so instances are available in parallel from
several controllers and lanes - further boosting IO and
reliability of metadata operations.

However, unlike L2ARC in general, here we know our "target
audience" better, so we can do an optimization for a particular
useful situation: gigabytes worth of data in small portions
(sized from 512b to 8Kb, IIRC?), quite randomly stored and
often read in comparison to amount of writes.
Regarding size in particular: with blocks of 128K and BP entries
of 512b, the minimum overhead for a single copy of BPtree metadata
is 1/256 (without actually the tree, dataset labels, etc).
So for each 1Tb of written ZFS pool userdata we get at least 4Gb
metadata of just the block pointer tree (likely more in reality).
For practical Home-NAS pools of about 10Tb this warrants about
60Gb (give or take an order of magnitude) on SSD dedicated to
casual metadata without even a DDT, be it generic L2ARC or an
optimized METAXEL.

The tradeoffs for dedicating a storage device (or several) to
this one task are, hopefully: no need for heating up the cache
every time with gigabytes that are known to be needed again
and again, even if only to boost weekly scrubs, some RAM ARC
savings and release of L2ARC to tasks it is more efficient at
(generic larger blocks). Eliminating many small random IOs to
spinning rust, we're winning in HDD performance and arguably
power consumption and vitality (less mechanical overheads and
delays per overall amount of transferred gigabytes).

There is also relatively large RAM pointer overhead for storing
small pieces of data (such as metadata blocks sized 1 or few
sectors) in L2ARC, which I expect to be eliminated by storing
and using these blocks directly from the pool (on SSD METAXELs),
having both SSD-fast access to the blocks and no expiration into
L2ARC and back with inefficiently-sized ARC pointers to remember.

I guess METAXEL might indeed be cheaper and faster than L2ARC,
for this particular use-case (metadata). Also, this way the
true L2ARC device would be more available to "real" userdata
which is likely to use larger blocks - improving benefits
from your L2ARC compression features as well as reducing
the overhead percentage for ARC pointers; and being a random
selection of the pool's blocks, the userdata is unpredictable
for good acceleration by other means (short of a full-SSD pool).

Also, having this bulky amount of bytes (BPTree, DDT) is
essentially required for fast operation of the overall pool,
and it is not some unpredictable random set of blocks as is
expected for usual cacheable data - so why 

Re: [zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Richard Elling

On Aug 24, 2012, at 6:50 AM, Sašo Kiselkov wrote:

> This is something I've been looking into in the code and my take on your
> proposed points this:
> 
> 1) This requires many and deep changes across much of ZFS's architecture
> (especially the ability to sustain tlvdev failures).
> 
> 2) Most of this can be achieved (except for cache persistency) by
> implementing ARC space reservations for certain types of data.

I think the simple solution of increasing default metadata limit above 1/4 of
arc_max will take care of the vast majority of small system complaints. The 
limit is arbitrary and set well before dedupe was delivered.

> 
> The latter has the added benefit of spreading load across all ARC and
> L2ARC resources, so your metaxel device never becomes the sole
> bottleneck and it better embraces the ZFS design philosophy of pooled
> storage.
> 
> I plan on having a look at implementing cache management policies (which
> would allow for tuning space reservations for metadata/etc. in a
> fine-grained manner without the cruft of having to worry about physical
> cache devices as well).
> 
> Cheers,
> --
> Saso
> 
> On 08/24/2012 03:39 PM, Jim Klimov wrote:
>> Hello all,
>> 
>>  The idea of dedicated metadata devices (likely SSDs) for ZFS
>> has been generically discussed a number of times on this list,
>> but I don't think I've seen a final proposal that someone would
>> take up for implementation (as a public source code, at least).
>> 
>>  I'd like to take a liberty of summarizing the ideas I've either
>> seen in discussions or proposed myself on this matter, to see if
>> the overall idea would make sense to gurus of ZFS architecture.
>> 
>>  So, the assumption was that the performance killer in ZFS at
>> least on smallish deployments (few HDDs and an SSD accelerator),
>> like those in Home-NAS types of boxes, was random IO to lots of
>> metadata.

It is a bad idea to make massive investments in development and 
testing because of an assumption. Build test cases, prove that the
benefits of the investment can outweigh other alternatives, and then
deliver code.
 -- richard

>> This IMHO includes primarily the block pointer tree
>> and the DDT for those who risked using dedup. I am not sure how
>> frequent is the required read access to other types of metadata
>> (like dataset descriptors, etc.) that the occasional reading and
>> caching won't solve.
>> 
>>  Another idea was that L2ARC caching might not really cut it
>> for metadata in comparison to a dedicated metadata storage,
>> partly due to the L2ARC becoming empty upon every export/import
>> (boot) and needing to get re-heated.
>> 
>>  So, here go the highlights of proposal (up for discussion).
>> 
>> In short, the idea is to use today's format of the blkptr_t
>> which by default allows to store up to 3 DVA addresses of the
>> block, and many types of metadata use only 2 copies (at least
>> by default). This new feature adds a specially processed
>> TLVDEV in the common DVA address space of the pool, and
>> enforces storage of added third copies for certain types
>> of metadata blocks on these devices. (Limited) Backwards
>> compatibility is quite possible, on-disk format change may
>> be not required. The proposal also addresses some questions
>> that arose in previous discussions, especially about proposals
>> where SSDs would be the only storage for pool's metadata:
>> * What if the dedicated metadata device overflows?
>> * What if the dedicated metadata device breaks?
>> = okay/expected by design, nothing dies.
>> 
>>  In more detail:
>> 1) Add a special Top-Level VDEV (TLVDEV below) device type (like
>>   "cache" and "log" - say, "metaxel" for "metadata accelerator"?),
>>   and allow (even encourage) use of mirrored devices and allow
>>   expansion (raid0, raid10 and/or separate TLVDEVs) with added
>>   singlets/mirrors of such devices.
>>   Method of device type definition for the pool is discussable,
>>   I'd go with a special attribute (array) or nvlist in the pool
>>   descriptor, rather than some special type ID in the ZFS label
>>   (backwards compatibility, see point 4 for detailed rationale).
>> 
>>   Discussable: enable pool-wide or per-dataset (i.e. don't
>>   waste accelerator space and lifetime for rarely-reused
>>   datasets like rolling backups)? Choose what to store on
>>   (particular) metaxels - DDT, BPTree, something else?
>>   Overall, this availability of choice is similar to choice
>>   of modes for ARC/L2ARC caching or enabling ZIL per-dataset...
>> 
>> 2) These devices should be formally addressable as part of the
>>   pool in DVA terms (tlvdev:offset:size), but writes onto them
>>   are artificially limited by ZFS scheduler so as to only allow
>>   specific types of metadata blocks (blkptr_t's, DDT entries),
>>   and also enforce writing of added third copies (for blocks
>>   of metadata with usual copies=2) onto these devices.
>> 
>> 3) Absence or "FAULTEDness" of this device should not be fatal
>>   to the pool, but i

Re: [zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Sašo Kiselkov
This is something I've been looking into in the code and my take on your
proposed points this:

1) This requires many and deep changes across much of ZFS's architecture
(especially the ability to sustain tlvdev failures).

2) Most of this can be achieved (except for cache persistency) by
implementing ARC space reservations for certain types of data.

The latter has the added benefit of spreading load across all ARC and
L2ARC resources, so your metaxel device never becomes the sole
bottleneck and it better embraces the ZFS design philosophy of pooled
storage.

I plan on having a look at implementing cache management policies (which
would allow for tuning space reservations for metadata/etc. in a
fine-grained manner without the cruft of having to worry about physical
cache devices as well).

Cheers,
--
Saso

On 08/24/2012 03:39 PM, Jim Klimov wrote:
> Hello all,
> 
>   The idea of dedicated metadata devices (likely SSDs) for ZFS
> has been generically discussed a number of times on this list,
> but I don't think I've seen a final proposal that someone would
> take up for implementation (as a public source code, at least).
> 
>   I'd like to take a liberty of summarizing the ideas I've either
> seen in discussions or proposed myself on this matter, to see if
> the overall idea would make sense to gurus of ZFS architecture.
> 
>   So, the assumption was that the performance killer in ZFS at
> least on smallish deployments (few HDDs and an SSD accelerator),
> like those in Home-NAS types of boxes, was random IO to lots of
> metadata. This IMHO includes primarily the block pointer tree
> and the DDT for those who risked using dedup. I am not sure how
> frequent is the required read access to other types of metadata
> (like dataset descriptors, etc.) that the occasional reading and
> caching won't solve.
> 
>   Another idea was that L2ARC caching might not really cut it
> for metadata in comparison to a dedicated metadata storage,
> partly due to the L2ARC becoming empty upon every export/import
> (boot) and needing to get re-heated.
> 
>   So, here go the highlights of proposal (up for discussion).
> 
> In short, the idea is to use today's format of the blkptr_t
> which by default allows to store up to 3 DVA addresses of the
> block, and many types of metadata use only 2 copies (at least
> by default). This new feature adds a specially processed
> TLVDEV in the common DVA address space of the pool, and
> enforces storage of added third copies for certain types
> of metadata blocks on these devices. (Limited) Backwards
> compatibility is quite possible, on-disk format change may
> be not required. The proposal also addresses some questions
> that arose in previous discussions, especially about proposals
> where SSDs would be the only storage for pool's metadata:
> * What if the dedicated metadata device overflows?
> * What if the dedicated metadata device breaks?
> = okay/expected by design, nothing dies.
> 
>   In more detail:
> 1) Add a special Top-Level VDEV (TLVDEV below) device type (like
>"cache" and "log" - say, "metaxel" for "metadata accelerator"?),
>and allow (even encourage) use of mirrored devices and allow
>expansion (raid0, raid10 and/or separate TLVDEVs) with added
>singlets/mirrors of such devices.
>Method of device type definition for the pool is discussable,
>I'd go with a special attribute (array) or nvlist in the pool
>descriptor, rather than some special type ID in the ZFS label
>(backwards compatibility, see point 4 for detailed rationale).
> 
>Discussable: enable pool-wide or per-dataset (i.e. don't
>waste accelerator space and lifetime for rarely-reused
>datasets like rolling backups)? Choose what to store on
>(particular) metaxels - DDT, BPTree, something else?
>Overall, this availability of choice is similar to choice
>of modes for ARC/L2ARC caching or enabling ZIL per-dataset...
> 
> 2) These devices should be formally addressable as part of the
>pool in DVA terms (tlvdev:offset:size), but writes onto them
>are artificially limited by ZFS scheduler so as to only allow
>specific types of metadata blocks (blkptr_t's, DDT entries),
>and also enforce writing of added third copies (for blocks
>of metadata with usual copies=2) onto these devices.
> 
> 3) Absence or "FAULTEDness" of this device should not be fatal
>to the pool, but it may require manual intervention to force
>the import. Particularly, removal, replacement or resilvering
>onto different storage (i.e. migrating to larger SSDs) should
>be supported in the design.
>Beside experimentation and migration concerns, this approach
>should also ease replacement of SSDs used for metadata in case
>of their untimely fatal failures - and this may be a concern
>for many SSD deployments, increasingly susceptible to write
>wearing and ultimate death (at least in the cheaper bulkier
>range, which is a likely component in Home-NAS s

[zfs-discuss] Dedicated metadata devices

2012-08-24 Thread Jim Klimov

Hello all,

  The idea of dedicated metadata devices (likely SSDs) for ZFS
has been generically discussed a number of times on this list,
but I don't think I've seen a final proposal that someone would
take up for implementation (as a public source code, at least).

  I'd like to take a liberty of summarizing the ideas I've either
seen in discussions or proposed myself on this matter, to see if
the overall idea would make sense to gurus of ZFS architecture.

  So, the assumption was that the performance killer in ZFS at
least on smallish deployments (few HDDs and an SSD accelerator),
like those in Home-NAS types of boxes, was random IO to lots of
metadata. This IMHO includes primarily the block pointer tree
and the DDT for those who risked using dedup. I am not sure how
frequent is the required read access to other types of metadata
(like dataset descriptors, etc.) that the occasional reading and
caching won't solve.

  Another idea was that L2ARC caching might not really cut it
for metadata in comparison to a dedicated metadata storage,
partly due to the L2ARC becoming empty upon every export/import
(boot) and needing to get re-heated.

  So, here go the highlights of proposal (up for discussion).

In short, the idea is to use today's format of the blkptr_t
which by default allows to store up to 3 DVA addresses of the
block, and many types of metadata use only 2 copies (at least
by default). This new feature adds a specially processed
TLVDEV in the common DVA address space of the pool, and
enforces storage of added third copies for certain types
of metadata blocks on these devices. (Limited) Backwards
compatibility is quite possible, on-disk format change may
be not required. The proposal also addresses some questions
that arose in previous discussions, especially about proposals
where SSDs would be the only storage for pool's metadata:
* What if the dedicated metadata device overflows?
* What if the dedicated metadata device breaks?
= okay/expected by design, nothing dies.

  In more detail:
1) Add a special Top-Level VDEV (TLVDEV below) device type (like
   "cache" and "log" - say, "metaxel" for "metadata accelerator"?),
   and allow (even encourage) use of mirrored devices and allow
   expansion (raid0, raid10 and/or separate TLVDEVs) with added
   singlets/mirrors of such devices.
   Method of device type definition for the pool is discussable,
   I'd go with a special attribute (array) or nvlist in the pool
   descriptor, rather than some special type ID in the ZFS label
   (backwards compatibility, see point 4 for detailed rationale).

   Discussable: enable pool-wide or per-dataset (i.e. don't
   waste accelerator space and lifetime for rarely-reused
   datasets like rolling backups)? Choose what to store on
   (particular) metaxels - DDT, BPTree, something else?
   Overall, this availability of choice is similar to choice
   of modes for ARC/L2ARC caching or enabling ZIL per-dataset...

2) These devices should be formally addressable as part of the
   pool in DVA terms (tlvdev:offset:size), but writes onto them
   are artificially limited by ZFS scheduler so as to only allow
   specific types of metadata blocks (blkptr_t's, DDT entries),
   and also enforce writing of added third copies (for blocks
   of metadata with usual copies=2) onto these devices.

3) Absence or "FAULTEDness" of this device should not be fatal
   to the pool, but it may require manual intervention to force
   the import. Particularly, removal, replacement or resilvering
   onto different storage (i.e. migrating to larger SSDs) should
   be supported in the design.
   Beside experimentation and migration concerns, this approach
   should also ease replacement of SSDs used for metadata in case
   of their untimely fatal failures - and this may be a concern
   for many SSD deployments, increasingly susceptible to write
   wearing and ultimate death (at least in the cheaper bulkier
   range, which is a likely component in Home-NAS solutions).

4) For backwards compatibility, to older versions of ZFS this
   device should seem like a normal single-disk or mirror TLVDEV
   which contains blocks addressed within the common pool DVA
   address-space. This should have no effect for read-only
   imports. However, other ZFS releases likely won't respect the
   filtering and alignment limitations enforced for the device
   normally in this design, and can "contaminate" the device
   with other types of blocks (and would refuse to import the
   pool if the device is missing/faulted).

5) The ZFS reads should be tweaked to first consult the copy
   of metadata blocks on the metadata accelerator device, and
   only use spinning rust (ordinary TLVDEVs) if there are some
   errors (checksum mismatches, lacking devices, etc.) or during
   scrubs and similar tasks which would require full reads of
   the pool's addressed blocks.
   Prioritized reads from this metadata accelerator won't need
   a special bit in the blkptr_t (like is done for d