Adam Leventhal wrote:
> On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote:
>>> How so? In my opinion, it seems like a cure for the brain damage of RAID-5.
>> Nope.
>>
>> A decent RAID-5 hardware implementation has no 'write hole' to worry about, 
>> and one can make a software implementation similarly robust with some effort 
>> (e.g., by using a transaction log to protect the data-plus-parity 
>> double-update or by using COW mechanisms like ZFS's in a more intelligent 
>> manner).
> 
> Can you reference a software RAID implementation which implements a solution
> to the write hole and performs well.

No, but I described how to use a transaction log to do so and later on in the 
post how ZFS could implement a different solution more consistent with its 
current behavior.  In the case of the transaction log, the key is to use the 
log not only to protect the RAID update but to protect the associated 
higher-level file operation as well, such that a single log force satisfies 
both (otherwise, logging the RAID update separately would indeed slow things 
down - unless you had NVRAM to use for it, in which case you've effectively 
just reimplemented a low-end RAID controller - which is probably why no one has 
implemented that kind of solution in a stand-alone software RAID product).

...
 
>> The part of RAID-Z that's brain-damaged is its 
>> concurrent-small-to-medium-sized-access performance (at least up to request 
>> sizes equal to the largest block size that ZFS supports, and arguably 
>> somewhat beyond that):  while conventional RAID-5 can satisfy N+1 
>> small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in 
>> parallel (though the latter also take an extra rev to complete), RAID-Z can 
>> satisfy only one small-to-medium access request at a time (well, plus a 
>> smidge for read accesses if it doesn't verity the parity) - effectively 
>> providing RAID-3-style performance.
> 
> Brain damage seems a bit of an alarmist label.

I consider 'brain damage' to be if anything a charitable characterization.

 While you're certainly right
> that for a given block we do need to access all disks in the given stripe,
> it seems like a rather quaint argument: aren't most environments that matter
> trying to avoid waiting for the disk at all?

Everyone tries to avoid waiting for the disk at all.  Remarkably few succeed 
very well.

 Intelligent prefetch and large
> caches -- I'd argue -- are far more important for performance these days.

Intelligent prefetch doesn't do squat if your problem is disk throughput (which 
in server environments it frequently is).  And all caching does (if you're 
lucky and your workload benefits much at all from caching) is improve your 
system throughput at the point where you hit the disk throughput wall.

Improving your disk utilization, by contrast, pushes back that wall.  And as I 
just observed in another thread, not by 20% or 50% but potentially by around 
two decimal orders of magnitude if you compare the sequential scan performance 
to multiple randomly-updated database tables between a moderately 
coarsely-chunked conventional RAID and a fine-grained ZFS block size (e.g., the 
16 KB used by the example database) with each block sprayed across several 
disks.

Sure, that's a worst-case scenario.  But two orders of magnitude is a hell of a 
lot, even if it doesn't happen often - and suggests that in more typical cases 
you're still likely leaving a considerable amount of performance on the table 
even if that amount is a lot less than a factor of 100.

> 
>> The easiest way to fix ZFS's deficiency in this area would probably be to 
>> map each group of N blocks in a file as a stripe with its own parity - which 
>> would have the added benefit of removing any need to handle parity groups at 
>> the disk level (this would, incidentally, not be a bad idea to use for 
>> mirroring as well, if my impression is correct that there's a remnant of 
>> LVM-style internal management there).  While this wouldn't allow use of 
>> parity RAID for very small files, in most installations they really don't 
>> occupy much space compared to that used by large files so this should not 
>> constitute a significant drawback.
> 
> I don't really think this would be feasible given how ZFS is stratified
> today, but go ahead and prove me wrong: here are the instructions for
> bringing over a copy of the source code:
> 
>   http://www.opensolaris.org/os/community/tools/scm

Now you want me not only to design the fix but code it for you?  I'm afraid 
that you vastly overestimate my commitment to ZFS:  while I'm somewhat 
interested in discussing it and happy to provide what insights I can, I really 
don't personally care whether it succeeds or fails.

But I sort of assumed that you might.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to