Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c))

2001-12-12 Thread Bernd Walter

On Thu, Dec 13, 2001 at 12:47:53PM +1030, Greg Lehey wrote:
> On Thursday, 13 December 2001 at  3:06:14 +0100, Bernd Walter wrote:
> > Currently if we have two writes in two stripes each, all initated before
> > the first finished, the drive has to seek between the two stripes, as
> > the second write to the same stripe has to wait.
> 
> I'm not sure I understand this.  The stripes are on different drives,
> after all.

Lets asume a 256k striped single plex volume with 3 subdisks.
We get a layout like this:

sd1 sd2 sd3
256k256kparity
256kparity  256k
parity  256k256k
256k256kparity
... ... ...

Now we write on the volume the blocks 1, 10, 1040 and 1045.
All writes are initated at the same time.
Good would be to write first 1 then 10 then 1040 and finaly 1045.
What we currently see is write 1 then 1040 then 10 and finaly 1045.
This is because we can't write 10 unless 1 is finished but we already
start with 1040 because it's independend.
The result is avoidable seeking in subdisk 1.

Back to the >256k performance breakdown you described.
Because of the seeks we have not only unneeded seeks on the drive but
also have a different use pattern on the drive cache.

Once the locks are untangled it is required to verify the situation as
the drive cache may behave differently.

-- 
B.Walter  COSMO-Project http://www.cosmo-project.de
[EMAIL PROTECTED] Usergroup   [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c))

2001-12-12 Thread Greg Lehey

On Thursday, 13 December 2001 at  3:06:14 +0100, Bernd Walter wrote:
> On Thu, Dec 13, 2001 at 10:54:13AM +1030, Greg Lehey wrote:
>> On Wednesday, 12 December 2001 at 12:53:37 +0100, Bernd Walter wrote:
>>> On Wed, Dec 12, 2001 at 04:22:05PM +1030, Greg Lehey wrote:
 On Tuesday, 11 December 2001 at  3:11:21 +0100, Bernd Walter wrote:
 2.  Cache the parity blocks.  This is an optimization which I think
 would be very valuable, but which Vinum doesn't currently perform.
>>>
>>> I thought of connecting the parity to the wait lock.
>>> If there's a waiter for the same parity data it's not droped.
>>> This way we don't waste memory but still have an efect.
>>
>> That's a possibility, though it doesn't directly address parity block
>> caching.  The problem is that by the time you find another lock,
>> you've already performed part of the parity calculation, and probably
>> part of the I/O transfer.  But it's an interesting consideration.
>
> I know that it doesn't do the best, but it's easy to implement.
> A more complex handling for the better results can still be done.

I don't have the time to work out an example, but I don't think it
would change anything until you had two lock waits.  I could be wrong,
though: you've certainly brought out something here that I hadn't
considered, so if you can write up a detailed example (preferably
after you've looked at the code and decided how to handle it), I'd
certainly be interested.

>>> I would guess it when the stripe size is bigger than the preread
>>> cache the drives uses.  This would mean we have a less chance to
>>> get parity data out of the drive cache.
>>
>> Yes, this was one of the possibilities we considered.
>
> It should be measured and compared after I changed the looking.
> It will look different after that and may lead to other reasons,
> because we will have a different load characteristic on the drives.
> Currently if we have two writes in two stripes each, all initated before
> the first finished, the drive has to seek between the two stripes, as
> the second write to the same stripe has to wait.

I'm not sure I understand this.  The stripes are on different drives,
after all.

>>> Whenever a write hits a driver there is a waiter for it.
>>> Either a softdep, a memory freeing or an application doing an sync
>>> transfer.
>>> I'm almost shure delaying writes will harm performance in upper layers.
>>
>> I'm not so sure.  Full stripe writes, where needed, are *much* faster
>> than partial strip writes.
>
> Hardware raid usually comes with NVRAM and can cache write data without
> delaying the acklowledge to the initiator.
> That option is not available to software raid.

It could be.  It's probably something worth investigating and
supporting.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c))

2001-12-12 Thread Bernd Walter

On Thu, Dec 13, 2001 at 10:54:13AM +1030, Greg Lehey wrote:
> On Wednesday, 12 December 2001 at 12:53:37 +0100, Bernd Walter wrote:
> > On Wed, Dec 12, 2001 at 04:22:05PM +1030, Greg Lehey wrote:
> >> On Tuesday, 11 December 2001 at  3:11:21 +0100, Bernd Walter wrote:
> >> 2.  Cache the parity blocks.  This is an optimization which I think
> >> would be very valuable, but which Vinum doesn't currently perform.
> >
> > I thought of connecting the parity to the wait lock.
> > If there's a waiter for the same parity data it's not droped.
> > This way we don't waste memory but still have an efect.
> 
> That's a possibility, though it doesn't directly address parity block
> caching.  The problem is that by the time you find another lock,
> you've already performed part of the parity calculation, and probably
> part of the I/O transfer.  But it's an interesting consideration.

I know that it doesn't do the best, but it's easy to implement.
A more complex handling for the better results can still be done.

> >>> If we had a fine grained locking which only locks the accessed sectors
> >>> in the parity we would be able to have more than a single ascending
> >>> write transaction onto a single drive.
> >>
> >> Hmm.  This is something I hadn't thought about.  Note that sequential
> >> writes to a RAID-5 volume don't go to sequential addresses on the
> >> spindles; they will work up to the end of the stripe on one spindle,
> >> then start on the next spindle at the start of the stripe.  You can do
> >> that as long as the address ranges in the parity block don't overlap,
> >> but the larger the stripe, the greater the likelihood of this would
> >> be. This might also explain the following observed behaviour:
> >>
> >> 1.  RAID-5 writes slow down when the stripe size gets > 256 kB or so.
> >> I don't know if this happens on all disks, but I've seen it often
> >> enough.
> >
> > I would guess it when the stripe size is bigger than the preread cache
> > the drives uses.
> > This would mean we have a less chance to get parity data out of the
> > drive cache.
> 
> Yes, this was one of the possibilities we considered.  

It should be measured and compared after I changed the looking.
It will look different after that and may lead to other reasons,
because we will have a different load characteristic on the drives.
Currently if we have two writes in two stripes each, all initated before
the first finished, the drive has to seek between the two stripes, as
the second write to the same stripe has to wait.

> >> Note that there's another possible optimization here: delay the writes
> >> by a certain period of time and coalesce them if possible.  I haven't
> >> finished thinking about the implications.
> >
> > That's exactly what the ufs clustering and softupdates does.
> > If it doesn't fit modern drives anymore it should get tuned there.
> 
> This doesn't have too much to do with modern drives; it's just as
> applicable to 70s drives.

One of softupdates job is to eliminate redundant writes and to do async
writes without loosing consistency of the on media structure.
This also means that we have a better chance that data is written in big
chunks.
In general the wire speed of data to the drive is increased with every new
bus generation but usually big parts of the overhead is keeped for
compatibility with older drives.
I agree that the parity based raid situation does depend more on principle
than on the age of the drive.

> > Whenever a write hits a driver there is a waiter for it.
> > Either a softdep, a memory freeing or an application doing an sync
> > transfer.
> > I'm almost shure delaying writes will harm performance in upper layers.
> 
> I'm not so sure.  Full stripe writes, where needed, are *much* faster
> than partial strip writes.

Hardware raid usually comes with NVRAM and can cache write data without
delaying the acklowledge to the initiator.
That option is not available to software raid.

-- 
B.Walter  COSMO-Project http://www.cosmo-project.de
[EMAIL PROTECTED] Usergroup   [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c))

2001-12-12 Thread Greg Lehey

On Wednesday, 12 December 2001 at 12:53:37 +0100, Bernd Walter wrote:
> On Wed, Dec 12, 2001 at 04:22:05PM +1030, Greg Lehey wrote:
>> On Tuesday, 11 December 2001 at  3:11:21 +0100, Bernd Walter wrote:
>>> striped:
>>> If you have 512byte stripes and have 2 disks.
>>> You access 64k which is put into 2 32k transactions onto the disk.
>>
>> Only if your software optimizes the transfers.  There are reasons why
>> it should not.  Without optimization, you get 128 individual
>> transfers.
>
> If the software does not we end with 128 transactions anyway, which is
> not very good becuase of the overhead for each of them.

Correct.

> UFS does a more or less good job in doing this.

Well, it requires a lot of moves.  Vinum *could* do this, but for the
reasons specified below, there's no need.

>>> raid5:
>>> For a write you have two read transactions and two writes.
>>
>> This is the way Vinum does it.  There are other possibilities:
>>
>> 1.  Always do full-stripe writes.  Then you don't need to read the old
>> contents.
>
> Which isn't that good with the big stripes we usually want.

Correct.  That's why most RAID controllers limit stripe size to
something sub-optimal, because it simplifies the code to do
full-stripe writes.

>> 2.  Cache the parity blocks.  This is an optimization which I think
>> would be very valuable, but which Vinum doesn't currently perform.
>
> I thought of connecting the parity to the wait lock.
> If there's a waiter for the same parity data it's not droped.
> This way we don't waste memory but still have an efect.

That's a possibility, though it doesn't directly address parity block
caching.  The problem is that by the time you find another lock,
you've already performed part of the parity calculation, and probably
part of the I/O transfer.  But it's an interesting consideration.

>>> If we had a fine grained locking which only locks the accessed sectors
>>> in the parity we would be able to have more than a single ascending
>>> write transaction onto a single drive.
>>
>> Hmm.  This is something I hadn't thought about.  Note that sequential
>> writes to a RAID-5 volume don't go to sequential addresses on the
>> spindles; they will work up to the end of the stripe on one spindle,
>> then start on the next spindle at the start of the stripe.  You can do
>> that as long as the address ranges in the parity block don't overlap,
>> but the larger the stripe, the greater the likelihood of this would
>> be. This might also explain the following observed behaviour:
>>
>> 1.  RAID-5 writes slow down when the stripe size gets > 256 kB or so.
>> I don't know if this happens on all disks, but I've seen it often
>> enough.
>
> I would guess it when the stripe size is bigger than the preread cache
> the drives uses.
> This would mean we have a less chance to get parity data out of the
> drive cache.

Yes, this was one of the possibilities we considered.  

>> 2.  rawio write performance is better than ufs write performance.
>> rawio does "truly" random transfers, where ufs is a mixture.
>
> The current problem is to increase linear write performance.
> I don't see a chance that rawio benefit of it, but ufs will.

Well, rawio doesn't need to benefit.  It's supposed to be a neutral
observer, but in this case it's not doing too well.

>> Do you feel like changing the locking code?  It shouldn't be that much
>> work, and I'd be interested to see how much performance difference it
>> makes.
>
> I put it onto my todo list.

Thanks.

>> Note that there's another possible optimization here: delay the writes
>> by a certain period of time and coalesce them if possible.  I haven't
>> finished thinking about the implications.
>
> That's exactly what the ufs clustering and softupdates does.
> If it doesn't fit modern drives anymore it should get tuned there.

This doesn't have too much to do with modern drives; it's just as
applicable to 70s drives.

> Whenever a write hits a driver there is a waiter for it.
> Either a softdep, a memory freeing or an application doing an sync
> transfer.
> I'm almost shure delaying writes will harm performance in upper layers.

I'm not so sure.  Full stripe writes, where needed, are *much* faster
than partial strip writes.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Vinum write performance (was: RAID performance (was: cvs commit: src/sys/kern subr_diskmbr.c))

2001-12-12 Thread Bernd Walter

On Wed, Dec 12, 2001 at 04:22:05PM +1030, Greg Lehey wrote:
> On Tuesday, 11 December 2001 at  3:11:21 +0100, Bernd Walter wrote:
> > striped:
> > If you have 512byte stripes and have 2 disks.
> > You access 64k which is put into 2 32k transactions onto the disk.
> 
> Only if your software optimizes the transfers.  There are reasons why
> it should not.  Without optimization, you get 128 individual
> transfers.

If the software does not we end with 128 transactions anyway, which is
not very good becuase of the overhead for each of them.
UFS does a more or less good job in doing this.

> > Linear speed could be about twice the speed of a single drive.  But
> > this is more theoretic today than real.  The average transaction
> > size per disk decreases with growing number of spindles and you get
> > more transaction overhead.  Also the voice coil technology used in
> > drives since many years add a random amount of time to the access
> > time, which invalidates some of the spindle sync potential.  Plus it
> > may break some benefits of precaching mechanisms in drives.  I'm
> > almost shure there is no real performance gain with modern drives.
> 
> The real problem with this scenario is that you're missing a couple of
> points:
> 
> 1.  Typically it's not the latency that matters.  If you have to wait
> a few ms longer, that's not important.  What's interesting is the
> case of a heavily loaded system, where the throughput is much more
> important than the latency.

Agreed - especially because we don't wait for writes as most are async.

> 2.  Throughput is the data transferred per unit time.  There's active
> transfer time, nowadays in the order of 500 µs, and positioning
> time, in the order of 6 ms.  Clearly the fewer positioning
> operations, the better.  This means that you should want to put
> most transfers on a single spindle, not a single stripe.  To do
> this, you need big stripes.

In the general case yes.

> > raid5:
> > For a write you have two read transactions and two writes.
> 
> This is the way Vinum does it.  There are other possibilities:
> 
> 1.  Always do full-stripe writes.  Then you don't need to read the old
> contents.

Which isn't that good with the big stripes we usually want.

> 2.  Cache the parity blocks.  This is an optimization which I think
> would be very valuable, but which Vinum doesn't currently perform.

I thought of connecting the parity to the wait lock.
If there's a waiter for the same parity data it's not droped.
This way we don't waste memory but still have an efect.

> > There are easier things to raise performance.
> > Ever wondered why people claim vinums raid5 writes are slow?
> > The answer is astonishing simple:
> > Vinum does striped based locking, while the ufs tries to lay out data
> > mostly ascending sectors.
> > What happens here is that the first write has to wait for two reads
> > and two writes.
> > If we have an ascending write it has to wait for the first write to
> > finish, because the stripe is still locked.
> > The first is unlocked after both physical writes are on disk.
> > Now we start our two reads which are (thanks to drives precache)
> > most likely in the drives cache - than we write.
> >
> > The problem here is that physical writes gets serialized and the drive
> > has to wait a complete rotation between each.
> 
> Not if the data is in the drive cache.

This example was for writing.
Reads get precached by the drive and have a very good chance of beeing
in the cache.
It doesn't matter on IDE disks, because if you have write cache enabled
the write gets acked from the cache and not the media.  If write cache
is disabled writes gets serialized anyway.

> > If we had a fine grained locking which only locks the accessed sectors
> > in the parity we would be able to have more than a single ascending
> > write transaction onto a single drive.
> 
> Hmm.  This is something I hadn't thought about.  Note that sequential
> writes to a RAID-5 volume don't go to sequential addresses on the
> spindles; they will work up to the end of the stripe on one spindle,
> then start on the next spindle at the start of the stripe.  You can do
> that as long as the address ranges in the parity block don't overlap,
> but the larger the stripe, the greater the likelihood of this would
> be. This might also explain the following observed behaviour:
> 
> 1.  RAID-5 writes slow down when the stripe size gets > 256 kB or so.
> I don't know if this happens on all disks, but I've seen it often
> enough.

I would guess it when the stripe size is bigger than the preread cache
the drives uses.
This would mean we have a less chance to get parity data out of the
drive cache.

> 2.  rawio write performance is better than ufs write performance.
> rawio does "truly" random transfers, where ufs is a mixture.

The current problem is to increase linear write performance.
I don't see a chance that rawio benefit of it, but ufs will.