Re: linear writes to raid5

2006-04-26 Thread Neil Brown
On Thursday April 20, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  
  What is the rationale for your position?
 
 My rationale was that if md layer receives *write* requests not smaller
 than a full stripe size, it is able to omit reading data to update, and
 can just calculate new parity from the new data.  Hence, combining a
 dozen small write requests coming from a filesystem to form a single
 request = full stripe size should dramatically increase
 performance.

That makes sense.

However in both cases (above and below raid5), the device receiving
the requests is in a better position to know what size is a good
size than the client sending the requests.
That is exactly what the 'plugging' concept is for.  When a request
arrives, the device is 'plugged' so that it won't process new
requests, and the request plus any following requests are queued.  At
some point the queue is unplugged and the device should be able to
collect related requests to make large requests of an appropriate size
and alignment for the device.

The current suggestion is that plugging is quite working right for
raid5.  That is certainly possible.


 
 Eg, when I use dd with O_DIRECT mode (oflag=direct) and experiment with
 different block size, write performance increases alot when bs becomes
 full stripe size.  Ofcourse it decreases again when bs is increased a
 bit further (as md starts reading again, to construct parity blocks).
 

Yes O_DIRECT is essentially saying I know what I am doing and I
want to bypass all the smarts and go straight to the device.
O_DIRECT requests should certainly be sized and aligned to make the
device.  For non-O_DIRECT it shouldn't matter so much.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-20 Thread Michael Tokarev

Neil Brown wrote:

On Tuesday April 18, [EMAIL PROTECTED] wrote:

[]

I mean, mergeing bios into larger requests makes alot of sense between
a filesystem and md levels, but it makes alot less sense to do that
between md and physical (fsvo physical anyway) disks.


This seems completely backwards to me, so we must be thinking of
different things.

Creating large requests above the md level doesn't make a lot of sense
to me because there is a reasonable chance that md will just need to
break the requests up again to submit to different devices lower down.

Building large requests for the physical disk makes lots of sense
because you get much better throughput on an e.g. SCSI buss by having
few large requests rather than many small requests.  But this building
should be done close to the device so that as much information as
possible is available about particular device characteristics.

What is the rationale for your position?


My rationale was that if md layer receives *write* requests not smaller
than a full stripe size, it is able to omit reading data to update, and
can just calculate new parity from the new data.  Hence, combining a
dozen small write requests coming from a filesystem to form a single
request = full stripe size should dramatically increase performance.

Eg, when I use dd with O_DIRECT mode (oflag=direct) and experiment with
different block size, write performance increases alot when bs becomes
full stripe size.  Ofcourse it decreases again when bs is increased a
bit further (as md starts reading again, to construct parity blocks).

For read requests, it makes much less difference where to combine them.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-19 Thread Alex Tomas
 Neil Brown (NB) writes:

 NB raid5 shouldn't need to merge small requests into large requests.
 NB That is what the 'elevator' or io_scheduler algorithms are for.  There
 NB already merge multiple bio's into larger 'requests'.  If they aren't
 NB doing that, then something needs to be fixed.

hmm. then why filesystems try to allocate big chunks and submit them
at once? what's the point to have bio subsystem?

 NB It is certainly possible that raid5 is doing something wrong that
 NB makes merging harder - maybe sending bios in the wrong order, or
 NB sending them with unfortunate timing.  And if that is the case it
 NB certainly makes sense to fix it.  
 NB But I really don't see that raid5 should be merging requests together
 NB - that is for a lower-level to do.

well, another thing is that it's extremly cheap to merge them in raid5
because we know request size and what stripes it covers. at same time
 block layer doesn't know that and need to _search_ where to merge to.

 NB This implies 3millisecs have passed since the queue was plugged, which
 NB is a long time.
 NB I guess what could be happening is that the queue is being unplugged
 NB every 3msec whether it is really needed or not.
 NB i.e. we plug the queue, more requests come, the stripes we plugged the
 NB queue for get filled up and processes, but the timer never gets reset.
 NB Maybe we need to find a way to call blk_remove_plug when there are no
 NB stripes waiting for pre-read...

 NB Alternately, stripes on the delayed queue could get a timestamp, and
 NB only get removed if they are older than 3msec.  Then we would replug
 NB the queue if there were some new stripes left

could we somehow mark all stripes that belong to given incoming request
in make_request() and skip them in raid5_activate_delayed() ? after the
whole incoming request is processed, drop the mark.

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-19 Thread Alex Tomas
 Michael Tokarev (MT) writes:

 MT Hmm.  So where's the elevator level - before raid level (between e.g.
 MT a filesystem and md), or after it (between md and physical devices) ?

in the both, because raid5 produces _new_ requests and send them
to elevator again.

 MT I mean, mergeing bios into larger requests makes alot of sense between
 MT a filesystem and md levels, but it makes alot less sense to do that
 MT between md and physical (fsvo physical anyway) disks.

i'm not talking about merging small _incoming_ requests. the problem
is that a filesystem sends big requests to raid5 (say, 1MB) and then
raid5 produces ton of small requests (PAGE_SIZE) to handle that 1MB
one. this kills performance.

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-19 Thread Neil Brown
On Tuesday April 18, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
 []
  raid5 shouldn't need to merge small requests into large requests.
  That is what the 'elevator' or io_scheduler algorithms are for.  There
  already merge multiple bio's into larger 'requests'.  If they aren't
  doing that, then something needs to be fixed.
  
  It is certainly possible that raid5 is doing something wrong that
  makes merging harder - maybe sending bios in the wrong order, or
  sending them with unfortunate timing.  And if that is the case it
  certainly makes sense to fix it.  
  But I really don't see that raid5 should be merging requests together
  - that is for a lower-level to do.
 
 Hmm.  So where's the elevator level - before raid level (between e.g.
 a filesystem and md), or after it (between md and physical devices)?

The elevator is immediately above the low-level device.  So it is
between md and the physical device.  There is no elevator above md.


 
 I mean, mergeing bios into larger requests makes alot of sense between
 a filesystem and md levels, but it makes alot less sense to do that
 between md and physical (fsvo physical anyway) disks.

This seems completely backwards to me, so we must be thinking of
different things.

Creating large requests above the md level doesn't make a lot of sense
to me because there is a reasonable chance that md will just need to
break the requests up again to submit to different devices lower down.

Building large requests for the physical disk makes lots of sense
because you get much better throughput on an e.g. SCSI buss by having
few large requests rather than many small requests.  But this building
should be done close to the device so that as much information as
possible is available about particular device characteristics.

What is the rationale for your position?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-19 Thread Neil Brown
On Wednesday April 19, [EMAIL PROTECTED] wrote:
  Neil Brown (NB) writes:
 
  NB raid5 shouldn't need to merge small requests into large requests.
  NB That is what the 'elevator' or io_scheduler algorithms are for.  There
  NB already merge multiple bio's into larger 'requests'.  If they aren't
  NB doing that, then something needs to be fixed.
 
 hmm. then why filesystems try to allocate big chunks and submit them
 at once? what's the point to have bio subsystem?

I've often wondered this

The rationale for creating large bios has to do with code path length.
Making small requests and sending each one down the block device stack
results in long code paths being called over and over again, each call
doing almost exactly the same thing.  This isn't nice to L-1 cache.

Creating a large request and sending it down once means the long path
is traversed less often.

However I would have built a linked-list of very lightweight
structures and passed that down...

 
  NB It is certainly possible that raid5 is doing something wrong that
  NB makes merging harder - maybe sending bios in the wrong order, or
  NB sending them with unfortunate timing.  And if that is the case it
  NB certainly makes sense to fix it.  
  NB But I really don't see that raid5 should be merging requests together
  NB - that is for a lower-level to do.
 
 well, another thing is that it's extremly cheap to merge them in raid5
 because we know request size and what stripes it covers. at same time
  block layer doesn't know that and need to _search_ where to merge
 to.

For write requests, I don't think there is much gain here.  By the
time you have done all the parity updates, you have probably lost
track of what follows what.

For read requests on a working drive, I'd like to simply bypass the
stripe cache altogether as I outlined in a separate email on
linux-raid a couple of weeks ago.

 
  NB This implies 3millisecs have passed since the queue was plugged, which
  NB is a long time.
  NB I guess what could be happening is that the queue is being unplugged
  NB every 3msec whether it is really needed or not.
  NB i.e. we plug the queue, more requests come, the stripes we plugged the
  NB queue for get filled up and processes, but the timer never gets reset.
  NB Maybe we need to find a way to call blk_remove_plug when there are no
  NB stripes waiting for pre-read...
 
  NB Alternately, stripes on the delayed queue could get a timestamp, and
  NB only get removed if they are older than 3msec.  Then we would replug
  NB the queue if there were some new stripes left
 
 could we somehow mark all stripes that belong to given incoming request
 in make_request() and skip them in raid5_activate_delayed() ? after the
 whole incoming request is processed, drop the mark.

Again, I don't think that the logic should be based on a given
incoming request.  Yes, something needs to be done here, but I think
it should essentially be time based rather than incoming-request
based.

However you are welcome to try things out and see if you can make it
work faster.  If you can, I'm sure your results will be a significant
contribution to whatever ends up being the final solution.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-18 Thread Michael Tokarev

Neil Brown wrote:
[]

raid5 shouldn't need to merge small requests into large requests.
That is what the 'elevator' or io_scheduler algorithms are for.  There
already merge multiple bio's into larger 'requests'.  If they aren't
doing that, then something needs to be fixed.

It is certainly possible that raid5 is doing something wrong that
makes merging harder - maybe sending bios in the wrong order, or
sending them with unfortunate timing.  And if that is the case it
certainly makes sense to fix it.  
But I really don't see that raid5 should be merging requests together

- that is for a lower-level to do.


Hmm.  So where's the elevator level - before raid level (between e.g.
a filesystem and md), or after it (between md and physical devices) ?

I mean, mergeing bios into larger requests makes alot of sense between
a filesystem and md levels, but it makes alot less sense to do that
between md and physical (fsvo physical anyway) disks.

Thanks.

/mjt

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-16 Thread Neil Brown
On Wednesday April 12, [EMAIL PROTECTED] wrote:
  Neil Brown (NB) writes:
 
  NB There are a number of aspects to this.
 
  NB  - When a write arrives we 'plug' the queue so the stripe goes onto a 
  NB'delayed' list which doesn't get processed until an unplug happens,
  NBor until the stripe is full and not requiring any reads.
  NB  - If there is already pre-read active, then we don't start any more
  NBprereading until the pre-read is finished.  This effectively
  NBbatches the prereading which delays writes a little, but not too
  NBmuch.
  NB  - When the stripe-cache becomes full, we wait until it gets down to
  NB3/4 full before allocating another stripe.  This means that when
  NBsome write requests come in, there should be enough room in the
  NBcache to delay them until they become full. 
 
 I see. though my point is a bit different:
 say, there is an application that's doing big linear writes in order
 to achieve good throughput. on the other hand, most of modern storages
 are very sensible to request size and tend to suck serving zillions
 of small I/Os. raid5 breaks all incoming requests into small ones and
 handles them separately. of couse, one might be lucky and after submiting
 those small requests get merged to larger ones. but only due to luck,
 I'm afraid. what I'm talking about is expressly code in raid5 that
 would try to merge small requests in some obvious cases.
 for example:

raid5 shouldn't need to merge small requests into large requests.
That is what the 'elevator' or io_scheduler algorithms are for.  There
already merge multiple bio's into larger 'requests'.  If they aren't
doing that, then something needs to be fixed.

It is certainly possible that raid5 is doing something wrong that
makes merging harder - maybe sending bios in the wrong order, or
sending them with unfortunate timing.  And if that is the case it
certainly makes sense to fix it.  
But I really don't see that raid5 should be merging requests together
- that is for a lower-level to do.

 
 
  NB You are right.  This isn't optimal.
  NB I don't think that the queue should get unplugged at this point.
  NB Do you know what is calling raid5_unplug_device in your step 4?
 
  NB We could take the current request into account, but I would rather
  NB avoid that if possible.  If we can develop a mechanism that does the
  NB right thing without reference to the current request, then it will
  NB work equally if the request comes down in smaller chunks.
 
 note also, that there can be other stripes being served. and they
 may need reads. thus you'll have to unplug the queue for them.
 
   cause delayed stripes to get activated.
 
  NB Can you explain where they cause delayed stripes to get activated?
 
 just catched it:
 
  [c0106b3e] dump_stack+0x1e/0x30
  [f881186e] raid5_unplug_device+0xee/0x110 [ raid5]
  [c02452e2]blk_unplug_work+0x12/0x20   
  [c01319ad] worker_thread+0x19d/0x240
  [c013611a] kthread+0xba/0xc0
  [c01047c5] kernel_thread_helper+0x5/0x10

This implies 3millisecs have passed since the queue was plugged, which
is a long time.
I guess what could be happening is that the queue is being unplugged
every 3msec whether it is really needed or not.
i.e. we plug the queue, more requests come, the stripes we plugged the
queue for get filled up and processes, but the timer never gets reset.
Maybe we need to find a way to call blk_remove_plug when there are no
stripes waiting for pre-read...

Alternately, stripes on the delayed queue could get a timestamp, and
only get removed if they are older than 3msec.  Then we would replug
the queue if there were some new stripes left

Something like that might work.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-12 Thread Alex Tomas
 Neil Brown (NB) writes:

 NB There are a number of aspects to this.

 NB  - When a write arrives we 'plug' the queue so the stripe goes onto a 
 NB'delayed' list which doesn't get processed until an unplug happens,
 NBor until the stripe is full and not requiring any reads.
 NB  - If there is already pre-read active, then we don't start any more
 NBprereading until the pre-read is finished.  This effectively
 NBbatches the prereading which delays writes a little, but not too
 NBmuch.
 NB  - When the stripe-cache becomes full, we wait until it gets down to
 NB3/4 full before allocating another stripe.  This means that when
 NBsome write requests come in, there should be enough room in the
 NBcache to delay them until they become full. 

I see. though my point is a bit different:
say, there is an application that's doing big linear writes in order
to achieve good throughput. on the other hand, most of modern storages
are very sensible to request size and tend to suck serving zillions
of small I/Os. raid5 breaks all incoming requests into small ones and
handles them separately. of couse, one might be lucky and after submiting
those small requests get merged to larger ones. but only due to luck,
I'm afraid. what I'm talking about is expressly code in raid5 that
would try to merge small requests in some obvious cases.
for example:


 NB You are right.  This isn't optimal.
 NB I don't think that the queue should get unplugged at this point.
 NB Do you know what is calling raid5_unplug_device in your step 4?

 NB We could take the current request into account, but I would rather
 NB avoid that if possible.  If we can develop a mechanism that does the
 NB right thing without reference to the current request, then it will
 NB work equally if the request comes down in smaller chunks.

note also, that there can be other stripes being served. and they
may need reads. thus you'll have to unplug the queue for them.

  cause delayed stripes to get activated.

 NB Can you explain where they cause delayed stripes to get activated?

just catched it:

 [c0106b3e] dump_stack+0x1e/0x30
 [f881186e] raid5_unplug_device+0xee/0x110 [ raid5]
 [c02452e2]blk_unplug_work+0x12/0x20   
 [c01319ad] worker_thread+0x19d/0x240
 [c013611a] kthread+0xba/0xc0
 [c01047c5] kernel_thread_helper+0x5/0x10


thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-12 Thread Alex Tomas
 Alex Tomas (AT) writes:

 AT I see. though my point is a bit different:
 AT say, there is an application that's doing big linear writes in order
 AT to achieve good throughput. on the other hand, most of modern storages
 AT are very sensible to request size and tend to suck serving zillions
 AT of small I/Os. raid5 breaks all incoming requests into small ones and
 AT handles them separately. of couse, one might be lucky and after submiting
 AT those small requests get merged to larger ones. but only due to luck,
 AT I'm afraid. what I'm talking about is expressly code in raid5 that
 AT would try to merge small requests in some obvious cases.
 AT for example:


sorry, forgot to include the example ...

there is 3 disks raid5 with chunk=64K. one does 128K request.

and in make_request() we do something like this:

struct context {
  struct bio *bios[FOR_EVERY_DISK];
};

/* initialize context with empty bio's */
for (all internal stripes) {
sh = get_active_stripe();
add_stripe_bio(sh, bi, ...);
handle_stripe(sh, context);
}
/* submit all non-empty bio's from the context */


and in handle_stripe(): instead of immediate calling
generic_make_request() try to merge request to correspondend
bio in the context.

I understand there are few different limits to bio size,
active stripes, etc. but what do you think about the idea in general?

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-11 Thread Alex Tomas
 Mark Hahn (MH) writes:

 MH don't you mean _3_ chunk-sized writes?  if so, are you actually
 MH asking about the case when you issue an aligned two-stripe write?
 MH (which might get broken into 6 64K writes, not sure, rather than 
 MH three 2-chunk writes...)

actually, yes. I'm talking about 3 requests: 2 of data and one of parity.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-11 Thread Alex Tomas
 Neil Brown (NB) writes:

 NB The raid5 code attempts to do this already, though I'm not sure how
 NB successful it is.  I think it is fairly successful, but not completely
 NB successful. 

hmm. could you tell me what the code should I look at?


 NB There is a trade-off that raid5 has to make.  Waiting longer can mean
 NB more blocks on the same stripe, and so less reads.  But waiting longer
 NB can also increase latency which might not be good.

yes, I agree.

 NB The thing to would be to put some tracing in to find out exactly what
 NB is happening for some sample workloads, and then see if anything can
 NB be improved.

well, the simplest case I tried was this:

mdadm -C /dev/md0 --level=5 --chunk=8 --raid-disks=3 ...
then open /dev/md0 with O_DIRECT and send a write of 16K.
it ended up, doing few writes and one read. the sequence was:
1) serving first 4K of the request - put the stripe it onto delayed list
2) serving 2nd 4KB -- again onto delayed list
3) serving 3rd 4KB -- get a full uptodate stripe, time to make the parity
   3 writes are issued for stripe #0
4) raid5_unplug_device() is called because of those 3 writes
   it activates delayed stripe #4
5) raid5d() finds stripe #4 and issues READ
...

I tend to think this isn't the most optimal way. couldn't we take current
request into account somehow. something like keep delayed off the queue
until current requests aren't served AND stripe cache isn't full.

another similar case is when you have two processes writing to very
different stripes and low-level requests they make from handle_stripe()
cause delayed stripes to get activated.

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-09 Thread Mark Hahn
 is there a way to batch explicitely write requests raid5 issues?

sort of like TCP_CORK?

 for example, there is a raid5 built from 3 disks with chunk=64K.
 one types dd if=/dev/zero of=/dev/md0 bs=128k count=1

OK, so this is an aligned, whole-stripe write.

and 128K
 bio gets into the raid5. raid5 processes the request, does xor
 for parity stripe, then issues 2 64KB requests down to lower level.

don't you mean _3_ chunk-sized writes?  if so, are you actually
asking about the case when you issue an aligned two-stripe write?
(which might get broken into 6 64K writes, not sure, rather than 
three 2-chunk writes...)

_not_ that I know this code at all!

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-09 Thread Neil Brown
On Saturday April 8, [EMAIL PROTECTED] wrote:
 
 Good day all,
 
 is there a way to batch explicitely write requests raid5 issues?
 for example, there is a raid5 built from 3 disks with chunk=64K.
 one types dd if=/dev/zero of=/dev/md0 bs=128k count=1 and 128K
 bio gets into the raid5. raid5 processes the request, does xor
 for parity stripe, then issues 2 64KB requests down to lower level.
 
 is it even possible to implement? if so, how complex?
 
 I suppose we could introduce a context which holds last
 non-issued bio and instead of generic_make_request() in 
 handle_stripe() try to merge current request to the previous
 one from the context? how does this sound to you?

The raid5 code attempts to do this already, though I'm not sure how
successful it is.  I think it is fairly successful, but not completely
successful. 

There is a trade-off that raid5 has to make.  Waiting longer can mean
more blocks on the same stripe, and so less reads.  But waiting longer
can also increase latency which might not be good.

The thing to would be to put some tracing in to find out exactly what
is happening for some sample workloads, and then see if anything can
be improved.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html