Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-09-01 Thread Daniel Phillips
On Friday 31 August 2007 14:41, Alasdair G Kergon wrote:
> On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
> > Resubmitting a bio or submitting a dependent bio from
> > inside a block driver does not need to be throttled because all
> > resources required to guarantee completion must have been obtained
> > _before_ the bio was allowed to proceed into the block layer.
>
> I'm toying with the idea of keeping track of the maximum device stack
> depth for each stacked device, and only permitting it to increase in
> controlled circumstances.

Hi Alasdair,

What kind of circumstances did you have in mind?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-09-01 Thread Daniel Phillips
On Friday 31 August 2007 14:41, Alasdair G Kergon wrote:
 On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
  Resubmitting a bio or submitting a dependent bio from
  inside a block driver does not need to be throttled because all
  resources required to guarantee completion must have been obtained
  _before_ the bio was allowed to proceed into the block layer.

 I'm toying with the idea of keeping track of the maximum device stack
 depth for each stacked device, and only permitting it to increase in
 controlled circumstances.

Hi Alasdair,

What kind of circumstances did you have in mind?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-31 Thread Alasdair G Kergon
On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
> Resubmitting a bio or submitting a dependent bio from 
> inside a block driver does not need to be throttled because all 
> resources required to guarantee completion must have been obtained 
> _before_ the bio was allowed to proceed into the block layer.

I'm toying with the idea of keeping track of the maximum device stack
depth for each stacked device, and only permitting it to increase in
controlled circumstances.

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-31 Thread Evgeniy Polyakov
Hi Daniel.

On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
> > Then, if of course you will want, which I doubt, you can reread
> > previous mails and find that it was pointed to that race and
> > possibilities to solve it way too long ago.
> 
> What still bothers me about your response is that, while you know the 
> race exists and do not disagree with my example, you don't seem to see 
> that that race can eventually lock up the block device by repeatedly 
> losing throttle counts which are never recovered.  What prevents that?

I posted a trivial hack with pointed possible errors and a question
about should it be further extended (and race fixed by any of the
possible methods and so on) or new one should be developed (like in your
approach when only high level device is charged), instead I got replies
that it contains bugs, whcih will stop system and kill gene pool of the
mankind. I know how it works and where problems are. And if we are going
with this approach I will fix pointed issues.

> > > --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0
> > > -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0
> > > -0700 @@ -3237,6 +3237,15 @@ end_io:
> > >   */
> > >  void generic_make_request(struct bio *bio)
> > >  {
> > > + struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > > +
> > > + if (q && q->metric) {
> > > + int need = bio->bi_reserved = q->metric(bio);
> > > + bio->queue = q;
> >
> > In case you have stacked device, this entry will be rewritten and you
> > will lost all your account data.
> 
> It is a weakness all right.  Well,
> 
> - if (q && q->metric) {
> + if (q && q->metric && !bio->queue) {
> 
> which fixes that problem.  Maybe there is a better fix possible.  Thanks 
> for the catch!

Yes, it should.

> The original conception was that this block throttling would apply only 
> to the highest level submission of the bio, the one that crosses the 
> boundary between filesystem (or direct block device application) and 
> block layer.  Resubmitting a bio or submitting a dependent bio from 
> inside a block driver does not need to be throttled because all 
> resources required to guarantee completion must have been obtained 
> _before_ the bio was allowed to proceed into the block layer.

We still did not come to the conclusion, but I do not want to start a
flamewar, you believe that throttling must be done on the top level
device, so you need to extend bio and convince others that idea worth
it.

> The other principle we are trying to satisfy is that the throttling 
> should not be released until bio->endio, which I am not completely sure 
> about with the patch as modified above.  Your earlier idea of having 
> the throttle protection only cover the actual bio submission is 
> interesting and may be effective in some cases, in fact it may cover 
> the specific case of ddsnap.  But we don't have to look any further 
> than ddraid (distributed raid) to find a case it doesn't cover - the 
> additional memory allocated to hold parity data has to be reserved 
> until parity data is deallocated, long after the submission completes.
> So while you manage to avoid some logistical difficulties, it also looks 
> like you didn't solve the general problem.

Block layer does not know and should not be bothered with underlying
device nature - if you think that in endio callback limit should not be
rechardged, then provide your own layer on top of bio and thus call
endio callback only when you think it is ready to be completed.

> Hopefully I will be able to report on whether my patch actually works 
> soon, when I get back from vacation.  The mechanism in ddsnap this is 
> supposed to replace is effective, it is just ugly and tricky to verify.
> 
> Regards,
> 
> Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-31 Thread Evgeniy Polyakov
Hi Daniel.

On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
  Then, if of course you will want, which I doubt, you can reread
  previous mails and find that it was pointed to that race and
  possibilities to solve it way too long ago.
 
 What still bothers me about your response is that, while you know the 
 race exists and do not disagree with my example, you don't seem to see 
 that that race can eventually lock up the block device by repeatedly 
 losing throttle counts which are never recovered.  What prevents that?

I posted a trivial hack with pointed possible errors and a question
about should it be further extended (and race fixed by any of the
possible methods and so on) or new one should be developed (like in your
approach when only high level device is charged), instead I got replies
that it contains bugs, whcih will stop system and kill gene pool of the
mankind. I know how it works and where problems are. And if we are going
with this approach I will fix pointed issues.

   --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0
   -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0
   -0700 @@ -3237,6 +3237,15 @@ end_io:
 */
void generic_make_request(struct bio *bio)
{
   + struct request_queue *q = bdev_get_queue(bio-bi_bdev);
   +
   + if (q  q-metric) {
   + int need = bio-bi_reserved = q-metric(bio);
   + bio-queue = q;
 
  In case you have stacked device, this entry will be rewritten and you
  will lost all your account data.
 
 It is a weakness all right.  Well,
 
 - if (q  q-metric) {
 + if (q  q-metric  !bio-queue) {
 
 which fixes that problem.  Maybe there is a better fix possible.  Thanks 
 for the catch!

Yes, it should.

 The original conception was that this block throttling would apply only 
 to the highest level submission of the bio, the one that crosses the 
 boundary between filesystem (or direct block device application) and 
 block layer.  Resubmitting a bio or submitting a dependent bio from 
 inside a block driver does not need to be throttled because all 
 resources required to guarantee completion must have been obtained 
 _before_ the bio was allowed to proceed into the block layer.

We still did not come to the conclusion, but I do not want to start a
flamewar, you believe that throttling must be done on the top level
device, so you need to extend bio and convince others that idea worth
it.

 The other principle we are trying to satisfy is that the throttling 
 should not be released until bio-endio, which I am not completely sure 
 about with the patch as modified above.  Your earlier idea of having 
 the throttle protection only cover the actual bio submission is 
 interesting and may be effective in some cases, in fact it may cover 
 the specific case of ddsnap.  But we don't have to look any further 
 than ddraid (distributed raid) to find a case it doesn't cover - the 
 additional memory allocated to hold parity data has to be reserved 
 until parity data is deallocated, long after the submission completes.
 So while you manage to avoid some logistical difficulties, it also looks 
 like you didn't solve the general problem.

Block layer does not know and should not be bothered with underlying
device nature - if you think that in endio callback limit should not be
rechardged, then provide your own layer on top of bio and thus call
endio callback only when you think it is ready to be completed.

 Hopefully I will be able to report on whether my patch actually works 
 soon, when I get back from vacation.  The mechanism in ddsnap this is 
 supposed to replace is effective, it is just ugly and tricky to verify.
 
 Regards,
 
 Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-31 Thread Alasdair G Kergon
On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
 Resubmitting a bio or submitting a dependent bio from 
 inside a block driver does not need to be throttled because all 
 resources required to guarantee completion must have been obtained 
 _before_ the bio was allowed to proceed into the block layer.

I'm toying with the idea of keeping track of the maximum device stack
depth for each stacked device, and only permitting it to increase in
controlled circumstances.

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-30 Thread Daniel Phillips
On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
> Then, if of course you will want, which I doubt, you can reread
> previous mails and find that it was pointed to that race and
> possibilities to solve it way too long ago.

What still bothers me about your response is that, while you know the 
race exists and do not disagree with my example, you don't seem to see 
that that race can eventually lock up the block device by repeatedly 
losing throttle counts which are never recovered.  What prevents that?

> > --- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0
> > -0700 +++ 2.6.22/block/ll_rw_blk.c  2007-08-24 12:07:16.0
> > -0700 @@ -3237,6 +3237,15 @@ end_io:
> >   */
> >  void generic_make_request(struct bio *bio)
> >  {
> > +   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > +
> > +   if (q && q->metric) {
> > +   int need = bio->bi_reserved = q->metric(bio);
> > +   bio->queue = q;
>
> In case you have stacked device, this entry will be rewritten and you
> will lost all your account data.

It is a weakness all right.  Well,

-   if (q && q->metric) {
+   if (q && q->metric && !bio->queue) {

which fixes that problem.  Maybe there is a better fix possible.  Thanks 
for the catch!

The original conception was that this block throttling would apply only 
to the highest level submission of the bio, the one that crosses the 
boundary between filesystem (or direct block device application) and 
block layer.  Resubmitting a bio or submitting a dependent bio from 
inside a block driver does not need to be throttled because all 
resources required to guarantee completion must have been obtained 
_before_ the bio was allowed to proceed into the block layer.

The other principle we are trying to satisfy is that the throttling 
should not be released until bio->endio, which I am not completely sure 
about with the patch as modified above.  Your earlier idea of having 
the throttle protection only cover the actual bio submission is 
interesting and may be effective in some cases, in fact it may cover 
the specific case of ddsnap.  But we don't have to look any further 
than ddraid (distributed raid) to find a case it doesn't cover - the 
additional memory allocated to hold parity data has to be reserved 
until parity data is deallocated, long after the submission completes.
So while you manage to avoid some logistical difficulties, it also looks 
like you didn't solve the general problem.

Hopefully I will be able to report on whether my patch actually works 
soon, when I get back from vacation.  The mechanism in ddsnap this is 
supposed to replace is effective, it is just ugly and tricky to verify.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-30 Thread Daniel Phillips
On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
 Then, if of course you will want, which I doubt, you can reread
 previous mails and find that it was pointed to that race and
 possibilities to solve it way too long ago.

What still bothers me about your response is that, while you know the 
race exists and do not disagree with my example, you don't seem to see 
that that race can eventually lock up the block device by repeatedly 
losing throttle counts which are never recovered.  What prevents that?

  --- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0
  -0700 +++ 2.6.22/block/ll_rw_blk.c  2007-08-24 12:07:16.0
  -0700 @@ -3237,6 +3237,15 @@ end_io:
*/
   void generic_make_request(struct bio *bio)
   {
  +   struct request_queue *q = bdev_get_queue(bio-bi_bdev);
  +
  +   if (q  q-metric) {
  +   int need = bio-bi_reserved = q-metric(bio);
  +   bio-queue = q;

 In case you have stacked device, this entry will be rewritten and you
 will lost all your account data.

It is a weakness all right.  Well,

-   if (q  q-metric) {
+   if (q  q-metric  !bio-queue) {

which fixes that problem.  Maybe there is a better fix possible.  Thanks 
for the catch!

The original conception was that this block throttling would apply only 
to the highest level submission of the bio, the one that crosses the 
boundary between filesystem (or direct block device application) and 
block layer.  Resubmitting a bio or submitting a dependent bio from 
inside a block driver does not need to be throttled because all 
resources required to guarantee completion must have been obtained 
_before_ the bio was allowed to proceed into the block layer.

The other principle we are trying to satisfy is that the throttling 
should not be released until bio-endio, which I am not completely sure 
about with the patch as modified above.  Your earlier idea of having 
the throttle protection only cover the actual bio submission is 
interesting and may be effective in some cases, in fact it may cover 
the specific case of ddsnap.  But we don't have to look any further 
than ddraid (distributed raid) to find a case it doesn't cover - the 
additional memory allocated to hold parity data has to be reserved 
until parity data is deallocated, long after the submission completes.
So while you manage to avoid some logistical difficulties, it also looks 
like you didn't solve the general problem.

Hopefully I will be able to report on whether my patch actually works 
soon, when I get back from vacation.  The mechanism in ddsnap this is 
supposed to replace is effective, it is just ugly and tricky to verify.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-29 Thread Evgeniy Polyakov
On Tue, Aug 28, 2007 at 02:08:04PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
> > On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL 
> > PROTECTED]) wrote:
> > > > We do not care about one cpu being able to increase its counter
> > > > higher than the limit, such inaccuracy (maximum bios in flight
> > > > thus can be more than limit, difference is equal to the number of
> > > > CPUs - 1) is a price for removing atomic operation. I thought I
> > > > pointed it in the original description, but might forget, that if
> > > > it will be an issue, that atomic operations can be introduced
> > > > there. Any uber-precise measurements in the case when we are
> > > > close to the edge will not give us any benefit at all, since were
> > > > are already in the grey area.
> > >
> > > This is not just inaccurate, it is suicide.  Keep leaking throttle
> > > counts and eventually all of them will be gone.  No more IO
> > > on that block device!
> >
> > First, because number of increased and decreased operations are the
> > same, so it will dance around limit in both directions.
> 
> No.  Please go and read it the description of the race again.  A count
> gets irretrievably lost because the write operation of the first
> decrement is overwritten by the second. Data gets lost.  Atomic 
> operations exist to prevent that sort of thing.  You either need to use 
> them or have a deep understanding of SMP read and write ordering in 
> order to preserve data integrity by some equivalent algorithm.

I think you should complete your emotional email with decription of how
atomic types are operated and how processors access data. Just to give a
lesson to those who never knew how SMP works, but create patches and
have the conscience to send them and even discuss.
Then, if of course you will want, which I doubt, you can reread previous 
mails and find that it was pointed to that race and possibilities to 
solve it way too long ago. 
Anyway, I prefer to look like I do not know how SMP and atomic operation
work and thus stay away from this discussion.

> --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 -0700
> +++ 2.6.22/block/ll_rw_blk.c  2007-08-24 12:07:16.0 -0700
> @@ -3237,6 +3237,15 @@ end_io:
>   */
>  void generic_make_request(struct bio *bio)
>  {
> + struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> +
> + if (q && q->metric) {
> + int need = bio->bi_reserved = q->metric(bio);
> + bio->queue = q;

In case you have stacked device, this entry will be rewritten and you
will lost all your account data.

> + wait_event_interruptible(q->throttle_wait, 
> atomic_read(>available) >= need);
> + atomic_sub(>available, need);
> + }

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-29 Thread Evgeniy Polyakov
On Tue, Aug 28, 2007 at 02:08:04PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
  On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL 
  PROTECTED]) wrote:
We do not care about one cpu being able to increase its counter
higher than the limit, such inaccuracy (maximum bios in flight
thus can be more than limit, difference is equal to the number of
CPUs - 1) is a price for removing atomic operation. I thought I
pointed it in the original description, but might forget, that if
it will be an issue, that atomic operations can be introduced
there. Any uber-precise measurements in the case when we are
close to the edge will not give us any benefit at all, since were
are already in the grey area.
  
   This is not just inaccurate, it is suicide.  Keep leaking throttle
   counts and eventually all of them will be gone.  No more IO
   on that block device!
 
  First, because number of increased and decreased operations are the
  same, so it will dance around limit in both directions.
 
 No.  Please go and read it the description of the race again.  A count
 gets irretrievably lost because the write operation of the first
 decrement is overwritten by the second. Data gets lost.  Atomic 
 operations exist to prevent that sort of thing.  You either need to use 
 them or have a deep understanding of SMP read and write ordering in 
 order to preserve data integrity by some equivalent algorithm.

I think you should complete your emotional email with decription of how
atomic types are operated and how processors access data. Just to give a
lesson to those who never knew how SMP works, but create patches and
have the conscience to send them and even discuss.
Then, if of course you will want, which I doubt, you can reread previous 
mails and find that it was pointed to that race and possibilities to 
solve it way too long ago. 
Anyway, I prefer to look like I do not know how SMP and atomic operation
work and thus stay away from this discussion.

 --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 -0700
 +++ 2.6.22/block/ll_rw_blk.c  2007-08-24 12:07:16.0 -0700
 @@ -3237,6 +3237,15 @@ end_io:
   */
  void generic_make_request(struct bio *bio)
  {
 + struct request_queue *q = bdev_get_queue(bio-bi_bdev);
 +
 + if (q  q-metric) {
 + int need = bio-bi_reserved = q-metric(bio);
 + bio-queue = q;

In case you have stacked device, this entry will be rewritten and you
will lost all your account data.

 + wait_event_interruptible(q-throttle_wait, 
 atomic_read(q-available) = need);
 + atomic_sub(q-available, need);
 + }

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
> On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
> wrote:
> > > We do not care about one cpu being able to increase its counter
> > > higher than the limit, such inaccuracy (maximum bios in flight
> > > thus can be more than limit, difference is equal to the number of
> > > CPUs - 1) is a price for removing atomic operation. I thought I
> > > pointed it in the original description, but might forget, that if
> > > it will be an issue, that atomic operations can be introduced
> > > there. Any uber-precise measurements in the case when we are
> > > close to the edge will not give us any benefit at all, since were
> > > are already in the grey area.
> >
> > This is not just inaccurate, it is suicide.  Keep leaking throttle
> > counts and eventually all of them will be gone.  No more IO
> > on that block device!
>
> First, because number of increased and decreased operations are the
> same, so it will dance around limit in both directions.

No.  Please go and read it the description of the race again.  A count
gets irretrievably lost because the write operation of the first
decrement is overwritten by the second. Data gets lost.  Atomic 
operations exist to prevent that sort of thing.  You either need to use 
them or have a deep understanding of SMP read and write ordering in 
order to preserve data integrity by some equivalent algorithm.

> Let's solve problems in order of their appearence. If bio structure
> will be allowed to grow, then the whole patches can be done better.

How about like the patch below.  This throttles any block driver by
implementing a throttle metric method so that each block driver can
keep track of its own resource consumption in units of its choosing.
As an (important) example, it implements a simple metric for device
mapper devices.  Other block devices will work as before, because
they do not define any metric.  Short, sweet and untested, which is
why I have not posted it until now.

This patch originally kept its accounting info in backing_dev_info,
however that structure seems to be in some and it is just a part of
struct queue anyway, so I lifted the throttle accounting up into
struct queue.  We should be able to report on the efficacy of this
patch in terms of deadlock prevention pretty soon.

--- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0 -0700
+++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700
@@ -3237,6 +3237,15 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+
+   if (q && q->metric) {
+   int need = bio->bi_reserved = q->metric(bio);
+   bio->queue = q;
+   wait_event_interruptible(q->throttle_wait, 
atomic_read(>available) >= need);
+   atomic_sub(>available, need);
+   }
+
if (current->bio_tail) {
/* make_request is active */
*(current->bio_tail) = bio;
--- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700
+++ 2.6.22/drivers/md/dm.c  2007-08-24 12:14:23.0 -0700
@@ -880,6 +880,11 @@ static int dm_any_congested(void *conges
return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+   return bio->bi_vcnt;
+}
+
 /*-
  * An IDR is used to keep track of allocated minor numbers.
  *---*/
@@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i
goto bad1_free_minor;
 
md->queue->queuedata = md;
+   md->queue->metric = dm_metric;
+   atomic_set(>queue->available, md->queue->capacity = 1000);
+   init_waitqueue_head(>queue->throttle_wait);
+
md->queue->backing_dev_info.congested_fn = dm_any_congested;
md->queue->backing_dev_info.congested_data = md;
blk_queue_make_request(md->queue, dm_request);
--- 2.6.22.clean/fs/bio.c   2007-07-08 16:32:17.0 -0700
+++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700
@@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned
bytes_done = bio->bi_size;
}
 
-   bio->bi_size -= bytes_done;
+   if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) {
+   struct request_queue *q = bio->queue;
+   atomic_add(>available, bio->bi_reserved);
+   bio->bi_reserved = 0; /* just in case */
+   wake_up(>throttle_wait);
+   }
bio->bi_sector += (bytes_done >> 9);
 
if (bio->bi_end_io)
--- 2.6.22.clean/include/linux/bio.h2007-07-08 16:32:17.0 -0700
+++ 2.6.22/include/linux/bio.h  2007-08-24 11:53:51.0 -0700
@@ -109,6 +109,9 @@ struct bio {
bio_end_io_t*bi_end_io;
atomic_tbi_cnt; /* pin count */
 
+   struct request_queue

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Evgeniy Polyakov
On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> > We do not care about one cpu being able to increase its counter
> > higher than the limit, such inaccuracy (maximum bios in flight thus
> > can be more than limit, difference is equal to the number of CPUs -
> > 1) is a price for removing atomic operation. I thought I pointed it
> > in the original description, but might forget, that if it will be an
> > issue, that atomic operations can be introduced there. Any
> > uber-precise measurements in the case when we are close to the edge
> > will not give us any benefit at all, since were are already in the
> > grey area.
> 
> This is not just inaccurate, it is suicide.  Keep leaking throttle 
> counts and eventually all of them will be gone.  No more IO
> on that block device!

First, because number of increased and decreased operations are the
same, so it will dance around limit in both directions. Second, I
wrote about this race and there is number of ways to deal with it, from
atomic operations to separated counters for in-flight and completed
bios (which can be racy too, but in different angle). Third, if people
can not agree even on much higher layer detail about should bio
structure be increased or not, how we can discuss details of
the preliminary implementation with known issues.

So I can not agree with fatality of the issue, but of course it exists,
and was highlighted.

Let's solve problems in order of their appearence. If bio structure will
be allowed to grow, then the whole patches can be done better, if not,
then there are issues with performance (although the more I think, the
more I become sure that since bio itself is very rarely shared, and thus
requires cloning and alocation/freeing, which itself is much more costly
operation than atomic_sub/dec, it can safely host additional operation).

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote:
> On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > Say Evgeniy, something I was curious about but forgot to ask you
> > earlier...
> >
> > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> > > ...All oerations are not atomic, since we do not care about
> > > precise number of bios, but a fact, that we are close or close
> > > enough to the limit.
> > > ... in bio->endio
> > > + q->bio_queued--;
> >
> > In your proposed patch, what prevents the race:
> >
> > cpu1cpu2
> >
> > read q->bio_queued
> > 
> > q->bio_queued--
> > write q->bio_queued - 1
> > Whoops! We leaked a throttle count.
>
> We do not care about one cpu being able to increase its counter
> higher than the limit, such inaccuracy (maximum bios in flight thus
> can be more than limit, difference is equal to the number of CPUs -
> 1) is a price for removing atomic operation. I thought I pointed it
> in the original description, but might forget, that if it will be an
> issue, that atomic operations can be introduced there. Any
> uber-precise measurements in the case when we are close to the edge
> will not give us any benefit at all, since were are already in the
> grey area.

This is not just inaccurate, it is suicide.  Keep leaking throttle 
counts and eventually all of them will be gone.  No more IO
on that block device!

> Another possibility is to create a queue/device pointer in the bio
> structure to hold original device and then in its backing dev
> structure add a callback to recalculate the limit, but it increases
> the size of the bio. Do we need this?

Different issue.  Yes, I think we need a nice simple approach like
that, and prove it is stable before worrying about the size cost.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Evgeniy Polyakov
On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> Say Evgeniy, something I was curious about but forgot to ask you 
> earlier...
> 
> On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> > ...All oerations are not atomic, since we do not care about precise
> > number of bios, but a fact, that we are close or close enough to the
> > limit. 
> > ... in bio->endio
> > +   q->bio_queued--;
> 
> In your proposed patch, what prevents the race:
> 
>   cpu1cpu2
> 
>   read q->bio_queued
>   
> q->bio_queued--
>   write q->bio_queued - 1
>   Whoops! We leaked a throttle count.

We do not care about one cpu being able to increase its counter higher
than the limit, such inaccuracy (maximum bios in flight thus can be more
than limit, difference is equal to the number of CPUs - 1) is a price
for removing atomic operation. I thought I pointed it in the original
description, but might forget, that if it will be an issue, that atomic
operations can be introduced there. Any uber-precise measurements in the
case when we are close to the edge will not give us any benefit at all,
since were are already in the grey area.

Another possibility is to create a queue/device pointer in the bio
structure to hold original device and then in its backing dev structure
add a callback to recalculate the limit, but it increases the size of
the bio. Do we need this?

> Regards,
> 
> Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Evgeniy Polyakov
On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 Say Evgeniy, something I was curious about but forgot to ask you 
 earlier...
 
 On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
  ...All oerations are not atomic, since we do not care about precise
  number of bios, but a fact, that we are close or close enough to the
  limit. 
  ... in bio-endio
  +   q-bio_queued--;
 
 In your proposed patch, what prevents the race:
 
   cpu1cpu2
 
   read q-bio_queued
   
 q-bio_queued--
   write q-bio_queued - 1
   Whoops! We leaked a throttle count.

We do not care about one cpu being able to increase its counter higher
than the limit, such inaccuracy (maximum bios in flight thus can be more
than limit, difference is equal to the number of CPUs - 1) is a price
for removing atomic operation. I thought I pointed it in the original
description, but might forget, that if it will be an issue, that atomic
operations can be introduced there. Any uber-precise measurements in the
case when we are close to the edge will not give us any benefit at all,
since were are already in the grey area.

Another possibility is to create a queue/device pointer in the bio
structure to hold original device and then in its backing dev structure
add a callback to recalculate the limit, but it increases the size of
the bio. Do we need this?

 Regards,
 
 Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Evgeniy Polyakov
On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
  We do not care about one cpu being able to increase its counter
  higher than the limit, such inaccuracy (maximum bios in flight thus
  can be more than limit, difference is equal to the number of CPUs -
  1) is a price for removing atomic operation. I thought I pointed it
  in the original description, but might forget, that if it will be an
  issue, that atomic operations can be introduced there. Any
  uber-precise measurements in the case when we are close to the edge
  will not give us any benefit at all, since were are already in the
  grey area.
 
 This is not just inaccurate, it is suicide.  Keep leaking throttle 
 counts and eventually all of them will be gone.  No more IO
 on that block device!

First, because number of increased and decreased operations are the
same, so it will dance around limit in both directions. Second, I
wrote about this race and there is number of ways to deal with it, from
atomic operations to separated counters for in-flight and completed
bios (which can be racy too, but in different angle). Third, if people
can not agree even on much higher layer detail about should bio
structure be increased or not, how we can discuss details of
the preliminary implementation with known issues.

So I can not agree with fatality of the issue, but of course it exists,
and was highlighted.

Let's solve problems in order of their appearence. If bio structure will
be allowed to grow, then the whole patches can be done better, if not,
then there are issues with performance (although the more I think, the
more I become sure that since bio itself is very rarely shared, and thus
requires cloning and alocation/freeing, which itself is much more costly
operation than atomic_sub/dec, it can safely host additional operation).

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote:
 On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
  Say Evgeniy, something I was curious about but forgot to ask you
  earlier...
 
  On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
   ...All oerations are not atomic, since we do not care about
   precise number of bios, but a fact, that we are close or close
   enough to the limit.
   ... in bio-endio
   + q-bio_queued--;
 
  In your proposed patch, what prevents the race:
 
  cpu1cpu2
 
  read q-bio_queued
  
  q-bio_queued--
  write q-bio_queued - 1
  Whoops! We leaked a throttle count.

 We do not care about one cpu being able to increase its counter
 higher than the limit, such inaccuracy (maximum bios in flight thus
 can be more than limit, difference is equal to the number of CPUs -
 1) is a price for removing atomic operation. I thought I pointed it
 in the original description, but might forget, that if it will be an
 issue, that atomic operations can be introduced there. Any
 uber-precise measurements in the case when we are close to the edge
 will not give us any benefit at all, since were are already in the
 grey area.

This is not just inaccurate, it is suicide.  Keep leaking throttle 
counts and eventually all of them will be gone.  No more IO
on that block device!

 Another possibility is to create a queue/device pointer in the bio
 structure to hold original device and then in its backing dev
 structure add a callback to recalculate the limit, but it increases
 the size of the bio. Do we need this?

Different issue.  Yes, I think we need a nice simple approach like
that, and prove it is stable before worrying about the size cost.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
 On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
 wrote:
   We do not care about one cpu being able to increase its counter
   higher than the limit, such inaccuracy (maximum bios in flight
   thus can be more than limit, difference is equal to the number of
   CPUs - 1) is a price for removing atomic operation. I thought I
   pointed it in the original description, but might forget, that if
   it will be an issue, that atomic operations can be introduced
   there. Any uber-precise measurements in the case when we are
   close to the edge will not give us any benefit at all, since were
   are already in the grey area.
 
  This is not just inaccurate, it is suicide.  Keep leaking throttle
  counts and eventually all of them will be gone.  No more IO
  on that block device!

 First, because number of increased and decreased operations are the
 same, so it will dance around limit in both directions.

No.  Please go and read it the description of the race again.  A count
gets irretrievably lost because the write operation of the first
decrement is overwritten by the second. Data gets lost.  Atomic 
operations exist to prevent that sort of thing.  You either need to use 
them or have a deep understanding of SMP read and write ordering in 
order to preserve data integrity by some equivalent algorithm.

 Let's solve problems in order of their appearence. If bio structure
 will be allowed to grow, then the whole patches can be done better.

How about like the patch below.  This throttles any block driver by
implementing a throttle metric method so that each block driver can
keep track of its own resource consumption in units of its choosing.
As an (important) example, it implements a simple metric for device
mapper devices.  Other block devices will work as before, because
they do not define any metric.  Short, sweet and untested, which is
why I have not posted it until now.

This patch originally kept its accounting info in backing_dev_info,
however that structure seems to be in some and it is just a part of
struct queue anyway, so I lifted the throttle accounting up into
struct queue.  We should be able to report on the efficacy of this
patch in terms of deadlock prevention pretty soon.

--- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0 -0700
+++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700
@@ -3237,6 +3237,15 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   struct request_queue *q = bdev_get_queue(bio-bi_bdev);
+
+   if (q  q-metric) {
+   int need = bio-bi_reserved = q-metric(bio);
+   bio-queue = q;
+   wait_event_interruptible(q-throttle_wait, 
atomic_read(q-available) = need);
+   atomic_sub(q-available, need);
+   }
+
if (current-bio_tail) {
/* make_request is active */
*(current-bio_tail) = bio;
--- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700
+++ 2.6.22/drivers/md/dm.c  2007-08-24 12:14:23.0 -0700
@@ -880,6 +880,11 @@ static int dm_any_congested(void *conges
return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+   return bio-bi_vcnt;
+}
+
 /*-
  * An IDR is used to keep track of allocated minor numbers.
  *---*/
@@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i
goto bad1_free_minor;
 
md-queue-queuedata = md;
+   md-queue-metric = dm_metric;
+   atomic_set(md-queue-available, md-queue-capacity = 1000);
+   init_waitqueue_head(md-queue-throttle_wait);
+
md-queue-backing_dev_info.congested_fn = dm_any_congested;
md-queue-backing_dev_info.congested_data = md;
blk_queue_make_request(md-queue, dm_request);
--- 2.6.22.clean/fs/bio.c   2007-07-08 16:32:17.0 -0700
+++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700
@@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned
bytes_done = bio-bi_size;
}
 
-   bio-bi_size -= bytes_done;
+   if (!(bio-bi_size -= bytes_done)  bio-bi_reserved) {
+   struct request_queue *q = bio-queue;
+   atomic_add(q-available, bio-bi_reserved);
+   bio-bi_reserved = 0; /* just in case */
+   wake_up(q-throttle_wait);
+   }
bio-bi_sector += (bytes_done  9);
 
if (bio-bi_end_io)
--- 2.6.22.clean/include/linux/bio.h2007-07-08 16:32:17.0 -0700
+++ 2.6.22/include/linux/bio.h  2007-08-24 11:53:51.0 -0700
@@ -109,6 +109,9 @@ struct bio {
bio_end_io_t*bi_end_io;
atomic_tbi_cnt; /* pin count */
 
+   struct request_queue*queue; /* for throttling */
+   unsigned

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-27 Thread Daniel Phillips
Say Evgeniy, something I was curious about but forgot to ask you 
earlier...

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> ...All oerations are not atomic, since we do not care about precise
> number of bios, but a fact, that we are close or close enough to the
> limit. 
> ... in bio->endio
> + q->bio_queued--;

In your proposed patch, what prevents the race:

cpu1cpu2

read q->bio_queued

q->bio_queued--
write q->bio_queued - 1
Whoops! We leaked a throttle count.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-27 Thread Daniel Phillips
Say Evgeniy, something I was curious about but forgot to ask you 
earlier...

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
 ...All oerations are not atomic, since we do not care about precise
 number of bios, but a fact, that we are close or close enough to the
 limit. 
 ... in bio-endio
 + q-bio_queued--;

In your proposed patch, what prevents the race:

cpu1cpu2

read q-bio_queued

q-bio_queued--
write q-bio_queued - 1
Whoops! We leaked a throttle count.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov
Hi Daniel.

On Sun, Aug 12, 2007 at 04:16:10PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> Your patch is close to the truth, but it needs to throttle at the top 
> (virtual) end of each block device stack instead of the bottom 
> (physical) end.  It does head in the direction of eliminating your own 
> deadlock risk indeed, however there are block devices it does not 
> cover.

I decided to limit physical devices just because any limit on top of
virtual one is not correct. When system recharges bio from virtual
device to physical, and the latter is full, virtual device will not
accept any new blocks for that physical device, but can accept for
another ones. That was created specially to allow fair use for network
and physical storages.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov
Hi Daniel.

On Sun, Aug 12, 2007 at 04:16:10PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
 Your patch is close to the truth, but it needs to throttle at the top 
 (virtual) end of each block device stack instead of the bottom 
 (physical) end.  It does head in the direction of eliminating your own 
 deadlock risk indeed, however there are block devices it does not 
 cover.

I decided to limit physical devices just because any limit on top of
virtual one is not correct. When system recharges bio from virtual
device to physical, and the latter is full, virtual device will not
accept any new blocks for that physical device, but can accept for
another ones. That was created specially to allow fair use for network
and physical storages.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips
Hi Evgeniy,

Sorry for not getting back to you right away, I was on the road with 
limited email access.  Incidentally, the reason my mails to you keep 
bouncing is, your MTA is picky about my mailer's IP reversing to a real 
hostname.  I will take care of that pretty soon, but for now my direct 
mail to you is going to bounce and you will only see the lkml copy.

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> This throttling mechanism allows to limit maximum amount of queued
> bios per physical device. By default it is turned off and old block
> layer behaviour with unlimited number of bios is used. When turned on
> (queue limit is set to something different than -1U via
> blk_set_queue_limit()), generic_make_request() will sleep until there
> is room in the queue. number of bios is increased in
> generic_make_request() and reduced either in bio_endio(), when bio is
> completely processed (bi_size is zero), and recharged from original
> queue when new device is assigned to bio via blk_set_bdev(). All
> oerations are not atomic, since we do not care about precise number
> of bios, but a fact, that we are close or close enough to the limit.
>
> Tested on distributed storage device - with limit of 2 bios it works
> slow :)

it seems to me you need:

-   if (q) {
+   if (q && q->bio_limit != -1) {

This patch is short and simple, and will throttle more accurately than 
the current simplistic per-request allocation limit.  However, it fails 
to throttle device mapper devices.  This is because no request is 
allocated by the device mapper queue method, instead the mapping call 
goes straight through to the mapping function.  If the mapping function 
allocates memory (typically the case) then this resource usage evades 
throttling and deadlock becomes a risk.

There are three obvious fixes:

   1) Implement bio throttling in each virtual block device
   2) Implement bio throttling generically in device mapper
   3) Implement bio throttling for all block devices

Number 1 is the approach we currently use in ddsnap, but it is ugly and 
repetitious.  Number 2 is a possibility, but I favor number 3 because 
it is a system-wide solution to a system-wide problem, does not need to 
be repeated for every block device that lacks a queue, heads in the 
direction of code subtraction, and allows system-wide reserve 
accounting. 

Your patch is close to the truth, but it needs to throttle at the top 
(virtual) end of each block device stack instead of the bottom 
(physical) end.  It does head in the direction of eliminating your own 
deadlock risk indeed, however there are block devices it does not 
cover.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips
Hi Evgeniy,

Sorry for not getting back to you right away, I was on the road with 
limited email access.  Incidentally, the reason my mails to you keep 
bouncing is, your MTA is picky about my mailer's IP reversing to a real 
hostname.  I will take care of that pretty soon, but for now my direct 
mail to you is going to bounce and you will only see the lkml copy.

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
 This throttling mechanism allows to limit maximum amount of queued
 bios per physical device. By default it is turned off and old block
 layer behaviour with unlimited number of bios is used. When turned on
 (queue limit is set to something different than -1U via
 blk_set_queue_limit()), generic_make_request() will sleep until there
 is room in the queue. number of bios is increased in
 generic_make_request() and reduced either in bio_endio(), when bio is
 completely processed (bi_size is zero), and recharged from original
 queue when new device is assigned to bio via blk_set_bdev(). All
 oerations are not atomic, since we do not care about precise number
 of bios, but a fact, that we are close or close enough to the limit.

 Tested on distributed storage device - with limit of 2 bios it works
 slow :)

it seems to me you need:

-   if (q) {
+   if (q  q-bio_limit != -1) {

This patch is short and simple, and will throttle more accurately than 
the current simplistic per-request allocation limit.  However, it fails 
to throttle device mapper devices.  This is because no request is 
allocated by the device mapper queue method, instead the mapping call 
goes straight through to the mapping function.  If the mapping function 
allocates memory (typically the case) then this resource usage evades 
throttling and deadlock becomes a risk.

There are three obvious fixes:

   1) Implement bio throttling in each virtual block device
   2) Implement bio throttling generically in device mapper
   3) Implement bio throttling for all block devices

Number 1 is the approach we currently use in ddsnap, but it is ugly and 
repetitious.  Number 2 is a possibility, but I favor number 3 because 
it is a system-wide solution to a system-wide problem, does not need to 
be repeated for every block device that lacks a queue, heads in the 
direction of code subtraction, and allows system-wide reserve 
accounting. 

Your patch is close to the truth, but it needs to throttle at the top 
(virtual) end of each block device stack instead of the bottom 
(physical) end.  It does head in the direction of eliminating your own 
deadlock risk indeed, however there are block devices it does not 
cover.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-08 Thread Evgeniy Polyakov
On Wed, Aug 08, 2007 at 02:17:09PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
> This throttling mechanism allows to limit maximum amount of queued bios 
> per physical device. By default it is turned off and old block layer 
> behaviour with unlimited number of bios is used. When turned on (queue
> limit is set to something different than -1U via blk_set_queue_limit()),
> generic_make_request() will sleep until there is room in the queue.
> number of bios is increased in generic_make_request() and reduced either
> in bio_endio(), when bio is completely processed (bi_size is zero), and
> recharged from original queue when new device is assigned to bio via
> blk_set_bdev(). All oerations are not atomic, since we do not care about
> precise number of bios, but a fact, that we are close or close enough to
> the limit.
> 
> Tested on distributed storage device - with limit of 2 bios it works
> slow :)

As addon I can cook up a patch to configure this via sysfs if needed.
Thoughts?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[1/1] Block device throttling [Re: Distributed storage.]

2007-08-08 Thread Evgeniy Polyakov
This throttling mechanism allows to limit maximum amount of queued bios 
per physical device. By default it is turned off and old block layer 
behaviour with unlimited number of bios is used. When turned on (queue
limit is set to something different than -1U via blk_set_queue_limit()),
generic_make_request() will sleep until there is room in the queue.
number of bios is increased in generic_make_request() and reduced either
in bio_endio(), when bio is completely processed (bi_size is zero), and
recharged from original queue when new device is assigned to bio via
blk_set_bdev(). All oerations are not atomic, since we do not care about
precise number of bios, but a fact, that we are close or close enough to
the limit.

Tested on distributed storage device - with limit of 2 bios it works
slow :)

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index c99b463..1882c9b 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1851,6 +1851,10 @@ request_queue_t *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q->backing_dev_info.unplug_io_data = q;
 
+   q->bio_limit = -1U;
+   q->bio_queued = 0;
+   init_waitqueue_head(>wait);
+
mutex_init(>sysfs_lock);
 
return q;
@@ -3237,6 +3241,16 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   request_queue_t *q;
+
+   BUG_ON(!bio->bi_bdev)
+
+   q = bdev_get_queue(bio->bi_bdev);
+   if (q && q->bio_limit != -1U) {
+   wait_event_interruptible(q->wait, q->bio_queued + 1 <= 
q->bio_limit);
+   q->bio_queued++;
+   }
+
if (current->bio_tail) {
/* make_request is active */
*(current->bio_tail) = bio;
diff --git a/fs/bio.c b/fs/bio.c
index 093345f..0a33958 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1028,6 +1028,16 @@ void bio_endio(struct bio *bio, unsigned int bytes_done, 
int error)
bio->bi_size -= bytes_done;
bio->bi_sector += (bytes_done >> 9);
 
+   if (!bio->bi_size && bio->bi_bdev) {
+   request_queue_t *q;
+
+   q = bdev_get_queue(bio->bi_bdev);
+   if (q) {
+   q->bio_queued--;
+   wake_up(>wait);
+   }
+   }
+
if (bio->bi_end_io)
bio->bi_end_io(bio, bytes_done, error);
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index db5b00a..7ce0cd7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -467,6 +467,9 @@ struct request_queue
struct request  *orig_bar_rq;
unsigned intbi_size;
 
+   wait_queue_head_t   wait;
+   unsigned intbio_limit, bio_queued;
+
struct mutexsysfs_lock;
 };
 
@@ -764,6 +767,30 @@ extern long nr_blockdev_pages(void);
 int blk_get_queue(request_queue_t *);
 request_queue_t *blk_alloc_queue(gfp_t);
 request_queue_t *blk_alloc_queue_node(gfp_t, int);
+
+static inline void blk_queue_set_limit(request_queue_t *q, unsigned int limit)
+{
+   q->bio_limit = limit;
+}
+
+static inline void blk_set_bdev(struct bio *bio, struct block_device *bdev)
+{
+   request_queue_t *q;
+
+   if (!bio->bi_bdev) {
+   bio->bi_bdev = bdev;
+   return;
+   }
+   
+   q = bdev_get_queue(bio->bi_bdev);
+   if (q) {
+   q->bio_queued--;
+   wake_up(>wait);
+   }
+
+   bio->bi_bdev = bdev;
+}
+
 extern void blk_put_queue(request_queue_t *);
 
 /*

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[1/1] Block device throttling [Re: Distributed storage.]

2007-08-08 Thread Evgeniy Polyakov
This throttling mechanism allows to limit maximum amount of queued bios 
per physical device. By default it is turned off and old block layer 
behaviour with unlimited number of bios is used. When turned on (queue
limit is set to something different than -1U via blk_set_queue_limit()),
generic_make_request() will sleep until there is room in the queue.
number of bios is increased in generic_make_request() and reduced either
in bio_endio(), when bio is completely processed (bi_size is zero), and
recharged from original queue when new device is assigned to bio via
blk_set_bdev(). All oerations are not atomic, since we do not care about
precise number of bios, but a fact, that we are close or close enough to
the limit.

Tested on distributed storage device - with limit of 2 bios it works
slow :)

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index c99b463..1882c9b 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1851,6 +1851,10 @@ request_queue_t *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q-backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q-backing_dev_info.unplug_io_data = q;
 
+   q-bio_limit = -1U;
+   q-bio_queued = 0;
+   init_waitqueue_head(q-wait);
+
mutex_init(q-sysfs_lock);
 
return q;
@@ -3237,6 +3241,16 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   request_queue_t *q;
+
+   BUG_ON(!bio-bi_bdev)
+
+   q = bdev_get_queue(bio-bi_bdev);
+   if (q  q-bio_limit != -1U) {
+   wait_event_interruptible(q-wait, q-bio_queued + 1 = 
q-bio_limit);
+   q-bio_queued++;
+   }
+
if (current-bio_tail) {
/* make_request is active */
*(current-bio_tail) = bio;
diff --git a/fs/bio.c b/fs/bio.c
index 093345f..0a33958 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1028,6 +1028,16 @@ void bio_endio(struct bio *bio, unsigned int bytes_done, 
int error)
bio-bi_size -= bytes_done;
bio-bi_sector += (bytes_done  9);
 
+   if (!bio-bi_size  bio-bi_bdev) {
+   request_queue_t *q;
+
+   q = bdev_get_queue(bio-bi_bdev);
+   if (q) {
+   q-bio_queued--;
+   wake_up(q-wait);
+   }
+   }
+
if (bio-bi_end_io)
bio-bi_end_io(bio, bytes_done, error);
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index db5b00a..7ce0cd7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -467,6 +467,9 @@ struct request_queue
struct request  *orig_bar_rq;
unsigned intbi_size;
 
+   wait_queue_head_t   wait;
+   unsigned intbio_limit, bio_queued;
+
struct mutexsysfs_lock;
 };
 
@@ -764,6 +767,30 @@ extern long nr_blockdev_pages(void);
 int blk_get_queue(request_queue_t *);
 request_queue_t *blk_alloc_queue(gfp_t);
 request_queue_t *blk_alloc_queue_node(gfp_t, int);
+
+static inline void blk_queue_set_limit(request_queue_t *q, unsigned int limit)
+{
+   q-bio_limit = limit;
+}
+
+static inline void blk_set_bdev(struct bio *bio, struct block_device *bdev)
+{
+   request_queue_t *q;
+
+   if (!bio-bi_bdev) {
+   bio-bi_bdev = bdev;
+   return;
+   }
+   
+   q = bdev_get_queue(bio-bi_bdev);
+   if (q) {
+   q-bio_queued--;
+   wake_up(q-wait);
+   }
+
+   bio-bi_bdev = bdev;
+}
+
 extern void blk_put_queue(request_queue_t *);
 
 /*

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-08 Thread Evgeniy Polyakov
On Wed, Aug 08, 2007 at 02:17:09PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 This throttling mechanism allows to limit maximum amount of queued bios 
 per physical device. By default it is turned off and old block layer 
 behaviour with unlimited number of bios is used. When turned on (queue
 limit is set to something different than -1U via blk_set_queue_limit()),
 generic_make_request() will sleep until there is room in the queue.
 number of bios is increased in generic_make_request() and reduced either
 in bio_endio(), when bio is completely processed (bi_size is zero), and
 recharged from original queue when new device is assigned to bio via
 blk_set_bdev(). All oerations are not atomic, since we do not care about
 precise number of bios, but a fact, that we are close or close enough to
 the limit.
 
 Tested on distributed storage device - with limit of 2 bios it works
 slow :)

As addon I can cook up a patch to configure this via sysfs if needed.
Thoughts?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/