Re: [1/1] Block device throttling [Re: Distributed storage.]
On Friday 31 August 2007 14:41, Alasdair G Kergon wrote: > On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote: > > Resubmitting a bio or submitting a dependent bio from > > inside a block driver does not need to be throttled because all > > resources required to guarantee completion must have been obtained > > _before_ the bio was allowed to proceed into the block layer. > > I'm toying with the idea of keeping track of the maximum device stack > depth for each stacked device, and only permitting it to increase in > controlled circumstances. Hi Alasdair, What kind of circumstances did you have in mind? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Friday 31 August 2007 14:41, Alasdair G Kergon wrote: On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote: Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. I'm toying with the idea of keeping track of the maximum device stack depth for each stacked device, and only permitting it to increase in controlled circumstances. Hi Alasdair, What kind of circumstances did you have in mind? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote: > Resubmitting a bio or submitting a dependent bio from > inside a block driver does not need to be throttled because all > resources required to guarantee completion must have been obtained > _before_ the bio was allowed to proceed into the block layer. I'm toying with the idea of keeping track of the maximum device stack depth for each stacked device, and only permitting it to increase in controlled circumstances. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Daniel. On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote: > > Then, if of course you will want, which I doubt, you can reread > > previous mails and find that it was pointed to that race and > > possibilities to solve it way too long ago. > > What still bothers me about your response is that, while you know the > race exists and do not disagree with my example, you don't seem to see > that that race can eventually lock up the block device by repeatedly > losing throttle counts which are never recovered. What prevents that? I posted a trivial hack with pointed possible errors and a question about should it be further extended (and race fixed by any of the possible methods and so on) or new one should be developed (like in your approach when only high level device is charged), instead I got replies that it contains bugs, whcih will stop system and kill gene pool of the mankind. I know how it works and where problems are. And if we are going with this approach I will fix pointed issues. > > > --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 > > > -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 > > > -0700 @@ -3237,6 +3237,15 @@ end_io: > > > */ > > > void generic_make_request(struct bio *bio) > > > { > > > + struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > > + > > > + if (q && q->metric) { > > > + int need = bio->bi_reserved = q->metric(bio); > > > + bio->queue = q; > > > > In case you have stacked device, this entry will be rewritten and you > > will lost all your account data. > > It is a weakness all right. Well, > > - if (q && q->metric) { > + if (q && q->metric && !bio->queue) { > > which fixes that problem. Maybe there is a better fix possible. Thanks > for the catch! Yes, it should. > The original conception was that this block throttling would apply only > to the highest level submission of the bio, the one that crosses the > boundary between filesystem (or direct block device application) and > block layer. Resubmitting a bio or submitting a dependent bio from > inside a block driver does not need to be throttled because all > resources required to guarantee completion must have been obtained > _before_ the bio was allowed to proceed into the block layer. We still did not come to the conclusion, but I do not want to start a flamewar, you believe that throttling must be done on the top level device, so you need to extend bio and convince others that idea worth it. > The other principle we are trying to satisfy is that the throttling > should not be released until bio->endio, which I am not completely sure > about with the patch as modified above. Your earlier idea of having > the throttle protection only cover the actual bio submission is > interesting and may be effective in some cases, in fact it may cover > the specific case of ddsnap. But we don't have to look any further > than ddraid (distributed raid) to find a case it doesn't cover - the > additional memory allocated to hold parity data has to be reserved > until parity data is deallocated, long after the submission completes. > So while you manage to avoid some logistical difficulties, it also looks > like you didn't solve the general problem. Block layer does not know and should not be bothered with underlying device nature - if you think that in endio callback limit should not be rechardged, then provide your own layer on top of bio and thus call endio callback only when you think it is ready to be completed. > Hopefully I will be able to report on whether my patch actually works > soon, when I get back from vacation. The mechanism in ddsnap this is > supposed to replace is effective, it is just ugly and tricky to verify. > > Regards, > > Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Daniel. On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote: Then, if of course you will want, which I doubt, you can reread previous mails and find that it was pointed to that race and possibilities to solve it way too long ago. What still bothers me about your response is that, while you know the race exists and do not disagree with my example, you don't seem to see that that race can eventually lock up the block device by repeatedly losing throttle counts which are never recovered. What prevents that? I posted a trivial hack with pointed possible errors and a question about should it be further extended (and race fixed by any of the possible methods and so on) or new one should be developed (like in your approach when only high level device is charged), instead I got replies that it contains bugs, whcih will stop system and kill gene pool of the mankind. I know how it works and where problems are. And if we are going with this approach I will fix pointed issues. --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + + if (q q-metric) { + int need = bio-bi_reserved = q-metric(bio); + bio-queue = q; In case you have stacked device, this entry will be rewritten and you will lost all your account data. It is a weakness all right. Well, - if (q q-metric) { + if (q q-metric !bio-queue) { which fixes that problem. Maybe there is a better fix possible. Thanks for the catch! Yes, it should. The original conception was that this block throttling would apply only to the highest level submission of the bio, the one that crosses the boundary between filesystem (or direct block device application) and block layer. Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. We still did not come to the conclusion, but I do not want to start a flamewar, you believe that throttling must be done on the top level device, so you need to extend bio and convince others that idea worth it. The other principle we are trying to satisfy is that the throttling should not be released until bio-endio, which I am not completely sure about with the patch as modified above. Your earlier idea of having the throttle protection only cover the actual bio submission is interesting and may be effective in some cases, in fact it may cover the specific case of ddsnap. But we don't have to look any further than ddraid (distributed raid) to find a case it doesn't cover - the additional memory allocated to hold parity data has to be reserved until parity data is deallocated, long after the submission completes. So while you manage to avoid some logistical difficulties, it also looks like you didn't solve the general problem. Block layer does not know and should not be bothered with underlying device nature - if you think that in endio callback limit should not be rechardged, then provide your own layer on top of bio and thus call endio callback only when you think it is ready to be completed. Hopefully I will be able to report on whether my patch actually works soon, when I get back from vacation. The mechanism in ddsnap this is supposed to replace is effective, it is just ugly and tricky to verify. Regards, Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote: Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. I'm toying with the idea of keeping track of the maximum device stack depth for each stacked device, and only permitting it to increase in controlled circumstances. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote: > Then, if of course you will want, which I doubt, you can reread > previous mails and find that it was pointed to that race and > possibilities to solve it way too long ago. What still bothers me about your response is that, while you know the race exists and do not disagree with my example, you don't seem to see that that race can eventually lock up the block device by repeatedly losing throttle counts which are never recovered. What prevents that? > > --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 > > -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.0 > > -0700 @@ -3237,6 +3237,15 @@ end_io: > > */ > > void generic_make_request(struct bio *bio) > > { > > + struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > + > > + if (q && q->metric) { > > + int need = bio->bi_reserved = q->metric(bio); > > + bio->queue = q; > > In case you have stacked device, this entry will be rewritten and you > will lost all your account data. It is a weakness all right. Well, - if (q && q->metric) { + if (q && q->metric && !bio->queue) { which fixes that problem. Maybe there is a better fix possible. Thanks for the catch! The original conception was that this block throttling would apply only to the highest level submission of the bio, the one that crosses the boundary between filesystem (or direct block device application) and block layer. Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. The other principle we are trying to satisfy is that the throttling should not be released until bio->endio, which I am not completely sure about with the patch as modified above. Your earlier idea of having the throttle protection only cover the actual bio submission is interesting and may be effective in some cases, in fact it may cover the specific case of ddsnap. But we don't have to look any further than ddraid (distributed raid) to find a case it doesn't cover - the additional memory allocated to hold parity data has to be reserved until parity data is deallocated, long after the submission completes. So while you manage to avoid some logistical difficulties, it also looks like you didn't solve the general problem. Hopefully I will be able to report on whether my patch actually works soon, when I get back from vacation. The mechanism in ddsnap this is supposed to replace is effective, it is just ugly and tricky to verify. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote: Then, if of course you will want, which I doubt, you can reread previous mails and find that it was pointed to that race and possibilities to solve it way too long ago. What still bothers me about your response is that, while you know the race exists and do not disagree with my example, you don't seem to see that that race can eventually lock up the block device by repeatedly losing throttle counts which are never recovered. What prevents that? --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + + if (q q-metric) { + int need = bio-bi_reserved = q-metric(bio); + bio-queue = q; In case you have stacked device, this entry will be rewritten and you will lost all your account data. It is a weakness all right. Well, - if (q q-metric) { + if (q q-metric !bio-queue) { which fixes that problem. Maybe there is a better fix possible. Thanks for the catch! The original conception was that this block throttling would apply only to the highest level submission of the bio, the one that crosses the boundary between filesystem (or direct block device application) and block layer. Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. The other principle we are trying to satisfy is that the throttling should not be released until bio-endio, which I am not completely sure about with the patch as modified above. Your earlier idea of having the throttle protection only cover the actual bio submission is interesting and may be effective in some cases, in fact it may cover the specific case of ddsnap. But we don't have to look any further than ddraid (distributed raid) to find a case it doesn't cover - the additional memory allocated to hold parity data has to be reserved until parity data is deallocated, long after the submission completes. So while you manage to avoid some logistical difficulties, it also looks like you didn't solve the general problem. Hopefully I will be able to report on whether my patch actually works soon, when I get back from vacation. The mechanism in ddsnap this is supposed to replace is effective, it is just ugly and tricky to verify. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tue, Aug 28, 2007 at 02:08:04PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: > > On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL > > PROTECTED]) wrote: > > > > We do not care about one cpu being able to increase its counter > > > > higher than the limit, such inaccuracy (maximum bios in flight > > > > thus can be more than limit, difference is equal to the number of > > > > CPUs - 1) is a price for removing atomic operation. I thought I > > > > pointed it in the original description, but might forget, that if > > > > it will be an issue, that atomic operations can be introduced > > > > there. Any uber-precise measurements in the case when we are > > > > close to the edge will not give us any benefit at all, since were > > > > are already in the grey area. > > > > > > This is not just inaccurate, it is suicide. Keep leaking throttle > > > counts and eventually all of them will be gone. No more IO > > > on that block device! > > > > First, because number of increased and decreased operations are the > > same, so it will dance around limit in both directions. > > No. Please go and read it the description of the race again. A count > gets irretrievably lost because the write operation of the first > decrement is overwritten by the second. Data gets lost. Atomic > operations exist to prevent that sort of thing. You either need to use > them or have a deep understanding of SMP read and write ordering in > order to preserve data integrity by some equivalent algorithm. I think you should complete your emotional email with decription of how atomic types are operated and how processors access data. Just to give a lesson to those who never knew how SMP works, but create patches and have the conscience to send them and even discuss. Then, if of course you will want, which I doubt, you can reread previous mails and find that it was pointed to that race and possibilities to solve it way too long ago. Anyway, I prefer to look like I do not know how SMP and atomic operation work and thus stay away from this discussion. > --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 -0700 > +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.0 -0700 > @@ -3237,6 +3237,15 @@ end_io: > */ > void generic_make_request(struct bio *bio) > { > + struct request_queue *q = bdev_get_queue(bio->bi_bdev); > + > + if (q && q->metric) { > + int need = bio->bi_reserved = q->metric(bio); > + bio->queue = q; In case you have stacked device, this entry will be rewritten and you will lost all your account data. > + wait_event_interruptible(q->throttle_wait, > atomic_read(>available) >= need); > + atomic_sub(>available, need); > + } -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tue, Aug 28, 2007 at 02:08:04PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! First, because number of increased and decreased operations are the same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. I think you should complete your emotional email with decription of how atomic types are operated and how processors access data. Just to give a lesson to those who never knew how SMP works, but create patches and have the conscience to send them and even discuss. Then, if of course you will want, which I doubt, you can reread previous mails and find that it was pointed to that race and possibilities to solve it way too long ago. Anyway, I prefer to look like I do not know how SMP and atomic operation work and thus stay away from this discussion. --- 2.6.22.clean/block/ll_rw_blk.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + + if (q q-metric) { + int need = bio-bi_reserved = q-metric(bio); + bio-queue = q; In case you have stacked device, this entry will be rewritten and you will lost all your account data. + wait_event_interruptible(q-throttle_wait, atomic_read(q-available) = need); + atomic_sub(q-available, need); + } -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: > On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) > wrote: > > > We do not care about one cpu being able to increase its counter > > > higher than the limit, such inaccuracy (maximum bios in flight > > > thus can be more than limit, difference is equal to the number of > > > CPUs - 1) is a price for removing atomic operation. I thought I > > > pointed it in the original description, but might forget, that if > > > it will be an issue, that atomic operations can be introduced > > > there. Any uber-precise measurements in the case when we are > > > close to the edge will not give us any benefit at all, since were > > > are already in the grey area. > > > > This is not just inaccurate, it is suicide. Keep leaking throttle > > counts and eventually all of them will be gone. No more IO > > on that block device! > > First, because number of increased and decreased operations are the > same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. > Let's solve problems in order of their appearence. If bio structure > will be allowed to grow, then the whole patches can be done better. How about like the patch below. This throttles any block driver by implementing a throttle metric method so that each block driver can keep track of its own resource consumption in units of its choosing. As an (important) example, it implements a simple metric for device mapper devices. Other block devices will work as before, because they do not define any metric. Short, sweet and untested, which is why I have not posted it until now. This patch originally kept its accounting info in backing_dev_info, however that structure seems to be in some and it is just a part of struct queue anyway, so I lifted the throttle accounting up into struct queue. We should be able to report on the efficacy of this patch in terms of deadlock prevention pretty soon. --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + + if (q && q->metric) { + int need = bio->bi_reserved = q->metric(bio); + bio->queue = q; + wait_event_interruptible(q->throttle_wait, atomic_read(>available) >= need); + atomic_sub(>available, need); + } + if (current->bio_tail) { /* make_request is active */ *(current->bio_tail) = bio; --- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/drivers/md/dm.c 2007-08-24 12:14:23.0 -0700 @@ -880,6 +880,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio->bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md->queue->queuedata = md; + md->queue->metric = dm_metric; + atomic_set(>queue->available, md->queue->capacity = 1000); + init_waitqueue_head(>queue->throttle_wait); + md->queue->backing_dev_info.congested_fn = dm_any_congested; md->queue->backing_dev_info.congested_data = md; blk_queue_make_request(md->queue, dm_request); --- 2.6.22.clean/fs/bio.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700 @@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned bytes_done = bio->bi_size; } - bio->bi_size -= bytes_done; + if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) { + struct request_queue *q = bio->queue; + atomic_add(>available, bio->bi_reserved); + bio->bi_reserved = 0; /* just in case */ + wake_up(>throttle_wait); + } bio->bi_sector += (bytes_done >> 9); if (bio->bi_end_io) --- 2.6.22.clean/include/linux/bio.h2007-07-08 16:32:17.0 -0700 +++ 2.6.22/include/linux/bio.h 2007-08-24 11:53:51.0 -0700 @@ -109,6 +109,9 @@ struct bio { bio_end_io_t*bi_end_io; atomic_tbi_cnt; /* pin count */ + struct request_queue
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > We do not care about one cpu being able to increase its counter > > higher than the limit, such inaccuracy (maximum bios in flight thus > > can be more than limit, difference is equal to the number of CPUs - > > 1) is a price for removing atomic operation. I thought I pointed it > > in the original description, but might forget, that if it will be an > > issue, that atomic operations can be introduced there. Any > > uber-precise measurements in the case when we are close to the edge > > will not give us any benefit at all, since were are already in the > > grey area. > > This is not just inaccurate, it is suicide. Keep leaking throttle > counts and eventually all of them will be gone. No more IO > on that block device! First, because number of increased and decreased operations are the same, so it will dance around limit in both directions. Second, I wrote about this race and there is number of ways to deal with it, from atomic operations to separated counters for in-flight and completed bios (which can be racy too, but in different angle). Third, if people can not agree even on much higher layer detail about should bio structure be increased or not, how we can discuss details of the preliminary implementation with known issues. So I can not agree with fatality of the issue, but of course it exists, and was highlighted. Let's solve problems in order of their appearence. If bio structure will be allowed to grow, then the whole patches can be done better, if not, then there are issues with performance (although the more I think, the more I become sure that since bio itself is very rarely shared, and thus requires cloning and alocation/freeing, which itself is much more costly operation than atomic_sub/dec, it can safely host additional operation). -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote: > On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > Say Evgeniy, something I was curious about but forgot to ask you > > earlier... > > > > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > > > ...All oerations are not atomic, since we do not care about > > > precise number of bios, but a fact, that we are close or close > > > enough to the limit. > > > ... in bio->endio > > > + q->bio_queued--; > > > > In your proposed patch, what prevents the race: > > > > cpu1cpu2 > > > > read q->bio_queued > > > > q->bio_queued-- > > write q->bio_queued - 1 > > Whoops! We leaked a throttle count. > > We do not care about one cpu being able to increase its counter > higher than the limit, such inaccuracy (maximum bios in flight thus > can be more than limit, difference is equal to the number of CPUs - > 1) is a price for removing atomic operation. I thought I pointed it > in the original description, but might forget, that if it will be an > issue, that atomic operations can be introduced there. Any > uber-precise measurements in the case when we are close to the edge > will not give us any benefit at all, since were are already in the > grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! > Another possibility is to create a queue/device pointer in the bio > structure to hold original device and then in its backing dev > structure add a callback to recalculate the limit, but it increases > the size of the bio. Do we need this? Different issue. Yes, I think we need a nice simple approach like that, and prove it is stable before worrying about the size cost. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > Say Evgeniy, something I was curious about but forgot to ask you > earlier... > > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > > ...All oerations are not atomic, since we do not care about precise > > number of bios, but a fact, that we are close or close enough to the > > limit. > > ... in bio->endio > > + q->bio_queued--; > > In your proposed patch, what prevents the race: > > cpu1cpu2 > > read q->bio_queued > > q->bio_queued-- > write q->bio_queued - 1 > Whoops! We leaked a throttle count. We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. Another possibility is to create a queue/device pointer in the bio structure to hold original device and then in its backing dev structure add a callback to recalculate the limit, but it increases the size of the bio. Do we need this? > Regards, > > Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: ...All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. ... in bio-endio + q-bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q-bio_queued q-bio_queued-- write q-bio_queued - 1 Whoops! We leaked a throttle count. We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. Another possibility is to create a queue/device pointer in the bio structure to hold original device and then in its backing dev structure add a callback to recalculate the limit, but it increases the size of the bio. Do we need this? Regards, Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! First, because number of increased and decreased operations are the same, so it will dance around limit in both directions. Second, I wrote about this race and there is number of ways to deal with it, from atomic operations to separated counters for in-flight and completed bios (which can be racy too, but in different angle). Third, if people can not agree even on much higher layer detail about should bio structure be increased or not, how we can discuss details of the preliminary implementation with known issues. So I can not agree with fatality of the issue, but of course it exists, and was highlighted. Let's solve problems in order of their appearence. If bio structure will be allowed to grow, then the whole patches can be done better, if not, then there are issues with performance (although the more I think, the more I become sure that since bio itself is very rarely shared, and thus requires cloning and alocation/freeing, which itself is much more costly operation than atomic_sub/dec, it can safely host additional operation). -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote: On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: ...All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. ... in bio-endio + q-bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q-bio_queued q-bio_queued-- write q-bio_queued - 1 Whoops! We leaked a throttle count. We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! Another possibility is to create a queue/device pointer in the bio structure to hold original device and then in its backing dev structure add a callback to recalculate the limit, but it increases the size of the bio. Do we need this? Different issue. Yes, I think we need a nice simple approach like that, and prove it is stable before worrying about the size cost. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: We do not care about one cpu being able to increase its counter higher than the limit, such inaccuracy (maximum bios in flight thus can be more than limit, difference is equal to the number of CPUs - 1) is a price for removing atomic operation. I thought I pointed it in the original description, but might forget, that if it will be an issue, that atomic operations can be introduced there. Any uber-precise measurements in the case when we are close to the edge will not give us any benefit at all, since were are already in the grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! First, because number of increased and decreased operations are the same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. Let's solve problems in order of their appearence. If bio structure will be allowed to grow, then the whole patches can be done better. How about like the patch below. This throttles any block driver by implementing a throttle metric method so that each block driver can keep track of its own resource consumption in units of its choosing. As an (important) example, it implements a simple metric for device mapper devices. Other block devices will work as before, because they do not define any metric. Short, sweet and untested, which is why I have not posted it until now. This patch originally kept its accounting info in backing_dev_info, however that structure seems to be in some and it is just a part of struct queue anyway, so I lifted the throttle accounting up into struct queue. We should be able to report on the efficacy of this patch in terms of deadlock prevention pretty soon. --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio-bi_bdev); + + if (q q-metric) { + int need = bio-bi_reserved = q-metric(bio); + bio-queue = q; + wait_event_interruptible(q-throttle_wait, atomic_read(q-available) = need); + atomic_sub(q-available, need); + } + if (current-bio_tail) { /* make_request is active */ *(current-bio_tail) = bio; --- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/drivers/md/dm.c 2007-08-24 12:14:23.0 -0700 @@ -880,6 +880,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio-bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md-queue-queuedata = md; + md-queue-metric = dm_metric; + atomic_set(md-queue-available, md-queue-capacity = 1000); + init_waitqueue_head(md-queue-throttle_wait); + md-queue-backing_dev_info.congested_fn = dm_any_congested; md-queue-backing_dev_info.congested_data = md; blk_queue_make_request(md-queue, dm_request); --- 2.6.22.clean/fs/bio.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700 @@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned bytes_done = bio-bi_size; } - bio-bi_size -= bytes_done; + if (!(bio-bi_size -= bytes_done) bio-bi_reserved) { + struct request_queue *q = bio-queue; + atomic_add(q-available, bio-bi_reserved); + bio-bi_reserved = 0; /* just in case */ + wake_up(q-throttle_wait); + } bio-bi_sector += (bytes_done 9); if (bio-bi_end_io) --- 2.6.22.clean/include/linux/bio.h2007-07-08 16:32:17.0 -0700 +++ 2.6.22/include/linux/bio.h 2007-08-24 11:53:51.0 -0700 @@ -109,6 +109,9 @@ struct bio { bio_end_io_t*bi_end_io; atomic_tbi_cnt; /* pin count */ + struct request_queue*queue; /* for throttling */ + unsigned
Re: [1/1] Block device throttling [Re: Distributed storage.]
Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > ...All oerations are not atomic, since we do not care about precise > number of bios, but a fact, that we are close or close enough to the > limit. > ... in bio->endio > + q->bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q->bio_queued q->bio_queued-- write q->bio_queued - 1 Whoops! We leaked a throttle count. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: ...All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. ... in bio-endio + q-bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q-bio_queued q-bio_queued-- write q-bio_queued - 1 Whoops! We leaked a throttle count. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Daniel. On Sun, Aug 12, 2007 at 04:16:10PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > Your patch is close to the truth, but it needs to throttle at the top > (virtual) end of each block device stack instead of the bottom > (physical) end. It does head in the direction of eliminating your own > deadlock risk indeed, however there are block devices it does not > cover. I decided to limit physical devices just because any limit on top of virtual one is not correct. When system recharges bio from virtual device to physical, and the latter is full, virtual device will not accept any new blocks for that physical device, but can accept for another ones. That was created specially to allow fair use for network and physical storages. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Daniel. On Sun, Aug 12, 2007 at 04:16:10PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Your patch is close to the truth, but it needs to throttle at the top (virtual) end of each block device stack instead of the bottom (physical) end. It does head in the direction of eliminating your own deadlock risk indeed, however there are block devices it does not cover. I decided to limit physical devices just because any limit on top of virtual one is not correct. When system recharges bio from virtual device to physical, and the latter is full, virtual device will not accept any new blocks for that physical device, but can accept for another ones. That was created specially to allow fair use for network and physical storages. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Evgeniy, Sorry for not getting back to you right away, I was on the road with limited email access. Incidentally, the reason my mails to you keep bouncing is, your MTA is picky about my mailer's IP reversing to a real hostname. I will take care of that pretty soon, but for now my direct mail to you is going to bounce and you will only see the lkml copy. On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > This throttling mechanism allows to limit maximum amount of queued > bios per physical device. By default it is turned off and old block > layer behaviour with unlimited number of bios is used. When turned on > (queue limit is set to something different than -1U via > blk_set_queue_limit()), generic_make_request() will sleep until there > is room in the queue. number of bios is increased in > generic_make_request() and reduced either in bio_endio(), when bio is > completely processed (bi_size is zero), and recharged from original > queue when new device is assigned to bio via blk_set_bdev(). All > oerations are not atomic, since we do not care about precise number > of bios, but a fact, that we are close or close enough to the limit. > > Tested on distributed storage device - with limit of 2 bios it works > slow :) it seems to me you need: - if (q) { + if (q && q->bio_limit != -1) { This patch is short and simple, and will throttle more accurately than the current simplistic per-request allocation limit. However, it fails to throttle device mapper devices. This is because no request is allocated by the device mapper queue method, instead the mapping call goes straight through to the mapping function. If the mapping function allocates memory (typically the case) then this resource usage evades throttling and deadlock becomes a risk. There are three obvious fixes: 1) Implement bio throttling in each virtual block device 2) Implement bio throttling generically in device mapper 3) Implement bio throttling for all block devices Number 1 is the approach we currently use in ddsnap, but it is ugly and repetitious. Number 2 is a possibility, but I favor number 3 because it is a system-wide solution to a system-wide problem, does not need to be repeated for every block device that lacks a queue, heads in the direction of code subtraction, and allows system-wide reserve accounting. Your patch is close to the truth, but it needs to throttle at the top (virtual) end of each block device stack instead of the bottom (physical) end. It does head in the direction of eliminating your own deadlock risk indeed, however there are block devices it does not cover. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Evgeniy, Sorry for not getting back to you right away, I was on the road with limited email access. Incidentally, the reason my mails to you keep bouncing is, your MTA is picky about my mailer's IP reversing to a real hostname. I will take care of that pretty soon, but for now my direct mail to you is going to bounce and you will only see the lkml copy. On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: This throttling mechanism allows to limit maximum amount of queued bios per physical device. By default it is turned off and old block layer behaviour with unlimited number of bios is used. When turned on (queue limit is set to something different than -1U via blk_set_queue_limit()), generic_make_request() will sleep until there is room in the queue. number of bios is increased in generic_make_request() and reduced either in bio_endio(), when bio is completely processed (bi_size is zero), and recharged from original queue when new device is assigned to bio via blk_set_bdev(). All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. Tested on distributed storage device - with limit of 2 bios it works slow :) it seems to me you need: - if (q) { + if (q q-bio_limit != -1) { This patch is short and simple, and will throttle more accurately than the current simplistic per-request allocation limit. However, it fails to throttle device mapper devices. This is because no request is allocated by the device mapper queue method, instead the mapping call goes straight through to the mapping function. If the mapping function allocates memory (typically the case) then this resource usage evades throttling and deadlock becomes a risk. There are three obvious fixes: 1) Implement bio throttling in each virtual block device 2) Implement bio throttling generically in device mapper 3) Implement bio throttling for all block devices Number 1 is the approach we currently use in ddsnap, but it is ugly and repetitious. Number 2 is a possibility, but I favor number 3 because it is a system-wide solution to a system-wide problem, does not need to be repeated for every block device that lacks a queue, heads in the direction of code subtraction, and allows system-wide reserve accounting. Your patch is close to the truth, but it needs to throttle at the top (virtual) end of each block device stack instead of the bottom (physical) end. It does head in the direction of eliminating your own deadlock risk indeed, however there are block devices it does not cover. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Wed, Aug 08, 2007 at 02:17:09PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > This throttling mechanism allows to limit maximum amount of queued bios > per physical device. By default it is turned off and old block layer > behaviour with unlimited number of bios is used. When turned on (queue > limit is set to something different than -1U via blk_set_queue_limit()), > generic_make_request() will sleep until there is room in the queue. > number of bios is increased in generic_make_request() and reduced either > in bio_endio(), when bio is completely processed (bi_size is zero), and > recharged from original queue when new device is assigned to bio via > blk_set_bdev(). All oerations are not atomic, since we do not care about > precise number of bios, but a fact, that we are close or close enough to > the limit. > > Tested on distributed storage device - with limit of 2 bios it works > slow :) As addon I can cook up a patch to configure this via sysfs if needed. Thoughts? -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[1/1] Block device throttling [Re: Distributed storage.]
This throttling mechanism allows to limit maximum amount of queued bios per physical device. By default it is turned off and old block layer behaviour with unlimited number of bios is used. When turned on (queue limit is set to something different than -1U via blk_set_queue_limit()), generic_make_request() will sleep until there is room in the queue. number of bios is increased in generic_make_request() and reduced either in bio_endio(), when bio is completely processed (bi_size is zero), and recharged from original queue when new device is assigned to bio via blk_set_bdev(). All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. Tested on distributed storage device - with limit of 2 bios it works slow :) Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index c99b463..1882c9b 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1851,6 +1851,10 @@ request_queue_t *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug; q->backing_dev_info.unplug_io_data = q; + q->bio_limit = -1U; + q->bio_queued = 0; + init_waitqueue_head(>wait); + mutex_init(>sysfs_lock); return q; @@ -3237,6 +3241,16 @@ end_io: */ void generic_make_request(struct bio *bio) { + request_queue_t *q; + + BUG_ON(!bio->bi_bdev) + + q = bdev_get_queue(bio->bi_bdev); + if (q && q->bio_limit != -1U) { + wait_event_interruptible(q->wait, q->bio_queued + 1 <= q->bio_limit); + q->bio_queued++; + } + if (current->bio_tail) { /* make_request is active */ *(current->bio_tail) = bio; diff --git a/fs/bio.c b/fs/bio.c index 093345f..0a33958 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -1028,6 +1028,16 @@ void bio_endio(struct bio *bio, unsigned int bytes_done, int error) bio->bi_size -= bytes_done; bio->bi_sector += (bytes_done >> 9); + if (!bio->bi_size && bio->bi_bdev) { + request_queue_t *q; + + q = bdev_get_queue(bio->bi_bdev); + if (q) { + q->bio_queued--; + wake_up(>wait); + } + } + if (bio->bi_end_io) bio->bi_end_io(bio, bytes_done, error); } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index db5b00a..7ce0cd7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -467,6 +467,9 @@ struct request_queue struct request *orig_bar_rq; unsigned intbi_size; + wait_queue_head_t wait; + unsigned intbio_limit, bio_queued; + struct mutexsysfs_lock; }; @@ -764,6 +767,30 @@ extern long nr_blockdev_pages(void); int blk_get_queue(request_queue_t *); request_queue_t *blk_alloc_queue(gfp_t); request_queue_t *blk_alloc_queue_node(gfp_t, int); + +static inline void blk_queue_set_limit(request_queue_t *q, unsigned int limit) +{ + q->bio_limit = limit; +} + +static inline void blk_set_bdev(struct bio *bio, struct block_device *bdev) +{ + request_queue_t *q; + + if (!bio->bi_bdev) { + bio->bi_bdev = bdev; + return; + } + + q = bdev_get_queue(bio->bi_bdev); + if (q) { + q->bio_queued--; + wake_up(>wait); + } + + bio->bi_bdev = bdev; +} + extern void blk_put_queue(request_queue_t *); /* -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[1/1] Block device throttling [Re: Distributed storage.]
This throttling mechanism allows to limit maximum amount of queued bios per physical device. By default it is turned off and old block layer behaviour with unlimited number of bios is used. When turned on (queue limit is set to something different than -1U via blk_set_queue_limit()), generic_make_request() will sleep until there is room in the queue. number of bios is increased in generic_make_request() and reduced either in bio_endio(), when bio is completely processed (bi_size is zero), and recharged from original queue when new device is assigned to bio via blk_set_bdev(). All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. Tested on distributed storage device - with limit of 2 bios it works slow :) Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index c99b463..1882c9b 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1851,6 +1851,10 @@ request_queue_t *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q-backing_dev_info.unplug_io_fn = blk_backing_dev_unplug; q-backing_dev_info.unplug_io_data = q; + q-bio_limit = -1U; + q-bio_queued = 0; + init_waitqueue_head(q-wait); + mutex_init(q-sysfs_lock); return q; @@ -3237,6 +3241,16 @@ end_io: */ void generic_make_request(struct bio *bio) { + request_queue_t *q; + + BUG_ON(!bio-bi_bdev) + + q = bdev_get_queue(bio-bi_bdev); + if (q q-bio_limit != -1U) { + wait_event_interruptible(q-wait, q-bio_queued + 1 = q-bio_limit); + q-bio_queued++; + } + if (current-bio_tail) { /* make_request is active */ *(current-bio_tail) = bio; diff --git a/fs/bio.c b/fs/bio.c index 093345f..0a33958 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -1028,6 +1028,16 @@ void bio_endio(struct bio *bio, unsigned int bytes_done, int error) bio-bi_size -= bytes_done; bio-bi_sector += (bytes_done 9); + if (!bio-bi_size bio-bi_bdev) { + request_queue_t *q; + + q = bdev_get_queue(bio-bi_bdev); + if (q) { + q-bio_queued--; + wake_up(q-wait); + } + } + if (bio-bi_end_io) bio-bi_end_io(bio, bytes_done, error); } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index db5b00a..7ce0cd7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -467,6 +467,9 @@ struct request_queue struct request *orig_bar_rq; unsigned intbi_size; + wait_queue_head_t wait; + unsigned intbio_limit, bio_queued; + struct mutexsysfs_lock; }; @@ -764,6 +767,30 @@ extern long nr_blockdev_pages(void); int blk_get_queue(request_queue_t *); request_queue_t *blk_alloc_queue(gfp_t); request_queue_t *blk_alloc_queue_node(gfp_t, int); + +static inline void blk_queue_set_limit(request_queue_t *q, unsigned int limit) +{ + q-bio_limit = limit; +} + +static inline void blk_set_bdev(struct bio *bio, struct block_device *bdev) +{ + request_queue_t *q; + + if (!bio-bi_bdev) { + bio-bi_bdev = bdev; + return; + } + + q = bdev_get_queue(bio-bi_bdev); + if (q) { + q-bio_queued--; + wake_up(q-wait); + } + + bio-bi_bdev = bdev; +} + extern void blk_put_queue(request_queue_t *); /* -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Wed, Aug 08, 2007 at 02:17:09PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: This throttling mechanism allows to limit maximum amount of queued bios per physical device. By default it is turned off and old block layer behaviour with unlimited number of bios is used. When turned on (queue limit is set to something different than -1U via blk_set_queue_limit()), generic_make_request() will sleep until there is room in the queue. number of bios is increased in generic_make_request() and reduced either in bio_endio(), when bio is completely processed (bi_size is zero), and recharged from original queue when new device is assigned to bio via blk_set_bdev(). All oerations are not atomic, since we do not care about precise number of bios, but a fact, that we are close or close enough to the limit. Tested on distributed storage device - with limit of 2 bios it works slow :) As addon I can cook up a patch to configure this via sysfs if needed. Thoughts? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/