Re: [RFC] [PATCH] A clean approach to writeout throttling
On Monday 10 December 2007 13:31, Jonathan Corbet wrote: > Hey, Daniel, > > I'm just getting around to looking at this. One thing jumped out at me: > > + if (bio->bi_throttle) { > > + struct request_queue *q = bio->bi_queue; > > + bio->bi_throttle = 0; /* or detect multiple endio and err? */ > > + atomic_add(bio->bi_throttle, >available); > > + wake_up(>throttle_wait); > > + } > > I'm feeling like I must be really dumb, but...how can that possibly > work? You're zeroing >bi_throttle before adding it back into > q->available, so the latter will never increase... Hi Jon, Don't you know? These days we optimize all our code for modern processors with tunnelling instructions and metaphysical cache. On such processors, setting a register to zero does not entirely destroy all the data that used to be in the register, so subsequent instructions can make further use of the overwritten data by reconstructing it from remnants of bits left attached to the edges of the register. Um, yeah, that's it. Actually, I fat-fingered it in the merge to -mm. Thanks for the catch, corrected patch attached. The offending line isn't even a functional part of the algorithm, it is just supposed to defend against the possibility that, somehow, ->bi_endio gets called multiple times. Probably it should really be something like: BUG_ON(bio->bi_throttle == -1); if (bio->bi_throttle) { ... bio->bi_throttle = -1; Or perhaps we should just rely on nobody ever making that mistake and let somebody else catch it if it does. Regards, Daniel --- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c 2007-12-04 14:45:25.0 -0800 +++ 2.6.24-rc3-mm/block/ll_rw_blk.c 2007-12-10 04:49:56.0 -0800 @@ -3210,9 +3210,9 @@ static inline int bio_check_eod(struct b */ static inline void __generic_make_request(struct bio *bio) { - struct request_queue *q; + struct request_queue *q = bdev_get_queue(bio->bi_bdev); sector_t old_sector; - int ret, nr_sectors = bio_sectors(bio); + int nr_sectors = bio_sectors(bio); dev_t old_dev; int err = -EIO; @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques if (bio_check_eod(bio, nr_sectors)) goto end_io; + if (q && q->metric && !bio->bi_queue) { + int need = bio->bi_throttle = q->metric(bio); + bio->bi_queue = q; + /* FIXME: potential race if atomic_sub is called in the middle of condition check */ + wait_event(q->throttle_wait, atomic_read(>available) >= need); + atomic_sub(need, >available); + } /* * Resolve the mapping until finished. (drivers are * still free to implement/resolve their own stacking @@ -3231,10 +3238,9 @@ static inline void __generic_make_reques */ old_sector = -1; old_dev = 0; - do { + while (1) { char b[BDEVNAME_SIZE]; - q = bdev_get_queue(bio->bi_bdev); if (!q) { printk(KERN_ERR "generic_make_request: Trying to access " @@ -3282,8 +3288,10 @@ end_io: goto end_io; } - ret = q->make_request_fn(q, bio); - } while (ret); + if (!q->make_request_fn(q, bio)) + return; + q = bdev_get_queue(bio->bi_bdev); + } } /* --- 2.6.24-rc3-mm.clean/drivers/md/dm.c 2007-12-04 14:46:04.0 -0800 +++ 2.6.24-rc3-mm/drivers/md/dm.c 2007-12-04 23:31:41.0 -0800 @@ -889,6 +889,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio->bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -967,6 +972,7 @@ out: static struct block_device_operations dm_blk_dops; +#define DEFAULT_THROTTLE_CAPACITY 1000 /* * Allocate and initialise a blank device with a given minor. */ @@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md->queue->queuedata = md; + md->queue->metric = dm_metric; + /* A dm device constructor may change the throttle capacity */ + atomic_set(>queue->available, md->queue->capacity = DEFAULT_THROTTLE_CAPACITY); + init_waitqueue_head(>queue->throttle_wait); + md->queue->backing_dev_info.congested_fn = dm_any_congested; md->queue->backing_dev_info.congested_data = md; blk_queue_make_request(md->queue, dm_request); --- 2.6.24-rc3-mm.clean/fs/bio.c 2007-12-04 14:38:47.0 -0800 +++ 2.6.24-rc3-mm/fs/bio.c 2007-12-04 23:31:41.0 -0800 @@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro else if (!test_bit(BIO_UPTODATE, >bi_flags)) error = -EIO; + if (bio->bi_throttle) { + struct request_queue *q = bio->bi_queue; + atomic_add(bio->bi_throttle, >available); + bio->bi_throttle = 0; /* or detect multiple endio and err? */ + wake_up(>throttle_wait); + } + if (bio->bi_end_io) bio->bi_end_io(bio, error); } --- 2.6.24-rc3-mm.clean/include/linux/bio.h 2007-12-04
Re: [RFC] [PATCH] A clean approach to writeout throttling
Hi, On Dec 10, 2007 11:31 PM, Jonathan Corbet <[EMAIL PROTECTED]> wrote: > I'm just getting around to looking at this. One thing jumped out at me: > > > + if (bio->bi_throttle) { > > + struct request_queue *q = bio->bi_queue; > > + bio->bi_throttle = 0; /* or detect multiple endio and err? */ > > + atomic_add(bio->bi_throttle, >available); > > + wake_up(>throttle_wait); > > + } > > I'm feeling like I must be really dumb, but...how can that possibly > work? You're zeroing >bi_throttle before adding it back into > q->available, so the latter will never increase... Heh, well, that's ok as long as bio->bi_vcnt is set to zero and I think we have some md raid drivers do just that... ;-) Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
Hey, Daniel, I'm just getting around to looking at this. One thing jumped out at me: > + if (bio->bi_throttle) { > + struct request_queue *q = bio->bi_queue; > + bio->bi_throttle = 0; /* or detect multiple endio and err? */ > + atomic_add(bio->bi_throttle, >available); > + wake_up(>throttle_wait); > + } I'm feeling like I must be really dumb, but...how can that possibly work? You're zeroing >bi_throttle before adding it back into q->available, so the latter will never increase... jon -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Monday 10 December 2007 02:47, Jens Axboe wrote: > ...the warning on request_queue_t... There you go, Jens, service with a smile. Regards, Daniel --- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c 2007-12-04 14:45:25.0 -0800 +++ 2.6.24-rc3-mm/block/ll_rw_blk.c 2007-12-10 03:27:42.0 -0800 @@ -3210,7 +3210,7 @@ static inline int bio_check_eod(struct b */ static inline void __generic_make_request(struct bio *bio) { - struct request_queue *q; + struct request_queue *q = bdev_get_queue(bio->bi_bdev); sector_t old_sector; int ret, nr_sectors = bio_sectors(bio); dev_t old_dev; @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques if (bio_check_eod(bio, nr_sectors)) goto end_io; + if (q && q->metric && !bio->bi_queue) { + int need = bio->bi_throttle = q->metric(bio); + bio->bi_queue = q; + /* FIXME: potential race if atomic_sub is called in the middle of condition check */ + wait_event_interruptible(q->throttle_wait, atomic_read(>available) >= need); + atomic_sub(need, >available); + } /* * Resolve the mapping until finished. (drivers are * still free to implement/resolve their own stacking @@ -3234,7 +3241,6 @@ static inline void __generic_make_reques do { char b[BDEVNAME_SIZE]; - q = bdev_get_queue(bio->bi_bdev); if (!q) { printk(KERN_ERR "generic_make_request: Trying to access " --- 2.6.24-rc3-mm.clean/drivers/md/dm.c 2007-12-04 14:46:04.0 -0800 +++ 2.6.24-rc3-mm/drivers/md/dm.c 2007-12-04 23:31:41.0 -0800 @@ -889,6 +889,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio->bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -967,6 +972,7 @@ out: static struct block_device_operations dm_blk_dops; +#define DEFAULT_THROTTLE_CAPACITY 1000 /* * Allocate and initialise a blank device with a given minor. */ @@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md->queue->queuedata = md; + md->queue->metric = dm_metric; + /* A dm device constructor may change the throttle capacity */ + atomic_set(>queue->available, md->queue->capacity = DEFAULT_THROTTLE_CAPACITY); + init_waitqueue_head(>queue->throttle_wait); + md->queue->backing_dev_info.congested_fn = dm_any_congested; md->queue->backing_dev_info.congested_data = md; blk_queue_make_request(md->queue, dm_request); --- 2.6.24-rc3-mm.clean/fs/bio.c 2007-12-04 14:38:47.0 -0800 +++ 2.6.24-rc3-mm/fs/bio.c 2007-12-04 23:31:41.0 -0800 @@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro else if (!test_bit(BIO_UPTODATE, >bi_flags)) error = -EIO; + if (bio->bi_throttle) { + struct request_queue *q = bio->bi_queue; + bio->bi_throttle = 0; /* or detect multiple endio and err? */ + atomic_add(bio->bi_throttle, >available); + wake_up(>throttle_wait); + } + if (bio->bi_end_io) bio->bi_end_io(bio, error); } --- 2.6.24-rc3-mm.clean/include/linux/bio.h 2007-12-04 14:39:31.0 -0800 +++ 2.6.24-rc3-mm/include/linux/bio.h 2007-12-04 23:31:41.0 -0800 @@ -111,6 +111,9 @@ struct bio { bio_end_io_t *bi_end_io; atomic_t bi_cnt; /* pin count */ + struct request_queue *bi_queue; /* for throttling */ + unsigned bi_throttle; /* throttle metric */ + void *bi_private; bio_destructor_t *bi_destructor; /* destructor */ --- 2.6.24-rc3-mm.clean/include/linux/blkdev.h 2007-12-04 14:47:18.0 -0800 +++ 2.6.24-rc3-mm/include/linux/blkdev.h 2007-12-04 23:31:41.0 -0800 @@ -383,6 +383,10 @@ struct request_queue struct work_struct unplug_work; struct backing_dev_info backing_dev_info; + unsigned (*metric)(struct bio *bio); /* bio throttle metric */ + wait_queue_head_t throttle_wait; + atomic_t available; + unsigned capacity; /* * The queue owner gets to use this for whatever they like.
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wed, Dec 05 2007, Daniel Phillips wrote: > --- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c 2007-12-04 14:45:25.0 > -0800 > +++ 2.6.24-rc3-mm/block/ll_rw_blk.c 2007-12-04 14:01:18.0 -0800 > @@ -3210,7 +3210,7 @@ static inline int bio_check_eod(struct b > */ > static inline void __generic_make_request(struct bio *bio) > { > - struct request_queue *q; > + request_queue_t *q = bdev_get_queue(bio->bi_bdev); > sector_t old_sector; > int ret, nr_sectors = bio_sectors(bio); > dev_t old_dev; > @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques > if (bio_check_eod(bio, nr_sectors)) > goto end_io; > > + if (q && q->metric && !bio->bi_queue) { > + int need = bio->bi_throttle = q->metric(bio); > + bio->bi_queue = q; > + /* FIXME: potential race if atomic_sub is called in the middle > of condition check */ > + wait_event_interruptible(q->throttle_wait, > atomic_read(>available) >= need); > + atomic_sub(need, >available); > + } > /* >* Resolve the mapping until finished. (drivers are >* still free to implement/resolve their own stacking > @@ -3234,7 +3241,6 @@ static inline void __generic_make_reques > do { > char b[BDEVNAME_SIZE]; > > - q = bdev_get_queue(bio->bi_bdev); > if (!q) { > printk(KERN_ERR > "generic_make_request: Trying to access " Ehm, this patch is so broken it's not even funny - did you even compile? You would have noticed the warning on request_queue_t, surely. The big problem is the last hunk here though, how would that work on stacked devices? Clue: ->bi_bdev is not const, it can change after a call to ->make_request_fn(). -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
Hi Andrew, Unfortunately, I agreed with your suggestion too hastily. Not only would it be complex to implement, It does not work. It took me several days to put my finger on exactly why. Here it is in a nutshell: resources may be consumed _after_ the gatekeeper runs the "go, no go" throttling decision. To illustrate, throw 10,000 bios simultaneously at a block stack that is supposed to allow only about 1,000 in flight at a time. If the block stack allocates memory somewhat late in its servicing scheme (for example, when it sends a network message) then it is possible that no actual resource consumption will have taken place before all 10,000 bios are allowed past the gate keeper, and deadlock is sure to occur sooner or later. In general, we must throttle against the maximum requirement of inflight bios rather than against the measured consumption. This achieves the invariant I have touted, namely that memory consumption on the block writeout path must be bounded. We could therefore possibly use your suggestion or something resembling it to implement a debug check that the programmer did in fact do their bounds arithmetic correctly, but it is not useful for enforcing the bound itself. In case that coffin needs more nails in it, consider that we would not only need to account page allocations, but frees as well. So what tells us that a page has returned to the reserve pool? Oops, tough one. The page may have been returned to a slab and thus not actually freed, though it remains available for satisfying new bio transactions. Because of such caching, your algorithm would quickly lose track of available resources and grind to a halt. Never mind that keeping track of page frees is a nasty problem in itself. They can occur in interrupt context, so forget the current-> idea. Even keeping track of page allocations for bio transactions in normal context will be a mess, and that is the easy part. I can just imagine the code attempting to implement this approach acreting into a monster that gets confusingly close to working without ever actually getting there. We do have a simple, elegant solution posted at the head of this thread, which is known to work. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
Hi Andrew, Unfortunately, I agreed with your suggestion too hastily. Not only would it be complex to implement, It does not work. It took me several days to put my finger on exactly why. Here it is in a nutshell: resources may be consumed _after_ the gatekeeper runs the go, no go throttling decision. To illustrate, throw 10,000 bios simultaneously at a block stack that is supposed to allow only about 1,000 in flight at a time. If the block stack allocates memory somewhat late in its servicing scheme (for example, when it sends a network message) then it is possible that no actual resource consumption will have taken place before all 10,000 bios are allowed past the gate keeper, and deadlock is sure to occur sooner or later. In general, we must throttle against the maximum requirement of inflight bios rather than against the measured consumption. This achieves the invariant I have touted, namely that memory consumption on the block writeout path must be bounded. We could therefore possibly use your suggestion or something resembling it to implement a debug check that the programmer did in fact do their bounds arithmetic correctly, but it is not useful for enforcing the bound itself. In case that coffin needs more nails in it, consider that we would not only need to account page allocations, but frees as well. So what tells us that a page has returned to the reserve pool? Oops, tough one. The page may have been returned to a slab and thus not actually freed, though it remains available for satisfying new bio transactions. Because of such caching, your algorithm would quickly lose track of available resources and grind to a halt. Never mind that keeping track of page frees is a nasty problem in itself. They can occur in interrupt context, so forget the current- idea. Even keeping track of page allocations for bio transactions in normal context will be a mess, and that is the easy part. I can just imagine the code attempting to implement this approach acreting into a monster that gets confusingly close to working without ever actually getting there. We do have a simple, elegant solution posted at the head of this thread, which is known to work. Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wed, Dec 05 2007, Daniel Phillips wrote: --- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c 2007-12-04 14:45:25.0 -0800 +++ 2.6.24-rc3-mm/block/ll_rw_blk.c 2007-12-04 14:01:18.0 -0800 @@ -3210,7 +3210,7 @@ static inline int bio_check_eod(struct b */ static inline void __generic_make_request(struct bio *bio) { - struct request_queue *q; + request_queue_t *q = bdev_get_queue(bio-bi_bdev); sector_t old_sector; int ret, nr_sectors = bio_sectors(bio); dev_t old_dev; @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques if (bio_check_eod(bio, nr_sectors)) goto end_io; + if (q q-metric !bio-bi_queue) { + int need = bio-bi_throttle = q-metric(bio); + bio-bi_queue = q; + /* FIXME: potential race if atomic_sub is called in the middle of condition check */ + wait_event_interruptible(q-throttle_wait, atomic_read(q-available) = need); + atomic_sub(need, q-available); + } /* * Resolve the mapping until finished. (drivers are * still free to implement/resolve their own stacking @@ -3234,7 +3241,6 @@ static inline void __generic_make_reques do { char b[BDEVNAME_SIZE]; - q = bdev_get_queue(bio-bi_bdev); if (!q) { printk(KERN_ERR generic_make_request: Trying to access Ehm, this patch is so broken it's not even funny - did you even compile? You would have noticed the warning on request_queue_t, surely. The big problem is the last hunk here though, how would that work on stacked devices? Clue: -bi_bdev is not const, it can change after a call to -make_request_fn(). -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Monday 10 December 2007 02:47, Jens Axboe wrote: ...the warning on request_queue_t... There you go, Jens, service with a smile. Regards, Daniel --- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c 2007-12-04 14:45:25.0 -0800 +++ 2.6.24-rc3-mm/block/ll_rw_blk.c 2007-12-10 03:27:42.0 -0800 @@ -3210,7 +3210,7 @@ static inline int bio_check_eod(struct b */ static inline void __generic_make_request(struct bio *bio) { - struct request_queue *q; + struct request_queue *q = bdev_get_queue(bio-bi_bdev); sector_t old_sector; int ret, nr_sectors = bio_sectors(bio); dev_t old_dev; @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques if (bio_check_eod(bio, nr_sectors)) goto end_io; + if (q q-metric !bio-bi_queue) { + int need = bio-bi_throttle = q-metric(bio); + bio-bi_queue = q; + /* FIXME: potential race if atomic_sub is called in the middle of condition check */ + wait_event_interruptible(q-throttle_wait, atomic_read(q-available) = need); + atomic_sub(need, q-available); + } /* * Resolve the mapping until finished. (drivers are * still free to implement/resolve their own stacking @@ -3234,7 +3241,6 @@ static inline void __generic_make_reques do { char b[BDEVNAME_SIZE]; - q = bdev_get_queue(bio-bi_bdev); if (!q) { printk(KERN_ERR generic_make_request: Trying to access --- 2.6.24-rc3-mm.clean/drivers/md/dm.c 2007-12-04 14:46:04.0 -0800 +++ 2.6.24-rc3-mm/drivers/md/dm.c 2007-12-04 23:31:41.0 -0800 @@ -889,6 +889,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio-bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -967,6 +972,7 @@ out: static struct block_device_operations dm_blk_dops; +#define DEFAULT_THROTTLE_CAPACITY 1000 /* * Allocate and initialise a blank device with a given minor. */ @@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md-queue-queuedata = md; + md-queue-metric = dm_metric; + /* A dm device constructor may change the throttle capacity */ + atomic_set(md-queue-available, md-queue-capacity = DEFAULT_THROTTLE_CAPACITY); + init_waitqueue_head(md-queue-throttle_wait); + md-queue-backing_dev_info.congested_fn = dm_any_congested; md-queue-backing_dev_info.congested_data = md; blk_queue_make_request(md-queue, dm_request); --- 2.6.24-rc3-mm.clean/fs/bio.c 2007-12-04 14:38:47.0 -0800 +++ 2.6.24-rc3-mm/fs/bio.c 2007-12-04 23:31:41.0 -0800 @@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro else if (!test_bit(BIO_UPTODATE, bio-bi_flags)) error = -EIO; + if (bio-bi_throttle) { + struct request_queue *q = bio-bi_queue; + bio-bi_throttle = 0; /* or detect multiple endio and err? */ + atomic_add(bio-bi_throttle, q-available); + wake_up(q-throttle_wait); + } + if (bio-bi_end_io) bio-bi_end_io(bio, error); } --- 2.6.24-rc3-mm.clean/include/linux/bio.h 2007-12-04 14:39:31.0 -0800 +++ 2.6.24-rc3-mm/include/linux/bio.h 2007-12-04 23:31:41.0 -0800 @@ -111,6 +111,9 @@ struct bio { bio_end_io_t *bi_end_io; atomic_t bi_cnt; /* pin count */ + struct request_queue *bi_queue; /* for throttling */ + unsigned bi_throttle; /* throttle metric */ + void *bi_private; bio_destructor_t *bi_destructor; /* destructor */ --- 2.6.24-rc3-mm.clean/include/linux/blkdev.h 2007-12-04 14:47:18.0 -0800 +++ 2.6.24-rc3-mm/include/linux/blkdev.h 2007-12-04 23:31:41.0 -0800 @@ -383,6 +383,10 @@ struct request_queue struct work_struct unplug_work; struct backing_dev_info backing_dev_info; + unsigned (*metric)(struct bio *bio); /* bio throttle metric */ + wait_queue_head_t throttle_wait; + atomic_t available; + unsigned capacity; /* * The queue owner gets to use this for whatever they like.
Re: [RFC] [PATCH] A clean approach to writeout throttling
Hey, Daniel, I'm just getting around to looking at this. One thing jumped out at me: + if (bio-bi_throttle) { + struct request_queue *q = bio-bi_queue; + bio-bi_throttle = 0; /* or detect multiple endio and err? */ + atomic_add(bio-bi_throttle, q-available); + wake_up(q-throttle_wait); + } I'm feeling like I must be really dumb, but...how can that possibly work? You're zeroing bi_throttle before adding it back into q-available, so the latter will never increase... jon -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
Hi, On Dec 10, 2007 11:31 PM, Jonathan Corbet [EMAIL PROTECTED] wrote: I'm just getting around to looking at this. One thing jumped out at me: + if (bio-bi_throttle) { + struct request_queue *q = bio-bi_queue; + bio-bi_throttle = 0; /* or detect multiple endio and err? */ + atomic_add(bio-bi_throttle, q-available); + wake_up(q-throttle_wait); + } I'm feeling like I must be really dumb, but...how can that possibly work? You're zeroing bi_throttle before adding it back into q-available, so the latter will never increase... Heh, well, that's ok as long as bio-bi_vcnt is set to zero and I think we have some md raid drivers do just that... ;-) Pekka -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Monday 10 December 2007 13:31, Jonathan Corbet wrote: Hey, Daniel, I'm just getting around to looking at this. One thing jumped out at me: + if (bio-bi_throttle) { + struct request_queue *q = bio-bi_queue; + bio-bi_throttle = 0; /* or detect multiple endio and err? */ + atomic_add(bio-bi_throttle, q-available); + wake_up(q-throttle_wait); + } I'm feeling like I must be really dumb, but...how can that possibly work? You're zeroing bi_throttle before adding it back into q-available, so the latter will never increase... Hi Jon, Don't you know? These days we optimize all our code for modern processors with tunnelling instructions and metaphysical cache. On such processors, setting a register to zero does not entirely destroy all the data that used to be in the register, so subsequent instructions can make further use of the overwritten data by reconstructing it from remnants of bits left attached to the edges of the register. Um, yeah, that's it. Actually, I fat-fingered it in the merge to -mm. Thanks for the catch, corrected patch attached. The offending line isn't even a functional part of the algorithm, it is just supposed to defend against the possibility that, somehow, -bi_endio gets called multiple times. Probably it should really be something like: BUG_ON(bio-bi_throttle == -1); if (bio-bi_throttle) { ... bio-bi_throttle = -1; Or perhaps we should just rely on nobody ever making that mistake and let somebody else catch it if it does. Regards, Daniel --- 2.6.24-rc3-mm.clean/block/ll_rw_blk.c 2007-12-04 14:45:25.0 -0800 +++ 2.6.24-rc3-mm/block/ll_rw_blk.c 2007-12-10 04:49:56.0 -0800 @@ -3210,9 +3210,9 @@ static inline int bio_check_eod(struct b */ static inline void __generic_make_request(struct bio *bio) { - struct request_queue *q; + struct request_queue *q = bdev_get_queue(bio-bi_bdev); sector_t old_sector; - int ret, nr_sectors = bio_sectors(bio); + int nr_sectors = bio_sectors(bio); dev_t old_dev; int err = -EIO; @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques if (bio_check_eod(bio, nr_sectors)) goto end_io; + if (q q-metric !bio-bi_queue) { + int need = bio-bi_throttle = q-metric(bio); + bio-bi_queue = q; + /* FIXME: potential race if atomic_sub is called in the middle of condition check */ + wait_event(q-throttle_wait, atomic_read(q-available) = need); + atomic_sub(need, q-available); + } /* * Resolve the mapping until finished. (drivers are * still free to implement/resolve their own stacking @@ -3231,10 +3238,9 @@ static inline void __generic_make_reques */ old_sector = -1; old_dev = 0; - do { + while (1) { char b[BDEVNAME_SIZE]; - q = bdev_get_queue(bio-bi_bdev); if (!q) { printk(KERN_ERR generic_make_request: Trying to access @@ -3282,8 +3288,10 @@ end_io: goto end_io; } - ret = q-make_request_fn(q, bio); - } while (ret); + if (!q-make_request_fn(q, bio)) + return; + q = bdev_get_queue(bio-bi_bdev); + } } /* --- 2.6.24-rc3-mm.clean/drivers/md/dm.c 2007-12-04 14:46:04.0 -0800 +++ 2.6.24-rc3-mm/drivers/md/dm.c 2007-12-04 23:31:41.0 -0800 @@ -889,6 +889,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio-bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -967,6 +972,7 @@ out: static struct block_device_operations dm_blk_dops; +#define DEFAULT_THROTTLE_CAPACITY 1000 /* * Allocate and initialise a blank device with a given minor. */ @@ -1009,6 +1015,11 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md-queue-queuedata = md; + md-queue-metric = dm_metric; + /* A dm device constructor may change the throttle capacity */ + atomic_set(md-queue-available, md-queue-capacity = DEFAULT_THROTTLE_CAPACITY); + init_waitqueue_head(md-queue-throttle_wait); + md-queue-backing_dev_info.congested_fn = dm_any_congested; md-queue-backing_dev_info.congested_data = md; blk_queue_make_request(md-queue, dm_request); --- 2.6.24-rc3-mm.clean/fs/bio.c 2007-12-04 14:38:47.0 -0800 +++ 2.6.24-rc3-mm/fs/bio.c 2007-12-04 23:31:41.0 -0800 @@ -1007,6 +1007,13 @@ void bio_endio(struct bio *bio, int erro else if (!test_bit(BIO_UPTODATE, bio-bi_flags)) error = -EIO; + if (bio-bi_throttle) { + struct request_queue *q = bio-bi_queue; + atomic_add(bio-bi_throttle, q-available); + bio-bi_throttle = 0; /* or detect multiple endio and err? */ + wake_up(q-throttle_wait); + } + if (bio-bi_end_io) bio-bi_end_io(bio, error); } --- 2.6.24-rc3-mm.clean/include/linux/bio.h 2007-12-04 14:39:31.0 -0800 +++ 2.6.24-rc3-mm/include/linux/bio.h
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thursday 06 December 2007 16:29, Andrew Morton wrote: > On Thu, 6 Dec 2007 16:04:41 -0800 > > Daniel Phillips <[EMAIL PROTECTED]> wrote: > > The runner up key idea is that we will gain a notion of "block > > device stack" (or block stack for short, so that we may implement > > block stackers) which for the time being will simply be Device > > Mapper's notion of device stack, however many warts that may have. > > It's there now and we use it for ddsnap. > > Perhaps all we need to track is the outermost point? > > submit_bio(...) > { > bool remove_the_rq = false; > > ... > if (current->the_rq == NULL) { > current->the_rq = rq; > remove_the_rq = true; > } > ... > if (remove_the_rq) > current->the_rq = NULL; > } > > ? The parent patch already has that crucial property in a simple say, see if (q && q->metric && !bio->bi_queue) { bio->bi_queue = q; Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 16:04:41 -0800 Daniel Phillips <[EMAIL PROTECTED]> wrote: > The runner up key idea is that we will gain a notion of "block device > stack" (or block stack for short, so that we may implement block > stackers) which for the time being will simply be Device Mapper's > notion of device stack, however many warts that may have. It's there > now and we use it for ddsnap. Perhaps all we need to track is the outermost point? submit_bio(...) { bool remove_the_rq = false; ... if (current->the_rq == NULL) { current->the_rq = rq; remove_the_rq = true; } ... if (remove_the_rq) current->the_rq = NULL; } ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thursday 06 December 2007 13:53, Bill Davidsen wrote: > Daniel Phillips wrote: > The problem is that you (a) may or may not know just how bad a worst > case can be, and (b) may block unnecessarily by being pessimistic. True, but after a quick introspect I realized that that issue (it's really a single issue) is not any worse than the way I planned to wave my hands at the issue of programmers constructing their metrics wrongly and thereby breaking the throttling assumptions. Which is to say that I am now entirely convince by Andrew's argument and am prepardc to reroll the patch along the lines he suggests. The result will be somewhat bigger. Only a minor change is required to the main mechanism: we will now account things entirely in units of pages instead of abstract units, eliminating a whole class of things to go wrong. I like that. Accounting variables get shifted to a new home, maybe. Must try a few ideas and see what works. Anyway, the key idea is that task struct will gain a field pointing at a handle for the "block device stack", whatever that is (this is sure to evolve over time) and alloc_pages will know how to account pages to that object. The submit_bio and bio->endio bits change hardly at all. The runner up key idea is that we will gain a notion of "block device stack" (or block stack for short, so that we may implement block stackers) which for the time being will simply be Device Mapper's notion of device stack, however many warts that may have. It's there now and we use it for ddsnap. The other player in this is Peterz's swap over network use case, which does not involve a device mapper device. Maybe it should? Otherwise we will need a variant notion of block device stack, and the two threads of work should merge eventually. There is little harm in starting this effort in two different places, quite the contrary. In the meantime we do have a strategy that works, posted at the head of this thread, for anybody who needs it now. > The dummy transaction would be nice, but it would be perfect if you > could send the real transaction down with a max memory limit and a > flag, have each level check and decrement the max by what's actually > needed, and then return some pass/fail status for that particular > transaction. Clearly every level in the stack would have to know how > to do that. It would seem that once excess memory use was detected > the transaction could be failed without deadlock. The function of the dummy transaction will be to establish roughly what kind of footprint for a single transaction we see on that block IO path. Then we will make the reservation _hugely_ greater than that, to accommodate 1000 or so of those. A transaction blocks if it actually tries to use more than that. We close the "many partial submissions all deadlock together" hole by ... . Can't go wrong, right? Agreed that drivers should pay special attention to our dummy transaction and try to use the maximum possible resources when they see one. More handwaving, but this is progress. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
Daniel Phillips wrote: On Wednesday 05 December 2007 17:24, Andrew Morton wrote: On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips <[EMAIL PROTECTED]> wrote: ...a block device these days may not be just a single device, but may be a stack of devices connected together by a generic mechanism such as device mapper, or a hardcoded stack such as multi-disk or network block device. It is necessary to consider the resource requirements of the stack as a whole _before_ letting a transfer proceed into any layer of the stack, otherwise deadlock on many partially completed transfers becomes a possibility. For this reason, the bio throttling is only implemented at the initial, highest level submission of the bio to the block layer and not for any recursive submission of the same bio to a lower level block device in a stack. This in turn has rather far reaching implications: the top level device in a stack must take care of inspecting the entire stack in order to determine how to calculate its resource requirements, thus becoming the boss device for the entire stack. Though this intriguing idea could easily become the cause of endless design work and many thousands of lines of fancy code, today I sidestep the question entirely using the "just provide lots of reserve" strategy. Horrifying as it may seem to some, this is precisely the strategy that Linux has used in the context of resource management in general, from the very beginning and likely continuing for quite some time into the future My strongly held opinion in this matter is that we need to solve the real, underlying problems definitively with nice code before declaring the opening of fancy patch season. So I am leaving further discussion of automatic resource discovery algorithms and the like out of this post. Rather than asking the stack "how much memory will this request consume" you could instead ask "how much memory are you currently using". ie: on entry to the stack, do current->account_block_allocations = 1; make_request(...); rq->used_memory += current->pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() && current->account_block_allocations) current->pages_used_for_block_allocations++; and then somehow handle deallocation too ;) Ah, and how do you ensure that you do not deadlock while making this inquiry? Perhaps send a dummy transaction down the pipe? Even so, deadlock is possible, quite evidently so in the real life example I have at hand. Yours is essentially one of the strategies I had in mind, the other major one being simply to examine the whole stack, which presupposes some as-yet-nonexistant kernel wide method of representing block device stacks in all there glorious possible topology variations. The basic idea being to know in real time how much memory a particular block stack is presently using. Then, on entry to that stack, if the stack's current usage is too high, wait for it to subside. We do not wait for high block device resource usage to subside before submitting more requests. The improvement you suggest is aimed at automatically determining resource requirements by sampling a running system, rather than requiring a programmer to determine them arduously by hand. Something like automatically determining a workable locking strategy by analyzing running code, wouldn't that be a treat? I will hope for one of those under my tree at Christmas. The problem is that you (a) may or may not know just how bad a worst case can be, and (b) may block unnecessarily by being pessimistic. The dummy transaction would be nice, but it would be perfect if you could send the real transaction down with a max memory limit and a flag, have each level check and decrement the max by what's actually needed, and then return some pass/fail status for that particular transaction. Clearly every level in the stack would have to know how to do that. It would seem that once excess memory use was detected the transaction could be failed without deadlock. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thursday 06 December 2007 12:27, Andrew Morton wrote: > On Thu, 6 Dec 2007 12:04:14 -0800 > > Daniel Phillips <[EMAIL PROTECTED]> wrote: > > Any idea > > how to extend the accounting idea to all tasks involved in a > > particular block device stack? > > SMOP, I'd have thought. Agreed, which I realized as soon as the post was one minute old. Sure, each helper for the device registers as a helper which puts a pointer in the task struct, which points to the accounting info so only one new field in task struct. The more I ponder, the more doable it seems. > As long as each piece of code which handles > data for this stack knows that it's handling data for that stack it > should be able to account its memory allocations. Don't forget that we do not actually have a usable notion of "block device stack" yet. Perhaps you are just assuming that is easy/imminent? > The tricky part will be networking allocations because a NIC can of > course handle data for all sorts of consumers. But I expect this can > be greatly simplified with a few heuristics - work out how much > memory your typical networking stack will allocate for a frame and > tack that onto the total. Couple of pages worst case.. Actually, the same pattern that Peter and I developed for handling network deadlock extends to this accounting concept. As you say, it's a SMOP. Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 12:04:14 -0800 Daniel Phillips <[EMAIL PROTECTED]> wrote: > Any idea > how to extend the accounting idea to all tasks involved in a particular > block device stack? SMOP, I'd have thought. As long as each piece of code which handles data for this stack knows that it's handling data for that stack it should be able to account its memory allocations. The tricky part will be networking allocations because a NIC can of course handle data for all sorts of consumers. But I expect this can be greatly simplified with a few heuristics - work out how much memory your typical networking stack will allocate for a frame and tack that onto the total. Couple of pages worst case.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thursday 06 December 2007 03:55, Andrew Morton wrote: > Consider an example. > > - We a-priori decide to limit a particular stack's peak memory usage > to 1MB Ah, this is the piece I was missing. > - We empirically discover that the maximum amount of memory which is > allocated by that stack on behalf of a single BIO is 16kb. (ie: > that's the most it has ever used for a single BIO). > > - Now, we refuse to feed any more BIOs into the stack when its > instantaneous memory usage exceeds (1MB - 16kb). And printk a warning, because the programmer's static analysis was wrong. > Of course, the _average_ memory-per-BIO is much less than 16kb. So > there are a lot of BIOs in flight - probably hundreds, but a minimum > of 63. And progress is being made, everybody is happy. > There is a teeny so-small-it-doesn't-matter chance that the stack > will exceed the 1MB limit. If it happens to be at its (1MB-16kb) > limit and all the memory in the machine is AWOL and then someone > throws a never-seen-before twirly BIO at it. Not worth worrying > about, surely. OK, I see where you are going. The programmer statically determines a total reservation for the stack, which is big enough that progress is guaranteed. We then throttle based on actual memory consumption, and essentially use a heuristic to decide when we are near the upper limit. Workable I think, but... The main idea, current->pages_used_for_block_allocations++ is valid only in direct call context. If a daemon needs to allocate memory on behalf of the IO transfer (not unusual) it won't get accounted, which is actually the central issue in this whole class of deadlocks. Any idea how to extend the accounting idea to all tasks involved in a particular block device stack? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 09:34:32 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > That's all rather handwavy and misses a lot of details and might be > inaccurate too. Probably sufficient to just work out by hand the amount of > memory which the network stack will need to allocate. I expect it'll be > two pages.. Doesn't Peter Zijlstra's patch series take care of all those nasty details already? -- All Rights Reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 10:52:42 -0500 Rik van Riel <[EMAIL PROTECTED]> wrote: > On Thu, 6 Dec 2007 03:55:11 -0800 > Andrew Morton <[EMAIL PROTECTED]> wrote: > > > - We a-priori decide to limit a particular stack's peak memory usage to > > 1MB > > > > - We empirically discover that the maximum amount of memory which is > > allocated by that stack on behalf of a single BIO is 16kb. (ie: that's > > the most it has ever used for a single BIO). > > > > - Now, we refuse to feed any more BIOs into the stack when its > > instantaneous memory usage exceeds (1MB - 16kb). > > > > Of course, the _average_ memory-per-BIO is much less than 16kb. So there > > are a lot of BIOs in flight - probably hundreds, but a minimum of 63. > > There is only one problem I can see with this. With network block > IO, some memory will be consumed upon IO completion. We need to > make sure we reserve (number of in flight BIOs * maximum amount of > memory consumed upon IO completion) memory, in addition to the > memory you're accounting in your example above. > hm, yeah, drat. What we could do is - in do_IRQ(): set up a per-cpu pointer to some counter which corresponds to this IRQ. - in the page allocator, if in_irq(), retrieve that per-cpu pointer and increment the counter. - in the network block-io stack we can now look at the number of interrupts, number of packets, size of packets and amount of memory allocated and work out the max amount of memory which needs to be allocated for each frame. That's all rather handwavy and misses a lot of details and might be inaccurate too. Probably sufficient to just work out by hand the amount of memory which the network stack will need to allocate. I expect it'll be two pages.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 03:55:11 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > - We a-priori decide to limit a particular stack's peak memory usage to > 1MB > > - We empirically discover that the maximum amount of memory which is > allocated by that stack on behalf of a single BIO is 16kb. (ie: that's > the most it has ever used for a single BIO). > > - Now, we refuse to feed any more BIOs into the stack when its > instantaneous memory usage exceeds (1MB - 16kb). > > Of course, the _average_ memory-per-BIO is much less than 16kb. So there > are a lot of BIOs in flight - probably hundreds, but a minimum of 63. There is only one problem I can see with this. With network block IO, some memory will be consumed upon IO completion. We need to make sure we reserve (number of in flight BIOs * maximum amount of memory consumed upon IO completion) memory, in addition to the memory you're accounting in your example above. -- All Rights Reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 01:48:52 -0800 Daniel Phillips <[EMAIL PROTECTED]> wrote: > On Wednesday 05 December 2007 23:31, Andrew Morton wrote: > > > > Rather than asking the stack "how much memory will this request consume" > > > > you could instead ask "how much memory are you currently using". > > > > > > > > ie: on entry to the stack, do > > > > > > > > current->account_block_allocations = 1; > > > > make_request(...); > > > > rq->used_memory += current->pages_used_for_block_allocations; > > > > > > > > and in the page allocator do > > > > > > > > if (!in_interrupt() && current->account_block_allocations) > > > > current->pages_used_for_block_allocations++; > > > > > > > > and then somehow handle deallocation too ;) > > > > > > Ah, and how do you ensure that you do not deadlock while making this > > > inquiry? > > > > It isn't an inquiry - it's a plain old submit_bio() and it runs to > > completion in the usual fashion. > > > > Thing is, we wouldn't have called it at all if this queue was already over > > its allocation limit. IOW, we know that it's below its allocation limit, > > so we know it won't deadlock. Given, of course, reasonably pessimistc > > error margins. > > OK, I see what you are suggesting. Yes, one could set the inflight limit > very low and the reserve very high, and run a bio through the stack (what > I meant by "inquiry") to discover the actual usage, then shrink the reserve > accordingly. By also running a real bio through the stack we can discover > something about the latency. So we would then know roughly how high > the inflight limit should be set and how much the memalloc reserve > should be increased to handle that particular driver instance. > > The big fly in this ointment is that we cannot possibly know that our bio > followed the worst case resource consumption path, whereas it is fairly > easy (hopefully) for a programmer to determine this statically. nonono... Consider an example. - We a-priori decide to limit a particular stack's peak memory usage to 1MB - We empirically discover that the maximum amount of memory which is allocated by that stack on behalf of a single BIO is 16kb. (ie: that's the most it has ever used for a single BIO). - Now, we refuse to feed any more BIOs into the stack when its instantaneous memory usage exceeds (1MB - 16kb). Of course, the _average_ memory-per-BIO is much less than 16kb. So there are a lot of BIOs in flight - probably hundreds, but a minimum of 63. There is a teeny so-small-it-doesn't-matter chance that the stack will exceed the 1MB limit. If it happens to be at its (1MB-16kb) limit and all the memory in the machine is AWOL and then someone throws a never-seen-before twirly BIO at it. Not worth worrying about, surely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wednesday 05 December 2007 23:31, Andrew Morton wrote: > > > Rather than asking the stack "how much memory will this request consume" > > > you could instead ask "how much memory are you currently using". > > > > > > ie: on entry to the stack, do > > > > > > current->account_block_allocations = 1; > > > make_request(...); > > > rq->used_memory += current->pages_used_for_block_allocations; > > > > > > and in the page allocator do > > > > > > if (!in_interrupt() && current->account_block_allocations) > > > current->pages_used_for_block_allocations++; > > > > > > and then somehow handle deallocation too ;) > > > > Ah, and how do you ensure that you do not deadlock while making this > > inquiry? > > It isn't an inquiry - it's a plain old submit_bio() and it runs to > completion in the usual fashion. > > Thing is, we wouldn't have called it at all if this queue was already over > its allocation limit. IOW, we know that it's below its allocation limit, > so we know it won't deadlock. Given, of course, reasonably pessimistc > error margins. OK, I see what you are suggesting. Yes, one could set the inflight limit very low and the reserve very high, and run a bio through the stack (what I meant by "inquiry") to discover the actual usage, then shrink the reserve accordingly. By also running a real bio through the stack we can discover something about the latency. So we would then know roughly how high the inflight limit should be set and how much the memalloc reserve should be increased to handle that particular driver instance. The big fly in this ointment is that we cannot possibly know that our bio followed the worst case resource consumption path, whereas it is fairly easy (hopefully) for a programmer to determine this statically. > Which margins can even be observed at runtime: keep a running "max" of this > stack's most-ever memory consumption (for a single call), and only submit a > bio into it when its current allocation is less than (limit - that-max). Actually, your mechanism would always have to be operable at runtime, since inserting a new driver while the system is under heavy memory load is a perfectly valid operation and has to be reliable. Anyway, even if you run a bio through the stack lots of times (insert definition of "lots" here) you still cannot be sure that it has explored the worst case path. To put this in perspective, some of the deadlocks we have hunted down recently have taken days to manifest under artificially high load. It just takes that long to randomly explore a sufficient number of corner cases. > > Perhaps send a dummy transaction down the pipe? Even so, > > deadlock is possible, quite evidently so in the real life example I have > > at hand. > > > > Yours is essentially one of the strategies I had in mind, the other major > > one being simply to examine the whole stack, which presupposes some > > as-yet-nonexistant kernel wide method of representing block device > > stacks in all there glorious possible topology variations. > > We already have that, I think: blk_run_backing_dev(). One could envisage a > similar thing which runs up and down the stack accumulating "how much > memory do you need for this request" data, but I think that would be hard to > implement and plain dumb. I don't think I quite communicated there. We don't actually have any generic notion of "the block device stack". Device mapper has its own model, md has another model, and other stacking devices may have no model at all, just some through-coded hack. It would be worth fixing this problem as part of an effort to generalize the block IO model and make block devices in general look more like device mapper devices. But that would be a pretty big project, the need for which is not generally recognized. > > ...Something like automatically determining a > > workable locking strategy by analyzing running code, wouldn't that be > > a treat? I will hope for one of those under my tree at Christmas. > > I don't see any unviability. A small matter of coding. It would be a legendary hack. > > ...number of requests > > does not map well to the amount of resources consumed. In ddsnap for > > example, the amount of memory used by the userspace ddsnapd is > > roughly linear vs the number of pages transferred, not the number of > > requests. > > Yeah, one would need to be pretty pessimal. Perhaps unacceptably > inaccurate, dunno. Orders of magnitude more reserve would need to be allocated in the case of ddsnap, since bio payload can vary through a big range, which is expected to get bigger as time goes by. So the few lines of extra code and the extra bio field needed to get a better fit is well worth the effort, or even indispensable. > > > > @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques > > > > if (bio_check_eod(bio, nr_sectors)) > > > > goto end_io; > > > > > > > > + if (q && q->metric && !bio->bi_queue) { >
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wednesday 05 December 2007 23:31, Andrew Morton wrote: Rather than asking the stack how much memory will this request consume you could instead ask how much memory are you currently using. ie: on entry to the stack, do current-account_block_allocations = 1; make_request(...); rq-used_memory += current-pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() current-account_block_allocations) current-pages_used_for_block_allocations++; and then somehow handle deallocation too ;) Ah, and how do you ensure that you do not deadlock while making this inquiry? It isn't an inquiry - it's a plain old submit_bio() and it runs to completion in the usual fashion. Thing is, we wouldn't have called it at all if this queue was already over its allocation limit. IOW, we know that it's below its allocation limit, so we know it won't deadlock. Given, of course, reasonably pessimistc error margins. OK, I see what you are suggesting. Yes, one could set the inflight limit very low and the reserve very high, and run a bio through the stack (what I meant by inquiry) to discover the actual usage, then shrink the reserve accordingly. By also running a real bio through the stack we can discover something about the latency. So we would then know roughly how high the inflight limit should be set and how much the memalloc reserve should be increased to handle that particular driver instance. The big fly in this ointment is that we cannot possibly know that our bio followed the worst case resource consumption path, whereas it is fairly easy (hopefully) for a programmer to determine this statically. Which margins can even be observed at runtime: keep a running max of this stack's most-ever memory consumption (for a single call), and only submit a bio into it when its current allocation is less than (limit - that-max). Actually, your mechanism would always have to be operable at runtime, since inserting a new driver while the system is under heavy memory load is a perfectly valid operation and has to be reliable. Anyway, even if you run a bio through the stack lots of times (insert definition of lots here) you still cannot be sure that it has explored the worst case path. To put this in perspective, some of the deadlocks we have hunted down recently have taken days to manifest under artificially high load. It just takes that long to randomly explore a sufficient number of corner cases. Perhaps send a dummy transaction down the pipe? Even so, deadlock is possible, quite evidently so in the real life example I have at hand. Yours is essentially one of the strategies I had in mind, the other major one being simply to examine the whole stack, which presupposes some as-yet-nonexistant kernel wide method of representing block device stacks in all there glorious possible topology variations. We already have that, I think: blk_run_backing_dev(). One could envisage a similar thing which runs up and down the stack accumulating how much memory do you need for this request data, but I think that would be hard to implement and plain dumb. I don't think I quite communicated there. We don't actually have any generic notion of the block device stack. Device mapper has its own model, md has another model, and other stacking devices may have no model at all, just some through-coded hack. It would be worth fixing this problem as part of an effort to generalize the block IO model and make block devices in general look more like device mapper devices. But that would be a pretty big project, the need for which is not generally recognized. ...Something like automatically determining a workable locking strategy by analyzing running code, wouldn't that be a treat? I will hope for one of those under my tree at Christmas. I don't see any unviability. A small matter of coding. It would be a legendary hack. ...number of requests does not map well to the amount of resources consumed. In ddsnap for example, the amount of memory used by the userspace ddsnapd is roughly linear vs the number of pages transferred, not the number of requests. Yeah, one would need to be pretty pessimal. Perhaps unacceptably inaccurate, dunno. Orders of magnitude more reserve would need to be allocated in the case of ddsnap, since bio payload can vary through a big range, which is expected to get bigger as time goes by. So the few lines of extra code and the extra bio field needed to get a better fit is well worth the effort, or even indispensable. @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques if (bio_check_eod(bio, nr_sectors)) goto end_io; + if (q q-metric !bio-bi_queue) { + int need = bio-bi_throttle = q-metric(bio); + bio-bi_queue = q; + /* FIXME: potential race if
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 01:48:52 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: On Wednesday 05 December 2007 23:31, Andrew Morton wrote: Rather than asking the stack how much memory will this request consume you could instead ask how much memory are you currently using. ie: on entry to the stack, do current-account_block_allocations = 1; make_request(...); rq-used_memory += current-pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() current-account_block_allocations) current-pages_used_for_block_allocations++; and then somehow handle deallocation too ;) Ah, and how do you ensure that you do not deadlock while making this inquiry? It isn't an inquiry - it's a plain old submit_bio() and it runs to completion in the usual fashion. Thing is, we wouldn't have called it at all if this queue was already over its allocation limit. IOW, we know that it's below its allocation limit, so we know it won't deadlock. Given, of course, reasonably pessimistc error margins. OK, I see what you are suggesting. Yes, one could set the inflight limit very low and the reserve very high, and run a bio through the stack (what I meant by inquiry) to discover the actual usage, then shrink the reserve accordingly. By also running a real bio through the stack we can discover something about the latency. So we would then know roughly how high the inflight limit should be set and how much the memalloc reserve should be increased to handle that particular driver instance. The big fly in this ointment is that we cannot possibly know that our bio followed the worst case resource consumption path, whereas it is fairly easy (hopefully) for a programmer to determine this statically. nonono... Consider an example. - We a-priori decide to limit a particular stack's peak memory usage to 1MB - We empirically discover that the maximum amount of memory which is allocated by that stack on behalf of a single BIO is 16kb. (ie: that's the most it has ever used for a single BIO). - Now, we refuse to feed any more BIOs into the stack when its instantaneous memory usage exceeds (1MB - 16kb). Of course, the _average_ memory-per-BIO is much less than 16kb. So there are a lot of BIOs in flight - probably hundreds, but a minimum of 63. There is a teeny so-small-it-doesn't-matter chance that the stack will exceed the 1MB limit. If it happens to be at its (1MB-16kb) limit and all the memory in the machine is AWOL and then someone throws a never-seen-before twirly BIO at it. Not worth worrying about, surely. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 03:55:11 -0800 Andrew Morton [EMAIL PROTECTED] wrote: - We a-priori decide to limit a particular stack's peak memory usage to 1MB - We empirically discover that the maximum amount of memory which is allocated by that stack on behalf of a single BIO is 16kb. (ie: that's the most it has ever used for a single BIO). - Now, we refuse to feed any more BIOs into the stack when its instantaneous memory usage exceeds (1MB - 16kb). Of course, the _average_ memory-per-BIO is much less than 16kb. So there are a lot of BIOs in flight - probably hundreds, but a minimum of 63. There is only one problem I can see with this. With network block IO, some memory will be consumed upon IO completion. We need to make sure we reserve (number of in flight BIOs * maximum amount of memory consumed upon IO completion) memory, in addition to the memory you're accounting in your example above. -- All Rights Reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 10:52:42 -0500 Rik van Riel [EMAIL PROTECTED] wrote: On Thu, 6 Dec 2007 03:55:11 -0800 Andrew Morton [EMAIL PROTECTED] wrote: - We a-priori decide to limit a particular stack's peak memory usage to 1MB - We empirically discover that the maximum amount of memory which is allocated by that stack on behalf of a single BIO is 16kb. (ie: that's the most it has ever used for a single BIO). - Now, we refuse to feed any more BIOs into the stack when its instantaneous memory usage exceeds (1MB - 16kb). Of course, the _average_ memory-per-BIO is much less than 16kb. So there are a lot of BIOs in flight - probably hundreds, but a minimum of 63. There is only one problem I can see with this. With network block IO, some memory will be consumed upon IO completion. We need to make sure we reserve (number of in flight BIOs * maximum amount of memory consumed upon IO completion) memory, in addition to the memory you're accounting in your example above. hm, yeah, drat. What we could do is - in do_IRQ(): set up a per-cpu pointer to some counter which corresponds to this IRQ. - in the page allocator, if in_irq(), retrieve that per-cpu pointer and increment the counter. - in the network block-io stack we can now look at the number of interrupts, number of packets, size of packets and amount of memory allocated and work out the max amount of memory which needs to be allocated for each frame. That's all rather handwavy and misses a lot of details and might be inaccurate too. Probably sufficient to just work out by hand the amount of memory which the network stack will need to allocate. I expect it'll be two pages.. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 09:34:32 -0800 Andrew Morton [EMAIL PROTECTED] wrote: That's all rather handwavy and misses a lot of details and might be inaccurate too. Probably sufficient to just work out by hand the amount of memory which the network stack will need to allocate. I expect it'll be two pages.. Doesn't Peter Zijlstra's patch series take care of all those nasty details already? -- All Rights Reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thursday 06 December 2007 03:55, Andrew Morton wrote: Consider an example. - We a-priori decide to limit a particular stack's peak memory usage to 1MB Ah, this is the piece I was missing. - We empirically discover that the maximum amount of memory which is allocated by that stack on behalf of a single BIO is 16kb. (ie: that's the most it has ever used for a single BIO). - Now, we refuse to feed any more BIOs into the stack when its instantaneous memory usage exceeds (1MB - 16kb). And printk a warning, because the programmer's static analysis was wrong. Of course, the _average_ memory-per-BIO is much less than 16kb. So there are a lot of BIOs in flight - probably hundreds, but a minimum of 63. And progress is being made, everybody is happy. There is a teeny so-small-it-doesn't-matter chance that the stack will exceed the 1MB limit. If it happens to be at its (1MB-16kb) limit and all the memory in the machine is AWOL and then someone throws a never-seen-before twirly BIO at it. Not worth worrying about, surely. OK, I see where you are going. The programmer statically determines a total reservation for the stack, which is big enough that progress is guaranteed. We then throttle based on actual memory consumption, and essentially use a heuristic to decide when we are near the upper limit. Workable I think, but... The main idea, current-pages_used_for_block_allocations++ is valid only in direct call context. If a daemon needs to allocate memory on behalf of the IO transfer (not unusual) it won't get accounted, which is actually the central issue in this whole class of deadlocks. Any idea how to extend the accounting idea to all tasks involved in a particular block device stack? Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 12:04:14 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: Any idea how to extend the accounting idea to all tasks involved in a particular block device stack? SMOP, I'd have thought. As long as each piece of code which handles data for this stack knows that it's handling data for that stack it should be able to account its memory allocations. The tricky part will be networking allocations because a NIC can of course handle data for all sorts of consumers. But I expect this can be greatly simplified with a few heuristics - work out how much memory your typical networking stack will allocate for a frame and tack that onto the total. Couple of pages worst case.. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thursday 06 December 2007 12:27, Andrew Morton wrote: On Thu, 6 Dec 2007 12:04:14 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: Any idea how to extend the accounting idea to all tasks involved in a particular block device stack? SMOP, I'd have thought. Agreed, which I realized as soon as the post was one minute old. Sure, each helper for the device registers as a helper which puts a pointer in the task struct, which points to the accounting info so only one new field in task struct. The more I ponder, the more doable it seems. As long as each piece of code which handles data for this stack knows that it's handling data for that stack it should be able to account its memory allocations. Don't forget that we do not actually have a usable notion of block device stack yet. Perhaps you are just assuming that is easy/imminent? The tricky part will be networking allocations because a NIC can of course handle data for all sorts of consumers. But I expect this can be greatly simplified with a few heuristics - work out how much memory your typical networking stack will allocate for a frame and tack that onto the total. Couple of pages worst case.. Actually, the same pattern that Peter and I developed for handling network deadlock extends to this accounting concept. As you say, it's a SMOP. Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
Daniel Phillips wrote: On Wednesday 05 December 2007 17:24, Andrew Morton wrote: On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: ...a block device these days may not be just a single device, but may be a stack of devices connected together by a generic mechanism such as device mapper, or a hardcoded stack such as multi-disk or network block device. It is necessary to consider the resource requirements of the stack as a whole _before_ letting a transfer proceed into any layer of the stack, otherwise deadlock on many partially completed transfers becomes a possibility. For this reason, the bio throttling is only implemented at the initial, highest level submission of the bio to the block layer and not for any recursive submission of the same bio to a lower level block device in a stack. This in turn has rather far reaching implications: the top level device in a stack must take care of inspecting the entire stack in order to determine how to calculate its resource requirements, thus becoming the boss device for the entire stack. Though this intriguing idea could easily become the cause of endless design work and many thousands of lines of fancy code, today I sidestep the question entirely using the just provide lots of reserve strategy. Horrifying as it may seem to some, this is precisely the strategy that Linux has used in the context of resource management in general, from the very beginning and likely continuing for quite some time into the future My strongly held opinion in this matter is that we need to solve the real, underlying problems definitively with nice code before declaring the opening of fancy patch season. So I am leaving further discussion of automatic resource discovery algorithms and the like out of this post. Rather than asking the stack how much memory will this request consume you could instead ask how much memory are you currently using. ie: on entry to the stack, do current-account_block_allocations = 1; make_request(...); rq-used_memory += current-pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() current-account_block_allocations) current-pages_used_for_block_allocations++; and then somehow handle deallocation too ;) Ah, and how do you ensure that you do not deadlock while making this inquiry? Perhaps send a dummy transaction down the pipe? Even so, deadlock is possible, quite evidently so in the real life example I have at hand. Yours is essentially one of the strategies I had in mind, the other major one being simply to examine the whole stack, which presupposes some as-yet-nonexistant kernel wide method of representing block device stacks in all there glorious possible topology variations. The basic idea being to know in real time how much memory a particular block stack is presently using. Then, on entry to that stack, if the stack's current usage is too high, wait for it to subside. We do not wait for high block device resource usage to subside before submitting more requests. The improvement you suggest is aimed at automatically determining resource requirements by sampling a running system, rather than requiring a programmer to determine them arduously by hand. Something like automatically determining a workable locking strategy by analyzing running code, wouldn't that be a treat? I will hope for one of those under my tree at Christmas. The problem is that you (a) may or may not know just how bad a worst case can be, and (b) may block unnecessarily by being pessimistic. The dummy transaction would be nice, but it would be perfect if you could send the real transaction down with a max memory limit and a flag, have each level check and decrement the max by what's actually needed, and then return some pass/fail status for that particular transaction. Clearly every level in the stack would have to know how to do that. It would seem that once excess memory use was detected the transaction could be failed without deadlock. -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thursday 06 December 2007 13:53, Bill Davidsen wrote: Daniel Phillips wrote: The problem is that you (a) may or may not know just how bad a worst case can be, and (b) may block unnecessarily by being pessimistic. True, but after a quick introspect I realized that that issue (it's really a single issue) is not any worse than the way I planned to wave my hands at the issue of programmers constructing their metrics wrongly and thereby breaking the throttling assumptions. Which is to say that I am now entirely convince by Andrew's argument and am prepardc to reroll the patch along the lines he suggests. The result will be somewhat bigger. Only a minor change is required to the main mechanism: we will now account things entirely in units of pages instead of abstract units, eliminating a whole class of things to go wrong. I like that. Accounting variables get shifted to a new home, maybe. Must try a few ideas and see what works. Anyway, the key idea is that task struct will gain a field pointing at a handle for the block device stack, whatever that is (this is sure to evolve over time) and alloc_pages will know how to account pages to that object. The submit_bio and bio-endio bits change hardly at all. The runner up key idea is that we will gain a notion of block device stack (or block stack for short, so that we may implement block stackers) which for the time being will simply be Device Mapper's notion of device stack, however many warts that may have. It's there now and we use it for ddsnap. The other player in this is Peterz's swap over network use case, which does not involve a device mapper device. Maybe it should? Otherwise we will need a variant notion of block device stack, and the two threads of work should merge eventually. There is little harm in starting this effort in two different places, quite the contrary. In the meantime we do have a strategy that works, posted at the head of this thread, for anybody who needs it now. The dummy transaction would be nice, but it would be perfect if you could send the real transaction down with a max memory limit and a flag, have each level check and decrement the max by what's actually needed, and then return some pass/fail status for that particular transaction. Clearly every level in the stack would have to know how to do that. It would seem that once excess memory use was detected the transaction could be failed without deadlock. The function of the dummy transaction will be to establish roughly what kind of footprint for a single transaction we see on that block IO path. Then we will make the reservation _hugely_ greater than that, to accommodate 1000 or so of those. A transaction blocks if it actually tries to use more than that. We close the many partial submissions all deadlock together hole by ... insert handwaving here. Can't go wrong, right? Agreed that drivers should pay special attention to our dummy transaction and try to use the maximum possible resources when they see one. More handwaving, but this is progress. Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thu, 6 Dec 2007 16:04:41 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: The runner up key idea is that we will gain a notion of block device stack (or block stack for short, so that we may implement block stackers) which for the time being will simply be Device Mapper's notion of device stack, however many warts that may have. It's there now and we use it for ddsnap. Perhaps all we need to track is the outermost point? submit_bio(...) { bool remove_the_rq = false; ... if (current-the_rq == NULL) { current-the_rq = rq; remove_the_rq = true; } ... if (remove_the_rq) current-the_rq = NULL; } ? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Thursday 06 December 2007 16:29, Andrew Morton wrote: On Thu, 6 Dec 2007 16:04:41 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: The runner up key idea is that we will gain a notion of block device stack (or block stack for short, so that we may implement block stackers) which for the time being will simply be Device Mapper's notion of device stack, however many warts that may have. It's there now and we use it for ddsnap. Perhaps all we need to track is the outermost point? submit_bio(...) { bool remove_the_rq = false; ... if (current-the_rq == NULL) { current-the_rq = rq; remove_the_rq = true; } ... if (remove_the_rq) current-the_rq = NULL; } ? The parent patch already has that crucial property in a simple say, see if (q q-metric !bio-bi_queue) { bio-bi_queue = q; Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wed, 5 Dec 2007 22:21:44 -0800 Daniel Phillips <[EMAIL PROTECTED]> wrote: > On Wednesday 05 December 2007 17:24, Andrew Morton wrote: > > On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips <[EMAIL PROTECTED]> wrote: > > > ...a block device these days may not be just a single > > > device, but may be a stack of devices connected together by a generic > > > mechanism such as device mapper, or a hardcoded stack such as > > > multi-disk or network block device. It is necessary to consider the > > > resource requirements of the stack as a whole _before_ letting a > > > transfer proceed into any layer of the stack, otherwise deadlock on > > > many partially completed transfers becomes a possibility. For this > > > reason, the bio throttling is only implemented at the initial, highest > > > level submission of the bio to the block layer and not for any recursive > > > submission of the same bio to a lower level block device in a stack. > > > > > > This in turn has rather far reaching implications: the top level device > > > in a stack must take care of inspecting the entire stack in order to > > > determine how to calculate its resource requirements, thus becoming > > > the boss device for the entire stack. Though this intriguing idea could > > > easily become the cause of endless design work and many thousands of > > > lines of fancy code, today I sidestep the question entirely using > > > the "just provide lots of reserve" strategy. Horrifying as it may seem > > > to some, this is precisely the strategy that Linux has used in the > > > context of resource management in general, from the very beginning and > > > likely continuing for quite some time into the future My strongly held > > > opinion in this matter is that we need to solve the real, underlying > > > problems definitively with nice code before declaring the opening of > > > fancy patch season. So I am leaving further discussion of automatic > > > resource discovery algorithms and the like out of this post. > > > > Rather than asking the stack "how much memory will this request consume" > > you could instead ask "how much memory are you currently using". > > > > ie: on entry to the stack, do > > > > current->account_block_allocations = 1; > > make_request(...); > > rq->used_memory += current->pages_used_for_block_allocations; > > > > and in the page allocator do > > > > if (!in_interrupt() && current->account_block_allocations) > > current->pages_used_for_block_allocations++; > > > > and then somehow handle deallocation too ;) > > Ah, and how do you ensure that you do not deadlock while making this > inquiry? It isn't an inquiry - it's a plain old submit_bio() and it runs to completion in the usual fashion. Thing is, we wouldn't have called it at all if this queue was already over its allocation limit. IOW, we know that it's below its allocation limit, so we know it won't deadlock. Given, of course, reasonably pessimistc error margins. Which margins can even be observed at runtime: keep a running "max" of this stack's most-ever memory consumption (for a single call), and only submit a bio into it when its current allocation is less than (limit - that-max). > Perhaps send a dummy transaction down the pipe? Even so, > deadlock is possible, quite evidently so in the real life example I have > at hand. > > Yours is essentially one of the strategies I had in mind, the other major > one being simply to examine the whole stack, which presupposes some > as-yet-nonexistant kernel wide method of representing block device > stacks in all there glorious possible topology variations. We already have that, I think: blk_run_backing_dev(). One could envisage a similar thing which runs up and down the stack accumulating "how much memory do you need for this request" data, but I think that would be hard to implement and plain dumb. > > The basic idea being to know in real time how much memory a particular > > block stack is presently using. Then, on entry to that stack, if the > > stack's current usage is too high, wait for it to subside. > > We do not wait for high block device resource usage to subside before > submitting more requests. The improvement you suggest is aimed at > automatically determining resource requirements by sampling a > running system, rather than requiring a programmer to determine them > arduously by hand. Something like automatically determining a > workable locking strategy by analyzing running code, wouldn't that be > a treat? I will hope for one of those under my tree at Christmas. I don't see any unviability. > More practically, I can see a debug mode implemented along the lines > you describe where we automatically detect that a writeout path has > violated its covenant as expressed by its throttle_metric. > > > otoh we already have mechanisms for limiting the number of requests in > > flight. This is approximately proportional to the amount of memory
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wednesday 05 December 2007 17:24, Andrew Morton wrote: > On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips <[EMAIL PROTECTED]> wrote: > > ...a block device these days may not be just a single > > device, but may be a stack of devices connected together by a generic > > mechanism such as device mapper, or a hardcoded stack such as > > multi-disk or network block device. It is necessary to consider the > > resource requirements of the stack as a whole _before_ letting a > > transfer proceed into any layer of the stack, otherwise deadlock on > > many partially completed transfers becomes a possibility. For this > > reason, the bio throttling is only implemented at the initial, highest > > level submission of the bio to the block layer and not for any recursive > > submission of the same bio to a lower level block device in a stack. > > > > This in turn has rather far reaching implications: the top level device > > in a stack must take care of inspecting the entire stack in order to > > determine how to calculate its resource requirements, thus becoming > > the boss device for the entire stack. Though this intriguing idea could > > easily become the cause of endless design work and many thousands of > > lines of fancy code, today I sidestep the question entirely using > > the "just provide lots of reserve" strategy. Horrifying as it may seem > > to some, this is precisely the strategy that Linux has used in the > > context of resource management in general, from the very beginning and > > likely continuing for quite some time into the future My strongly held > > opinion in this matter is that we need to solve the real, underlying > > problems definitively with nice code before declaring the opening of > > fancy patch season. So I am leaving further discussion of automatic > > resource discovery algorithms and the like out of this post. > > Rather than asking the stack "how much memory will this request consume" > you could instead ask "how much memory are you currently using". > > ie: on entry to the stack, do > > current->account_block_allocations = 1; > make_request(...); > rq->used_memory += current->pages_used_for_block_allocations; > > and in the page allocator do > > if (!in_interrupt() && current->account_block_allocations) > current->pages_used_for_block_allocations++; > > and then somehow handle deallocation too ;) Ah, and how do you ensure that you do not deadlock while making this inquiry? Perhaps send a dummy transaction down the pipe? Even so, deadlock is possible, quite evidently so in the real life example I have at hand. Yours is essentially one of the strategies I had in mind, the other major one being simply to examine the whole stack, which presupposes some as-yet-nonexistant kernel wide method of representing block device stacks in all there glorious possible topology variations. > The basic idea being to know in real time how much memory a particular > block stack is presently using. Then, on entry to that stack, if the > stack's current usage is too high, wait for it to subside. We do not wait for high block device resource usage to subside before submitting more requests. The improvement you suggest is aimed at automatically determining resource requirements by sampling a running system, rather than requiring a programmer to determine them arduously by hand. Something like automatically determining a workable locking strategy by analyzing running code, wouldn't that be a treat? I will hope for one of those under my tree at Christmas. More practically, I can see a debug mode implemented along the lines you describe where we automatically detect that a writeout path has violated its covenant as expressed by its throttle_metric. > otoh we already have mechanisms for limiting the number of requests in > flight. This is approximately proportional to the amount of memory which > was allocated to service those requests. Why not just use that? Two reasons. The minor one is that device mapper bypasses that mechanism (no elevator) and the major one is that number of requests does not map well to the amount of resources consumed. In ddsnap for example, the amount of memory used by the userspace ddsnapd is roughly linear vs the number of pages transferred, not the number of requests. > > @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques > > if (bio_check_eod(bio, nr_sectors)) > > goto end_io; > > > > + if (q && q->metric && !bio->bi_queue) { > > + int need = bio->bi_throttle = q->metric(bio); > > + bio->bi_queue = q; > > + /* FIXME: potential race if atomic_sub is called in the middle > > of condition check */ > > + wait_event_interruptible(q->throttle_wait, > > atomic_read(>available) >= need); > > This will fall straight through if signal_pending() and (I assume) bad > stuff will happen. uninterruptible sleep, methinks. Yes,
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips <[EMAIL PROTECTED]> wrote: > Practice > > So here is the key idea of today's patch: it provides a simple mechanism > for imposing a limit on the amount of data that can be in flight to any > particular block device. > > The limit on in-flight data is in fact expressed generically, as it is > hard to set down any single rule to specify the amount of resources > any particular bio transfer will require. Instead, the block layer > provides a method that a block driver may optionally fill in, to > calculate the resource bound in units of the block driver's choosing. > The block driver thus takes upon itself the task of translating its > own, self-imposed bound into generic resource units that can be treated > generically by the kernel. In simple terms, the block driver looks at > each bio and decides how many pages (at most) of memalloc reserve could > be needed to fully service the transfer, and translates that > requirement into generic units for use by the block layer. The block > layer compares the generic units to a generic bound provided by the > block driver at initialization time and decides whether to let the bio > transfer in question proceed, or hold it back. > > This idea is Simplicity itself. Some not so obvious details follow. > > For one thing, a block device these days may not be just a single > device, but may be a stack of devices connected together by a generic > mechanism such as device mapper, or a hardcoded stack such as > multi-disk or network block device. It is necessary to consider the > resource requirements of the stack as a whole _before_ letting a > transfer proceed into any layer of the stack, otherwise deadlock on > many partially completed transfers becomes a possibility. For this > reason, the bio throttling is only implemented at the initial, highest > level submission of the bio to the block layer and not for any recursive > submission of the same bio to a lower level block device in a stack. > > This in turn has rather far reaching implications: the top level device > in a stack must take care of inspecting the entire stack in order to > determine how to calculate its resource requirements, thus becoming > the boss device for the entire stack. Though this intriguing idea could > easily become the cause of endless design work and many thousands of > lines of fancy code, today I sidestep the question entirely using > the "just provide lots of reserve" strategy. Horrifying as it may seem > to some, this is precisely the strategy that Linux has used in the > context of resource management in general, from the very beginning and > likely continuing for quite some time into the future My strongly held > opinion in this matter is that we need to solve the real, underlying > problems definitively with nice code before declaring the opening of > fancy patch season. So I am leaving further discussion of automatic > resource discovery algorithms and the like out of this post. Rather than asking the stack "how much memory will this request consume" you could instead ask "how much memory are you currently using". ie: on entry to the stack, do current->account_block_allocations = 1; make_request(...); rq->used_memory += current->pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() && current->account_block_allocations) current->pages_used_for_block_allocations++; and then somehow handle deallocation too ;) The basic idea being to know in real time how much memory a particular block stack is presently using. Then, on entry to that stack, if the stack's current usage is too high, wait for it to subside. otoh we already have mechanisms for limiting the number of requests in flight. This is approximately proportional to the amount of memory which was allocated to service those requests. Why not just use that? > @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques > if (bio_check_eod(bio, nr_sectors)) > goto end_io; > > + if (q && q->metric && !bio->bi_queue) { > + int need = bio->bi_throttle = q->metric(bio); > + bio->bi_queue = q; > + /* FIXME: potential race if atomic_sub is called in the middle > of condition check */ > + wait_event_interruptible(q->throttle_wait, > atomic_read(>available) >= need); This will fall straight through if signal_pending() and (I assume) bad stuff will happen. uninterruptible sleep, methinks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] [PATCH] A clean approach to writeout throttling
Good afternoon, According to me, each line of code removed from the kernel is worth ten lines added. If lines can be removed while at the same time improving performance, that is worth, ah, about 1,000 times more than lines added, correct? Or maybe 1,000,000 times as much, if removing lines removes a deadlock at the same time, something like that. Today's patch is a step towards removing many lines of code from mainline and also aims to improve write cache performance, eliminate a troublesome class of deadlocks, save kernel memory and make the kernel easier to read. Background We seriously began to tackle the issue of block writeout vm deadlock more than three years ago, and by now we have acreted a creaky agglomeration of dirty page limits, dirty page balancing between block devices, and miscellaneous other hacks attempting to solve these deadlocks. These bandaids as of 2.6.23 do not in themselves address the underlying problem, but they do add a lot of code to core kernel and they do impair write Linux's write cache performance. This is a classic case of fixing symptoms instead of the cause of a problem. The worst part of it? The dirty page limit idea did not actually fix the deadlock it was supposed to, but instead generated a new class of deadlocks that are easier to trigger. The basis of the writeout deadlock scenario is easy to see: a task requests a page of memory, but all pages are currently in use for caching among other things. To recover memory, some disk-backed pages must be evicted. Dirty pages must be written to disk before being evicted, so they are passed to the block layer for writeout. If any code in the block writeout path needs to allocate memory to do its work, then we can deadlock because the shortage of memory prevents the block layer from making progress to recover memory. So far, I have just summarized what everybody knows. This deadlock scenario has been present in Linux since day one, and has been solved for local block devices since very early on, by the simple expedient of providing a reserve of "memalloc" pages that a task may access only if it is in the call chain of a memory manager task engaged in writing out dirty pages. What was not commonly known three years ago is that the memalloc reserve solution fails in some cases, all of which share the common attribute that the task trying to allocate memory for block writeout is not the same as the task that initiated the writeout. In this case, the "PF_MEMALLOC" task flag strategy fails because the consumer of the memory is not in the call chain of the submitter, and thus does not inherit the flag. Examples of such deadlock-prone use cases include: * Fancy virtual block devices with helper daemons * Network block device accessed locally * Swap over network * Remote block devices in general * Any block device with a user space component * Filesystems implemented in user space To put it bluntly: without solving these deadlocks, Linux is pretty much useless in many modern storage roles. When I started working on this problem years go, I went at it by tackling the nastiest, most icky manifestation of it first, namely the possibility that the network layer might be unable to allocate memory to receive a reply packet from a remote block device. Unfortunately, the tricky details of the solution to this problem had the unintended effect of overshadowing the main issues, just because the details of the network receive deadlock are so deliciously obscure. Ironically, we found that the network read readlock scenario does not occur in practice. It is high time to draw attention back to the main issue. The meta-mistake I made while tackling this problem was to focus mainly on the logistics of providing "writeout helper" tasks with access to memory reserves, and just assume that it would be easy to place limits on how deep the helpers would dip into the reserves, which restriction is obviously necessary to prevent deadlock. This was so obvious in fact that I (we) did not get around to implementing it until quite recently. Then it became abundantly clear that limiting resource requirements is actually the main ingredient of any correct solution, and that the various complexities of providing access to memory reserves do not amount to much more than an interesting side show. We (Zumastor team) proved this to ourselves by removing the entire "PeterZ" patch set we were carrying (a distant descendant of my original network deadlock prevention patch) and lo! No deadlocks. It seems that throttling writeout traffic in an organized way amounts to powerful magic indeed. So that is all by way of saying, today's patch to limit the amount of data in flight to a block device is an Important Patch. We find that our own storage project needs very little more than this hundred lines of code or so to wave goodbye permanently to writeout deadlocks,
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: Practice So here is the key idea of today's patch: it provides a simple mechanism for imposing a limit on the amount of data that can be in flight to any particular block device. The limit on in-flight data is in fact expressed generically, as it is hard to set down any single rule to specify the amount of resources any particular bio transfer will require. Instead, the block layer provides a method that a block driver may optionally fill in, to calculate the resource bound in units of the block driver's choosing. The block driver thus takes upon itself the task of translating its own, self-imposed bound into generic resource units that can be treated generically by the kernel. In simple terms, the block driver looks at each bio and decides how many pages (at most) of memalloc reserve could be needed to fully service the transfer, and translates that requirement into generic units for use by the block layer. The block layer compares the generic units to a generic bound provided by the block driver at initialization time and decides whether to let the bio transfer in question proceed, or hold it back. This idea is Simplicity itself. Some not so obvious details follow. For one thing, a block device these days may not be just a single device, but may be a stack of devices connected together by a generic mechanism such as device mapper, or a hardcoded stack such as multi-disk or network block device. It is necessary to consider the resource requirements of the stack as a whole _before_ letting a transfer proceed into any layer of the stack, otherwise deadlock on many partially completed transfers becomes a possibility. For this reason, the bio throttling is only implemented at the initial, highest level submission of the bio to the block layer and not for any recursive submission of the same bio to a lower level block device in a stack. This in turn has rather far reaching implications: the top level device in a stack must take care of inspecting the entire stack in order to determine how to calculate its resource requirements, thus becoming the boss device for the entire stack. Though this intriguing idea could easily become the cause of endless design work and many thousands of lines of fancy code, today I sidestep the question entirely using the just provide lots of reserve strategy. Horrifying as it may seem to some, this is precisely the strategy that Linux has used in the context of resource management in general, from the very beginning and likely continuing for quite some time into the future My strongly held opinion in this matter is that we need to solve the real, underlying problems definitively with nice code before declaring the opening of fancy patch season. So I am leaving further discussion of automatic resource discovery algorithms and the like out of this post. Rather than asking the stack how much memory will this request consume you could instead ask how much memory are you currently using. ie: on entry to the stack, do current-account_block_allocations = 1; make_request(...); rq-used_memory += current-pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() current-account_block_allocations) current-pages_used_for_block_allocations++; and then somehow handle deallocation too ;) The basic idea being to know in real time how much memory a particular block stack is presently using. Then, on entry to that stack, if the stack's current usage is too high, wait for it to subside. otoh we already have mechanisms for limiting the number of requests in flight. This is approximately proportional to the amount of memory which was allocated to service those requests. Why not just use that? @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques if (bio_check_eod(bio, nr_sectors)) goto end_io; + if (q q-metric !bio-bi_queue) { + int need = bio-bi_throttle = q-metric(bio); + bio-bi_queue = q; + /* FIXME: potential race if atomic_sub is called in the middle of condition check */ + wait_event_interruptible(q-throttle_wait, atomic_read(q-available) = need); This will fall straight through if signal_pending() and (I assume) bad stuff will happen. uninterruptible sleep, methinks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] [PATCH] A clean approach to writeout throttling
Good afternoon, According to me, each line of code removed from the kernel is worth ten lines added. If lines can be removed while at the same time improving performance, that is worth, ah, about 1,000 times more than lines added, correct? Or maybe 1,000,000 times as much, if removing lines removes a deadlock at the same time, something like that. Today's patch is a step towards removing many lines of code from mainline and also aims to improve write cache performance, eliminate a troublesome class of deadlocks, save kernel memory and make the kernel easier to read. Background We seriously began to tackle the issue of block writeout vm deadlock more than three years ago, and by now we have acreted a creaky agglomeration of dirty page limits, dirty page balancing between block devices, and miscellaneous other hacks attempting to solve these deadlocks. These bandaids as of 2.6.23 do not in themselves address the underlying problem, but they do add a lot of code to core kernel and they do impair write Linux's write cache performance. This is a classic case of fixing symptoms instead of the cause of a problem. The worst part of it? The dirty page limit idea did not actually fix the deadlock it was supposed to, but instead generated a new class of deadlocks that are easier to trigger. The basis of the writeout deadlock scenario is easy to see: a task requests a page of memory, but all pages are currently in use for caching among other things. To recover memory, some disk-backed pages must be evicted. Dirty pages must be written to disk before being evicted, so they are passed to the block layer for writeout. If any code in the block writeout path needs to allocate memory to do its work, then we can deadlock because the shortage of memory prevents the block layer from making progress to recover memory. So far, I have just summarized what everybody knows. This deadlock scenario has been present in Linux since day one, and has been solved for local block devices since very early on, by the simple expedient of providing a reserve of memalloc pages that a task may access only if it is in the call chain of a memory manager task engaged in writing out dirty pages. What was not commonly known three years ago is that the memalloc reserve solution fails in some cases, all of which share the common attribute that the task trying to allocate memory for block writeout is not the same as the task that initiated the writeout. In this case, the PF_MEMALLOC task flag strategy fails because the consumer of the memory is not in the call chain of the submitter, and thus does not inherit the flag. Examples of such deadlock-prone use cases include: * Fancy virtual block devices with helper daemons * Network block device accessed locally * Swap over network * Remote block devices in general * Any block device with a user space component * Filesystems implemented in user space To put it bluntly: without solving these deadlocks, Linux is pretty much useless in many modern storage roles. When I started working on this problem years go, I went at it by tackling the nastiest, most icky manifestation of it first, namely the possibility that the network layer might be unable to allocate memory to receive a reply packet from a remote block device. Unfortunately, the tricky details of the solution to this problem had the unintended effect of overshadowing the main issues, just because the details of the network receive deadlock are so deliciously obscure. Ironically, we found that the network read readlock scenario does not occur in practice. It is high time to draw attention back to the main issue. The meta-mistake I made while tackling this problem was to focus mainly on the logistics of providing writeout helper tasks with access to memory reserves, and just assume that it would be easy to place limits on how deep the helpers would dip into the reserves, which restriction is obviously necessary to prevent deadlock. This was so obvious in fact that I (we) did not get around to implementing it until quite recently. Then it became abundantly clear that limiting resource requirements is actually the main ingredient of any correct solution, and that the various complexities of providing access to memory reserves do not amount to much more than an interesting side show. We (Zumastor team) proved this to ourselves by removing the entire PeterZ patch set we were carrying (a distant descendant of my original network deadlock prevention patch) and lo! No deadlocks. It seems that throttling writeout traffic in an organized way amounts to powerful magic indeed. So that is all by way of saying, today's patch to limit the amount of data in flight to a block device is an Important Patch. We find that our own storage project needs very little more than this hundred lines of code or so to wave goodbye permanently to writeout deadlocks, and
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wednesday 05 December 2007 17:24, Andrew Morton wrote: On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: ...a block device these days may not be just a single device, but may be a stack of devices connected together by a generic mechanism such as device mapper, or a hardcoded stack such as multi-disk or network block device. It is necessary to consider the resource requirements of the stack as a whole _before_ letting a transfer proceed into any layer of the stack, otherwise deadlock on many partially completed transfers becomes a possibility. For this reason, the bio throttling is only implemented at the initial, highest level submission of the bio to the block layer and not for any recursive submission of the same bio to a lower level block device in a stack. This in turn has rather far reaching implications: the top level device in a stack must take care of inspecting the entire stack in order to determine how to calculate its resource requirements, thus becoming the boss device for the entire stack. Though this intriguing idea could easily become the cause of endless design work and many thousands of lines of fancy code, today I sidestep the question entirely using the just provide lots of reserve strategy. Horrifying as it may seem to some, this is precisely the strategy that Linux has used in the context of resource management in general, from the very beginning and likely continuing for quite some time into the future My strongly held opinion in this matter is that we need to solve the real, underlying problems definitively with nice code before declaring the opening of fancy patch season. So I am leaving further discussion of automatic resource discovery algorithms and the like out of this post. Rather than asking the stack how much memory will this request consume you could instead ask how much memory are you currently using. ie: on entry to the stack, do current-account_block_allocations = 1; make_request(...); rq-used_memory += current-pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() current-account_block_allocations) current-pages_used_for_block_allocations++; and then somehow handle deallocation too ;) Ah, and how do you ensure that you do not deadlock while making this inquiry? Perhaps send a dummy transaction down the pipe? Even so, deadlock is possible, quite evidently so in the real life example I have at hand. Yours is essentially one of the strategies I had in mind, the other major one being simply to examine the whole stack, which presupposes some as-yet-nonexistant kernel wide method of representing block device stacks in all there glorious possible topology variations. The basic idea being to know in real time how much memory a particular block stack is presently using. Then, on entry to that stack, if the stack's current usage is too high, wait for it to subside. We do not wait for high block device resource usage to subside before submitting more requests. The improvement you suggest is aimed at automatically determining resource requirements by sampling a running system, rather than requiring a programmer to determine them arduously by hand. Something like automatically determining a workable locking strategy by analyzing running code, wouldn't that be a treat? I will hope for one of those under my tree at Christmas. More practically, I can see a debug mode implemented along the lines you describe where we automatically detect that a writeout path has violated its covenant as expressed by its throttle_metric. otoh we already have mechanisms for limiting the number of requests in flight. This is approximately proportional to the amount of memory which was allocated to service those requests. Why not just use that? Two reasons. The minor one is that device mapper bypasses that mechanism (no elevator) and the major one is that number of requests does not map well to the amount of resources consumed. In ddsnap for example, the amount of memory used by the userspace ddsnapd is roughly linear vs the number of pages transferred, not the number of requests. @@ -3221,6 +3221,13 @@ static inline void __generic_make_reques if (bio_check_eod(bio, nr_sectors)) goto end_io; + if (q q-metric !bio-bi_queue) { + int need = bio-bi_throttle = q-metric(bio); + bio-bi_queue = q; + /* FIXME: potential race if atomic_sub is called in the middle of condition check */ + wait_event_interruptible(q-throttle_wait, atomic_read(q-available) = need); This will fall straight through if signal_pending() and (I assume) bad stuff will happen. uninterruptible sleep, methinks. Yes, as a first order repair. To be done properly I need to express this in terms of the guts of wait_event_*, and get rid of
Re: [RFC] [PATCH] A clean approach to writeout throttling
On Wed, 5 Dec 2007 22:21:44 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: On Wednesday 05 December 2007 17:24, Andrew Morton wrote: On Wed, 5 Dec 2007 16:03:01 -0800 Daniel Phillips [EMAIL PROTECTED] wrote: ...a block device these days may not be just a single device, but may be a stack of devices connected together by a generic mechanism such as device mapper, or a hardcoded stack such as multi-disk or network block device. It is necessary to consider the resource requirements of the stack as a whole _before_ letting a transfer proceed into any layer of the stack, otherwise deadlock on many partially completed transfers becomes a possibility. For this reason, the bio throttling is only implemented at the initial, highest level submission of the bio to the block layer and not for any recursive submission of the same bio to a lower level block device in a stack. This in turn has rather far reaching implications: the top level device in a stack must take care of inspecting the entire stack in order to determine how to calculate its resource requirements, thus becoming the boss device for the entire stack. Though this intriguing idea could easily become the cause of endless design work and many thousands of lines of fancy code, today I sidestep the question entirely using the just provide lots of reserve strategy. Horrifying as it may seem to some, this is precisely the strategy that Linux has used in the context of resource management in general, from the very beginning and likely continuing for quite some time into the future My strongly held opinion in this matter is that we need to solve the real, underlying problems definitively with nice code before declaring the opening of fancy patch season. So I am leaving further discussion of automatic resource discovery algorithms and the like out of this post. Rather than asking the stack how much memory will this request consume you could instead ask how much memory are you currently using. ie: on entry to the stack, do current-account_block_allocations = 1; make_request(...); rq-used_memory += current-pages_used_for_block_allocations; and in the page allocator do if (!in_interrupt() current-account_block_allocations) current-pages_used_for_block_allocations++; and then somehow handle deallocation too ;) Ah, and how do you ensure that you do not deadlock while making this inquiry? It isn't an inquiry - it's a plain old submit_bio() and it runs to completion in the usual fashion. Thing is, we wouldn't have called it at all if this queue was already over its allocation limit. IOW, we know that it's below its allocation limit, so we know it won't deadlock. Given, of course, reasonably pessimistc error margins. Which margins can even be observed at runtime: keep a running max of this stack's most-ever memory consumption (for a single call), and only submit a bio into it when its current allocation is less than (limit - that-max). Perhaps send a dummy transaction down the pipe? Even so, deadlock is possible, quite evidently so in the real life example I have at hand. Yours is essentially one of the strategies I had in mind, the other major one being simply to examine the whole stack, which presupposes some as-yet-nonexistant kernel wide method of representing block device stacks in all there glorious possible topology variations. We already have that, I think: blk_run_backing_dev(). One could envisage a similar thing which runs up and down the stack accumulating how much memory do you need for this request data, but I think that would be hard to implement and plain dumb. The basic idea being to know in real time how much memory a particular block stack is presently using. Then, on entry to that stack, if the stack's current usage is too high, wait for it to subside. We do not wait for high block device resource usage to subside before submitting more requests. The improvement you suggest is aimed at automatically determining resource requirements by sampling a running system, rather than requiring a programmer to determine them arduously by hand. Something like automatically determining a workable locking strategy by analyzing running code, wouldn't that be a treat? I will hope for one of those under my tree at Christmas. I don't see any unviability. More practically, I can see a debug mode implemented along the lines you describe where we automatically detect that a writeout path has violated its covenant as expressed by its throttle_metric. otoh we already have mechanisms for limiting the number of requests in flight. This is approximately proportional to the amount of memory which was allocated to service those requests. Why not just use that? Two reasons. The minor one is that device mapper bypasses that mechanism (no