Re: [PATCH 2/2] block: fix leak of q->rq_wb

2017-03-27 Thread Omar Sandoval
On Tue, Mar 28, 2017 at 11:43:04AM +0800, Ming Lei wrote: > On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote: > > From: Omar Sandoval > > > > CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a > > couple of cases: when a request queue is

Re: [PATCH 2/2] block: fix leak of q->rq_wb

2017-03-27 Thread Ming Lei
On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote: > From: Omar Sandoval > > CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a > couple of cases: when a request queue is reregistered and when gendisks > share a request_queue. This has been a

Re: [PATCH 2/2] block: fix leak of q->rq_wb

2017-03-27 Thread Omar Sandoval
On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote: > From: Omar Sandoval > > CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a > couple of cases: when a request queue is reregistered and when gendisks > share a request_queue. This has been a

Re: [dm-devel] [PATCH v3] block: trace completion of all bios.

2017-03-27 Thread NeilBrown
On Mon, Mar 27 2017, Christoph Hellwig wrote: > On Mon, Mar 27, 2017 at 08:49:57PM +1100, NeilBrown wrote: >> On Mon, Mar 27 2017, Christoph Hellwig wrote: >> >> > I don't really like the flag at all. I'd much prefer a __bio_endio >> > with a 'bool trace' flag. Also please remove the manual

Re: [PATCH 7/7] sd: use ZERO_PAGE for WRITE_SAME payloads

2017-03-27 Thread Bart Van Assche
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote: > We're never touching the contents of the page, so save a memory > allocation for these cases. > > Signed-off-by: Christoph Hellwig > --- > drivers/scsi/sd.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-)

Re: [PATCH 6/7] sd: support multi-range TRIM for ATA disks

2017-03-27 Thread Bart Van Assche
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote: > @@ -698,13 +698,19 @@ static void sd_config_discard(struct scsi_disk *sdkp, > unsigned int mode) > break; > > case SD_LBP_ATA_TRIM: > - max_blocks = 65535 * (512 / sizeof(__le64)); > +

[PATCH 3/3] blk-throttle: add latency target support

2017-03-27 Thread Shaohua Li
One hard problem adding .low limit is to detect idle cgroup. If one cgroup doesn't dispatch enough IO against its low limit, we must have a mechanism to determine if other cgroups dispatch more IO. We added the think time detection mechanism before, but it doesn't work for all workloads. Here we

Re: [PATCH 5/7] block: add a max_discard_segment_size queue limit

2017-03-27 Thread Bart Van Assche
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote: > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 5a7da607ca04..3b3bd646f580 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -333,6 +333,7 @@ struct queue_limits { > unsigned short

[PATCH 2/3] blk-throttle: add a mechanism to estimate IO latency

2017-03-27 Thread Shaohua Li
User configures latency target, but the latency threshold for each request size isn't fixed. For a SSD, the IO latency highly depends on request size. To calculate latency threshold, we sample some data, eg, average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency threshold of each

[PATCH 1/3] block: track request size in blk_issue_stat

2017-03-27 Thread Shaohua Li
Currently there is no way to know the request size when the request is finished. Next patch will need this info. We could add extra field to record the size, but blk_issue_stat has enough space to record it, so this patch just overloads blk_issue_stat. With this, we will have 49bits to track time,

[PATCH 0/3] blk-throttle: add .low limit fix

2017-03-27 Thread Shaohua Li
Jens, The 3 patches replace the last two patches (patch 17/18) of the series I sent out today. The new patch overloads blk stat as you suggested. Thanks, Shaohua Shaohua Li (3): block: track request size in blk_issue_stat blk-throttle: add a mechanism to estimate IO latency blk-throttle:

Re: [PATCH 2/7] sd: provide a new ata trim provisioning mode

2017-03-27 Thread Bart Van Assche
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote: > + case SD_LBP_ATA_TRIM: > + max_blocks = 65535 * (512 / sizeof(__le64)); > + if (sdkp->device->ata_trim_zeroes_data) > + q->limits.discard_zeroes_data = 1; > + break; Do we

Re: [PATCH 1/7] ѕd: split sd_setup_discard_cmnd

2017-03-27 Thread Bart Van Assche
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote: > + u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); > + u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); Although I know this is an issue in the existing code and not something introduced by

v4.11-rc blk-mq lockup?

2017-03-27 Thread Bart Van Assche
Hello Jens, If I leave the srp-test software running for a few minutes using the following command: # while ~bart/software/infiniband/srp-test/run_tests -d -r 30; do :; done then after some time the following complaint appears for multiple kworkers: INFO: task kworker/9:0:65 blocked for more

Re: [PATCH V7 00/18] blk-throttle: add .low limit

2017-03-27 Thread Jens Axboe
On 03/27/2017 01:00 PM, Shaohua Li wrote: > On Mon, Mar 27, 2017 at 12:15:29PM -0600, Jens Axboe wrote: >> On 03/27/2017 11:51 AM, Shaohua Li wrote: >>> V6->V7: >>> - Don't overload blk stat, which will simplify the code. This will add extra >>> space in bio/request though with the low interface

Re: [PATCH V7 00/18] blk-throttle: add .low limit

2017-03-27 Thread Shaohua Li
On Mon, Mar 27, 2017 at 12:15:29PM -0600, Jens Axboe wrote: > On 03/27/2017 11:51 AM, Shaohua Li wrote: > > V6->V7: > > - Don't overload blk stat, which will simplify the code. This will add extra > > space in bio/request though with the low interface configure option on. > > Hmm, why? As far

[PATCH V7 07/18] blk-throttle: add downgrade logic

2017-03-27 Thread Shaohua Li
When queue state machine is in LIMIT_MAX state, but a cgroup is below its low limit for some time, the queue should be downgraded to lower state as one cgroup's low limit isn't met. Signed-off-by: Shaohua Li --- block/blk-throttle.c | 156

[PATCH V7 00/18] blk-throttle: add .low limit

2017-03-27 Thread Shaohua Li
Hi, cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not much for SSD. This patch set try to add a conservative limit for blk-throttle. It isn't a proportional scheduling, but can help prioritize cgroups. There are several advantages we choose blk-throttle: - blk-throttle

[PATCH V7 13/18] blk-throttle: add a simple idle detection

2017-03-27 Thread Shaohua Li
A cgroup gets assigned a low limit, but the cgroup could never dispatch enough IO to cross the low limit. In such case, the queue state machine will remain in LIMIT_LOW state and all other cgroups will be throttled according to low limit. This is unfair for other cgroups. We should treat the

Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock

2017-03-27 Thread Tahsin Erdogan
On Sun, Mar 26, 2017 at 3:54 AM, Julia Lawall wrote: > Is an unlock needed before line 848? Or does blkg_lookup_check take care > of it? Unlock is not needed. On success, function returns with locks held. It is documented at line 805: "... This function returns with RCU

[PATCH V7 16/18] blk-throttle: add interface for per-cgroup target latency

2017-03-27 Thread Shaohua Li
Here we introduce per-cgroup latency target. The target determines how a cgroup can afford latency increasement. We will use the target latency to calculate a threshold and use it to schedule IO for cgroups. If a cgroup's bandwidth is below its low limit but its average latency is below the

[PATCH V7 06/18] blk-throttle: add upgrade logic for LIMIT_LOW state

2017-03-27 Thread Shaohua Li
When queue is in LIMIT_LOW state and all cgroups with low limit cross the bps/iops limitation, we will upgrade queue's state to LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if the cgroup has pending request. Since cgroup is throttled according to the limit, pending request

[PATCH V7 03/18] blk-throttle: add configure option for new .low interface

2017-03-27 Thread Shaohua Li
As discussed in LSF, add configure option for the interface and mark it as experimental, so people can try/test. Signed-off-by: Shaohua Li --- block/Kconfig | 12 1 file changed, 12 insertions(+) diff --git a/block/Kconfig b/block/Kconfig index e9f780f..89cd28f 100644

Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline

2017-03-27 Thread Paul E. McKenney
On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote: > Hi Paul. > > Thanks for the quick reply. > > On 3/26/2017 5:28 PM, Paul E. McKenney wrote: > >On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote: > > >>It is a race between this work running, and the cpu offline

Re: [PATCH V7 00/18] blk-throttle: add .low limit

2017-03-27 Thread Jens Axboe
On 03/27/2017 11:51 AM, Shaohua Li wrote: > V6->V7: > - Don't overload blk stat, which will simplify the code. This will add extra > space in bio/request though with the low interface configure option on. Hmm, why? As far as I can see, it just makes things worse. -- Jens Axboe

[PATCH V7 09/18] blk-throttle: make throtl_slice tunable

2017-03-27 Thread Shaohua Li
throtl_slice is important for blk-throttling. It's called slice internally but it really is a time window blk-throttling samples data. blk-throttling will make decision based on the samplings. An example is bandwidth measurement. A cgroup's bandwidth is measured in the time interval of

[PATCH V7 08/18] blk-throttle: make sure expire time isn't too big

2017-03-27 Thread Shaohua Li
cgroup could be throttled to a limit but when all cgroups cross high limit, queue enters a higher state and so the group should be throttled to a higher limit. It's possible the cgroup is sleeping because of throttle and other cgroups don't dispatch IO any more. In this case, nobody can trigger

[PATCH V7 18/18] blk-throttle: add latency target support

2017-03-27 Thread Shaohua Li
One hard problem adding .low limit is to detect idle cgroup. If one cgroup doesn't dispatch enough IO against its low limit, we must have a mechanism to determine if other cgroups dispatch more IO. We added the think time detection mechanism before, but it doesn't work for all workloads. Here we

[PATCH V7 15/18] blk-throttle: ignore idle cgroup limit

2017-03-27 Thread Shaohua Li
Last patch introduces a way to detect idle cgroup. We use it to make upgrade/downgrade decision. And the new algorithm can detect completely idle cgroup too, so we can delete the corresponding code. Signed-off-by: Shaohua Li --- block/blk-throttle.c | 40

[PATCH V7 11/18] blk-throttle: detect completed idle cgroup

2017-03-27 Thread Shaohua Li
cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the cgroup is idle. When this happens, the cgroup doesn't hit its limit, so we can't move the state machine to higher level and all cgroups will be throttled to their lower limit, so we waste bandwidth. Detecting idle cgroup is

[PATCH V7 12/18] blk-throttle: make bandwidth change smooth

2017-03-27 Thread Shaohua Li
When cgroups all reach low limit, cgroups can dispatch more IO. This could make some cgroups dispatch more IO but others not, and even some cgroups could dispatch less IO than their low limit. For example, cg1 low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is 120M/s for the

[PATCH 2/2] block: fix leak of q->rq_wb

2017-03-27 Thread Omar Sandoval
From: Omar Sandoval CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a couple of cases: when a request queue is reregistered and when gendisks share a request_queue. This has been a problem since wbt was introduced, but the WARN_ON(!list_empty(>callbacks)) in

[PATCH 1/2] blk-mq: fix leak of q->stats

2017-03-27 Thread Omar Sandoval
From: Omar Sandoval blk_alloc_queue_node() already allocates q->stats, so blk_mq_init_allocated_queue() is overwriting it with a new allocation. Fixes: a83b576c9c25 ("block: fix stacked driver stats init and free") Signed-off-by: Omar Sandoval ---

Re: [PATCH v3] block: trace completion of all bios.

2017-03-27 Thread Christoph Hellwig
On Mon, Mar 27, 2017 at 08:49:57PM +1100, NeilBrown wrote: > On Mon, Mar 27 2017, Christoph Hellwig wrote: > > > I don't really like the flag at all. I'd much prefer a __bio_endio > > with a 'bool trace' flag. Also please remove the manual tracing in > > dm.ċ. Once that is done I suspect we

Re: [PATCH 1/2] blk-stat: convert blk-stat bucket callback to signed

2017-03-27 Thread Omar Sandoval
On Mon, Mar 27, 2017 at 04:00:08PM +, Stephen Bates wrote: > Thanks for the review Omar! > > >> > >> -unsigned int blk_stat_rq_ddir(const struct request *rq) > >> +int blk_stat_rq_ddir(const struct request *rq) > >> { > >> - return rq_data_dir(rq); > >> + return (int)rq_data_dir(rq); >

Re: [PATCH 1/2] blk-stat: convert blk-stat bucket callback to signed

2017-03-27 Thread Stephen Bates
Thanks for the review Omar! >> >> -unsigned int blk_stat_rq_ddir(const struct request *rq) >> +int blk_stat_rq_ddir(const struct request *rq) >> { >> -return rq_data_dir(rq); >> +return (int)rq_data_dir(rq); > >The cast here here isn't necessary, let's leave it off. > OK, I will add

Re: [PATCH v3 4/4] block: block new I/O just after queue is set as dying

2017-03-27 Thread Bart Van Assche
On Mon, 2017-03-27 at 20:06 +0800, Ming Lei wrote: > Before commit 780db2071a(blk-mq: decouble blk-mq freezing > from generic bypassing), the dying flag is checked before > entering queue, and Tejun converts the checking into .mq_freeze_depth, > and assumes the counter is increased just after

Re: [PATCH v3 2/4] block: add a read barrier in blk_queue_enter()

2017-03-27 Thread Bart Van Assche
On Mon, 2017-03-27 at 20:06 +0800, Ming Lei wrote: > diff --git a/block/blk-core.c b/block/blk-core.c > index ad388d5e309a..5e8963bc98d9 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -669,6 +669,15 @@ int blk_queue_enter(struct request_queue *q, bool nowait) > if

Re: [Drbd-dev] RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-27 Thread Bart Van Assche
On Mon, 2017-03-27 at 10:03 -0400, Mike Snitzer wrote: > As for the blkdev_issue_zeroout() resorting to manually zeroing the > range, if the discard fails or dzd not supported, that certainly > requires DM thinp to implement manual zeroing of the head and tail of > the range if partial blocks are

Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-27 Thread Christoph Hellwig
On Mon, Mar 27, 2017 at 10:03:07AM -0400, Mike Snitzer wrote: > By "you" I assume you're referring to Lars? Yes. > Lars' approach for discard, > when drbd is layered on dm-thinp, feels over-engineered. Not his fault, > the way discard and zeroing got conflated certainly lends itself to > these

Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-27 Thread Mike Snitzer
On Mon, Mar 27 2017 at 5:10am -0400, Christoph Hellwig wrote: > It sounds like you don't want to support traditional discard at all, > but only WRITE ZEROES. So in many ways this series is the right way > forward. It would be nice if we could do a full blown >

Re: [PATCH v3 2/4] block: add a read barrier in blk_queue_enter()

2017-03-27 Thread Johannes Thumshirn
On Mon, Mar 27, 2017 at 08:06:56PM +0800, Ming Lei wrote: > Without the barrier, reading DEAD flag of .q_usage_counter > and reading .mq_freeze_depth may be reordered, then the > following wait_event_interruptible() may never return. > > Reviewed-by: Hannes Reinecke >

Re: [PATCH v3 4/4] block: block new I/O just after queue is set as dying

2017-03-27 Thread Johannes Thumshirn
On Mon, Mar 27, 2017 at 08:06:58PM +0800, Ming Lei wrote: > Before commit 780db2071a(blk-mq: decouble blk-mq freezing > from generic bypassing), the dying flag is checked before > entering queue, and Tejun converts the checking into .mq_freeze_depth, > and assumes the counter is increased just

Re: [PATCH v3 3/4] block: rename blk_mq_freeze_queue_start()

2017-03-27 Thread Johannes Thumshirn
On Mon, Mar 27, 2017 at 08:06:57PM +0800, Ming Lei wrote: > As the .q_usage_counter is used by both legacy and > mq path, we need to block new I/O if queue becomes > dead in blk_queue_enter(). > > So rename it and we can use this function in both > paths. > > Reviewed-by: Bart Van Assche

Re: [PATCH v3 1/4] blk-mq: comment on races related with timeout handler

2017-03-27 Thread Johannes Thumshirn
On Mon, Mar 27, 2017 at 08:06:55PM +0800, Ming Lei wrote: > This patch adds comment on two races related with > timeout handler: > > - requeue from queue busy vs. timeout > - rq free & reallocation vs. timeout > > Both the races themselves and current solution aren't > explicit enough, so add

[PATCH v3 3/4] block: rename blk_mq_freeze_queue_start()

2017-03-27 Thread Ming Lei
As the .q_usage_counter is used by both legacy and mq path, we need to block new I/O if queue becomes dead in blk_queue_enter(). So rename it and we can use this function in both paths. Reviewed-by: Bart Van Assche Reviewed-by: Hannes Reinecke

[PATCH v3 2/4] block: add a read barrier in blk_queue_enter()

2017-03-27 Thread Ming Lei
Without the barrier, reading DEAD flag of .q_usage_counter and reading .mq_freeze_depth may be reordered, then the following wait_event_interruptible() may never return. Reviewed-by: Hannes Reinecke Signed-off-by: Ming Lei --- block/blk-core.c | 9

[PATCH v3 0/4] block: misc changes

2017-03-27 Thread Ming Lei
Hi, The 1st patch add comments on blk-mq races with timeout handler. The other 3 patches improves handling for dying queue: - the 2nd one adds one barrier in blk_queue_enter() for avoiding hanging caused by out-of-order - the 3rd and 4th patches block new I/O entering

[PATCH v3 1/4] blk-mq: comment on races related with timeout handler

2017-03-27 Thread Ming Lei
This patch adds comment on two races related with timeout handler: - requeue from queue busy vs. timeout - rq free & reallocation vs. timeout Both the races themselves and current solution aren't explicit enough, so add comments on them. Cc: Bart Van Assche

[PATCH v3 4/4] block: block new I/O just after queue is set as dying

2017-03-27 Thread Ming Lei
Before commit 780db2071a(blk-mq: decouble blk-mq freezing from generic bypassing), the dying flag is checked before entering queue, and Tejun converts the checking into .mq_freeze_depth, and assumes the counter is increased just after dying flag is set. Unfortunately we doesn't do that in

Re: [PATCH v2 2/4] block: add a read barrier in blk_queue_enter()

2017-03-27 Thread Ming Lei
On Sat, Mar 25, 2017 at 2:45 AM, Bart Van Assche wrote: > On Sat, 2017-03-25 at 01:38 +0800, Ming Lei wrote: >> As I explained, the dying flag should only be mentioned after we change >> the code in blk_set_queue_dying(). > > Hello Ming, > > If patches 2 and 4 would be

Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)

2017-03-27 Thread Jinpu Wang
Hi Sagi, On Mon, Mar 27, 2017 at 4:20 AM, Sagi Grimberg wrote: > >> This series introduces IBNBD/IBTRS kernel modules. >> >> IBNBD (InfiniBand network block device) allows for an RDMA transfer of >> block IO >> over InfiniBand network. The driver presents itself as a block

Re: [PATCH v3 02/14] md: move two macros into md.h

2017-03-27 Thread NeilBrown
On Mon, Mar 27 2017, Christoph Hellwig wrote: > On Fri, Mar 24, 2017 at 09:53:25AM -0700, Shaohua Li wrote: >> >> I had the same concern when I looked at this patch firstly. The number for >> raid1/10 doesn't need to be the same. But if we don't move the number to a >> generic header, the third

Re: [PATCH v3] block: trace completion of all bios.

2017-03-27 Thread NeilBrown
On Mon, Mar 27 2017, Christoph Hellwig wrote: > I don't really like the flag at all. I'd much prefer a __bio_endio > with a 'bool trace' flag. Also please remove the manual tracing in > dm.ċ. Once that is done I suspect we can also remove the > block_bio_complete export. Can you say why you

Re: [PATCH v3 02/14] md: move two macros into md.h

2017-03-27 Thread Christoph Hellwig
On Fri, Mar 24, 2017 at 09:53:25AM -0700, Shaohua Li wrote: > > I had the same concern when I looked at this patch firstly. The number for > raid1/10 doesn't need to be the same. But if we don't move the number to a > generic header, the third patch will become a little more complicated. I >

Re: [PATCH] blk-mq: Add NULL pointer check for HW dispatch queue

2017-03-27 Thread Christoph Hellwig
On Mon, Mar 20, 2017 at 03:10:01PM +0530, Jitendra Bhivare wrote: > As part of blk_mq_realloc_hw_ctx(), if the init_hctx() ops is > failed by the underyling transport, the hctx pointer is freed and > initialized to NULL. > However, functions down the line, access this hwctx pointer without > a

Re: [PATCH v3 01/14] md: raid1/raid10: don't handle failure of bio_add_page()

2017-03-27 Thread Christoph Hellwig
Looks fine, Reviewed-by: Christoph Hellwig

Re: [PATCH 06/23] dm-kcopyd: switch to use REQ_OP_WRITE_ZEROES

2017-03-27 Thread Christoph Hellwig
On Thu, Mar 23, 2017 at 11:10:38AM -0400, Mike Snitzer wrote: > Not sure why you've split out the dm-kcopyd patch, likely best to just > fold it into the previous dm support patch. The dm-kcopyd patch switches to using WRITE_ZEROES instead of write same for dm as user of these interfaces. The

Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-03-27 Thread Christoph Hellwig
It sounds like you don't want to support traditional discard at all, but only WRITE ZEROES. So in many ways this series is the right way forward. It would be nice if we could do a full blown REQ_OP_WRITE_ZEROES for dm_think that zeroes out partial blocks, similar to what hardware that implements

Re: [PATCH] fs: remove _submit_bh()

2017-03-27 Thread Christoph Hellwig
Looks fine, Reviewed-by: Christoph Hellwig

Re: [PATCH v3] block: trace completion of all bios.

2017-03-27 Thread Christoph Hellwig
I don't really like the flag at all. I'd much prefer a __bio_endio with a 'bool trace' flag. Also please remove the manual tracing in dm.ċ. Once that is done I suspect we can also remove the block_bio_complete export.