On Tue, Mar 28, 2017 at 11:43:04AM +0800, Ming Lei wrote:
> On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval
> >
> > CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a
> > couple of cases: when a request queue is
On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval
>
> CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a
> couple of cases: when a request queue is reregistered and when gendisks
> share a request_queue. This has been a
On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval
>
> CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a
> couple of cases: when a request queue is reregistered and when gendisks
> share a request_queue. This has been a
On Mon, Mar 27 2017, Christoph Hellwig wrote:
> On Mon, Mar 27, 2017 at 08:49:57PM +1100, NeilBrown wrote:
>> On Mon, Mar 27 2017, Christoph Hellwig wrote:
>>
>> > I don't really like the flag at all. I'd much prefer a __bio_endio
>> > with a 'bool trace' flag. Also please remove the manual
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> We're never touching the contents of the page, so save a memory
> allocation for these cases.
>
> Signed-off-by: Christoph Hellwig
> ---
> drivers/scsi/sd.c | 7 ---
> 1 file changed, 4 insertions(+), 3 deletions(-)
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> @@ -698,13 +698,19 @@ static void sd_config_discard(struct scsi_disk *sdkp,
> unsigned int mode)
> break;
>
> case SD_LBP_ATA_TRIM:
> - max_blocks = 65535 * (512 / sizeof(__le64));
> +
One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 5a7da607ca04..3b3bd646f580 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -333,6 +333,7 @@ struct queue_limits {
> unsigned short
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each
Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time,
Jens,
The 3 patches replace the last two patches (patch 17/18) of the series I sent
out today. The new patch overloads blk stat as you suggested.
Thanks,
Shaohua
Shaohua Li (3):
block: track request size in blk_issue_stat
blk-throttle: add a mechanism to estimate IO latency
blk-throttle:
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> + case SD_LBP_ATA_TRIM:
> + max_blocks = 65535 * (512 / sizeof(__le64));
> + if (sdkp->device->ata_trim_zeroes_data)
> + q->limits.discard_zeroes_data = 1;
> + break;
Do we
On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> + u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
> + u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
Although I know this is an issue in the existing code and not something
introduced by
Hello Jens,
If I leave the srp-test software running for a few minutes using the
following command:
# while ~bart/software/infiniband/srp-test/run_tests -d -r 30; do :; done
then after some time the following complaint appears for multiple
kworkers:
INFO: task kworker/9:0:65 blocked for more
On 03/27/2017 01:00 PM, Shaohua Li wrote:
> On Mon, Mar 27, 2017 at 12:15:29PM -0600, Jens Axboe wrote:
>> On 03/27/2017 11:51 AM, Shaohua Li wrote:
>>> V6->V7:
>>> - Don't overload blk stat, which will simplify the code. This will add extra
>>> space in bio/request though with the low interface
On Mon, Mar 27, 2017 at 12:15:29PM -0600, Jens Axboe wrote:
> On 03/27/2017 11:51 AM, Shaohua Li wrote:
> > V6->V7:
> > - Don't overload blk stat, which will simplify the code. This will add extra
> > space in bio/request though with the low interface configure option on.
>
> Hmm, why? As far
When queue state machine is in LIMIT_MAX state, but a cgroup is below
its low limit for some time, the queue should be downgraded to lower
state as one cgroup's low limit isn't met.
Signed-off-by: Shaohua Li
---
block/blk-throttle.c | 156
Hi,
cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
much for SSD. This patch set try to add a conservative limit for blk-throttle.
It isn't a proportional scheduling, but can help prioritize cgroups. There are
several advantages we choose blk-throttle:
- blk-throttle
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the
On Sun, Mar 26, 2017 at 3:54 AM, Julia Lawall wrote:
> Is an unlock needed before line 848? Or does blkg_lookup_check take care
> of it?
Unlock is not needed. On success, function returns with locks held.
It is documented at line 805:
"... This function returns with RCU
Here we introduce per-cgroup latency target. The target determines how a
cgroup can afford latency increasement. We will use the target latency
to calculate a threshold and use it to schedule IO for cgroups. If a
cgroup's bandwidth is below its low limit but its average latency is
below the
When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
the cgroup has pending request. Since cgroup is throttled according to
the limit, pending request
As discussed in LSF, add configure option for the interface and mark it
as experimental, so people can try/test.
Signed-off-by: Shaohua Li
---
block/Kconfig | 12
1 file changed, 12 insertions(+)
diff --git a/block/Kconfig b/block/Kconfig
index e9f780f..89cd28f 100644
On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote:
> Hi Paul.
>
> Thanks for the quick reply.
>
> On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
> >On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
>
> >>It is a race between this work running, and the cpu offline
On 03/27/2017 11:51 AM, Shaohua Li wrote:
> V6->V7:
> - Don't overload blk stat, which will simplify the code. This will add extra
> space in bio/request though with the low interface configure option on.
Hmm, why? As far as I can see, it just makes things worse.
--
Jens Axboe
throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of
cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger
One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we
Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.
Signed-off-by: Shaohua Li
---
block/blk-throttle.c | 40
cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to their lower limit, so we waste bandwidth. Detecting idle
cgroup is
When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the
From: Omar Sandoval
CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a
couple of cases: when a request queue is reregistered and when gendisks
share a request_queue. This has been a problem since wbt was introduced,
but the WARN_ON(!list_empty(>callbacks)) in
From: Omar Sandoval
blk_alloc_queue_node() already allocates q->stats, so
blk_mq_init_allocated_queue() is overwriting it with a new allocation.
Fixes: a83b576c9c25 ("block: fix stacked driver stats init and free")
Signed-off-by: Omar Sandoval
---
On Mon, Mar 27, 2017 at 08:49:57PM +1100, NeilBrown wrote:
> On Mon, Mar 27 2017, Christoph Hellwig wrote:
>
> > I don't really like the flag at all. I'd much prefer a __bio_endio
> > with a 'bool trace' flag. Also please remove the manual tracing in
> > dm.ċ. Once that is done I suspect we
On Mon, Mar 27, 2017 at 04:00:08PM +, Stephen Bates wrote:
> Thanks for the review Omar!
>
> >>
> >> -unsigned int blk_stat_rq_ddir(const struct request *rq)
> >> +int blk_stat_rq_ddir(const struct request *rq)
> >> {
> >> - return rq_data_dir(rq);
> >> + return (int)rq_data_dir(rq);
>
Thanks for the review Omar!
>>
>> -unsigned int blk_stat_rq_ddir(const struct request *rq)
>> +int blk_stat_rq_ddir(const struct request *rq)
>> {
>> -return rq_data_dir(rq);
>> +return (int)rq_data_dir(rq);
>
>The cast here here isn't necessary, let's leave it off.
>
OK, I will add
On Mon, 2017-03-27 at 20:06 +0800, Ming Lei wrote:
> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
> from generic bypassing), the dying flag is checked before
> entering queue, and Tejun converts the checking into .mq_freeze_depth,
> and assumes the counter is increased just after
On Mon, 2017-03-27 at 20:06 +0800, Ming Lei wrote:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ad388d5e309a..5e8963bc98d9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -669,6 +669,15 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
> if
On Mon, 2017-03-27 at 10:03 -0400, Mike Snitzer wrote:
> As for the blkdev_issue_zeroout() resorting to manually zeroing the
> range, if the discard fails or dzd not supported, that certainly
> requires DM thinp to implement manual zeroing of the head and tail of
> the range if partial blocks are
On Mon, Mar 27, 2017 at 10:03:07AM -0400, Mike Snitzer wrote:
> By "you" I assume you're referring to Lars?
Yes.
> Lars' approach for discard,
> when drbd is layered on dm-thinp, feels over-engineered. Not his fault,
> the way discard and zeroing got conflated certainly lends itself to
> these
On Mon, Mar 27 2017 at 5:10am -0400,
Christoph Hellwig wrote:
> It sounds like you don't want to support traditional discard at all,
> but only WRITE ZEROES. So in many ways this series is the right way
> forward. It would be nice if we could do a full blown
>
On Mon, Mar 27, 2017 at 08:06:56PM +0800, Ming Lei wrote:
> Without the barrier, reading DEAD flag of .q_usage_counter
> and reading .mq_freeze_depth may be reordered, then the
> following wait_event_interruptible() may never return.
>
> Reviewed-by: Hannes Reinecke
>
On Mon, Mar 27, 2017 at 08:06:58PM +0800, Ming Lei wrote:
> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
> from generic bypassing), the dying flag is checked before
> entering queue, and Tejun converts the checking into .mq_freeze_depth,
> and assumes the counter is increased just
On Mon, Mar 27, 2017 at 08:06:57PM +0800, Ming Lei wrote:
> As the .q_usage_counter is used by both legacy and
> mq path, we need to block new I/O if queue becomes
> dead in blk_queue_enter().
>
> So rename it and we can use this function in both
> paths.
>
> Reviewed-by: Bart Van Assche
On Mon, Mar 27, 2017 at 08:06:55PM +0800, Ming Lei wrote:
> This patch adds comment on two races related with
> timeout handler:
>
> - requeue from queue busy vs. timeout
> - rq free & reallocation vs. timeout
>
> Both the races themselves and current solution aren't
> explicit enough, so add
As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().
So rename it and we can use this function in both
paths.
Reviewed-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Without the barrier, reading DEAD flag of .q_usage_counter
and reading .mq_freeze_depth may be reordered, then the
following wait_event_interruptible() may never return.
Reviewed-by: Hannes Reinecke
Signed-off-by: Ming Lei
---
block/blk-core.c | 9
Hi,
The 1st patch add comments on blk-mq races with timeout handler.
The other 3 patches improves handling for dying queue:
- the 2nd one adds one barrier in blk_queue_enter() for
avoiding hanging caused by out-of-order
- the 3rd and 4th patches block new I/O entering
This patch adds comment on two races related with
timeout handler:
- requeue from queue busy vs. timeout
- rq free & reallocation vs. timeout
Both the races themselves and current solution aren't
explicit enough, so add comments on them.
Cc: Bart Van Assche
Before commit 780db2071a(blk-mq: decouble blk-mq freezing
from generic bypassing), the dying flag is checked before
entering queue, and Tejun converts the checking into .mq_freeze_depth,
and assumes the counter is increased just after dying flag
is set. Unfortunately we doesn't do that in
On Sat, Mar 25, 2017 at 2:45 AM, Bart Van Assche
wrote:
> On Sat, 2017-03-25 at 01:38 +0800, Ming Lei wrote:
>> As I explained, the dying flag should only be mentioned after we change
>> the code in blk_set_queue_dying().
>
> Hello Ming,
>
> If patches 2 and 4 would be
Hi Sagi,
On Mon, Mar 27, 2017 at 4:20 AM, Sagi Grimberg wrote:
>
>> This series introduces IBNBD/IBTRS kernel modules.
>>
>> IBNBD (InfiniBand network block device) allows for an RDMA transfer of
>> block IO
>> over InfiniBand network. The driver presents itself as a block
On Mon, Mar 27 2017, Christoph Hellwig wrote:
> On Fri, Mar 24, 2017 at 09:53:25AM -0700, Shaohua Li wrote:
>>
>> I had the same concern when I looked at this patch firstly. The number for
>> raid1/10 doesn't need to be the same. But if we don't move the number to a
>> generic header, the third
On Mon, Mar 27 2017, Christoph Hellwig wrote:
> I don't really like the flag at all. I'd much prefer a __bio_endio
> with a 'bool trace' flag. Also please remove the manual tracing in
> dm.ċ. Once that is done I suspect we can also remove the
> block_bio_complete export.
Can you say why you
On Fri, Mar 24, 2017 at 09:53:25AM -0700, Shaohua Li wrote:
>
> I had the same concern when I looked at this patch firstly. The number for
> raid1/10 doesn't need to be the same. But if we don't move the number to a
> generic header, the third patch will become a little more complicated. I
>
On Mon, Mar 20, 2017 at 03:10:01PM +0530, Jitendra Bhivare wrote:
> As part of blk_mq_realloc_hw_ctx(), if the init_hctx() ops is
> failed by the underyling transport, the hctx pointer is freed and
> initialized to NULL.
> However, functions down the line, access this hwctx pointer without
> a
Looks fine,
Reviewed-by: Christoph Hellwig
On Thu, Mar 23, 2017 at 11:10:38AM -0400, Mike Snitzer wrote:
> Not sure why you've split out the dm-kcopyd patch, likely best to just
> fold it into the previous dm support patch.
The dm-kcopyd patch switches to using WRITE_ZEROES instead of write
same for dm as user of these interfaces. The
It sounds like you don't want to support traditional discard at all,
but only WRITE ZEROES. So in many ways this series is the right way
forward. It would be nice if we could do a full blown
REQ_OP_WRITE_ZEROES for dm_think that zeroes out partial blocks,
similar to what hardware that implements
Looks fine,
Reviewed-by: Christoph Hellwig
I don't really like the flag at all. I'd much prefer a __bio_endio
with a 'bool trace' flag. Also please remove the manual tracing in
dm.ċ. Once that is done I suspect we can also remove the
block_bio_complete export.
61 matches
Mail list logo