Re: [PATCH v2] blk-throttle: track read and write request individually

2017-10-22 Thread Joseph Qi
Hi Jens,
Could you please pick this patch?

Thanks,
Joseph

On 17/10/13 01:13, Shaohua Li wrote:
> On Thu, Oct 12, 2017 at 05:15:46PM +0800, Joseph Qi wrote:
>> From: Joseph Qi 
>>
>> In mixed read/write workload on SSD, write latency is much lower than
>> read. But now we only track and record read latency and then use it as
>> threshold base for both read and write io latency accounting. As a
>> result, write io latency will always be considered as good and
>> bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
>> tg to be checked will be treated as idle most of the time and still let
>> others dispatch more ios, even it is truly running under low limit and
>> wants its low limit to be guaranteed, which is not we expected in fact.
>> So track read and write request individually, which can bring more
>> precise latency control for low limit idle detection.
>>
>> Signed-off-by: Joseph Qi 
> 
> Reviewed-by: Shaohua Li 
>> ---
>>  block/blk-throttle.c | 134 
>> ++-
>>  1 file changed, 79 insertions(+), 55 deletions(-)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index 17816a0..0897378 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -215,9 +215,9 @@ struct throtl_data
>>  
>>  unsigned int scale;
>>  
>> -struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
>> -struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
>> -struct latency_bucket __percpu *latency_buckets;
>> +struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
>> +struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
>> +struct latency_bucket __percpu *latency_buckets[2];
>>  unsigned long last_calculate_time;
>>  unsigned long filtered_latency;
>>  
>> @@ -2040,10 +2040,10 @@ static void blk_throtl_update_idletime(struct 
>> throtl_grp *tg)
>>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>>  static void throtl_update_latency_buckets(struct throtl_data *td)
>>  {
>> -struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
>> -int i, cpu;
>> -unsigned long last_latency = 0;
>> -unsigned long latency;
>> +struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
>> +int i, cpu, rw;
>> +unsigned long last_latency[2] = { 0 };
>> +unsigned long latency[2];
>>  
>>  if (!blk_queue_nonrot(td->queue))
>>  return;
>> @@ -2052,56 +2052,67 @@ static void throtl_update_latency_buckets(struct 
>> throtl_data *td)
>>  td->last_calculate_time = jiffies;
>>  
>>  memset(avg_latency, 0, sizeof(avg_latency));
>> -for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> -struct latency_bucket *tmp = >tmp_buckets[i];
>> -
>> -for_each_possible_cpu(cpu) {
>> -struct latency_bucket *bucket;
>> -
>> -/* this isn't race free, but ok in practice */
>> -bucket = per_cpu_ptr(td->latency_buckets, cpu);
>> -tmp->total_latency += bucket[i].total_latency;
>> -tmp->samples += bucket[i].samples;
>> -bucket[i].total_latency = 0;
>> -bucket[i].samples = 0;
>> -}
>> +for (rw = READ; rw <= WRITE; rw++) {
>> +for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
>> +struct latency_bucket *tmp = >tmp_buckets[rw][i];
>> +
>> +for_each_possible_cpu(cpu) {
>> +struct latency_bucket *bucket;
>> +
>> +/* this isn't race free, but ok in practice */
>> +bucket = per_cpu_ptr(td->latency_buckets[rw],
>> +cpu);
>> +tmp->total_latency += bucket[i].total_latency;
>> +tmp->samples += bucket[i].samples;
>> +bucket[i].total_latency = 0;
>> +bucket[i].samples = 0;
>> +}
>>  
>> -if (tmp->samples >= 32) {
>> -int samples = tmp->samples;
>> +if (tmp->samples >= 32) {
>> +int samples = tmp->samples;
>>  
>> -latency = tmp->total_latency;
>> +latency[rw] = tmp->total_latency;
>>  
>> -tmp->total_latency = 0;
>> -tmp->samples = 0;
>> -latency /= samples;
>> -if (latency == 0)
>> -continue;
>> -avg_latency[i].latency = latency;
>> +tmp->total_latency = 0;
>> +tmp->samples = 0;
>> +latency[rw] /= samples;
>> +if (latency[rw] == 0)
>> +continue;
>> +   

Re: nvme multipath support V4

2017-10-22 Thread Guan Junxiong
Hi Christoph,


On 2017/10/19 0:52, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for multipathing, that is accessing nvme
> namespaces through multiple controllers to the nvme core driver.
> 
> It is a very thin and efficient implementation that relies on
> close cooperation with other bits of the nvme driver, and few small
> and simple block helpers.
> 
> Compared to dm-multipath the important differences are how management
> of the paths is done, and how the I/O path works.
> 
> Management of the paths is fully integrated into the nvme driver,
> for each newly found nvme controller we check if there are other
> controllers that refer to the same subsystem, and if so we link them
> up in the nvme driver.  Then for each namespace found we check if
> the namespace id and identifiers match to check if we have multiple
> controllers that refer to the same namespaces.  For now path
> availability is based entirely on the controller status, which at
> least for fabrics will be continuously updated based on the mandatory
> keep alive timer.  Once the Asynchronous Namespace Access (ANA)
> proposal passes in NVMe we will also get per-namespace states in
> addition to that, but for now any details of that remain confidential
> to NVMe members.
> 
> The I/O path is very different from the existing multipath drivers,
> which is enabled by the fact that NVMe (unlike SCSI) does not support
> partial completions - a controller will either complete a whole
> command or not, but never only complete parts of it.  Because of that
> there is no need to clone bios or requests - the I/O path simply
> redirects the I/O to a suitable path.  For successful commands
> multipath is not in the completion stack at all.  For failed commands
> we decide if the error could be a path failure, and if yes remove
> the bios from the request structure and requeue them before completing
> the request.  All together this means there is no performance
> degradation compared to normal nvme operation when using the multipath
> device node (at least not until I find a dual ported DRAM backed
> device :))
> 
> A git tree is available at:
> 
>git://git.infradead.org/users/hch/block.git nvme-mpath
> 
> gitweb:
> 
>http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath
> 
> Changes since V3:
>   - new block layer support for hidden gendisks
>   - a couple new patches to refactor device handling before the
> actual multipath support
>   - don't expose per-controller block device nodes
>   - use /dev/nvmeXnZ as the device nodes for the whole subsystem.

If per-controller block device nodes are hidden, how can the user-space tools
such as multipath-tools and nvme-cli (if it supports) know status of each path 
of
the multipath device?
In some cases, the admin wants to know which path is in down state , in degraded
state such as suffering intermittent IO error because of shaky link and he can 
fix
the link or isolate such link from the normal path.

Regards
Guan


>   - expose subsystems in sysfs (Hannes Reinecke)
>   - fix a subsystem leak when duplicate NQNs are found
>   - fix up some names
>   - don't clear current_path if freeing a different namespace
> 
> Changes since V2:
>   - don't create duplicate subsystems on reset (Keith Bush)
>   - free requests properly when failing over in I/O completion (Keith Bush)
>   - new devices names: /dev/nvm-sub%dn%d
>   - expose the namespace identification sysfs files for the mpath nodes
> 
> Changes since V1:
>   - introduce new nvme_ns_ids structure to clean up identifier handling
>   - generic_make_request_fast is now named direct_make_request and calls
> generic_make_request_checks
>   - reset bi_disk on resubmission
>   - create sysfs links between the existing nvme namespace block devices and
> the new share mpath device
>   - temporarily added the timeout patches from James, this should go into
> nvme-4.14, though
> 
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 
> .
> 



Re: [RESEND PATCH 1/3] completion: Add support for initializing completion with lockdep_map

2017-10-22 Thread Byungchul Park
On Sun, Oct 22, 2017 at 02:34:56PM +, Bart Van Assche wrote:
> On Sat, 2017-10-21 at 11:23 +0900, Byungchul Park wrote:
> > On Sat, Oct 21, 2017 at 4:58 AM, Bart Van Assche  
> > wrote:
> > > As explained in another e-mail thread, unlike the lock inversion checking
> > > performed by the <= v4.13 lockdep code, cross-release checking is a 
> > > heuristic
> > > that does not have a sound theoretical basis. The lock validator is an
> > 
> > It's not heuristic but based on the same theoretical basis as <=4.13
> > lockdep. I mean, the key basis is:
> > 
> >1) What causes deadlock
> >2) What is a dependency
> >3) Build a dependency when identified
> 
> Sorry but I doubt that that statement is correct. The publication [1] contains

IMHO, the paper is talking about totally different things wrt
deadlocks by wait_for_event/event, that is, lost events.

Furthermore, it doesn't rely on dependencies itself, but just lock
ordering 'case by case', which is a subset of the more general concept.

> a proof that an algorithm that is closely related to the traditional lockdep
> lock inversion detector is able to detect all deadlocks and does not report

I can admit this.

> false positives for programs that only use mutexes as synchronization objects.

I want to ask you. What makes false positives avoidable in the paper?

> The comment of the authors of that paper for programs that use mutexes,
> condition variables and semaphores is as follows: "It is unclear how to extend
> the lock-graph-based algorithm in Section 3 to efficiently consider the 
> effects
> of condition variables and semaphores. Therefore, when considering all three
> synchronization mechanisms, we currently use a naive algorithm that checks 
> each

Right. The paper seems to use a naive algorigm for that cases, not
replying on dependencies, which they should.

> feasible permutation of the trace for deadlock." In other words, if you have
> found an approach for detecting potential deadlocks for programs that use 
> these
> three kinds of synchronization objects and that does not report false 
> positives
> then that's a breakthrough that's worth publishing in a journal or in the
> proceedings of a scientific conference.

Please, point out logical problems of cross-release than saying it's
impossbile according to the paper. I think you'd better understand how
cross-release works *first*. I'll do my best to help you do.

> Bart.
> 
> [1] Agarwal, Rahul, and Scott D. Stoller. "Run-time detection of potential
> deadlocks for programs with locks, semaphores, and condition variables." In
> Proceedings of the 2006 workshop on Parallel and distributed systems: testing
> and debugging, pp. 51-60. ACM, 2006.
> (https://pdfs.semanticscholar.org/9324/fc0b5d5cd5e05d551a3e98757122039946a2.pdf).


Re: [PATCH v2 4/4] lockdep: Assign a lock_class per gendisk used for wait_for_completion()

2017-10-22 Thread Byungchul Park
On Fri, Oct 20, 2017 at 07:44:51AM -0700, Christoph Hellwig wrote:
> The Subject prefix for this should be "block:".
> 
> > @@ -945,7 +945,7 @@ int submit_bio_wait(struct bio *bio)
> >  {
> > struct submit_bio_ret ret;
> >  
> > -   init_completion();
> > +   init_completion_with_map(, >bi_disk->lockdep_map);
> 
> FYI, I have an outstanding patch to simplify this a lot, which
> switches this to DECLARE_COMPLETION_ONSTACK.  I can delay this or let
> you pick it up with your series, but we'll need a variant of
> DECLARE_COMPLETION_ONSTACK with the lockdep annotations.

Hello,

I'm sorry for late.

I think your patch makes block code simpler and better. I like it.

But, I just wonder if it's related to my series. Is it proper to add
your patch into my series?

Thanks,
Byungchul

> Patch below for reference:
> 
> ---
> >From d65b89843c9f82c0744643515ba51dd10e66e67b Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Thu, 5 Oct 2017 18:31:02 +0200
> Subject: block: use DECLARE_COMPLETION_ONSTACK in submit_bio_wait
> 
> Simplify the code by getting rid of the submit_bio_ret structure.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bio.c | 19 +--
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 8338304ea256..4e18e959fc0a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -917,17 +917,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct 
> iov_iter *iter)
>  }
>  EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
>  
> -struct submit_bio_ret {
> - struct completion event;
> - int error;
> -};
> -
>  static void submit_bio_wait_endio(struct bio *bio)
>  {
> - struct submit_bio_ret *ret = bio->bi_private;
> -
> - ret->error = blk_status_to_errno(bio->bi_status);
> - complete(>event);
> + complete(bio->bi_private);
>  }
>  
>  /**
> @@ -943,16 +935,15 @@ static void submit_bio_wait_endio(struct bio *bio)
>   */
>  int submit_bio_wait(struct bio *bio)
>  {
> - struct submit_bio_ret ret;
> + DECLARE_COMPLETION_ONSTACK(done);
>  
> - init_completion();
> - bio->bi_private = 
> + bio->bi_private = 
>   bio->bi_end_io = submit_bio_wait_endio;
>   bio->bi_opf |= REQ_SYNC;
>   submit_bio(bio);
> - wait_for_completion_io();
> + wait_for_completion_io();
>  
> - return ret.error;
> + return blk_status_to_errno(bio->bi_status);
>  }
>  EXPORT_SYMBOL(submit_bio_wait);
>  
> -- 
> 2.14.2


Re: [PATCH] blk-mq: Iterate also over sched_tags requests at blk_mq_tagset_iter()

2017-10-22 Thread Sagi Grimberg



Currently, blk_mq_tagset_iter() iterate over initial hctx tags only.
In case scheduler is used, it doesn't iterate the hctx scheduler tags
and the static request aren't been updated.
For example, while using NVMe over Fabrics RDMA host, this cause us not to
reinit the scheduler requests and thus not re-register all the memory regions
during the tagset re-initialization in the reconnect flow.


I think this is a sign that we should cease from embedding memory
regions on the pre-allocated requests. Its too much resources
that we waste. In our case, tags are not really cheap given
that they take a physical HW resource (rdma memory region).

I think we should switch (again) to a pool design instead. I guess its
time for a generic MR pool that will serve nvmf, xprt, srp, iser and
friends.


Re: [RESEND PATCH 1/3] completion: Add support for initializing completion with lockdep_map

2017-10-22 Thread Bart Van Assche
On Sat, 2017-10-21 at 11:23 +0900, Byungchul Park wrote:
> On Sat, Oct 21, 2017 at 4:58 AM, Bart Van Assche  
> wrote:
> > As explained in another e-mail thread, unlike the lock inversion checking
> > performed by the <= v4.13 lockdep code, cross-release checking is a 
> > heuristic
> > that does not have a sound theoretical basis. The lock validator is an
> 
> It's not heuristic but based on the same theoretical basis as <=4.13
> lockdep. I mean, the key basis is:
> 
>1) What causes deadlock
>2) What is a dependency
>3) Build a dependency when identified

Sorry but I doubt that that statement is correct. The publication [1] contains
a proof that an algorithm that is closely related to the traditional lockdep
lock inversion detector is able to detect all deadlocks and does not report
false positives for programs that only use mutexes as synchronization objects.
The comment of the authors of that paper for programs that use mutexes,
condition variables and semaphores is as follows: "It is unclear how to extend
the lock-graph-based algorithm in Section 3 to efficiently consider the effects
of condition variables and semaphores. Therefore, when considering all three
synchronization mechanisms, we currently use a naive algorithm that checks each
feasible permutation of the trace for deadlock." In other words, if you have
found an approach for detecting potential deadlocks for programs that use these
three kinds of synchronization objects and that does not report false positives
then that's a breakthrough that's worth publishing in a journal or in the
proceedings of a scientific conference.

Bart.

[1] Agarwal, Rahul, and Scott D. Stoller. "Run-time detection of potential
deadlocks for programs with locks, semaphores, and condition variables." In
Proceedings of the 2006 workshop on Parallel and distributed systems: testing
and debugging, pp. 51-60. ACM, 2006.
(https://pdfs.semanticscholar.org/9324/fc0b5d5cd5e05d551a3e98757122039946a2.pdf).

[PATCH] blk-mq: Iterate also over sched_tags requests at blk_mq_tagset_iter()

2017-10-22 Thread Israel Rukshin
Currently, blk_mq_tagset_iter() iterate over initial hctx tags only.
In case scheduler is used, it doesn't iterate the hctx scheduler tags
and the static request aren't been updated.
For example, while using NVMe over Fabrics RDMA host, this cause us not to
reinit the scheduler requests and thus not re-register all the memory regions
during the tagset re-initialization in the reconnect flow.

This may lead to a memory registration error:
"MEMREG for CQE 0x88044c14dce8 failed with status memory
management operation error (6)"

Signed-off-by: Israel Rukshin 
Reviewed-by: Max Gurtovoy 
---

The commit is based on nvme branch for 4.15 which includes
Sagi's patches for reinit_tagset.

---
 block/blk-mq-sched.c   |  3 +++
 block/blk-mq-tag.c | 16 
 block/blk-mq.c | 14 +-
 include/linux/blk-mq.h |  1 +
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab6943..4db9797 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -426,6 +426,7 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set 
*set,
blk_mq_free_rqs(set, hctx->sched_tags, hctx_idx);
blk_mq_free_rq_map(hctx->sched_tags);
hctx->sched_tags = NULL;
+   set->sched_tags[hctx_idx] = NULL;
}
 }
 
@@ -441,6 +442,8 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
if (!hctx->sched_tags)
return -ENOMEM;
 
+   set->sched_tags[hctx_idx] = hctx->sched_tags;
+
ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
if (ret)
blk_mq_sched_free_tags(set, hctx, hctx_idx);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index c81b40e..c290de0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -322,6 +322,22 @@ int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void 
*data,
}
}
 
+   for (i = 0; i < set->nr_hw_queues; i++) {
+   struct blk_mq_tags *sched_tags = set->sched_tags[i];
+
+   if (!sched_tags)
+   continue;
+
+   for (j = 0; j < sched_tags->nr_tags; j++) {
+   if (!sched_tags->static_rqs[j])
+   continue;
+
+   ret = fn(data, sched_tags->static_rqs[j]);
+   if (ret)
+   goto out;
+   }
+   }
+
 out:
return ret;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7f01d69..d7675b7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2576,10 +2576,16 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
return -ENOMEM;
 
ret = -ENOMEM;
+
+   set->sched_tags = kzalloc_node(nr_cpu_ids * sizeof(struct blk_mq_tags 
*),
+  GFP_KERNEL, set->numa_node);
+   if (!set->sched_tags)
+   goto out_free_tags;
+
set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
GFP_KERNEL, set->numa_node);
if (!set->mq_map)
-   goto out_free_tags;
+   goto out_free_sched_tags;
 
ret = blk_mq_update_queue_map(set);
if (ret)
@@ -2597,6 +2603,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 out_free_mq_map:
kfree(set->mq_map);
set->mq_map = NULL;
+out_free_sched_tags:
+   kfree(set->sched_tags);
+   set->sched_tags = NULL;
 out_free_tags:
kfree(set->tags);
set->tags = NULL;
@@ -2614,6 +2623,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
kfree(set->mq_map);
set->mq_map = NULL;
 
+   kfree(set->sched_tags);
+   set->sched_tags = NULL;
+
kfree(set->tags);
set->tags = NULL;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index cfd64e5..9ec629f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -78,6 +78,7 @@ struct blk_mq_tag_set {
void*driver_data;
 
struct blk_mq_tags  **tags;
+   struct blk_mq_tags  **sched_tags;
 
struct mutextag_list_lock;
struct list_headtag_list;
-- 
1.8.3.1



Re: high overhead of functions blkg_*stats_* in bfq

2017-10-22 Thread Paolo Valente

> Il giorno 21 ott 2017, alle ore 18:13, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Thu, Oct 19, 2017 at 08:50:17AM +0200, Paolo Valente wrote:
>> The blkg obtained through a blkg_lookup, in a rcu_read section, is
>> protected.  But, outside that section, a pointer to that blkg is not
>> guaranteed to be valid any longer.  Stat-update functions seem safe in
> 
> blkg's destruction is rcu delayed.  If you have access to a blkg under
> rcu, it won't get freed until the rcu read lock is released.
> 
>> cfq and bfq, just because they are executed within locks that happen
>> to be taken also before destroying the blkg.  They are the
>> request_queue lock for cfq and the scheduler lock for bfq.  Thus, at
>> least the request_queue lock apparently needs to be taken around
>> stat-update functions in bfq, if they are moved outside the section
>> protected by the scheduler lock.
> 
> So, a blkg stays alive if the queue lock is held, or the cgroup and
> request_queue stays alive, and won't be freed (different from being
> alive) as long as RCU read lock is held.
> 

Ok, then I have either to protect these stats-update functions with
the queue lock, or to extend somehow rcu protection to the bfq groups
referred by these functions.  According to our tests with different
processors, the gain of the latter solution is probably around 2 or 3%
of IOPS boost, with respect to using just the queue lock.  It would be
a little gain, against the current rather high overhead of the
functions to protect.  And it would cost additional code complexity.
So I'm oriented towards using just the queue lock.

Thank you very much for helping me with these doubts,
Paolo

> Thanks.
> 
> -- 
> tejun