Re: [PATCH 0/5][v2] blk-iolatency: Fixes and tweak the miss algo for ssds

2018-09-28 Thread Jens Axboe
On 9/28/18 11:45 AM, Josef Bacik wrote:
> v1->v2:
> - rebased onto a recent for-4.20/block branch
> - dropped the changed variable cleanup.
> 
> -- Original message --
> 
> Testing on ssd's with the current iolatency code wasn't working quite as well.
> This is mostly because ssd's don't behave like rotational drives, they are 
> more
> spikey which means that using the average latency for IO wasn't very 
> responsive
> until the drive was extremely over taxed.  To deal with this I've reworked
> iolatency to use a p(90) based approach for ssd latencies.  I originally
> intended to use this approach for both ssd's and rotational drives, but p(90)
> was too high of a bar to use.  By the time we were exceeding p(90) things were
> already pretty bad.  So to keep things simpler just use p(90) for ssd's since
> their latency targets tend to be orders of magnitude lower than rotational
> drives, and keep the average latency calculations for rotational drives.
> 
> This testing also showed a few issues with blk-iolatency, so the preceding
> patches are all fixing issues we saw in testing.  Using q->nr_requests instead
> of blk_queue_depth() is probably the most subtle and important change.  We 
> want
> to limit the IO's based on the number of outstanding requests we can have in 
> the
> block layer, not necessarily how many we can have going to the device.  So 
> make
> this explicity by using nr_requests directly.  These patches have been in
> production for a week on both our rotational and ssd tiers and everything is
> going smoothly.  Thanks,

Applied for 4.20, thanks for respinning.

-- 
Jens Axboe



[PATCH 5/5] blk-iolatency: keep track of previous windows stats

2018-09-28 Thread Josef Bacik
We apply a smoothing to the scale changes in order to keep sawtoothy
behavior from occurring.  However our window for checking if we've
missed our target can sometimes be lower than the smoothing interval
(500ms), especially on faster drives like ssd's.  In order to deal with
this keep track of the running tally of the previous intervals that we
threw away because we had already done a scale event recently.

This is needed for the ssd case as these low latency drives will have
bursts of latency, and if it happens to be ok for the window that
directly follows the opening of the scale window we could unthrottle
when previous windows we were missing our target.

Signed-off-by: Josef Bacik 
---
 block/blk-iolatency.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index fd246805b0be..35c48d7b8f78 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -130,6 +130,7 @@ struct latency_stat {
 struct iolatency_grp {
struct blkg_policy_data pd;
struct latency_stat __percpu *stats;
+   struct latency_stat cur_stat;
struct blk_iolatency *blkiolat;
struct rq_depth rq_depth;
struct rq_wait rq_wait;
@@ -570,24 +571,27 @@ static void iolatency_check_latencies(struct 
iolatency_grp *iolat, u64 now)
 
/* Somebody beat us to the punch, just bail. */
spin_lock_irqsave(&lat_info->lock, flags);
+
+   latency_stat_sum(iolat, &iolat->cur_stat, &stat);
lat_info->nr_samples -= iolat->nr_samples;
-   lat_info->nr_samples += latency_stat_samples(iolat, &stat);
-   iolat->nr_samples = latency_stat_samples(iolat, &stat);
+   lat_info->nr_samples += latency_stat_samples(iolat, &iolat->cur_stat);
+   iolat->nr_samples = latency_stat_samples(iolat, &iolat->cur_stat);
 
if ((lat_info->last_scale_event >= now ||
-   now - lat_info->last_scale_event < BLKIOLATENCY_MIN_ADJUST_TIME) &&
-   lat_info->scale_lat <= iolat->min_lat_nsec)
+   now - lat_info->last_scale_event < BLKIOLATENCY_MIN_ADJUST_TIME))
goto out;
 
-   if (latency_sum_ok(iolat, &stat)) {
-   if (latency_stat_samples(iolat, &stat) <
+   if (latency_sum_ok(iolat, &iolat->cur_stat) &&
+   latency_sum_ok(iolat, &stat)) {
+   if (latency_stat_samples(iolat, &iolat->cur_stat) <
BLKIOLATENCY_MIN_GOOD_SAMPLES)
goto out;
if (lat_info->scale_grp == iolat) {
lat_info->last_scale_event = now;
scale_cookie_change(iolat->blkiolat, lat_info, true);
}
-   } else {
+   } else if (lat_info->scale_lat == 0 ||
+  lat_info->scale_lat >= iolat->min_lat_nsec) {
lat_info->last_scale_event = now;
if (!lat_info->scale_grp ||
lat_info->scale_lat > iolat->min_lat_nsec) {
@@ -596,6 +600,7 @@ static void iolatency_check_latencies(struct iolatency_grp 
*iolat, u64 now)
}
scale_cookie_change(iolat->blkiolat, lat_info, false);
}
+   latency_stat_init(iolat, &iolat->cur_stat);
 out:
spin_unlock_irqrestore(&lat_info->lock, flags);
 }
@@ -966,6 +971,7 @@ static void iolatency_pd_init(struct blkg_policy_data *pd)
latency_stat_init(iolat, stat);
}
 
+   latency_stat_init(iolat, &iolat->cur_stat);
rq_wait_init(&iolat->rq_wait);
spin_lock_init(&iolat->child_lat.lock);
iolat->rq_depth.queue_depth = blkg->q->nr_requests;
-- 
2.14.3



[PATCH 0/5][v2] blk-iolatency: Fixes and tweak the miss algo for ssds

2018-09-28 Thread Josef Bacik
v1->v2:
- rebased onto a recent for-4.20/block branch
- dropped the changed variable cleanup.

-- Original message --

Testing on ssd's with the current iolatency code wasn't working quite as well.
This is mostly because ssd's don't behave like rotational drives, they are more
spikey which means that using the average latency for IO wasn't very responsive
until the drive was extremely over taxed.  To deal with this I've reworked
iolatency to use a p(90) based approach for ssd latencies.  I originally
intended to use this approach for both ssd's and rotational drives, but p(90)
was too high of a bar to use.  By the time we were exceeding p(90) things were
already pretty bad.  So to keep things simpler just use p(90) for ssd's since
their latency targets tend to be orders of magnitude lower than rotational
drives, and keep the average latency calculations for rotational drives.

This testing also showed a few issues with blk-iolatency, so the preceding
patches are all fixing issues we saw in testing.  Using q->nr_requests instead
of blk_queue_depth() is probably the most subtle and important change.  We want
to limit the IO's based on the number of outstanding requests we can have in the
block layer, not necessarily how many we can have going to the device.  So make
this explicity by using nr_requests directly.  These patches have been in
production for a week on both our rotational and ssd tiers and everything is
going smoothly.  Thanks,

Josef


[PATCH 1/5] blk-iolatency: use q->nr_requests directly

2018-09-28 Thread Josef Bacik
We were using blk_queue_depth() assuming that it would return
nr_requests, but we hit a case in production on drives that had to have
NCQ turned off in order for them to not shit the bed which resulted in a
qd of 1, even though the nr_requests was much larger.  iolatency really
only cares about requests we are allowed to queue up, as any io that
get's onto the request list is going to be serviced soonish, so we want
to be throttling before the bio gets onto the request list.  To make
iolatency work as expected, simply use q->nr_requests instead of
blk_queue_depth() as that is what we actually care about.

Signed-off-by: Josef Bacik 
---
 block/blk-iolatency.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 27c14f8d2576..c2e38bc12f27 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -255,7 +255,7 @@ static void scale_cookie_change(struct blk_iolatency 
*blkiolat,
struct child_latency_info *lat_info,
bool up)
 {
-   unsigned long qd = blk_queue_depth(blkiolat->rqos.q);
+   unsigned long qd = blkiolat->rqos.q->nr_requests;
unsigned long scale = scale_amount(qd, up);
unsigned long old = atomic_read(&lat_info->scale_cookie);
unsigned long max_scale = qd << 1;
@@ -295,7 +295,7 @@ static void scale_cookie_change(struct blk_iolatency 
*blkiolat,
  */
 static void scale_change(struct iolatency_grp *iolat, bool up)
 {
-   unsigned long qd = blk_queue_depth(iolat->blkiolat->rqos.q);
+   unsigned long qd = iolat->blkiolat->rqos.q->nr_requests;
unsigned long scale = scale_amount(qd, up);
unsigned long old = iolat->rq_depth.max_depth;
 
@@ -857,7 +857,7 @@ static void iolatency_pd_init(struct blkg_policy_data *pd)
 
rq_wait_init(&iolat->rq_wait);
spin_lock_init(&iolat->child_lat.lock);
-   iolat->rq_depth.queue_depth = blk_queue_depth(blkg->q);
+   iolat->rq_depth.queue_depth = blkg->q->nr_requests;
iolat->rq_depth.max_depth = UINT_MAX;
iolat->rq_depth.default_depth = iolat->rq_depth.queue_depth;
iolat->blkiolat = blkiolat;
-- 
2.14.3



[PATCH 3/5] blk-iolatency: deal with small samples

2018-09-28 Thread Josef Bacik
There is logic to keep cgroups that haven't done a lot of IO in the most
recent scale window from being punished for over-active higher priority
groups.  However for things like ssd's where the windows are pretty
short we'll end up with small numbers of samples, so 5% of samples will
come out to 0 if there aren't enough.  Make the floor 1 sample to keep
us from improperly bailing out of scaling down.

Signed-off-by: Josef Bacik 
---
 block/blk-iolatency.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 8daea7a4fe49..e7be77b0ce8b 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -366,7 +366,7 @@ static void check_scale_change(struct iolatency_grp *iolat)
 * scale down event.
 */
samples_thresh = lat_info->nr_samples * 5;
-   samples_thresh = div64_u64(samples_thresh, 100);
+   samples_thresh = max(1ULL, div64_u64(samples_thresh, 100));
if (iolat->nr_samples <= samples_thresh)
return;
}
-- 
2.14.3



[PATCH 4/5] blk-iolatency: use a percentile approache for ssd's

2018-09-28 Thread Josef Bacik
We use an average latency approach for determining if we're missing our
latency target.  This works well for rotational storage where we have
generally consistent latencies, but for ssd's and other low latency
devices you have more of a spikey behavior, which means we often won't
throttle misbehaving groups because a lot of IO completes at drastically
faster times than our latency target.  Instead keep track of how many
IO's miss our target and how many IO's are done in our time window.  If
the p(90) latency is above our target then we know we need to throttle.
With this change in place we are seeing the same throttling behavior
with our testcase on ssd's as we see with rotational drives.

Signed-off-by: Josef Bacik 
---
 block/blk-iolatency.c | 179 --
 1 file changed, 145 insertions(+), 34 deletions(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index e7be77b0ce8b..fd246805b0be 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -115,9 +115,21 @@ struct child_latency_info {
atomic_t scale_cookie;
 };
 
+struct percentile_stats {
+   u64 total;
+   u64 missed;
+};
+
+struct latency_stat {
+   union {
+   struct percentile_stats ps;
+   struct blk_rq_stat rqs;
+   };
+};
+
 struct iolatency_grp {
struct blkg_policy_data pd;
-   struct blk_rq_stat __percpu *stats;
+   struct latency_stat __percpu *stats;
struct blk_iolatency *blkiolat;
struct rq_depth rq_depth;
struct rq_wait rq_wait;
@@ -132,6 +144,7 @@ struct iolatency_grp {
/* Our current number of IO's for the last summation. */
u64 nr_samples;
 
+   bool ssd;
struct child_latency_info child_lat;
 };
 
@@ -172,6 +185,80 @@ static inline struct blkcg_gq *lat_to_blkg(struct 
iolatency_grp *iolat)
return pd_to_blkg(&iolat->pd);
 }
 
+static inline void latency_stat_init(struct iolatency_grp *iolat,
+struct latency_stat *stat)
+{
+   if (iolat->ssd) {
+   stat->ps.total = 0;
+   stat->ps.missed = 0;
+   } else
+   blk_rq_stat_init(&stat->rqs);
+}
+
+static inline void latency_stat_sum(struct iolatency_grp *iolat,
+   struct latency_stat *sum,
+   struct latency_stat *stat)
+{
+   if (iolat->ssd) {
+   sum->ps.total += stat->ps.total;
+   sum->ps.missed += stat->ps.missed;
+   } else
+   blk_rq_stat_sum(&sum->rqs, &stat->rqs);
+}
+
+static inline void latency_stat_record_time(struct iolatency_grp *iolat,
+   u64 req_time)
+{
+   struct latency_stat *stat = get_cpu_ptr(iolat->stats);
+   if (iolat->ssd) {
+   if (req_time >= iolat->min_lat_nsec)
+   stat->ps.missed++;
+   stat->ps.total++;
+   } else
+   blk_rq_stat_add(&stat->rqs, req_time);
+   put_cpu_ptr(stat);
+}
+
+static inline bool latency_sum_ok(struct iolatency_grp *iolat,
+ struct latency_stat *stat)
+{
+   if (iolat->ssd) {
+   u64 thresh = div64_u64(stat->ps.total, 10);
+   thresh = max(thresh, 1ULL);
+   return stat->ps.missed < thresh;
+   }
+   return stat->rqs.mean <= iolat->min_lat_nsec;
+}
+
+static inline u64 latency_stat_samples(struct iolatency_grp *iolat,
+  struct latency_stat *stat)
+{
+   if (iolat->ssd)
+   return stat->ps.total;
+   return stat->rqs.nr_samples;
+}
+
+static inline void iolat_update_total_lat_avg(struct iolatency_grp *iolat,
+ struct latency_stat *stat)
+{
+   int exp_idx;
+
+   if (iolat->ssd)
+   return;
+
+   /*
+* CALC_LOAD takes in a number stored in fixed point representation.
+* Because we are using this for IO time in ns, the values stored
+* are significantly larger than the FIXED_1 denominator (2048).
+* Therefore, rounding errors in the calculation are negligible and
+* can be ignored.
+*/
+   exp_idx = min_t(int, BLKIOLATENCY_NR_EXP_FACTORS - 1,
+   div64_u64(iolat->cur_win_nsec,
+ BLKIOLATENCY_EXP_BUCKET_SIZE));
+   CALC_LOAD(iolat->lat_avg, iolatency_exp_factors[exp_idx], 
stat->rqs.mean);
+}
+
 static inline bool iolatency_may_queue(struct iolatency_grp *iolat,
   wait_queue_entry_t *wait,
   bool first_block)
@@ -418,7 +505,6 @@ static void iolatency_record_time(struct iolatency_grp 
*iolat,
  struct bio_issue *issue, u64 now,
  bool issue_as_root)
 {
-   struct blk_rq_stat *rq_stat;
u64 start = bio_issue_ti

[PATCH 2/5] blk-iolatency: deal with nr_requests == 1

2018-09-28 Thread Josef Bacik
Hitting the case where blk_queue_depth() returned 1 uncovered the fact
that iolatency doesn't actually handle this case properly, it simply
doesn't scale down anybody.  For this case we should go straight into
applying the time delay, which we weren't doing.  Since we already limit
the floor at 1 request this if statement is not needed, and this allows
us to set our depth to 1 which allows us to apply the delay if needed.

Signed-off-by: Josef Bacik 
---
 block/blk-iolatency.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index c2e38bc12f27..8daea7a4fe49 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -312,7 +312,7 @@ static void scale_change(struct iolatency_grp *iolat, bool 
up)
iolat->rq_depth.max_depth = old;
wake_up_all(&iolat->rq_wait.wait);
}
-   } else if (old > 1) {
+   } else {
old >>= 1;
iolat->rq_depth.max_depth = max(old, 1UL);
}
-- 
2.14.3



Re: [PATCH] kyber: fix integer overflow of latency targets on 32-bit

2018-09-28 Thread Jens Axboe
On 9/28/18 10:22 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
> However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
> 32-bit long. Make sure all of the targets are calculated as 64-bit
> values.

Applied, thanks Omar.

-- 
Jens Axboe



[PATCH] kyber: fix integer overflow of latency targets on 32-bit

2018-09-28 Thread Omar Sandoval
From: Omar Sandoval 

NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
32-bit long. Make sure all of the targets are calculated as 64-bit
values.

Fixes: 6e25cb01ea20 ("kyber: implement improved heuristics")
Reported-by: Stephen Rothwell 
Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 2b62e362fb36..eccac01a10b6 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -79,9 +79,9 @@ static const unsigned int kyber_depth[] = {
  * Default latency targets for each scheduling domain.
  */
 static const u64 kyber_latency_targets[] = {
-   [KYBER_READ] = 2 * NSEC_PER_MSEC,
-   [KYBER_WRITE] = 10 * NSEC_PER_MSEC,
-   [KYBER_DISCARD] = 5 * NSEC_PER_SEC,
+   [KYBER_READ] = 2ULL * NSEC_PER_MSEC,
+   [KYBER_WRITE] = 10ULL * NSEC_PER_MSEC,
+   [KYBER_DISCARD] = 5ULL * NSEC_PER_SEC,
 };
 
 /*
-- 
2.19.0



Re: [GIT PULL] nvme fixes for 4.19

2018-09-28 Thread Jens Axboe
On 9/28/18 9:40 AM, Christoph Hellwig wrote:
> The following changes since commit b228ba1cb95afbaeeb86cf06cd9fd6f6369c3b14:
> 
>   null_blk: fix zoned support for non-rq based operation (2018-09-12 18:21:11 
> -0600)
> 
> are available in the Git repository at:
> 
>   git://git.infradead.org/nvme.git nvme-4.19
> 
> for you to fetch changes up to bb830add192e9d8338082c0fc2c209e23b43d865:
> 
>   nvme: properly propagate errors in nvme_mpath_init (2018-09-25 16:21:40 
> -0700)
> 
> 
> Hannes Reinecke (1):
>   nvme: count all ANA groups for ANA Log page

This one I already pulled a while back, only this one:

> Susobhan Dey (1):
>   nvme: properly propagate errors in nvme_mpath_init
is new.

-- 
Jens Axboe



[GIT PULL] nvme fixes for 4.19

2018-09-28 Thread Christoph Hellwig
The following changes since commit b228ba1cb95afbaeeb86cf06cd9fd6f6369c3b14:

  null_blk: fix zoned support for non-rq based operation (2018-09-12 18:21:11 
-0600)

are available in the Git repository at:

  git://git.infradead.org/nvme.git nvme-4.19

for you to fetch changes up to bb830add192e9d8338082c0fc2c209e23b43d865:

  nvme: properly propagate errors in nvme_mpath_init (2018-09-25 16:21:40 -0700)


Hannes Reinecke (1):
  nvme: count all ANA groups for ANA Log page

Susobhan Dey (1):
  nvme: properly propagate errors in nvme_mpath_init

 drivers/nvme/host/multipath.c   | 6 --
 drivers/nvme/target/admin-cmd.c | 4 
 2 files changed, 8 insertions(+), 2 deletions(-)


Re: [PATCH 2/5] nvme: register ns_id attributes as default sysfs groups

2018-09-28 Thread Christoph Hellwig
On Fri, Sep 28, 2018 at 08:17:20AM +0200, Hannes Reinecke wrote:
> We should be registering the ns_id attribute as default sysfs
> attribute groups, otherwise we have a race condition between
> the uevent and the attributes appearing in sysfs.

Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCHv4 0/5] genhd: register default groups with device_add_disk()

2018-09-28 Thread Jens Axboe
On 9/28/18 12:17 AM, Hannes Reinecke wrote:
> device_add_disk() doesn't allow to register with default sysfs groups,
> which introduces a race with udev as these groups have to be created after
> the uevent has been sent.
> This patchset updates device_add_disk() to accept a 'groups' argument to
> avoid this race and updates the obvious drivers to use it.
> There are some more drivers which might benefit from this (eg loop or md),
> but the interface is not straightforward so I haven't attempted it.

Thanks Hannes, applied for 4.20.

-- 
Jens Axboe



Re: [PATCH 2/5] nvme: register ns_id attributes as default sysfs groups

2018-09-28 Thread Keith Busch
On Fri, Sep 28, 2018 at 08:17:20AM +0200, Hannes Reinecke wrote:
> We should be registering the ns_id attribute as default sysfs
> attribute groups, otherwise we have a race condition between
> the uevent and the attributes appearing in sysfs.
> 
> Suggested-by: Bart van Assche 
> Signed-off-by: Hannes Reinecke 

Thanks, looks great!

Reviewed-by: Keith Busch 


[PATCH V2] blk-mq: complete req in softirq context in case of single queue

2018-09-28 Thread Ming Lei
Lot of controllers may have only one irq vector for completing IO
request. And usually affinity of the only irq vector is all possible
CPUs, however, on most of ARCH, there may be only one specific CPU
for handling this interrupt.

So if all IOs are completed in hardirq context, it is inevitable to
degrade IO performance because of increased irq latency.

This patch tries to address this issue by allowing to complete request
in softirq context, like the legacy IO path.

IOPS is observed as ~13%+ in the following randread test on raid0 over
virtio-scsi.

mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 
/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi

fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 
--ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 
--verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k

Cc: Dongli Zhang 
Cc: Zach Marano 
Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Jianchao Wang 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c  | 14 ++
 block/blk-softirq.c |  5 ++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..d4792c3ac983 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -565,6 +565,20 @@ static void __blk_mq_complete_request(struct request *rq)
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
 
+   /*
+* Most of single queue controllers, there is only one irq vector
+* for handling IO completion, and the only irq's affinity is set
+* as all possible CPUs. On most of ARCHs, this affinity means the
+* irq is handled on one specific CPU.
+*
+* So complete IO reqeust in softirq context in case of single queue
+* for not degrading IO performance by irqsoff latency.
+*/
+   if (rq->q->nr_hw_queues == 1) {
+   __blk_complete_request(rq);
+   return;
+   }
+
if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) {
rq->q->softirq_done_fn(rq);
return;
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 15c1f5e12eb8..e47a2f751884 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -97,8 +97,8 @@ static int blk_softirq_cpu_dead(unsigned int cpu)
 
 void __blk_complete_request(struct request *req)
 {
-   int ccpu, cpu;
struct request_queue *q = req->q;
+   int cpu, ccpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu;
unsigned long flags;
bool shared = false;
 
@@ -110,8 +110,7 @@ void __blk_complete_request(struct request *req)
/*
 * Select completion CPU
 */
-   if (req->cpu != -1) {
-   ccpu = req->cpu;
+   if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && ccpu != -1) {
if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
shared = cpus_share_cache(cpu, ccpu);
} else
-- 
2.9.5



Re: [PATCH] blk-mq: complete req in softirq context in case of single queue

2018-09-28 Thread Ming Lei
On Thu, Sep 27, 2018 at 11:30:19AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/27/2018 12:08 AM, Ming Lei wrote:
> > Lot of controllers may have only one irq vector for completing IO
> > request. And usually affinity of the only irq vector is all possible
> > CPUs, however, on most of ARCH, there may be only one specific CPU
> > for handling this interrupt.
> > 
> > So if all IOs are completed in hardirq context, it is inevitable to
> > degrade IO performance because of increased irq latency.
> > 
> > This patch tries to address this issue by allowing to complete request
> > in softirq context, like the legacy IO path.
> > 
> > IOPS is observed as ~13%+ in the following randread test on raid0 over
> > virtio-scsi.
> > 
> > mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 
> > /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi
> > 
> > fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 
> > --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 
> > --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k
> > 
> > Cc: Zach Marano 
> > Cc: Christoph Hellwig 
> > Cc: Bart Van Assche 
> > Cc: Jianchao Wang 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq.c  | 14 ++
> >  block/blk-softirq.c |  7 +--
> >  2 files changed, 19 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 85a1c1a59c72..d4792c3ac983 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -565,6 +565,20 @@ static void __blk_mq_complete_request(struct request 
> > *rq)
> > if (rq->internal_tag != -1)
> > blk_mq_sched_completed_request(rq);
> >  
> > +   /*
> > +* Most of single queue controllers, there is only one irq vector
> > +* for handling IO completion, and the only irq's affinity is set
> > +* as all possible CPUs. On most of ARCHs, this affinity means the
> > +* irq is handled on one specific CPU.
> > +*
> > +* So complete IO reqeust in softirq context in case of single queue
> > +* for not degrading IO performance by irqsoff latency.
> > +*/
> > +   if (rq->q->nr_hw_queues == 1) {
> > +   __blk_complete_request(rq);
> > +   return;
> > +   }
> > +
> > if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) {
> > rq->q->softirq_done_fn(rq);
> > return;
> > diff --git a/block/blk-softirq.c b/block/blk-softirq.c
> > index 15c1f5e12eb8..b1df9b6c1731 100644
> > --- a/block/blk-softirq.c
> > +++ b/block/blk-softirq.c
> > @@ -101,17 +101,20 @@ void __blk_complete_request(struct request *req)
> > struct request_queue *q = req->q;
> > unsigned long flags;
> > bool shared = false;
> > +   int rq_cpu;
> >  
> > BUG_ON(!q->softirq_done_fn);
> >  
> > +   rq_cpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu;
> > +
> > local_irq_save(flags);
> > cpu = smp_processor_id();
> >  
> > /*
> >  * Select completion CPU
> >  */
> > -   if (req->cpu != -1) {
> > -   ccpu = req->cpu;
> > +   if (rq_cpu != -1) {
> > +   ccpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu;
> > if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
> > shared = cpus_share_cache(cpu, ccpu);
> > } else
> > 
> ;
> 
> Only QUEUE_FLAG_SAME_COMP is set, we will try to complete the request on the 
> cpu
> where it is allocated.
> 
> For the blk-legacy, the req->cpu is -1 when QUEUE_FLAG_SAME_COMP is not set.
> 
> blk_queue_bio
> 
>   if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
>   req->cpu = raw_smp_processor_id()
> 
> But for the blk-mq, the req->mq_ctx->cpu must be valid.
> So we should check the QUEUE_FLAG_SAME_COMP for the blk-mq case.

Good catch, will fix it in V2.

> 
> On the other hand, how about split the softirq completion part out of
> __blk_complete_request ? It is confused when find __blk_complete_request
> in __blk_mq_complete_request.

__blk_complete_request() is just a common helper between blk-mq and legacy
path. We have such usages already, so looks it is fine. 

Thanks,
Ming