[PATCH v2 2/2] bcache: fix high CPU occupancy during journal

2018-01-31 Thread tang . junhui
From: Tang Junhui 

After long time small writing I/O running, we found the occupancy of CPU
is very high and I/O performance has been reduced by about half:

[root@ceph151 internal]# top
top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
%Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem

  PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
 2023 root 20  0   0  0  0 S 55.1  0.0   0:04.42 kworker/11:191
14126 root 20  0   0  0  0 S 42.9  0.0   0:08.72 kworker/10:3
 9292 root 20  0   0  0  0 S 30.4  0.0   1:10.99 kworker/6:1
 8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
12287 root 20  0   0  0  0 S 26.7  0.0   0:28.13 kworker/7:85
31019 root 20  0   0  0  0 S 26.1  0.0   1:30.79 kworker/22:1
 1787 root 20  0   0  0  0 R 25.7  0.0   5:18.45 kworker/8:7
32169 root 20  0   0  0  0 S 14.5  0.0   1:01.92 kworker/23:1
21476 root 20  0   0  0  0 S 13.9  0.0   0:05.09 kworker/1:54
 2204 root 20  0   0  0  0 S 12.5  0.0   1:25.17 kworker/9:10
16994 root 20  0   0  0  0 S 12.2  0.0   0:06.27 kworker/5:106
15714 root 20  0   0  0  0 R 10.9  0.0   0:01.85 kworker/19:2
 9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
 9960 root 20  0   0  0  0 S 10.2  0.0   0:02.75 kworker/2:139
11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
 6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd

The kernel work consumed a lot of CPU, and I found they are running journal
work, The journal is reclaiming source and flush btree node with surprising
frequency.

Through further analysis, we found that in btree_flush_write(), we try to
get a btree node with the smallest fifo idex to flush by traverse all the
btree nodein c->bucket_hash, after we getting it, since no locker protects
it, this btree node may have been written to cache device by other works,
and if this occurred, we retry to traverse in c->bucket_hash and get
another btree node. When the problem occurrd, the retry times is very high,
and we consume a lot of CPU in looking for a appropriate btree node.

In this patch, we try to record 128 btree nodes with the smallest fifo idex
in heap, and pop one by one when we need to flush btree node. It greatly
reduces the time for the loop to find the appropriate BTREE node, and also
reduce the occupancy of CPU.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/bcache.h  |  2 ++
 drivers/md/bcache/journal.c | 47 ++---
 drivers/md/bcache/util.h|  2 ++
 3 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 0432e28..b343ba4 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -669,6 +669,8 @@ struct cache_set {
 
 #define BUCKET_HASH_BITS   12
struct hlist_head   bucket_hash[1 << BUCKET_HASH_BITS];
+
+   DECLARE_HEAP(struct btree *, flush_btree);
 };
 
 struct bbio {
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 47fd0b8..f42d3ea 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -363,6 +363,12 @@ int bch_journal_replay(struct cache_set *s, struct 
list_head *list)
 }
 
 /* Journalling */
+#define journal_max_cmp(l, r) \
+   (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \
+fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
+#define journal_min_cmp(l, r) \
+   (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \
+fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
 
 static void btree_flush_write(struct cache_set *c)
 {
@@ -370,25 +376,35 @@ static void btree_flush_write(struct cache_set *c)
 * Try to find the btree node with that references the oldest journal
 * entry, best is our current candidate and is locked if non NULL:
 */
-   struct btree *b, *best;
-   unsigned i;
+   struct btree *b;
+   int i;
 
atomic_long_inc(>flush_write);
+
 retry:
-   best = NULL;
-
-   for_each_cached_btree(b, c, i)
-   if (btree_current_write(b)->journal) {
-   if (!best)
-   best = b;
-   else if (journal_pin_cmp(c,
-   btree_current_write(best)->journal,
-   btree_current_write(b)->journal)) {
-   best = b;
+   

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread Keith Busch
On Wed, Jan 31, 2018 at 08:07:41PM -0700, Jens Axboe wrote:
>   if (total_phys_segments > queue_max_segments(q))
> - return 0;
> + return 0;

This perhaps unintended change happens to point out another problem:
queue_max_segments is the wrong limit for discards, which require
queue_max_discard_segments.

It might be easier to merge discard requests special, like how merging
a discard bio is handled (untested).

---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8452fc7164cc..01671e1373ff 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -550,6 +550,28 @@ static bool req_no_special_merge(struct request *req)
return !q->mq_ops && req->special;
 }
 
+static bool req_attempt_discard_merge(struct request_queue *q, struct request 
*req,
+   struct request *next)
+{
+   unsigned short segments = blk_rq_nr_discard_segments(req);
+
+   if (segments >= queue_max_discard_segments(q))
+   goto no_merge;
+   if (blk_rq_sectors(req) + bio_sectors(next->bio) >
+   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
+   goto no_merge;
+
+   req->biotail->bi_next = next->bio;
+   req->biotail = next->biotail;
+   req->nr_phys_segments = segments + blk_rq_nr_discard_segments(next);
+   next->bio = NULL;
+   return true;
+
+no_merge:
+   req_set_nomerge(q, req);
+   return false;
+}
+
 static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
struct request *next)
 {
@@ -679,6 +701,15 @@ static struct request *attempt_merge(struct request_queue 
*q,
if (req->write_hint != next->write_hint)
return NULL;
 
+   /*
+* Discards are ... special.
+*/
+   if (req_op(req) == REQ_OP_DISCARD) {
+   if (req_attempt_discard_merge(q, req, next))
+   return next;
+   return NULL;
+   }
+
/*
 * If we are allowed to merge, then append bio list
 * from next to rq and release next. merge_requests_fn
--


Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread Jens Axboe
On 1/31/18 8:33 PM, jianchao.wang wrote:
> Sorry, Jens, I think I didn't get the point.
> Do I miss anything ?
> 
> On 02/01/2018 11:07 AM, Jens Axboe wrote:
>> Yeah I agree, and my last patch missed that we do care about segments for
>> discards. Below should be better...
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 8452fc7164cc..055057bd727f 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -553,9 +553,8 @@ static bool req_no_special_merge(struct request *req)
>>  static int ll_merge_requests_fn(struct request_queue *q, struct request 
>> *req,
>>  struct request *next)
>>  {
>> -int total_phys_segments;
>> -unsigned int seg_size =
>> -req->biotail->bi_seg_back_size + next->bio->bi_seg_front_size;
>> +int total_phys_segments = req->nr_phys_segments +
>> +next->nr_phys_segments;
> 
> For DISCARD reqs, the total_phys_segments is still zero here.

This seems broken, it should count the ranges in the discard request.

>> @@ -574,8 +573,15 @@ static int ll_merge_requests_fn(struct request_queue 
>> *q, struct request *req,
>>  blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>>  return 0;
>>  
>> -total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
>> -if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
>> +/*
>> + * If the requests aren't carrying any data payloads, we don't need
>> + * to look at the segment count
>> + */
>> +if (bio_has_data(next->bio) &&
>> +blk_phys_contig_segment(q, req->biotail, next->bio)) {
>> +unsigned int seg_size = req->biotail->bi_seg_back_size +
>> +next->bio->bi_seg_front_size;
> 
> Yes, total_phys_segments will not be decreased.
> 
>> +
>>  if (req->nr_phys_segments == 1)
>>  req->bio->bi_seg_front_size = seg_size;
>>  if (next->nr_phys_segments == 1)
>> @@ -584,7 +590,7 @@ static int ll_merge_requests_fn(struct request_queue *q, 
>> struct request *req,
>>  }
>>  
>>  if (total_phys_segments > queue_max_segments(q))
>> -return 0;
>> +return 0;
>>  
>>  if (blk_integrity_merge_rq(q, req, next) == false)
>>  return 0;
> 
> But finally, the merged DISCARD req's nr_phys_segment is still zero.
> 
> blk_rq_nr_discard_segments will return 1 but this req has two bios.
> blk_rq_nr_discard_segments 's comment says

They should have the range count. The discard merge stuff is a bit of
a hack, it would be nice to get that improved.

-- 
Jens Axboe



Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread jianchao.wang
Sorry, Jens, I think I didn't get the point.
Do I miss anything ?

On 02/01/2018 11:07 AM, Jens Axboe wrote:
> Yeah I agree, and my last patch missed that we do care about segments for
> discards. Below should be better...
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8452fc7164cc..055057bd727f 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -553,9 +553,8 @@ static bool req_no_special_merge(struct request *req)
>  static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
>   struct request *next)
>  {
> - int total_phys_segments;
> - unsigned int seg_size =
> - req->biotail->bi_seg_back_size + next->bio->bi_seg_front_size;
> + int total_phys_segments = req->nr_phys_segments +
> + next->nr_phys_segments;

For DISCARD reqs, the total_phys_segments is still zero here.

>  
>   /*
>* First check if the either of the requests are re-queued
> @@ -574,8 +573,15 @@ static int ll_merge_requests_fn(struct request_queue *q, 
> struct request *req,
>   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>   return 0;
>  
> - total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
> - if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
> + /*
> +  * If the requests aren't carrying any data payloads, we don't need
> +  * to look at the segment count
> +  */
> + if (bio_has_data(next->bio) &&
> + blk_phys_contig_segment(q, req->biotail, next->bio)) {
> + unsigned int seg_size = req->biotail->bi_seg_back_size +
> + next->bio->bi_seg_front_size;

Yes, total_phys_segments will not be decreased.

> +
>   if (req->nr_phys_segments == 1)
>   req->bio->bi_seg_front_size = seg_size;
>   if (next->nr_phys_segments == 1)
> @@ -584,7 +590,7 @@ static int ll_merge_requests_fn(struct request_queue *q, 
> struct request *req,
>   }
>  
>   if (total_phys_segments > queue_max_segments(q))
> - return 0;
> + return 0;
>  
>   if (blk_integrity_merge_rq(q, req, next) == false)
>   return 0;

But finally, the merged DISCARD req's nr_phys_segment is still zero.

blk_rq_nr_discard_segments will return 1 but this req has two bios.
blk_rq_nr_discard_segments 's comment says
-- 
Each discard bio merged into a request is counted as one segment
--

Maybe patch below should be followed with yours.

diff --git a/block/blk-core.c b/block/blk-core.c
index a2005a4..b444fb7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1763,7 +1763,6 @@ bool bio_attempt_discard_merge(struct request_queue *q, 
struct request *req,
req->biotail = bio;
req->__data_len += bio->bi_iter.bi_size;
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
-   req->nr_phys_segments = segments + 1;
 
blk_account_io_start(req, false);
return true;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4f3df80..1af2138 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1312,7 +1312,13 @@ static inline unsigned short 
blk_rq_nr_phys_segments(struct request *rq)
  */
 static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
 {
-   return max_t(unsigned short, rq->nr_phys_segments, 1);
+   struct bio *bio;
+   unsigned short count = 0;
+
+   __rq_for_each_bio(bio, req)
+   count ++;
+
+   return count;
 }


Thanks
Jianchao


Re: [PATCH 2/2] bcache: fix high CPU occupancy during journal

2018-01-31 Thread tang . junhui
From: Tang Junhui 


> Unfortunately, this doesn't build because of nonexistent call heap_empty
> (I assume some changes to util.h got left out).  I really need clean
> patches that build and are formatted properly.
> 
> Mike
Oh, I am so sorry for that. A new version patch on road.
I will be more careful for later patches.

Thanks.
Tang Junhui



Re: [PATCH 2/2] bcache: fix high CPU occupancy during journal

2018-01-31 Thread Michael Lyle
Unfortunately, this doesn't build because of nonexistent call heap_empty
(I assume some changes to util.h got left out).  I really need clean
patches that build and are formatted properly.

Mike

On 01/26/2018 12:24 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> After long time small writing I/O running, we found the occupancy of CPU
> is very high and I/O performance has been reduced by about half:
> 
> [root@ceph151 internal]# top
> top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
> Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
> %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
> KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
> KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem
> 
>   PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
>  2023 root 20  0   0  0  0 S 55.1  0.0   0:04.42 kworker/11:191
> 14126 root 20  0   0  0  0 S 42.9  0.0   0:08.72 kworker/10:3
>  9292 root 20  0   0  0  0 S 30.4  0.0   1:10.99 kworker/6:1   
>  8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
> 12287 root 20  0   0  0  0 S 26.7  0.0   0:28.13 kworker/7:85
> 31019 root 20  0   0  0  0 S 26.1  0.0   1:30.79 kworker/22:1
>  1787 root 20  0   0  0  0 R 25.7  0.0   5:18.45 kworker/8:7
> 32169 root 20  0   0  0  0 S 14.5  0.0   1:01.92 kworker/23:1
> 21476 root 20  0   0  0  0 S 13.9  0.0   0:05.09 kworker/1:54
>  2204 root 20  0   0  0  0 S 12.5  0.0   1:25.17 kworker/9:10
> 16994 root 20  0   0  0  0 S 12.2  0.0   0:06.27 kworker/5:106
> 15714 root 20  0   0  0  0 R 10.9  0.0   0:01.85 kworker/19:2
>  9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
> 11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
>  9960 root 20  0   0  0  0 S 10.2  0.0   0:02.75 kworker/2:139
> 11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
>  6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
> 
> The kernel work consumed a lot of CPU, and I found they are running journal
> work, The journal is reclaiming source and flush btree node with surprising
> frequency.
> 
> Through further analysis, we found that in btree_flush_write(), we try to
> get a btree node with the smallest fifo idex to flush by traverse all the
> btree nodein c->bucket_hash, after we getting it, since no locker protects
> it, this btree node may have been written to cache device by other works,
> and if this occurred, we retry to traverse in c->bucket_hash and get
> another btree node. When the problem occurrd, the retry times is very high,
> and we consume a lot of CPU in looking for a appropriate btree node.
> 
> In this patch, we try to record 128 btree nodes with the smallest fifo idex
> in heap, and pop one by one when we need to flush btree node. It greatly
> reduces the time for the loop to find the appropriate BTREE node, and also
> reduce the occupancy of CPU.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/bcache.h  |  2 ++
>  drivers/md/bcache/journal.c | 47 
> ++---
>  2 files changed, 34 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 0432e28..b343ba4 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -669,6 +669,8 @@ struct cache_set {
>  
>  #define BUCKET_HASH_BITS 12
>   struct hlist_head   bucket_hash[1 << BUCKET_HASH_BITS];
> +
> + DECLARE_HEAP(struct btree *, flush_btree);
>  };
>  
>  struct bbio {
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 47fd0b8..f42d3ea 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -363,6 +363,12 @@ int bch_journal_replay(struct cache_set *s, struct 
> list_head *list)
>  }
>  
>  /* Journalling */
> +#define journal_max_cmp(l, r) \
> + (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \
> +  fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
> +#define journal_min_cmp(l, r) \
> + (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \
> +  fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
>  
>  static void btree_flush_write(struct cache_set *c)
>  {
> @@ -370,25 +376,35 @@ static void btree_flush_write(struct cache_set *c)
>* Try to find the btree node with that references the oldest journal
>* entry, best is our current candidate and is locked if non NULL:
>*/
> - struct btree *b, *best;
> - unsigned i;
> + struct btree *b;
> + int i;
>  
>   atomic_long_inc(>flush_write);
> +
>  retry:
> - best = NULL;
> -
> - for_each_cached_btree(b, c, i)
> -  

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread Jens Axboe
On 1/31/18 8:03 PM, jianchao.wang wrote:
> Hi Jens
> 
> 
> On 01/31/2018 11:29 PM, Jens Axboe wrote:
>> How about something like the below?
>>
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 8452fc7164cc..cee102fb060e 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -574,8 +574,13 @@ static int ll_merge_requests_fn(struct request_queue 
>> *q, struct request *req,
>>  blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>>  return 0;
>>  
>> +/*
>> + * For DISCARDs, the segment count isn't interesting since
>> + * the requests have no data attached.
>> + */
>>  total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
>> -if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
>> +if (total_phys_segments &&
>> +blk_phys_contig_segment(q, req->biotail, next->bio)) {
>>  if (req->nr_phys_segments == 1)
>>  req->bio->bi_seg_front_size = seg_size;
>>  if (next->nr_phys_segments == 1)
> 
> This patch will avoid the nr_phys_segments to be set to 0x,
> but the merged req will have two bios but zero nr_phys_segments.
> 
> We have to align with the DISCARD merging strategy.
> 
> Please refer to:
> /*
>  * Number of discard segments (or ranges) the driver needs to fill in.
>  * Each discard bio merged into a request is counted as one segment.
>  */
> static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
> {
>return max_t(unsigned short, rq->nr_phys_segments, 1);
> }
> bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
>   struct bio *bio)
> {
>   unsigned short segments = blk_rq_nr_discard_segments(req);
> 
>   if (segments >= queue_max_discard_segments(q))
>   goto no_merge;
>   if (blk_rq_sectors(req) + bio_sectors(bio) >
>   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>   goto no_merge;
> 
>   req->biotail->bi_next = bio;
>   req->biotail = bio;
>   req->__data_len += bio->bi_iter.bi_size;
>   req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
>   req->nr_phys_segments = segments + 1;
> 
>   blk_account_io_start(req, false);
>   return true;
> no_merge:
>   req_set_nomerge(q, req);
>   return false;
> }
> 
> blk_rq_nr_discard_segments will get a wrong value finally.
> 
> Maybe we could change blk_rq_nr_discard_segments to iterate and count the 
> bios in one request
> to decide the DISCARD request nr_phy_segment. And discard the 
> nr_phys_segments operations in
> the DISCARD merging path, plus your patch here.

Yeah I agree, and my last patch missed that we do care about segments for
discards. Below should be better...

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8452fc7164cc..055057bd727f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -553,9 +553,8 @@ static bool req_no_special_merge(struct request *req)
 static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
struct request *next)
 {
-   int total_phys_segments;
-   unsigned int seg_size =
-   req->biotail->bi_seg_back_size + next->bio->bi_seg_front_size;
+   int total_phys_segments = req->nr_phys_segments +
+   next->nr_phys_segments;
 
/*
 * First check if the either of the requests are re-queued
@@ -574,8 +573,15 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
return 0;
 
-   total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
-   if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
+   /*
+* If the requests aren't carrying any data payloads, we don't need
+* to look at the segment count
+*/
+   if (bio_has_data(next->bio) &&
+   blk_phys_contig_segment(q, req->biotail, next->bio)) {
+   unsigned int seg_size = req->biotail->bi_seg_back_size +
+   next->bio->bi_seg_front_size;
+
if (req->nr_phys_segments == 1)
req->bio->bi_seg_front_size = seg_size;
if (next->nr_phys_segments == 1)
@@ -584,7 +590,7 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
}
 
if (total_phys_segments > queue_max_segments(q))
-   return 0;
+   return 0;
 
if (blk_integrity_merge_rq(q, req, next) == false)
return 0;

-- 
Jens Axboe



Re: [LSF/MM TOPIC] KPTI effect on IO performance

2018-01-31 Thread Ming Lei
On Wed, Jan 31, 2018 at 11:43:33AM -0700, Scotty Bauer wrote:
> On 2018-01-31 01:23, Ming Lei wrote:
> > Hi All,
> > 
> > After KPTI is merged, there is extra load introduced to context switch
> > between user space and kernel space. It is observed on my laptop that
> > one
> > syscall takes extra ~0.15us[1] compared with 'nopti'.
> > 
> > IO performance is affected too, it is observed that IOPS drops by 32% in
> > my test[2] on null_blk compared with 'nopti':
> > 
> > randread IOPS on latest linus tree:
> > -
> > | randread IOPS | randread IOPS with 'nopti'|
> > 
> > | 928K  | 1372K |
> > 
> > 
> > 
> 
> Do you know if your CPU has PCID? It would be interesting to see these tests
> on older CPUs or older kernels without PCID support.

BTW, I also saw test data in case of vCPU without PCID, and it is said the 
syscall
time can be close to ~30X compared with nopti, and the test should be setup 
easily
by adjust CPU model of Qemu.

So in case of no PCID, KPTI effect on IO performance should be much bigger than 
the
above data.

Thanks,
Ming


Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread Jens Axboe
On 1/31/18 4:33 PM, Keith Busch wrote:
> On Wed, Jan 31, 2018 at 08:29:37AM -0700, Jens Axboe wrote:
>>
>> How about something like the below?
>>
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 8452fc7164cc..cee102fb060e 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -574,8 +574,13 @@ static int ll_merge_requests_fn(struct request_queue 
>> *q, struct request *req,
>>  blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>>  return 0;
>>  
>> +/*
>> + * For DISCARDs, the segment count isn't interesting since
>> + * the requests have no data attached.
>> + */
>>  total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
>> -if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
>> +if (total_phys_segments &&
>> +blk_phys_contig_segment(q, req->biotail, next->bio)) {
>>  if (req->nr_phys_segments == 1)
>>  req->bio->bi_seg_front_size = seg_size;
>>  if (next->nr_phys_segments == 1)
> 
> That'll keep it from going to 0x, but you'll still hit the warning and
> IO error. Even worse, this will corrupt memory: blk_rq_nr_discard_segments
> will return 1, and since you really had 2 segments, the nvme driver will
> overrun its array.

Yeah you are right, that patch was shit. How about the below? We only
need to worry about segment size and number of segments if we are
carrying data. req->biotail and next->bio must be the same type, so
should be safe.


diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8452fc7164cc..cf9adc4c64b5 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -553,9 +553,7 @@ static bool req_no_special_merge(struct request *req)
 static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
struct request *next)
 {
-   int total_phys_segments;
-   unsigned int seg_size =
-   req->biotail->bi_seg_back_size + next->bio->bi_seg_front_size;
+   int total_phys_segments = 0;
 
/*
 * First check if the either of the requests are re-queued
@@ -574,17 +572,27 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
return 0;
 
-   total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
-   if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
-   if (req->nr_phys_segments == 1)
-   req->bio->bi_seg_front_size = seg_size;
-   if (next->nr_phys_segments == 1)
-   next->biotail->bi_seg_back_size = seg_size;
-   total_phys_segments--;
-   }
+   /*
+* If the requests aren't carrying any data payloads, we don't need
+* to look at the segment count
+*/
+   if (bio_has_data(next->bio)) {
+   total_phys_segments = req->nr_phys_segments +
+   next->nr_phys_segments;
+   if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
+   unsigned int seg_size = req->biotail->bi_seg_back_size +
+   next->bio->bi_seg_front_size;
+
+   if (req->nr_phys_segments == 1)
+   req->bio->bi_seg_front_size = seg_size;
+   if (next->nr_phys_segments == 1)
+   next->biotail->bi_seg_back_size = seg_size;
+   total_phys_segments--;
+   }
 
-   if (total_phys_segments > queue_max_segments(q))
-   return 0;
+   if (total_phys_segments > queue_max_segments(q))
+   return 0;
+   }
 
if (blk_integrity_merge_rq(q, req, next) == false)
return 0;

-- 
Jens Axboe



Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread jianchao.wang
Hi Jens


On 01/31/2018 11:29 PM, Jens Axboe wrote:
> How about something like the below?
> 
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8452fc7164cc..cee102fb060e 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -574,8 +574,13 @@ static int ll_merge_requests_fn(struct request_queue *q, 
> struct request *req,
>   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>   return 0;
>  
> + /*
> +  * For DISCARDs, the segment count isn't interesting since
> +  * the requests have no data attached.
> +  */
>   total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
> - if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
> + if (total_phys_segments &&
> + blk_phys_contig_segment(q, req->biotail, next->bio)) {
>   if (req->nr_phys_segments == 1)
>   req->bio->bi_seg_front_size = seg_size;
>   if (next->nr_phys_segments == 1)

This patch will avoid the nr_phys_segments to be set to 0x,
but the merged req will have two bios but zero nr_phys_segments.

We have to align with the DISCARD merging strategy.

Please refer to:
/*
 * Number of discard segments (or ranges) the driver needs to fill in.
 * Each discard bio merged into a request is counted as one segment.
 */
static inline unsigned short blk_rq_nr_discard_segments(struct request *rq)
{
   return max_t(unsigned short, rq->nr_phys_segments, 1);
}
bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
struct bio *bio)
{
unsigned short segments = blk_rq_nr_discard_segments(req);

if (segments >= queue_max_discard_segments(q))
goto no_merge;
if (blk_rq_sectors(req) + bio_sectors(bio) >
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
goto no_merge;

req->biotail->bi_next = bio;
req->biotail = bio;
req->__data_len += bio->bi_iter.bi_size;
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
req->nr_phys_segments = segments + 1;

blk_account_io_start(req, false);
return true;
no_merge:
req_set_nomerge(q, req);
return false;
}

blk_rq_nr_discard_segments will get a wrong value finally.

Maybe we could change blk_rq_nr_discard_segments to iterate and count the bios 
in one request
to decide the DISCARD request nr_phy_segment. And discard the nr_phys_segments 
operations in
the DISCARD merging path, plus your patch here.

Thanks
Jianchao


Re: [LSF/MM TOPIC] KPTI effect on IO performance

2018-01-31 Thread Ming Lei
Hi Scotty,

On Wed, Jan 31, 2018 at 11:43:33AM -0700, Scotty Bauer wrote:
> On 2018-01-31 01:23, Ming Lei wrote:
> > Hi All,
> > 
> > After KPTI is merged, there is extra load introduced to context switch
> > between user space and kernel space. It is observed on my laptop that
> > one
> > syscall takes extra ~0.15us[1] compared with 'nopti'.
> > 
> > IO performance is affected too, it is observed that IOPS drops by 32% in
> > my test[2] on null_blk compared with 'nopti':
> > 
> > randread IOPS on latest linus tree:
> > -
> > | randread IOPS | randread IOPS with 'nopti'|
> > 
> > | 928K  | 1372K |
> > 
> > 
> > 
> 
> Do you know if your CPU has PCID? It would be interesting to see these tests
> on older CPUs or older kernels without PCID support.

My CPU has PCID, which can be retrieved via /proc/cpuinfo.

And the above test is run on same kernel binary, and the result is just done
between 'nopti' and no 'nopti' in kernel command line.

Thanks,
Ming


Re: [PATCH 2/2] bcache: fix high CPU occupancy during journal

2018-01-31 Thread Michael Lyle
LGTM on first read--  I'll read it again and test in test branch.

Reviewed-by: Michael Lyle 

On 01/26/2018 12:24 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> After long time small writing I/O running, we found the occupancy of CPU
> is very high and I/O performance has been reduced by about half:
> 
> [root@ceph151 internal]# top
> top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
> Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
> %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
> KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
> KiB Swap: 65667068 total, 65667068 free,0 used. 25136812 avail Mem
> 
>   PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
>  2023 root 20  0   0  0  0 S 55.1  0.0   0:04.42 kworker/11:191
> 14126 root 20  0   0  0  0 S 42.9  0.0   0:08.72 kworker/10:3
>  9292 root 20  0   0  0  0 S 30.4  0.0   1:10.99 kworker/6:1   
>  8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
> 12287 root 20  0   0  0  0 S 26.7  0.0   0:28.13 kworker/7:85
> 31019 root 20  0   0  0  0 S 26.1  0.0   1:30.79 kworker/22:1
>  1787 root 20  0   0  0  0 R 25.7  0.0   5:18.45 kworker/8:7
> 32169 root 20  0   0  0  0 S 14.5  0.0   1:01.92 kworker/23:1
> 21476 root 20  0   0  0  0 S 13.9  0.0   0:05.09 kworker/1:54
>  2204 root 20  0   0  0  0 S 12.5  0.0   1:25.17 kworker/9:10
> 16994 root 20  0   0  0  0 S 12.2  0.0   0:06.27 kworker/5:106
> 15714 root 20  0   0  0  0 R 10.9  0.0   0:01.85 kworker/19:2
>  9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
> 11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
>  9960 root 20  0   0  0  0 S 10.2  0.0   0:02.75 kworker/2:139
> 11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
>  6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
> 
> The kernel work consumed a lot of CPU, and I found they are running journal
> work, The journal is reclaiming source and flush btree node with surprising
> frequency.
> 
> Through further analysis, we found that in btree_flush_write(), we try to
> get a btree node with the smallest fifo idex to flush by traverse all the
> btree nodein c->bucket_hash, after we getting it, since no locker protects
> it, this btree node may have been written to cache device by other works,
> and if this occurred, we retry to traverse in c->bucket_hash and get
> another btree node. When the problem occurrd, the retry times is very high,
> and we consume a lot of CPU in looking for a appropriate btree node.
> 
> In this patch, we try to record 128 btree nodes with the smallest fifo idex
> in heap, and pop one by one when we need to flush btree node. It greatly
> reduces the time for the loop to find the appropriate BTREE node, and also
> reduce the occupancy of CPU.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/bcache.h  |  2 ++
>  drivers/md/bcache/journal.c | 47 
> ++---
>  2 files changed, 34 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 0432e28..b343ba4 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -669,6 +669,8 @@ struct cache_set {
>  
>  #define BUCKET_HASH_BITS 12
>   struct hlist_head   bucket_hash[1 << BUCKET_HASH_BITS];
> +
> + DECLARE_HEAP(struct btree *, flush_btree);
>  };
>  
>  struct bbio {
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 47fd0b8..f42d3ea 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -363,6 +363,12 @@ int bch_journal_replay(struct cache_set *s, struct 
> list_head *list)
>  }
>  
>  /* Journalling */
> +#define journal_max_cmp(l, r) \
> + (fifo_idx(>journal.pin, btree_current_write(l)->journal) < \
> +  fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
> +#define journal_min_cmp(l, r) \
> + (fifo_idx(>journal.pin, btree_current_write(l)->journal) > \
> +  fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
>  
>  static void btree_flush_write(struct cache_set *c)
>  {
> @@ -370,25 +376,35 @@ static void btree_flush_write(struct cache_set *c)
>* Try to find the btree node with that references the oldest journal
>* entry, best is our current candidate and is locked if non NULL:
>*/
> - struct btree *b, *best;
> - unsigned i;
> + struct btree *b;
> + int i;
>  
>   atomic_long_inc(>flush_write);
> +
>  retry:
> - best = NULL;
> -
> - for_each_cached_btree(b, c, i)
> - if (btree_current_write(b)->journal) {
> - if (!best)
> - 

Re: [PATCH 1/2] bcache: add journal statistic

2018-01-31 Thread Michael Lyle
LGTM except for formatting / an extra newline (will fix) -- in my test
branch for possible 4.16

Reviewed-by: Michael Lyle 

On 01/26/2018 12:23 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Sometimes, Journal takes up a lot of CPU, we need statistics
> to know what's the journal is doing. So this patch provide
> some journal statistics:
> 1) reclaim: how many times the journal try to reclaim resource,
>usually the journal bucket or/and the pin are exhausted.
> 2) flush_write: how many times the journal try to flush btree node
>to cache device, usually the journal bucket are exhausted.
> 3) retry_flush_write: how many times the journal retry to flush
>the next btree node, usually the previous tree node have been
>flushed by other thread.
> we show these statistic by sysfs interface. Through these statistics
> We can totally see the status of journal module when the CPU is too
> high.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/bcache.h  |  5 +
>  drivers/md/bcache/journal.c |  5 +
>  drivers/md/bcache/sysfs.c   | 15 +++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index abd31e8..0432e28 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -647,6 +647,11 @@ struct cache_set {
>   atomic_long_t   writeback_keys_done;
>   atomic_long_t   writeback_keys_failed;
>  
> +
> + atomic_long_t   reclaim;
> + atomic_long_t   flush_write;
> + atomic_long_t   retry_flush_write;
> +
>   enum{
>   ON_ERROR_UNREGISTER,
>   ON_ERROR_PANIC,
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 02a98dd..47fd0b8 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -372,6 +372,8 @@ static void btree_flush_write(struct cache_set *c)
>*/
>   struct btree *b, *best;
>   unsigned i;
> +
> + atomic_long_inc(>flush_write);
>  retry:
>   best = NULL;
>  
> @@ -392,6 +394,7 @@ static void btree_flush_write(struct cache_set *c)
>   if (!btree_current_write(b)->journal) {
>   mutex_unlock(>write_lock);
>   /* We raced */
> + atomic_long_inc(>retry_flush_write);
>   goto retry;
>   }
>  
> @@ -471,6 +474,8 @@ static void journal_reclaim(struct cache_set *c)
>   unsigned iter, n = 0;
>   atomic_t p;
>  
> + atomic_long_inc(>reclaim);
> +
>   while (!atomic_read(_front(>journal.pin)))
>   fifo_pop(>journal.pin, p);
>  
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 234b2f5..0afbf1a 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -65,6 +65,9 @@
>  
>  read_attribute(state);
>  read_attribute(cache_read_races);
> +read_attribute(reclaim);
> +read_attribute(flush_write);
> +read_attribute(retry_flush_write);
>  read_attribute(writeback_keys_done);
>  read_attribute(writeback_keys_failed);
>  read_attribute(io_errors);
> @@ -543,6 +546,15 @@ static unsigned bch_average_key_size(struct cache_set *c)
>   sysfs_print(cache_read_races,
>   atomic_long_read(>cache_read_races));
>  
> + sysfs_print(reclaim,
> + atomic_long_read(>reclaim));
> +
> + sysfs_print(flush_write,
> + atomic_long_read(>flush_write));
> +
> + sysfs_print(retry_flush_write,
> + atomic_long_read(>retry_flush_write));
> +
>   sysfs_print(writeback_keys_done,
>   atomic_long_read(>writeback_keys_done));
>   sysfs_print(writeback_keys_failed,
> @@ -729,6 +741,9 @@ static void bch_cache_set_internal_release(struct kobject 
> *k)
>  
>   _bset_tree_stats,
>   _cache_read_races,
> + _reclaim,
> + _flush_write,
> + _retry_flush_write,
>   _writeback_keys_done,
>   _writeback_keys_failed,
>  
> 



Re: [PATCH 1/2] lightnvm: remove mlc pairs structure

2018-01-31 Thread Javier González
> On 31 Jan 2018, at 16.35, Matias Bjørling  wrote:
> 
> On 01/31/2018 03:00 AM, Javier González wrote:
>>> On 30 Jan 2018, at 21.26, Matias Bjørling  wrote:
>>> 
>>> The known implementations of the 1.2 specification, and upcoming 2.0
>>> implementation all expose a sequential list of pages to write.
>>> Remove the data structure, as it is no longer needed.
>>> 
>>> Signed-off-by: Matias Bjørling 
>>> ---
>>> drivers/nvme/host/lightnvm.c | 14 +-
>>> 1 file changed, 1 insertion(+), 13 deletions(-)
>> Even though the current implementations does not use the MLC pairing
>> information, this is part of the on the 1.2 identification. Until we
>> eventually remove 1.2 support (if we do), the identify structure should
>> reflect the specification as is.
>> Javier
> 
> I already started on removing the MLC bits in previous patches. These
> two patches continue that trend. I don't know of any implementing
> either multiple groups or MLC, and since it has not been implemented
> before, no one is going to use it. No reason to have dead code hanging
> around. This is similar to the NVMe device driver not having
> definitions for all that is in the specification until it is used in
> the implementation.

It's ok with me - just wanted to point out that it seems weird removing
bits from the structure that describes the 1.2 identify command. Guess
these can com back if someone ever needs them...

Javier


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-01-31 Thread Joseph Qi
Hi Bart,

On 18/2/1 07:53, Bart Van Assche wrote:
> Initialize the request queue lock earlier such that the following
> race can no longer occur:
> 
> blk_init_queue_node blkcg_print_blkgs
>   blk_alloc_queue_node (1)
> q->queue_lock = >__queue_lock (2)
> blkcg_init_queue(q) (3)
> spin_lock_irq(blkg->q->queue_lock) (4)
>   q->queue_lock = lock (5)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) allocate an uninitialized queue;
> (2) initialize queue_lock to its default internal lock;
> (3) initialize blkcg part of request queue, which will create blkg and
> then insert it to blkg_list;
> (4) traverse blkg_list and find the created blkg, and then take its
> queue lock, here it is the default *internal lock*;
> (5) *race window*, now queue_lock is overridden with *driver specified
> lock*;
> (6) now unlock *driver specified lock*, not the locked *internal lock*,
> unlock balance breaks.
> 
> The changes in this patch are as follows:
> - Move the .queue_lock initialization from blk_init_queue_node() into
>   blk_alloc_queue_node().
> - For all all block drivers that initialize .queue_lock explicitly,
>   change the blk_alloc_queue() call in the driver into a
>   blk_alloc_queue_node() call and remove the explicit .queue_lock
>   initialization. Additionally, initialize the spin lock that will
>   be used as queue lock earlier if necessary.
> 
I'm afraid the risk may also exist in blk_cleanup_queue, which will
set queue_lock to to the default internal lock.

spin_lock_irq(lock);
if (q->queue_lock != >__queue_lock)
q->queue_lock = >__queue_lock;
spin_unlock_irq(lock);

I'm thinking of getting blkg->q->queue_lock to local first, but this
will result in still using driver lock even the queue_lock has already
been set to the default internal lock.

Thanks,
Joseph


[PATCH v2 0/2] block: Fix a race between the throttling code and request queue initialization

2018-01-31 Thread Bart Van Assche
Hello Jens,

The two patches in this series fix a recently reported race between the
throttling code and request queue initialization. It would be appreciated
if you could have a look at this patch series.

Thanks,

Bart.

Changes between v1 and v2:
- Split a single patch into two patches.
- Dropped blk_alloc_queue_node2() and modified all block drivers that call
  blk_alloc_queue_node().

Bart Van Assche (2):
  block: Add a third argument to blk_alloc_queue_node()
  block: Fix a race between the throttling code and request queue
initialization

 block/blk-core.c   | 29 +++--
 block/blk-mq.c |  2 +-
 drivers/block/drbd/drbd_main.c |  3 +--
 drivers/block/null_blk.c   |  3 ++-
 drivers/block/umem.c   |  7 +++
 drivers/ide/ide-probe.c|  2 +-
 drivers/lightnvm/core.c|  2 +-
 drivers/md/dm.c|  2 +-
 drivers/mmc/core/queue.c   |  3 +--
 drivers/nvdimm/pmem.c  |  2 +-
 drivers/nvme/host/multipath.c  |  2 +-
 drivers/scsi/scsi_lib.c|  2 +-
 include/linux/blkdev.h |  3 ++-
 13 files changed, 35 insertions(+), 27 deletions(-)

-- 
2.16.0



[PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-01-31 Thread Bart Van Assche
Initialize the request queue lock earlier such that the following
race can no longer occur:

blk_init_queue_node blkcg_print_blkgs
  blk_alloc_queue_node (1)
q->queue_lock = >__queue_lock (2)
blkcg_init_queue(q) (3)
spin_lock_irq(blkg->q->queue_lock) (4)
  q->queue_lock = lock (5)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) allocate an uninitialized queue;
(2) initialize queue_lock to its default internal lock;
(3) initialize blkcg part of request queue, which will create blkg and
then insert it to blkg_list;
(4) traverse blkg_list and find the created blkg, and then take its
queue lock, here it is the default *internal lock*;
(5) *race window*, now queue_lock is overridden with *driver specified
lock*;
(6) now unlock *driver specified lock*, not the locked *internal lock*,
unlock balance breaks.

The changes in this patch are as follows:
- Move the .queue_lock initialization from blk_init_queue_node() into
  blk_alloc_queue_node().
- For all all block drivers that initialize .queue_lock explicitly,
  change the blk_alloc_queue() call in the driver into a
  blk_alloc_queue_node() call and remove the explicit .queue_lock
  initialization. Additionally, initialize the spin lock that will
  be used as queue lock earlier if necessary.

Reported-by: Joseph Qi 
Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Joseph Qi 
Cc: Philipp Reisner 
Cc: Ulf Hansson 
Cc: Kees Cook 
Cc: 
---
 block/blk-core.c   | 24 
 drivers/block/drbd/drbd_main.c |  3 +--
 drivers/block/umem.c   |  7 +++
 drivers/mmc/core/queue.c   |  3 +--
 4 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 860a039fd1a8..c2c81c5b7420 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -946,6 +946,20 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
kblockd_schedule_work(>timeout_work);
 }
 
+/**
+ * blk_alloc_queue_node - allocate a request queue
+ * @gfp_mask: memory allocation flags
+ * @node_id: NUMA node to allocate memory from
+ * @lock: Pointer to a spinlock that will be used to e.g. serialize calls to
+ *   the legacy .request_fn(). Only set this pointer for queues that use
+ *   legacy mode and not for queues that use blk-mq.
+ *
+ * Note: pass the queue lock as the third argument to this function instead of
+ * setting the queue lock pointer explicitly to avoid triggering a crash in
+ * the blkcg throttling code. That code namely makes sysfs attributes visible
+ * in user space before this function returns and the show methods of these
+ * sysfs attributes use the queue lock.
+ */
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
   spinlock_t *lock)
 {
@@ -998,11 +1012,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id,
mutex_init(>sysfs_lock);
spin_lock_init(>__queue_lock);
 
-   /*
-* By default initialize queue_lock to internal lock and driver can
-* override it later if need be.
-*/
-   q->queue_lock = >__queue_lock;
+   q->queue_lock = lock ? : >__queue_lock;
 
/*
 * A queue starts its life with bypass turned on to avoid
@@ -1089,13 +1099,11 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
 {
struct request_queue *q;
 
-   q = blk_alloc_queue_node(GFP_KERNEL, node_id, NULL);
+   q = blk_alloc_queue_node(GFP_KERNEL, node_id, lock);
if (!q)
return NULL;
 
q->request_fn = rfn;
-   if (lock)
-   q->queue_lock = lock;
if (blk_init_allocated_queue(q) < 0) {
blk_cleanup_queue(q);
return NULL;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 4b4697a1f963..058247bc2f30 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2822,7 +2822,7 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
 
drbd_init_set_defaults(device);
 
-   q = blk_alloc_queue(GFP_KERNEL);
+   q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, >req_lock);
if (!q)
goto out_no_q;
device->rq_queue = q;
@@ -2854,7 +2854,6 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
/* Setting the max_hw_sectors to an odd value of 8kibyte here
   This triggers a max_bio_size message upon first attach or connect */
blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
-   q->queue_lock = >req_lock;
 
device->md_io.page = alloc_page(GFP_KERNEL);
  

[PATCH v2 1/2] block: Add a third argument to blk_alloc_queue_node()

2018-01-31 Thread Bart Van Assche
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Joseph Qi 
Cc: Philipp Reisner 
Cc: Ulf Hansson 
Cc: Kees Cook 
Cc: 
---
 block/blk-core.c  | 7 ---
 block/blk-mq.c| 2 +-
 drivers/block/null_blk.c  | 3 ++-
 drivers/ide/ide-probe.c   | 2 +-
 drivers/lightnvm/core.c   | 2 +-
 drivers/md/dm.c   | 2 +-
 drivers/nvdimm/pmem.c | 2 +-
 drivers/nvme/host/multipath.c | 2 +-
 drivers/scsi/scsi_lib.c   | 2 +-
 include/linux/blkdev.h| 3 ++-
 10 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bd43bc50740a..860a039fd1a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -868,7 +868,7 @@ void blk_exit_rl(struct request_queue *q, struct 
request_list *rl)
 
 struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 {
-   return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE);
+   return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE, NULL);
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
@@ -946,7 +946,8 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
kblockd_schedule_work(>timeout_work);
 }
 
-struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
+struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
+  spinlock_t *lock)
 {
struct request_queue *q;
 
@@ -1088,7 +1089,7 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
 {
struct request_queue *q;
 
-   q = blk_alloc_queue_node(GFP_KERNEL, node_id);
+   q = blk_alloc_queue_node(GFP_KERNEL, node_id, NULL);
if (!q)
return NULL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index aacc5280e25f..8191391d1a1d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2554,7 +2554,7 @@ struct request_queue *blk_mq_init_queue(struct 
blk_mq_tag_set *set)
 {
struct request_queue *uninit_q, *q;
 
-   uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node);
+   uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node, NULL);
if (!uninit_q)
return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index aa1c7d4bcac5..9fe8f2a3ec45 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1717,7 +1717,8 @@ static int null_add_dev(struct nullb_device *dev)
}
null_init_queues(nullb);
} else if (dev->queue_mode == NULL_Q_BIO) {
-   nullb->q = blk_alloc_queue_node(GFP_KERNEL, dev->home_node);
+   nullb->q = blk_alloc_queue_node(GFP_KERNEL, dev->home_node,
+   NULL);
if (!nullb->q) {
rv = -ENOMEM;
goto out_cleanup_queues;
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 70d6d8ff0fd9..1303d0e31e80 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -766,7 +766,7 @@ static int ide_init_queue(ide_drive_t *drive)
 *  limits and LBA48 we could raise it but as yet
 *  do not.
 */
-   q = blk_alloc_queue_node(GFP_KERNEL, hwif_to_node(hwif));
+   q = blk_alloc_queue_node(GFP_KERNEL, hwif_to_node(hwif), NULL);
if (!q)
return 1;
 
diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..5f1988df1593 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -384,7 +384,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
goto err_dev;
}
 
-   tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node);
+   tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node, NULL);
if (!tqueue) {
ret = -ENOMEM;
goto err_disk;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5c478c185041..93f3ef15b4b2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1731,7 +1731,7 @@ static struct mapped_device *alloc_dev(int minor)
INIT_LIST_HEAD(>table_devices);
spin_lock_init(>uevent_lock);
 
-   md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
+   md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id, NULL);
if (!md->queue)
goto bad;
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 49285701fe48..118b4b13592d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -342,7 +342,7 @@ static int pmem_attach_disk(struct device *dev,
return -EBUSY;
}
 
-   q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
+   q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev), NULL);
if (!q)
   

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread Keith Busch
On Wed, Jan 31, 2018 at 08:29:37AM -0700, Jens Axboe wrote:
> 
> How about something like the below?
> 
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8452fc7164cc..cee102fb060e 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -574,8 +574,13 @@ static int ll_merge_requests_fn(struct request_queue *q, 
> struct request *req,
>   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
>   return 0;
>  
> + /*
> +  * For DISCARDs, the segment count isn't interesting since
> +  * the requests have no data attached.
> +  */
>   total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
> - if (blk_phys_contig_segment(q, req->biotail, next->bio)) {
> + if (total_phys_segments &&
> + blk_phys_contig_segment(q, req->biotail, next->bio)) {
>   if (req->nr_phys_segments == 1)
>   req->bio->bi_seg_front_size = seg_size;
>   if (next->nr_phys_segments == 1)

That'll keep it from going to 0x, but you'll still hit the warning and
IO error. Even worse, this will corrupt memory: blk_rq_nr_discard_segments
will return 1, and since you really had 2 segments, the nvme driver will
overrun its array.


Re: [PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio

2018-01-31 Thread Jiri Palecek


On 1/31/18 6:24 AM, Ming Lei wrote:

On Tue, Jan 30, 2018 at 04:24:14PM +0100, Jiri Palecek wrote:

On 1/30/18 1:53 PM, Ming Lei wrote:

On Thu, Jan 25, 2018 at 9:58 PM, Jiří Paleček  wrote:

   Avoids page leak from bounced requests
---
   block/blk-map.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index d3a94719f03f..702d68166689 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio **bio)
  } else {
  if (!ll_back_merge_fn(rq->q, rq, *bio)) {
  if (orig_bio != *bio) {
-   bio_put(*bio);
+   bio_inc_remaining(orig_bio);
+   bio_endio(*bio);

'bio_inc_remaining(orig_bio);' shouldn't be needed since we don't chain bounced
bio, otherwise this patch is fine.

I believe it is needed or at least desirable. The situation when the request
bounced is like this

bio (bounced) . bi_private ---> orig_bio

and at the end of bounce_end_io, bio_endio(bio->bi_private) [which is
bio_endio(orig_bio) in our case] is called. That doesn't have any effect on
__blk_rq_map_user_iov; its bios have .bi_end_io==0, therefore one call more
or less doesn't matter. However, for other callers, like osd_initiator.c,
this is not the case. They pass bios which have bi_end_io, and might be
surprised if this was called unexpectedly.

OK, I think it is good to follow the rule of not calling .bi_end_io() in
the failure path, just like __blk_rq_map_user_iov()/blk_rq_map_kern().

But seems pscsi_map_sg() depends on bio_endio(orig_bio) to free the 'orig_bio',
could you fix it in this patch too?
Yes, there seems to be leak in the error path of that code. However, 
that was there for more than a year so I didn't think that was urgent. 
I'll have a look at it, but I would prefer if someone familiar with 
pscsi had their say as well.



Before  caa4b02476e3, blk_rq_append_request wouldn't touch/delete the passed
bio at all under any circumstances. I believe it should stay that way and
incrementing the remaining counter, which effectively nullifies the extra
bio_endio call, does that pretty straightforwardly.

Seems too tricky to use bio_inc_remaining() for avoiding bio_endio(orig_bio),
if we have to do that, I suggest to add comment on that.


Okay. I thought that it didn't really reach the level of sophistication 
otherwise seen in block layer code :)


Regards

    Jiri Palecek



Re: [PATCH] block: Fix a race between the throttling code and request queue initialization

2018-01-31 Thread Jens Axboe
On 1/31/18 12:13 PM, Bart Van Assche wrote:
> Initialize the request queue lock earlier such that the following
> race can no longer occur:
> 
> blk_init_queue_node blkcg_print_blkgs
>   blk_alloc_queue_node (1)
> q->queue_lock = >__queue_lock (2)
> blkcg_init_queue(q) (3)
> spin_lock_irq(blkg->q->queue_lock) (4)
>   q->queue_lock = lock (5)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) allocate an uninitialized queue;
> (2) initialize queue_lock to its default internal lock;
> (3) initialize blkcg part of request queue, which will create blkg and
> then insert it to blkg_list;
> (4) traverse blkg_list and find the created blkg, and then take its
> queue lock, here it is the default *internal lock*;
> (5) *race window*, now queue_lock is overridden with *driver specified
> lock*;
> (6) now unlock *driver specified lock*, not the locked *internal lock*,
> unlock balance breaks.
> 
> The changes in this patch are as follows:
> - Rename blk_alloc_queue_node() into blk_alloc_queue_node2() and add
>   a new queue lock argument.
> - Introduce a wrapper function with the same name and behavior as the
>   old blk_alloc_queue_node() function.

Let's please not do any of that, that's a horrible name to export. It's
not like we have hundreds of callers of blk_alloc_queue_node(), just
change them to pass in a NULL if they use the queue embedded lock.


-- 
Jens Axboe



[PATCH] block: Fix a race between the throttling code and request queue initialization

2018-01-31 Thread Bart Van Assche
Initialize the request queue lock earlier such that the following
race can no longer occur:

blk_init_queue_node blkcg_print_blkgs
  blk_alloc_queue_node (1)
q->queue_lock = >__queue_lock (2)
blkcg_init_queue(q) (3)
spin_lock_irq(blkg->q->queue_lock) (4)
  q->queue_lock = lock (5)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) allocate an uninitialized queue;
(2) initialize queue_lock to its default internal lock;
(3) initialize blkcg part of request queue, which will create blkg and
then insert it to blkg_list;
(4) traverse blkg_list and find the created blkg, and then take its
queue lock, here it is the default *internal lock*;
(5) *race window*, now queue_lock is overridden with *driver specified
lock*;
(6) now unlock *driver specified lock*, not the locked *internal lock*,
unlock balance breaks.

The changes in this patch are as follows:
- Rename blk_alloc_queue_node() into blk_alloc_queue_node2() and add
  a new queue lock argument.
- Introduce a wrapper function with the same name and behavior as the
  old blk_alloc_queue_node() function.
- Move the .queue_lock initialization from blk_init_queue_node() into
  blk_alloc_queue_node2().
- For all all block drivers that initialize .queue_lock explicitly,
  change the blk_alloc_queue() call in the driver into a
  blk_alloc_queue_node2() call and remove the explicit .queue_lock
  initialization. Additionally, initialize the spin lock that will
  be used as queue lock earlier if necessary.

Reported-by: Joseph Qi 
Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Joseph Qi 
Cc: Philipp Reisner 
Cc: Ulf Hansson 
Cc: Kees Cook 
Cc: 
---
 block/blk-core.c   | 33 -
 drivers/block/drbd/drbd_main.c |  4 ++--
 drivers/block/umem.c   |  7 +++
 drivers/mmc/core/queue.c   |  3 +--
 include/linux/blkdev.h |  2 ++
 5 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bd43bc50740a..cb18f57e5b13 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -946,7 +946,22 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
kblockd_schedule_work(>timeout_work);
 }
 
-struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
+/**
+ * blk_alloc_queue_node2 - allocate a request queue
+ * @gfp_mask: memory allocation flags
+ * @node_id: NUMA node to allocate memory from
+ * @lock: Pointer to a spinlock that will be used to e.g. serialize calls to
+ *   the legacy .request_fn(). Only set this pointer for queues that use
+ *   legacy mode and not for queues that use blk-mq.
+ *
+ * Note: use this function instead of calling blk_alloc_queue_node() and
+ * setting the queue lock pointer explicitly to avoid triggering a crash in
+ * the blkcg throttling code. That code namely makes sysfs attributes visible
+ * in user space before this function returns and the show method of these
+ * attributes uses the queue lock.
+ */
+struct request_queue *blk_alloc_queue_node2(gfp_t gfp_mask, int node_id,
+   spinlock_t *lock)
 {
struct request_queue *q;
 
@@ -997,11 +1012,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id)
mutex_init(>sysfs_lock);
spin_lock_init(>__queue_lock);
 
-   /*
-* By default initialize queue_lock to internal lock and driver can
-* override it later if need be.
-*/
-   q->queue_lock = >__queue_lock;
+   q->queue_lock = lock ? : >__queue_lock;
 
/*
 * A queue starts its life with bypass turned on to avoid
@@ -1042,6 +1053,12 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id)
kmem_cache_free(blk_requestq_cachep, q);
return NULL;
 }
+EXPORT_SYMBOL(blk_alloc_queue_node2);
+
+struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
+{
+   return blk_alloc_queue_node2(gfp_mask, node_id, NULL);
+}
 EXPORT_SYMBOL(blk_alloc_queue_node);
 
 /**
@@ -1088,13 +1105,11 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
 {
struct request_queue *q;
 
-   q = blk_alloc_queue_node(GFP_KERNEL, node_id);
+   q = blk_alloc_queue_node2(GFP_KERNEL, node_id, lock);
if (!q)
return NULL;
 
q->request_fn = rfn;
-   if (lock)
-   q->queue_lock = lock;
if (blk_init_allocated_queue(q) < 0) {
blk_cleanup_queue(q);
return NULL;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 4b4697a1f963..965e80b13443 100644
--- a/drivers/block/drbd/drbd_main.c
+++ 

Re: [LSF/MM TOPIC] KPTI effect on IO performance

2018-01-31 Thread Scotty Bauer

On 2018-01-31 01:23, Ming Lei wrote:

Hi All,

After KPTI is merged, there is extra load introduced to context switch
between user space and kernel space. It is observed on my laptop that 
one

syscall takes extra ~0.15us[1] compared with 'nopti'.

IO performance is affected too, it is observed that IOPS drops by 32% 
in

my test[2] on null_blk compared with 'nopti':

randread IOPS on latest linus tree:
-
| randread IOPS | randread IOPS with 'nopti'|

| 928K  | 1372K |





Do you know if your CPU has PCID? It would be interesting to see these 
tests on older CPUs or older kernels without PCID support.


Re: [PATCH 5/5] lightnvm: pblk: refactor bad block identification

2018-01-31 Thread Matias Bjørling

On 01/31/2018 10:13 AM, Javier Gonzalez wrote:




On 31 Jan 2018, at 16.51, Matias Bjørling  wrote:


On 01/31/2018 03:06 AM, Javier González wrote:
In preparation for the OCSSD 2.0 spec. bad block identification,
refactor the current code to generalize bad block get/set functions and
structures.
Signed-off-by: Javier González 
---
  drivers/lightnvm/pblk-init.c | 213 +++
  drivers/lightnvm/pblk.h  |   6 --
  2 files changed, 112 insertions(+), 107 deletions(-)
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index a5e3510c0f38..86a94a7faa96 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -365,7 +365,25 @@ static void pblk_luns_free(struct pblk *pblk)
  kfree(pblk->luns);
  }
  -static void pblk_free_line_bitmaps(struct pblk_line *line)
+static void pblk_line_mg_free(struct pblk *pblk)
+{
+struct pblk_line_mgmt *l_mg = >l_mg;
+int i;
+
+kfree(l_mg->bb_template);
+kfree(l_mg->bb_aux);
+kfree(l_mg->vsc_list);
+
+for (i = 0; i < PBLK_DATA_LINES; i++) {
+kfree(l_mg->sline_meta[i]);
+pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
+kfree(l_mg->eline_meta[i]);
+}
+
+kfree(pblk->lines);
+}
+
+static void pblk_line_meta_free(struct pblk_line *line)
  {
  kfree(line->blk_bitmap);
  kfree(line->erase_bitmap);
@@ -382,40 +400,16 @@ static void pblk_lines_free(struct pblk *pblk)
  line = >lines[i];
pblk_line_free(pblk, line);
-pblk_free_line_bitmaps(line);
+pblk_line_meta_free(line);
  }
  spin_unlock(_mg->free_lock);
  }
  -static void pblk_line_meta_free(struct pblk *pblk)
+static int pblk_bb_get_tbl(struct nvm_tgt_dev *dev, struct pblk_lun *rlun,
+   u8 *blks, int nr_blks)
  {
-struct pblk_line_mgmt *l_mg = >l_mg;
-int i;
-
-kfree(l_mg->bb_template);
-kfree(l_mg->bb_aux);
-kfree(l_mg->vsc_list);
-
-for (i = 0; i < PBLK_DATA_LINES; i++) {
-kfree(l_mg->sline_meta[i]);
-pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
-kfree(l_mg->eline_meta[i]);
-}
-
-kfree(pblk->lines);
-}
-
-static int pblk_bb_discovery(struct nvm_tgt_dev *dev, struct pblk_lun *rlun)
-{
-struct nvm_geo *geo = >geo;
  struct ppa_addr ppa;
-u8 *blks;
-int nr_blks, ret;
-
-nr_blks = geo->nr_chks * geo->plane_mode;
-blks = kmalloc(nr_blks, GFP_KERNEL);
-if (!blks)
-return -ENOMEM;
+int ret;
ppa.ppa = 0;
  ppa.g.ch = rlun->bppa.g.ch;
@@ -423,34 +417,56 @@ static int pblk_bb_discovery(struct nvm_tgt_dev *dev, 
struct pblk_lun *rlun)
ret = nvm_get_tgt_bb_tbl(dev, ppa, blks);
  if (ret)
-goto out;
+return ret;
nr_blks = nvm_bb_tbl_fold(dev->parent, blks, nr_blks);
-if (nr_blks < 0) {
-ret = nr_blks;
-goto out;
-}
-
-rlun->bb_list = blks;
+if (nr_blks < 0)
+return -EIO;
return 0;
-out:
-kfree(blks);
-return ret;
+}
+
+static void *pblk_bb_get_log(struct pblk *pblk)
+{
+struct nvm_tgt_dev *dev = pblk->dev;
+struct nvm_geo *geo = >geo;
+u8 *log;
+int i, nr_blks, blk_per_lun;
+int ret;
+
+blk_per_lun = geo->nr_chks * geo->plane_mode;
+nr_blks = blk_per_lun * geo->all_luns;
+
+log = kmalloc(nr_blks, GFP_KERNEL);
+if (!log)
+return ERR_PTR(-ENOMEM);
+
+for (i = 0; i < geo->all_luns; i++) {
+struct pblk_lun *rlun = >luns[i];
+u8 *log_pos = log + i * blk_per_lun;
+
+ret = pblk_bb_get_tbl(dev, rlun, log_pos, blk_per_lun);
+if (ret) {
+kfree(log);
+return ERR_PTR(-EIO);
+}
+}
+
+return log;
  }
static int pblk_bb_line(struct pblk *pblk, struct pblk_line *line,
-int blk_per_line)
+u8 *bb_log, int blk_per_line)
  {
  struct nvm_tgt_dev *dev = pblk->dev;
  struct nvm_geo *geo = >geo;
-struct pblk_lun *rlun;
-int bb_cnt = 0;
-int i;
+int i, bb_cnt = 0;
for (i = 0; i < blk_per_line; i++) {
-rlun = >luns[i];
-if (rlun->bb_list[line->id] == NVM_BLK_T_FREE)
+struct pblk_lun *rlun = >luns[i];
+u8 *lun_bb_log = bb_log + i * blk_per_line;
+
+if (lun_bb_log[line->id] == NVM_BLK_T_FREE)
  continue;
set_bit(pblk_ppa_to_pos(geo, rlun->bppa), line->blk_bitmap);
@@ -460,29 +476,12 @@ static int pblk_bb_line(struct pblk *pblk, struct 
pblk_line *line,
  return bb_cnt;
  }
  -static int pblk_alloc_line_bitmaps(struct pblk *pblk, struct pblk_line *line)
-{
-struct pblk_line_meta *lm = >lm;
-
-line->blk_bitmap = kzalloc(lm->blk_bitmap_len, GFP_KERNEL);
-if (!line->blk_bitmap)
-return -ENOMEM;
-
-line->erase_bitmap = kzalloc(lm->blk_bitmap_len, GFP_KERNEL);
-if (!line->erase_bitmap) {
-kfree(line->blk_bitmap);
-return 

Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Jerome Glisse
On Wed, Jan 31, 2018 at 05:55:58PM +, Al Viro wrote:
> On Wed, Jan 31, 2018 at 12:42:45PM -0500, Jerome Glisse wrote:
> 
> > For block devices the idea is to use struct page and buffer_head (first one 
> > of
> > a page) as a key to find mapping (struct address_space) back.
> 
> Details, please...

Note that i am not talking about block device page (i am excluding
those from that). So just regular filesystem page (ext*,xfs,btrfs,
...).

So in block device context AFAIK only time when you need mapping is
if they are some I/O error. Given than i am doing this with intent to
write protect the page one can argue that i can wait for all writeback
to complete before proceeding. At that time, it does not matter to block
device if page->mapping is no longer an address_space because the block
device code is done with the page and has forget about it.

That's one solution, another one is to have struct bio_vec store
buffer_head pointer and not page pointer, from buffer_head you can
find struct page and using buffer_head and struct page pointer you
can walk the KSM rmap_item chain to find back the mapping. This
would be needed on I/O error for pending writeback of a newly write
protected page, so one can argue that the overhead of the chain lookup
to find back the mapping against which to report IO error, is an
acceptable cost.

Another solution is to override the writeback end callback with
special one capable of finding the mapping from struct page and bio
pointer. This would not need any change to block device code. It
would have the same overhead thought as solution 2 above.


My intention was to stick to first solution (wait for writeback and
make no modification to block device struct or function). Then latter
if it make sense to add support to write protect a page before write
back is done.

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Al Viro
On Wed, Jan 31, 2018 at 12:42:45PM -0500, Jerome Glisse wrote:

> For block devices the idea is to use struct page and buffer_head (first one of
> a page) as a key to find mapping (struct address_space) back.

Details, please...


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Jerome Glisse
On Wed, Jan 31, 2018 at 07:09:48PM +0200, Igor Stoppa wrote:
> On 30/01/18 02:43, Jerome Glisse wrote:
> 
> [...]
> 
> > Maybe we can kill page->mapping altogether as a result of this. However 
> > this is
> > not my motivation at this time.
> 
> We had a discussion some time ago
> 
> http://www.openwall.com/lists/kernel-hardening/2017/07/07/7
> 
> where you advised to use it for tracking pmalloc pages vs area, which
> generated this patch:
> 
> http://www.openwall.com/lists/kernel-hardening/2018/01/24/7
> 
> Could you please comment what wold happen to the shortcut from struct
> page to vm_struct that this patch is now introducing?

Sadly struct page fields means different thing depending on the context
in which the page is use. This is confusing i know. So when i say kill
page->mapping i am not saying shrink the struct page and remove that
field, i am saying maybe we can kill current user of page->mapping
for regular process page (ie page that are in some mmap() area of a
process).

Other use of that field in different context like yours are not affected
by this change and can ignore it alltogether.

Hope this clarify it :)

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Jerome Glisse
On Wed, Jan 31, 2018 at 04:56:46PM +, Al Viro wrote:
> On Mon, Jan 29, 2018 at 07:43:48PM -0500, Jerome Glisse wrote:
> > I started a patchset about $TOPIC a while ago, right now i am working on 
> > other
> > thing but i hope to have an RFC for $TOPIC before LSF/MM and thus would 
> > like a
> > slot during common track to talk about it as it impacts FS, BLOCK and MM (i 
> > am
> > assuming their will be common track).
> > 
> > Idea is that mapping (struct address_space) is available in virtualy all the
> > places where it is needed and that their should be no reasons to depend 
> > only on
> > struct page->mapping field. My patchset basicly add mapping to a bunch of 
> > vfs
> > callback (struct address_space_operations) where it is missing, changing 
> > call
> > site. Then i do an individual patch per filesystem to leverage the new 
> > argument
> > instead on struct page.
> 
> Oh?  What about the places like fs/coda?  Or block devices, for that matter...
> You can't count upon file->f_mapping->host == file_inode(file).

What matter is that the place that call an address_space_operations callback
already has mapping == page->mapping in many places this is obvious. For
instance page just have been looked up using mapping and thus you must have
mapping == page->mapping. But i believe this holds in all places. They are
few dark corners (fuse, splice, ...). Truncate also factor in all this as
page->mapping is use to determine if a page has been truncated, but it
should not be an issue.

So i am not counting on file->f_mapping->host == file_inode(file) but i might
count in _some_ place on vma->file->f_mapping == page->mapping of any non 
private
page inside that vma. AFAICT this holds for coda and should hold elsewhere too.

For block devices the idea is to use struct page and buffer_head (first one of
a page) as a key to find mapping (struct address_space) back.

The overall idea i have is that in any place in the kernel (except memory 
reclaim
but that's ok) we can either get mapping or buffer_head information without 
relying
on struct page and if we have either one and a struct page then we can find the
other one.

Like i said i am not done with a patchset for that yet so maybe i am too
optimistic. I have another patchset i need to finish first before i go back to
this. I hope to have an RFC sometime in February or March and maybe by then
i would have found a roadblock, i am crossing my fingers until then :)

If it turns out that it is not doable i will comment on this thread and we can
kill that of from the agenda.

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Igor Stoppa
On 30/01/18 02:43, Jerome Glisse wrote:

[...]

> Maybe we can kill page->mapping altogether as a result of this. However this 
> is
> not my motivation at this time.

We had a discussion some time ago

http://www.openwall.com/lists/kernel-hardening/2017/07/07/7

where you advised to use it for tracking pmalloc pages vs area, which
generated this patch:

http://www.openwall.com/lists/kernel-hardening/2018/01/24/7

Could you please comment what wold happen to the shortcut from struct
page to vm_struct that this patch is now introducing?


--
thanks, igor


Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-31 Thread Al Viro
On Mon, Jan 29, 2018 at 07:43:48PM -0500, Jerome Glisse wrote:
> I started a patchset about $TOPIC a while ago, right now i am working on other
> thing but i hope to have an RFC for $TOPIC before LSF/MM and thus would like a
> slot during common track to talk about it as it impacts FS, BLOCK and MM (i am
> assuming their will be common track).
> 
> Idea is that mapping (struct address_space) is available in virtualy all the
> places where it is needed and that their should be no reasons to depend only 
> on
> struct page->mapping field. My patchset basicly add mapping to a bunch of vfs
> callback (struct address_space_operations) where it is missing, changing call
> site. Then i do an individual patch per filesystem to leverage the new 
> argument
> instead on struct page.

Oh?  What about the places like fs/coda?  Or block devices, for that matter...
You can't count upon file->f_mapping->host == file_inode(file).



Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-31 Thread Jens Axboe
On 1/30/18 9:25 PM, jianchao.wang wrote:
> Hi Jens
> 
> On 01/30/2018 11:57 PM, Jens Axboe wrote:
>> On 1/30/18 8:41 AM, Jens Axboe wrote:
>>> Hi,
>>>
>>> I just hit this on 4.15+ on the laptop, it's running Linus' git
>>> as of yesterday, right after the block tree merge:
>>>
>>> commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0
>>> Merge: 9697e9da8429 796baeeef85a
>>> Author: Linus Torvalds 
>>> Date:   Mon Jan 29 11:51:49 2018 -0800
>>>
>>> Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
>>>
>>> It's hitting this case:
>>>
>>> if (WARN_ON_ONCE(n != segments)) {  
>>> 
>>>
>>> in nvme, indicating that rq->nr_phys_segments does not equal the
>>> number of bios in the request. Anyone seen this? I'll take a closer
>>> look as time permits, full trace below.
>>>
>>>
>>>  WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 
>>> nvme_setup_cmd+0x3d3/0x430
>>>  Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc 
>>> snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat 
>>> snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 
>>> snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq 
>>> x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb 
>>> snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer 
>>> videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 
>>> crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore 
>>> hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd 
>>> intel_gtt
>>>  CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U   4.15.0+ 
>>> #176
>>>  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 
>>> 12/19/2017
>>>  RIP: 0010:nvme_setup_cmd+0x3d3/0x430
>>>  RSP: 0018:880423e9f838 EFLAGS: 00010217
>>>  RAX:  RBX: 880423e9f8c8 RCX: 0001
>>>  RDX: 88022b200010 RSI: 0002 RDI: 327f
>>>  RBP: 880421251400 R08: 88022b20 R09: 0009
>>>  R10:  R11:  R12: 
>>>  R13: 88042341e280 R14:  R15: 880421251440
>>>  FS:  () GS:88044150() 
>>> knlGS:
>>>  CS:  0010 DS:  ES:  CR0: 80050033
>>>  CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0
>>>  DR0:  DR1:  DR2: 
>>>  DR3:  DR6: fffe0ff0 DR7: 0400
>>>  Call Trace:
>>>   nvme_queue_rq+0x40/0xa00
>>>   ? __sbitmap_queue_get+0x24/0x90
>>>   ? blk_mq_get_tag+0xa3/0x250
>>>   ? wait_woken+0x80/0x80
>>>   ? blk_mq_get_driver_tag+0x97/0xf0
>>>   blk_mq_dispatch_rq_list+0x7b/0x4a0
>>>   ? deadline_remove_request+0x49/0xb0
>>>   blk_mq_do_dispatch_sched+0x4f/0xc0
>>>   blk_mq_sched_dispatch_requests+0x106/0x170
>>>   __blk_mq_run_hw_queue+0x53/0xa0
>>>   __blk_mq_delay_run_hw_queue+0x83/0xa0
>>>   blk_mq_run_hw_queue+0x6c/0xd0
>>>   blk_mq_sched_insert_request+0x96/0x140
>>>   __blk_mq_try_issue_directly+0x3d/0x190
>>>   blk_mq_try_issue_directly+0x30/0x70
>>>   blk_mq_make_request+0x1a4/0x6a0
>>>   generic_make_request+0xfd/0x2f0
>>>   ? submit_bio+0x5c/0x110
>>>   submit_bio+0x5c/0x110
>>>   ? __blkdev_issue_discard+0x152/0x200
>>>   submit_bio_wait+0x43/0x60
>>>   ext4_process_freed_data+0x1cd/0x440
>>>   ? account_page_dirtied+0xe2/0x1a0
>>>   ext4_journal_commit_callback+0x4a/0xc0
>>>   jbd2_journal_commit_transaction+0x17e2/0x19e0
>>>   ? kjournald2+0xb0/0x250
>>>   kjournald2+0xb0/0x250
>>>   ? wait_woken+0x80/0x80
>>>   ? commit_timeout+0x10/0x10
>>>   kthread+0x111/0x130
>>>   ? kthread_create_worker_on_cpu+0x50/0x50
>>>   ? do_group_exit+0x3a/0xa0
>>>   ret_from_fork+0x1f/0x30
>>>  Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 
>>> 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 
>>> c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff 
>>>  ---[ end trace 50d361cc444506c8 ]---
>>>  print_req_error: I/O error, dev nvme0n1, sector 847167488
>>
>> Looking at the disassembly, 'n' is 2 and 'segments' is 0x.
>>
> 
> Let's consider the case following:
> 
> blk_mq_bio_to_request()
>   -> blk_init_request_from_bio()
> -> blk_rq_bio_prep()
> 
> if (bio_has_data(bio))
> rq->nr_phys_segments = bio_phys_segments(q, bio);
> 
> static inline bool bio_has_data(struct bio *bio)
> {
> if (bio &&
> bio->bi_iter.bi_size &&
> bio_op(bio) != REQ_OP_DISCARD &&   //-> HERE !
> bio_op(bio) != REQ_OP_SECURE_ERASE &&
> bio_op(bio) != REQ_OP_WRITE_ZEROES)
> return true;
> 
> return false;
> }
> For the DISCARD req, the nr_phys_segments is 0.
> 
> dd_insert_request()
>   -> blk_mq_sched_try_insert_merge()
> -> 

Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-31 Thread David Brown
On 31/01/18 15:27, Wols Lists wrote:
> On 31/01/18 09:58, David Brown wrote:
>> I would also be interested in how the data and parities are distributed
>> across cabinets and disk controllers.  When you manually build from
>> smaller raid sets, you can ensure that in set the data disks and the
>> parity are all in different cabinets - that way if an entire cabinet
>> goes up in smoke, you have lost one drive from each set, and your data
>> is still there.  With a pseudo random layout, you have lost that.  (I
>> don't know how often entire cabinets of disks die, but I once lost both
>> disks of a raid1 mirror when the disk controller card died.)
> 
> The more I think about how I plan to spec raid-61, the more a modulo
> approach seems to make sense. That way, it'll be fairly easy to predict
> what ends up where, and make sure your disks are evenly scattered.
> 
> I think both your and my approach might have problems with losing an
> entire cabinet, however. Depends on how many drives per cabinet ...

Exactly.  I don't know how many cabinets are used on such systems.

> 
> Anyways, my second thoughts are ...
> 
> We have what I will call a stripe-block. The lowest common multiple of
> "disks needed" ie number of mirrors times number of drives in the
> raid-6, and the disks available.
> 
> Assuming my blocks are all stored sequentially I can then quickly
> calculate their position in this stripe-block. But this will fall foul
> of just hammering the drives nearest to the failed drive. But if I
> pseudo-randomise this position with "position * prime mod drives" where
> "prime" is not common to either the number of drives or the number or
> mirrors or the number of raid-drives, then this should achieve my aim of
> uniquely shuffling the location of all the blocks without collisions.
> 
> Pretty simple maths, for efficiency, that smears the data over all the
> drives. Does that sound feasible? All the heavy lifting, calculating the
> least common multiple, finding the prime, etc etc can be done at array
> set-up time.

Something like that should work, and be convenient to implement.  I am
not sure off the top of my head if such a simple modulo system is valid,
but it won't be difficult to check.

> 
> (If this then allows feasible 100-drive arrays, we won't just need an
> incremental assemble mode, we might need an incremental build mode :-)
> 

You really want to track which stripes are valid here, and which are not
yet made consistent.  A blank array will start with everything marked
invalid or inconsistent - build mode is just a matter of writing the
metadata.  You only need to make stripes consistent when you first write
to them.




Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-31 Thread Wols Lists
On 31/01/18 09:58, David Brown wrote:
> I would also be interested in how the data and parities are distributed
> across cabinets and disk controllers.  When you manually build from
> smaller raid sets, you can ensure that in set the data disks and the
> parity are all in different cabinets - that way if an entire cabinet
> goes up in smoke, you have lost one drive from each set, and your data
> is still there.  With a pseudo random layout, you have lost that.  (I
> don't know how often entire cabinets of disks die, but I once lost both
> disks of a raid1 mirror when the disk controller card died.)

The more I think about how I plan to spec raid-61, the more a modulo
approach seems to make sense. That way, it'll be fairly easy to predict
what ends up where, and make sure your disks are evenly scattered.

I think both your and my approach might have problems with losing an
entire cabinet, however. Depends on how many drives per cabinet ...

Anyways, my second thoughts are ...

We have what I will call a stripe-block. The lowest common multiple of
"disks needed" ie number of mirrors times number of drives in the
raid-6, and the disks available.

Assuming my blocks are all stored sequentially I can then quickly
calculate their position in this stripe-block. But this will fall foul
of just hammering the drives nearest to the failed drive. But if I
pseudo-randomise this position with "position * prime mod drives" where
"prime" is not common to either the number of drives or the number or
mirrors or the number of raid-drives, then this should achieve my aim of
uniquely shuffling the location of all the blocks without collisions.

Pretty simple maths, for efficiency, that smears the data over all the
drives. Does that sound feasible? All the heavy lifting, calculating the
least common multiple, finding the prime, etc etc can be done at array
set-up time.

(If this then allows feasible 100-drive arrays, we won't just need an
incremental assemble mode, we might need an incremental build mode :-)

Cheers,
Wol


Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-31 Thread Johannes Thumshirn
David Brown  writes:
> That sounds smart.  I don't see that you need anything particularly
> complicated for how you distribute your data and parity drives across
> the 100 disks - you just need a fairly even spread.

Exactly.

> I would be more concerned with how you could deal with resizing such an
> array.  In particular, I think it is not unlikely that someone with a
> 100 drive array will one day want to add another bank of 24 disks (or
> whatever fits in a cabinet).  Making that work nicely would, I believe,
> be more important than making sure the rebuild load distribution is
> balanced evenly across 99 drives.

I don't think rebuilding is such a big deal, lets consider the following
hypothetical scenario:

6 Disks with 4 data blocks (3 replicas per block, could be RAID1 like
duplicates or RAID5 like data + parity, doesn't matter at all for this
example)

D1  D2  D3  D4  D5  D6
[A] [B] [C] [ ] [ ] [ ]
[ ] [ ] [ ] [A] [D] [B] 
[ ] [A] [B] [ ] [C] [ ]
[C] [ ] [ ] [D] [ ] [D]

Now we're adding one disk and rebalance:

D1  D2  D3  D4  D5  D6  D7
[A] [B] [C] [ ] [ ] [ ] [A]
[ ] [ ] [ ] [ ] [D] [B] [ ]
[ ] [A] [B] [ ] [ ] [ ] [C]
[C] [ ] [ ] [D] [ ] [D] [ ]

This moved the "A" from D4 and the "C" from D5 to D7. The whole
rebalancing affected only 3 disks (read from D4 and D5 write to D7).

> I would also be interested in how the data and parities are distributed
> across cabinets and disk controllers.  When you manually build from
> smaller raid sets, you can ensure that in set the data disks and the
> parity are all in different cabinets - that way if an entire cabinet
> goes up in smoke, you have lost one drive from each set, and your data
> is still there.  With a pseudo random layout, you have lost that.  (I
> don't know how often entire cabinets of disks die, but I once lost both
> disks of a raid1 mirror when the disk controller card died.)

Well this is something CRSUH takes care of. As I said earlier it's a
weighted decision tree. One of the weights could be to evenly spread all
blocks across two cabinets.

Taking this into account would require a non-trivial user interface and
I'm not sure if the benefits of this outnumber the costs (at least for
an initial implementation).

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-31 Thread David Brown
On 29/01/18 22:50, NeilBrown wrote:
> On Mon, Jan 29 2018, Wols Lists wrote:
> 
>> On 29/01/18 15:23, Johannes Thumshirn wrote:
>>> Hi linux-raid, lsf-pc
>>>
>>> (If you've received this mail multiple times, I'm sorry, I'm having
>>> trouble with the mail setup).
>>
>> My immediate reactions as a lay person (I edit the raid wiki) ...
>>>
>>> With the rise of bigger and bigger disks, array rebuilding times start
>>> skyrocketing.
>>
>> And? Yes, your data is at risk during a rebuild, but md-raid throttles
>> the i/o, so it doesn't hammer the system.
>>>
>>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>>> similar to RAID5 but instead of utilizing all disks in an array for
>>> every I/O operation, but implement a per-I/O mapping function to only
>>> use a subset of the available disks.
>>>
>>> This has at least two advantages:
>>> 1) If one disk has to be replaced, it's not needed to read the data from
>>>all disks to recover the one failed disk so non-affected disks can be
>>>used for real user I/O and not just recovery and
>>
>> Again, that's throttling, so that's not a problem ...
> 
> Imagine an array with 100 drives on which we store data in sets of
> (say) 6 data chunks and 2 parity chunks.
> Each group of 8 chunks is distributed over the 100 drives in a
> different way so that (e.g) 600 data chunks and 200 parity chunks are
> distributed over 8 physical stripes using some clever distribution
> function.
> If (when) one drive fails, the 8 chunks in this set of 8 physical
> stripes can be recovered by reading 6*8 == 48 chunks which will each be
> on a different drive.  Half the drives deliver only one chunk (in an ideal
> distribution) and the other half deliver none.  Maybe they will deliver
> some for the next set of 100 logical stripes.
> 
> You would probably say that even doing raid6 on 100 drives is crazy.
> Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10,
> then LVM them together.
> 
> By doing declustered parity you can sanely do raid6 on 100 drives, using
> a logical stripe size that is much smaller than 100.
> When recovering a single drive, the 10-groups-of-10 would put heavy load
> on 9 other drives, while the decluster approach puts light load on 99
> other drives.  No matter how clever md is at throttling recovery, I
> would still rather distribute the load so that md has an easier job.
> 

That sounds smart.  I don't see that you need anything particularly
complicated for how you distribute your data and parity drives across
the 100 disks - you just need a fairly even spread.

I would be more concerned with how you could deal with resizing such an
array.  In particular, I think it is not unlikely that someone with a
100 drive array will one day want to add another bank of 24 disks (or
whatever fits in a cabinet).  Making that work nicely would, I believe,
be more important than making sure the rebuild load distribution is
balanced evenly across 99 drives.

I would also be interested in how the data and parities are distributed
across cabinets and disk controllers.  When you manually build from
smaller raid sets, you can ensure that in set the data disks and the
parity are all in different cabinets - that way if an entire cabinet
goes up in smoke, you have lost one drive from each set, and your data
is still there.  With a pseudo random layout, you have lost that.  (I
don't know how often entire cabinets of disks die, but I once lost both
disks of a raid1 mirror when the disk controller card died.)




Re: [PATCH 5/5] lightnvm: pblk: refactor bad block identification

2018-01-31 Thread Javier Gonzalez


> On 31 Jan 2018, at 16.51, Matias Bjørling  wrote:
> 
>> On 01/31/2018 03:06 AM, Javier González wrote:
>> In preparation for the OCSSD 2.0 spec. bad block identification,
>> refactor the current code to generalize bad block get/set functions and
>> structures.
>> Signed-off-by: Javier González 
>> ---
>>  drivers/lightnvm/pblk-init.c | 213 
>> +++
>>  drivers/lightnvm/pblk.h  |   6 --
>>  2 files changed, 112 insertions(+), 107 deletions(-)
>> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
>> index a5e3510c0f38..86a94a7faa96 100644
>> --- a/drivers/lightnvm/pblk-init.c
>> +++ b/drivers/lightnvm/pblk-init.c
>> @@ -365,7 +365,25 @@ static void pblk_luns_free(struct pblk *pblk)
>>  kfree(pblk->luns);
>>  }
>>  -static void pblk_free_line_bitmaps(struct pblk_line *line)
>> +static void pblk_line_mg_free(struct pblk *pblk)
>> +{
>> +struct pblk_line_mgmt *l_mg = >l_mg;
>> +int i;
>> +
>> +kfree(l_mg->bb_template);
>> +kfree(l_mg->bb_aux);
>> +kfree(l_mg->vsc_list);
>> +
>> +for (i = 0; i < PBLK_DATA_LINES; i++) {
>> +kfree(l_mg->sline_meta[i]);
>> +pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
>> +kfree(l_mg->eline_meta[i]);
>> +}
>> +
>> +kfree(pblk->lines);
>> +}
>> +
>> +static void pblk_line_meta_free(struct pblk_line *line)
>>  {
>>  kfree(line->blk_bitmap);
>>  kfree(line->erase_bitmap);
>> @@ -382,40 +400,16 @@ static void pblk_lines_free(struct pblk *pblk)
>>  line = >lines[i];
>>pblk_line_free(pblk, line);
>> -pblk_free_line_bitmaps(line);
>> +pblk_line_meta_free(line);
>>  }
>>  spin_unlock(_mg->free_lock);
>>  }
>>  -static void pblk_line_meta_free(struct pblk *pblk)
>> +static int pblk_bb_get_tbl(struct nvm_tgt_dev *dev, struct pblk_lun *rlun,
>> +   u8 *blks, int nr_blks)
>>  {
>> -struct pblk_line_mgmt *l_mg = >l_mg;
>> -int i;
>> -
>> -kfree(l_mg->bb_template);
>> -kfree(l_mg->bb_aux);
>> -kfree(l_mg->vsc_list);
>> -
>> -for (i = 0; i < PBLK_DATA_LINES; i++) {
>> -kfree(l_mg->sline_meta[i]);
>> -pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
>> -kfree(l_mg->eline_meta[i]);
>> -}
>> -
>> -kfree(pblk->lines);
>> -}
>> -
>> -static int pblk_bb_discovery(struct nvm_tgt_dev *dev, struct pblk_lun *rlun)
>> -{
>> -struct nvm_geo *geo = >geo;
>>  struct ppa_addr ppa;
>> -u8 *blks;
>> -int nr_blks, ret;
>> -
>> -nr_blks = geo->nr_chks * geo->plane_mode;
>> -blks = kmalloc(nr_blks, GFP_KERNEL);
>> -if (!blks)
>> -return -ENOMEM;
>> +int ret;
>>ppa.ppa = 0;
>>  ppa.g.ch = rlun->bppa.g.ch;
>> @@ -423,34 +417,56 @@ static int pblk_bb_discovery(struct nvm_tgt_dev *dev, 
>> struct pblk_lun *rlun)
>>ret = nvm_get_tgt_bb_tbl(dev, ppa, blks);
>>  if (ret)
>> -goto out;
>> +return ret;
>>nr_blks = nvm_bb_tbl_fold(dev->parent, blks, nr_blks);
>> -if (nr_blks < 0) {
>> -ret = nr_blks;
>> -goto out;
>> -}
>> -
>> -rlun->bb_list = blks;
>> +if (nr_blks < 0)
>> +return -EIO;
>>return 0;
>> -out:
>> -kfree(blks);
>> -return ret;
>> +}
>> +
>> +static void *pblk_bb_get_log(struct pblk *pblk)
>> +{
>> +struct nvm_tgt_dev *dev = pblk->dev;
>> +struct nvm_geo *geo = >geo;
>> +u8 *log;
>> +int i, nr_blks, blk_per_lun;
>> +int ret;
>> +
>> +blk_per_lun = geo->nr_chks * geo->plane_mode;
>> +nr_blks = blk_per_lun * geo->all_luns;
>> +
>> +log = kmalloc(nr_blks, GFP_KERNEL);
>> +if (!log)
>> +return ERR_PTR(-ENOMEM);
>> +
>> +for (i = 0; i < geo->all_luns; i++) {
>> +struct pblk_lun *rlun = >luns[i];
>> +u8 *log_pos = log + i * blk_per_lun;
>> +
>> +ret = pblk_bb_get_tbl(dev, rlun, log_pos, blk_per_lun);
>> +if (ret) {
>> +kfree(log);
>> +return ERR_PTR(-EIO);
>> +}
>> +}
>> +
>> +return log;
>>  }
>>static int pblk_bb_line(struct pblk *pblk, struct pblk_line *line,
>> -int blk_per_line)
>> +u8 *bb_log, int blk_per_line)
>>  {
>>  struct nvm_tgt_dev *dev = pblk->dev;
>>  struct nvm_geo *geo = >geo;
>> -struct pblk_lun *rlun;
>> -int bb_cnt = 0;
>> -int i;
>> +int i, bb_cnt = 0;
>>for (i = 0; i < blk_per_line; i++) {
>> -rlun = >luns[i];
>> -if (rlun->bb_list[line->id] == NVM_BLK_T_FREE)
>> +struct pblk_lun *rlun = >luns[i];
>> +u8 *lun_bb_log = bb_log + i * blk_per_line;
>> +
>> +if (lun_bb_log[line->id] == NVM_BLK_T_FREE)
>>  continue;
>>set_bit(pblk_ppa_to_pos(geo, rlun->bppa), line->blk_bitmap);
>> @@ -460,29 +476,12 @@ static int pblk_bb_line(struct pblk *pblk, struct 
>> pblk_line *line,
>>  return bb_cnt;
>>  }
>>  

Re: [PATCH 5/5] lightnvm: pblk: refactor bad block identification

2018-01-31 Thread Matias Bjørling

On 01/31/2018 03:06 AM, Javier González wrote:

In preparation for the OCSSD 2.0 spec. bad block identification,
refactor the current code to generalize bad block get/set functions and
structures.

Signed-off-by: Javier González 
---
  drivers/lightnvm/pblk-init.c | 213 +++
  drivers/lightnvm/pblk.h  |   6 --
  2 files changed, 112 insertions(+), 107 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index a5e3510c0f38..86a94a7faa96 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -365,7 +365,25 @@ static void pblk_luns_free(struct pblk *pblk)
kfree(pblk->luns);
  }
  
-static void pblk_free_line_bitmaps(struct pblk_line *line)

+static void pblk_line_mg_free(struct pblk *pblk)
+{
+   struct pblk_line_mgmt *l_mg = >l_mg;
+   int i;
+
+   kfree(l_mg->bb_template);
+   kfree(l_mg->bb_aux);
+   kfree(l_mg->vsc_list);
+
+   for (i = 0; i < PBLK_DATA_LINES; i++) {
+   kfree(l_mg->sline_meta[i]);
+   pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
+   kfree(l_mg->eline_meta[i]);
+   }
+
+   kfree(pblk->lines);
+}
+
+static void pblk_line_meta_free(struct pblk_line *line)
  {
kfree(line->blk_bitmap);
kfree(line->erase_bitmap);
@@ -382,40 +400,16 @@ static void pblk_lines_free(struct pblk *pblk)
line = >lines[i];
  
  		pblk_line_free(pblk, line);

-   pblk_free_line_bitmaps(line);
+   pblk_line_meta_free(line);
}
spin_unlock(_mg->free_lock);
  }
  
-static void pblk_line_meta_free(struct pblk *pblk)

+static int pblk_bb_get_tbl(struct nvm_tgt_dev *dev, struct pblk_lun *rlun,
+  u8 *blks, int nr_blks)
  {
-   struct pblk_line_mgmt *l_mg = >l_mg;
-   int i;
-
-   kfree(l_mg->bb_template);
-   kfree(l_mg->bb_aux);
-   kfree(l_mg->vsc_list);
-
-   for (i = 0; i < PBLK_DATA_LINES; i++) {
-   kfree(l_mg->sline_meta[i]);
-   pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
-   kfree(l_mg->eline_meta[i]);
-   }
-
-   kfree(pblk->lines);
-}
-
-static int pblk_bb_discovery(struct nvm_tgt_dev *dev, struct pblk_lun *rlun)
-{
-   struct nvm_geo *geo = >geo;
struct ppa_addr ppa;
-   u8 *blks;
-   int nr_blks, ret;
-
-   nr_blks = geo->nr_chks * geo->plane_mode;
-   blks = kmalloc(nr_blks, GFP_KERNEL);
-   if (!blks)
-   return -ENOMEM;
+   int ret;
  
  	ppa.ppa = 0;

ppa.g.ch = rlun->bppa.g.ch;
@@ -423,34 +417,56 @@ static int pblk_bb_discovery(struct nvm_tgt_dev *dev, 
struct pblk_lun *rlun)
  
  	ret = nvm_get_tgt_bb_tbl(dev, ppa, blks);

if (ret)
-   goto out;
+   return ret;
  
  	nr_blks = nvm_bb_tbl_fold(dev->parent, blks, nr_blks);

-   if (nr_blks < 0) {
-   ret = nr_blks;
-   goto out;
-   }
-
-   rlun->bb_list = blks;
+   if (nr_blks < 0)
+   return -EIO;
  
  	return 0;

-out:
-   kfree(blks);
-   return ret;
+}
+
+static void *pblk_bb_get_log(struct pblk *pblk)
+{
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = >geo;
+   u8 *log;
+   int i, nr_blks, blk_per_lun;
+   int ret;
+
+   blk_per_lun = geo->nr_chks * geo->plane_mode;
+   nr_blks = blk_per_lun * geo->all_luns;
+
+   log = kmalloc(nr_blks, GFP_KERNEL);
+   if (!log)
+   return ERR_PTR(-ENOMEM);
+
+   for (i = 0; i < geo->all_luns; i++) {
+   struct pblk_lun *rlun = >luns[i];
+   u8 *log_pos = log + i * blk_per_lun;
+
+   ret = pblk_bb_get_tbl(dev, rlun, log_pos, blk_per_lun);
+   if (ret) {
+   kfree(log);
+   return ERR_PTR(-EIO);
+   }
+   }
+
+   return log;
  }
  
  static int pblk_bb_line(struct pblk *pblk, struct pblk_line *line,

-   int blk_per_line)
+   u8 *bb_log, int blk_per_line)
  {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
-   struct pblk_lun *rlun;
-   int bb_cnt = 0;
-   int i;
+   int i, bb_cnt = 0;
  
  	for (i = 0; i < blk_per_line; i++) {

-   rlun = >luns[i];
-   if (rlun->bb_list[line->id] == NVM_BLK_T_FREE)
+   struct pblk_lun *rlun = >luns[i];
+   u8 *lun_bb_log = bb_log + i * blk_per_line;
+
+   if (lun_bb_log[line->id] == NVM_BLK_T_FREE)
continue;
  
  		set_bit(pblk_ppa_to_pos(geo, rlun->bppa), line->blk_bitmap);

@@ -460,29 +476,12 @@ static int pblk_bb_line(struct pblk *pblk, struct 
pblk_line *line,
return bb_cnt;
  }
  
-static int pblk_alloc_line_bitmaps(struct pblk *pblk, struct pblk_line *line)

-{
-   struct 

Re: [PATCH 4/5] lightnvm: pblk: add padding distribution sysfs attribute

2018-01-31 Thread Matias Bjørling

On 01/31/2018 03:06 AM, Javier González wrote:

From: Hans Holmberg 

When pblk receives a sync, all data up to that point in the write buffer
must be comitted to persistent storage, and as flash memory comes with a
minimal write size there is a significant cost involved both in terms
of time for completing the sync and in terms of write amplification
padded sectors for filling up to the minimal write size.

In order to get a better understanding of the costs involved for syncs,
Add a sysfs attribute to pblk: padded_dist, showing a normalized
distribution of sectors padded. In order to facilitate measurements of
specific workloads during the lifetime of the pblk instance, the
distribution can be reset by writing 0 to the attribute.

Do this by introducing counters for each possible padding:
{0..(minimal write size - 1)} and calculate the normalized distribution
when showing the attribute.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
---
  drivers/lightnvm/pblk-init.c  | 16 --
  drivers/lightnvm/pblk-rb.c| 15 +-
  drivers/lightnvm/pblk-sysfs.c | 68 +++
  drivers/lightnvm/pblk.h   |  6 +++-
  4 files changed, 95 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 7eedc5daebc8..a5e3510c0f38 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -921,6 +921,7 @@ static void pblk_free(struct pblk *pblk)
  {
pblk_luns_free(pblk);
pblk_lines_free(pblk);
+   kfree(pblk->pad_dist);
pblk_line_meta_free(pblk);
pblk_core_free(pblk);
pblk_l2p_free(pblk);
@@ -998,11 +999,13 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk->pad_rst_wa = 0;
pblk->gc_rst_wa = 0;
  
+	atomic_long_set(>nr_flush, 0);

+   pblk->nr_flush_rst = 0;
+
  #ifdef CONFIG_NVM_DEBUG
atomic_long_set(>inflight_writes, 0);
atomic_long_set(>padded_writes, 0);
atomic_long_set(>padded_wb, 0);
-   atomic_long_set(>nr_flush, 0);
atomic_long_set(>req_writes, 0);
atomic_long_set(>sub_writes, 0);
atomic_long_set(>sync_writes, 0);
@@ -1034,10 +1037,17 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
goto fail_free_luns;
}
  
+	pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) * sizeof(atomic64_t),

+GFP_KERNEL);
+   if (!pblk->pad_dist) {
+   ret = -ENOMEM;
+   goto fail_free_line_meta;
+   }
+
ret = pblk_core_init(pblk);
if (ret) {
pr_err("pblk: could not initialize core\n");
-   goto fail_free_line_meta;
+   goto fail_free_pad_dist;
}
  
  	ret = pblk_l2p_init(pblk);

@@ -1097,6 +1107,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk_l2p_free(pblk);
  fail_free_core:
pblk_core_free(pblk);
+fail_free_pad_dist:
+   kfree(pblk->pad_dist);
  fail_free_line_meta:
pblk_line_meta_free(pblk);
  fail_free_luns:
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 7044b5599cc4..264372d7b537 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -437,9 +437,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb, 
unsigned int nr_entries,
if (bio->bi_opf & REQ_PREFLUSH) {
struct pblk *pblk = container_of(rb, struct pblk, rwb);
  
-#ifdef CONFIG_NVM_DEBUG

atomic_long_inc(>nr_flush);
-#endif
if (pblk_rb_flush_point_set(>rwb, bio, mem))
*io_ret = NVM_IO_OK;
}
@@ -620,14 +618,17 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb, 
struct nvm_rq *rqd,
pr_err("pblk: could not pad page in write bio\n");
return NVM_IO_ERR;
}
+
+   if (pad < pblk->min_write_pgs)
+   atomic64_inc(>pad_dist[pad - 1]);
+   else
+   pr_warn("pblk: padding more than min. sectors\n");


Curious, when would this happen? Would it be an implementation error or 
something a user did wrong?



+
+   atomic64_add(pad, >pad_wa);
}
  
-	atomic64_add(pad, &((struct pblk *)

-   (container_of(rb, struct pblk, rwb)))->pad_wa);
-
  #ifdef CONFIG_NVM_DEBUG
-   atomic_long_add(pad, &((struct pblk *)
-   (container_of(rb, struct pblk, rwb)))->padded_writes);
+   atomic_long_add(pad, >padded_writes);
  #endif
  
  	return NVM_IO_OK;

diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c
index 4804bbd32d5f..f902ac00d071 100644
--- a/drivers/lightnvm/pblk-sysfs.c
+++ b/drivers/lightnvm/pblk-sysfs.c
@@ -341,6 +341,38 @@ static ssize_t 

Re: [PATCH 1/2] lightnvm: remove mlc pairs structure

2018-01-31 Thread Matias Bjørling

On 01/31/2018 03:00 AM, Javier González wrote:

On 30 Jan 2018, at 21.26, Matias Bjørling  wrote:

The known implementations of the 1.2 specification, and upcoming 2.0
implementation all expose a sequential list of pages to write.
Remove the data structure, as it is no longer needed.

Signed-off-by: Matias Bjørling 
---
drivers/nvme/host/lightnvm.c | 14 +-
1 file changed, 1 insertion(+), 13 deletions(-)


Even though the current implementations does not use the MLC pairing
information, this is part of the on the 1.2 identification. Until we
eventually remove 1.2 support (if we do), the identify structure should
reflect the specification as is.

Javier



I already started on removing the MLC bits in previous patches. These 
two patches continue that trend. I don't know of any implementing either 
multiple groups or MLC, and since it has not been implemented before, no 
one is going to use it. No reason to have dead code hanging around. This 
is similar to the NVMe device driver not having definitions for all that 
is in the specification until it is used in the implementation.




[LSF/MM TOPIC] KPTI effect on IO performance

2018-01-31 Thread Ming Lei
Hi All,

After KPTI is merged, there is extra load introduced to context switch
between user space and kernel space. It is observed on my laptop that one
syscall takes extra ~0.15us[1] compared with 'nopti'.

IO performance is affected too, it is observed that IOPS drops by 32% in
my test[2] on null_blk compared with 'nopti':

randread IOPS on latest linus tree:
-
| randread IOPS | randread IOPS with 'nopti'|   

| 928K  | 1372K |   



Two paths are affected, one is IO submission(read, write,... syscall),
another is the IO completion path in which interrupt may be triggered
from user space, and context switch is needed.

So is there something we can do for decreasing the effect on IO performance?

This effect may make Hannes's issue[3] worse, and maybe 'irq poll' should be
used more widely for all high performance IO device, even some optimization
should be considered for KPTI's effect.


[1] http://people.redhat.com/minlei/tests/tools/syscall_speed.c
[2] http://people.redhat.com/minlei/tests/tools/null_perf
[3] [LSF/MM TOPIC] irq affinity handling for high CPU count machines
https://marc.info/?t=15172215682=1=2

Thanks,
Ming


Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-31 Thread David Brown
On 30/01/18 10:40, Johannes Thumshirn wrote:
> Wols Lists  writes:
> 
>> On 29/01/18 15:23, Johannes Thumshirn wrote:
>>> Hi linux-raid, lsf-pc
>>>
>>> (If you've received this mail multiple times, I'm sorry, I'm having
>>> trouble with the mail setup).
>>
>> My immediate reactions as a lay person (I edit the raid wiki) ...
>>>
>>> With the rise of bigger and bigger disks, array rebuilding times start
>>> skyrocketing.
>>
>> And? Yes, your data is at risk during a rebuild, but md-raid throttles
>> the i/o, so it doesn't hammer the system.
>>>
>>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>>> similar to RAID5 but instead of utilizing all disks in an array for
>>> every I/O operation, but implement a per-I/O mapping function to only
>>> use a subset of the available disks.
>>>
>>> This has at least two advantages:
>>> 1) If one disk has to be replaced, it's not needed to read the data from
>>>all disks to recover the one failed disk so non-affected disks can be
>>>used for real user I/O and not just recovery and
>>
>> Again, that's throttling, so that's not a problem ...
> 
> And throttling in a production environment is not exactly
> desired. Imagine a 500 disk array (and yes this is something we've seen
> with MD) and you have to replace disks. While the array is rebuilt you
> have to throttle all I/O because with raid-{1,5,6,10} all data is
> striped across all disks.

You definitely don't want a stripe across 500 disks!  I'd be inclined to
have raid1 pairs as the basic block, or perhaps 6-8 drive raid6 if you
want higher space efficiency.  Then you build your full array on top of
that, along with a file system that can take advantage of the layout.
If you have an XFS over a linear concat of these sets, then you have a
system that can quickly server many parallel loads - but that could be
poor distribution if you are storing massive streaming data.  And
rebuilds only delay data from the one block that is involved in the rebuild.

(I have no experience with anything bigger than about 6 disks - this is
just theory on my part.)

> 
> With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call
> it) you don't have to as the I/O is replicated in parity groups across a
> subset of disks. All I/O targeting disks which aren't needed to recover
> the data from the failed disks aren't affected by the throttling at all.
> 
>>> 2) an efficient mapping function can improve parallel I/O submission, as
>>>two different I/Os are not necessarily going to the same disks in the
>>>array. 
>>>
>>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>>> would be ideal, as it provides a pseudo random but deterministic mapping
>>> for the I/O onto the drives.
>>>
>>> This whole declustering of cause only makes sense for more than (at
>>> least) 4 drives but we do have customers with several orders of
>>> magnitude more drivers in an MD array.
>>
>> If you have four drives or more - especially if they are multi-terabyte
>> drives - you should NOT be using raid-5 ...
> 
> raid-6 won't help you much in above scenario.
> 

Raid-6 is still a great deal better than raid-5 :-)

And for your declustered raid or distributed parity, you can have two
parities rather than just one.