Re: bfq: BUG bfq_queue: Objects remaining in bfq_queue on __kmem_cache_shutdown() after rmmod

2017-12-20 Thread Paolo Valente


> Il giorno 21 dic 2017, alle ore 08:08, Guoqing Jiang  ha 
> scritto:
> 
> Hi,
> 
> 
> On 12/08/2017 08:34 AM, Holger Hoffstätte wrote:
>> So plugging in a device on USB with BFQ as scheduler now works without
>> hiccup (probably thanks to Ming Lei's last patch), but of course I found
>> another problem. Unmounting the device after use, changing the scheduler
>> back to deadline or kyber and rmmod'ing the BFQ module reproducibly gives me:
>> 
>> kernel: 
>> =
>> kernel: BUG bfq_queue (Tainted: GB  ): Objects remaining in 
>> bfq_queue on __kmem_cache_shutdown()
>> kernel: 
>> -
>> kernel:
>> kernel: INFO: Slab 0xea001601fc00 objects=37 used=3 
>> fp=0x8805807f0360 flags=0x80008100
>> kernel: CPU: 0 PID: 9967 Comm: rmmod Tainted: GB   4.14.5 #1
>> kernel: Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, 
>> BIOS F1 05/06/2011
>> kernel: Call Trace:
>> kernel:  dump_stack+0x46/0x5e
>> kernel:  slab_err+0x9e/0xb0
>> kernel:  ? on_each_cpu_mask+0x35/0x40
>> kernel:  ? ksm_migrate_page+0x60/0x60
>> kernel:  ? __kmalloc+0x1c9/0x1d0
>> kernel:  __kmem_cache_shutdown+0x177/0x350
>> kernel:  shutdown_cache+0xf/0x130
>> kernel:  kmem_cache_destroy+0x19e/0x1b0
>> kernel:  SyS_delete_module+0x168/0x230
>> kernel:  ? exit_to_usermode_loop+0x39/0x80
>> kernel:  entry_SYSCALL_64_fastpath+0x13/0x94
>> kernel: RIP: 0033:0x7f53e4136b97
>> kernel: RSP: 002b:7ffd660061d8 EFLAGS: 0206 ORIG_RAX: 
>> 00b0
>> kernel: RAX: ffda RBX: 0003 RCX: 7f53e4136b97
>> kernel: RDX: 000a RSI: 0800 RDI: 006247f8
>> kernel: RBP:  R08: 7ffd66005171 R09: 
>> kernel: R10: 7f53e41acbc0 R11: 0206 R12: 00624790
>> kernel: R13: 7ffd660051d0 R14: 00624790 R15: 00623260
>> kernel: INFO: Object 0x8805807f @offset=0
>> kernel: INFO: Object 0x8805807f01b0 @offset=432
>> kernel: INFO: Object 0x8805807f3cc0 @offset=15552
>> kernel: kmem_cache_destroy bfq_queue: Slab cache still has objects
>> kernel: CPU: 0 PID: 9967 Comm: rmmod Tainted: GB   4.14.5 #1
>> kernel: Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, 
>> BIOS F1 05/06/2011
>> kernel: Call Trace:
>> kernel:  dump_stack+0x46/0x5e
>> kernel:  kmem_cache_destroy+0x191/0x1b0
>> kernel:  SyS_delete_module+0x168/0x230
>> kernel:  ? exit_to_usermode_loop+0x39/0x80
>> kernel:  entry_SYSCALL_64_fastpath+0x13/0x94
>> kernel: RIP: 0033:0x7f53e4136b97
>> kernel: RSP: 002b:7ffd660061d8 EFLAGS: 0206 ORIG_RAX: 
>> 00b0
>> kernel: RAX: ffda RBX: 0003 RCX: 7f53e4136b97
>> kernel: RDX: 000a RSI: 0800 RDI: 006247f8
>> kernel: RBP:  R08: 7ffd66005171 R09: 
>> kernel: R10: 7f53e41acbc0 R11: 0206 R12: 00624790
>> kernel: R13: 7ffd660051d0 R14: 00624790 R15: 00623260
>> 
> 
> I also encountered the similar bug, seems there are some async objects which 
> are not freed with
> BFQ_GROUP_IOSCHED build as "Y".
> 
> If revert part of commit e21b7a0b988772e82e7147e1c659a5afe2ae003c ("block, 
> bfq: add full hierarchical
> scheduling and cgroups support")like below, then I can't see the bug again 
> with simple test.
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 889a8549d97f..c640e64e042f 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4710,6 +4710,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
> spin_lock_irq(>lock);
> list_for_each_entry_safe(bfqq, n, >idle_list, bfqq_list)
> bfq_deactivate_bfqq(bfqd, bfqq, false, false);
> +   bfq_put_async_queues(bfqd, bfqd->root_group);
> spin_unlock_irq(>lock);
> 
> hrtimer_cancel(>idle_slice_timer);
> @@ -4718,7 +4719,6 @@ static void bfq_exit_queue(struct elevator_queue *e)
> blkcg_deactivate_policy(bfqd->queue, _policy_bfq);
>  #else
> spin_lock_irq(>lock);
> -   bfq_put_async_queues(bfqd, bfqd->root_group);
> kfree(bfqd->root_group);
> spin_unlock_irq(>lock);
> 
> But perhaps we should do it inside blkcg_deactivate_policy, right?
> 

Thank you very much for investigating this. And yes, I removed that 
bfq_put_async_queues invocation a long ago, because it had to be done in 
blkcg_deactivate_policy; and actually it was invoked as expected. We are trying 
to understand what is going wrong now.

Thanks,
Paolo

> Thanks,
> Guoqing



Re: bfq: BUG bfq_queue: Objects remaining in bfq_queue on __kmem_cache_shutdown() after rmmod

2017-12-20 Thread Guoqing Jiang

Hi,


On 12/08/2017 08:34 AM, Holger Hoffstätte wrote:

So plugging in a device on USB with BFQ as scheduler now works without
hiccup (probably thanks to Ming Lei's last patch), but of course I found
another problem. Unmounting the device after use, changing the scheduler
back to deadline or kyber and rmmod'ing the BFQ module reproducibly gives me:

kernel: 
=
kernel: BUG bfq_queue (Tainted: GB  ): Objects remaining in 
bfq_queue on __kmem_cache_shutdown()
kernel: 
-
kernel:
kernel: INFO: Slab 0xea001601fc00 objects=37 used=3 fp=0x8805807f0360 
flags=0x80008100
kernel: CPU: 0 PID: 9967 Comm: rmmod Tainted: GB   4.14.5 #1
kernel: Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, 
BIOS F1 05/06/2011
kernel: Call Trace:
kernel:  dump_stack+0x46/0x5e
kernel:  slab_err+0x9e/0xb0
kernel:  ? on_each_cpu_mask+0x35/0x40
kernel:  ? ksm_migrate_page+0x60/0x60
kernel:  ? __kmalloc+0x1c9/0x1d0
kernel:  __kmem_cache_shutdown+0x177/0x350
kernel:  shutdown_cache+0xf/0x130
kernel:  kmem_cache_destroy+0x19e/0x1b0
kernel:  SyS_delete_module+0x168/0x230
kernel:  ? exit_to_usermode_loop+0x39/0x80
kernel:  entry_SYSCALL_64_fastpath+0x13/0x94
kernel: RIP: 0033:0x7f53e4136b97
kernel: RSP: 002b:7ffd660061d8 EFLAGS: 0206 ORIG_RAX: 00b0
kernel: RAX: ffda RBX: 0003 RCX: 7f53e4136b97
kernel: RDX: 000a RSI: 0800 RDI: 006247f8
kernel: RBP:  R08: 7ffd66005171 R09: 
kernel: R10: 7f53e41acbc0 R11: 0206 R12: 00624790
kernel: R13: 7ffd660051d0 R14: 00624790 R15: 00623260
kernel: INFO: Object 0x8805807f @offset=0
kernel: INFO: Object 0x8805807f01b0 @offset=432
kernel: INFO: Object 0x8805807f3cc0 @offset=15552
kernel: kmem_cache_destroy bfq_queue: Slab cache still has objects
kernel: CPU: 0 PID: 9967 Comm: rmmod Tainted: GB   4.14.5 #1
kernel: Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, 
BIOS F1 05/06/2011
kernel: Call Trace:
kernel:  dump_stack+0x46/0x5e
kernel:  kmem_cache_destroy+0x191/0x1b0
kernel:  SyS_delete_module+0x168/0x230
kernel:  ? exit_to_usermode_loop+0x39/0x80
kernel:  entry_SYSCALL_64_fastpath+0x13/0x94
kernel: RIP: 0033:0x7f53e4136b97
kernel: RSP: 002b:7ffd660061d8 EFLAGS: 0206 ORIG_RAX: 00b0
kernel: RAX: ffda RBX: 0003 RCX: 7f53e4136b97
kernel: RDX: 000a RSI: 0800 RDI: 006247f8
kernel: RBP:  R08: 7ffd66005171 R09: 
kernel: R10: 7f53e41acbc0 R11: 0206 R12: 00624790
kernel: R13: 7ffd660051d0 R14: 00624790 R15: 00623260



I also encountered the similar bug, seems there are some async objects 
which are not freed with

BFQ_GROUP_IOSCHED build as "Y".

If revert part of commit e21b7a0b988772e82e7147e1c659a5afe2ae003c 
("block, bfq: add full hierarchical
scheduling and cgroups support")like below, then I can't see the bug 
again with simple test.


diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 889a8549d97f..c640e64e042f 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4710,6 +4710,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
    spin_lock_irq(>lock);
    list_for_each_entry_safe(bfqq, n, >idle_list, bfqq_list)
    bfq_deactivate_bfqq(bfqd, bfqq, false, false);
+   bfq_put_async_queues(bfqd, bfqd->root_group);
    spin_unlock_irq(>lock);

    hrtimer_cancel(>idle_slice_timer);
@@ -4718,7 +4719,6 @@ static void bfq_exit_queue(struct elevator_queue *e)
    blkcg_deactivate_policy(bfqd->queue, _policy_bfq);
 #else
    spin_lock_irq(>lock);
-   bfq_put_async_queues(bfqd, bfqd->root_group);
    kfree(bfqd->root_group);
    spin_unlock_irq(>lock);

But perhaps we should do it inside blkcg_deactivate_policy, right?

Thanks,
Guoqing


[PATCH V9 3/7] mq-deadline: Introduce zone locking support

2017-12-20 Thread Damien Le Moal
Introduce zone write locking to avoid write request reordering with
zoned block devices. This is achieved using a finer selection of the
next request to dispatch:
1) Any non-write request is always allowed to proceed.
2) Any write to a conventional zone is always allowed to proceed.
3) For a write to a sequential zone, the zone lock is first checked.
   a) If the zone is not locked, the write is allowed to proceed after
  its target zone is locked.
   b) If the zone is locked, the write request is skipped and the next
  request in the dispatch queue tested (back to step 1).

For a write request that has locked its target zone, the zone is
unlocked either when the request completes with a call to the method
deadline_request_completed() or when the request is requeued using
dd_insert_request().

Requests targeting a locked zone are always left in the scheduler queue
to preserve the lba ordering for write requests. If no write request
can be dispatched, allow reads to be dispatched even if the write batch
is not done.

If the device used is not a zoned block device, or if zoned block device
support is disabled, this patch does not modify mq-deadline behavior.

Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
---
 block/mq-deadline.c | 89 +++--
 1 file changed, 86 insertions(+), 3 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 8bd6db9e69c7..d56972e8ebda 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -59,6 +59,7 @@ struct deadline_data {
int front_merges;
 
spinlock_t lock;
+   spinlock_t zone_lock;
struct list_head dispatch;
 };
 
@@ -198,13 +199,33 @@ static inline int deadline_check_fifo(struct 
deadline_data *dd, int ddir)
 static struct request *
 deadline_fifo_request(struct deadline_data *dd, int data_dir)
 {
+   struct request *rq;
+   unsigned long flags;
+
if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
return NULL;
 
if (list_empty(>fifo_list[data_dir]))
return NULL;
 
-   return rq_entry_fifo(dd->fifo_list[data_dir].next);
+   rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+   if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+   return rq;
+
+   /*
+* Look for a write request that can be dispatched, that is one with
+* an unlocked target zone.
+*/
+   spin_lock_irqsave(>zone_lock, flags);
+   list_for_each_entry(rq, >fifo_list[WRITE], queuelist) {
+   if (blk_req_can_dispatch_to_zone(rq))
+   goto out;
+   }
+   rq = NULL;
+out:
+   spin_unlock_irqrestore(>zone_lock, flags);
+
+   return rq;
 }
 
 /*
@@ -214,10 +235,32 @@ deadline_fifo_request(struct deadline_data *dd, int 
data_dir)
 static struct request *
 deadline_next_request(struct deadline_data *dd, int data_dir)
 {
+   struct request *rq;
+   unsigned long flags;
+
if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
return NULL;
 
-   return dd->next_rq[data_dir];
+   rq = dd->next_rq[data_dir];
+   if (!rq)
+   return NULL;
+
+   if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+   return rq;
+
+   /*
+* Look for a write request that can be dispatched, that is one with
+* an unlocked target zone.
+*/
+   spin_lock_irqsave(>zone_lock, flags);
+   while (rq) {
+   if (blk_req_can_dispatch_to_zone(rq))
+   break;
+   rq = deadline_latter_request(rq);
+   }
+   spin_unlock_irqrestore(>zone_lock, flags);
+
+   return rq;
 }
 
 /*
@@ -259,7 +302,8 @@ static struct request *__dd_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
if (reads) {
BUG_ON(RB_EMPTY_ROOT(>sort_list[READ]));
 
-   if (writes && (dd->starved++ >= dd->writes_starved))
+   if (deadline_fifo_request(dd, WRITE) &&
+   (dd->starved++ >= dd->writes_starved))
goto dispatch_writes;
 
data_dir = READ;
@@ -304,6 +348,13 @@ static struct request *__dd_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
rq = next_rq;
}
 
+   /*
+* For a zoned block device, if we only have writes queued and none of
+* them can be dispatched, rq will be NULL.
+*/
+   if (!rq)
+   return NULL;
+
dd->batching = 0;
 
 dispatch_request:
@@ -313,6 +364,10 @@ static struct request *__dd_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
dd->batching++;
deadline_move_request(dd, rq);
 done:
+   /*
+* If the request needs its target zone locked, do it.
+*/
+   blk_req_zone_write_lock(rq);
rq->rq_flags |= RQF_STARTED;

[PATCH V9 4/7] deadline-iosched: Introduce dispatch helpers

2017-12-20 Thread Damien Le Moal
Avoid directly referencing the next_rq and fifo_list arrays using the
helper functions deadline_next_request() and deadline_fifo_request() to
facilitate changes in the dispatch request selection in
deadline_dispatch_requests() for zoned block devices.

While at it, also remove the unnecessary forward declaration of the
function deadline_move_request().

Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
---
 block/deadline-iosched.c | 47 +--
 1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b83f77460d28..81e3f0897457 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -50,8 +50,6 @@ struct deadline_data {
int front_merges;
 };
 
-static void deadline_move_request(struct deadline_data *, struct request *);
-
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
@@ -230,6 +228,35 @@ static inline int deadline_check_fifo(struct deadline_data 
*dd, int ddir)
return 0;
 }
 
+/*
+ * For the specified data direction, return the next request to dispatch using
+ * arrival ordered lists.
+ */
+static struct request *
+deadline_fifo_request(struct deadline_data *dd, int data_dir)
+{
+   if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
+   return NULL;
+
+   if (list_empty(>fifo_list[data_dir]))
+   return NULL;
+
+   return rq_entry_fifo(dd->fifo_list[data_dir].next);
+}
+
+/*
+ * For the specified data direction, return the next request to dispatch using
+ * sector position sorted lists.
+ */
+static struct request *
+deadline_next_request(struct deadline_data *dd, int data_dir)
+{
+   if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
+   return NULL;
+
+   return dd->next_rq[data_dir];
+}
+
 /*
  * deadline_dispatch_requests selects the best request according to
  * read/write expire, fifo_batch, etc
@@ -239,16 +266,15 @@ static int deadline_dispatch_requests(struct 
request_queue *q, int force)
struct deadline_data *dd = q->elevator->elevator_data;
const int reads = !list_empty(>fifo_list[READ]);
const int writes = !list_empty(>fifo_list[WRITE]);
-   struct request *rq;
+   struct request *rq, *next_rq;
int data_dir;
 
/*
 * batches are currently reads XOR writes
 */
-   if (dd->next_rq[WRITE])
-   rq = dd->next_rq[WRITE];
-   else
-   rq = dd->next_rq[READ];
+   rq = deadline_next_request(dd, WRITE);
+   if (!rq)
+   rq = deadline_next_request(dd, READ);
 
if (rq && dd->batching < dd->fifo_batch)
/* we have a next request are still entitled to batch */
@@ -291,19 +317,20 @@ static int deadline_dispatch_requests(struct 
request_queue *q, int force)
/*
 * we are not running a batch, find best request for selected data_dir
 */
-   if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+   next_rq = deadline_next_request(dd, data_dir);
+   if (deadline_check_fifo(dd, data_dir) || !next_rq) {
/*
 * A deadline has expired, the last request was in the other
 * direction, or we have run out of higher-sectored requests.
 * Start again from the request with the earliest expiry time.
 */
-   rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+   rq = deadline_fifo_request(dd, data_dir);
} else {
/*
 * The last req was the same dir and we have a next request in
 * sort order. No expired requests so continue on from here.
 */
-   rq = dd->next_rq[data_dir];
+   rq = next_rq;
}
 
dd->batching = 0;
-- 
2.14.3



[PATCH V9 7/7] sd: Remove zone write locking

2017-12-20 Thread Damien Le Moal
The block layer now handles zone write locking.

Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
---
 drivers/scsi/sd.c| 41 +++-
 drivers/scsi/sd.h| 11 ---
 drivers/scsi/sd_zbc.c| 83 
 include/scsi/scsi_cmnd.h |  3 +-
 4 files changed, 6 insertions(+), 132 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a028ab3322a9..f1157b4fe32e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -851,16 +851,13 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd 
*cmd)
struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
-   int ret;
 
if (!(rq->cmd_flags & REQ_NOUNMAP)) {
switch (sdkp->zeroing_mode) {
case SD_ZERO_WS16_UNMAP:
-   ret = sd_setup_write_same16_cmnd(cmd, true);
-   goto out;
+   return sd_setup_write_same16_cmnd(cmd, true);
case SD_ZERO_WS10_UNMAP:
-   ret = sd_setup_write_same10_cmnd(cmd, true);
-   goto out;
+   return sd_setup_write_same10_cmnd(cmd, true);
}
}
 
@@ -868,15 +865,9 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd 
*cmd)
return BLKPREP_INVALID;
 
if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
-   ret = sd_setup_write_same16_cmnd(cmd, false);
-   else
-   ret = sd_setup_write_same10_cmnd(cmd, false);
-
-out:
-   if (sd_is_zoned(sdkp) && ret == BLKPREP_OK)
-   return sd_zbc_write_lock_zone(cmd);
+   return sd_setup_write_same16_cmnd(cmd, false);
 
-   return ret;
+   return sd_setup_write_same10_cmnd(cmd, false);
 }
 
 static void sd_config_write_same(struct scsi_disk *sdkp)
@@ -964,12 +955,6 @@ static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd)
 
BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);
 
-   if (sd_is_zoned(sdkp)) {
-   ret = sd_zbc_write_lock_zone(cmd);
-   if (ret != BLKPREP_OK)
-   return ret;
-   }
-
sector >>= ilog2(sdp->sector_size) - 9;
nr_sectors >>= ilog2(sdp->sector_size) - 9;
 
@@ -1004,9 +989,6 @@ static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd)
ret = scsi_init_io(cmd);
rq->__data_len = nr_bytes;
 
-   if (sd_is_zoned(sdkp) && ret != BLKPREP_OK)
-   sd_zbc_write_unlock_zone(cmd);
-
return ret;
 }
 
@@ -1036,19 +1018,12 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd 
*SCpnt)
sector_t threshold;
unsigned int this_count = blk_rq_sectors(rq);
unsigned int dif, dix;
-   bool zoned_write = sd_is_zoned(sdkp) && rq_data_dir(rq) == WRITE;
int ret;
unsigned char protect;
 
-   if (zoned_write) {
-   ret = sd_zbc_write_lock_zone(SCpnt);
-   if (ret != BLKPREP_OK)
-   return ret;
-   }
-
ret = scsi_init_io(SCpnt);
if (ret != BLKPREP_OK)
-   goto out;
+   return ret;
WARN_ON_ONCE(SCpnt != rq->special);
 
/* from here on until we're complete, any goto out
@@ -1267,9 +1242,6 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd 
*SCpnt)
 */
ret = BLKPREP_OK;
  out:
-   if (zoned_write && ret != BLKPREP_OK)
-   sd_zbc_write_unlock_zone(SCpnt);
-
return ret;
 }
 
@@ -1314,9 +1286,6 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
struct request *rq = SCpnt->request;
u8 *cmnd;
 
-   if (SCpnt->flags & SCMD_ZONE_WRITE_LOCK)
-   sd_zbc_write_unlock_zone(SCpnt);
-
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
__free_page(rq->special_vec.bv_page);
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 320de758323e..0d663b5e45bb 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -77,7 +77,6 @@ struct scsi_disk {
unsigned intnr_zones;
unsigned intzone_blocks;
unsigned intzone_shift;
-   unsigned long   *zones_wlock;
unsigned intzones_optimal_open;
unsigned intzones_optimal_nonseq;
unsigned intzones_max_open;
@@ -283,8 +282,6 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
 extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
 extern void sd_zbc_remove(struct scsi_disk *sdkp);
 extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
-extern int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd);
-extern void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd);
 extern int 

[PATCH V9 6/7] sd_zbc: Initialize device request queue zoned data

2017-12-20 Thread Damien Le Moal
Initialize the seq_zones_bitmap, seq_zones_wlock and nr_zones fields of
the disk request queue on disk revalidate. As the seq_zones_bitmap
and seq_zones_wlock allocations are identical, introduce the helper
sd_zbc_alloc_zone_bitmap(). Using this helper, reallocate the bitmaps
whenever the disk capacity (number of zones) changes.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd_zbc.c | 152 +++---
 1 file changed, 144 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 27793b9f54c0..c715b8363ce0 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -586,8 +586,123 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
return 0;
 }
 
+/**
+ * sd_zbc_alloc_zone_bitmap - Allocate a zone bitmap (one bit per zone).
+ * @sdkp: The disk of the bitmap
+ */
+static inline unsigned long *sd_zbc_alloc_zone_bitmap(struct scsi_disk *sdkp)
+{
+   struct request_queue *q = sdkp->disk->queue;
+
+   return kzalloc_node(BITS_TO_LONGS(sdkp->nr_zones)
+   * sizeof(unsigned long),
+   GFP_KERNEL, q->node);
+}
+
+/**
+ * sd_zbc_get_seq_zones - Parse report zones reply to identify sequential zones
+ * @sdkp: disk used
+ * @buf: report reply buffer
+ * @seq_zone_bitamp: bitmap of sequential zones to set
+ *
+ * Parse reported zone descriptors in @buf to identify sequential zones and
+ * set the reported zone bit in @seq_zones_bitmap accordingly.
+ * Since read-only and offline zones cannot be written, do not
+ * mark them as sequential in the bitmap.
+ * Return the LBA after the last zone reported.
+ */
+static sector_t sd_zbc_get_seq_zones(struct scsi_disk *sdkp, unsigned char 
*buf,
+unsigned int buflen,
+unsigned long *seq_zones_bitmap)
+{
+   sector_t lba, next_lba = sdkp->capacity;
+   unsigned int buf_len, list_length;
+   unsigned char *rec;
+   u8 type, cond;
+
+   list_length = get_unaligned_be32([0]) + 64;
+   buf_len = min(list_length, buflen);
+   rec = buf + 64;
+
+   while (rec < buf + buf_len) {
+   type = rec[0] & 0x0f;
+   cond = (rec[1] >> 4) & 0xf;
+   lba = get_unaligned_be64([16]);
+   if (type != ZBC_ZONE_TYPE_CONV &&
+   cond != ZBC_ZONE_COND_READONLY &&
+   cond != ZBC_ZONE_COND_OFFLINE)
+   set_bit(lba >> sdkp->zone_shift, seq_zones_bitmap);
+   next_lba = lba + get_unaligned_be64([8]);
+   rec += 64;
+   }
+
+   return next_lba;
+}
+
+/**
+ * sd_zbc_setup_seq_zones_bitmap - Initialize the disk seq zone bitmap.
+ * @sdkp: target disk
+ *
+ * Allocate a zone bitmap and initialize it by identifying sequential zones.
+ */
+static int sd_zbc_setup_seq_zones_bitmap(struct scsi_disk *sdkp)
+{
+   struct request_queue *q = sdkp->disk->queue;
+   unsigned long *seq_zones_bitmap;
+   sector_t lba = 0;
+   unsigned char *buf;
+   int ret = -ENOMEM;
+
+   seq_zones_bitmap = sd_zbc_alloc_zone_bitmap(sdkp);
+   if (!seq_zones_bitmap)
+   return -ENOMEM;
+
+   buf = kmalloc(SD_ZBC_BUF_SIZE, GFP_KERNEL);
+   if (!buf)
+   goto out;
+
+   while (lba < sdkp->capacity) {
+   ret = sd_zbc_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, lba);
+   if (ret)
+   goto out;
+   lba = sd_zbc_get_seq_zones(sdkp, buf, SD_ZBC_BUF_SIZE,
+  seq_zones_bitmap);
+   }
+
+   if (lba != sdkp->capacity) {
+   /* Something went wrong */
+   ret = -EIO;
+   }
+
+out:
+   kfree(buf);
+   if (ret) {
+   kfree(seq_zones_bitmap);
+   return ret;
+   }
+
+   q->seq_zones_bitmap = seq_zones_bitmap;
+
+   return 0;
+}
+
+static void sd_zbc_cleanup(struct scsi_disk *sdkp)
+{
+   struct request_queue *q = sdkp->disk->queue;
+
+   kfree(q->seq_zones_bitmap);
+   q->seq_zones_bitmap = NULL;
+
+   kfree(q->seq_zones_wlock);
+   q->seq_zones_wlock = NULL;
+
+   q->nr_zones = 0;
+}
+
 static int sd_zbc_setup(struct scsi_disk *sdkp)
 {
+   struct request_queue *q = sdkp->disk->queue;
+   int ret;
 
/* READ16/WRITE16 is mandatory for ZBC disks */
sdkp->device->use_16_for_rw = 1;
@@ -599,15 +714,36 @@ static int sd_zbc_setup(struct scsi_disk *sdkp)
sdkp->nr_zones =
round_up(sdkp->capacity, sdkp->zone_blocks) >> sdkp->zone_shift;
 
-   if (!sdkp->zones_wlock) {
-   sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones),
-   sizeof(unsigned long),
-   GFP_KERNEL);
-   if (!sdkp->zones_wlock)
-   return -ENOMEM;

[PATCH V9 5/7] deadline-iosched: Introduce zone locking support

2017-12-20 Thread Damien Le Moal
Introduce zone write locking to avoid write request reordering with
zoned block devices. This is achieved using a finer selection of the
next request to dispatch:
1) Any non-write request is always allowed to proceed.
2) Any write to a conventional zone is always allowed to proceed.
3) For a write to a sequential zone, the zone lock is first checked.
   a) If the zone is not locked, the write is allowed to proceed after
  its target zone is locked.
   b) If the zone is locked, the write request is skipped and the next
  request in the dispatch queue tested (back to step 1).

For a write request that has locked its target zone, the zone is
unlocked either when the request completes and the method
deadline_request_completed() is called, or when the request is requeued
using the method deadline_add_request().

Requests targeting a locked zone are always left in the scheduler queue
to preserve the initial write order. If no write request can be
dispatched, allow reads to be dispatched even if the write batch is not
done.

If the device used is not a zoned block device, or if zoned block device
support is disabled, this patch does not modify deadline behavior.

Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
---
 block/deadline-iosched.c | 71 ++--
 1 file changed, 68 insertions(+), 3 deletions(-)

diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 81e3f0897457..9de9f156e203 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -98,6 +98,12 @@ deadline_add_request(struct request_queue *q, struct request 
*rq)
struct deadline_data *dd = q->elevator->elevator_data;
const int data_dir = rq_data_dir(rq);
 
+   /*
+* This may be a requeue of a write request that has locked its
+* target zone. If it is the case, this releases the zone lock.
+*/
+   blk_req_zone_write_unlock(rq);
+
deadline_add_rq_rb(dd, rq);
 
/*
@@ -188,6 +194,12 @@ deadline_move_to_dispatch(struct deadline_data *dd, struct 
request *rq)
 {
struct request_queue *q = rq->q;
 
+   /*
+* For a zoned block device, write requests must write lock their
+* target zone.
+*/
+   blk_req_zone_write_lock(rq);
+
deadline_remove_request(q, rq);
elv_dispatch_add_tail(q, rq);
 }
@@ -235,13 +247,28 @@ static inline int deadline_check_fifo(struct 
deadline_data *dd, int ddir)
 static struct request *
 deadline_fifo_request(struct deadline_data *dd, int data_dir)
 {
+   struct request *rq;
+
if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
return NULL;
 
if (list_empty(>fifo_list[data_dir]))
return NULL;
 
-   return rq_entry_fifo(dd->fifo_list[data_dir].next);
+   rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+   if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+   return rq;
+
+   /*
+* Look for a write request that can be dispatched, that is one with
+* an unlocked target zone.
+*/
+   list_for_each_entry(rq, >fifo_list[WRITE], queuelist) {
+   if (blk_req_can_dispatch_to_zone(rq))
+   return rq;
+   }
+
+   return NULL;
 }
 
 /*
@@ -251,10 +278,29 @@ deadline_fifo_request(struct deadline_data *dd, int 
data_dir)
 static struct request *
 deadline_next_request(struct deadline_data *dd, int data_dir)
 {
+   struct request *rq;
+
if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
return NULL;
 
-   return dd->next_rq[data_dir];
+   rq = dd->next_rq[data_dir];
+   if (!rq)
+   return NULL;
+
+   if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+   return rq;
+
+   /*
+* Look for a write request that can be dispatched, that is one with
+* an unlocked target zone.
+*/
+   while (rq) {
+   if (blk_req_can_dispatch_to_zone(rq))
+   return rq;
+   rq = deadline_latter_request(rq);
+   }
+
+   return NULL;
 }
 
 /*
@@ -288,7 +334,8 @@ static int deadline_dispatch_requests(struct request_queue 
*q, int force)
if (reads) {
BUG_ON(RB_EMPTY_ROOT(>sort_list[READ]));
 
-   if (writes && (dd->starved++ >= dd->writes_starved))
+   if (deadline_fifo_request(dd, WRITE) &&
+   (dd->starved++ >= dd->writes_starved))
goto dispatch_writes;
 
data_dir = READ;
@@ -333,6 +380,13 @@ static int deadline_dispatch_requests(struct request_queue 
*q, int force)
rq = next_rq;
}
 
+   /*
+* For a zoned block device, if we only have writes queued and none of
+* them can be dispatched, rq will be NULL.
+*/
+  

[PATCH V9 2/7] mq-deadline: Introduce dispatch helpers

2017-12-20 Thread Damien Le Moal
Avoid directly referencing the next_rq and fifo_list arrays using the
helper functions deadline_next_request() and deadline_fifo_request() to
facilitate changes in the dispatch request selection in
__dd_dispatch_request() for zoned block devices.

Signed-off-by: Damien Le Moal 
Reviewed-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
---
 block/mq-deadline.c | 45 +
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 0179e484ec98..8bd6db9e69c7 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -191,6 +191,35 @@ static inline int deadline_check_fifo(struct deadline_data 
*dd, int ddir)
return 0;
 }
 
+/*
+ * For the specified data direction, return the next request to
+ * dispatch using arrival ordered lists.
+ */
+static struct request *
+deadline_fifo_request(struct deadline_data *dd, int data_dir)
+{
+   if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
+   return NULL;
+
+   if (list_empty(>fifo_list[data_dir]))
+   return NULL;
+
+   return rq_entry_fifo(dd->fifo_list[data_dir].next);
+}
+
+/*
+ * For the specified data direction, return the next request to
+ * dispatch using sector position sorted lists.
+ */
+static struct request *
+deadline_next_request(struct deadline_data *dd, int data_dir)
+{
+   if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
+   return NULL;
+
+   return dd->next_rq[data_dir];
+}
+
 /*
  * deadline_dispatch_requests selects the best request according to
  * read/write expire, fifo_batch, etc
@@ -198,7 +227,7 @@ static inline int deadline_check_fifo(struct deadline_data 
*dd, int ddir)
 static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 {
struct deadline_data *dd = hctx->queue->elevator->elevator_data;
-   struct request *rq;
+   struct request *rq, *next_rq;
bool reads, writes;
int data_dir;
 
@@ -214,10 +243,9 @@ static struct request *__dd_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
/*
 * batches are currently reads XOR writes
 */
-   if (dd->next_rq[WRITE])
-   rq = dd->next_rq[WRITE];
-   else
-   rq = dd->next_rq[READ];
+   rq = deadline_next_request(dd, WRITE);
+   if (!rq)
+   rq = deadline_next_request(dd, READ);
 
if (rq && dd->batching < dd->fifo_batch)
/* we have a next request are still entitled to batch */
@@ -260,19 +288,20 @@ static struct request *__dd_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
/*
 * we are not running a batch, find best request for selected data_dir
 */
-   if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+   next_rq = deadline_next_request(dd, data_dir);
+   if (deadline_check_fifo(dd, data_dir) || !next_rq) {
/*
 * A deadline has expired, the last request was in the other
 * direction, or we have run out of higher-sectored requests.
 * Start again from the request with the earliest expiry time.
 */
-   rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+   rq = deadline_fifo_request(dd, data_dir);
} else {
/*
 * The last req was the same dir and we have a next request in
 * sort order. No expired requests so continue on from here.
 */
-   rq = dd->next_rq[data_dir];
+   rq = next_rq;
}
 
dd->batching = 0;
-- 
2.14.3



[PATCH V9 1/7] block: introduce zoned block devices zone write locking

2017-12-20 Thread Damien Le Moal
From: Christoph Hellwig 

Components relying only on the request_queue structure for accessing
block devices (e.g. I/O schedulers) have a limited knowledged of the
device characteristics. In particular, the device capacity cannot be
easily discovered, which for a zoned block device also result in the
inability to easily know the number of zones of the device (the zone
size is indicated by the chunk_sectors field of the queue limits).

Introduce the nr_zones field to the request_queue structure to simplify
access to this information. Also, add the bitmap seq_zone_bitmap which
indicates which zones of the device are sequential zones (write
preferred or write required) and the bitmap seq_zones_wlock which
indicates if a zone is write locked, that is, if a write request
targeting a zone was dispatched to the device. These fields are
initialized by the low level block device driver (sd.c for ZBC/ZAC
disks). They are not initialized by stacking drivers (device mappers)
handling zoned block devices (e.g. dm-linear).

Using this, I/O schedulers can introduce zone write locking to control
request dispatching to a zoned block device and avoid write request
reordering by limiting to at most a single write request per zone
outside of the scheduler at any time.

Based on previous patches from Damien Le Moal.

Signed-off-by: Christoph Hellwig 
[Damien]
* Fixed comments and identation in blkdev.h
* Changed helper functions
* Fixed this commit message
Signed-off-by: Damien Le Moal 
Reviewed-by: Martin K. Petersen 
---
 block/blk-core.c   |   1 +
 block/blk-zoned.c  |  42 +++
 include/linux/blkdev.h | 111 +
 3 files changed, 154 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8881750a3ac..e6e5bbc4c366 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1641,6 +1641,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
 
lockdep_assert_held(q->queue_lock);
 
+   blk_req_zone_write_unlock(req);
blk_pm_put_request(req);
 
elv_completed_request(q, req);
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ff57fb51b338..acb7252c7e81 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -21,6 +21,48 @@ static inline sector_t blk_zone_start(struct request_queue 
*q,
return sector & ~zone_mask;
 }
 
+/*
+ * Return true if a request is a write requests that needs zone write locking.
+ */
+bool blk_req_needs_zone_write_lock(struct request *rq)
+{
+   if (!rq->q->seq_zones_wlock)
+   return false;
+
+   if (blk_rq_is_passthrough(rq))
+   return false;
+
+   switch (req_op(rq)) {
+   case REQ_OP_WRITE_ZEROES:
+   case REQ_OP_WRITE_SAME:
+   case REQ_OP_WRITE:
+   return blk_rq_zone_is_seq(rq);
+   default:
+   return false;
+   }
+}
+EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
+
+void __blk_req_zone_write_lock(struct request *rq)
+{
+   if (WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq),
+ rq->q->seq_zones_wlock)))
+   return;
+
+   WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
+   rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
+}
+EXPORT_SYMBOL_GPL(__blk_req_zone_write_lock);
+
+void __blk_req_zone_write_unlock(struct request *rq)
+{
+   rq->rq_flags &= ~RQF_ZONE_WRITE_LOCKED;
+   if (rq->q->seq_zones_wlock)
+   WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq),
+rq->q->seq_zones_wlock));
+}
+EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
+
 /*
  * Check that a zone report belongs to the partition.
  * If yes, fix its start sector and write pointer, copy it in the
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca17db9a..46e606f5b44b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -121,6 +121,8 @@ typedef __u32 __bitwise req_flags_t;
 /* Look at ->special_vec for the actual data payload instead of the
bio chain. */
 #define RQF_SPECIAL_PAYLOAD((__force req_flags_t)(1 << 18))
+/* The per-zone write lock is held for this request */
+#define RQF_ZONE_WRITE_LOCKED  ((__force req_flags_t)(1 << 19))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
@@ -546,6 +548,22 @@ struct request_queue {
 
struct queue_limits limits;
 
+   /*
+* Zoned block device information for request dispatch control.
+* nr_zones is the total number of zones of the device. This is always
+* 0 for regular block devices. seq_zones_bitmap is a bitmap of nr_zones
+* bits which indicates if a zone is conventional (bit clear) or
+* sequential (bit set). seq_zones_wlock is a bitmap of nr_zones
+* bits which indicates if a zone is write locked, that is, if a write
+

[PATCH V9 0/7] blk-mq support for ZBC disks

2017-12-20 Thread Damien Le Moal
This series, formerly titled "scsi-mq support for ZBC disks", implements
support for ZBC disks for system using the scsi-mq I/O path.

The current scsi level support of ZBC disks guarantees write request ordering
using a per-zone write lock which prevents issuing simultaneously multiple
write commands to a zone, doing so avoid reordering of sequential writes to
sequential zones. This method is however ineffective when scsi-mq is used with
zoned block devices. This is due to the different execution model of blk-mq
which passes a request to the scsi layer for dispatching after the request has
been removed from the I/O scheduler queue. That is, when the scsi layer tries
to lock the target zone of the request, the request may already be out of
order and zone write locking fails to prevent that.

Various approaches have been tried to solve this problem directly from the core
code of blk-mq. All of them had the serious disadvantage of cluttering blk-mq
code with zoned block device specific conditions and processing, making
maintenance and testing difficult.

This series adds blk-mq support for zoned block devices at the I/O scheduler
level with simple modifications of the mq-deadline scheduler. Implementation
is done with reusable helpers defined in the zoned block device support file
(blk-zoned.c). These helpers provide per zone write locking control functions
similar to what was implemented directly in the SCSI layer in sd_zbc.c.
The zone write locking mechanism is used by mq-deadline for the exact same
purpose, that is, to limit writes per zone to at most one request to avoid
reordering.

The changes to mq-deadline do not affect its operation with regular disks. The
same scheduling behavior is maintained for these devices. Compared to the SCSI
layer zone locking implementation, this series optimizes avoids locking
conventional zones which result in a use of these zone that is comparable to a
regular disk.

This series also implements changes to the legacy deadline-iosched. Doing so,
the zone locking code at the SCSI layer in sd.c and sd_zbc.c can be removed.
This results in a significant simplification of the sd driver command handling.

Patch 1 to 5 introduce the zone locking helpers in the block layer and modify
the deadline and mq-deadline schedulers.
Patch 6 and 7 remove the SCSI layer zone locking and initialize the device
request queue zone information.

All patches apply without conflicts to the scsi tree branch 4.16/scsi-queue, to
the block tree branch for-linus as well as to the current 4.15-rc4 tree.

Of note is that this series imposes the use of the deadline and mq-deadline
schedulers with zoned block devices. A system can trivialy enforce this using
a udev rule such as:

ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{queue/zoned}=="host-managed", \
ATTR{queue/scheduler}="deadline"

This rules applies equally for the legacy SCSI path as well as the scsi-mq path
thanks to "mq-deadline" being aliased to "deadline".

Comments are as always very much appreciated.

Changes from v8:
* Rebased on 4.15-rc4

Changes from v7:
* Merged patch 8 into patch 6
* Fixed various typos and commit messages

Changes from v6:
* Implement zone write locking helpers in the block layer
* Also modify legacy path deadline scheduler to remove all zone write locking
  code from the scsi layer

Changes from v5:
* Refactor patches to introduce the zone_lock spinlock only when needed
* Addressed Bart's comments (in particular explanations of the zone_lock
  spinlock use)

Changes from v4:
* Various fixes and improvements (From Christoph's comments)
* Dropped zones_wlock scheduler tunable attribute

Changes from v3:
* Integrated support directly into mq-deadline instead of creating a new I/O
  scheduler.
* Disable setting of default mq scheduler for single queue devices

Changes from v2:
* Introduced blk_zoned structure
* Moved I/O scheduler from drivers/scsi to block

Changes from v1:
* Addressed Bart's comments for the blk-mq patches (declarations files)
* Split (former) patch 4 into multiple patches to facilitate review
* Fixed scsi disk lookup from io scheduler by introducing
  scsi_disk_from_queue()

Christoph Hellwig (1):
  block: introduce zoned block devices zone write locking

Damien Le Moal (6):
  mq-deadline: Introduce dispatch helpers
  mq-deadline: Introduce zone locking support
  deadline-iosched: Introduce dispatch helpers
  deadline-iosched: Introduce zone locking support
  sd_zbc: Initialize device request queue zoned data
  sd: Remove zone write locking

 block/blk-core.c |   1 +
 block/blk-zoned.c|  42 +
 block/deadline-iosched.c | 114 ---
 block/mq-deadline.c  | 130 --
 drivers/scsi/sd.c|  41 +
 drivers/scsi/sd.h|  11 ---
 drivers/scsi/sd_zbc.c| 235 +--
 include/linux/blkdev.h   | 111 ++
 include/scsi/scsi_cmnd.h |   3 +-
 9 files changed, 528 

Re: [PATCHSET v2] blk-mq: reimplement timeout handling

2017-12-20 Thread Bart Van Assche
On Wed, 2017-12-20 at 16:08 -0800, t...@kernel.org wrote:
> On Wed, Dec 20, 2017 at 11:41:02PM +, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > Currently, blk-mq timeout path synchronizes against the usual
> > > issue/completion path using a complex scheme involving atomic
> > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > rules.  Unfortunatley, it contains quite a few holes.
> > 
> > Hello Tejun,
> > 
> > An attempt to run SCSI I/O with this patch series applied resulted in
> > the following:
> 
> Can you please try the v3?  There were a couple bugs that I missed
> while testing earlier versions.
> 
>  http://lkml.kernel.org/r/20171216120726.517153-1...@kernel.org

Will do. But please Cc: linux-block in case you would post a v4 of this patch
series. I searched the linux-block folder of my mailbox for the latest version
of this patch series and that is how I ended up testing v2 instead of v3 ...

Bart.

Re: [PATCHSET v2] blk-mq: reimplement timeout handling

2017-12-20 Thread t...@kernel.org
On Wed, Dec 20, 2017 at 11:41:02PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > Currently, blk-mq timeout path synchronizes against the usual
> > issue/completion path using a complex scheme involving atomic
> > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > rules.  Unfortunatley, it contains quite a few holes.
> 
> Hello Tejun,
> 
> An attempt to run SCSI I/O with this patch series applied resulted in
> the following:

Can you please try the v3?  There were a couple bugs that I missed
while testing earlier versions.

 http://lkml.kernel.org/r/20171216120726.517153-1...@kernel.org

Thanks.

-- 
tejun


Re: [PATCHSET v2] blk-mq: reimplement timeout handling

2017-12-20 Thread Bart Van Assche
On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.

Hello Tejun,

An attempt to run SCSI I/O with this patch series applied resulted in
the following:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: scsi_times_out+0x1c/0x2d0
PGD 0 P4D 0
Oops:  [#1] PREEMPT SMP
CPU: 1 PID: 437 Comm: kworker/1:1H Tainted: GW4.15.0-rc4-dbg+ #1
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 2.5.4 01/22/2016
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:scsi_times_out+0x1c/0x2d0
RSP: 0018:c90007ef3d58 EFLAGS: 00010246
RAX:  RBX: 880878eab000 RCX: 
RDX:  RSI:  RDI: 880878eab000
RBP: 880878eab1a0 R08:  R09: 0001
R10:  R11:  R12: 0004
R13:  R14: 88085e4a5ce8 R15: 880878e9f848
FS:  () GS:88093f60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 01c0f002 CR4: 000606e0
Call Trace:
 blk_mq_terminate_expired+0x36/0x70
 bt_iter+0x43/0x50
 blk_mq_queue_tag_busy_iter+0xee/0x200
 blk_mq_timeout_work+0x186/0x2e0
 process_one_work+0x221/0x6e0
 worker_thread+0x3a/0x390
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30
RIP: scsi_times_out+0x1c/0x2d0 RSP: c90007ef3d58
CR2: 

(gdb) list *(scsi_times_out+0x1c)
0x8147adbc is in scsi_times_out (drivers/scsi/scsi_error.c:285).
280  */
281 enum blk_eh_timer_return scsi_times_out(struct request *req)
282 {
283 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285 struct Scsi_Host *host = scmd->device->host;
286
287 trace_scsi_dispatch_cmd_timeout(scmd);
288 scsi_log_completion(scmd, TIMEOUT_ERROR);
289

(gdb) disas /s scsi_times_out
[ ... ]
283 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285 struct Scsi_Host *host = scmd->device->host;
   0x8147adb2 <+18>:mov0x1d8(%rdi),%rax
   0x8147adb9 <+25>:mov%rdi,%rbx
   0x8147adbc <+28>:mov(%rax),%r13
   0x8147adbf <+31>:nopl   0x0(%rax,%rax,1)

Bart.

Re: [PATCH V3] block: drain queue before waiting for q_usage_counter becoming zero

2017-12-20 Thread Ming Lei
On Thu, Nov 30, 2017 at 07:56:35AM +0800, Ming Lei wrote:
> Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
> ("block: Make q_usage_counter also track legacy requests"), but that
> commit never runs and drains legacy queue before waiting for this counter
> becoming zero, then IO hang is caused in the test of pulling disk during IO.
> 
> This patch fixes the issue by draining requests before waiting for
> q_usage_counter becoming zero, both Mauricio and chenxiang reported this
> issue, and observed that it can be fixed by this patch.
> 
> Link: https://marc.info/?l=linux-block=151192424731797=2
> Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
> Cc: Wen Xiong 
> Tested-by: "chenxiang (M)" 
> Tested-by: Mauricio Faria de Oliveira 
> Signed-off-by: Ming Lei 
> ---
> V3:
>   - V2 can't cover chenxiang's issue, so we have to drain queue via
>   blk_drain_queue(), and fallback to original post(V1)

Hi Jens,

This patch fixes regression caused by 055f6e18e08f merged to v4.15-rc,
could you consider it for v4.15?

Thanks,
Ming


Re: [dm-devel] [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable

2017-12-20 Thread Sagi Grimberg



But interestingly, with my "mptest" link failure test
(test_01_nvme_offline) I'm not actually seeing NVMe trigger a failure
that needs a multipath layer (be it NVMe multipath or DM multipath) to
fail a path and retry the IO.  The pattern is that the link goes down,
and nvme waits for it to come back (internalizing any failure) and then
the IO continues.. so no multipath _really_ needed:

[55284.011286] nvme nvme0: NVME-FC{0}: controller connectivity lost. Awaiting 
Reconnect
[55284.020078] nvme nvme1: NVME-FC{1}: controller connectivity lost. Awaiting 
Reconnect
[55284.028872] nvme nvme2: NVME-FC{2}: controller connectivity lost. Awaiting 
Reconnect
[55284.037658] nvme nvme3: NVME-FC{3}: controller connectivity lost. Awaiting 
Reconnect
[55295.157773] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[55295.157775] nvmet: ctrl 4 keep-alive timer (15 seconds) expired!
[55295.157778] nvmet: ctrl 3 keep-alive timer (15 seconds) expired!
[55295.157780] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
[55295.157781] nvmet: ctrl 4 fatal error occurred!
[55295.157784] nvmet: ctrl 3 fatal error occurred!
[55295.157785] nvmet: ctrl 2 fatal error occurred!
[55295.199816] nvmet: ctrl 1 fatal error occurred!
[55304.047540] nvme nvme0: NVME-FC{0}: connectivity re-established. Attempting 
reconnect
[55304.056533] nvme nvme1: NVME-FC{1}: connectivity re-established. Attempting 
reconnect
[55304.066053] nvme nvme2: NVME-FC{2}: connectivity re-established. Attempting 
reconnect
[55304.075037] nvme nvme3: NVME-FC{3}: connectivity re-established. Attempting 
reconnect
[55304.373776] nvmet: creating controller 1 for subsystem mptestnqn for NQN 
nqn.2014-08.org.nvmexpress:uuid:----.
[55304.373835] nvmet: creating controller 2 for subsystem mptestnqn for NQN 
nqn.2014-08.org.nvmexpress:uuid:----.
[55304.373873] nvmet: creating controller 3 for subsystem mptestnqn for NQN 
nqn.2014-08.org.nvmexpress:uuid:----.
[55304.373879] nvmet: creating controller 4 for subsystem mptestnqn for NQN 
nqn.2014-08.org.nvmexpress:uuid:----.
[55304.430988] nvme nvme0: NVME-FC{0}: controller reconnect complete
[55304.433124] nvme nvme3: NVME-FC{3}: controller reconnect complete
[55304.433705] nvme nvme1: NVME-FC{1}: controller reconnect complete

It seems if we have multipath ontop (again: either NVMe native multipath
_or_ DM multipath) we'd prefer to have the equivalent of SCSI's
REQ_FAILFAST_TRANSPORT support?

But nvme_req_needs_retry() calls blk_noretry_request() which returns
true if REQ_FAILFAST_TRANSPORT is set.  Which results in
nvme_req_needs_retry() returning false.  Which causes nvme_complete_rq()
to skip the multipath specific nvme_req_needs_failover(), etc.

So all said:

1) why wait for connection recovery if we have other connections to try?
I think NVMe needs to be plumbed for respecting REQ_FAILFAST_TRANSPORT.


This is specific to FC fail fast logic, nvme-rdma will fail inflight
commands as soon as the transport see an error (or keep alive timeout
expires).

It seems that FC wants to wait for the request retries counter to exceed
but given that the queue isn't unquiesced, the requests are quiesced
until the host will successfully reconnect.


Re: random call_single_data alignment

2017-12-20 Thread Jens Axboe
On 12/20/17 1:18 PM, Peter Zijlstra wrote:
> On Wed, Dec 20, 2017 at 12:40:25PM -0700, Jens Axboe wrote:
>> On 12/20/17 12:10 PM, Jens Axboe wrote:
>>> For some reason, commit 966a967116e6 was added to the tree without
>>> CC'ing relevant maintainers, even though it's touching random subsystems.
>>> One example is struct request, a core structure in the block layer.
>>> After this change, struct request grows from 296 to 320 bytes on my
>>> setup.
> 
>> https://marc.info/?l=linux-block=151379793913822=2
> 
> Does that actually matter though?, kmalloc is likely to over-allocate in
> any case. Sure it introduces a weird hole in the data structure, but
> that can be easily fixed by rearranging the thing.

Yes it matters, if you look at how blk-mq allocates requests. We do it
by the page, and chop it up.

But you're missing the entire point of this email - don't make random
changes in core data structures without CC'ing the relevant folks. Even
the fact that the change is nonsensical on the block front .

>>> Why are we blindly aligning to 32 bytes? The comment says to avoid
>>> it spanning two cache lines - but if that's the case, look at the
>>> actual use case and see if that's actually a problem. For struct
>>> request, it is not.
>>>
>>> Seems to me, this should have been applied in the specific area
>>> where it was needed. Keep struct call_single_data (instead of some
>>> __ version), and just manually align it where it matters.
> 
> Without enforcement of some kind, its too easy to get wrong.

I'd argue that the change already did more harm than good.
Alignment bloat should only be applied where it matters. And whether
it matters or not should be investigated and deduced separately.

>> https://marc.info/?l=linux-block=151379849914002=2
> 
> diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> index ccb9975a97fa..e0c44e4efa44 100644
> --- a/drivers/block/null_blk.c
> +++ b/drivers/block/null_blk.c
> @@ -33,9 +33,9 @@ static inline u64 mb_per_tick(int mbps)
>  }
>  
>  struct nullb_cmd {
> + call_single_data_t csd;
>   struct list_head list;
>   struct llist_node ll_list;
> - call_single_data_t csd;
>   struct request *rq;
>   struct bio *bio;
>   unsigned int tag;
> 
> 
> Gets you the exact same size back.

Again, besides the point, the random spray of changes like this is
really poor form.

>> In the future, please CC the relevant folks before making (and
>> committing) changes like that.
> 
> Yeah, I usually do, sorry about that :/

At least it didn't make it to a release. Oh...

-- 
Jens Axboe



Re: random call_single_data alignment

2017-12-20 Thread Peter Zijlstra
On Wed, Dec 20, 2017 at 12:40:25PM -0700, Jens Axboe wrote:
> On 12/20/17 12:10 PM, Jens Axboe wrote:
> > For some reason, commit 966a967116e6 was added to the tree without
> > CC'ing relevant maintainers, even though it's touching random subsystems.
> > One example is struct request, a core structure in the block layer.
> > After this change, struct request grows from 296 to 320 bytes on my
> > setup.

> https://marc.info/?l=linux-block=151379793913822=2

Does that actually matter though?, kmalloc is likely to over-allocate in
any case. Sure it introduces a weird hole in the data structure, but
that can be easily fixed by rearranging the thing.

struct request {
struct list_head {
struct list_head * next;
 /* 0 8 */
struct list_head * prev;
 /* 8 8 */
} queuelist; /* 016 */

/* XXX 16 bytes hole, try to pack */

union {
/* typedef call_single_data_t */ struct __call_single_data {
struct llist_node {
struct llist_node * next;   
 /*32 8 */
} llist; /*32 8 */
/* typedef smp_call_func_t */ void   (*func)(void 
*);/*40 8 */
void * info;
 /*48 8 */
unsigned int flags; 
 /*56 4 */
} csd; /*  32 */
/* typedef u64 */ long long unsigned int fifo_time; 
 /*   8 */
};  
 /*3232 */
/* --- cacheline 1 boundary (64 bytes) --- */

}


diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca17db9a..9d6fb6d59268 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -133,11 +133,11 @@ typedef __u32 __bitwise req_flags_t;
  * especially blk_mq_rq_ctx_init() to take care of the added fields.
  */
 struct request {
-   struct list_head queuelist;
union {
call_single_data_t csd;
u64 fifo_time;
};
+   struct list_head queuelist;
 
struct request_queue *q;
struct blk_mq_ctx *mq_ctx;


> > Why are we blindly aligning to 32 bytes? The comment says to avoid
> > it spanning two cache lines - but if that's the case, look at the
> > actual use case and see if that's actually a problem. For struct
> > request, it is not.
> > 
> > Seems to me, this should have been applied in the specific area
> > where it was needed. Keep struct call_single_data (instead of some
> > __ version), and just manually align it where it matters.

Without enforcement of some kind, its too easy to get wrong.

> https://marc.info/?l=linux-block=151379849914002=2

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index ccb9975a97fa..e0c44e4efa44 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -33,9 +33,9 @@ static inline u64 mb_per_tick(int mbps)
 }
 
 struct nullb_cmd {
+   call_single_data_t csd;
struct list_head list;
struct llist_node ll_list;
-   call_single_data_t csd;
struct request *rq;
struct bio *bio;
unsigned int tag;


Gets you the exact same size back.

> In the future, please CC the relevant folks before making (and
> committing) changes like that.

Yeah, I usually do, sorry about that :/


Re: random call_single_data alignment

2017-12-20 Thread Jens Axboe
On 12/20/17 12:10 PM, Jens Axboe wrote:
> For some reason, commit 966a967116e6 was added to the tree without
> CC'ing relevant maintainers, even though it's touching random subsystems.
> One example is struct request, a core structure in the block layer.
> After this change, struct request grows from 296 to 320 bytes on my
> setup.
> 
> Why are we blindly aligning to 32 bytes? The comment says to avoid
> it spanning two cache lines - but if that's the case, look at the
> actual use case and see if that's actually a problem. For struct
> request, it is not.
> 
> Seems to me, this should have been applied in the specific area
> where it was needed. Keep struct call_single_data (instead of some
> __ version), and just manually align it where it matters.

https://marc.info/?l=linux-block=151379793913822=2

https://marc.info/?l=linux-block=151379849914002=2

In the future, please CC the relevant folks before making (and
committing) changes like that.

-- 
Jens Axboe



[PATCH] null_blk: unalign call_single_data

2017-12-20 Thread Jens Axboe
Commit 966a967116e6 randomly added alignment to this structure, but
it's actually detrimental to performance of null_blk. Test case:

# modprobe null_blk queue_mode=1 irqmode=1 home_node={0,1}
# echo noop > /sys/block/nullb0/queue/scheduler
# fio --name=csd --filename=/dev/nullb0 --numjobs=8 --direct=1 --rw=randread 
--norandommap --cpus_allowed=$(cat /sys/devices/system/node/node0/cpulist) 
--group_reporting=1

Running on both the home and remote node shows a ~5% degradation
in performance.

Fixes: 966a967116e69 ("smp: Avoid using two cache lines for struct 
call_single_data")
Signed-off-by: Jens Axboe 

---

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index ccb9975a97fa..8c6912dd5142 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -35,7 +35,7 @@ static inline u64 mb_per_tick(int mbps)
 struct nullb_cmd {
struct list_head list;
struct llist_node ll_list;
-   call_single_data_t csd;
+   struct __call_single_data csd;
struct request *rq;
struct bio *bio;
unsigned int tag;


-- 
Jens Axboe



[PATCH] block: unalign call_single_data in struct request

2017-12-20 Thread Jens Axboe
A previous change blindly added massive alignment to the
call_single_data structure in struct request. This ballooned it in
size from 296 to 320 bytes on my setup, for no valid reason at all.

Use the unaligned struct __call_single_data variant instead.

Fixes: 966a967116e69 ("smp: Avoid using two cache lines for struct 
call_single_data")
Signed-off-by: Jens Axboe 

---

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 100d0df38026..0ce8a372d506 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -135,7 +135,7 @@ typedef __u32 __bitwise req_flags_t;
 struct request {
struct list_head queuelist;
union {
-   call_single_data_t csd;
+   struct __call_single_data csd;
u64 fifo_time;
};
 

-- 
Jens Axboe



random call_single_data alignment

2017-12-20 Thread Jens Axboe
For some reason, commit 966a967116e6 was added to the tree without
CC'ing relevant maintainers, even though it's touching random subsystems.
One example is struct request, a core structure in the block layer.
After this change, struct request grows from 296 to 320 bytes on my
setup.

Why are we blindly aligning to 32 bytes? The comment says to avoid
it spanning two cache lines - but if that's the case, look at the
actual use case and see if that's actually a problem. For struct
request, it is not.

Seems to me, this should have been applied in the specific area
where it was needed. Keep struct call_single_data (instead of some
__ version), and just manually align it where it matters.

-- 
Jens Axboe


Re: [PATCH V2] block-throttle: avoid double charge

2017-12-20 Thread Jens Axboe
On 11/13/17 1:37 PM, Shaohua Li wrote:
> If a bio is throttled and splitted after throttling, the bio could be
> resubmited and enters the throttling again. This will cause part of the
> bio is charged multiple times. If the cgroup has an IO limit, the double
> charge will significantly harm the performance. The bio split becomes
> quite common after arbitrary bio size change.
> 
> To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
> If the bio is cloned/slitted, we copy the flag to new bio too to avoid
> double charge. However cloned bio could be directed to a new disk,
> keeping the flag will have problem. The observation is we always set new
> disk for the bio in this case, so we can clear the flag in
> bio_set_dev().
> 
> This issue exists a long time, arbitrary bio size change makes it worse,
> so this should go into stable at least since v4.2.
> 
> V1-> V2: Not add extra field in bio based on discussion with Tejun

Applied, thanks Shaohua.

-- 
Jens Axboe



[PATCH 01/25] null_blk: remove lightnvm support

2017-12-20 Thread Matias Bjørling
With rrpc to be removed, the null_blk lightnvm support is no longer
functional. Remove the lightnvm implementation and maybe add it to
another module in the future if someone takes on the challenge.

Signed-off-by: Matias Bjørling 
---
 drivers/block/null_blk.c | 220 +--
 1 file changed, 3 insertions(+), 217 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index bf2c8ca..842bced 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -12,7 +12,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -107,7 +106,6 @@ struct nullb_device {
unsigned int hw_queue_depth; /* queue depth */
unsigned int index; /* index of the disk, only valid with a disk */
unsigned int mbps; /* Bandwidth throttle cap (in MB/s) */
-   bool use_lightnvm; /* register as a LightNVM device */
bool blocking; /* blocking blk-mq device */
bool use_per_node_hctx; /* use per-node allocation for hardware context 
*/
bool power; /* power on/off the device */
@@ -121,7 +119,6 @@ struct nullb {
unsigned int index;
struct request_queue *q;
struct gendisk *disk;
-   struct nvm_dev *ndev;
struct blk_mq_tag_set *tag_set;
struct blk_mq_tag_set __tag_set;
unsigned int queue_depth;
@@ -139,7 +136,6 @@ static LIST_HEAD(nullb_list);
 static struct mutex lock;
 static int null_major;
 static DEFINE_IDA(nullb_indexes);
-static struct kmem_cache *ppa_cache;
 static struct blk_mq_tag_set tag_set;
 
 enum {
@@ -208,10 +204,6 @@ static int nr_devices = 1;
 module_param(nr_devices, int, S_IRUGO);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
 
-static bool g_use_lightnvm;
-module_param_named(use_lightnvm, g_use_lightnvm, bool, S_IRUGO);
-MODULE_PARM_DESC(use_lightnvm, "Register as a LightNVM device");
-
 static bool g_blocking;
 module_param_named(blocking, g_blocking, bool, S_IRUGO);
 MODULE_PARM_DESC(blocking, "Register as a blocking blk-mq driver device");
@@ -345,7 +337,6 @@ NULLB_DEVICE_ATTR(blocksize, uint);
 NULLB_DEVICE_ATTR(irqmode, uint);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint);
 NULLB_DEVICE_ATTR(index, uint);
-NULLB_DEVICE_ATTR(use_lightnvm, bool);
 NULLB_DEVICE_ATTR(blocking, bool);
 NULLB_DEVICE_ATTR(use_per_node_hctx, bool);
 NULLB_DEVICE_ATTR(memory_backed, bool);
@@ -455,7 +446,6 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
-   _device_attr_use_lightnvm,
_device_attr_blocking,
_device_attr_use_per_node_hctx,
_device_attr_power,
@@ -574,7 +564,6 @@ static struct nullb_device *null_alloc_dev(void)
dev->blocksize = g_bs;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
-   dev->use_lightnvm = g_use_lightnvm;
dev->blocking = g_blocking;
dev->use_per_node_hctx = g_use_per_node_hctx;
return dev;
@@ -1420,170 +1409,6 @@ static void cleanup_queues(struct nullb *nullb)
kfree(nullb->queues);
 }
 
-#ifdef CONFIG_NVM
-
-static void null_lnvm_end_io(struct request *rq, blk_status_t status)
-{
-   struct nvm_rq *rqd = rq->end_io_data;
-
-   /* XXX: lighnvm core seems to expect NVM_RSP_* values here.. */
-   rqd->error = status ? -EIO : 0;
-   nvm_end_io(rqd);
-
-   blk_put_request(rq);
-}
-
-static int null_lnvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd)
-{
-   struct request_queue *q = dev->q;
-   struct request *rq;
-   struct bio *bio = rqd->bio;
-
-   rq = blk_mq_alloc_request(q,
-   op_is_write(bio_op(bio)) ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN, 0);
-   if (IS_ERR(rq))
-   return -ENOMEM;
-
-   blk_init_request_from_bio(rq, bio);
-
-   rq->end_io_data = rqd;
-
-   blk_execute_rq_nowait(q, NULL, rq, 0, null_lnvm_end_io);
-
-   return 0;
-}
-
-static int null_lnvm_id(struct nvm_dev *dev, struct nvm_id *id)
-{
-   struct nullb *nullb = dev->q->queuedata;
-   sector_t size = (sector_t)nullb->dev->size * 1024 * 1024ULL;
-   sector_t blksize;
-   struct nvm_id_group *grp;
-
-   id->ver_id = 0x1;
-   id->vmnt = 0;
-   id->cap = 0x2;
-   id->dom = 0x1;
-
-   id->ppaf.blk_offset = 0;
-   id->ppaf.blk_len = 16;
-   id->ppaf.pg_offset = 16;
-   id->ppaf.pg_len = 16;
-   id->ppaf.sect_offset = 32;
-   id->ppaf.sect_len = 8;
-   id->ppaf.pln_offset = 40;
-   id->ppaf.pln_len = 8;
-   id->ppaf.lun_offset = 48;
-   id->ppaf.lun_len = 8;
-   id->ppaf.ch_offset = 56;
-   id->ppaf.ch_len = 8;
-
-   sector_div(size, nullb->dev->blocksize); /* convert size to pages */
-   size >>= 8; /* concert size to pgs pr blk */
-   grp = >grp;
-   grp->mtype = 0;
-   grp->fmtype = 0;
-   grp->num_ch = 1;
-   grp->num_pg = 256;
-   

[PATCH 00/25] Updates to lightnvm and pblk

2017-12-20 Thread Matias Bjørling
Hi,

A bunch of patches for the lightnvm subsystem and pblk.

The first part is preparation patches for the 2.0 revision of the
specification. This includes removing the null_blk implementation
and killing the rrpc implementation, which used already deprecated
definitions from the 1.2 revision. Then a couple of patches of
general clean up and then finally provide a path to support both
the 1.2 and 2.0 revisions simultaneously.

The second part is fixes for the lightnvm subsystem, pblk, and
new features that include ioctl support for instantiating pblk
with specific overprovisioning percentage and adding iostat
support.

-Matias

Hans Holmberg (5):
  lightnvm: pblk: refactor emeta consistency check
  lightnvm: pblk: rename sync_point to flush_point
  lightnvm: pblk: clear flush point on completed writes
  lightnvm: pblk: prevent premature sync point resets
  lightnvm: pblk: remove pblk_gc_stop

Javier González (13):
  lightnvm: remove unnecessary field from nvm_rq
  lightnvm: refactor target type lookup
  lightnvm: guarantee target unique name across devs.
  lightnvm: pblk: compress and reorder helper functions
  lightnvm: pblk: remove pblk_for_each_lun helper
  lightnvm: pblk: use exact free block counter in RL
  lightnvm: set target over-provision on create ioctl
  lightnvm: pblk: ignore high ecc errors on recovery
  lightnvm: pblk: do not log recovery read errors
  lightnvm: pblk: ensure kthread alloc. before kicking it
  lightnvm: pblk: free write buffer on init failure
  lightnvm: pblk: print instance name on instance info
  lightnvm: pblk: add iostat support

Matias Bjørling (7):
  null_blk: remove lightnvm support
  lightnvm: remove rrpc
  lightnvm: use internal pblk methods
  lightnvm: remove hybrid ocssd 1.2 support
  lightnvm: remove lower page tables
  lightnvm: make geometry structures 2.0 ready
  lightnvm: pblk: refactor pblk_ppa_comp function

 drivers/block/null_blk.c |  220 +-
 drivers/lightnvm/Kconfig |7 -
 drivers/lightnvm/Makefile|1 -
 drivers/lightnvm/core.c  |  460 ---
 drivers/lightnvm/pblk-cache.c|5 +
 drivers/lightnvm/pblk-core.c |   55 +-
 drivers/lightnvm/pblk-gc.c   |   23 +-
 drivers/lightnvm/pblk-init.c |  104 ++-
 drivers/lightnvm/pblk-map.c  |2 +-
 drivers/lightnvm/pblk-rb.c   |  111 ++-
 drivers/lightnvm/pblk-read.c |   35 +-
 drivers/lightnvm/pblk-recovery.c |   43 +-
 drivers/lightnvm/pblk-rl.c   |   54 +-
 drivers/lightnvm/pblk-sysfs.c|   15 +-
 drivers/lightnvm/pblk-write.c|   23 +-
 drivers/lightnvm/pblk.h  |  165 ++--
 drivers/lightnvm/rrpc.c  | 1625 --
 drivers/lightnvm/rrpc.h  |  290 ---
 drivers/nvme/host/lightnvm.c |  137 +---
 include/linux/lightnvm.h |  120 +--
 include/uapi/linux/lightnvm.h|9 +
 21 files changed, 586 insertions(+), 2918 deletions(-)
 delete mode 100644 drivers/lightnvm/rrpc.c
 delete mode 100644 drivers/lightnvm/rrpc.h

-- 
2.9.3



[PATCH 05/25] lightnvm: remove unnecessary field from nvm_rq

2017-12-20 Thread Matias Bjørling
From: Javier González 

Remove the wait filed in nvm_rq. It is not used anymore, as targets rely
on the functionality provided by the LightNVM subsystem when sending
sync I/O.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 include/linux/lightnvm.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 13f0b70..900fe9f 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -234,7 +234,6 @@ struct nvm_rq {
void *meta_list;
dma_addr_t dma_meta_list;
 
-   struct completion *wait;
nvm_end_io_fn *end_io;
 
uint8_t opcode;
-- 
2.9.3



[PATCH 06/25] lightnvm: remove lower page tables

2017-12-20 Thread Matias Bjørling
The lower page table is unused. All page tables reported by 1.2
devices are all reporting a sequential 1:1 page mapping. This is
also not used going forward with the 2.0 revision.

Signed-off-by: Matias Bjørling 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 67 
 drivers/nvme/host/lightnvm.c | 14 -
 include/linux/lightnvm.h |  6 
 3 files changed, 87 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 390d5ef..52059dd 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -751,53 +751,6 @@ int nvm_get_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct 
ppa_addr ppa,
 }
 EXPORT_SYMBOL(nvm_get_tgt_bb_tbl);
 
-static int nvm_init_slc_tbl(struct nvm_dev *dev, struct nvm_id_group *grp)
-{
-   struct nvm_geo *geo = >geo;
-   int i;
-
-   dev->lps_per_blk = geo->pgs_per_blk;
-   dev->lptbl = kcalloc(dev->lps_per_blk, sizeof(int), GFP_KERNEL);
-   if (!dev->lptbl)
-   return -ENOMEM;
-
-   /* Just a linear array */
-   for (i = 0; i < dev->lps_per_blk; i++)
-   dev->lptbl[i] = i;
-
-   return 0;
-}
-
-static int nvm_init_mlc_tbl(struct nvm_dev *dev, struct nvm_id_group *grp)
-{
-   int i, p;
-   struct nvm_id_lp_mlc *mlc = >lptbl.mlc;
-
-   if (!mlc->num_pairs)
-   return 0;
-
-   dev->lps_per_blk = mlc->num_pairs;
-   dev->lptbl = kcalloc(dev->lps_per_blk, sizeof(int), GFP_KERNEL);
-   if (!dev->lptbl)
-   return -ENOMEM;
-
-   /* The lower page table encoding consists of a list of bytes, where each
-* has a lower and an upper half. The first half byte maintains the
-* increment value and every value after is an offset added to the
-* previous incrementation value
-*/
-   dev->lptbl[0] = mlc->pairs[0] & 0xF;
-   for (i = 1; i < dev->lps_per_blk; i++) {
-   p = mlc->pairs[i >> 1];
-   if (i & 0x1) /* upper */
-   dev->lptbl[i] = dev->lptbl[i - 1] + ((p & 0xF0) >> 4);
-   else /* lower */
-   dev->lptbl[i] = dev->lptbl[i - 1] + (p & 0xF);
-   }
-
-   return 0;
-}
-
 static int nvm_core_init(struct nvm_dev *dev)
 {
struct nvm_id *id = >identity;
@@ -846,25 +799,6 @@ static int nvm_core_init(struct nvm_dev *dev)
if (!dev->lun_map)
return -ENOMEM;
 
-   switch (grp->fmtype) {
-   case NVM_ID_FMTYPE_SLC:
-   if (nvm_init_slc_tbl(dev, grp)) {
-   ret = -ENOMEM;
-   goto err_fmtype;
-   }
-   break;
-   case NVM_ID_FMTYPE_MLC:
-   if (nvm_init_mlc_tbl(dev, grp)) {
-   ret = -ENOMEM;
-   goto err_fmtype;
-   }
-   break;
-   default:
-   pr_err("nvm: flash type not supported\n");
-   ret = -EINVAL;
-   goto err_fmtype;
-   }
-
INIT_LIST_HEAD(>area_list);
INIT_LIST_HEAD(>targets);
mutex_init(>mlock);
@@ -890,7 +824,6 @@ static void nvm_free(struct nvm_dev *dev)
dev->ops->destroy_dma_pool(dev->dma_pool);
 
nvm_unregister_map(dev);
-   kfree(dev->lptbl);
kfree(dev->lun_map);
kfree(dev);
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index da2d6fe..9d0bca1 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -279,20 +279,6 @@ static int init_grps(struct nvm_id *nvm_id, struct 
nvme_nvm_id *nvme_nvm_id)
 
dst->cpar = le16_to_cpu(src->cpar);
 
-   if (dst->fmtype == NVM_ID_FMTYPE_MLC) {
-   memcpy(dst->lptbl.id, src->lptbl.id, 8);
-   dst->lptbl.mlc.num_pairs =
-   le16_to_cpu(src->lptbl.mlc.num_pairs);
-
-   if (dst->lptbl.mlc.num_pairs > NVME_NVM_LP_MLC_PAIRS) {
-   pr_err("nvm: number of MLC pairs not supported\n");
-   return -EINVAL;
-   }
-
-   memcpy(dst->lptbl.mlc.pairs, src->lptbl.mlc.pairs,
-   dst->lptbl.mlc.num_pairs);
-   }
-
return 0;
 }
 
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 900fe9f..312cfb1 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -175,8 +175,6 @@ struct nvm_id_group {
u32 mpos;
u32 mccap;
u16 cpar;
-
-   struct nvm_id_lp_tbl lptbl;
 };
 
 struct nvm_addr_format {
@@ -314,10 +312,6 @@ struct nvm_dev {
/* Device information */
struct nvm_geo geo;
 
- /* lower page table */
-   int lps_per_blk;
-   int *lptbl;
-
unsigned long total_secs;
 
unsigned long *lun_map;
-- 
2.9.3



Re: [PATCH] lightnvm: pblk: remove some unnecessary NULL checks

2017-12-20 Thread Matias Bjørling

On 11/06/2017 12:48 PM, Dan Carpenter wrote:

Smatch complains that flush_workqueue() dereferences the work queue
pointer but then we check if it's NULL on the next line when it's too
late.  These NULL checks can be removed because the module won't load if
we can't allocate the work queues.

Signed-off-by: Dan Carpenter 

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 00d5698d64a9..aaba2049a135 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -669,12 +669,10 @@ void pblk_gc_exit(struct pblk *pblk)
kthread_stop(gc->gc_reader_ts);
  
  	flush_workqueue(gc->gc_reader_wq);

-   if (gc->gc_reader_wq)
-   destroy_workqueue(gc->gc_reader_wq);
+   destroy_workqueue(gc->gc_reader_wq);
  
  	flush_workqueue(gc->gc_line_reader_wq);

-   if (gc->gc_line_reader_wq)
-   destroy_workqueue(gc->gc_line_reader_wq);
+   destroy_workqueue(gc->gc_line_reader_wq);
  
  	if (gc->gc_writer_ts)

kthread_stop(gc->gc_writer_ts);



Thanks Dan. I've applied it for 4.16.


[PATCH 11/25] lightnvm: pblk: remove pblk_for_each_lun helper

2017-12-20 Thread Matias Bjørling
From: Javier González 

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index b62790e..7b09fb2 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -51,10 +51,6 @@
 
 #define NR_PHY_IN_LOG (PBLK_EXPOSED_PAGE_SIZE / PBLK_SECTOR)
 
-#define pblk_for_each_lun(pblk, rlun, i) \
-   for ((i) = 0, rlun = &(pblk)->luns[0]; \
-   (i) < (pblk)->nr_luns; (i)++, rlun = &(pblk)->luns[(i)])
-
 /* Static pool sizes */
 #define PBLK_GEN_WS_POOL_SIZE (2)
 
-- 
2.9.3



[PATCH 03/25] lightnvm: use internal pblk methods

2017-12-20 Thread Matias Bjørling
Now that rrpc has been removed, the only users of the ppa helpers
is pblk. However, pblk already defines similar functions.

Switch pblk to use the internal ones, and remove the generic ppa
helpers.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-map.c   |  2 +-
 drivers/lightnvm/pblk-write.c |  4 ++--
 include/linux/lightnvm.h  | 19 ---
 3 files changed, 3 insertions(+), 22 deletions(-)

diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c
index 6f3ecde..7445e64 100644
--- a/drivers/lightnvm/pblk-map.c
+++ b/drivers/lightnvm/pblk-map.c
@@ -146,7 +146,7 @@ void pblk_map_erase_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
return;
 
/* Erase blocks that are bad in this line but might not be in next */
-   if (unlikely(ppa_empty(*erase_ppa)) &&
+   if (unlikely(pblk_ppa_empty(*erase_ppa)) &&
bitmap_weight(d_line->blk_bitmap, lm->blk_per_line)) {
int bit = -1;
 
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 6c1cafa..6c30b7a 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -439,7 +439,7 @@ static int pblk_submit_io_set(struct pblk *pblk, struct 
nvm_rq *rqd)
struct pblk_line *meta_line;
int err;
 
-   ppa_set_empty(_ppa);
+   pblk_ppa_set_empty(_ppa);
 
/* Assign lbas to ppas and populate request structure */
err = pblk_setup_w_rq(pblk, rqd, _ppa);
@@ -457,7 +457,7 @@ static int pblk_submit_io_set(struct pblk *pblk, struct 
nvm_rq *rqd)
return NVM_IO_ERR;
}
 
-   if (!ppa_empty(erase_ppa)) {
+   if (!pblk_ppa_empty(erase_ppa)) {
/* Submit erase for next data line */
if (pblk_blk_erase_async(pblk, erase_ppa)) {
struct pblk_line *e_line = pblk_line_get_erase(pblk);
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index b7f111f..c506e5d 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -417,25 +417,6 @@ static inline struct ppa_addr dev_to_generic_addr(struct 
nvm_tgt_dev *tgt_dev,
return l;
 }
 
-static inline int ppa_empty(struct ppa_addr ppa_addr)
-{
-   return (ppa_addr.ppa == ADDR_EMPTY);
-}
-
-static inline void ppa_set_empty(struct ppa_addr *ppa_addr)
-{
-   ppa_addr->ppa = ADDR_EMPTY;
-}
-
-static inline int ppa_cmp_blk(struct ppa_addr ppa1, struct ppa_addr ppa2)
-{
-   if (ppa_empty(ppa1) || ppa_empty(ppa2))
-   return 0;
-
-   return ((ppa1.g.ch == ppa2.g.ch) && (ppa1.g.lun == ppa2.g.lun) &&
-   (ppa1.g.blk == ppa2.g.blk));
-}
-
 typedef blk_qc_t (nvm_tgt_make_rq_fn)(struct request_queue *, struct bio *);
 typedef sector_t (nvm_tgt_capacity_fn)(void *);
 typedef void *(nvm_tgt_init_fn)(struct nvm_tgt_dev *, struct gendisk *,
-- 
2.9.3



[PATCH 07/25] lightnvm: make geometry structures 2.0 ready

2017-12-20 Thread Matias Bjørling
From: Matias Bjørling 

Prepare for the 2.0 revision by adapting the geometry
structures to coexist with the 1.2 revision.

Signed-off-by: Matias Bjørling 
Reviewed-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 89 +++-
 drivers/lightnvm/pblk-core.c |  6 +--
 drivers/lightnvm/pblk-init.c | 62 ++--
 drivers/lightnvm/pblk-recovery.c |  2 +-
 drivers/lightnvm/pblk-sysfs.c|  6 +--
 drivers/lightnvm/pblk.h  |  8 ++--
 drivers/nvme/host/lightnvm.c | 80 
 include/linux/lightnvm.h | 47 -
 8 files changed, 159 insertions(+), 141 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 52059dd..6d6d2c1 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -98,7 +98,7 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, 
int clear)
if (clear) {
for (j = 0; j < ch_map->nr_luns; j++) {
int lun = j + lun_offs[j];
-   int lunid = (ch * dev->geo.luns_per_chnl) + lun;
+   int lunid = (ch * dev->geo.nr_luns) + lun;
 
WARN_ON(!test_and_clear_bit(lunid,
dev->lun_map));
@@ -124,10 +124,10 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
struct ppa_addr *luns;
int nr_luns = lun_end - lun_begin + 1;
int luns_left = nr_luns;
-   int nr_chnls = nr_luns / dev->geo.luns_per_chnl;
-   int nr_chnls_mod = nr_luns % dev->geo.luns_per_chnl;
-   int bch = lun_begin / dev->geo.luns_per_chnl;
-   int blun = lun_begin % dev->geo.luns_per_chnl;
+   int nr_chnls = nr_luns / dev->geo.nr_luns;
+   int nr_chnls_mod = nr_luns % dev->geo.nr_luns;
+   int bch = lun_begin / dev->geo.nr_luns;
+   int blun = lun_begin % dev->geo.nr_luns;
int lunid = 0;
int lun_balanced = 1;
int prev_nr_luns;
@@ -148,15 +148,15 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
if (!luns)
goto err_luns;
 
-   prev_nr_luns = (luns_left > dev->geo.luns_per_chnl) ?
-   dev->geo.luns_per_chnl : luns_left;
+   prev_nr_luns = (luns_left > dev->geo.nr_luns) ?
+   dev->geo.nr_luns : luns_left;
for (i = 0; i < nr_chnls; i++) {
struct nvm_ch_map *ch_rmap = _rmap->chnls[i + bch];
int *lun_roffs = ch_rmap->lun_offs;
struct nvm_ch_map *ch_map = _map->chnls[i];
int *lun_offs;
-   int luns_in_chnl = (luns_left > dev->geo.luns_per_chnl) ?
-   dev->geo.luns_per_chnl : luns_left;
+   int luns_in_chnl = (luns_left > dev->geo.nr_luns) ?
+   dev->geo.nr_luns : luns_left;
 
if (lun_balanced && prev_nr_luns != luns_in_chnl)
lun_balanced = 0;
@@ -193,8 +193,8 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
memcpy(_dev->geo, >geo, sizeof(struct nvm_geo));
/* Target device only owns a portion of the physical device */
tgt_dev->geo.nr_chnls = nr_chnls;
-   tgt_dev->geo.nr_luns = nr_luns;
-   tgt_dev->geo.luns_per_chnl = (lun_balanced) ? prev_nr_luns : -1;
+   tgt_dev->geo.all_luns = nr_luns;
+   tgt_dev->geo.nr_luns = (lun_balanced) ? prev_nr_luns : -1;
tgt_dev->total_secs = nr_luns * tgt_dev->geo.sec_per_lun;
tgt_dev->q = dev->q;
tgt_dev->map = dev_map;
@@ -414,7 +414,7 @@ static int nvm_register_map(struct nvm_dev *dev)
for (i = 0; i < dev->geo.nr_chnls; i++) {
struct nvm_ch_map *ch_rmap;
int *lun_roffs;
-   int luns_in_chnl = dev->geo.luns_per_chnl;
+   int luns_in_chnl = dev->geo.nr_luns;
 
ch_rmap = >chnls[i];
 
@@ -717,10 +717,10 @@ int nvm_bb_tbl_fold(struct nvm_dev *dev, u8 *blks, int 
nr_blks)
struct nvm_geo *geo = >geo;
int blk, offset, pl, blktype;
 
-   if (nr_blks != geo->blks_per_lun * geo->plane_mode)
+   if (nr_blks != geo->nr_chks * geo->plane_mode)
return -EINVAL;
 
-   for (blk = 0; blk < geo->blks_per_lun; blk++) {
+   for (blk = 0; blk < geo->nr_chks; blk++) {
offset = blk * geo->plane_mode;
blktype = blks[offset];
 
@@ -736,7 +736,7 @@ int nvm_bb_tbl_fold(struct nvm_dev *dev, u8 *blks, int 
nr_blks)
blks[blk] = blktype;
}
 
-   return geo->blks_per_lun;
+   return geo->nr_chks;
 }
 EXPORT_SYMBOL(nvm_bb_tbl_fold);
 
@@ -758,43 +758,40 @@ static 

[PATCH 09/25] lightnvm: guarantee target unique name across devs.

2017-12-20 Thread Matias Bjørling
From: Javier González 

Until now, target unique naming is only guaranteed per device. This is
ok from a lightnvm perspective, but not from a sysfs one, since groups
will collide regardless of the underlying device.

Check that names are unique across all lightnvm-capable devices.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c | 33 +++--
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 5c2d0f3..d5f231c 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -56,6 +56,30 @@ static struct nvm_target *nvm_find_target(struct nvm_dev 
*dev, const char *name)
return NULL;
 }
 
+static bool nvm_target_exists(const char *name)
+{
+   struct nvm_dev *dev;
+   struct nvm_target *tgt;
+   bool ret = false;
+
+   down_write(_lock);
+   list_for_each_entry(dev, _devices, devices) {
+   mutex_lock(>mlock);
+   list_for_each_entry(tgt, >targets, list) {
+   if (!strcmp(name, tgt->disk->disk_name)) {
+   ret = true;
+   mutex_unlock(>mlock);
+   goto out;
+   }
+   }
+   mutex_unlock(>mlock);
+   }
+
+out:
+   up_write(_lock);
+   return ret;
+}
+
 static int nvm_reserve_luns(struct nvm_dev *dev, int lun_begin, int lun_end)
 {
int i;
@@ -259,14 +283,11 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
return -EINVAL;
}
 
-   mutex_lock(>mlock);
-   t = nvm_find_target(dev, create->tgtname);
-   if (t) {
-   pr_err("nvm: target name already exists.\n");
-   mutex_unlock(>mlock);
+   if (nvm_target_exists(create->tgtname)) {
+   pr_err("nvm: target name already exists (%s)\n",
+   create->tgtname);
return -EINVAL;
}
-   mutex_unlock(>mlock);
 
ret = nvm_reserve_luns(dev, s->lun_begin, s->lun_end);
if (ret)
-- 
2.9.3



[PATCH 10/25] lightnvm: pblk: compress and reorder helper functions

2017-12-20 Thread Matias Bjørling
From: Javier González 

Through time, we have generated some redundant helper functions.
Refactor them to eliminate redundant and unnecessary code. Also, reorder
them to improve readability

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c |  24 
 drivers/lightnvm/pblk-rb.c   |   2 +-
 drivers/lightnvm/pblk-read.c |   4 +-
 drivers/lightnvm/pblk-recovery.c |  20 +++
 drivers/lightnvm/pblk.h  | 121 +++
 5 files changed, 72 insertions(+), 99 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index dd041a9..9ebc60c 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -32,8 +32,8 @@ static void pblk_line_mark_bb(struct work_struct *work)
struct pblk_line *line;
int pos;
 
-   line = >lines[pblk_dev_ppa_to_line(*ppa)];
-   pos = pblk_dev_ppa_to_pos(>geo, *ppa);
+   line = >lines[pblk_ppa_to_line(*ppa)];
+   pos = pblk_ppa_to_pos(>geo, *ppa);
 
pr_err("pblk: failed to mark bb, line:%d, pos:%d\n",
line->id, pos);
@@ -48,7 +48,7 @@ static void pblk_mark_bb(struct pblk *pblk, struct pblk_line 
*line,
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
-   int pos = pblk_dev_ppa_to_pos(geo, *ppa);
+   int pos = pblk_ppa_to_pos(geo, *ppa);
 
pr_debug("pblk: erase failed: line:%d, pos:%d\n", line->id, pos);
atomic_long_inc(>erase_failed);
@@ -66,7 +66,7 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct 
nvm_rq *rqd)
 {
struct pblk_line *line;
 
-   line = >lines[pblk_dev_ppa_to_line(rqd->ppa_addr)];
+   line = >lines[pblk_ppa_to_line(rqd->ppa_addr)];
atomic_dec(>left_seblks);
 
if (rqd->error) {
@@ -144,7 +144,7 @@ void pblk_map_invalidate(struct pblk *pblk, struct ppa_addr 
ppa)
BUG_ON(pblk_ppa_empty(ppa));
 #endif
 
-   line_id = pblk_tgt_ppa_to_line(ppa);
+   line_id = pblk_ppa_to_line(ppa);
line = >lines[line_id];
paddr = pblk_dev_ppa_to_line_addr(pblk, ppa);
 
@@ -650,7 +650,7 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, 
struct pblk_line *line,
} else {
for (i = 0; i < rqd.nr_ppas; ) {
struct ppa_addr ppa = addr_to_gen_ppa(pblk, paddr, id);
-   int pos = pblk_dev_ppa_to_pos(geo, ppa);
+   int pos = pblk_ppa_to_pos(geo, ppa);
int read_type = PBLK_READ_RANDOM;
 
if (pblk_io_aligned(pblk, rq_ppas))
@@ -668,7 +668,7 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, 
struct pblk_line *line,
}
 
ppa = addr_to_gen_ppa(pblk, paddr, id);
-   pos = pblk_dev_ppa_to_pos(geo, ppa);
+   pos = pblk_ppa_to_pos(geo, ppa);
}
 
if (pblk_boundary_paddr_checks(pblk, paddr + min)) {
@@ -854,8 +854,8 @@ static int pblk_blk_erase_sync(struct pblk *pblk, struct 
ppa_addr ppa)
struct nvm_geo *geo = >geo;
 
pr_err("pblk: could not sync erase line:%d,blk:%d\n",
-   pblk_dev_ppa_to_line(ppa),
-   pblk_dev_ppa_to_pos(geo, ppa));
+   pblk_ppa_to_line(ppa),
+   pblk_ppa_to_pos(geo, ppa));
 
rqd.error = ret;
goto out;
@@ -1561,8 +1561,8 @@ int pblk_blk_erase_async(struct pblk *pblk, struct 
ppa_addr ppa)
struct nvm_geo *geo = >geo;
 
pr_err("pblk: could not async erase line:%d,blk:%d\n",
-   pblk_dev_ppa_to_line(ppa),
-   pblk_dev_ppa_to_pos(geo, ppa));
+   pblk_ppa_to_line(ppa),
+   pblk_ppa_to_pos(geo, ppa));
}
 
return err;
@@ -1884,7 +1884,7 @@ void pblk_lookup_l2p_seq(struct pblk *pblk, struct 
ppa_addr *ppas,
 
/* If the L2P entry maps to a line, the reference is valid */
if (!pblk_ppa_empty(ppa) && !pblk_addr_in_cache(ppa)) {
-   int line_id = pblk_dev_ppa_to_line(ppa);
+   int line_id = pblk_ppa_to_line(ppa);
struct pblk_line *line = >lines[line_id];
 
kref_get(>ref);
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index b8f78e4..62db408 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -226,7 +226,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, 
unsigned int 

[PATCH 08/25] lightnvm: refactor target type lookup

2017-12-20 Thread Matias Bjørling
From: Javier González 

Refactor target type lookup to use/not use locks explicitly instead of
using a hidden parameter to make the function locking.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 6d6d2c1..5c2d0f3 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -220,21 +220,25 @@ static const struct block_device_operations nvm_fops = {
.owner  = THIS_MODULE,
 };
 
-static struct nvm_tgt_type *nvm_find_target_type(const char *name, int lock)
+static struct nvm_tgt_type *__nvm_find_target_type(const char *name)
 {
-   struct nvm_tgt_type *tmp, *tt = NULL;
+   struct nvm_tgt_type *tt;
 
-   if (lock)
-   down_write(_tgtt_lock);
+   list_for_each_entry(tt, _tgt_types, list)
+   if (!strcmp(name, tt->name))
+   return tt;
 
-   list_for_each_entry(tmp, _tgt_types, list)
-   if (!strcmp(name, tmp->name)) {
-   tt = tmp;
-   break;
-   }
+   return NULL;
+}
+
+static struct nvm_tgt_type *nvm_find_target_type(const char *name)
+{
+   struct nvm_tgt_type *tt;
+
+   down_write(_tgtt_lock);
+   tt = __nvm_find_target_type(name);
+   up_write(_tgtt_lock);
 
-   if (lock)
-   up_write(_tgtt_lock);
return tt;
 }
 
@@ -249,7 +253,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
void *targetdata;
int ret;
 
-   tt = nvm_find_target_type(create->tgttype, 1);
+   tt = nvm_find_target_type(create->tgttype);
if (!tt) {
pr_err("nvm: target type %s not found\n", create->tgttype);
return -EINVAL;
@@ -523,7 +527,7 @@ int nvm_register_tgt_type(struct nvm_tgt_type *tt)
int ret = 0;
 
down_write(_tgtt_lock);
-   if (nvm_find_target_type(tt->name, 0))
+   if (__nvm_find_target_type(tt->name))
ret = -EEXIST;
else
list_add(>list, _tgt_types);
-- 
2.9.3



[PATCH 15/25] lightnvm: pblk: prevent premature sync point resets

2017-12-20 Thread Matias Bjørling
From: Hans Holmberg 

Unless we protect flush pointer updates with a lock, we risk
resetting new flush points before we've synced all sectors
up to that point.

This patch protects new flush points with the same spin lock
that is being held when advancing the sync pointer and
resetting completed flush points.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-rb.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 9a4100c..3d33d77 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -367,17 +367,17 @@ static int pblk_rb_flush_point_set(struct pblk_rb *rb, 
struct bio *bio,
flush_point = (pos == 0) ? (rb->nr_entries - 1) : (pos - 1);
entry = >entries[flush_point];
 
+   pblk_rb_sync_init(rb, NULL);
+
/* Protect flush points */
smp_store_release(>flush_point, flush_point);
 
-   if (!bio)
-   return 0;
+   if (bio)
+   bio_list_add(>w_ctx.bios, bio);
 
-   spin_lock_irq(>s_lock);
-   bio_list_add(>w_ctx.bios, bio);
-   spin_unlock_irq(>s_lock);
+   pblk_rb_sync_end(rb, NULL);
 
-   return 1;
+   return bio ? 1 : 0;
 }
 
 static int __pblk_rb_may_write(struct pblk_rb *rb, unsigned int nr_entries,
-- 
2.9.3



[PATCH 18/25] lightnvm: set target over-provision on create ioctl

2017-12-20 Thread Matias Bjørling
From: Javier González 

Allow to set the over-provision percentage on target creation. In case
that the value is not provided, fall back to the default value set by
the target.

In pblk, set the default OP to 11% of the total size of the device

Signed-off-by: Javier González 
Signed-off-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c   | 106 +-
 drivers/lightnvm/pblk-init.c  |   5 +-
 drivers/lightnvm/pblk.h   |   2 +
 include/linux/lightnvm.h  |   6 +++
 include/uapi/linux/lightnvm.h |   9 
 5 files changed, 104 insertions(+), 24 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index d5f231c..dcc9e62 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -140,7 +140,8 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, 
int clear)
 }
 
 static struct nvm_tgt_dev *nvm_create_tgt_dev(struct nvm_dev *dev,
- int lun_begin, int lun_end)
+ u16 lun_begin, u16 lun_end,
+ u16 op)
 {
struct nvm_tgt_dev *tgt_dev = NULL;
struct nvm_dev_map *dev_rmap = dev->rmap;
@@ -219,6 +220,7 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
tgt_dev->geo.nr_chnls = nr_chnls;
tgt_dev->geo.all_luns = nr_luns;
tgt_dev->geo.nr_luns = (lun_balanced) ? prev_nr_luns : -1;
+   tgt_dev->geo.op = op;
tgt_dev->total_secs = nr_luns * tgt_dev->geo.sec_per_lun;
tgt_dev->q = dev->q;
tgt_dev->map = dev_map;
@@ -266,9 +268,57 @@ static struct nvm_tgt_type *nvm_find_target_type(const 
char *name)
return tt;
 }
 
+static int nvm_config_check_luns(struct nvm_geo *geo, int lun_begin,
+int lun_end)
+{
+   if (lun_begin > lun_end || lun_end >= geo->all_luns) {
+   pr_err("nvm: lun out of bound (%u:%u > %u)\n",
+   lun_begin, lun_end, geo->all_luns - 1);
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int __nvm_config_simple(struct nvm_dev *dev,
+  struct nvm_ioctl_create_simple *s)
+{
+   struct nvm_geo *geo = >geo;
+
+   if (s->lun_begin == -1 && s->lun_end == -1) {
+   s->lun_begin = 0;
+   s->lun_end = geo->all_luns - 1;
+   }
+
+   return nvm_config_check_luns(geo, s->lun_begin, s->lun_end);
+}
+
+static int __nvm_config_extended(struct nvm_dev *dev,
+struct nvm_ioctl_create_extended *e)
+{
+   struct nvm_geo *geo = >geo;
+
+   if (e->lun_begin == 0x && e->lun_end == 0x) {
+   e->lun_begin = 0;
+   e->lun_end = dev->geo.all_luns - 1;
+   }
+
+   /* op not set falls into target's default */
+   if (e->op == 0x)
+   e->op = NVM_TARGET_DEFAULT_OP;
+
+   if (e->op < NVM_TARGET_MIN_OP ||
+   e->op > NVM_TARGET_MAX_OP) {
+   pr_err("nvm: invalid over provisioning value\n");
+   return -EINVAL;
+   }
+
+   return nvm_config_check_luns(geo, e->lun_begin, e->lun_end);
+}
+
 static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 {
-   struct nvm_ioctl_create_simple *s = >conf.s;
+   struct nvm_ioctl_create_extended e;
struct request_queue *tqueue;
struct gendisk *tdisk;
struct nvm_tgt_type *tt;
@@ -277,6 +327,28 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
void *targetdata;
int ret;
 
+   switch (create->conf.type) {
+   case NVM_CONFIG_TYPE_SIMPLE:
+   ret = __nvm_config_simple(dev, >conf.s);
+   if (ret)
+   return ret;
+
+   e.lun_begin = create->conf.s.lun_begin;
+   e.lun_end = create->conf.s.lun_end;
+   e.op = NVM_TARGET_DEFAULT_OP;
+   break;
+   case NVM_CONFIG_TYPE_EXTENDED:
+   ret = __nvm_config_extended(dev, >conf.e);
+   if (ret)
+   return ret;
+
+   e = create->conf.e;
+   break;
+   default:
+   pr_err("nvm: config type not valid\n");
+   return -EINVAL;
+   }
+
tt = nvm_find_target_type(create->tgttype);
if (!tt) {
pr_err("nvm: target type %s not found\n", create->tgttype);
@@ -289,7 +361,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
return -EINVAL;
}
 
-   ret = nvm_reserve_luns(dev, s->lun_begin, s->lun_end);
+   ret = nvm_reserve_luns(dev, e.lun_begin, e.lun_end);
if (ret)
return ret;
 
@@ -299,7 +371,7 @@ static int nvm_create_tgt(struct nvm_dev 

[PATCH 12/25] lightnvm: pblk: refactor emeta consistency check

2017-12-20 Thread Matias Bjørling
From: Hans Holmberg 

Currently pblk_recov_get_lba list does two separate things:
it checks the consistency of the emeta and extracts the lba list.

This patch separates the consistency check to make the code easier
to read and to prepare for version checks of the line emeta
persistent data format version.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-gc.c   |  9 -
 drivers/lightnvm/pblk-recovery.c | 15 ++-
 drivers/lightnvm/pblk.h  |  2 +-
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 00d5698..9e3f22b 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -169,7 +169,14 @@ static void pblk_gc_line_prepare_ws(struct work_struct 
*work)
 * the line untouched. TODO: Implement a recovery routine that scans and
 * moves all sectors on the line.
 */
-   lba_list = pblk_recov_get_lba_list(pblk, emeta_buf);
+
+   ret = pblk_recov_check_emeta(pblk, emeta_buf);
+   if (ret) {
+   pr_err("pblk: inconsistent emeta (line %d)\n", line->id);
+   goto fail_free_emeta;
+   }
+
+   lba_list = emeta_to_lbas(pblk, emeta_buf);
if (!lba_list) {
pr_err("pblk: could not interpret emeta (line %d)\n", line->id);
goto fail_free_emeta;
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 1b272ae..39a2e19 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -111,18 +111,18 @@ int pblk_recov_setup_rq(struct pblk *pblk, struct 
pblk_c_ctx *c_ctx,
return 0;
 }
 
-__le64 *pblk_recov_get_lba_list(struct pblk *pblk, struct line_emeta 
*emeta_buf)
+int pblk_recov_check_emeta(struct pblk *pblk, struct line_emeta *emeta_buf)
 {
u32 crc;
 
crc = pblk_calc_emeta_crc(pblk, emeta_buf);
if (le32_to_cpu(emeta_buf->crc) != crc)
-   return NULL;
+   return 1;
 
if (le32_to_cpu(emeta_buf->header.identifier) != PBLK_MAGIC)
-   return NULL;
+   return 1;
 
-   return emeta_to_lbas(pblk, emeta_buf);
+   return 0;
 }
 
 static int pblk_recov_l2p_from_emeta(struct pblk *pblk, struct pblk_line *line)
@@ -137,7 +137,7 @@ static int pblk_recov_l2p_from_emeta(struct pblk *pblk, 
struct pblk_line *line)
u64 nr_valid_lbas, nr_lbas = 0;
u64 i;
 
-   lba_list = pblk_recov_get_lba_list(pblk, emeta_buf);
+   lba_list = emeta_to_lbas(pblk, emeta_buf);
if (!lba_list)
return 1;
 
@@ -938,6 +938,11 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
goto next;
}
 
+   if (pblk_recov_check_emeta(pblk, line->emeta->buf)) {
+   pblk_recov_l2p_from_oob(pblk, line);
+   goto next;
+   }
+
if (pblk_recov_l2p_from_emeta(pblk, line))
pblk_recov_l2p_from_oob(pblk, line);
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 7b09fb2..f324a7e 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -808,7 +808,7 @@ int pblk_submit_read_gc(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq);
 void pblk_submit_rec(struct work_struct *work);
 struct pblk_line *pblk_recov_l2p(struct pblk *pblk);
 int pblk_recov_pad(struct pblk *pblk);
-__le64 *pblk_recov_get_lba_list(struct pblk *pblk, struct line_emeta *emeta);
+int pblk_recov_check_emeta(struct pblk *pblk, struct line_emeta *emeta);
 int pblk_recov_setup_rq(struct pblk *pblk, struct pblk_c_ctx *c_ctx,
struct pblk_rec_ctx *recovery, u64 *comp_bits,
unsigned int comp);
-- 
2.9.3



[PATCH 16/25] lightnvm: pblk: remove pblk_gc_stop

2017-12-20 Thread Matias Bjørling
From: Hans Holmberg 

pblk_gc_stop just sets pblk->gc->gc_active to zero, ignoring
the flush parameter. This is plain confusing, so remove the
function and set the gc active flag at the call points instead.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-gc.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 9e3f22b..8f71898 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -526,22 +526,12 @@ void pblk_gc_should_start(struct pblk *pblk)
}
 }
 
-/*
- * If flush_wq == 1 then no lock should be held by the caller since
- * flush_workqueue can sleep
- */
-static void pblk_gc_stop(struct pblk *pblk, int flush_wq)
-{
-   pblk->gc.gc_active = 0;
-   pr_debug("pblk: gc stop\n");
-}
-
 void pblk_gc_should_stop(struct pblk *pblk)
 {
struct pblk_gc *gc = >gc;
 
if (gc->gc_active && !gc->gc_forced)
-   pblk_gc_stop(pblk, 0);
+   gc->gc_active = 0;
 }
 
 void pblk_gc_should_kick(struct pblk *pblk)
@@ -667,7 +657,7 @@ void pblk_gc_exit(struct pblk *pblk)
 
gc->gc_enabled = 0;
del_timer_sync(>gc_timer);
-   pblk_gc_stop(pblk, 1);
+   gc->gc_active = 0;
 
if (gc->gc_ts)
kthread_stop(gc->gc_ts);
-- 
2.9.3



[PATCH 17/25] lightnvm: pblk: use exact free block counter in RL

2017-12-20 Thread Matias Bjørling
From: Javier González 

Until now, pblk's rate-limiter has used a heuristic to reserve space for
GC I/O given that the over-provision area was fixed.

In preparation for allowing to define the over-provision area on target
creation, define a dedicated free_block counter in the rate-limiter to
track the number of blocks being used for user data.

Signed-off-by: Javier González 
Signed-off-by: Hans Holmberg 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 19 +-
 drivers/lightnvm/pblk-init.c | 18 +++---
 drivers/lightnvm/pblk-recovery.c |  4 +--
 drivers/lightnvm/pblk-rl.c   | 54 +++-
 drivers/lightnvm/pblk-sysfs.c|  9 ---
 drivers/lightnvm/pblk.h  | 15 ++-
 6 files changed, 73 insertions(+), 46 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 9ebc60c..22cdafd 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1145,7 +1145,7 @@ int pblk_line_recov_alloc(struct pblk *pblk, struct 
pblk_line *line)
}
spin_unlock(_mg->free_lock);
 
-   pblk_rl_free_lines_dec(>rl, line);
+   pblk_rl_free_lines_dec(>rl, line, true);
 
if (!pblk_line_init_bb(pblk, line, 0)) {
list_add(>list, _mg->free_list);
@@ -1233,7 +1233,7 @@ static struct pblk_line *pblk_line_retry(struct pblk 
*pblk,
l_mg->data_line = retry_line;
spin_unlock(_mg->free_lock);
 
-   pblk_rl_free_lines_dec(>rl, retry_line);
+   pblk_rl_free_lines_dec(>rl, line, false);
 
if (pblk_line_erase(pblk, retry_line))
goto retry;
@@ -1252,7 +1252,6 @@ struct pblk_line *pblk_line_get_first_data(struct pblk 
*pblk)
 {
struct pblk_line_mgmt *l_mg = >l_mg;
struct pblk_line *line;
-   int is_next = 0;
 
spin_lock(_mg->free_lock);
line = pblk_line_get(pblk);
@@ -1280,7 +1279,6 @@ struct pblk_line *pblk_line_get_first_data(struct pblk 
*pblk)
} else {
l_mg->data_next->seq_nr = l_mg->d_seq_nr++;
l_mg->data_next->type = PBLK_LINETYPE_DATA;
-   is_next = 1;
}
spin_unlock(_mg->free_lock);
 
@@ -1290,10 +1288,6 @@ struct pblk_line *pblk_line_get_first_data(struct pblk 
*pblk)
return NULL;
}
 
-   pblk_rl_free_lines_dec(>rl, line);
-   if (is_next)
-   pblk_rl_free_lines_dec(>rl, l_mg->data_next);
-
 retry_setup:
if (!pblk_line_init_metadata(pblk, line, NULL)) {
line = pblk_line_retry(pblk, line);
@@ -1311,6 +1305,8 @@ struct pblk_line *pblk_line_get_first_data(struct pblk 
*pblk)
goto retry_setup;
}
 
+   pblk_rl_free_lines_dec(>rl, line, true);
+
return line;
 }
 
@@ -1395,7 +1391,6 @@ struct pblk_line *pblk_line_replace_data(struct pblk 
*pblk)
struct pblk_line_mgmt *l_mg = >l_mg;
struct pblk_line *cur, *new = NULL;
unsigned int left_seblks;
-   int is_next = 0;
 
cur = l_mg->data_line;
new = l_mg->data_next;
@@ -1444,6 +1439,8 @@ struct pblk_line *pblk_line_replace_data(struct pblk 
*pblk)
goto retry_setup;
}
 
+   pblk_rl_free_lines_dec(>rl, new, true);
+
/* Allocate next line for preparation */
spin_lock(_mg->free_lock);
l_mg->data_next = pblk_line_get(pblk);
@@ -1457,13 +1454,9 @@ struct pblk_line *pblk_line_replace_data(struct pblk 
*pblk)
} else {
l_mg->data_next->seq_nr = l_mg->d_seq_nr++;
l_mg->data_next->type = PBLK_LINETYPE_DATA;
-   is_next = 1;
}
spin_unlock(_mg->free_lock);
 
-   if (is_next)
-   pblk_rl_free_lines_dec(>rl, l_mg->data_next);
-
 out:
return new;
 }
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 2af1b0a..e939f0a 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -579,22 +579,34 @@ static unsigned int calc_emeta_len(struct pblk *pblk)
 static void pblk_set_provision(struct pblk *pblk, long nr_free_blks)
 {
struct nvm_tgt_dev *dev = pblk->dev;
+   struct pblk_line_mgmt *l_mg = >l_mg;
+   struct pblk_line_meta *lm = >lm;
struct nvm_geo *geo = >geo;
sector_t provisioned;
+   int sec_meta, blk_meta;
 
-   pblk->over_pct = 20;
+   pblk->op = 20;
 
provisioned = nr_free_blks;
-   provisioned *= (100 - pblk->over_pct);
+   provisioned *= (100 - pblk->op);
sector_div(provisioned, 100);
 
+   pblk->op_blks = nr_free_blks - provisioned;
+
/* Internally pblk manages all free blocks, but all calculations based
 * on user capacity consider only provisioned blocks
 */
pblk->rl.total_blocks = nr_free_blks;
pblk->rl.nr_secs = nr_free_blks * 

[PATCH 25/25] lightnvm: pblk: refactor pblk_ppa_comp function

2017-12-20 Thread Matias Bjørling
Shorten function to simply return the value of the if statement.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk.h | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 1e9eafd..a1434da 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -1047,10 +1047,7 @@ static inline void pblk_ppa_set_empty(struct ppa_addr 
*ppa_addr)
 
 static inline bool pblk_ppa_comp(struct ppa_addr lppa, struct ppa_addr rppa)
 {
-   if (lppa.ppa == rppa.ppa)
-   return true;
-
-   return false;
+   return (lppa.ppa == rppa.ppa);
 }
 
 static inline int pblk_addr_in_cache(struct ppa_addr ppa)
-- 
2.9.3



[PATCH 24/25] lightnvm: pblk: add iostat support

2017-12-20 Thread Matias Bjørling
From: Javier González 

Since pblk registers its own block device, the iostat accounting is
not automatically done for us. Therefore, add the necessary
accounting logic to satisfy the iostat interface.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-cache.c |  5 +
 drivers/lightnvm/pblk-read.c  | 31 +++
 drivers/lightnvm/pblk.h   |  1 +
 3 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/lightnvm/pblk-cache.c b/drivers/lightnvm/pblk-cache.c
index 0d227ef..000fcad 100644
--- a/drivers/lightnvm/pblk-cache.c
+++ b/drivers/lightnvm/pblk-cache.c
@@ -19,12 +19,16 @@
 
 int pblk_write_to_cache(struct pblk *pblk, struct bio *bio, unsigned long 
flags)
 {
+   struct request_queue *q = pblk->dev->q;
struct pblk_w_ctx w_ctx;
sector_t lba = pblk_get_lba(bio);
+   unsigned long start_time = jiffies;
unsigned int bpos, pos;
int nr_entries = pblk_get_secs(bio);
int i, ret;
 
+   generic_start_io_acct(q, WRITE, bio_sectors(bio), >disk->part0);
+
/* Update the write buffer head (mem) with the entries that we can
 * write. The write in itself cannot fail, so there is no need to
 * rollback from here on.
@@ -67,6 +71,7 @@ int pblk_write_to_cache(struct pblk *pblk, struct bio *bio, 
unsigned long flags)
pblk_rl_inserted(>rl, nr_entries);
 
 out:
+   generic_end_io_acct(q, WRITE, >disk->part0, start_time);
pblk_write_should_kick(pblk);
return ret;
 }
diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 0fe0c04..2f76128 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -158,8 +158,12 @@ static void pblk_end_user_read(struct bio *bio)
 static void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd,
   bool put_line)
 {
+   struct nvm_tgt_dev *dev = pblk->dev;
struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
struct bio *bio = rqd->bio;
+   unsigned long start_time = r_ctx->start_time;
+
+   generic_end_io_acct(dev->q, READ, >disk->part0, start_time);
 
if (rqd->error)
pblk_log_read_err(pblk, rqd);
@@ -193,9 +197,9 @@ static void pblk_end_io_read(struct nvm_rq *rqd)
__pblk_end_io_read(pblk, rqd, true);
 }
 
-static int pblk_fill_partial_read_bio(struct pblk *pblk, struct nvm_rq *rqd,
- unsigned int bio_init_idx,
- unsigned long *read_bitmap)
+static int pblk_partial_read_bio(struct pblk *pblk, struct nvm_rq *rqd,
+unsigned int bio_init_idx,
+unsigned long *read_bitmap)
 {
struct bio *new_bio, *bio = rqd->bio;
struct pblk_sec_meta *meta_list = rqd->meta_list;
@@ -306,6 +310,8 @@ static int pblk_fill_partial_read_bio(struct pblk *pblk, 
struct nvm_rq *rqd,
return NVM_IO_OK;
 
 err:
+   pr_err("pblk: failed to perform partial read\n");
+
/* Free allocated pages in new bio */
pblk_bio_free_pages(pblk, bio, 0, new_bio->bi_vcnt);
__pblk_end_io_read(pblk, rqd, false);
@@ -357,6 +363,7 @@ static void pblk_read_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
 int pblk_submit_read(struct pblk *pblk, struct bio *bio)
 {
struct nvm_tgt_dev *dev = pblk->dev;
+   struct request_queue *q = dev->q;
sector_t blba = pblk_get_lba(bio);
unsigned int nr_secs = pblk_get_secs(bio);
struct pblk_g_ctx *r_ctx;
@@ -372,6 +379,8 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
return NVM_IO_ERR;
}
 
+   generic_start_io_acct(q, READ, bio_sectors(bio), >disk->part0);
+
bitmap_zero(_bitmap, nr_secs);
 
rqd = pblk_alloc_rqd(pblk, PBLK_READ);
@@ -383,6 +392,7 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
rqd->end_io = pblk_end_io_read;
 
r_ctx = nvm_rq_to_pdu(rqd);
+   r_ctx->start_time = jiffies;
r_ctx->lba = blba;
 
/* Save the index for this bio's start. This is needed in case
@@ -422,7 +432,7 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
int_bio = bio_clone_fast(bio, GFP_KERNEL, pblk_bio_set);
if (!int_bio) {
pr_err("pblk: could not clone read bio\n");
-   return NVM_IO_ERR;
+   goto fail_end_io;
}
 
rqd->bio = int_bio;
@@ -433,7 +443,7 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
pr_err("pblk: read IO submission failed\n");
if (int_bio)
bio_put(int_bio);
-   return ret;
+   goto fail_end_io;
}
 
return NVM_IO_OK;
@@ -442,17 

[PATCH 20/25] lightnvm: pblk: do not log recovery read errors

2017-12-20 Thread Matias Bjørling
From: Javier González 

On scan recovery, reads can fail. This happens because the first page
for each line is read in order to determined if the line has been used
(and thus needs to be recovered), or not. This can lead to "empty page"
read errors.

Since these errors are normal, do not log them, as they are confusing
when reviewing the logs.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 6 +++---
 drivers/lightnvm/pblk.h  | 1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 22cdafd..5692d33 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -742,7 +742,7 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
cmd_op = NVM_OP_PWRITE;
flags = pblk_set_progr_mode(pblk, PBLK_WRITE);
lba_list = emeta_to_lbas(pblk, line->emeta->buf);
-   } else if (dir == PBLK_READ) {
+   } else if (dir == PBLK_READ_RECOV || dir == PBLK_READ) {
bio_op = REQ_OP_READ;
cmd_op = NVM_OP_PREAD;
flags = pblk_set_read_mode(pblk, PBLK_READ_SEQUENTIAL);
@@ -802,7 +802,7 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
if (rqd.error) {
if (dir == PBLK_WRITE)
pblk_log_write_err(pblk, );
-   else
+   else if (dir == PBLK_READ)
pblk_log_read_err(pblk, );
}
 
@@ -816,7 +816,7 @@ int pblk_line_read_smeta(struct pblk *pblk, struct 
pblk_line *line)
 {
u64 bpaddr = pblk_line_smeta_start(pblk, line);
 
-   return pblk_line_submit_smeta_io(pblk, line, bpaddr, PBLK_READ);
+   return pblk_line_submit_smeta_io(pblk, line, bpaddr, PBLK_READ_RECOV);
 }
 
 int pblk_line_read_emeta(struct pblk *pblk, struct pblk_line *line,
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 430d076..cb38990 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -60,6 +60,7 @@ enum {
PBLK_READ   = READ,
PBLK_WRITE  = WRITE,/* Write from write buffer */
PBLK_WRITE_INT, /* Internal write - no write buffer */
+   PBLK_READ_RECOV,/* Recovery read - errors allowed */
PBLK_ERASE,
 };
 
-- 
2.9.3



[PATCH 21/25] lightnvm: pblk: ensure kthread alloc. before kicking it

2017-12-20 Thread Matias Bjørling
From: Javier González 

When creating the write thread, ensure that the kthread has been created
before initializing the timer responsible from kicking it. Otherwise, if
the kthread creation fails or gets killed from used space, we risk
kicking an empty thread structure.

Also, since the kthread creation can be interrupted form user space,
adapt the error path to not report an error when this happens, since it
is intentional that the instance creation is aborted.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 30ea592..3150b8f 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -883,15 +883,19 @@ static int pblk_lines_init(struct pblk *pblk)
 
 static int pblk_writer_init(struct pblk *pblk)
 {
-   setup_timer(>wtimer, pblk_write_timer_fn, (unsigned long)pblk);
-   mod_timer(>wtimer, jiffies + msecs_to_jiffies(100));
-
pblk->writer_ts = kthread_create(pblk_write_ts, pblk, "pblk-writer-t");
if (IS_ERR(pblk->writer_ts)) {
-   pr_err("pblk: could not allocate writer kthread\n");
-   return PTR_ERR(pblk->writer_ts);
+   int err = PTR_ERR(pblk->writer_ts);
+
+   if (err != -EINTR)
+   pr_err("pblk: could not allocate writer kthread (%d)\n",
+   err);
+   return err;
}
 
+   setup_timer(>wtimer, pblk_write_timer_fn, (unsigned long)pblk);
+   mod_timer(>wtimer, jiffies + msecs_to_jiffies(100));
+
return 0;
 }
 
@@ -1042,7 +1046,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
 
ret = pblk_writer_init(pblk);
if (ret) {
-   pr_err("pblk: could not initialize write thread\n");
+   if (ret != -EINTR)
+   pr_err("pblk: could not initialize write thread\n");
goto fail_free_lines;
}
 
-- 
2.9.3



[PATCH 19/25] lightnvm: pblk: ignore high ecc errors on recovery

2017-12-20 Thread Matias Bjørling
From: Javier González 

On recovery, do not stop L2P recovery if reads report high ECC error
as the data is still available.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-recovery.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index fd38036..1d5e961 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -288,7 +288,7 @@ static int pblk_recov_read_oob(struct pblk *pblk, struct 
pblk_line *line,
/* At this point, the read should not fail. If it does, it is a problem
 * we cannot recover from here. Need FTL log.
 */
-   if (rqd->error) {
+   if (rqd->error && rqd->error != NVM_RSP_WARN_HIGHECC) {
pr_err("pblk: L2P recovery failed (%d)\n", rqd->error);
return -EINTR;
}
-- 
2.9.3



[PATCH 23/25] lightnvm: pblk: print instance name on instance info

2017-12-20 Thread Matias Bjørling
From: Javier González 

Add the instance name to the information printed out on target creation.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index d603b94..55d3b42 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -1069,7 +1069,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
blk_queue_max_discard_sectors(tqueue, UINT_MAX >> 9);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, tqueue);
 
-   pr_info("pblk init: luns:%u, lines:%d, secs:%llu, buf entries:%u\n",
+   pr_info("pblk(%s): luns:%u, lines:%d, secs:%llu, buf entries:%u\n",
+   tdisk->disk_name,
geo->all_luns, pblk->l_mg.nr_lines,
(unsigned long long)pblk->rl.nr_secs,
pblk->rwb.nr_entries);
-- 
2.9.3



[PATCH 14/25] lightnvm: pblk: clear flush point on completed writes

2017-12-20 Thread Matias Bjørling
From: Hans Holmberg 

Move completion of syncs and clearing of flush points to the
write completion path - this ensures that the data has been
committed to the media before completing bios containing syncs.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-rb.c| 58 +--
 drivers/lightnvm/pblk-write.c | 17 -
 2 files changed, 44 insertions(+), 31 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 941842e..9a4100c 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -353,17 +353,17 @@ static int pblk_rb_flush_point_set(struct pblk_rb *rb, 
struct bio *bio,
  unsigned int pos)
 {
struct pblk_rb_entry *entry;
-   unsigned int subm, flush_point;
+   unsigned int sync, flush_point;
 
-   subm = READ_ONCE(rb->subm);
+   sync = READ_ONCE(rb->sync);
+
+   if (pos == sync)
+   return 0;
 
 #ifdef CONFIG_NVM_DEBUG
atomic_inc(>inflight_flush_point);
 #endif
 
-   if (pos == subm)
-   return 0;
-
flush_point = (pos == 0) ? (rb->nr_entries - 1) : (pos - 1);
entry = >entries[flush_point];
 
@@ -606,22 +606,6 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb, 
struct nvm_rq *rqd,
return NVM_IO_ERR;
}
 
-   if (flags & PBLK_FLUSH_ENTRY) {
-   unsigned int flush_point;
-
-   flush_point = READ_ONCE(rb->flush_point);
-   if (flush_point == pos) {
-   /* Protect flush points */
-   smp_store_release(>flush_point,
-   EMPTY_ENTRY);
-   }
-
-   flags &= ~PBLK_FLUSH_ENTRY;
-#ifdef CONFIG_NVM_DEBUG
-   atomic_dec(>inflight_flush_point);
-#endif
-   }
-
flags &= ~PBLK_WRITTEN_DATA;
flags |= PBLK_SUBMITTED_ENTRY;
 
@@ -731,15 +715,24 @@ void pblk_rb_sync_end(struct pblk_rb *rb, unsigned long 
*flags)
 
 unsigned int pblk_rb_sync_advance(struct pblk_rb *rb, unsigned int nr_entries)
 {
-   unsigned int sync;
-   unsigned int i;
-
+   unsigned int sync, flush_point;
lockdep_assert_held(>s_lock);
 
sync = READ_ONCE(rb->sync);
+   flush_point = READ_ONCE(rb->flush_point);
 
-   for (i = 0; i < nr_entries; i++)
-   sync = (sync + 1) & (rb->nr_entries - 1);
+   if (flush_point != EMPTY_ENTRY) {
+   unsigned int secs_to_flush;
+
+   secs_to_flush = pblk_rb_ring_count(flush_point, sync,
+   rb->nr_entries);
+   if (secs_to_flush <= nr_entries) {
+   /* Protect flush points */
+   smp_store_release(>flush_point, EMPTY_ENTRY);
+   }
+   }
+
+   sync = (sync + nr_entries) & (rb->nr_entries - 1);
 
/* Protect from counts */
smp_store_release(>sync, sync);
@@ -747,22 +740,27 @@ unsigned int pblk_rb_sync_advance(struct pblk_rb *rb, 
unsigned int nr_entries)
return sync;
 }
 
+/* Calculate how many sectors to submit up to the current flush point. */
 unsigned int pblk_rb_flush_point_count(struct pblk_rb *rb)
 {
-   unsigned int subm, flush_point;
-   unsigned int count;
+   unsigned int subm, sync, flush_point;
+   unsigned int submitted, to_flush;
 
/* Protect flush points */
flush_point = smp_load_acquire(>flush_point);
if (flush_point == EMPTY_ENTRY)
return 0;
 
+   /* Protect syncs */
+   sync = smp_load_acquire(>sync);
+
subm = READ_ONCE(rb->subm);
+   submitted = pblk_rb_ring_count(subm, sync, rb->nr_entries);
 
/* The sync point itself counts as a sector to sync */
-   count = pblk_rb_ring_count(flush_point, subm, rb->nr_entries) + 1;
+   to_flush = pblk_rb_ring_count(flush_point, sync, rb->nr_entries) + 1;
 
-   return count;
+   return (submitted < to_flush) ? (to_flush - submitted) : 0;
 }
 
 /*
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 018af87..aae86ed 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -21,13 +21,28 @@ static unsigned long pblk_end_w_bio(struct pblk *pblk, 
struct nvm_rq *rqd,
struct pblk_c_ctx *c_ctx)
 {
struct bio *original_bio;
+   struct pblk_rb *rwb = >rwb;
unsigned long ret;
int i;
 
for (i = 0; i < c_ctx->nr_valid; i++) {
struct pblk_w_ctx *w_ctx;
+   int pos = c_ctx->sentry + i;
+   int flags;
+
+   w_ctx = 

[PATCH 13/25] lightnvm: pblk: rename sync_point to flush_point

2017-12-20 Thread Matias Bjørling
From: Hans Holmberg 

Sync point is a really confusing name for keeping track of
the last entry that needs to be flushed so change the name
to to flush_point instead.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-rb.c| 61 ++-
 drivers/lightnvm/pblk-write.c |  2 +-
 drivers/lightnvm/pblk.h   |  6 ++---
 3 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 62db408..941842e 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -54,7 +54,7 @@ int pblk_rb_init(struct pblk_rb *rb, struct pblk_rb_entry 
*rb_entry_base,
rb->seg_size = (1 << power_seg_sz);
rb->nr_entries = (1 << power_size);
rb->mem = rb->subm = rb->sync = rb->l2p_update = 0;
-   rb->sync_point = EMPTY_ENTRY;
+   rb->flush_point = EMPTY_ENTRY;
 
spin_lock_init(>w_lock);
spin_lock_init(>s_lock);
@@ -112,7 +112,7 @@ int pblk_rb_init(struct pblk_rb *rb, struct pblk_rb_entry 
*rb_entry_base,
up_write(_rb_lock);
 
 #ifdef CONFIG_NVM_DEBUG
-   atomic_set(>inflight_sync_point, 0);
+   atomic_set(>inflight_flush_point, 0);
 #endif
 
/*
@@ -349,26 +349,26 @@ void pblk_rb_write_entry_gc(struct pblk_rb *rb, void 
*data,
smp_store_release(>w_ctx.flags, flags);
 }
 
-static int pblk_rb_sync_point_set(struct pblk_rb *rb, struct bio *bio,
+static int pblk_rb_flush_point_set(struct pblk_rb *rb, struct bio *bio,
  unsigned int pos)
 {
struct pblk_rb_entry *entry;
-   unsigned int subm, sync_point;
+   unsigned int subm, flush_point;
 
subm = READ_ONCE(rb->subm);
 
 #ifdef CONFIG_NVM_DEBUG
-   atomic_inc(>inflight_sync_point);
+   atomic_inc(>inflight_flush_point);
 #endif
 
if (pos == subm)
return 0;
 
-   sync_point = (pos == 0) ? (rb->nr_entries - 1) : (pos - 1);
-   entry = >entries[sync_point];
+   flush_point = (pos == 0) ? (rb->nr_entries - 1) : (pos - 1);
+   entry = >entries[flush_point];
 
-   /* Protect syncs */
-   smp_store_release(>sync_point, sync_point);
+   /* Protect flush points */
+   smp_store_release(>flush_point, flush_point);
 
if (!bio)
return 0;
@@ -416,7 +416,7 @@ void pblk_rb_flush(struct pblk_rb *rb)
struct pblk *pblk = container_of(rb, struct pblk, rwb);
unsigned int mem = READ_ONCE(rb->mem);
 
-   if (pblk_rb_sync_point_set(rb, NULL, mem))
+   if (pblk_rb_flush_point_set(rb, NULL, mem))
return;
 
pblk_write_should_kick(pblk);
@@ -440,7 +440,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb, 
unsigned int nr_entries,
 #ifdef CONFIG_NVM_DEBUG
atomic_long_inc(>nr_flush);
 #endif
-   if (pblk_rb_sync_point_set(>rwb, bio, mem))
+   if (pblk_rb_flush_point_set(>rwb, bio, mem))
*io_ret = NVM_IO_OK;
}
 
@@ -607,17 +607,18 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb, 
struct nvm_rq *rqd,
}
 
if (flags & PBLK_FLUSH_ENTRY) {
-   unsigned int sync_point;
+   unsigned int flush_point;
 
-   sync_point = READ_ONCE(rb->sync_point);
-   if (sync_point == pos) {
-   /* Protect syncs */
-   smp_store_release(>sync_point, EMPTY_ENTRY);
+   flush_point = READ_ONCE(rb->flush_point);
+   if (flush_point == pos) {
+   /* Protect flush points */
+   smp_store_release(>flush_point,
+   EMPTY_ENTRY);
}
 
flags &= ~PBLK_FLUSH_ENTRY;
 #ifdef CONFIG_NVM_DEBUG
-   atomic_dec(>inflight_sync_point);
+   atomic_dec(>inflight_flush_point);
 #endif
}
 
@@ -746,20 +747,20 @@ unsigned int pblk_rb_sync_advance(struct pblk_rb *rb, 
unsigned int nr_entries)
return sync;
 }
 
-unsigned int pblk_rb_sync_point_count(struct pblk_rb *rb)
+unsigned int pblk_rb_flush_point_count(struct pblk_rb *rb)
 {
-   unsigned int subm, sync_point;
+   unsigned int subm, flush_point;
unsigned int count;
 
-   /* Protect syncs */
-   sync_point = smp_load_acquire(>sync_point);
-   if (sync_point == EMPTY_ENTRY)
+   /* Protect flush points */
+   flush_point = smp_load_acquire(>flush_point);
+   if (flush_point == EMPTY_ENTRY)
return 0;
 
subm = READ_ONCE(rb->subm);
 
/* The sync point itself counts as a sector to sync */
-   count = 

[PATCH 04/25] lightnvm: remove hybrid ocssd 1.2 support

2017-12-20 Thread Matias Bjørling
Now that rrpc have been removed. Also remove the hybrid 1.2 support
from the core.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  | 141 ---
 drivers/nvme/host/lightnvm.c |  59 --
 include/linux/lightnvm.h |  41 -
 3 files changed, 241 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 83249b4..390d5ef 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -45,12 +45,6 @@ struct nvm_dev_map {
int nr_chnls;
 };
 
-struct nvm_area {
-   struct list_head list;
-   sector_t begin;
-   sector_t end;   /* end is excluded */
-};
-
 static struct nvm_target *nvm_find_target(struct nvm_dev *dev, const char 
*name)
 {
struct nvm_target *tgt;
@@ -524,35 +518,6 @@ static void nvm_rq_dev_to_tgt(struct nvm_tgt_dev *tgt_dev, 
struct nvm_rq *rqd)
nvm_ppa_dev_to_tgt(tgt_dev, rqd->ppa_list, rqd->nr_ppas);
 }
 
-void nvm_part_to_tgt(struct nvm_dev *dev, sector_t *entries,
-int len)
-{
-   struct nvm_geo *geo = >geo;
-   struct nvm_dev_map *dev_rmap = dev->rmap;
-   u64 i;
-
-   for (i = 0; i < len; i++) {
-   struct nvm_ch_map *ch_rmap;
-   int *lun_roffs;
-   struct ppa_addr gaddr;
-   u64 pba = le64_to_cpu(entries[i]);
-   u64 diff;
-
-   if (!pba)
-   continue;
-
-   gaddr = linear_to_generic_addr(geo, pba);
-   ch_rmap = _rmap->chnls[gaddr.g.ch];
-   lun_roffs = ch_rmap->lun_offs;
-
-   diff = ((ch_rmap->ch_off * geo->luns_per_chnl) +
-   (lun_roffs[gaddr.g.lun])) * geo->sec_per_lun;
-
-   entries[i] -= cpu_to_le64(diff);
-   }
-}
-EXPORT_SYMBOL(nvm_part_to_tgt);
-
 int nvm_register_tgt_type(struct nvm_tgt_type *tt)
 {
int ret = 0;
@@ -726,112 +691,6 @@ int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, 
struct nvm_rq *rqd)
 }
 EXPORT_SYMBOL(nvm_submit_io_sync);
 
-int nvm_erase_sync(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
-   int nr_ppas)
-{
-   struct nvm_geo *geo = _dev->geo;
-   struct nvm_rq rqd;
-   int ret;
-
-   memset(, 0, sizeof(struct nvm_rq));
-
-   rqd.opcode = NVM_OP_ERASE;
-   rqd.flags = geo->plane_mode >> 1;
-
-   ret = nvm_set_rqd_ppalist(tgt_dev, , ppas, nr_ppas);
-   if (ret)
-   return ret;
-
-   ret = nvm_submit_io_sync(tgt_dev, );
-   if (ret) {
-   pr_err("rrpr: erase I/O submission failed: %d\n", ret);
-   goto free_ppa_list;
-   }
-
-free_ppa_list:
-   nvm_free_rqd_ppalist(tgt_dev, );
-
-   return ret;
-}
-EXPORT_SYMBOL(nvm_erase_sync);
-
-int nvm_get_l2p_tbl(struct nvm_tgt_dev *tgt_dev, u64 slba, u32 nlb,
-   nvm_l2p_update_fn *update_l2p, void *priv)
-{
-   struct nvm_dev *dev = tgt_dev->parent;
-
-   if (!dev->ops->get_l2p_tbl)
-   return 0;
-
-   return dev->ops->get_l2p_tbl(dev, slba, nlb, update_l2p, priv);
-}
-EXPORT_SYMBOL(nvm_get_l2p_tbl);
-
-int nvm_get_area(struct nvm_tgt_dev *tgt_dev, sector_t *lba, sector_t len)
-{
-   struct nvm_dev *dev = tgt_dev->parent;
-   struct nvm_geo *geo = >geo;
-   struct nvm_area *area, *prev, *next;
-   sector_t begin = 0;
-   sector_t max_sectors = (geo->sec_size * dev->total_secs) >> 9;
-
-   if (len > max_sectors)
-   return -EINVAL;
-
-   area = kmalloc(sizeof(struct nvm_area), GFP_KERNEL);
-   if (!area)
-   return -ENOMEM;
-
-   prev = NULL;
-
-   spin_lock(>lock);
-   list_for_each_entry(next, >area_list, list) {
-   if (begin + len > next->begin) {
-   begin = next->end;
-   prev = next;
-   continue;
-   }
-   break;
-   }
-
-   if ((begin + len) > max_sectors) {
-   spin_unlock(>lock);
-   kfree(area);
-   return -EINVAL;
-   }
-
-   area->begin = *lba = begin;
-   area->end = begin + len;
-
-   if (prev) /* insert into sorted order */
-   list_add(>list, >list);
-   else
-   list_add(>list, >area_list);
-   spin_unlock(>lock);
-
-   return 0;
-}
-EXPORT_SYMBOL(nvm_get_area);
-
-void nvm_put_area(struct nvm_tgt_dev *tgt_dev, sector_t begin)
-{
-   struct nvm_dev *dev = tgt_dev->parent;
-   struct nvm_area *area;
-
-   spin_lock(>lock);
-   list_for_each_entry(area, >area_list, list) {
-   if (area->begin != begin)
-   continue;
-
-   list_del(>list);
-   spin_unlock(>lock);
-   kfree(area);
-   return;
-   }
-   spin_unlock(>lock);
-}
-EXPORT_SYMBOL(nvm_put_area);
-
 void 

[PATCH 02/25] lightnvm: remove rrpc

2017-12-20 Thread Matias Bjørling
The hybrid mode for 1.2 revision was deprecated, and have
no users. Remove to make it easier to move to the 2.0 revision.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/Kconfig  |7 -
 drivers/lightnvm/Makefile |1 -
 drivers/lightnvm/rrpc.c   | 1625 -
 drivers/lightnvm/rrpc.h   |  290 
 4 files changed, 1923 deletions(-)
 delete mode 100644 drivers/lightnvm/rrpc.c
 delete mode 100644 drivers/lightnvm/rrpc.h

diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
index 2a953ef..10c0898 100644
--- a/drivers/lightnvm/Kconfig
+++ b/drivers/lightnvm/Kconfig
@@ -27,13 +27,6 @@ config NVM_DEBUG
 
It is required to create/remove targets without IOCTLs.
 
-config NVM_RRPC
-   tristate "Round-robin Hybrid Open-Channel SSD target"
-   ---help---
-   Allows an open-channel SSD to be exposed as a block device to the
-   host. The target is implemented using a linear mapping table and
-   cost-based garbage collection. It is optimized for 4K IO sizes.
-
 config NVM_PBLK
tristate "Physical Block Device Open-Channel SSD target"
---help---
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
index 82d1a11..c2d0b38 100644
--- a/drivers/lightnvm/Makefile
+++ b/drivers/lightnvm/Makefile
@@ -3,7 +3,6 @@
 #
 
 obj-$(CONFIG_NVM)  := core.o
-obj-$(CONFIG_NVM_RRPC) += rrpc.o
 obj-$(CONFIG_NVM_PBLK) += pblk.o
 pblk-y := pblk-init.o pblk-core.o pblk-rb.o \
   pblk-write.o pblk-cache.o pblk-read.o \
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
deleted file mode 100644
index 267f01a..000
--- a/drivers/lightnvm/rrpc.c
+++ /dev/null
@@ -1,1625 +0,0 @@
-/*
- * Copyright (C) 2015 IT University of Copenhagen
- * Initial release: Matias Bjorling 
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License version
- * 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * Implementation of a Round-robin page-based Hybrid FTL for Open-channel SSDs.
- */
-
-#include "rrpc.h"
-
-static struct kmem_cache *rrpc_gcb_cache, *rrpc_rq_cache;
-static DECLARE_RWSEM(rrpc_lock);
-
-static int rrpc_submit_io(struct rrpc *rrpc, struct bio *bio,
-   struct nvm_rq *rqd, unsigned long flags);
-
-#define rrpc_for_each_lun(rrpc, rlun, i) \
-   for ((i) = 0, rlun = &(rrpc)->luns[0]; \
-   (i) < (rrpc)->nr_luns; (i)++, rlun = &(rrpc)->luns[(i)])
-
-static void rrpc_page_invalidate(struct rrpc *rrpc, struct rrpc_addr *a)
-{
-   struct nvm_tgt_dev *dev = rrpc->dev;
-   struct rrpc_block *rblk = a->rblk;
-   unsigned int pg_offset;
-
-   lockdep_assert_held(>rev_lock);
-
-   if (a->addr == ADDR_EMPTY || !rblk)
-   return;
-
-   spin_lock(>lock);
-
-   div_u64_rem(a->addr, dev->geo.sec_per_blk, _offset);
-   WARN_ON(test_and_set_bit(pg_offset, rblk->invalid_pages));
-   rblk->nr_invalid_pages++;
-
-   spin_unlock(>lock);
-
-   rrpc->rev_trans_map[a->addr].addr = ADDR_EMPTY;
-}
-
-static void rrpc_invalidate_range(struct rrpc *rrpc, sector_t slba,
-   unsigned int len)
-{
-   sector_t i;
-
-   spin_lock(>rev_lock);
-   for (i = slba; i < slba + len; i++) {
-   struct rrpc_addr *gp = >trans_map[i];
-
-   rrpc_page_invalidate(rrpc, gp);
-   gp->rblk = NULL;
-   }
-   spin_unlock(>rev_lock);
-}
-
-static struct nvm_rq *rrpc_inflight_laddr_acquire(struct rrpc *rrpc,
-   sector_t laddr, unsigned int pages)
-{
-   struct nvm_rq *rqd;
-   struct rrpc_inflight_rq *inf;
-
-   rqd = mempool_alloc(rrpc->rq_pool, GFP_ATOMIC);
-   if (!rqd)
-   return ERR_PTR(-ENOMEM);
-
-   inf = rrpc_get_inflight_rq(rqd);
-   if (rrpc_lock_laddr(rrpc, laddr, pages, inf)) {
-   mempool_free(rqd, rrpc->rq_pool);
-   return NULL;
-   }
-
-   return rqd;
-}
-
-static void rrpc_inflight_laddr_release(struct rrpc *rrpc, struct nvm_rq *rqd)
-{
-   struct rrpc_inflight_rq *inf = rrpc_get_inflight_rq(rqd);
-
-   rrpc_unlock_laddr(rrpc, inf);
-
-   mempool_free(rqd, rrpc->rq_pool);
-}
-
-static void rrpc_discard(struct rrpc *rrpc, struct bio *bio)
-{
-   sector_t slba = bio->bi_iter.bi_sector / NR_PHY_IN_LOG;
-   sector_t len = bio->bi_iter.bi_size / RRPC_EXPOSED_PAGE_SIZE;
-   struct nvm_rq *rqd;
-
-   while (1) {
-   rqd = 

[PATCH] BLOCK: blk-flush: fixed line with more than 80 character

2017-12-20 Thread Khan M Rashedun-Naby
The line has more than 80 character which is not linux kernel coding
style. Thus it has rewritten into two lines.

Signed-off-by: Khan M Rashedun-Naby 
---
 block/blk-flush.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index f171706..7ab46bb 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -299,7 +299,8 @@ static bool blk_kick_flush(struct request_queue *q, struct 
blk_flush_queue *fq)
struct request *flush_rq = fq->flush_rq;
 
/* C1 described at the top of this file */
-   if (fq->flush_pending_idx != fq->flush_running_idx || 
list_empty(pending))
+   if (fq->flush_pending_idx != fq->flush_running_idx ||
+   list_empty(pending))
return false;
 
/* C2 and C3
-- 
2.7.4



Re: [PATCH V2] block-throttle: avoid double charge

2017-12-20 Thread Shaohua Li
On Wed, Dec 20, 2017 at 02:42:20PM +0800, xuejiufei wrote:
> Hi Shaohua,
> 
> I noticed that the splitted bio will goto the scheduler directly while
> the cloned bio entering the generic_make_request again. So can we just
> leave the BIO_THROTTLED flag in the original bio and do not copy the
> flag to new splitted bio, so it is not necessary to remove the flag in
> bio_set_dev()? Or there are other different situations?

but we can still send the original bio to a different bdev, for example stacked
disks. That bit will prevent throttling for underlayer disks.

Thanks,
Shaohua
 
> On 2017/11/14 上午4:37, Shaohua Li wrote:
> > If a bio is throttled and splitted after throttling, the bio could be
> > resubmited and enters the throttling again. This will cause part of the
> > bio is charged multiple times. If the cgroup has an IO limit, the double
> > charge will significantly harm the performance. The bio split becomes
> > quite common after arbitrary bio size change.
> > 
> > To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
> > If the bio is cloned/slitted, we copy the flag to new bio too to avoid
> > double charge. However cloned bio could be directed to a new disk,
> > keeping the flag will have problem. The observation is we always set new
> > disk for the bio in this case, so we can clear the flag in
> > bio_set_dev().
> > 
> > This issue exists a long time, arbitrary bio size change makes it worse,
> > so this should go into stable at least since v4.2.
> > 
> > V1-> V2: Not add extra field in bio based on discussion with Tejun
> > 
> > Cc: Tejun Heo 
> > Cc: Vivek Goyal 
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Shaohua Li 
> > ---
> >  block/bio.c  | 2 ++
> >  block/blk-throttle.c | 8 +---
> >  include/linux/bio.h  | 2 ++
> >  3 files changed, 5 insertions(+), 7 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 8338304..d1d4d51 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -598,6 +598,8 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> > *bio_src)
> >  */
> > bio->bi_disk = bio_src->bi_disk;
> > bio_set_flag(bio, BIO_CLONED);
> > +   if (bio_flagged(bio_src, BIO_THROTTLED))
> > +   bio_set_flag(bio, BIO_THROTTLED);
> > bio->bi_opf = bio_src->bi_opf;
> > bio->bi_write_hint = bio_src->bi_write_hint;
> > bio->bi_iter = bio_src->bi_iter;
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index ee6d7b0..f90fec1 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -,13 +,7 @@ bool blk_throtl_bio(struct request_queue *q, struct 
> > blkcg_gq *blkg,
> >  out_unlock:
> > spin_unlock_irq(q->queue_lock);
> >  out:
> > -   /*
> > -* As multiple blk-throtls may stack in the same issue path, we
> > -* don't want bios to leave with the flag set.  Clear the flag if
> > -* being issued.
> > -*/
> > -   if (!throttled)
> > -   bio_clear_flag(bio, BIO_THROTTLED);
> > +   bio_set_flag(bio, BIO_THROTTLED);
> >  
> >  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
> > if (throttled || !td->track_bio_latency)
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 9c75f58..27b5bac 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -504,6 +504,8 @@ extern unsigned int bvec_nr_vecs(unsigned short idx);
> >  
> >  #define bio_set_dev(bio, bdev) \
> >  do {   \
> > +   if ((bio)->bi_disk != (bdev)->bd_disk)  \
> > +   bio_clear_flag(bio, BIO_THROTTLED);\
> > (bio)->bi_disk = (bdev)->bd_disk;   \
> > (bio)->bi_partno = (bdev)->bd_partno;   \
> >  } while (0)
> > 


[PATCH IMPROVEMENT] block, bfq: increase threshold to deem I/O as random

2017-12-20 Thread Paolo Valente
If two processes do I/O close to each other, i.e., are cooperating
processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
associated bfq_queues, so as to get sequential I/O from the union of
the I/O requests of the processes, and thus reach a higher
throughput. A merged queue is then split if its I/O stops being
sequential. In this respect, BFQ deems the I/O of a bfq_queue as
(mostly) sequential only if less than 4 I/O requests are random, out
of the last 32 requests inserted into the queue.

Unfortunately, extensive testing (with the interleaved_io benchmark of
the S suite [1], and with real applications spawning cooperating
processes) has clearly shown that, with such a low threshold, only a
rather low I/O throughput may be reached when several cooperating
processes do I/O. In particular, the outcome of each test run was
bimodal: if queue merging occurred and was stable during the test,
then the throughput was close to the peak rate of the storage device,
otherwise the throughput was arbitrarily low (usually around 1/10 of
the peak rate with a rotational device). The probability to get the
unlucky outcomes grew with the number of cooperating processes: it was
already significant with 5 processes, and close to one with 7 or more
processes.

The cause of the low throughput in the unlucky runs was that the
merged queues containing the I/O of these cooperating processes were
soon split, because they contained more random I/O requests than those
tolerated by the 4/32 threshold, but
- that I/O would have however allowed the storage device to reach
  peak throughput or almost peak throughput;
- in contrast, the I/O of these processes, if served individually
  (from separate queues) yielded a rather low throughput.

So we repeated our tests with increasing values of the threshold,
until we found the minimum value (19) for which we obtained maximum
throughput, reliably, with at least up to 9 cooperating
processes. Then we checked that the use of that higher threshold value
did not cause any regression for any other benchmark in the suite [1].
This commit raises the threshold to such a higher value.

[1] https://github.com/Algodev-github/S

Signed-off-by: Angelo Ruocco 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c66578592c9e..e33c5c4c9856 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -192,7 +192,7 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR  (sector_t)(8 * 100)
 #define BFQQ_SECT_THR_NONROT   (sector_t)(2 * 32)
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
-#define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
+#define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 19)
 
 /* Min number of samples required to perform peak-rate update */
 #define BFQ_RATE_MIN_SAMPLES   32
-- 
2.15.1



[PATCH] Block: blk-flush: removed an unnecessary else statement

2017-12-20 Thread Khan M Rashedun-Naby
As both of the if and else statement block is returning something then 
there is no need of the else statement. Thus this else statement 
has been removed.

Signed-off-by: Khan M Rashedun-Naby 
---
 block/blk-flush.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index f171706..b629014 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -137,13 +137,14 @@ static bool blk_flush_queue_rq(struct request *rq, bool 
add_front)
if (rq->q->mq_ops) {
blk_mq_add_to_requeue_list(rq, add_front, true);
return false;
-   } else {
-   if (add_front)
-   list_add(>queuelist, >q->queue_head);
-   else
-   list_add_tail(>queuelist, >q->queue_head);
-   return true;
}
+
+   if (add_front)
+   list_add(>queuelist, >q->queue_head);
+   else
+   list_add_tail(>queuelist, >q->queue_head);
+
+   return true;
 }
 
 /**
-- 
2.7.4



Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-12-20 Thread Christian Borntraeger
On 12/18/2017 02:56 PM, Stefan Haberland wrote:
> On 07.12.2017 00:29, Christoph Hellwig wrote:
>> On Wed, Dec 06, 2017 at 01:25:11PM +0100, Christian Borntraeger wrote:
>> t > commit 11b2025c3326f7096ceb588c3117c7883850c068    -> bad
>>>  blk-mq: create a blk_mq_ctx for each possible CPU
>>> does not boot on DASD and
>>> commit 9c6ae239e01ae9a9f8657f05c55c4372e9fc8bcc    -> good
>>>     genirq/affinity: assign vectors to all possible CPUs
>>> does boot with DASD disks.
>>>
>>> Also adding Stefan Haberland if he has an idea why this fails on DASD and 
>>> adding Martin (for the
>>> s390 irq handling code).
>> That is interesting as it really isn't related to interrupts at all,
>> it just ensures that possible CPUs are set in ->cpumask.
>>
>> I guess we'd really want:
>>
>> e005655c389e3d25bf3e43f71611ec12f3012de0
>> "blk-mq: only select online CPUs in blk_mq_hctx_next_cpu"
>>
>> before this commit, but it seems like the whole stack didn't work for
>> your either.
>>
>> I wonder if there is some weird thing about nr_cpu_ids in s390?
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-s390" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> I tried this on my system and the blk-mq-hotplug-fix branch does not boot for 
> me as well.
> The disks get up and running and I/O works fine. At least the partition 
> detection and EXT4-fs mount works.
> 
> But at some point in time the disk do not get any requests.
> 
> I currently have no clue why.
> I took a dump and had a look at the disk states and they are fine. No error 
> in the logs or in our debug entrys. Just empty DASD devices waiting to be 
> called for I/O requests.
> 
> Do you have anything I could have a look at?

Jens, Christoph, so what do we do about this?
To summarize:
- commit 4b855ad37194f7 ("blk-mq: Create hctx for each present CPU") broke CPU 
hotplug.
- Jens' quick revert did fix the issue and did not broke DASD support but has 
some issues
with interrupt affinity.
- Christoph patch set fixes the hotplug issue for virtio blk but causes I/O 
hangs on DASDs (even
without hotplug).

Christian



Re: [PATCH V2 0/2] block: fix queue freeze and cleanup

2017-12-20 Thread Mauricio Faria de Oliveira

Hi Bart,

On 12/13/2017 07:49 PM, Bart Van Assche wrote:

Would it be possible to repeat your test with the patch below applied on your
kernel tree? This patch has just been posted on the dm-devel mailing list.


Sorry for the delay. I missed this.

Unfortunately the oops problem still happens on PATCH v2 and that patch
(actually, the version that ended up in v4.15-rc4 [1] by Mike Snitzer).

The problem does not happen with PATCH v3 (a.k.a. v1) -- i.e., v3 is OK.

Thanks!


Test-case w/ PATCH v3:
---

[root@guest ~]# uname -r
4.15.0-rc4.mingleiv3

[root@guest ~]# reboot
...
systemd-shutdown[1]: Detaching DM devices.
systemd-shutdown[1]: Detaching DM 253:2.
shutdown: 7 output lines suppressed due to ratelimiting
dracut Warning: Killing all remaining processes
dracut Warning: Killing all remaining processes
XFS (dm-0): Unmounting Filesystem
dracut Warning: Unmounted /oldroot.
dracut: Disassembling device-mapper devices
Rebooting.
sd 0:0:0:0: [sda] Synchronizing SCSI cache
reboot: Restarting system


Test-case w/ PATCH v2:
---

[root@guest ~]# uname -r
4.15.0-rc4.mingleiv2

[root@guest ~]# reboot
...
systemd-shutdown[1]: Detaching DM devices.
systemd-shutdown[1]: Detaching DM 253:2.
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: vmx_crypto virtio_balloon ip_tables x_tables xfs 
libcrc32c virtio_net virtio_scsi crc32c_vpmsum virtio_pci virtio_ring 
virtio autofs4

CPU: 3 PID: 1 Comm: systemd-shutdow Not tainted 4.15.0-rc4.mingleiv2 #4
<...>
NIP []   (null)
LR [c057d0dc] __blk_run_queue+0x6c/0xb0
Call Trace:
[c001fb083970] [c001fb0839e0] 0xc001fb0839e0 (unreliable)
[c001fb0839a0] [c057d6ec] blk_run_queue+0x4c/0x80
[c001fb0839d0] [c0592834] blk_freeze_queue_start+0xa4/0xb0
[c001fb083a00] [c057deac] blk_set_queue_dying+0x6c/0x190
[c001fb083a30] [c08a4b7c] __dm_destroy+0xac/0x300
[c001fb083ad0] [c08b0244] dev_remove+0x154/0x1d0
[c001fb083b20] [c08b0b70] ctl_ioctl+0x360/0x4f0
[c001fb083d10] [c08b0d38] dm_ctl_ioctl+0x38/0x50
[c001fb083d50] [c03867d8] do_vfs_ioctl+0xd8/0x8c0
[c001fb083df0] [c0387028] SyS_ioctl+0x68/0x100
[c001fb083e30] [c000b760] system_call+0x58/0x6c
Instruction dump:
       
       
---[ end trace 0fceefbe4fc1cd29 ]---

Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b



[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.15=afc567a4977b2d798e05153dd131a3c8d4758c0c



--
Mauricio Faria de Oliveira
IBM Linux Technology Center



[PATCH IMPROVEMENT/BUGFIX 2/4] block, bfq: check low_latency flag in bfq_bfqq_save_state()

2017-12-20 Thread Paolo Valente
From: Angelo Ruocco 

A just-created bfq_queue will certainly be deemed as interactive on
the arrival of its first I/O request, if the low_latency flag is
set. Yet, if the queue is merged with another queue on the arrival of
its first I/O request, it will not have the chance to be flagged as
interactive. Nevertheless, if the queue is then split soon enough, it
has to be flagged as interactive after the split.

To handle this early-merge scenario correctly, BFQ saves the state of
the queue, on the merge, as if the latter had already been deemed
interactive. So, if the queue is split soon, it will get
weight-raised, because the previous state of the queue is resumed on
the split.

Unfortunately, in the act of saving the state of the newly-created
queue, BFQ doesn't check whether the low_latency flag is set, and this
causes early-merged queues to be then weight-raised, on queue splits,
even if low_latency is off. This commit addresses this problem by
adding the missing check.

Signed-off-by: Angelo Ruocco 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c6f0b93a769c..f58367ef98c1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2064,7 +2064,8 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
bic->was_in_burst_list = !hlist_unhashed(>burst_list_node);
if (unlikely(bfq_bfqq_just_created(bfqq) &&
-!bfq_bfqq_in_large_burst(bfqq))) {
+!bfq_bfqq_in_large_burst(bfqq) &&
+bfqq->bfqd->low_latency)) {
/*
 * bfqq being merged right after being created: bfqq
 * would have deserved interactive weight raising, but
-- 
2.15.1



[PATCH IMPROVEMENT/BUGFIX 3/4] block, bfq: let a queue be merged only shortly after starting I/O

2017-12-20 Thread Paolo Valente
In BFQ and CFQ, two processes are said to be cooperating if they do
I/O in such a way that the union of their I/O requests yields a
sequential I/O pattern. To get such a sequential I/O pattern out of
the non-sequential pattern of each cooperating process, BFQ and CFQ
merge the queues associated with these processes. In more detail,
cooperating processes, and thus their associated queues, usually
start, or restart, to do I/O shortly after each other. This is the
case, e.g., for the I/O threads of KVM/QEMU and of the dump
utility. Basing on this assumption, this commit allows a bfq_queue to
be merged only during a short time interval (100ms) after it starts,
or re-starts, to do I/O.  This filtering provides two important
benefits.

First, it greatly reduces the probability that two non-cooperating
processes have their queues merged by mistake, if they just happen to
do I/O close to each other for a short time interval. These spurious
merges cause loss of service guarantees. A low-weight bfq_queue may
unjustly get more than its expected share of the throughput: if such a
low-weight queue is merged with a high-weight queue, then the I/O for
the low-weight queue is served as if the queue had a high weight. This
may damage other high-weight queues unexpectedly.  For instance,
because of this issue, lxterminal occasionally took 7.5 seconds to
start, instead of 6.5 seconds, when some sequential readers and
writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
reason is that the bfq_queues associated with some of the readers or
the writers were merged with the high-weight queues of some processes
that had to do some urgent but little I/O. The readers then exploited
the inherited high weight for all or most of their I/O, during the
start-up of terminal. The filtering introduced by this commit
eliminated any outlier caused by spurious queue merges in our start-up
time tests.

This filtering also provides a little boost of the throughput
sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
once a bfq_queue cannot be merged any longer, this commit makes BFQ
stop updating the data needed to handle merging for the queue.

Signed-off-by: Paolo Valente 
Signed-off-by: Angelo Ruocco 
---
 block/bfq-iosched.c | 57 ++---
 block/bfq-iosched.h |  2 ++
 block/bfq-wf2q.c|  4 
 3 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index f58367ef98c1..320022135dc8 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -166,6 +166,20 @@ static const int bfq_async_charge_factor = 10;
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 const int bfq_timeout = HZ / 8;
 
+/*
+ * Time limit for merging (see comments in bfq_setup_cooperator). Set
+ * to the slowest value that, in our tests, proved to be effective in
+ * removing false positives, while not causing true positives to miss
+ * queue merging.
+ *
+ * As can be deduced from the low time limit below, queue merging, if
+ * successful, happens at the very beggining of the I/O of the involved
+ * cooperating processes, as a consequence of the arrival of the very
+ * first requests from each cooperator.  After that, there is very
+ * little chance to find cooperators.
+ */
+static const unsigned long bfq_merge_time_limit = HZ/10;
+
 static struct kmem_cache *bfq_pool;
 
 /* Below this threshold (in ns), we consider thinktime immediate. */
@@ -444,6 +458,13 @@ bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct 
rb_root *root,
return bfqq;
 }
 
+static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)
+{
+   return bfqq->service_from_backlogged > 0 &&
+   time_is_before_jiffies(bfqq->first_IO_time +
+  bfq_merge_time_limit);
+}
+
 void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
struct rb_node **p, *parent;
@@ -454,6 +475,14 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct 
bfq_queue *bfqq)
bfqq->pos_root = NULL;
}
 
+   /*
+* bfqq cannot be merged any longer (see comments in
+* bfq_setup_cooperator): no point in adding bfqq into the
+* position tree.
+*/
+   if (bfq_too_late_for_merging(bfqq))
+   return;
+
if (bfq_class_idle(bfqq))
return;
if (!bfqq->next_rq)
@@ -1935,6 +1964,9 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue 
*new_bfqq)
 static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
struct bfq_queue *new_bfqq)
 {
+   if (bfq_too_late_for_merging(new_bfqq))
+   return false;
+
if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) ||
(bfqq->ioprio_class != new_bfqq->ioprio_class))
return false;
@@ -2003,6 +2035,20 @@ 

[PATCH IMPROVEMENT/BUGFIX 4/4] block, bfq: remove superfluous check in queue-merging setup

2017-12-20 Thread Paolo Valente
From: Angelo Ruocco 

When two or more processes do I/O in a way that the their requests are
sequential in respect to one another, BFQ merges the bfq_queues associated
with the processes. This way the overall I/O pattern becomes sequential,
and thus there is a boost in througput.
These cooperating processes usually start or restart to do I/O shortly
after each other. So, in order to avoid merging non-cooperating processes,
BFQ ensures that none of these queues has been in weight raising for too
long.

In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged
only shortly after being created", BFQ checks whether any queue (and not
only weight-raised ones) is doing I/O continuously from too long to be
merged.

This new additional check makes the first one useless: a queue doing
I/O from long enough, if being weight-raised, is also a queue in
weight raising for too long to be merged. Accordingly, this commit
removes the first check.

Signed-off-by: Angelo Ruocco 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 36 +---
 1 file changed, 5 insertions(+), 31 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 320022135dc8..c66578592c9e 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1990,20 +1990,6 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue 
*bfqq,
return true;
 }
 
-/*
- * If this function returns true, then bfqq cannot be merged. The idea
- * is that true cooperation happens very early after processes start
- * to do I/O. Usually, late cooperations are just accidental false
- * positives. In case bfqq is weight-raised, such false positives
- * would evidently degrade latency guarantees for bfqq.
- */
-static bool wr_from_too_long(struct bfq_queue *bfqq)
-{
-   return bfqq->wr_coeff > 1 &&
-   time_is_before_jiffies(bfqq->last_wr_start_finish +
-  msecs_to_jiffies(100));
-}
-
 /*
  * Attempt to schedule a merge of bfqq with the currently in-service
  * queue or with a close queue among the scheduled queues.  Return
@@ -2017,11 +2003,6 @@ static bool wr_from_too_long(struct bfq_queue *bfqq)
  * to maintain. Besides, in such a critical condition as an out of memory,
  * the benefits of queue merging may be little relevant, or even negligible.
  *
- * Weight-raised queues can be merged only if their weight-raising
- * period has just started. In fact cooperating processes are usually
- * started together. Thus, with this filter we avoid false positives
- * that would jeopardize low-latency guarantees.
- *
  * WARNING: queue merging may impair fairness among non-weight raised
  * queues, for at least two reasons: 1) the original weight of a
  * merged queue may change during the merged state, 2) even being the
@@ -2052,9 +2033,7 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
if (bfqq->new_bfqq)
return bfqq->new_bfqq;
 
-   if (!io_struct ||
-   wr_from_too_long(bfqq) ||
-   unlikely(bfqq == >oom_bfqq))
+   if (!io_struct || unlikely(bfqq == >oom_bfqq))
return NULL;
 
/* If there is only one backlogged queue, don't search. */
@@ -2063,12 +2042,9 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 
in_service_bfqq = bfqd->in_service_queue;
 
-   if (!in_service_bfqq || in_service_bfqq == bfqq
-   || wr_from_too_long(in_service_bfqq) ||
-   unlikely(in_service_bfqq == >oom_bfqq))
-   goto check_scheduled;
-
-   if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+   if (in_service_bfqq && in_service_bfqq != bfqq &&
+   likely(in_service_bfqq != >oom_bfqq) &&
+   bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
bfqq->entity.parent == in_service_bfqq->entity.parent &&
bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) {
new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
@@ -2080,12 +2056,10 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 * queues. The only thing we need is that the bio/request is not
 * NULL, as we need it to establish whether a cooperator exists.
 */
-check_scheduled:
new_bfqq = bfq_find_close_cooperator(bfqd, bfqq,
bfq_io_struct_pos(io_struct, request));
 
-   if (new_bfqq && !wr_from_too_long(new_bfqq) &&
-   likely(new_bfqq != >oom_bfqq) &&
+   if (new_bfqq && likely(new_bfqq != >oom_bfqq) &&
bfq_may_be_close_cooperator(bfqq, new_bfqq))
return bfq_setup_merge(bfqq, new_bfqq);
 
-- 
2.15.1



[PATCH IMPROVEMENT/BUGFIX 1/4] block, bfq: add missing rq_pos_tree update on rq removal

2017-12-20 Thread Paolo Valente
If two processes do I/O close to each other, then BFQ merges the
bfq_queues associated with these processes, to get a more sequential
I/O, and thus a higher throughput.  In this respect, to detect whether
two processes are doing I/O close to each other, BFQ keeps a list of
the head-of-line I/O requests of all active bfq_queues.  The list is
ordered by initial sectors, and implemented through a red-black tree
(rq_pos_tree).

Unfortunately, the update of the rq_pos_tree was incomplete, because
the tree was not updated on the removal of the head-of-line I/O
request of a bfq_queue, in case the queue did not remain empty. This
commit adds the missing update.

Signed-off-by: Paolo Valente 
Signed-off-by: Angelo Ruocco 
---
 block/bfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a9e06217e64d..c6f0b93a769c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1627,6 +1627,8 @@ static void bfq_remove_request(struct request_queue *q,
rb_erase(>pos_node, bfqq->pos_root);
bfqq->pos_root = NULL;
}
+   } else {
+   bfq_pos_tree_add_move(bfqd, bfqq);
}
 
if (rq->cmd_flags & REQ_META)
-- 
2.15.1



[PATCH IMPROVEMENT/BUGFIX 0/4] remove start-up time outlier caused by wrong detection of cooperating processes

2017-12-20 Thread Paolo Valente
Hi,
the main patch in this series ("block, bfq: let a queue be merged only
shortly after starting I/O") eliminates an outlier in the application
start-up time guaranteed by BFQ. This outlier occurs more or less
frequently, as a function of the characteristics of the system, and is
caused by a wrong detection of so-called cooperating processes
(details in the commit message).  This main patch is preceded by two
patches that fix two bugs found while working on this problem.  The
patch is then followed by a further, optimization patch, that removes
an operation made superfluous by the main patch.

Jens, I've not forgotten our decision to make a patch that enables
blkio stats (related to proportional-share policy) to not be enabled
at boot, or when CFQ or BFQ modules are loaded. Just, we have already
prepared the present patches, plus a few other patches for improving
BFQ and fixing bugs, and I'd like to clear this backlog first.

In this respect, after a patch for boosting throughput more reliably
with cooperating processes, I'll send out a patch to solve an
important read starvation problem. If I'm not making a blunder, this
problem affects every I/O scheduler in blk-mq. As a first step, I'll
propose a fix for BFQ. If the fix is ok, I'm willing to port it to the
other schedulers.

Thanks,
Paolo

Angelo Ruocco (2):
  block, bfq: check low_latency flag in bfq_bfqq_save_state()
  block, bfq: remove superfluous check in queue-merging setup

Paolo Valente (2):
  block, bfq: add missing rq_pos_tree update on rq removal
  block, bfq: let a queue be merged only shortly after starting I/O

 block/bfq-iosched.c | 98 ++---
 block/bfq-iosched.h |  2 ++
 block/bfq-wf2q.c|  4 +++
 3 files changed, 61 insertions(+), 43 deletions(-)

--
2.15.1


[PATCH blktests] Fix syntax error with bash v4.1.2(e.g RHEL6)

2017-12-20 Thread xiao yang
-v option is not supported by conditional expressions on
bash v4.1.2, so we use -n instead of -v to fix this issue.

Signed-off-by: xiao yang 
---
 check| 12 ++--
 common/fio   |  2 +-
 common/rc|  2 +-
 tests/meta/group |  4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/check b/check
index 95d269a..7fbdc70 100755
--- a/check
+++ b/check
@@ -178,7 +178,7 @@ _output_status() {
local test="$1"
local status="$2"
 
-   if [[ -v DESCRIPTION ]]; then
+   if [[ -n $DESCRIPTION ]]; then
printf '%-60s' "$test ($DESCRIPTION)"
else
printf '%-60s' "$test"
@@ -215,7 +215,7 @@ _output_notrun() {
 }
 
 _output_last_test_run() {
-   if [[ -v TEST_DEV ]]; then
+   if [[ -n $TEST_DEV ]]; then
_output_status "$TEST_NAME => $(basename "$TEST_DEV")" ""
else
_output_status "$TEST_NAME" ""
@@ -240,7 +240,7 @@ _output_test_run() {
tput cuu $((${#LAST_TEST_RUN[@]} - 3))
fi
 
-   if [[ -v TEST_DEV ]]; then
+   if [[ -n $TEST_DEV ]]; then
_output_status "$TEST_NAME => $(basename "$TEST_DEV")" 
"${TEST_RUN[status]}ed"
else
_output_status "$TEST_NAME" "${TEST_RUN[status]}ed"
@@ -270,7 +270,7 @@ _output_test_run() {
 }
 
 _cleanup() {
-   if [[ -v TMPDIR ]]; then
+   if [[ -n $TMPDIR ]]; then
rm -rf "$TMPDIR"
unset TMPDIR
fi
@@ -282,7 +282,7 @@ _cleanup() {
unset TEST_DEV_QUEUE_SAVED["$key"]
done
 
-   if [[ -v RESTORE_CPUS_ONLINE ]]; then
+   if [[ -n $RESTORE_CPUS_ONLINE ]]; then
local cpu
for cpu in "${!CPUS_ONLINE_SAVED[@]}"; do
echo "${CPUS_ONLINE_SAVED["$cpu"]}" 
>"/sys/devices/system/cpu/cpu$cpu/online"
@@ -599,7 +599,7 @@ while true; do
esac
 done
 
-if [[ QUICK_RUN -ne 0 && ! -v TIMEOUT ]]; then
+if [[ QUICK_RUN -ne 0 && ! -n $TIMEOUT ]]; then
_error "QUICK_RUN specified without TIMEOUT"
 fi
 
diff --git a/common/fio b/common/fio
index f5787b4..eb21038 100644
--- a/common/fio
+++ b/common/fio
@@ -160,7 +160,7 @@ _fio_perf() {
 # You should usually use this instead of calling fio directly.
 _run_fio() {
local args=("--output=$TMPDIR/fio_perf" "--output-format=terse" 
"--terse-version=4" "--group_reporting=1")
-   if [[ -v TIMEOUT ]]; then
+   if [[ -n $TIMEOUT ]]; then
args+=("--runtime=$TIMEOUT")
fi
fio "${args[@]}" "$@"
diff --git a/common/rc b/common/rc
index 09ce4ef..0fdab99 100644
--- a/common/rc
+++ b/common/rc
@@ -35,7 +35,7 @@ _error() {
 # for TIMEOUT / number of subtests.
 _divide_timeout() {
local num_tests="$1"
-   if [[ -v TIMEOUT ]]; then
+   if [[ -n $TIMEOUT ]]; then
((TIMEOUT = (TIMEOUT + num_tests - 1) / num_tests))
fi
 }
diff --git a/tests/meta/group b/tests/meta/group
index 4281ea1..012a700 100644
--- a/tests/meta/group
+++ b/tests/meta/group
@@ -18,7 +18,7 @@
 # along with this program.  If not, see .
 
 group_requires() {
-   if [[ -v META_REQUIRES_SKIP ]]; then
+   if [[ -n $META_REQUIRES_SKIP ]]; then
SKIP_REASON="META_REQUIRES_SKIP was set"
return 1
fi
@@ -26,7 +26,7 @@ group_requires() {
 }
 
 group_device_requires() {
-   if [[ -v META_DEVICE_REQUIRES_SKIP ]]; then
+   if [[ -n $META_DEVICE_REQUIRES_SKIP ]]; then
SKIP_REASON="META_DEVICE_REQUIRES_SKIP was set"
return 1
fi
-- 
1.8.3.1





Re: [PATCH blktests] block/013: Add test for BLKRRPART ioctl

2017-12-20 Thread Johannes Thumshirn
Omar Sandoval  writes:

> On Tue, Dec 19, 2017 at 11:47:09AM +0100, Johannes Thumshirn wrote:
>> xiao yang  writes:
>> 
>> > +requires() {
>> > +  _have_program mkfs.ext3
>> > +}
>> [...]
>> > +  # Format
>> > +  mkfs.ext3 -F "$TEST_DEV" >> "$FULL" 2>&1
>> 
>> What's the reason to limit the test case to ext3 only? Can you switch it
>> to the generic 'mkfs' command? IIRC we require 'util-linux' to be
>> present for blktests but mkfs.ext3 is (at least in SUSE based distros)
>> in the e2fsprogs package.
>
> I'm fine with only testing one specific filesystem, since we're really
> testing generic functionality and not any filesystem code. And according
> to the manpage, bare mkfs is deprecated. I'm fine with requiring
> e2fsprogs for this test, and xiao yang added the _have_program there for
> it.

what about something like (totally untested):

if _have_program mkfs.ext3; then
   mkfs_prog="mkfs.ext3"
   return 1
elif _have_program mkfs.xfs; then
   mkfs_prog="mkfs.xfs"
   return 1
elif _have_program mkfs.brtfs; then
   mkfs_prog="mkfs.brtfs"
   return 1
else
   retrun 0
fi


-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850