[PATCH 1/3] kthread: add a mechanism to store cgroup info

2017-09-06 Thread Shaohua Li
From: Shaohua Li 

kthread usually runs jobs on behalf of other threads. The jobs should be
charged to cgroup of original threads. But the jobs run in a kthread,
where we lose the cgroup context of original threads. The patch adds a
machanism to record cgroup info of original threads in kthread context.
Later we can retrieve the cgroup info and attach the cgroup info to jobs.

Since this mechanism is only required by kthread, we store the cgroup
info in kthread data instead of generic task_struct.

Signed-off-by: Shaohua Li 
---
 include/linux/kthread.h | 13 +++
 include/linux/sched.h   |  1 +
 kernel/kthread.c| 57 -
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 82e197e..3eb24d1 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -3,6 +3,7 @@
 /* Simple interface for creating and stopping kernel threads without mess. */
 #include 
 #include 
+#include 
 
 __printf(4, 5)
 struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
@@ -198,4 +199,16 @@ bool kthread_cancel_delayed_work_sync(struct 
kthread_delayed_work *work);
 
 void kthread_destroy_worker(struct kthread_worker *worker);
 
+#ifdef CONFIG_CGROUPS
+void kthread_set_orig_css(struct cgroup_subsys_state *css);
+struct cgroup_subsys_state *kthread_get_orig_css(void);
+void kthread_reset_orig_css(void);
+#else
+static inline void kthread_set_orig_css(struct cgroup_subsys_state *css) { }
+static inline struct cgroup_subsys_state *kthread_get_orig_css(void)
+{
+   return NULL;
+}
+static inline void kthread_reset_orig_css(void) { }
+#endif
 #endif /* _LINUX_KTHREAD_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c05ac5f..ab2295d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1276,6 +1276,7 @@ extern struct pid *cad_pid;
 #define PF_SWAPWRITE   0x0080  /* Allowed to write to swap */
 #define PF_NO_SETAFFINITY  0x0400  /* Userland is not allowed to 
meddle with cpus_allowed */
 #define PF_MCE_EARLY   0x0800  /* Early kill for mce process 
policy */
+#define PF_KTHREAD_HAS_CSS 0x1000  /* kthread has css info 
attached */
 #define PF_MUTEX_TESTER0x2000  /* Thread belongs to 
the rt mutex tester */
 #define PF_FREEZER_SKIP0x4000  /* Freezer should not 
count it as freezable */
 #define PF_SUSPEND_TASK0x8000  /* This thread called 
freeze_processes() and should not be frozen */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528..d084ef3 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -20,7 +20,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 static DEFINE_SPINLOCK(kthread_create_lock);
@@ -47,6 +46,7 @@ struct kthread {
void *data;
struct completion parked;
struct completion exited;
+   struct cgroup_subsys_state *orig_css;
 };
 
 enum KTHREAD_BITS {
@@ -1153,3 +1153,58 @@ void kthread_destroy_worker(struct kthread_worker 
*worker)
kfree(worker);
 }
 EXPORT_SYMBOL(kthread_destroy_worker);
+
+#ifdef CONFIG_CGROUPS
+/**
+ * kthread_set_orig_css - set the original css for current thread
+ * @css: the cgroup info
+ *
+ * Current thread must be a kthread. The thread is running jobs on behalf of
+ * other threads. In some cases, we expect the jobs attach cgroup info of
+ * original threads instead of that of current thread. This function stores
+ * original thread's cgroup info in current kthread context for later
+ * retrieval.
+ */
+void kthread_set_orig_css(struct cgroup_subsys_state *css)
+{
+   struct kthread *kthread = to_kthread(current);
+
+   if (css) {
+   css_get(css);
+   kthread->orig_css = css;
+   current->flags |= PF_KTHREAD_HAS_CSS;
+   }
+}
+EXPORT_SYMBOL(kthread_set_orig_css);
+
+/**
+ * kthread_get_orig_css - get the stored original cgroup info
+ *
+ * Current thread must be a kthread.
+ */
+struct cgroup_subsys_state *kthread_get_orig_css(void)
+{
+   if (current->flags & PF_KTHREAD_HAS_CSS)
+   return to_kthread(current)->orig_css;
+   return NULL;
+}
+EXPORT_SYMBOL(kthread_get_orig_css);
+
+/**
+ * kthread_reset_orig_css - clear stored cgroup info
+ *
+ * Current thread must be a kthread.
+ */
+void kthread_reset_orig_css(void)
+{
+   struct kthread *kthread = to_kthread(current);
+   struct cgroup_subsys_state *css;
+
+   css = kthread->orig_css;
+   if (css)
+   css_put(css);
+   kthread->orig_css = NULL;
+   current->flags &= ~PF_KTHREAD_HAS_CSS;
+}
+EXPORT_SYMBOL(kthread_reset_orig_css);
+#endif
-- 
2.9.5



[PATCH 2/3] block: make blkcg aware of kthread stored original cgroup info

2017-09-06 Thread Shaohua Li
From: Shaohua Li 

Several blkcg APIs are deprecated. After removing them, bio_blkcg is the
only API to get cgroup info for a bio. If bio_blkcg finds current task
is a kthread and has original css recorded, it will use the css instead
of associating the bio to current task.

Signed-off-by: Shaohua Li 
---
 block/bio.c| 31 ---
 include/linux/bio.h|  2 --
 include/linux/blk-cgroup.h | 25 +++--
 3 files changed, 7 insertions(+), 51 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 6745759..9271fa3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2033,37 +2033,6 @@ int bio_associate_blkcg(struct bio *bio, struct 
cgroup_subsys_state *blkcg_css)
 EXPORT_SYMBOL_GPL(bio_associate_blkcg);
 
 /**
- * bio_associate_current - associate a bio with %current
- * @bio: target bio
- *
- * Associate @bio with %current if it hasn't been associated yet.  Block
- * layer will treat @bio as if it were issued by %current no matter which
- * task actually issues it.
- *
- * This function takes an extra reference of @task's io_context and blkcg
- * which will be put when @bio is released.  The caller must own @bio,
- * ensure %current->io_context exists, and is responsible for synchronizing
- * calls to this function.
- */
-int bio_associate_current(struct bio *bio)
-{
-   struct io_context *ioc;
-
-   if (bio->bi_css)
-   return -EBUSY;
-
-   ioc = current->io_context;
-   if (!ioc)
-   return -ENOENT;
-
-   get_io_context_active(ioc);
-   bio->bi_ioc = ioc;
-   bio->bi_css = task_get_css(current, io_cgrp_id);
-   return 0;
-}
-EXPORT_SYMBOL_GPL(bio_associate_current);
-
-/**
  * bio_disassociate_task - undo bio_associate_current()
  * @bio: target bio
  */
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a8fe793..d795cdd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -514,13 +514,11 @@ do {  \
 
 #ifdef CONFIG_BLK_CGROUP
 int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state 
*blkcg_css);
-int bio_associate_current(struct bio *bio);
 void bio_disassociate_task(struct bio *bio);
 void bio_clone_blkcg_association(struct bio *dst, struct bio *src);
 #else  /* CONFIG_BLK_CGROUP */
 static inline int bio_associate_blkcg(struct bio *bio,
struct cgroup_subsys_state *blkcg_css) { return 0; }
-static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
 static inline void bio_disassociate_task(struct bio *bio) { }
 static inline void bio_clone_blkcg_association(struct bio *dst,
struct bio *src) { }
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 9d92153..0cdcf6b 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
 #define BLKG_STAT_CPU_BATCH(INT_MAX / 2)
@@ -223,22 +224,16 @@ static inline struct blkcg *css_to_blkcg(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct blkcg, css) : NULL;
 }
 
-static inline struct blkcg *task_blkcg(struct task_struct *tsk)
-{
-   return css_to_blkcg(task_css(tsk, io_cgrp_id));
-}
-
 static inline struct blkcg *bio_blkcg(struct bio *bio)
 {
+   struct cgroup_subsys_state *css;
+
if (bio && bio->bi_css)
return css_to_blkcg(bio->bi_css);
-   return task_blkcg(current);
-}
-
-static inline struct cgroup_subsys_state *
-task_get_blkcg_css(struct task_struct *task)
-{
-   return task_get_css(task, io_cgrp_id);
+   css = kthread_get_orig_css();
+   if (css)
+   return css_to_blkcg(css);
+   return css_to_blkcg(task_css(current, io_cgrp_id));
 }
 
 /**
@@ -735,12 +730,6 @@ struct blkcg_policy {
 
 #define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
 
-static inline struct cgroup_subsys_state *
-task_get_blkcg_css(struct task_struct *task)
-{
-   return NULL;
-}
-
 #ifdef CONFIG_BLOCK
 
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { 
return NULL; }
-- 
2.9.5



[PATCH 3/3] block/loop: make loop cgroup aware

2017-09-06 Thread Shaohua Li
From: Shaohua Li 

loop block device handles IO in a separate thread. The actual IO
dispatched isn't cloned from the IO loop device received, so the
dispatched IO loses the cgroup context.

I'm ignoring buffer IO case now, which is quite complicated.  Making the
loop thread aware cgroup context doesn't really help. The loop device
only writes to a single file. In current writeback cgroup
implementation, the file can only belong to one cgroup.

For direct IO case, we could workaround the issue in theory. For
example, say we assign cgroup1 5M/s BW for loop device and cgroup2
10M/s. We can create a special cgroup for loop thread and assign at
least 15M/s for the underlayer disk. In this way, we correctly throttle
the two cgroups. But this is tricky to setup.

This patch tries to address the issue. We record bio's css in loop
command. When loop thread is handling the command, we then use the API
provided in patch 1 to set the css for current task. The bio layer will
use the css for new IO.

Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 13 +
 drivers/block/loop.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 9d4545f..9850b27 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -482,6 +482,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
+   if (cmd->css)
+   css_put(cmd->css);
cmd->ret = ret > 0 ? 0 : ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -541,6 +543,8 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_filp = file;
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
+   if (cmd->css)
+   kthread_set_orig_css(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -548,6 +552,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
+   kthread_reset_orig_css();
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -1692,6 +1697,14 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
break;
}
 
+   /* always use the first bio's css */
+#ifdef CONFIG_CGROUPS
+   if (cmd->use_aio && cmd->rq->bio && cmd->rq->bio->bi_css) {
+   cmd->css = cmd->rq->bio->bi_css;
+   css_get(cmd->css);
+   } else
+#endif
+   cmd->css = NULL;
kthread_queue_work(>worker, >work);
 
return BLK_STS_OK;
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index f68c1d5..d93b669 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -74,6 +74,7 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
+   struct cgroup_subsys_state *css;
 };
 
 /* Support for loadable transfer modules */
-- 
2.9.5



[PATCH 0/3] block: make loop block device cgroup aware

2017-09-06 Thread Shaohua Li
From: Shaohua Li 

Hi,

The IO dispatched to under layer disk by loop block device isn't cloned from
original bio, so the IO loses cgroup information of original bio. These IO
escapes from cgroup control. The patches try to address this issue. The idea is
quite generic, but we currently only make it work for blkcg.

Thanks,
Shaohua

Shaohua Li (3):
  kthread: add a mechanism to store cgroup info
  block: make blkcg aware of kthread stored original cgroup info
  block/loop: make loop cgroup aware

 block/bio.c| 31 -
 drivers/block/loop.c   | 13 +++
 drivers/block/loop.h   |  1 +
 include/linux/bio.h|  2 --
 include/linux/blk-cgroup.h | 25 ++--
 include/linux/kthread.h| 13 +++
 include/linux/sched.h  |  1 +
 kernel/kthread.c   | 57 +-
 8 files changed, 91 insertions(+), 52 deletions(-)

-- 
2.9.5



[PATCH V2 1/3] block/loop: don't hijack error number

2017-09-06 Thread Shaohua Li
From: Shaohua Li 

If the bio returns -EOPNOTSUPP, we shouldn't hijack it and return -EIO

Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 85de673..715b762 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -460,7 +460,7 @@ static void lo_complete_rq(struct request *rq)
zero_fill_bio(bio);
}
 
-   blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK);
+   blk_mq_end_request(rq, errno_to_blk_status(cmd->ret));
 }
 
 static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
@@ -476,7 +476,7 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-   cmd->ret = ret;
+   cmd->ret = ret > 0 ? 0 : ret;
lo_rw_aio_do_completion(cmd);
 }
 
@@ -1706,7 +1706,7 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
  failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
-   cmd->ret = ret ? -EIO : 0;
+   cmd->ret = ret;
blk_mq_complete_request(cmd->rq);
}
 }
-- 
2.9.5



Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance

2017-09-06 Thread Tom Nguyen
Likewise with no problems on my work laptop with 4 days uptime.

Tested-by: Tom Nguyen 


On 09/07/2017 04:09 AM, Oleksandr Natalenko wrote:
> Feel free to add:
>
> Tested-by: Oleksandr Natalenko 
>
> since I'm running this on 4 machines without issues.
>
>> Hi Jens,
>>
>> Ping...




Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance

2017-09-06 Thread Oleksandr Natalenko
Feel free to add:

Tested-by: Oleksandr Natalenko 

since I'm running this on 4 machines without issues.

> Hi Jens,
>
> Ping...


Re: [PATCH 1/6] lightnvm: pblk: check for failed mempool alloc.

2017-09-06 Thread Javier González
> On 6 Sep 2017, at 17.20, Jens Axboe  wrote:
> 
> On 09/06/2017 09:13 AM, Jens Axboe wrote:
>> On 09/06/2017 09:12 AM, Javier González wrote:
 On 6 Sep 2017, at 17.09, Jens Axboe  wrote:
 
 On 09/06/2017 09:08 AM, Johannes Thumshirn wrote:
> On Wed, Sep 06, 2017 at 05:01:01PM +0200, Javier González wrote:
>> Check for failed mempool allocations and act accordingly.
> 
> Are you sure it is needed? Quoting from mempool_alloc()s Documentation:
> "[...] Note that due to preallocation, this function *never* fails when 
> called
> from process contexts. (it might fail if called from an IRQ context.) 
> [...]"
 
 It's not needed, mempool() will never fail if __GFP_WAIT is set in the
 mask. The use case here is GFP_KERNEL, which does include __GFP_WAIT.
>>> 
>>> Thanks for the clarification. Do you just drop the patch, or do you want
>>> me to re-send the series?
>> 
>> No need to resend. I'll pick up the others in a day or two, once people
>> have had some time to go over them.
> 
> I took a quick look at your mempool usage, and I'm not sure it's
> correct.  For a mempool to work, you have to ensure that you provide a
> forward progress guarantee. With that guarantee, you know that if you do
> end up sleeping on allocation, you already have items inflight that will
> be freed when that operation completes. In other words, all allocations
> must have a defined and finite life time, as any allocation can
> potentially sleep/block for that life time. You can't allocate something
> and hold on to it forever, then you are violating the terms of agreement
> that makes a mempool work.

I understood the part of guaranteeing the number of inflight items to
keep the mempool active without waiting, but I must admit that I assumed
that the mempool would resize when getting pressure and that the penalty
would be increased latency, not the mempool giving up and causing a
deadlock.

> 
> The first one that caught my eye is pblk->page_pool. You have this loop:
> 
> for (i = 0; i < nr_pages; i++) {
>page = mempool_alloc(pblk->page_pool, flags);
>if (!page)
>goto err;
> 
>ret = bio_add_pc_page(q, bio, page, PBLK_EXPOSED_PAGE_SIZE, 0);
>if (ret != PBLK_EXPOSED_PAGE_SIZE) {
>pr_err("pblk: could not add page to bio\n");
>mempool_free(page, pblk->page_pool);
>goto err;
>}
> }
> 
> which looks suspect. This mempool is created with a reserve pool of
> PAGE_POOL_SIZE (16) members. Do we know if the bio has 16 pages or less?
> If not, then this is broken and can deadlock forever.

I can see that in this case, the 16 elements do not hold. In the read
path, we can guarantee that a read will be <= 64 sectors (4KB pages), so
this is definitely wrong. I'll fix it tomorrow.

Since we are at it, I have for some time wondered what's the right way
to balance the mempool sizes so that we are a good citizen and perform
at the same time. I don't see a lot of code using mempool_resize to tune
the min_nr based on load. For example, in our write path, the numbers
are easy to calculate, but on the read path I completely
over-dimensioned the mempool to ensure not having to wait for the
completion path. Any good rule of thumb here?

> You have a lot of mempool usage in the code, would probably not hurt to
> audit all of them.

Yes. I will take a look and add comments to the sizes.

Thanks Jens,
Javier


signature.asc
Description: Message signed with OpenPGP


[PATCH 13/13] bcache: initialize dirty stripes in flash_dev_run()

2017-09-06 Thread Coly Li
From: Tang Junhui 

bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.

Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.

This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
 -void bch_sectors_dirty_init(struct cached_dev *dc);
 +void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().

(Commit log is composed by Coly Li)

Signed-off-by: Tang Junhui 
Reviewed-by: Coly Li 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/super.c | 3 ++-
 drivers/md/bcache/writeback.c | 8 
 drivers/md/bcache/writeback.h | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8352fad765f6..60d2c3ce192e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1026,7 +1026,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 
cache_set *c)
}
 
if (BDEV_STATE(>sb) == BDEV_STATE_DIRTY) {
-   bch_sectors_dirty_init(dc);
+   bch_sectors_dirty_init(>disk);
atomic_set(>has_dirty, 1);
atomic_inc(>count);
bch_writeback_queue(dc);
@@ -1228,6 +1228,7 @@ static int flash_dev_run(struct cache_set *c, struct 
uuid_entry *u)
goto err;
 
bcache_device_attach(d, c, u - c->uuids);
+   bch_sectors_dirty_init(d);
bch_flash_dev_request_init(d);
add_disk(d->disk);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 42c66e76f05e..4fd332a7402e 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -482,17 +482,17 @@ static int sectors_dirty_init_fn(struct btree_op *_op, 
struct btree *b,
return MAP_CONTINUE;
 }
 
-void bch_sectors_dirty_init(struct cached_dev *dc)
+void bch_sectors_dirty_init(struct bcache_device *d)
 {
struct sectors_dirty_init op;
 
bch_btree_op_init(, -1);
-   op.inode = dc->disk.id;
+   op.inode = d->id;
 
-   bch_btree_map_keys(, dc->disk.c, (op.inode, 0, 0),
+   bch_btree_map_keys(, d->c, (op.inode, 0, 0),
   sectors_dirty_init_fn, 0);
 
-   dc->disk.sectors_dirty_last = bcache_dev_sectors_dirty(>disk);
+   d->sectors_dirty_last = bcache_dev_sectors_dirty(d);
 }
 
 void bch_cached_dev_writeback_init(struct cached_dev *dc)
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 629bd1a502fd..c992f06c1e9a 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -84,7 +84,7 @@ static inline void bch_writeback_add(struct cached_dev *dc)
 
 void bcache_dev_sectors_dirty_add(struct cache_set *, unsigned, uint64_t, int);
 
-void bch_sectors_dirty_init(struct cached_dev *dc);
+void bch_sectors_dirty_init(struct bcache_device *);
 void bch_cached_dev_writeback_init(struct cached_dev *);
 int bch_cached_dev_writeback_start(struct cached_dev *);
 
-- 
2.13.5



[PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()

2017-09-06 Thread Ilya Dryomov
sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
is set.  This means blkdev_issue_zeroout() must cope with WRITE SAME
failing with BLK_STS_TARGET/-EREMOTEIO and explicitly write zeroes,
unless BLKDEV_ZERO_NOFALLBACK is specified.

Commit 71027e97d796 ("block: stop using discards for zeroing") added
the following to __blkdev_issue_zeroout() comment:

  "Note that this function may fail with -EOPNOTSUPP if the driver signals
  zeroing offload support, but the device fails to process the command (for
  some devices there is no non-destructive way to verify whether this
  operation is actually supported).  In this case the caller should call
  retry the call to blkdev_issue_zeroout() and the fallback path will be used."

But __blkdev_issue_zeroout() doesn't fail in this case: if WRITE SAME
support is indicated, a REQ_OP_WRITE_ZEROES bio is built and returned
to blkdev_issue_zeroout().  -EREMOTEIO is then propagated up:

  $ fallocate -zn -l 1k /dev/sdg
  fallocate: fallocate failed: Remote I/O error
  $ fallocate -zn -l 1k /dev/sdg  # OK

(The second fallocate(1) succeeds because sd_done() sets ->no_write_same
in response to a sense that would become BLK_STS_TARGET.)

Retry __blkdev_issue_zeroout() if the I/O fails with -EREMOTEIO.  This
is roughly what we did until 4.12, sans BLKDEV_ZERO_NOFALLBACK knob.

Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing")
Cc: Christoph Hellwig 
Cc: "Martin K. Petersen" 
Cc: Hannes Reinecke 
Cc: sta...@vger.kernel.org # 4.12+
Signed-off-by: Ilya Dryomov 
---
 block/blk-lib.c | 49 +
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 3fe0aec90597..876b5478c1a2 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -287,12 +287,6 @@ static unsigned int __blkdev_sectors_to_bio_pages(sector_t 
nr_sects)
  *  Zero-fill a block range, either using hardware offload or by explicitly
  *  writing zeroes to the device.
  *
- *  Note that this function may fail with -EOPNOTSUPP if the driver signals
- *  zeroing offload support, but the device fails to process the command (for
- *  some devices there is no non-destructive way to verify whether this
- *  operation is actually supported).  In this case the caller should call
- *  retry the call to blkdev_issue_zeroout() and the fallback path will be 
used.
- *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  *
@@ -343,6 +337,34 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(__blkdev_issue_zeroout);
 
+/*
+ * This function may fail with -EREMOTEIO if the driver signals zeroing
+ * offload support, but the device fails to process the command (for
+ * some devices there is no non-destructive way to verify whether this
+ * operation is actually supported).  In this case the caller should
+ * retry and the fallback path in __blkdev_issue_zeroout() will be used
+ * unless %flags contains BLKDEV_ZERO_NOFALLBACK.
+ */
+static int __blkdev_issue_zeroout_wait(struct block_device *bdev,
+  sector_t sector, sector_t nr_sects,
+  gfp_t gfp_mask, unsigned flags)
+{
+   int ret;
+   struct bio *bio = NULL;
+   struct blk_plug plug;
+
+   blk_start_plug();
+   ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
+   , flags);
+   if (ret == 0 && bio) {
+   ret = submit_bio_wait(bio);
+   bio_put(bio);
+   }
+   blk_finish_plug();
+
+   return ret;
+}
+
 /**
  * blkdev_issue_zeroout - zero-fill a block range
  * @bdev:  blockdev to write
@@ -360,17 +382,12 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
int ret;
-   struct bio *bio = NULL;
-   struct blk_plug plug;
 
-   blk_start_plug();
-   ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
-   , flags);
-   if (ret == 0 && bio) {
-   ret = submit_bio_wait(bio);
-   bio_put(bio);
-   }
-   blk_finish_plug();
+   ret = __blkdev_issue_zeroout_wait(bdev, sector, nr_sects,
+ gfp_mask, flags);
+   if (ret == -EREMOTEIO)
+   ret = __blkdev_issue_zeroout_wait(bdev, sector, nr_sects,
+ gfp_mask, flags);
 
return ret;
 }
-- 
2.4.3



Re: [PATCH 00/13] bcache: fixes and update for 4.14

2017-09-06 Thread Coly Li
On 2017/9/6 下午11:46, Jens Axboe wrote:
> On 09/06/2017 09:41 AM, Coly Li wrote:
>> On 2017/9/6 下午10:20, Jens Axboe wrote:
>>> On Wed, Sep 06 2017, Coly Li wrote:
 Hi Jens,

 Here are 12 patchs for bcache fixes and updates, most of them were posted
 by Eric Wheeler in 4.13 merge window, but delayed due to lack of code
 review.
>>>
>>> Next time, please send this _before_ the merge window opens. Not a huge
>>> problem for this series, since it has been posted numerous times in the
>>> past and already has good review coverage, but it really should have
>>> been in linux-next for a week or two before heading upstream.
>>>
>> Hi Jens,
>>
>> Copied, send patches _before_ merge window opens.
>>
>> But could you please to give me a hint, how can I find when a specific
>> merge window will open ? I search LKML, and see people send pull
>> requests for 4.14 merge window, but I don't see any announcement for
>> 4.14 merge window time slot.
> 
> The merge window opens when Linus releases the previous kernel. So you
> have to try and keep track of when he expects to do that. That really
> isn't that difficult - Linus always cuts a release on Sundays. We always
> have 7 or 8 rc releases, so a good time to check is his mail on -rc6 or
> -rc7 release, that usually gives a good indication of when he expects to
> release the final.
> 
> To keep things simple, if you always have things ready when -rc7 is
> released, then you are at least one week out from the merge window,
> possibly two if we end up doing an -rc8. That means you don't have to
> track anything, you know the exact date of when -rc7 is released since
> Linus's schedule is usually reliable.
> 

Hi Jens,

Copied, I will follow the above rule in next cycle.

And I just post the last patch
(0013-bcache-initialize-stripe_sectors_dirty-correctly-for.patch) to you
for 4.14 cycle.

This patch is an important fix for bcache writeback rate calculation, it
is depended by another patch I sent to you
(0006-bcache-correct-cache_dirty_target-in-__update_writeb.patch),
please have it in 4.14.

Thanks in advance.

-- 
Coly Li


Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-09-06 Thread Shaohua Li
On Wed, Sep 06, 2017 at 09:12:20AM +0800, Joseph Qi wrote:
> Hi Shaohua,
> 
> On 17/9/6 05:02, Shaohua Li wrote:
> > On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
> >>
> >>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha 
> >>> scritto:
> >>>
> >>> Hi,
> >>>
> >>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but 
> >>> not
> >>> much for SSD. This patch set try to add a conservative limit for 
> >>> blk-throttle.
> >>> It isn't a proportional scheduling, but can help prioritize cgroups. 
> >>> There are
> >>> several advantages we choose blk-throttle:
> >>> - blk-throttle resides early in the block stack. It works for both bio and
> >>>  request based queues.
> >>> - blk-throttle is light weight in general. It still takes queue lock, but 
> >>> it's
> >>>  not hard to implement a per-cpu cache and remove the lock contention.
> >>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by 
> >>> CFQ/BFQ. The
> >>>  mechanism is proved to harm performance for fast SSD.
> >>>
> >>> The patch set add a new io.low limit for blk-throttle. It's only for 
> >>> cgroup2.
> >>> The existing io.max is a hard limit throttling. cgroup with a max limit 
> >>> never
> >>> dispatch more IO than its max limit. While io.low is a best effort 
> >>> throttling.
> >>> cgroups with 'low' limit can run above their 'low' limit at appropriate 
> >>> time.
> >>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
> >>> above
> >>> their 'low' limit. If any cgroup runs under its 'low' limit, all other 
> >>> cgroups
> >>> will run according to their 'low' limit. So the 'low' limit could act as 
> >>> two
> >>> roles, it allows cgroups using free bandwidth and it protects cgroups from
> >>> their 'low' limit.
> >>>
> >>> An example usage is we have a high prio cgroup with high 'low' limit and 
> >>> a low
> >>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, 
> >>> the low
> >>> prio can run above its 'low' limit, so we don't waste the bandwidth. When 
> >>> the
> >>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will 
> >>> run
> >>> under its 'low' limit. This will protect high prio cgroup to get more
> >>> resources.
> >>>
> >>
> >> Hi Shaohua,
> > 
> > Hi,
> > 
> > Sorry for the late response.
> >> I would like to ask you some questions, to make sure I fully
> >> understand how the 'low' limit and the idle-group detection work in
> >> your above scenario.  Suppose that: the drive has a random-I/O peak
> >> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
> >> the low prio group has a 'low' limit of 10 MB/s.  If
> >> - the high prio process happens to do, say, only 5 MB/s for a given
> >>   long time
> >> - the low prio process constantly does greedy I/O
> >> - the idle-group detection is not being used
> >> then the low prio process is limited to 10 MB/s during all this time
> >> interval.  And only 10% of the device bandwidth is utilized.
> >>
> >> To recover lost bandwidth through idle-group detection, we need to set
> >> a target IO latency for the high-prio group.  The high prio group
> >> should happen to be below the threshold, and thus to be detected as
> >> idle, leaving the low prio group free too use all the bandwidth.
> >>
> >> Here are my questions:
> >> 1) Is all I wrote above correct?
> > 
> > Yes
> >> 2) In particular, maybe there are other better mechanism to saturate
> >> the bandwidth in the above scenario?
> > 
> > Assume it's the 4) below.
> >> If what I wrote above is correct:
> >> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
> >> full bandwidth, the latency threshold of the high prio group may be
> >> overcome, causing the high prio group to not be considered idle any
> >> longer, and thus the low prio group to be limited again; this in turn
> >> will cause the threshold to not be overcome any longer, and so on.
> > 
> > That's true. We try to mitigate the fluctuation by increasing the low prio
> > cgroup bandwidth graduately though.
> > 
> >> 4) Is there a way to compute an appropriate target latency of the high
> >> prio group, if it is a generic group, for which the latency
> >> requirements of the processes it contains are only partially known or
> >> completely unknown?  By appropriate target latency, I mean a target
> >> latency that enables the framework to fully utilize the device
> >> bandwidth while the high prio group is doing less I/O than its limit.
> > 
> > Not sure how we can do this. The device max bandwidth varies based on 
> > request
> > size and read/write ratio. We don't know when the max bandwidth is reached.
> > Also I think we must consider a case that the workloads never use the full
> > bandwidth of a disk, which is pretty common for SSD (at least in our
> > environment).
> > 
> I have a question on the base latency tracking.
> From my test on SSD, write latency is much lower than read when doing
> 

Re: [PATCH 00/13] bcache: fixes and update for 4.14

2017-09-06 Thread Jens Axboe
On 09/06/2017 09:41 AM, Coly Li wrote:
> On 2017/9/6 下午10:20, Jens Axboe wrote:
>> On Wed, Sep 06 2017, Coly Li wrote:
>>> Hi Jens,
>>>
>>> Here are 12 patchs for bcache fixes and updates, most of them were posted
>>> by Eric Wheeler in 4.13 merge window, but delayed due to lack of code
>>> review.
>>
>> Next time, please send this _before_ the merge window opens. Not a huge
>> problem for this series, since it has been posted numerous times in the
>> past and already has good review coverage, but it really should have
>> been in linux-next for a week or two before heading upstream.
>>
> Hi Jens,
> 
> Copied, send patches _before_ merge window opens.
> 
> But could you please to give me a hint, how can I find when a specific
> merge window will open ? I search LKML, and see people send pull
> requests for 4.14 merge window, but I don't see any announcement for
> 4.14 merge window time slot.

The merge window opens when Linus releases the previous kernel. So you
have to try and keep track of when he expects to do that. That really
isn't that difficult - Linus always cuts a release on Sundays. We always
have 7 or 8 rc releases, so a good time to check is his mail on -rc6 or
-rc7 release, that usually gives a good indication of when he expects to
release the final.

To keep things simple, if you always have things ready when -rc7 is
released, then you are at least one week out from the merge window,
possibly two if we end up doing an -rc8. That means you don't have to
track anything, you know the exact date of when -rc7 is released since
Linus's schedule is usually reliable.

-- 
Jens Axboe



[PATCH 1/3] block: introduce blk_mq_tagset_iter

2017-09-06 Thread Sagi Grimberg
Iterator helper to apply a function on all the
tags in a given tagset. export it as it will be used
outside the block layer later on.

Suggested-by: Bart Van Assche 
Signed-off-by: Sagi Grimberg 
---
 block/blk-mq-tag.c | 16 +++-
 include/linux/blk-mq.h |  2 ++
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 6714507aa6c7..bce1c76fc768 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -298,12 +298,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set 
*tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
-int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
-int (reinit_request)(void *, struct request *))
+int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data,
+int (fn)(void *, struct request *))
 {
int i, j, ret = 0;
 
-   if (WARN_ON_ONCE(!reinit_request))
+   if (WARN_ON_ONCE(!fn))
goto out;
 
for (i = 0; i < set->nr_hw_queues; i++) {
@@ -316,8 +316,7 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
if (!tags->static_rqs[j])
continue;
 
-   ret = reinit_request(set->driver_data,
-tags->static_rqs[j]);
+   ret = fn(data, tags->static_rqs[j]);
if (ret)
goto out;
}
@@ -326,6 +325,13 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
 out:
return ret;
 }
+EXPORT_SYMBOL_GPL(blk_mq_tagset_iter);
+
+int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
+int (reinit_request)(void *, struct request *))
+{
+   return blk_mq_tagset_iter(set, set->driver_data, reinit_request);
+}
 EXPORT_SYMBOL_GPL(blk_mq_reinit_tagset);
 
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..6bc29f73a9aa 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -259,6 +259,8 @@ void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout);
+int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data,
+   int (reinit_request)(void *, struct request *));
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
 int (reinit_request)(void *, struct request *));
 
-- 
2.7.4



[PATCH 0/3] Move tagset reinit to its only current consumer, nvme-core

2017-09-06 Thread Sagi Grimberg
As suggested by Bart Van Assche, get rid of
blk_mq_reinit_tagset and move it to nvme-core (its
only current consumer).

Instead, introduce a more generic tagset iterator helper.

Sagi Grimberg (3):
  block: introduce blk_mq_tagset_iter
  nvme: introduce nvme_reinit_tagset
  block: remove blk_mq_reinit_tagset

 block/blk-mq-tag.c   | 11 +--
 drivers/nvme/host/core.c | 13 +
 drivers/nvme/host/fc.c   |  3 ++-
 drivers/nvme/host/nvme.h |  2 ++
 drivers/nvme/host/rdma.c |  7 +++
 include/linux/blk-mq.h   |  4 ++--
 6 files changed, 27 insertions(+), 13 deletions(-)

-- 
2.7.4



[PATCH 3/3] block: remove blk_mq_reinit_tagset

2017-09-06 Thread Sagi Grimberg
No callers left.

Suggested-by: Bart Van Assche 
Signed-off-by: Sagi Grimberg 
---
 block/blk-mq-tag.c | 7 ---
 include/linux/blk-mq.h | 2 --
 2 files changed, 9 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index bce1c76fc768..c81b40ecd3f1 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -327,13 +327,6 @@ int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void 
*data,
 }
 EXPORT_SYMBOL_GPL(blk_mq_tagset_iter);
 
-int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
-int (reinit_request)(void *, struct request *))
-{
-   return blk_mq_tagset_iter(set, set->driver_data, reinit_request);
-}
-EXPORT_SYMBOL_GPL(blk_mq_reinit_tagset);
-
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 6bc29f73a9aa..cfd64e500d82 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -261,8 +261,6 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue 
*q,
 unsigned long timeout);
 int blk_mq_tagset_iter(struct blk_mq_tag_set *set, void *data,
int (reinit_request)(void *, struct request *));
-int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
-int (reinit_request)(void *, struct request *));
 
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
-- 
2.7.4



Re: [PATCH V4 00/14] blk-mq-sched: improve SCSI-MQ performance

2017-09-06 Thread Ming Lei
On Tue, Sep 05, 2017 at 09:39:51AM +0800, Ming Lei wrote:
> On Mon, Sep 04, 2017 at 11:12:49AM +0200, Paolo Valente wrote:
> > 
> > > Il giorno 02 set 2017, alle ore 17:17, Ming Lei  ha 
> > > scritto:
> > > 
> > > Hi,
> > > 
> > > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > > found that I/O performance is much bad with mq-deadline, especially
> > > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > > SRP...)
> > > 
> > > Turns out one big issue causes the performance regression: requests
> > > are still dequeued from sw queue/scheduler queue even when ldd's
> > > queue is busy, so I/O merge becomes quite difficult to make, then
> > > sequential IO degrades a lot.
> > > 
> > > The 1st five patches improve this situation, and brings back
> > > some performance loss.
> > > 
> > > Patch 6 ~ 7 uses q->queue_depth as hint for setting up
> > > scheduler queue depth.
> > > 
> > > Patch 8 ~ 15 improve bio merge via hash table in sw queue,
> > > which makes bio merge more efficient than current approch
> > > in which only the last 8 requests are checked. Since patch
> > > 6~14 converts to the scheduler way of dequeuing one request
> > > from sw queue one time for SCSI device, and the times of
> > > acquring ctx->lock is increased, and merging bio via hash
> > > table decreases holding time of ctx->lock and should eliminate
> > > effect from patch 14. 
> > > 
> > > With this changes, SCSI-MQ sequential I/O performance is
> > > improved much, Paolo reported that mq-deadline performance
> > > improved much[2] in his dbench test wrt V2. Also performanc
> > > improvement on lpfc/qla2xx was observed with V1.[1]
> > > 
> > > Also Bart worried that this patchset may affect SRP, so provide
> > > test data on SCSI SRP this time:
> > > 
> > > - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> > > - system(16 cores, dual sockets, mem: 96G)
> > > 
> > >  |v4.13-rc6+*  |v4.13-rc6+   | patched v4.13-rc6+ 
> > > -
> > > IOPS(K)  |  DEADLINE   |NONE |NONE 
> > > -
> > > read  |  587.81 |  511.96 |  518.51 
> > > -
> > > randread  |  116.44 |  142.99 |  142.46 
> > > -
> > > write |  580.87 |   536.4 |  582.15 
> > > -
> > > randwrite |  104.95 |  124.89 |  123.99 
> > > -
> > > 
> > > 
> > >  |v4.13-rc6+   |v4.13-rc6+   | patched v4.13-rc6+ 
> > > -
> > > IOPS(K)  |  DEADLINE   |MQ-DEADLINE  |MQ-DEADLINE  
> > > -
> > > read  |  587.81 |   158.7 |  450.41 
> > > -
> > > randread  |  116.44 |  142.04 |  142.72 
> > > -
> > > write |  580.87 |  136.61 |  569.37 
> > > -
> > > randwrite |  104.95 |  123.14 |  124.36 
> > > -
> > > 
> > > *: v4.13-rc6+ means v4.13-rc6 with block for-next
> > > 
> > > 
> > > Please consider to merge to V4.4.
> > > 
> > > [1] http://marc.info/?l=linux-block=150151989915776=2
> > > [2] https://marc.info/?l=linux-block=150217980602843=2
> > > 
> > > V4:
> > >   - add Reviewed-by tag
> > >   - some trival change: typo fix in commit log or comment,
> > >   variable name, no actual functional change
> > > 
> > > V3:
> > >   - totally round robin for picking req from ctx, as suggested
> > >   by Bart
> > >   - remove one local variable in __sbitmap_for_each_set()
> > >   - drop patches of single dispatch list, which can improve
> > >   performance on mq-deadline, but cause a bit degrade on
> > >   none because all hctxs need to be checked after ->dispatch
> > >   is flushed. Will post it again once it is mature.
> > >   - rebase on v4.13-rc6 with block for-next
> > > 
> > > V2:
> > >   - dequeue request from sw queues in round roubin's style
> > >   as suggested by Bart, and introduces one helper in sbitmap
> > >   for this purpose
> > >   - improve bio merge via hash table from sw queue
> > >   - add comments about using DISPATCH_BUSY state in lockless way,
> > >   simplifying handling on busy state,
> > >   - hold ctx->lock when clearing ctx busy bit as suggested
> > >   by Bart
> > > 
> > > 
> > 
> > Tested-by: Paolo Valente 
> 
> Hi Jens,
> 
> Is there any chance to make this patchset merged to V4.4?

Hi Jens,

Ping...

Thanks,
Ming


Re: [PATCH 5/6] lightnvm: pblk: fix write I/O sync stat

2017-09-06 Thread Johannes Thumshirn

Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 4/6] lightnvm: pblk: free padded entries in write buffer

2017-09-06 Thread Johannes Thumshirn
On Wed, Sep 06, 2017 at 05:01:04PM +0200, Javier González wrote:
> the completion pah. This might require some sectors to be padded in
 ^ path

Looks good otherwise,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/6] lightnvm: pblk: check for failed mempool alloc.

2017-09-06 Thread Jens Axboe
On 09/06/2017 09:13 AM, Jens Axboe wrote:
> On 09/06/2017 09:12 AM, Javier González wrote:
>>> On 6 Sep 2017, at 17.09, Jens Axboe  wrote:
>>>
>>> On 09/06/2017 09:08 AM, Johannes Thumshirn wrote:
 On Wed, Sep 06, 2017 at 05:01:01PM +0200, Javier González wrote:
> Check for failed mempool allocations and act accordingly.

 Are you sure it is needed? Quoting from mempool_alloc()s Documentation:
 "[...] Note that due to preallocation, this function *never* fails when 
 called
 from process contexts. (it might fail if called from an IRQ context.) 
 [...]"
>>>
>>> It's not needed, mempool() will never fail if __GFP_WAIT is set in the
>>> mask. The use case here is GFP_KERNEL, which does include __GFP_WAIT.
>>
>> Thanks for the clarification. Do you just drop the patch, or do you want
>> me to re-send the series?
> 
> No need to resend. I'll pick up the others in a day or two, once people
> have had some time to go over them.

I took a quick look at your mempool usage, and I'm not sure it's
correct.  For a mempool to work, you have to ensure that you provide a
forward progress guarantee. With that guarantee, you know that if you do
end up sleeping on allocation, you already have items inflight that will
be freed when that operation completes. In other words, all allocations
must have a defined and finite life time, as any allocation can
potentially sleep/block for that life time. You can't allocate something
and hold on to it forever, then you are violating the terms of agreement
that makes a mempool work.

The first one that caught my eye is pblk->page_pool. You have this loop:

for (i = 0; i < nr_pages; i++) {
page = mempool_alloc(pblk->page_pool, flags);   
if (!page)  
goto err;   

ret = bio_add_pc_page(q, bio, page, PBLK_EXPOSED_PAGE_SIZE, 0); 
if (ret != PBLK_EXPOSED_PAGE_SIZE) {
pr_err("pblk: could not add page to bio\n");
mempool_free(page, pblk->page_pool);
goto err;   
}   
}  

which looks suspect. This mempool is created with a reserve pool of
PAGE_POOL_SIZE (16) members. Do we know if the bio has 16 pages or less?
If not, then this is broken and can deadlock forever.

You have a lot of mempool usage in the code, would probably not hurt to
audit all of them.

-- 
Jens Axboe



Re: [PATCH 3/6] lightnvm: pblk: use right flag for GC allocation

2017-09-06 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 2/6] lightnvm: pblk: initialize debug stat counter

2017-09-06 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/6] lightnvm: pblk: check for failed mempool alloc.

2017-09-06 Thread Jens Axboe
On 09/06/2017 09:12 AM, Javier González wrote:
>> On 6 Sep 2017, at 17.09, Jens Axboe  wrote:
>>
>> On 09/06/2017 09:08 AM, Johannes Thumshirn wrote:
>>> On Wed, Sep 06, 2017 at 05:01:01PM +0200, Javier González wrote:
 Check for failed mempool allocations and act accordingly.
>>>
>>> Are you sure it is needed? Quoting from mempool_alloc()s Documentation:
>>> "[...] Note that due to preallocation, this function *never* fails when 
>>> called
>>> from process contexts. (it might fail if called from an IRQ context.) [...]"
>>
>> It's not needed, mempool() will never fail if __GFP_WAIT is set in the
>> mask. The use case here is GFP_KERNEL, which does include __GFP_WAIT.
> 
> Thanks for the clarification. Do you just drop the patch, or do you want
> me to re-send the series?

No need to resend. I'll pick up the others in a day or two, once people
have had some time to go over them.

-- 
Jens Axboe



Re: [PATCH 1/6] lightnvm: pblk: check for failed mempool alloc.

2017-09-06 Thread Johannes Thumshirn
On Wed, Sep 06, 2017 at 09:09:29AM -0600, Jens Axboe wrote:
> On 09/06/2017 09:08 AM, Johannes Thumshirn wrote:
> > On Wed, Sep 06, 2017 at 05:01:01PM +0200, Javier González wrote:
> >> Check for failed mempool allocations and act accordingly.
> > 
> > Are you sure it is needed? Quoting from mempool_alloc()s Documentation:
> > "[...] Note that due to preallocation, this function *never* fails when 
> > called
> > from process contexts. (it might fail if called from an IRQ context.) [...]"
> 
> It's not needed, mempool() will never fail if __GFP_WAIT is set in the
> mask. The use case here is GFP_KERNEL, which does include __GFP_WAIT.

Exactly. Maybe I shouldn't have it phrased as a question though... 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/6] lightnvm: pblk: check for failed mempool alloc.

2017-09-06 Thread Javier González
> On 6 Sep 2017, at 17.09, Jens Axboe  wrote:
> 
> On 09/06/2017 09:08 AM, Johannes Thumshirn wrote:
>> On Wed, Sep 06, 2017 at 05:01:01PM +0200, Javier González wrote:
>>> Check for failed mempool allocations and act accordingly.
>> 
>> Are you sure it is needed? Quoting from mempool_alloc()s Documentation:
>> "[...] Note that due to preallocation, this function *never* fails when 
>> called
>> from process contexts. (it might fail if called from an IRQ context.) [...]"
> 
> It's not needed, mempool() will never fail if __GFP_WAIT is set in the
> mask. The use case here is GFP_KERNEL, which does include __GFP_WAIT.

Thanks for the clarification. Do you just drop the patch, or do you want
me to re-send the series?

I see that I do this other places, I'll clean it up for next window.

Thanks,
Javier


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 1/6] lightnvm: pblk: check for failed mempool alloc.

2017-09-06 Thread Jens Axboe
On 09/06/2017 09:08 AM, Johannes Thumshirn wrote:
> On Wed, Sep 06, 2017 at 05:01:01PM +0200, Javier González wrote:
>> Check for failed mempool allocations and act accordingly.
> 
> Are you sure it is needed? Quoting from mempool_alloc()s Documentation:
> "[...] Note that due to preallocation, this function *never* fails when called
> from process contexts. (it might fail if called from an IRQ context.) [...]"

It's not needed, mempool() will never fail if __GFP_WAIT is set in the
mask. The use case here is GFP_KERNEL, which does include __GFP_WAIT.

-- 
Jens Axboe



Re: [PATCH 1/6] lightnvm: pblk: check for failed mempool alloc.

2017-09-06 Thread Johannes Thumshirn
On Wed, Sep 06, 2017 at 05:01:01PM +0200, Javier González wrote:
> Check for failed mempool allocations and act accordingly.

Are you sure it is needed? Quoting from mempool_alloc()s Documentation:
"[...] Note that due to preallocation, this function *never* fails when called
from process contexts. (it might fail if called from an IRQ context.) [...]"


Byte,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH 5/6] lightnvm: pblk: fix write I/O sync stat

2017-09-06 Thread Javier González
Fix stat counter to collect the right number of I/Os being synced on the
completion path.

Fixes: 0880a9aa2d91f ("lightnvm: pblk: delete redundant buffer pointer")
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 6acb4a92435f..b48d52b2ae38 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -41,7 +41,7 @@ static unsigned long pblk_end_w_bio(struct pblk *pblk, struct 
nvm_rq *rqd,
c_ctx->nr_padded);
 
 #ifdef CONFIG_NVM_DEBUG
-   atomic_long_add(c_ctx->nr_valid, >sync_writes);
+   atomic_long_add(rqd->nr_ppas, >sync_writes);
 #endif
 
ret = pblk_rb_sync_advance(>rwb, c_ctx->nr_valid);
-- 
2.7.4



[PATCH 1/6] lightnvm: pblk: check for failed mempool alloc.

2017-09-06 Thread Javier González
Check for failed mempool allocations and act accordingly.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 81501644fb15..acb07bbcb416 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -165,6 +165,8 @@ struct nvm_rq *pblk_alloc_rqd(struct pblk *pblk, int rw)
}
 
rqd = mempool_alloc(pool, GFP_KERNEL);
+   if (!rqd)
+   return NULL;
memset(rqd, 0, rq_size);
 
return rqd;
@@ -1478,6 +1480,8 @@ int pblk_blk_erase_async(struct pblk *pblk, struct 
ppa_addr ppa)
int err;
 
rqd = mempool_alloc(pblk->g_rq_pool, GFP_KERNEL);
+   if (!rqd)
+   return -ENOMEM;
memset(rqd, 0, pblk_g_rq_size);
 
pblk_setup_e_rq(pblk, rqd, ppa);
-- 
2.7.4



[PATCH 6/6] lightnvm: pblk: avoid deadlock on low LUN config

2017-09-06 Thread Javier González
On low LUN configurations, make sure not to send bios that are bigger
than the buffer size.

Fixes: a4bd217b4326 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 2 +-
 drivers/lightnvm/pblk-rl.c   | 6 ++
 drivers/lightnvm/pblk.h  | 2 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 8459434edd89..12c05aebac16 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -46,7 +46,7 @@ static int pblk_rw_io(struct request_queue *q, struct pblk 
*pblk,
 * user I/Os. Unless stalled, the rate limiter leaves at least 256KB
 * available for user I/O.
 */
-   if (unlikely(pblk_get_secs(bio) >= pblk_rl_sysfs_rate_show(>rl)))
+   if (pblk_get_secs(bio) > pblk_rl_max_io(>rl))
blk_queue_split(q, );
 
return pblk_write_to_cache(pblk, bio, PBLK_IOTYPE_USER);
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 2e6a5361baf0..596bdec433c3 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -178,6 +178,11 @@ int pblk_rl_sysfs_rate_show(struct pblk_rl *rl)
return rl->rb_user_max;
 }
 
+int pblk_rl_max_io(struct pblk_rl *rl)
+{
+   return rl->rb_max_io;
+}
+
 static void pblk_rl_u_timer(unsigned long data)
 {
struct pblk_rl *rl = (struct pblk_rl *)data;
@@ -214,6 +219,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
/* To start with, all buffer is available to user I/O writers */
rl->rb_budget = budget;
rl->rb_user_max = budget;
+   rl->rb_max_io = budget >> 1;
rl->rb_gc_max = 0;
rl->rb_state = PBLK_RL_HIGH;
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 67e623bd5c2d..a2753f9da830 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -267,6 +267,7 @@ struct pblk_rl {
int rb_gc_max;  /* Max buffer entries available for GC I/O */
int rb_gc_rsv;  /* Reserved buffer entries for GC I/O */
int rb_state;   /* Rate-limiter current state */
+   int rb_max_io;  /* Maximum size for an I/O giving the config */
 
atomic_t rb_user_cnt;   /* User I/O buffer counter */
atomic_t rb_gc_cnt; /* GC I/O buffer counter */
@@ -844,6 +845,7 @@ int pblk_rl_gc_may_insert(struct pblk_rl *rl, int 
nr_entries);
 void pblk_rl_gc_in(struct pblk_rl *rl, int nr_entries);
 void pblk_rl_out(struct pblk_rl *rl, int nr_user, int nr_gc);
 int pblk_rl_sysfs_rate_show(struct pblk_rl *rl);
+int pblk_rl_max_io(struct pblk_rl *rl);
 void pblk_rl_free_lines_inc(struct pblk_rl *rl, struct pblk_line *line);
 void pblk_rl_free_lines_dec(struct pblk_rl *rl, struct pblk_line *line);
 void pblk_rl_set_space_limit(struct pblk_rl *rl, int entries_left);
-- 
2.7.4



[PATCH 2/6] lightnvm: pblk: initialize debug stat counter

2017-09-06 Thread Javier González
Initialize the stat counter for garbage collected reads.

Fixes: a4bd217b43268 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 1b0f61233c21..8459434edd89 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -944,6 +944,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
atomic_long_set(>recov_writes, 0);
atomic_long_set(>recov_writes, 0);
atomic_long_set(>recov_gc_writes, 0);
+   atomic_long_set(>recov_gc_reads, 0);
 #endif
 
atomic_long_set(>read_failed, 0);
-- 
2.7.4



[PATCH 0/6] lightnvm: pblk bug fixes for 4.14

2017-09-06 Thread Javier González
Hi Jens,

Here are the pblk bug fixes for this window.

The patches apply on your for-4.14/block, and you can be found at:
  https://github.com/OpenChannelSSD/linux/tree/pblk.for-4.14

Thanks,
Javier

Javier González (6):
  lightnvm: pblk: check for failed mempool alloc.
  lightnvm: pblk: initialize debug stat counter
  lightnvm: pblk: use right flag for GC allocation
  lightnvm: pblk: free padded entries in write buffer
  lightnvm: pblk: fix write I/O sync stat
  lightnvm: pblk: avoid deadlock on low LUN config

 drivers/lightnvm/pblk-core.c  |  5 -
 drivers/lightnvm/pblk-init.c  |  3 ++-
 drivers/lightnvm/pblk-read.c  |  7 +--
 drivers/lightnvm/pblk-rl.c|  6 ++
 drivers/lightnvm/pblk-write.c | 16 
 drivers/lightnvm/pblk.h   |  2 ++
 6 files changed, 31 insertions(+), 8 deletions(-)

-- 
2.7.4



Re: [PATCH] lightnvm: Add error code for bad write pointer

2017-09-06 Thread Javier González
> On 6 Sep 2017, at 15.44, Christoph Hellwig  wrote:
> 
> On Wed, Sep 06, 2017 at 12:22:38PM +0200, Javier González wrote:
>> Add new error code introduced on the OCSSD spec 2.0 for write pointer
>> mismatch on the device side. This indicates to the host that a write on
>> a block (chunk) is not respecting the required sequentiality.
> 
> Do you have a pointer to that spec?

No yet. Still writing it up nicely.

I'll wait to re-post until we have the spec out.

Javier


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 00/13] bcache: fixes and update for 4.14

2017-09-06 Thread Jens Axboe
On Wed, Sep 06 2017, Coly Li wrote:
> Hi Jens,
> 
> Here are 12 patchs for bcache fixes and updates, most of them were posted
> by Eric Wheeler in 4.13 merge window, but delayed due to lack of code
> review.

Next time, please send this _before_ the merge window opens. Not a huge
problem for this series, since it has been posted numerous times in the
past and already has good review coverage, but it really should have
been in linux-next for a week or two before heading upstream.

> There are some other patches which does not pass my test currently,
> once they are fixed and reviewed by peer developers I will send them
> to you for 4.14 merge window.

If they are not passing tests etc before the merge window opens, they
are not for this cycle.

-- 
Jens Axboe



Re: [PATCH 1/2] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer

2017-09-06 Thread Benjamin Block
On Wed, Sep 06, 2017 at 08:07:43AM -0600, Jens Axboe wrote:
> On 09/06/2017 07:44 AM, Christoph Hellwig wrote:
> > From: Benjamin Block 
> > 
> > Since we split the scsi_request out of struct request bsg fails to
> > provide a reply-buffer for the drivers. This was done via the pointer
> > for sense-data, that is not preallocated anymore.
> 
> Confused, this is already in master, was included before 4.13 was
> finalized.
> 

I think this is the backport for 4.11 and 4.12 as requested by Greg. I
just send a mail as answer that I would do it, but I guess Christoph
already had something in the pipe.

Or? I'll take a look.


Beste Grüße / Best regards,
  - Benjamin Block
-- 
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294



Re: Archiving linux-block

2017-09-06 Thread Jens Axboe
On 09/06/2017 08:10 AM, Jens Axboe wrote:
> On 09/06/2017 07:13 AM, Bart Van Assche wrote:
>> On Tue, 2017-09-05 at 08:33 -0600, Jens Axboe wrote:
>>> On 09/05/2017 08:19 AM, Bart Van Assche wrote:
 Since gmane.org is no longer operational linux-block messages are no longer
 archived. I think it's useful to archive linux-block messages because it
 makes it possible to review past discussions, especially for people who are
 not subscribed to the list. How about activating archival by 
 mail-archive.com
 and spinics.net? As far as I know this can only be done by someone who is 
 list
 administrator. But I don't know who the linux-block list administrator(s) 
 are?
>>>
>>> I always use marc.info:
>>>
>>> https://marc.info/?l=linux-block
>>>
>>> If we want to add it elsewhere, that's fine with me. I'm probably the admin,
>>> I was the one that asked for the list to be created.
>>
>> Thanks Jens for the feedback. What I tried yesterday morning was to look up a
>> linux-block message by searching for its contents with a web search engine. 
>> That
>> search did not yield any results. Does this mean that messages on marc.info 
>> are
>> not indexed by Google? Is this a sufficient reason to let an additional web 
>> site
>> archive the linux-block messages?
> 
> Definitely, it'd be nice to have it be searchable. That's half the utility of
> an archive...
> 
> I'll see if I can get mail-archive.com setup.

spinics already has it, fwiw:

https://www.spinics.net/lists/linux-block/


-- 
Jens Axboe



Re: Archiving linux-block

2017-09-06 Thread Jens Axboe
On 09/06/2017 07:13 AM, Bart Van Assche wrote:
> On Tue, 2017-09-05 at 08:33 -0600, Jens Axboe wrote:
>> On 09/05/2017 08:19 AM, Bart Van Assche wrote:
>>> Since gmane.org is no longer operational linux-block messages are no longer
>>> archived. I think it's useful to archive linux-block messages because it
>>> makes it possible to review past discussions, especially for people who are
>>> not subscribed to the list. How about activating archival by 
>>> mail-archive.com
>>> and spinics.net? As far as I know this can only be done by someone who is 
>>> list
>>> administrator. But I don't know who the linux-block list administrator(s) 
>>> are?
>>
>> I always use marc.info:
>>
>> https://marc.info/?l=linux-block
>>
>> If we want to add it elsewhere, that's fine with me. I'm probably the admin,
>> I was the one that asked for the list to be created.
> 
> Thanks Jens for the feedback. What I tried yesterday morning was to look up a
> linux-block message by searching for its contents with a web search engine. 
> That
> search did not yield any results. Does this mean that messages on marc.info 
> are
> not indexed by Google? Is this a sufficient reason to let an additional web 
> site
> archive the linux-block messages?

Definitely, it'd be nice to have it be searchable. That's half the utility of
an archive...

I'll see if I can get mail-archive.com setup.

-- 
Jens Axboe



Re: [PATCH 00/18] lightnvm: pblk patches for 4.14

2017-09-06 Thread Jens Axboe
On 09/06/2017 04:50 AM, Javier González wrote:
> Hi Jens,
> 
> Here is the pblk patchset for this window.

You must mean for the next window, surely, since we're in the middle
of the current merge window? Have these patches been posted for
review before? This should have been done (at least) two weeks ago!

Get rid of anything that isn't a bug fix, I don't want to take
anything that looks like a feature or refactoring of code.

-- 
Jens Axboe



Re: [PATCH 10/18] lightnvm: pblk: use bio_copy_kern when possible

2017-09-06 Thread Christoph Hellwig
On Wed, Sep 06, 2017 at 12:51:03PM +0200, Javier González wrote:
> In pblk, buffers forming bios can be allocated on physically contiguous
> or virtually contiguous memory. For physically contiguous memory, we
> already use the bio_map_kern helper funciton, however, for virtually
> contiguous memory, we from the bio manually. This makes the code more
> complex, specially on the completion path, where mapped pages need to be
> freed.
> 
> Instead, use bio_copy_kern, which does the same and at the same time
> simplifies the completion path.

Nope.  You want to loop over vmalloc_to_page and call bio_add_page
for each page, after taking care of virtually tagged caches instead
of this bounce buffering.

And you really want to allocate the request first and only then map
the data to the request, as said before.


Re: [PATCH 01/18] lightnvm: pblk: improve naming for internal req.

2017-09-06 Thread Christoph Hellwig
> -#define ERASE 2 /* READ = 0, WRITE = 1 */
> +/* READ = 0, WRITE (user) = 1 */
> +#define ERASE 2
> +#define WRITE_INT 3 /* Internal write. Not through write buffer */

enum {
PBLK_READ,
PBLK_WRITE,
PBLK_ERASE,
PBLK_WRITE,
};

please.  Don't abuse and overload the messy READ/WRITE macros in any
new code.


Re: [PATCH] lightnvm: Add error code for bad write pointer

2017-09-06 Thread Christoph Hellwig
On Wed, Sep 06, 2017 at 12:22:38PM +0200, Javier González wrote:
> Add new error code introduced on the OCSSD spec 2.0 for write pointer
> mismatch on the device side. This indicates to the host that a write on
> a block (chunk) is not respecting the required sequentiality.

Do you have a pointer to that spec?


[PATCH 2/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout

2017-09-06 Thread Christoph Hellwig
bsg-lib now embeddeds the job structure into the request, and req->special
can't be used anymore.

Signed-off-by: Christoph Hellwig 
Cc: sta...@vger.kernel.org
---
 drivers/scsi/scsi_transport_fc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 3c6bc0081fcb..ba9d70f8a6a1 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3571,7 +3571,7 @@ fc_vport_sched_delete(struct work_struct *work)
 static enum blk_eh_timer_return
 fc_bsg_job_timeout(struct request *req)
 {
-   struct bsg_job *job = (void *) req->special;
+   struct bsg_job *job = blk_mq_rq_to_pdu(req);
struct Scsi_Host *shost = fc_bsg_to_shost(job);
struct fc_rport *rport = fc_bsg_to_rport(job);
struct fc_internal *i = to_fc_internal(shost->transportt);
-- 
2.11.0



[PATCH 1/2] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer

2017-09-06 Thread Christoph Hellwig
From: Benjamin Block 

Since we split the scsi_request out of struct request bsg fails to
provide a reply-buffer for the drivers. This was done via the pointer
for sense-data, that is not preallocated anymore.

Failing to allocate/assign it results in illegal dereferences because
LLDs use this pointer unquestioned.

An example panic on s390x, using the zFCP driver, looks like this (I had
debugging on, otherwise NULL-pointer dereferences wouldn't even panic on
s390x):

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6403
Fault in home space mode while using kernel ASCE.
AS:01590007 R3:0024
Oops: 0038 ilc:2 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: 
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.12.0-bsg-regression+ #3
Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
task: 65cb0100 task.stack: 65cb4000
Krnl PSW : 0704e0018000 03ff801e4156 
(zfcp_fc_ct_els_job_handler+0x16/0x58 [zfcp])
   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0001 5fa9d0d0 5fa9d078 00e16866
   03ff0290 6b6b6b6b6b6b6b6b 59f78f00 000f
   593a0958 593a0958 60d88800 5ddd4c38
   58b50100 0700659cba08 03ff801e8556 659cb9a8
Krnl Code: 03ff801e4146: e3102054lg  %r1,80(%r2)
   03ff801e414c: 58402040   l   %r4,64(%r2)
  #03ff801e4150: e3502024   lg  %r5,32(%r2)
  >03ff801e4156: 50405004   st  %r4,4(%r5)
   03ff801e415a: e54c5008   mvhi8(%r5),0
   03ff801e4160: e33010280012   lt  %r3,40(%r1)
   03ff801e4166: a718fffb   lhi %r1,-5
   03ff801e416a: 1803   lr  %r0,%r3
Call Trace:
([<03ff801e8556>] zfcp_fsf_req_complete+0x726/0x768 [zfcp])
 [<03ff801ea82a>] zfcp_fsf_reqid_check+0x102/0x180 [zfcp]
 [<03ff801eb980>] zfcp_qdio_int_resp+0x230/0x278 [zfcp]
 [<009b91b6>] qdio_kick_handler+0x2ae/0x2c8
 [<009b9e3e>] __tiqdio_inbound_processing+0x406/0xc10
 [<001684c2>] tasklet_action+0x15a/0x1d8
 [<00bd28ec>] __do_softirq+0x3ec/0x848
 [<001675a4>] irq_exit+0x74/0xf8
 [<0010dd6a>] do_IRQ+0xba/0xf0
 [<00bd19e8>] io_int_handler+0x104/0x2d4
 [<001033b6>] enabled_wait+0xb6/0x188
([<0010339e>] enabled_wait+0x9e/0x188)
 [<0010396a>] arch_cpu_idle+0x32/0x50
 [<00bd0112>] default_idle_call+0x52/0x68
 [<001cd0fa>] do_idle+0x102/0x188
 [<001cd41e>] cpu_startup_entry+0x3e/0x48
 [<00118c64>] smp_start_secondary+0x11c/0x130
 [<00bd2016>] restart_int_handler+0x62/0x78
 [<>]   (null)
INFO: lockdep is turned off.
Last Breaking-Event-Address:
 [<03ff801e41d6>] zfcp_fc_ct_job_handler+0x3e/0x48 [zfcp]

Kernel panic - not syncing: Fatal exception in interrupt

This patch moves bsg-lib to allocate and setup struct bsg_job ahead of
time, including the allocation of a buffer for the reply-data.

This means, struct bsg_job is not allocated separately anymore, but as part
of struct request allocation - similar to struct scsi_cmd. Reflect this in
the function names that used to handle creation/destruction of struct
bsg_job.

Reported-by: Steffen Maier 
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Benjamin Block 
Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request")
Cc:  #4.11+
Signed-off-by: Jens Axboe 
---
 block/bsg-lib.c | 74 +
 include/linux/blkdev.h  |  1 -
 include/linux/bsg-lib.h |  2 ++
 3 files changed, 46 insertions(+), 31 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index 4752dbc3dc49..c82408c7cc3c 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -29,26 +29,25 @@
 #include 
 
 /**
- * bsg_destroy_job - routine to teardown/delete a bsg job
+ * bsg_teardown_job - routine to teardown a bsg job
  * @job: bsg_job that is to be torn down
  */
-static void bsg_destroy_job(struct kref *kref)
+static void bsg_teardown_job(struct kref *kref)
 {
struct bsg_job *job = container_of(kref, struct bsg_job, kref);
struct request *rq = job->req;
 
-   blk_end_request_all(rq, BLK_STS_OK);
-
put_device(job->dev);   /* release reference for the request */
 
kfree(job->request_payload.sg_list);
kfree(job->reply_payload.sg_list);
-   kfree(job);
+
+   blk_end_request_all(rq, BLK_STS_OK);
 }
 
 void bsg_job_put(struct bsg_job *job)
 {
-   kref_put(>kref, bsg_destroy_job);
+   kref_put(>kref, bsg_teardown_job);
 }
 

two small bsg fixes V2

2017-09-06 Thread Christoph Hellwig
Two fixups for the recent bsg-lib fixes, should go into 4.13 stable as
well.




Re: [PATCH] block: export symbol for bio_copy_kern

2017-09-06 Thread Christoph Hellwig
On Wed, Sep 06, 2017 at 12:20:25PM +0200, Javier González wrote:
> Export symbol for bio_copy_kern so that we can use it in modules.

NAK.  Always allocate the request first and then map the data to
the request.


Re: [PATCH v2] xfs: fix incorrect log_flushed on fsync

2017-09-06 Thread Brian Foster
On Wed, Sep 06, 2017 at 03:19:37PM +0200, Christoph Hellwig wrote:
> > You could also say that flush sequence counting code doesn't belong
> > to xfs code at all. There is nothing xfs specific about it.
> > 
> > If we had an API:
> > 
> > flush_seq = blkdev_get_flush_seq(bdev, flush_seq);
> > blkdev_issue_flush_after(bdev, flush_seq);
> > 
> > I am sure it would have been useful to more fs.
> > 
> > In fact, block drivers that use blk_insert_flush(),
> > already have serialized flush requests, so implementing
> > the above functionality would be quite trivial for those.
> > 
> > I am not fluent enough in block layer to say if this makes sense.
> > Christoph? Should we add some block people to this discussion?
> 
> Not that the interface can't be based on blkdev_issue_flush as
> our most important flushes are submitted asynchronously.
> 
> But except for that tying into the flush state machine sounds
> very interesting.  Let me look into that as the designate
> xfs <-> block layer liaison.

That sounds like a nice idea to me, particularly if there is potential
for other users. I'll wait to look into doing anything in XFS until we
see how this turns out. Thanks.

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] xfs: fix incorrect log_flushed on fsync

2017-09-06 Thread Christoph Hellwig
> You could also say that flush sequence counting code doesn't belong
> to xfs code at all. There is nothing xfs specific about it.
> 
> If we had an API:
> 
> flush_seq = blkdev_get_flush_seq(bdev, flush_seq);
> blkdev_issue_flush_after(bdev, flush_seq);
> 
> I am sure it would have been useful to more fs.
> 
> In fact, block drivers that use blk_insert_flush(),
> already have serialized flush requests, so implementing
> the above functionality would be quite trivial for those.
> 
> I am not fluent enough in block layer to say if this makes sense.
> Christoph? Should we add some block people to this discussion?

Not that the interface can't be based on blkdev_issue_flush as
our most important flushes are submitted asynchronously.

But except for that tying into the flush state machine sounds
very interesting.  Let me look into that as the designate
xfs <-> block layer liaison.


Re: Archiving linux-block

2017-09-06 Thread Bart Van Assche
On Tue, 2017-09-05 at 08:33 -0600, Jens Axboe wrote:
> On 09/05/2017 08:19 AM, Bart Van Assche wrote:
> > Since gmane.org is no longer operational linux-block messages are no longer
> > archived. I think it's useful to archive linux-block messages because it
> > makes it possible to review past discussions, especially for people who are
> > not subscribed to the list. How about activating archival by 
> > mail-archive.com
> > and spinics.net? As far as I know this can only be done by someone who is 
> > list
> > administrator. But I don't know who the linux-block list administrator(s) 
> > are?
> 
> I always use marc.info:
> 
> https://marc.info/?l=linux-block
> 
> If we want to add it elsewhere, that's fine with me. I'm probably the admin,
> I was the one that asked for the list to be created.

Thanks Jens for the feedback. What I tried yesterday morning was to look up a
linux-block message by searching for its contents with a web search engine. That
search did not yield any results. Does this mean that messages on marc.info are
not indexed by Google? Is this a sufficient reason to let an additional web site
archive the linux-block messages?

Thanks,

Bart.

Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-06 Thread Bart Van Assche
On Wed, 2017-09-06 at 15:34 +0800, weiping zhang wrote:
> On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote:
> > On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote:
> > > if blk-mq use "none" io scheduler, nr_request get a wrong value when
> > > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
> > > the smaller one min(nr, set->queue_depth), and then q->nr_request get a
> > > wrong value.
> > > 
> > > Reproduce:
> > > 
> > > echo none > /sys/block/nvme0n1/queue/ioscheduler
> > > echo 100 > /sys/block/nvme0n1/queue/nr_requests
> > > cat /sys/block/nvme0n1/queue/nr_requests
> > > 100
> > > 
> > > Signed-off-by: weiping zhang 
> > > ---
> > >  block/blk-mq.c | 7 +--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index f84d145..8303e5e 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct request_queue 
> > > *q, unsigned int nr)
> > >* queue depth. This is similar to what the old code would do.
> > >*/
> > >   if (!hctx->sched_tags) {
> > > - ret = blk_mq_tag_update_depth(hctx, >tags,
> > > - min(nr, 
> > > set->queue_depth),
> > > + if (nr > set->queue_depth) {
> > > + nr = set->queue_depth;
> > > + pr_warn("reduce nr_request to %u\n", nr);
> > > + }
> > > + ret = blk_mq_tag_update_depth(hctx, >tags, nr,
> > >   false);
> > >   } else {
> > >   ret = blk_mq_tag_update_depth(hctx, >sched_tags,
> > 
> > Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? That 
> > will help to
> > keep user space code simple that updates the queue depth.
> 
> Hi Bart,
> 
> The reason why not return -EINVAL is keeping alin with minimum checking in 
> queue_requests_store,
> if you insist return -EINVAL/-ERANGE, minimum checking should also keep
> same behavior. Both return error meesage and quietly changing are okey
> for me. Which way do you prefer ?
> 
> static ssize_t
> queue_requests_store(struct request_queue *q, const char *page, size_t count)
> {
>   unsigned long nr;
>   int ret, err;
> 
>   if (!q->request_fn && !q->mq_ops)
>   return -EINVAL;
> 
>   ret = queue_var_store(, page, count);
>   if (ret < 0)
>   return ret;
> 
>   if (nr < BLKDEV_MIN_RQ)
>   nr = BLKDEV_MIN_RQ;

Hello Jens,

Do you perhaps have a preference for one of the approaches that have been 
discussed
in this e-mail thread?

Thanks,

Bart.

Re: [PATCH v2] xfs: fix incorrect log_flushed on fsync

2017-09-06 Thread Amir Goldstein
[-stable]

On Tue, Sep 5, 2017 at 5:40 PM, Brian Foster  wrote:
> On Sat, Sep 02, 2017 at 06:47:03PM +0300, Amir Goldstein wrote:
>> On Sat, Sep 2, 2017 at 4:19 PM, Brian Foster  wrote:
>> > On Fri, Sep 01, 2017 at 06:39:25PM +0300, Amir Goldstein wrote:
>> >> When calling into _xfs_log_force{,_lsn}() with a pointer
>> >> to log_flushed variable, log_flushed will be set to 1 if:
>> >> 1. xlog_sync() is called to flush the active log buffer
>> >> AND/OR
>> >> 2. xlog_wait() is called to wait on a syncing log buffers
>> >>
>> >> xfs_file_fsync() checks the value of log_flushed after
>> >> _xfs_log_force_lsn() call to optimize away an explicit
>> >> PREFLUSH request to the data block device after writing
>> >> out all the file's pages to disk.
>> >>
>> >> This optimization is incorrect in the following sequence of events:
>> >>
>> >>  Task ATask B
>> >>  ---
>> >>  xfs_file_fsync()
>> >>_xfs_log_force_lsn()
>> >>  xlog_sync()
>> >> [submit PREFLUSH]
>> >>xfs_file_fsync()
>> >>  file_write_and_wait_range()
>> >>[submit WRITE X]
>> >>[endio  WRITE X]
>> >>  _xfs_log_force_lsn()
>> >>xlog_wait()
>> >> [endio  PREFLUSH]
>> >>
>> >> The write X is not guarantied to be on persistent storage
>> >> when PREFLUSH request in completed, because write A was submitted
>> >> after the PREFLUSH request, but xfs_file_fsync() of task A will
>> >> be notified of log_flushed=1 and will skip explicit flush.
>> >>
>> >> If the system crashes after fsync of task A, write X may not be
>> >> present on disk after reboot.
>> >>
>> >> This bug was discovered and demonstrated using Josef Bacik's
>> >> dm-log-writes target, which can be used to record block io operations
>> >> and then replay a subset of these operations onto the target device.
>> >> The test goes something like this:
>> >> - Use fsx to execute ops of a file and record ops on log device
>> >> - Every now and then fsync the file, store md5 of file and mark
>> >
>> >>   md5 of file to stored value
>> >>
>> >> Cc: Christoph Hellwig 
>> >> Cc: Josef Bacik 
>> >> Cc: 
>> >> Signed-off-by: Amir Goldstein 
>> >> ---
>> >>
>> >> Christoph,
>> >>
>> >> Here is another, more subtle, version of the fix.
>> >> It keeps a lot more use cases optimized (e.g. many threads doing fsync
>> >> and setting WANT_SYNC may still be optimized).
>> >> It also addresses your concern that xlog_state_release_iclog()
>> >> may not actually start the buffer sync.
>> >>
>> >> I tried to keep the logic as simple as possible:
>> >> If we see a buffer who is not yet SYNCING and we later
>> >> see that l_last_sync_lsn is >= to the lsn of that buffer,
>> >> we can safely say log_flushed.
>> >>
>> >> I only tested this patch through a few crash test runs, not even full
>> >> xfstests cycle, so I am not sure it is correct, just posting to get
>> >> your feedback.
>> >>
>> >> Is it worth something over the simpler v1 patch?
>> >> I can't really say.
>> >>
>> >
>> > This looks like it has a couple potential problems on a very quick look
>> > (so I could definitely be mistaken). E.g., the lsn could be zero at the
>> > time we set log_flushed in _xfs_log_force(). It also looks like waiting
>> > on an iclog that is waiting to run callbacks due to out of order
>> > completion could be interpreted as a log flush having occurred, but I
>> > haven't stared at this long enough to say whether that is actually
>> > possible.
>> >
>> > Stepping back from the details.. this seems like it could be done
>> > correctly in general. IIUC what you want to know is whether any iclog
>> > went from a pre-SYNCING state to a post-SYNCING state during the log
>> > force, right? The drawbacks to this are that the log states do not by
>> > design tell us whether devices have been flushed (landmine alert).
>> > AFAICS, the last tail lsn isn't necessarily updated on every I/O
>> > completion either.
>> >
>> > I'm really confused by the preoccupation with finding a way to keep this
>> > fix localized to xfs_log_force(), as if that provides some inherent
>> > advantage over fundamentally more simple code. If anything, the fact
>> > that this has been broken for so long suggests that is not the case.
>> >
>>
>> Brian,
>>
>> Don't let my motives confuse you, the localized approach has two reasons:
>> 1. I though there may be a low hanging fix, because of already existing
>> lsn counters
>> 2. I lack the experience and time to make the 'correct' fix you suggested
>>
>
> Fair enough, but note that the "correct" fix was your idea (based on
> hch's patch). :) I just suggested refactoring it out of the logging code
> because there's no reason it needs to be 

Re: [PATCH 1/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout

2017-09-06 Thread Christoph Hellwig
On Wed, Sep 06, 2017 at 06:59:39PM +0800, Ming Lei wrote:
> On Wed, Sep 6, 2017 at 6:11 PM, Christoph Hellwig  wrote:
> > bsg-lib now embeddeds the job structure into the request, and req->special
> > can't be used anymore.
> >
> > Signed-off-by: Christoph Hellwig 
> > Cc: sta...@vger.kernel.org
> > ---
> >  drivers/scsi/scsi_transport_fc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/scsi/scsi_transport_fc.c 
> > b/drivers/scsi/scsi_transport_fc.c
> > index 3c6bc0081fcb..d8de46806a1e 100644
> > --- a/drivers/scsi/scsi_transport_fc.c
> > +++ b/drivers/scsi/scsi_transport_fc.c
> > @@ -3571,7 +3571,7 @@ fc_vport_sched_delete(struct work_struct *work)
> >  static enum blk_eh_timer_return
> >  fc_bsg_job_timeout(struct request *req)
> >  {
> > -   struct bsg_job *job = (void *) req->special;
> > +   struct bsg_job *job = blk_mq_rq_to_pdu(req->special);
> 
> still req->special?

Meh, sent out before the rebase finished - I'll fix it up.



Re: [PATCH 1/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout

2017-09-06 Thread Ming Lei
On Wed, Sep 6, 2017 at 6:11 PM, Christoph Hellwig  wrote:
> bsg-lib now embeddeds the job structure into the request, and req->special
> can't be used anymore.
>
> Signed-off-by: Christoph Hellwig 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/scsi/scsi_transport_fc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/scsi_transport_fc.c 
> b/drivers/scsi/scsi_transport_fc.c
> index 3c6bc0081fcb..d8de46806a1e 100644
> --- a/drivers/scsi/scsi_transport_fc.c
> +++ b/drivers/scsi/scsi_transport_fc.c
> @@ -3571,7 +3571,7 @@ fc_vport_sched_delete(struct work_struct *work)
>  static enum blk_eh_timer_return
>  fc_bsg_job_timeout(struct request *req)
>  {
> -   struct bsg_job *job = (void *) req->special;
> +   struct bsg_job *job = blk_mq_rq_to_pdu(req->special);

still req->special?


-- 
Ming Lei


[PATCH 03/18] lightnvm: pblk: normalize ppa namings

2017-09-06 Thread Javier González
Normalize the way we name ppa variables to improve code readability.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 48 +++-
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 81501644fb15..e3997abdf740 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1764,7 +1764,7 @@ void pblk_up_rq(struct pblk *pblk, struct ppa_addr 
*ppa_list, int nr_ppas,
 
 void pblk_update_map(struct pblk *pblk, sector_t lba, struct ppa_addr ppa)
 {
-   struct ppa_addr l2p_ppa;
+   struct ppa_addr ppa_l2p;
 
/* logic error: lba out-of-bounds. Ignore update */
if (!(lba < pblk->rl.nr_secs)) {
@@ -1773,10 +1773,10 @@ void pblk_update_map(struct pblk *pblk, sector_t lba, 
struct ppa_addr ppa)
}
 
spin_lock(>trans_lock);
-   l2p_ppa = pblk_trans_map_get(pblk, lba);
+   ppa_l2p = pblk_trans_map_get(pblk, lba);
 
-   if (!pblk_addr_in_cache(l2p_ppa) && !pblk_ppa_empty(l2p_ppa))
-   pblk_map_invalidate(pblk, l2p_ppa);
+   if (!pblk_addr_in_cache(ppa_l2p) && !pblk_ppa_empty(ppa_l2p))
+   pblk_map_invalidate(pblk, ppa_l2p);
 
pblk_trans_map_set(pblk, lba, ppa);
spin_unlock(>trans_lock);
@@ -1793,16 +1793,16 @@ void pblk_update_map_cache(struct pblk *pblk, sector_t 
lba, struct ppa_addr ppa)
pblk_update_map(pblk, lba, ppa);
 }
 
-int pblk_update_map_gc(struct pblk *pblk, sector_t lba, struct ppa_addr ppa,
+int pblk_update_map_gc(struct pblk *pblk, sector_t lba, struct ppa_addr 
ppa_new,
   struct pblk_line *gc_line)
 {
-   struct ppa_addr l2p_ppa;
+   struct ppa_addr ppa_l2p;
int ret = 1;
 
 #ifdef CONFIG_NVM_DEBUG
/* Callers must ensure that the ppa points to a cache address */
-   BUG_ON(!pblk_addr_in_cache(ppa));
-   BUG_ON(pblk_rb_pos_oob(>rwb, pblk_addr_to_cacheline(ppa)));
+   BUG_ON(!pblk_addr_in_cache(ppa_new));
+   BUG_ON(pblk_rb_pos_oob(>rwb, pblk_addr_to_cacheline(ppa_new)));
 #endif
 
/* logic error: lba out-of-bounds. Ignore update */
@@ -1812,36 +1812,38 @@ int pblk_update_map_gc(struct pblk *pblk, sector_t lba, 
struct ppa_addr ppa,
}
 
spin_lock(>trans_lock);
-   l2p_ppa = pblk_trans_map_get(pblk, lba);
+   ppa_l2p = pblk_trans_map_get(pblk, lba);
 
/* Prevent updated entries to be overwritten by GC */
-   if (pblk_addr_in_cache(l2p_ppa) || pblk_ppa_empty(l2p_ppa) ||
-   pblk_tgt_ppa_to_line(l2p_ppa) != gc_line->id) {
+   if (pblk_addr_in_cache(ppa_l2p) || pblk_ppa_empty(ppa_l2p) ||
+   pblk_tgt_ppa_to_line(ppa_l2p) != gc_line->id) {
+
ret = 0;
goto out;
}
 
-   pblk_trans_map_set(pblk, lba, ppa);
+   pblk_trans_map_set(pblk, lba, ppa_new);
 out:
spin_unlock(>trans_lock);
return ret;
 }
 
-void pblk_update_map_dev(struct pblk *pblk, sector_t lba, struct ppa_addr ppa,
-struct ppa_addr entry_line)
+void pblk_update_map_dev(struct pblk *pblk, sector_t lba,
+struct ppa_addr ppa_mapped, struct ppa_addr ppa_cache)
 {
-   struct ppa_addr l2p_line;
+   struct ppa_addr ppa_l2p;
 
 #ifdef CONFIG_NVM_DEBUG
/* Callers must ensure that the ppa points to a device address */
-   BUG_ON(pblk_addr_in_cache(ppa));
+   BUG_ON(pblk_addr_in_cache(ppa_mapped));
 #endif
/* Invalidate and discard padded entries */
if (lba == ADDR_EMPTY) {
 #ifdef CONFIG_NVM_DEBUG
atomic_long_inc(>padded_wb);
 #endif
-   pblk_map_invalidate(pblk, ppa);
+   if (!pblk_ppa_empty(ppa_mapped))
+   pblk_map_invalidate(pblk, ppa_mapped);
return;
}
 
@@ -1852,22 +1854,22 @@ void pblk_update_map_dev(struct pblk *pblk, sector_t 
lba, struct ppa_addr ppa,
}
 
spin_lock(>trans_lock);
-   l2p_line = pblk_trans_map_get(pblk, lba);
+   ppa_l2p = pblk_trans_map_get(pblk, lba);
 
/* Do not update L2P if the cacheline has been updated. In this case,
 * the mapped ppa must be invalidated
 */
-   if (l2p_line.ppa != entry_line.ppa) {
-   if (!pblk_ppa_empty(ppa))
-   pblk_map_invalidate(pblk, ppa);
+   if (!pblk_ppa_comp(ppa_l2p, ppa_cache)) {
+   if (!pblk_ppa_empty(ppa_mapped))
+   pblk_map_invalidate(pblk, ppa_mapped);
goto out;
}
 
 #ifdef CONFIG_NVM_DEBUG
-   WARN_ON(!pblk_addr_in_cache(l2p_line) && !pblk_ppa_empty(l2p_line));
+   WARN_ON(!pblk_addr_in_cache(ppa_l2p) && !pblk_ppa_empty(ppa_l2p));
 #endif
 
-   pblk_trans_map_set(pblk, lba, ppa);
+   pblk_trans_map_set(pblk, lba, 

[PATCH 05/18] lightnvm: pblk: initialize debug stat counter

2017-09-06 Thread Javier González
Initialize the stat counter for garbage collected reads.

Fixes: a4bd217b43268 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 1b0f61233c21..8459434edd89 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -944,6 +944,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
atomic_long_set(>recov_writes, 0);
atomic_long_set(>recov_writes, 0);
atomic_long_set(>recov_gc_writes, 0);
+   atomic_long_set(>recov_gc_reads, 0);
 #endif
 
atomic_long_set(>read_failed, 0);
-- 
2.7.4



[PATCH 01/18] lightnvm: pblk: improve naming for internal req.

2017-09-06 Thread Javier González
Each request type sent to the LightNVM subsystem requires different
metadata. Until now, we have tailored this metadata based on write, read
and erase commands. However, pblk uses different metadata for internal
writes that do not hit the write buffer. Instead of abusing the metadata
for reads, create a new request type - internal write. This improves
code readability.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-recovery.c | 6 +++---
 drivers/lightnvm/pblk-write.c| 6 +++---
 drivers/lightnvm/pblk.h  | 5 -
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index cb556e06673e..279638309d9a 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -344,7 +344,7 @@ static void pblk_end_io_recov(struct nvm_rq *rqd)
 
bio_put(rqd->bio);
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
-   pblk_free_rqd(pblk, rqd, WRITE);
+   pblk_free_rqd(pblk, rqd, WRITE_INT);
 
atomic_dec(>inflight_io);
kref_put(_rq->ref, pblk_recov_complete);
@@ -404,7 +404,7 @@ static int pblk_recov_pad_oob(struct pblk *pblk, struct 
pblk_line *line,
ppa_list = (void *)(meta_list) + pblk_dma_meta_size;
dma_ppa_list = dma_meta_list + pblk_dma_meta_size;
 
-   rqd = pblk_alloc_rqd(pblk, WRITE);
+   rqd = pblk_alloc_rqd(pblk, WRITE_INT);
if (IS_ERR(rqd)) {
ret = PTR_ERR(rqd);
goto fail_free_meta;
@@ -491,7 +491,7 @@ static int pblk_recov_pad_oob(struct pblk *pblk, struct 
pblk_line *line,
 fail_free_bio:
bio_put(bio);
 fail_free_rqd:
-   pblk_free_rqd(pblk, rqd, WRITE);
+   pblk_free_rqd(pblk, rqd, WRITE_INT);
 fail_free_meta:
nvm_dev_dma_free(dev->parent, meta_list, dma_meta_list);
 fail_free_pad:
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 3ad9e56d2473..920b93fb15d5 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -199,7 +199,7 @@ static void pblk_end_io_write_meta(struct nvm_rq *rqd)
 
bio_put(rqd->bio);
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
-   pblk_free_rqd(pblk, rqd, READ);
+   pblk_free_rqd(pblk, rqd, WRITE_INT);
 
atomic_dec(>inflight_io);
 }
@@ -370,7 +370,7 @@ int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line 
*meta_line)
int i, j;
int ret;
 
-   rqd = pblk_alloc_rqd(pblk, READ);
+   rqd = pblk_alloc_rqd(pblk, WRITE_INT);
if (IS_ERR(rqd)) {
pr_err("pblk: cannot allocate write req.\n");
return PTR_ERR(rqd);
@@ -434,7 +434,7 @@ int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line 
*meta_line)
if (likely(l_mg->emeta_alloc_type == PBLK_VMALLOC_META))
bio_put(bio);
 fail_free_rqd:
-   pblk_free_rqd(pblk, rqd, READ);
+   pblk_free_rqd(pblk, rqd, WRITE_INT);
return ret;
 }
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 67e623bd5c2d..08a463fe855f 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -59,7 +59,10 @@
for ((i) = 0, rlun = &(pblk)->luns[0]; \
(i) < (pblk)->nr_luns; (i)++, rlun = &(pblk)->luns[(i)])
 
-#define ERASE 2 /* READ = 0, WRITE = 1 */
+/* READ = 0, WRITE (user) = 1 */
+#define ERASE 2
+#define WRITE_INT 3 /* Internal write. Not through write buffer */
+
 
 enum {
/* IO Types */
-- 
2.7.4



[PATCH 16/18] lightnvm: pblk: enable 1 LUN configuration

2017-09-06 Thread Javier González
Metadata I/Os are scheduled to minimize their impact on user data I/Os.
When there are enough LUNs instantiated (i.e., enought bandwidth), it is
easy to interleave metadata and data one after the other so that
metadata I/Os are the ones being blocked and not viceversa.

We do this by calculating the distance between the I/Os in terms of the
LUNs that are not in used, and selecting a free LUN that satisfies a
the simple heuristic that medatata is scheduled behind. The per-LUN
semaphores guarantee consistency. This works fine on >1 LUN
configuration. However, when a signle LUN is instantiated, this design
leads to a deadlock, where metadata waits to be scheduled on a free LUN.

This patch implements the 1 LUN case by simply scheduling the medatada
I/O after the data I/O. In the process, we refactor the way a line is
replaced to ensure that metadata writes are submitted after data writes
in order to guarantee block sequentiality. Note that, since there is
only one LUN, both I/Os will block each other by design. However, such
configuration only pursues tight read latencies, not write bandwidth.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c  | 17 ++---
 drivers/lightnvm/pblk-init.c  |  8 ++--
 drivers/lightnvm/pblk-map.c   | 21 -
 drivers/lightnvm/pblk-write.c |  8 +---
 drivers/lightnvm/pblk.h   |  2 +-
 5 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1f746e31e379..873b66200678 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1326,17 +1326,17 @@ void pblk_pipeline_stop(struct pblk *pblk)
spin_unlock(_mg->free_lock);
 }
 
-void pblk_line_replace_data(struct pblk *pblk)
+struct pblk_line *pblk_line_replace_data(struct pblk *pblk)
 {
struct pblk_line_mgmt *l_mg = >l_mg;
-   struct pblk_line *cur, *new;
+   struct pblk_line *cur, *new = NULL;
unsigned int left_seblks;
int is_next = 0;
 
cur = l_mg->data_line;
new = l_mg->data_next;
if (!new)
-   return;
+   goto out;
l_mg->data_line = new;
 
spin_lock(_mg->free_lock);
@@ -1344,7 +1344,7 @@ void pblk_line_replace_data(struct pblk *pblk)
l_mg->data_line = NULL;
l_mg->data_next = NULL;
spin_unlock(_mg->free_lock);
-   return;
+   goto out;
}
 
pblk_line_setup_metadata(new, l_mg, >lm);
@@ -1356,7 +1356,7 @@ void pblk_line_replace_data(struct pblk *pblk)
/* If line is not fully erased, erase it */
if (atomic_read(>left_eblks)) {
if (pblk_line_erase(pblk, new))
-   return;
+   goto out;
} else {
io_schedule();
}
@@ -1367,7 +1367,7 @@ void pblk_line_replace_data(struct pblk *pblk)
if (!pblk_line_init_metadata(pblk, new, cur)) {
new = pblk_line_retry(pblk, new);
if (!new)
-   return;
+   goto out;
 
goto retry_setup;
}
@@ -1375,7 +1375,7 @@ void pblk_line_replace_data(struct pblk *pblk)
if (!pblk_line_init_bb(pblk, new, 1)) {
new = pblk_line_retry(pblk, new);
if (!new)
-   return;
+   goto out;
 
goto retry_setup;
}
@@ -1399,6 +1399,9 @@ void pblk_line_replace_data(struct pblk *pblk)
 
if (is_next)
pblk_rl_free_lines_dec(>rl, l_mg->data_next);
+
+out:
+   return new;
 }
 
 void pblk_line_free(struct pblk *pblk, struct pblk_line *line)
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 12c05aebac16..0409839cc8fc 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -714,8 +714,12 @@ static int pblk_lines_init(struct pblk *pblk)
}
 
lm->emeta_bb = geo->nr_luns - i;
-   lm->min_blk_line = 1 + DIV_ROUND_UP(lm->smeta_sec + lm->emeta_sec[0],
-   geo->sec_per_blk);
+
+   lm->min_blk_line = 1;
+   if (geo->nr_luns > 1)
+   lm->min_blk_line += DIV_ROUND_UP(lm->smeta_sec +
+   lm->emeta_sec[0], geo->sec_per_blk);
+
if (lm->min_blk_line > lm->blk_per_line) {
pr_err("pblk: config. not supported. Min. LUN in line:%d\n",
lm->blk_per_line);
diff --git a/drivers/lightnvm/pblk-map.c b/drivers/lightnvm/pblk-map.c
index fddb924f6dde..3bc4c94f9cf2 100644
--- a/drivers/lightnvm/pblk-map.c
+++ b/drivers/lightnvm/pblk-map.c
@@ -25,13 +25,23 @@ static void pblk_map_page_data(struct pblk *pblk, unsigned 

[PATCH 12/18] lightnvm: pblk: free padded entries in write buffer

2017-09-06 Thread Javier González
When a REQ_FLUSH reaches pblk, the bio cannot be directly completed.
Instead, data on the write buffer is flushed and the bio is completed on
the completion pah. This might require some sectors to be padded in
order to guarantee a successful write.

This patch fixes a memory leak on the padded pages. A consequence of
this bad free was that internal bios not containing data (only a flush)
were not being completed.

Fixes: a4bd217b4326 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c  |  1 -
 drivers/lightnvm/pblk-write.c | 14 +++---
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index e69e8829b093..1f746e31e379 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -187,7 +187,6 @@ void pblk_bio_free_pages(struct pblk *pblk, struct bio 
*bio, int off,
 
WARN_ON(off + nr_pages != bio->bi_vcnt);
 
-   bio_advance(bio, off * PBLK_EXPOSED_PAGE_SIZE);
for (i = off; i < nr_pages + off; i++) {
bv = bio->bi_io_vec[i];
mempool_free(bv.bv_page, pblk->page_pool);
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index e5d77af1aad1..33dd90251a60 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -25,13 +25,20 @@ static unsigned long pblk_end_w_bio(struct pblk *pblk, 
struct nvm_rq *rqd,
unsigned long ret;
int i;
 
-   for (i = 0; i < c_ctx->nr_valid; i++) {
+   i = 0;
+   do {
struct pblk_w_ctx *w_ctx;
 
w_ctx = pblk_rb_w_ctx(>rwb, c_ctx->sentry + i);
while ((original_bio = bio_list_pop(_ctx->bios)))
bio_endio(original_bio);
-   }
+
+   i++;
+   } while (i < c_ctx->nr_valid);
+
+   if (c_ctx->nr_padded)
+   pblk_bio_free_pages(pblk, rqd->bio, c_ctx->nr_valid,
+   c_ctx->nr_padded);
 
 #ifdef CONFIG_NVM_DEBUG
atomic_long_add(c_ctx->nr_valid, >sync_writes);
@@ -516,7 +523,8 @@ static void pblk_free_write_rqd(struct pblk *pblk, struct 
nvm_rq *rqd)
struct bio *bio = rqd->bio;
 
if (c_ctx->nr_padded)
-   pblk_bio_free_pages(pblk, bio, rqd->nr_ppas, c_ctx->nr_padded);
+   pblk_bio_free_pages(pblk, bio, c_ctx->nr_valid,
+   c_ctx->nr_padded);
 }
 
 static int pblk_submit_write(struct pblk *pblk)
-- 
2.7.4



[PATCH 07/18] lightnvm: pblk: use constant for GC parameter

2017-09-06 Thread Javier González
Use a constant instead of using meaningless tuning parameters on the
garbage collector algorithms.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-gc.c | 4 ++--
 drivers/lightnvm/pblk.h| 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 6090d28f7995..92694e35afde 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -93,7 +93,7 @@ static int pblk_gc_move_valid_secs(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq)
 
 retry:
spin_lock(>w_lock);
-   if (gc->w_entries >= PBLK_GC_W_QD) {
+   if (gc->w_entries >= PBLK_GC_RQ_QD) {
spin_unlock(>w_lock);
pblk_gc_writer_kick(>gc);
usleep_range(128, 256);
@@ -602,7 +602,7 @@ int pblk_gc_init(struct pblk *pblk)
spin_lock_init(>w_lock);
spin_lock_init(>r_lock);
 
-   sema_init(>gc_sem, 128);
+   sema_init(>gc_sem, PBLK_GC_RQ_QD);
 
INIT_LIST_HEAD(>w_list);
INIT_LIST_HEAD(>r_list);
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 08a463fe855f..b7f5fa8b49d0 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -818,7 +818,7 @@ int pblk_recov_setup_rq(struct pblk *pblk, struct 
pblk_c_ctx *c_ctx,
  * pblk gc
  */
 #define PBLK_GC_MAX_READERS 8  /* Max number of outstanding GC reader jobs */
-#define PBLK_GC_W_QD 128   /* Queue depth for inflight GC write I/Os */
+#define PBLK_GC_RQ_QD 128  /* Queue depth for inflight GC requests */
 #define PBLK_GC_L_QD 4 /* Queue depth for inflight GC lines */
 #define PBLK_GC_RSV_LINE 1 /* Reserved lines for GC */
 
-- 
2.7.4



[PATCH 14/18] lightnvm: pblk: simplify path on REQ_PREFLUSH

2017-09-06 Thread Javier González
On REQ_PREFLUSH, directly tag the I/O context flags to signal a flush in
the write to cache path, instead of finding the correct entry context
and imposing a memory barrier. This simplifies the code and might
potentially prevent race conditions when adding functionality to the
write path.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-cache.c | 4 +++-
 drivers/lightnvm/pblk-rb.c| 8 +---
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/lightnvm/pblk-cache.c b/drivers/lightnvm/pblk-cache.c
index 1d6b8e3585f1..0d227ef7d1b9 100644
--- a/drivers/lightnvm/pblk-cache.c
+++ b/drivers/lightnvm/pblk-cache.c
@@ -43,8 +43,10 @@ int pblk_write_to_cache(struct pblk *pblk, struct bio *bio, 
unsigned long flags)
if (unlikely(!bio_has_data(bio)))
goto out;
 
+   pblk_ppa_set_empty(_ctx.ppa);
w_ctx.flags = flags;
-   pblk_ppa_set_empty(_ctx.ppa);
+   if (bio->bi_opf & REQ_PREFLUSH)
+   w_ctx.flags |= PBLK_FLUSH_ENTRY;
 
for (i = 0; i < nr_entries; i++) {
void *data = bio_data(bio);
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 74c768ce09ef..05e6b2e9221d 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -355,7 +355,6 @@ static int pblk_rb_sync_point_set(struct pblk_rb *rb, 
struct bio *bio,
 {
struct pblk_rb_entry *entry;
unsigned int subm, sync_point;
-   int flags;
 
subm = READ_ONCE(rb->subm);
 
@@ -369,12 +368,6 @@ static int pblk_rb_sync_point_set(struct pblk_rb *rb, 
struct bio *bio,
sync_point = (pos == 0) ? (rb->nr_entries - 1) : (pos - 1);
entry = >entries[sync_point];
 
-   flags = READ_ONCE(entry->w_ctx.flags);
-   flags |= PBLK_FLUSH_ENTRY;
-
-   /* Release flags on context. Protect from writes */
-   smp_store_release(>w_ctx.flags, flags);
-
/* Protect syncs */
smp_store_release(>sync_point, sync_point);
 
@@ -454,6 +447,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb, 
unsigned int nr_entries,
 
/* Protect from read count */
smp_store_release(>mem, mem);
+
return 1;
 }
 
-- 
2.7.4



[PATCH 15/18] lightnvm: pblk: avoid deadlock on low LUN config

2017-09-06 Thread Javier González
On low LUN configurations, make sure not to send bios that are bigger
than the buffer size.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-init.c | 2 +-
 drivers/lightnvm/pblk-rl.c   | 6 ++
 drivers/lightnvm/pblk.h  | 2 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 8459434edd89..12c05aebac16 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -46,7 +46,7 @@ static int pblk_rw_io(struct request_queue *q, struct pblk 
*pblk,
 * user I/Os. Unless stalled, the rate limiter leaves at least 256KB
 * available for user I/O.
 */
-   if (unlikely(pblk_get_secs(bio) >= pblk_rl_sysfs_rate_show(>rl)))
+   if (pblk_get_secs(bio) > pblk_rl_max_io(>rl))
blk_queue_split(q, );
 
return pblk_write_to_cache(pblk, bio, PBLK_IOTYPE_USER);
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 2e6a5361baf0..596bdec433c3 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -178,6 +178,11 @@ int pblk_rl_sysfs_rate_show(struct pblk_rl *rl)
return rl->rb_user_max;
 }
 
+int pblk_rl_max_io(struct pblk_rl *rl)
+{
+   return rl->rb_max_io;
+}
+
 static void pblk_rl_u_timer(unsigned long data)
 {
struct pblk_rl *rl = (struct pblk_rl *)data;
@@ -214,6 +219,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
/* To start with, all buffer is available to user I/O writers */
rl->rb_budget = budget;
rl->rb_user_max = budget;
+   rl->rb_max_io = budget >> 1;
rl->rb_gc_max = 0;
rl->rb_state = PBLK_RL_HIGH;
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 2304a4891f20..2ae9d94e9852 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -273,6 +273,7 @@ struct pblk_rl {
int rb_gc_max;  /* Max buffer entries available for GC I/O */
int rb_gc_rsv;  /* Reserved buffer entries for GC I/O */
int rb_state;   /* Rate-limiter current state */
+   int rb_max_io;  /* Maximum size for an I/O giving the config */
 
atomic_t rb_user_cnt;   /* User I/O buffer counter */
atomic_t rb_gc_cnt; /* GC I/O buffer counter */
@@ -846,6 +847,7 @@ int pblk_rl_gc_may_insert(struct pblk_rl *rl, int 
nr_entries);
 void pblk_rl_gc_in(struct pblk_rl *rl, int nr_entries);
 void pblk_rl_out(struct pblk_rl *rl, int nr_user, int nr_gc);
 int pblk_rl_sysfs_rate_show(struct pblk_rl *rl);
+int pblk_rl_max_io(struct pblk_rl *rl);
 void pblk_rl_free_lines_inc(struct pblk_rl *rl, struct pblk_line *line);
 void pblk_rl_free_lines_dec(struct pblk_rl *rl, struct pblk_line *line);
 void pblk_rl_set_space_limit(struct pblk_rl *rl, int entries_left);
-- 
2.7.4



[PATCH 09/18] lightnvm: pblk: simplify data validity check on GC

2017-09-06 Thread Javier González
When a line is selected for recycling by the garbage collector (GC), the
line state changes and the invalid bitmap is frozen, preventing
invalidations from happening. Throughout the GC, the L2P map is checked
to verify that not data being recycled has been updated. The last check
is done before the new map is being stored on the L2P table. Though
this algorithm works, it requires a number of corner cases to be checked
each time the L2P table is being updated. This complicates readability
and is error prone in case that the recycling algorithm is modified.

Instead, this patch makes the invalid bitmap accessible even when the
line is being recycled. When recycled data is being remapped, it is
enough to check the invalid bitmap for the line before updating the L2P
table.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-cache.c | 20 +--
 drivers/lightnvm/pblk-core.c  | 29 +++-
 drivers/lightnvm/pblk-gc.c| 59 ++--
 drivers/lightnvm/pblk-rb.c|  6 ++--
 drivers/lightnvm/pblk-read.c  | 79 +++
 drivers/lightnvm/pblk.h   | 23 -
 6 files changed, 109 insertions(+), 107 deletions(-)

diff --git a/drivers/lightnvm/pblk-cache.c b/drivers/lightnvm/pblk-cache.c
index 024a8fc93069..1d6b8e3585f1 100644
--- a/drivers/lightnvm/pblk-cache.c
+++ b/drivers/lightnvm/pblk-cache.c
@@ -73,12 +73,11 @@ int pblk_write_to_cache(struct pblk *pblk, struct bio *bio, 
unsigned long flags)
  * On GC the incoming lbas are not necessarily sequential. Also, some of the
  * lbas might not be valid entries, which are marked as empty by the GC thread
  */
-int pblk_write_gc_to_cache(struct pblk *pblk, void *data, u64 *lba_list,
-  unsigned int nr_entries, unsigned int nr_rec_entries,
-  struct pblk_line *gc_line, unsigned long flags)
+int pblk_write_gc_to_cache(struct pblk *pblk, struct pblk_gc_rq *gc_rq)
 {
struct pblk_w_ctx w_ctx;
unsigned int bpos, pos;
+   void *data = gc_rq->data;
int i, valid_entries;
 
/* Update the write buffer head (mem) with the entries that we can
@@ -86,28 +85,29 @@ int pblk_write_gc_to_cache(struct pblk *pblk, void *data, 
u64 *lba_list,
 * rollback from here on.
 */
 retry:
-   if (!pblk_rb_may_write_gc(>rwb, nr_rec_entries, )) {
+   if (!pblk_rb_may_write_gc(>rwb, gc_rq->secs_to_gc, )) {
io_schedule();
goto retry;
}
 
-   w_ctx.flags = flags;
+   w_ctx.flags = PBLK_IOTYPE_GC;
pblk_ppa_set_empty(_ctx.ppa);
 
-   for (i = 0, valid_entries = 0; i < nr_entries; i++) {
-   if (lba_list[i] == ADDR_EMPTY)
+   for (i = 0, valid_entries = 0; i < gc_rq->nr_secs; i++) {
+   if (gc_rq->lba_list[i] == ADDR_EMPTY)
continue;
 
-   w_ctx.lba = lba_list[i];
+   w_ctx.lba = gc_rq->lba_list[i];
 
pos = pblk_rb_wrap_pos(>rwb, bpos + valid_entries);
-   pblk_rb_write_entry_gc(>rwb, data, w_ctx, gc_line, pos);
+   pblk_rb_write_entry_gc(>rwb, data, w_ctx, gc_rq->line,
+   gc_rq->paddr_list[i], pos);
 
data += PBLK_EXPOSED_PAGE_SIZE;
valid_entries++;
}
 
-   WARN_ONCE(nr_rec_entries != valid_entries,
+   WARN_ONCE(gc_rq->secs_to_gc != valid_entries,
"pblk: inconsistent GC write\n");
 
 #ifdef CONFIG_NVM_DEBUG
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 11a684162a5d..f6b9afbe8589 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -77,11 +77,7 @@ void __pblk_map_invalidate(struct pblk *pblk, struct 
pblk_line *line,
 * that newer updates are not overwritten.
 */
spin_lock(>lock);
-   if (line->state == PBLK_LINESTATE_GC ||
-   line->state == PBLK_LINESTATE_FREE) {
-   spin_unlock(>lock);
-   return;
-   }
+   WARN_ON(line->state == PBLK_LINESTATE_FREE);
 
if (test_and_set_bit(paddr, line->invalid_bitmap)) {
WARN_ONCE(1, "pblk: double invalidate\n");
@@ -98,8 +94,7 @@ void __pblk_map_invalidate(struct pblk *pblk, struct 
pblk_line *line,
spin_lock(_mg->gc_lock);
spin_lock(>lock);
/* Prevent moving a line that has just been chosen for GC */
-   if (line->state == PBLK_LINESTATE_GC ||
-   line->state == PBLK_LINESTATE_FREE) {
+   if (line->state == PBLK_LINESTATE_GC) {
spin_unlock(>lock);
spin_unlock(_mg->gc_lock);
return;
@@ -1788,6 +1783,7 @@ void pblk_update_map(struct pblk 

[PATCH 10/18] lightnvm: pblk: use bio_copy_kern when possible

2017-09-06 Thread Javier González
In pblk, buffers forming bios can be allocated on physically contiguous
or virtually contiguous memory. For physically contiguous memory, we
already use the bio_map_kern helper funciton, however, for virtually
contiguous memory, we from the bio manually. This makes the code more
complex, specially on the completion path, where mapped pages need to be
freed.

Instead, use bio_copy_kern, which does the same and at the same time
simplifies the completion path.

Signed-off-by: Javier González 
Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/pblk-core.c | 39 ---
 drivers/lightnvm/pblk-read.c |  3 +--
 drivers/lightnvm/pblk-recovery.c |  3 +--
 drivers/lightnvm/pblk-write.c|  7 +--
 drivers/lightnvm/pblk.h  |  2 +-
 5 files changed, 8 insertions(+), 46 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index f6b9afbe8589..e69e8829b093 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -423,42 +423,14 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd)
 
 struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data,
  unsigned int nr_secs, unsigned int len,
- int alloc_type, gfp_t gfp_mask)
+ int alloc_type, gfp_t gfp_mask, int reading)
 {
struct nvm_tgt_dev *dev = pblk->dev;
-   void *kaddr = data;
-   struct page *page;
-   struct bio *bio;
-   int i, ret;
 
if (alloc_type == PBLK_KMALLOC_META)
-   return bio_map_kern(dev->q, kaddr, len, gfp_mask);
+   return bio_map_kern(dev->q, data, len, gfp_mask);
 
-   bio = bio_kmalloc(gfp_mask, nr_secs);
-   if (!bio)
-   return ERR_PTR(-ENOMEM);
-
-   for (i = 0; i < nr_secs; i++) {
-   page = vmalloc_to_page(kaddr);
-   if (!page) {
-   pr_err("pblk: could not map vmalloc bio\n");
-   bio_put(bio);
-   bio = ERR_PTR(-ENOMEM);
-   goto out;
-   }
-
-   ret = bio_add_pc_page(dev->q, bio, page, PAGE_SIZE, 0);
-   if (ret != PAGE_SIZE) {
-   pr_err("pblk: could not add page to bio\n");
-   bio_put(bio);
-   bio = ERR_PTR(-ENOMEM);
-   goto out;
-   }
-
-   kaddr += PAGE_SIZE;
-   }
-out:
-   return bio;
+   return bio_copy_kern(dev->q, data, len, GFP_KERNEL, reading);
 }
 
 int pblk_calc_secs(struct pblk *pblk, unsigned long secs_avail,
@@ -588,7 +560,7 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, 
struct pblk_line *line,
rq_len = rq_ppas * geo->sec_size;
 
bio = pblk_bio_map_addr(pblk, emeta_buf, rq_ppas, rq_len,
-   l_mg->emeta_alloc_type, GFP_KERNEL);
+   l_mg->emeta_alloc_type, GFP_KERNEL, dir == READ);
if (IS_ERR(bio)) {
ret = PTR_ERR(bio);
goto free_rqd_dma;
@@ -673,9 +645,6 @@ static int pblk_line_submit_emeta_io(struct pblk *pblk, 
struct pblk_line *line,
atomic_dec(>inflight_io);
reinit_completion();
 
-   if (likely(pblk->l_mg.emeta_alloc_type == PBLK_VMALLOC_META))
-   bio_put(bio);
-
if (rqd.error) {
if (dir == WRITE)
pblk_log_write_err(pblk, );
diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index f32091503547..1be972521dcd 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -542,7 +542,7 @@ int pblk_submit_read_gc(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq)
 
data_len = (gc_rq->secs_to_gc) * geo->sec_size;
bio = pblk_bio_map_addr(pblk, gc_rq->data, gc_rq->secs_to_gc, data_len,
-   PBLK_VMALLOC_META, GFP_KERNEL);
+   PBLK_VMALLOC_META, GFP_KERNEL, 1);
if (IS_ERR(bio)) {
pr_err("pblk: could not allocate GC bio (%lu)\n", PTR_ERR(bio));
goto err_free_dma;
@@ -583,7 +583,6 @@ int pblk_submit_read_gc(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq)
atomic_long_sub(gc_rq->secs_to_gc, >inflight_reads);
 #endif
 
-   bio_put(bio);
 out:
nvm_dev_dma_free(dev->parent, rqd.meta_list, rqd.dma_meta_list);
return ret;
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 279638309d9a..b033d4f2b446 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -342,7 +342,6 @@ static void pblk_end_io_recov(struct nvm_rq *rqd)
 
pblk_up_page(pblk, rqd->ppa_list, rqd->nr_ppas);
 
-   bio_put(rqd->bio);
nvm_dev_dma_free(dev->parent, rqd->meta_list, rqd->dma_meta_list);
pblk_free_rqd(pblk, 

[PATCH 00/18] lightnvm: pblk patches for 4.14

2017-09-06 Thread Javier González
Hi Jens,

Here is the pblk patchset for this window.

The most notable patches are on the read path:
  - ("lightnvm: pblk: check lba sanity on read path") uses the lba
stored on the out-of-bound area to verify that the read ppa
corresponds to the lba pointed to by the read bio.
  - ("lightnvm: pblk: guarantee line integrity on reads") guarantees
that if a line is being GC'ed and a read comes in the middle, that
line is not being moved into the free list until the read completes.
Otherwise, the line could be reclaimed to be erased and re-written,
thus causing data corruption.

The rest of the patches are are basically bug fixes and refactoring to
improve code readability.

The patches apply on your for-4.14/block, and you can be found at:
  https://github.com/OpenChannelSSD/linux/tree/pblk.for-4.14

Thanks,
Javier

Javier González (18):
  lightnvm: pblk: improve naming for internal req.
  lightnvm: pblk: refactor read lba sanity check
  lightnvm: pblk: normalize ppa namings
  lightnvm: pblk: check for failed mempool alloc.
  lightnvm: pblk: initialize debug stat counter
  lightnvm: pblk: use right flag for GC allocation
  lightnvm: pblk: use constant for GC parameter
  lightnvm: pblk: check lba sanity on read path
  lightnvm: pblk: simplify data validity check on GC
  lightnvm: pblk: use bio_copy_kern when possible
  lightnvm: pblk: refactor read path on GC
  lightnvm: pblk: free padded entries in write buffer
  lightnvm: pblk: fix write I/O sync stat
  lightnvm: pblk: simplify path on REQ_PREFLUSH
  lightnvm: pblk: avoid deadlock on low LUN config
  lightnvm: pblk: enable 1 LUN configuration
  lightnvm: pblk: guarantee line integrity on reads
  lightnvm: pblk: remove unnecessary check

 drivers/lightnvm/pblk-cache.c|  24 ++--
 drivers/lightnvm/pblk-core.c | 187 --
 drivers/lightnvm/pblk-gc.c   | 133 ++
 drivers/lightnvm/pblk-init.c |  25 -
 drivers/lightnvm/pblk-map.c  |  21 ++--
 drivers/lightnvm/pblk-rb.c   |  14 +--
 drivers/lightnvm/pblk-read.c | 237 ++-
 drivers/lightnvm/pblk-recovery.c |   9 +-
 drivers/lightnvm/pblk-rl.c   |   6 +
 drivers/lightnvm/pblk-write.c|  37 +++---
 drivers/lightnvm/pblk.h  |  40 +++
 11 files changed, 423 insertions(+), 310 deletions(-)

-- 
2.7.4



[PATCH] lightnvm: Add error code for bad write pointer

2017-09-06 Thread Javier González
Add new error code introduced on the OCSSD spec 2.0 for write pointer
mismatch on the device side. This indicates to the host that a write on
a block (chunk) is not respecting the required sequentiality.

Signed-off-by: Javier González 
---
 include/linux/lightnvm.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 7dfa56ebbc6d..81b71c6d5873 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -102,7 +102,8 @@ enum {
/* Status codes */
NVM_RSP_SUCCESS = 0x0,
NVM_RSP_NOT_CHANGEABLE  = 0x1,
-   NVM_RSP_ERR_FAILWRITE   = 0x40ff,
+   NVM_RSP_ERR_FAILWRITE   = 0x40ff,   /* Generic write failure */
+   NVM_RSP_ERR_FAILWP  = 0x42f0,   /* Write pointer failure */
NVM_RSP_ERR_EMPTYPAGE   = 0x42ff,
NVM_RSP_ERR_FAILECC = 0x4281,
NVM_RSP_ERR_FAILCRC = 0x4004,
-- 
2.7.4



[PATCH] block: export symbol for bio_copy_kern

2017-09-06 Thread Javier González
Export symbol for bio_copy_kern so that we can use it in modules.

Reported-by: Andiry Xu 
Signed-off-by: Javier González 
---
 block/bio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/bio.c b/block/bio.c
index 6745759028da..cface315ace5 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1610,6 +1610,7 @@ struct bio *bio_copy_kern(struct request_queue *q, void 
*data, unsigned int len,
bio_put(bio);
return ERR_PTR(-ENOMEM);
 }
+EXPORT_SYMBOL(bio_copy_kern);
 
 /*
  * bio_set_pages_dirty() and bio_check_pages_dirty() are support functions
-- 
2.7.4



[PATCH 1/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout

2017-09-06 Thread Christoph Hellwig
bsg-lib now embeddeds the job structure into the request, and req->special
can't be used anymore.

Signed-off-by: Christoph Hellwig 
Cc: sta...@vger.kernel.org
---
 drivers/scsi/scsi_transport_fc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 3c6bc0081fcb..d8de46806a1e 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3571,7 +3571,7 @@ fc_vport_sched_delete(struct work_struct *work)
 static enum blk_eh_timer_return
 fc_bsg_job_timeout(struct request *req)
 {
-   struct bsg_job *job = (void *) req->special;
+   struct bsg_job *job = blk_mq_rq_to_pdu(req->special);
struct Scsi_Host *shost = fc_bsg_to_shost(job);
struct fc_rport *rport = fc_bsg_to_rport(job);
struct fc_internal *i = to_fc_internal(shost->transportt);
-- 
2.11.0



[PATCH 2/2] bsg-lib: don't free job in bsg_prepare_job

2017-09-06 Thread Christoph Hellwig
The job structure is allocated as part of the request, so we should not
free it in the error path of bsg_prepare_job.

Signed-off-by: Christoph Hellwig 
Cc: sta...@vger.kernel.org
---
 block/bsg-lib.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index c82408c7cc3c..dbddff8174e5 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -154,7 +154,6 @@ static int bsg_prepare_job(struct device *dev, struct 
request *req)
 failjob_rls_rqst_payload:
kfree(job->request_payload.sg_list);
 failjob_rls_job:
-   kfree(job);
return -ENOMEM;
 }
 
-- 
2.11.0



two small bsg fixes

2017-09-06 Thread Christoph Hellwig
Two fixups for the recent bsg-lib fixups, should go into 4.13 stable as
well.



Re: [PATCH V7 00/10] mmc: Add Command Queue support

2017-09-06 Thread Ulf Hansson
On 6 September 2017 at 09:20, Adrian Hunter  wrote:
> On 05/09/17 20:54, Ulf Hansson wrote:
>> On 5 September 2017 at 10:10, Adrian Hunter  wrote:
>>> On 05/09/17 10:24, Ulf Hansson wrote:
 [...]

>>>
>>> I can send blk-mq support for legacy requests in a few days if you 
>>> like, but
>>> I want to hear a better explanation of why you are delaying CQE support.
>>
>> That would be very nice, however be aware of that we are in the merge
>> window, so I am not picking new material for 4.14 from this point. I
>> assume you understand why.
>
> Nope.  This is new functionality - doesn't affect anyone who doesn't have 
> a
> command queue engine.  Next to no chance of regressions.  Tested by 
> several
> in the community.  Substantially unchanged since February.  It is not even
> very much code in the block driver.

 Let me make it clear, once more - I don't want to maintain more hacks
 in mmc block layer code.

 This series add blkmq support, using a method (which may be considered
 as intermediate) via a new change in patch1 - but only for the new CQE
 path. That means the old legacy mmc block path is still there. So, for
 the reason stated above - no thanks!
>>>
>>> And where is your alternative.  When I pointed out you need a way to
>>> arbitrate between internal partitions, you went silent again.
>>>
>>> Can't have CQE without blk-mq but can't have blk-mq because you don't
>>> understand it, is hardly acceptable.
>>
>> Adrian, this discussion seems to lead nowhere. Can we please stop and
>> be constructive instead!
>
> If you want to be constructive you will queue CQE support for v4.15 now!
>
>>
>> Regarding the arbitration issue. We have been moving forward,
>> re-factoring the mmc block driver code, soon also solving the problem
>> for the rpmb internal partition [1]. Maybe the background to why Linus
>> is working on mmc block re-factoring, hasn't been entirely clear.
>> Anyway, it's exactly because of moving closer to address these issues.
>
> Nope, wrt blk-mq you are moving sideways with no clear idea where you're 
> going.
>
>> Even if the problems certainly becomes a step harder to resolve for
>> the boot and the general purpose partitions, it's still a path we
>> should try to find a solution for. Yeah, that may mean we need to
>> suggest changes for the generic block layer, to teach it to better
>> deal with these kind of devices.
>
> You mean you have no idea how to do it but we are still expected to wait.
> How is that acceptable!
>
>> Finally, I have never said the arbitration issue *must* be solved
>> before converting to blkmq. Only that we should avoid performance
>> regressions, but that of course applies to whatever changes we do.
>
> Then there is no problem in queuing up the CQE patches now!

Either you are not paying attention or just being grumpy.

So for the last time, I don't want to pick CQE, prior the existing mmc
block code has converted to blkmq.

End of discussion!

Kind regards
Uffe


Re: [PATCH 3/6] hid: make device_attribute const

2017-09-06 Thread Jiri Kosina
On Mon, 21 Aug 2017, Bhumika Goyal wrote:

> Make this const as it is only passed as an argument to the
> function device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle
> 
> Signed-off-by: Bhumika Goyal 
> ---
>  drivers/hid/hid-core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hid/hid-core.c b/drivers/hid/hid-core.c
> index 9bc9116..24e929c 100644
> --- a/drivers/hid/hid-core.c
> +++ b/drivers/hid/hid-core.c
> @@ -1662,7 +1662,7 @@ static bool hid_hiddev(struct hid_device *hdev)
>   .size = HID_MAX_DESCRIPTOR_SIZE,
>  };
>  
> -static struct device_attribute dev_attr_country = {
> +static const struct device_attribute dev_attr_country = {
>   .attr = { .name = "country", .mode = 0444 },
>   .show = show_country,

Applied, thanks.

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-06 Thread weiping zhang
On Tue, Sep 05, 2017 at 03:42:45PM +, Bart Van Assche wrote:
> On Sun, 2017-09-03 at 21:46 +0800, weiping zhang wrote:
> > if blk-mq use "none" io scheduler, nr_request get a wrong value when
> > input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
> > the smaller one min(nr, set->queue_depth), and then q->nr_request get a
> > wrong value.
> > 
> > Reproduce:
> > 
> > echo none > /sys/block/nvme0n1/queue/ioscheduler
> > echo 100 > /sys/block/nvme0n1/queue/nr_requests
> > cat /sys/block/nvme0n1/queue/nr_requests
> > 100
> > 
> > Signed-off-by: weiping zhang 
> > ---
> >  block/blk-mq.c | 7 +--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f84d145..8303e5e 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2622,8 +2622,11 @@ int blk_mq_update_nr_requests(struct request_queue 
> > *q, unsigned int nr)
> >  * queue depth. This is similar to what the old code would do.
> >  */
> > if (!hctx->sched_tags) {
> > -   ret = blk_mq_tag_update_depth(hctx, >tags,
> > -   min(nr, 
> > set->queue_depth),
> > +   if (nr > set->queue_depth) {
> > +   nr = set->queue_depth;
> > +   pr_warn("reduce nr_request to %u\n", nr);
> > +   }
> > +   ret = blk_mq_tag_update_depth(hctx, >tags, nr,
> > false);
> > } else {
> > ret = blk_mq_tag_update_depth(hctx, >sched_tags,
> 
> Shouldn't this code return -EINVAL or -ERANGE if 'nr' is too large? That will 
> help to
> keep user space code simple that updates the queue depth.

Hi Bart,

The reason why not return -EINVAL is keeping alin with minimum checking in 
queue_requests_store,
if you insist return -EINVAL/-ERANGE, minimum checking should also keep
same behavior. Both return error meesage and quietly changing are okey
for me. Which way do you prefer ?

static ssize_t
queue_requests_store(struct request_queue *q, const char *page, size_t count)
{
unsigned long nr;
int ret, err;

if (!q->request_fn && !q->mq_ops)
return -EINVAL;

ret = queue_var_store(, page, count);
if (ret < 0)
return ret;

if (nr < BLKDEV_MIN_RQ)
nr = BLKDEV_MIN_RQ;

Thanks


Re: [PATCH V7 00/10] mmc: Add Command Queue support

2017-09-06 Thread Adrian Hunter
On 05/09/17 20:54, Ulf Hansson wrote:
> On 5 September 2017 at 10:10, Adrian Hunter  wrote:
>> On 05/09/17 10:24, Ulf Hansson wrote:
>>> [...]
>>>
>>
>> I can send blk-mq support for legacy requests in a few days if you like, 
>> but
>> I want to hear a better explanation of why you are delaying CQE support.
>
> That would be very nice, however be aware of that we are in the merge
> window, so I am not picking new material for 4.14 from this point. I
> assume you understand why.

 Nope.  This is new functionality - doesn't affect anyone who doesn't have a
 command queue engine.  Next to no chance of regressions.  Tested by several
 in the community.  Substantially unchanged since February.  It is not even
 very much code in the block driver.
>>>
>>> Let me make it clear, once more - I don't want to maintain more hacks
>>> in mmc block layer code.
>>>
>>> This series add blkmq support, using a method (which may be considered
>>> as intermediate) via a new change in patch1 - but only for the new CQE
>>> path. That means the old legacy mmc block path is still there. So, for
>>> the reason stated above - no thanks!
>>
>> And where is your alternative.  When I pointed out you need a way to
>> arbitrate between internal partitions, you went silent again.
>>
>> Can't have CQE without blk-mq but can't have blk-mq because you don't
>> understand it, is hardly acceptable.
> 
> Adrian, this discussion seems to lead nowhere. Can we please stop and
> be constructive instead!

If you want to be constructive you will queue CQE support for v4.15 now!

> 
> Regarding the arbitration issue. We have been moving forward,
> re-factoring the mmc block driver code, soon also solving the problem
> for the rpmb internal partition [1]. Maybe the background to why Linus
> is working on mmc block re-factoring, hasn't been entirely clear.
> Anyway, it's exactly because of moving closer to address these issues.

Nope, wrt blk-mq you are moving sideways with no clear idea where you're going.

> Even if the problems certainly becomes a step harder to resolve for
> the boot and the general purpose partitions, it's still a path we
> should try to find a solution for. Yeah, that may mean we need to
> suggest changes for the generic block layer, to teach it to better
> deal with these kind of devices.

You mean you have no idea how to do it but we are still expected to wait.
How is that acceptable!

> Finally, I have never said the arbitration issue *must* be solved
> before converting to blkmq. Only that we should avoid performance
> regressions, but that of course applies to whatever changes we do.

Then there is no problem in queuing up the CQE patches now!


[PATCH 08/12] bcache: increase the number of open buckets

2017-09-06 Thread Coly Li
From: Tang Junhui 

In currently, we only alloc 6 open buckets for each cache set,
but in usually, we always attach about 10 or so backend devices for
each cache set, and the each bcache device are always accessed by
about 10 or so threads in top application layer. So 6 open buckets
are too few, It has led to that each of the same thread write data
to different buckets, which would cause low efficiency write-back,
and also cause buckets inefficient, and would be Very easy to run
out of.

I add debug message in bch_open_buckets_alloc() to print alloc bucket
info, and test with ten bcache devices with a cache set, and each
bcache device is accessed by ten threads.

>From the debug message, we can see that, after the modification, One
bucket is more likely to assign to the same thread, and the data from
the same thread are more likely to write the same bucket. Usually the
same thread always write/read the same backend device, so it is good
for write-back and also promote the usage efficiency of buckets.

Signed-off-by: Tang Junhui 
Reviewed-by: Coly Li 
---
 drivers/md/bcache/alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index ca4abe1ccd8d..cacbe2dbd5c3 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -68,6 +68,8 @@
 #include 
 #include 
 
+#define MAX_OPEN_BUCKETS 128
+
 /* Bucket heap / gen */
 
 uint8_t bch_inc_gen(struct cache *ca, struct bucket *b)
@@ -671,7 +673,7 @@ int bch_open_buckets_alloc(struct cache_set *c)
 
spin_lock_init(>data_bucket_lock);
 
-   for (i = 0; i < 6; i++) {
+   for (i = 0; i < MAX_OPEN_BUCKETS; i++) {
struct open_bucket *b = kzalloc(sizeof(*b), GFP_KERNEL);
if (!b)
return -ENOMEM;
-- 
2.13.5



[PATCH 09/12] bcache: fix for gc and write-back race

2017-09-06 Thread Coly Li
From: Tang Junhui 

gc and write-back get raced (see the email "bcache get stucked" I sended
before):
gc thread   write-back thread
|   |bch_writeback_thread()
|bch_gc_thread()|
|   |==>read_dirty()
|==>bch_btree_gc()  |
|==>btree_root() //get btree root   |
|//node write locker|
|==>bch_btree_gc_root() |
|   |==>read_dirty_submit()
|   |==>write_dirty()
|   |==>continue_at(cl,
|   |   write_dirty_finish,
|   |   system_wq);
|   |==>write_dirty_finish()//excute
|   |   //in system_wq
|   |==>bch_btree_insert()
|   |==>bch_btree_map_leaf_nodes()
|   |==>__bch_btree_map_nodes()
|   |==>btree_root //try to get btree
|   |  //root node read
|   |  //lock
|   |-stuck here
|==>bch_btree_set_root()
|==>bch_journal_meta()
|==>bch_journal()
|==>journal_try_write()
|==>journal_write_unlocked() //journal_full(>journal)
|//condition satisfied
|==>continue_at(cl, journal_write, system_wq); //try to excute
|   //journal_write in system_wq
|   //but work queue is excuting
|   //write_dirty_finish()
|==>closure_sync(); //wait journal_write execute
|   //over and wake up gc,
|-stuck here
|==>release root node write locker

This patch alloc a separate work-queue for write-back thread to avoid such
race.

(Commit log re-organized by Coly Li to pass checkpatch.pl checking)

Signed-off-by: Tang Junhui 
Acked-by: Coly Li 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/bcache.h| 1 +
 drivers/md/bcache/super.c | 2 ++
 drivers/md/bcache/writeback.c | 9 +++--
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index dee542fff68e..2ed9bd231d84 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -333,6 +333,7 @@ struct cached_dev {
/* Limit number of writeback bios in flight */
struct semaphorein_flight;
struct task_struct  *writeback_thread;
+   struct workqueue_struct *writeback_write_wq;
 
struct keybuf   writeback_keys;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 9a2c190745b6..253918972335 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1059,6 +1059,8 @@ static void cached_dev_free(struct closure *cl)
cancel_delayed_work_sync(>writeback_rate_update);
if (!IS_ERR_OR_NULL(dc->writeback_thread))
kthread_stop(dc->writeback_thread);
+   if (dc->writeback_write_wq)
+   destroy_workqueue(dc->writeback_write_wq);
 
mutex_lock(_register_lock);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index b533c2292ba5..323551f7cb28 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -187,7 +187,7 @@ static void write_dirty(struct closure *cl)
 
closure_bio_submit(>bio, cl);
 
-   continue_at(cl, write_dirty_finish, system_wq);
+   continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
 }
 
 static void read_dirty_endio(struct bio *bio)
@@ -207,7 +207,7 @@ static void read_dirty_submit(struct closure *cl)
 
closure_bio_submit(>bio, cl);
 
-   continue_at(cl, write_dirty, system_wq);
+   continue_at(cl, write_dirty, io->dc->writeback_write_wq);
 }
 
 static void read_dirty(struct cached_dev *dc)
@@ -517,6 +517,11 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 
 int bch_cached_dev_writeback_start(struct cached_dev *dc)
 {
+   dc->writeback_write_wq = alloc_workqueue("bcache_writeback_wq",
+   WQ_MEM_RECLAIM, 0);
+   if (!dc->writeback_write_wq)
+   return -ENOMEM;
+
dc->writeback_thread = kthread_create(bch_writeback_thread, dc,
  "bcache_writeback");
if (IS_ERR(dc->writeback_thread))
-- 
2.13.5



[PATCH 10/12] bcache: silence static checker warning

2017-09-06 Thread Coly Li
From: Dan Carpenter 

In olden times, closure_return() used to have a hidden return built in.
We removed the hidden return but forgot to add a new return here.  If
"c" were NULL we would oops on the next line, but fortunately "c" is
never NULL.  Let's just remove the if statement.

Signed-off-by: Dan Carpenter 
Reviewed-by: Coly Li 
---
 drivers/md/bcache/super.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 253918972335..c5deec621fdd 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1376,9 +1376,6 @@ static void cache_set_flush(struct closure *cl)
struct btree *b;
unsigned i;
 
-   if (!c)
-   closure_return(cl);
-
bch_cache_accounting_destroy(>accounting);
 
kobject_put(>internal);
-- 
2.13.5



[PATCH 07/12] bcache: Correct return value for sysfs attach errors

2017-09-06 Thread Coly Li
From: Tony Asleson 

If you encounter any errors in bch_cached_dev_attach it will return
a negative error code.  The variable 'v' which stores the result is
unsigned, thus user space sees a very large value returned for bytes
written which can cause incorrect user space behavior.  Utilize 1
signed variable to use throughout the function to preserve error return
capability.

Signed-off-by: Tony Asleson 
Acked-by: Coly Li 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/sysfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 09295c07ea88..104c57cd666c 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -192,7 +192,7 @@ STORE(__cached_dev)
 {
struct cached_dev *dc = container_of(kobj, struct cached_dev,
 disk.kobj);
-   unsigned v = size;
+   ssize_t v = size;
struct cache_set *c;
struct kobj_uevent_env *env;
 
@@ -227,7 +227,7 @@ STORE(__cached_dev)
bch_cached_dev_run(dc);
 
if (attr == _cache_mode) {
-   ssize_t v = bch_read_string_list(buf, bch_cache_modes + 1);
+   v = bch_read_string_list(buf, bch_cache_modes + 1);
 
if (v < 0)
return v;
-- 
2.13.5



[PATCH 04/12] bcache: Don't reinvent the wheel but use existing llist API

2017-09-06 Thread Coly Li
From: Byungchul Park 

Although llist provides proper APIs, they are not used. Make them used.

Signed-off-by: Byungchul Park 
Acked-by: Coly Li 
---
 drivers/md/bcache/closure.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 864e673aec39..7d5286b05036 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -70,21 +70,10 @@ void __closure_wake_up(struct closure_waitlist *wait_list)
list = llist_del_all(_list->list);
 
/* We first reverse the list to preserve FIFO ordering and fairness */
-
-   while (list) {
-   struct llist_node *t = list;
-   list = llist_next(list);
-
-   t->next = reverse;
-   reverse = t;
-   }
+   reverse = llist_reverse_order(list);
 
/* Then do the wakeups */
-
-   while (reverse) {
-   cl = container_of(reverse, struct closure, list);
-   reverse = llist_next(reverse);
-
+   llist_for_each_entry(cl, reverse, list) {
closure_set_waiting(cl, 0);
closure_sub(cl, CLOSURE_WAITING + 1);
}
-- 
2.13.5



[PATCH 06/12] bcache: correct cache_dirty_target in __update_writeback_rate()

2017-09-06 Thread Coly Li
From: Tang Junhui 

__update_write_rate() uses a Proportion-Differentiation Controller
algorithm to control writeback rate. A dirty target number is used in
this PD controller to control writeback rate. A larger target number
will make the writeback rate smaller, on the versus, a smaller target
number will make the writeback rate larger.

bcache uses the following steps to calculate the target number,
1) cache_sectors = all-buckets-of-cache-set * buckets-size
2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
3) target = cache_dirty_target *
(sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)

The calculation at step 1) for cache_sectors is incorrect, which does
not consider dirty blocks occupied by flash only volume.

A flash only volume can be took as a bcache device without cached
device. All data sectors allocated for it are persistent on cache device
and marked dirty, they are not touched by bcache writeback and garbage
collection code. So data blocks of flash only volume should be ignore
when calculating cache_sectors of cache set.

Current code does not subtract dirty sectors of flash only volume, which
results a larger target number from the above 3 steps. And in sequence
the cache device's writeback rate is smaller then a correct value,
writeback speed is slower on all cached devices.

This patch fixes the incorrect slower writeback rate by subtracting
dirty sectors of flash only volumes in __update_writeback_rate().

(Commit log composed by Coly Li to pass checkpatch.pl checking)

Signed-off-by: Tang Junhui 
Reviewed-by: Coly Li 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/writeback.c |  3 ++-
 drivers/md/bcache/writeback.h | 19 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 42c66e76f05e..b533c2292ba5 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -21,7 +21,8 @@
 static void __update_writeback_rate(struct cached_dev *dc)
 {
struct cache_set *c = dc->disk.c;
-   uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size;
+   uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
+   bcache_flash_devs_sectors_dirty(c);
uint64_t cache_dirty_target =
div_u64(cache_sectors * dc->writeback_percent, 100);
 
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 629bd1a502fd..8789b9c8c484 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -14,6 +14,25 @@ static inline uint64_t bcache_dev_sectors_dirty(struct 
bcache_device *d)
return ret;
 }
 
+static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set *c)
+{
+   uint64_t i, ret = 0;
+
+   mutex_lock(_register_lock);
+
+   for (i = 0; i < c->nr_uuids; i++) {
+   struct bcache_device *d = c->devices[i];
+
+   if (!d || !UUID_FLASH_ONLY(>uuids[i]))
+   continue;
+  ret += bcache_dev_sectors_dirty(d);
+   }
+
+   mutex_unlock(_register_lock);
+
+   return ret;
+}
+
 static inline unsigned offset_to_stripe(struct bcache_device *d,
uint64_t offset)
 {
-- 
2.13.5



[PATCH 01/12] bcache: Fix leak of bdev reference

2017-09-06 Thread Coly Li
From: Jan Kara 

If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.

Signed-off-by: Jan Kara 
Acked-by: Coly Li 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8352fad765f6..9a2c190745b6 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1964,6 +1964,8 @@ static ssize_t register_bcache(struct kobject *k, struct 
kobj_attribute *attr,
else
err = "device busy";
mutex_unlock(_register_lock);
+   if (!IS_ERR(bdev))
+   bdput(bdev);
if (attr == _register_quiet)
goto out;
}
-- 
2.13.5



[PATCH 03/12] bcache: do not subtract sectors_to_gc for bypassed IO

2017-09-06 Thread Coly Li
From: Tang Junhui 

Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
trigger gc thread.

Signed-off-by: tang.junhui 
Acked-by: Coly Li 
Reviewed-by: Eric Wheeler 
Reviewed-by: Christoph Hellwig 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/request.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 958072a11347..4b413db99276 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -196,12 +196,12 @@ static void bch_data_insert_start(struct closure *cl)
struct data_insert_op *op = container_of(cl, struct data_insert_op, cl);
struct bio *bio = op->bio, *n;
 
-   if (atomic_sub_return(bio_sectors(bio), >c->sectors_to_gc) < 0)
-   wake_up_gc(op->c);
-
if (op->bypass)
return bch_data_invalidate(cl);
 
+   if (atomic_sub_return(bio_sectors(bio), >c->sectors_to_gc) < 0)
+   wake_up_gc(op->c);
+
/*
 * Journal writes are marked REQ_PREFLUSH; if the original write was a
 * flush, it'll wait on the journal write.
-- 
2.13.5



[PATCH 05/12] bcache: gc does not work when triggering by manual command

2017-09-06 Thread Coly Li
From: Tang Junhui 

I try to execute the following command to trigger gc thread:
[root@localhost internal]# echo 1 > trigger_gc
But it does not work, I debug the code in gc_should_run(), It works only
if in invalidating or sectors_to_gc < 0. So set sectors_to_gc to -1 to
meet the condition when we trigger gc by manual command.

(Code comments aded by Coly Li)

Signed-off-by: Tang Junhui 
Reviewed-by: Coly Li 
---
 drivers/md/bcache/sysfs.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index f90f13616980..09295c07ea88 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -615,8 +615,21 @@ STORE(__bch_cache_set)
bch_cache_accounting_clear(>accounting);
}
 
-   if (attr == _trigger_gc)
+   if (attr == _trigger_gc) {
+   /*
+* Garbage collection thread only works when sectors_to_gc < 0,
+* when users write to sysfs entry trigger_gc, most of time
+* they want to forcibly triger gargage collection. Here -1 is
+* set to c->sectors_to_gc, to make gc_should_run() give a
+* chance to permit gc thread to run. "give a chance" means
+* before going into gc_should_run(), there is still chance
+* that c->sectors_to_gc being set to other positive value. So
+* writing sysfs entry trigger_gc won't always make sure gc
+* thread takes effect.
+*/
+   atomic_set(>sectors_to_gc, -1);
wake_up_gc(c);
+   }
 
if (attr == _prune_cache) {
struct shrink_control sc;
-- 
2.13.5



[PATCH 00/13] bcache: fixes and update for 4.14

2017-09-06 Thread Coly Li
Hi Jens,

Here are 12 patchs for bcache fixes and updates, most of them were posted
by Eric Wheeler in 4.13 merge window, but delayed due to lack of code
review.

The following patches are reviewed or acked by peer developers,
0001-bcache-Fix-leak-of-bdev-reference.patch
0002-bcache-fix-sequential-large-write-IO-bypass.patch
0003-bcache-do-not-subtract-sectors_to_gc-for-bypassed-IO.patch
0004-bcache-Don-t-reinvent-the-wheel-but-use-existing-lli.patch
0005-bcache-gc-does-not-work-when-triggering-by-manual-co.patch
0006-bcache-correct-cache_dirty_target-in-__update_writeb.patch
0007-bcache-Correct-return-value-for-sysfs-attach-errors.patch
0008-bcache-increase-the-number-of-open-buckets.patch
0009-bcache-fix-for-gc-and-write-back-race.patch
0010-bcache-silence-static-checker-warning.patch
0011-bcache-Update-continue_at-documentation.patch
0012-bcache-fix-bch_hprint-crash-and-improve-output.patch
You may consider to pick them up if no other comments come up, especially
0012-bcache-fix-bch_hprint-crash-and-improve-output.patch which fixes a
potential serious security issue as author declears.

I do basic test on the whole patch set, bcache works and behaves as
expected. For futher bcache bug reports, I will follow up them.

There are some other patches which does not pass my test currently,
once they are fixed and reviewed by peer developers I will send them
to you for 4.14 merge window.

Thanks in advance.

Coly Li