Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()

2017-09-13 Thread Ming Lei
On Wed, Sep 13, 2017 at 11:37:20AM -0700, Omar Sandoval wrote:
> On Mon, Sep 11, 2017 at 12:08:29PM +0800, Ming Lei wrote:
> > On Sun, Sep 10, 2017 at 10:20:27AM -0700, Omar Sandoval wrote:
> 
> [snip]
> 
> > > What I mean is that you keep the same initialization above, but instead of
> > >   depth += nr
> > > you do
> > >   depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > > because like I said, the reasoning about why `+= nr` is okay in the
> > > `sb->depth - scanned` case is subtle.
> > > 
> > > And maybe even replace the
> > >   scanned += depth;
> > > with
> > >   scanned += min_t(unsigned int, word->depth - nr,
> > >sb->depth - scanned);
> > > I.e., don't reuse the depth local variable for two different things. I'm
> > > nitpicking here but this code is tricky enough as it is.
> > 
> > It wasn't reused in old version, just for saving one local variable, and
> > one extra min_t().
> > 
> > Yeah, I admit it isn't clean enough.
> > 
> > > 
> > > For completeness, I mean this exactly:
> > > 
> > >   while (1) {
> > >   struct sbitmap_word *word = >map[index];
> > >   unsigned int depth;
> > > 
> > >   scanned += min_t(unsigned int, word->depth - nr,
> > >sb->depth - scanned);
> > >   if (!word->word)
> > >   goto next;
> > > 
> > >   depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > 
> > two min_t and a little code duplication.
> 
> They're similar but they represent different things, so I think trying
> to deduplicate this code just makes it more confusing. If performance is
> your concern, I'd be really surprised if there's a noticable difference.

No only one extra min_t(), also it isn't easy to read the code, since
only in the first scan that 'depth' isn't same with 'depth', that is
why I set the 1st 'scan' outside of the loop, then we can update 'scan'
with 'depth' in every loop. People will be easy to follow the
meaning.

> 
> As a side note, I also realized that this code doesn't handle the
> sb->depth == 0 case. We should change the while (1) to
> while (scanned < sb->depth) and remove the
> if (scanned >= sb->depth) break;

In the attached patch, I remember that the zero depth case is
addressed by:

if (start >= sb->depth)
return;

which is required since 'start' parameter is introduced in
this patch.

> 
> > >   off = index << sb->shift;
> > >   while (1) {
> > >   nr = find_next_bit(>word, depth, nr);
> > >   if (nr >= depth)
> > >   break;
> > > 
> > >   if (!fn(sb, off + nr, data))
> > >   return;
> > > 
> > >   nr++;
> > >   }
> > > next:
> > >   if (scanned >= sb->depth)
> > >   break;
> > >   nr = 0;
> > >   if (++index >= sb->map_nr)
> > >   index = 0;
> > >   }
> > 
> > The following patch switches to do{}while and handles the
> > 1st scan outside of the loop, then it should be clean
> > enough(no two min_t()), so how about this one?
> 
> I find this one subtler and harder to follow. The less it looks like the
> typical loop pattern, the longer someone reading the code has to reason
> about it.

Looks using 'depth' to update 'scanned' is easier to follow, than
two min_t(), since it will make people easy to understand the relation
between the two, then understand the whole code.

-- 
Ming


Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Ming Lei
On Wed, Sep 13, 2017 at 07:07:53PM +, Bart Van Assche wrote:
> On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> > No, that patch only changes blk_insert_cloned_request() which is used
> > by dm-rq(mpath) only, nothing to do with the reported issue during
> > suspend and sending SCSI Domain validation.
> 
> There may be other ways to fix the SCSI domain validation code.

Again the issue isn't in domain validation, it is in quiesce,
so we need to fix quiesce, instead of working around transport_spi.

Also What is the other way? Why not this patchset?

> 
> Regarding the hang during resume: I have not been able to reproduce that
> hang with Jens' latest for-next branch. If that hang would turn out not to
> be reproducible with v4.13-rc1, why to change the behavior of the block
> layer with regard to PREEMPT requests?

Previously you object this patchset because you mention there
is race, now you want to change to another reason?

So I suppose there isn't the race you worried about.

PREEMPT is special enough but used widely, not like NVMe,
traditional SCSI doesn't has a specific queue for
administration purpose only, then RFQ_PREEMPT need to
be dispatched to the same queue even normal I/O is not
allowed at that time.

Because it is a common/generic issue with quiesce, we have to
avoid I/O hang with quiesce. This way is a generic enough for
cover all this kind of issue. And no side effect on normal I/O
of block layer.

Can you make sure that the following two points can't happen
wrt. I/O hang during suspend/resume?

- no normal I/O is triggered during suspend

- all I/O requests in queue will be drained out before
entering suspend

-- 
Ming


Re: [PATCH V2 1/4] kthread: add a mechanism to store cgroup info

2017-09-13 Thread Shaohua Li
On Wed, Sep 13, 2017 at 02:38:20PM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 13, 2017 at 02:01:26PM -0700, Shaohua Li wrote:
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 26db528..3107eee 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -20,7 +20,6 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >  #include 
> >  
> >  static DEFINE_SPINLOCK(kthread_create_lock);
> > @@ -47,6 +46,7 @@ struct kthread {
> > void *data;
> > struct completion parked;
> > struct completion exited;
> 
> maybe #ifdef CONFIG_CGROUPS?
> 
> > +   struct cgroup_subsys_state *blkcg_css;

Ah, I thought cgroup_subsys_state is always defined, let me double check.

> >  };
> ...
> > +void kthread_associate_blkcg(struct cgroup_subsys_state *css)
> > +{
> > +   struct kthread *kthread;
> > +
> > +   if (!(current->flags & PF_KTHREAD) || !current->set_child_tid)
> > +   return;
> > +   kthread = to_kthread(current);
> > +   if (css) {
> > +   css_get(css);
> > +   kthread->blkcg_css = css;
> > +   } else if (kthread->blkcg_css) {
> > +   css_put(kthread->blkcg_css);
> > +   kthread->blkcg_css = NULL;
> > +   }
> > +}
> > +EXPORT_SYMBOL(kthread_associate_blkcg);
> 
> Maybe doing sth like the following is less error-prone?
> 
> kthread_associate_blkcg(@css)
> {
>   if (current's kthread->blkcg_css)
>   put kthread->blkcg_css and set it to NULL;
>   if (@css)
>   get @css and set kthread->blkcg;
> }

Sounds good.

Thanks,
Shaohua


Re: [PATCH V2 1/4] kthread: add a mechanism to store cgroup info

2017-09-13 Thread Tejun Heo
On Wed, Sep 13, 2017 at 02:38:20PM -0700, Tejun Heo wrote:
> Maybe doing sth like the following is less error-prone?
> 
> kthread_associate_blkcg(@css)
> {
>   if (current's kthread->blkcg_css)
>   put kthread->blkcg_css and set it to NULL;
>   if (@css)
>   get @css and set kthread->blkcg;
> }

Ooh, also, maybe add a WARN_ON_ONCE on non-NULL blkcg_css on kthread
exit?

Thanks.

-- 
tejun


Re: [PATCH V2 4/4] block/loop: make loop cgroup aware

2017-09-13 Thread Tejun Heo
On Wed, Sep 13, 2017 at 02:01:29PM -0700, Shaohua Li wrote:
> From: Shaohua Li 
> 
> loop block device handles IO in a separate thread. The actual IO
> dispatched isn't cloned from the IO loop device received, so the
> dispatched IO loses the cgroup context.
> 
> I'm ignoring buffer IO case now, which is quite complicated.  Making the
> loop thread aware cgroup context doesn't really help. The loop device
> only writes to a single file. In current writeback cgroup
> implementation, the file can only belong to one cgroup.
> 
> For direct IO case, we could workaround the issue in theory. For
> example, say we assign cgroup1 5M/s BW for loop device and cgroup2
> 10M/s. We can create a special cgroup for loop thread and assign at
> least 15M/s for the underlayer disk. In this way, we correctly throttle
> the two cgroups. But this is tricky to setup.
> 
> This patch tries to address the issue. We record bio's css in loop
> command. When loop thread is handling the command, we then use the API
> provided in patch 1 to set the css for current task. The bio layer will
> use the css for new IO (from patch 3).
> 
> Signed-off-by: Shaohua Li 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH V2 3/4] block: make blkcg aware of kthread stored original cgroup info

2017-09-13 Thread Tejun Heo
On Wed, Sep 13, 2017 at 02:01:28PM -0700, Shaohua Li wrote:
> From: Shaohua Li 
> 
> bio_blkcg is the only API to get cgroup info for a bio right now. If
> bio_blkcg finds current task is a kthread and has original blkcg
> associated, it will use the css instead of associating the bio to
> current task. This makes it possible that kthread dispatches bios on
> behalf of other threads.
> 
> Signed-off-by: Shaohua Li 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH V2 2/4] blkcg: delete unused APIs

2017-09-13 Thread Tejun Heo
On Wed, Sep 13, 2017 at 02:01:27PM -0700, Shaohua Li wrote:
> From: Shaohua Li 
> 
> Nobody uses the APIs right now.
> 
> Signed-off-by: Shaohua Li 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH V2 1/4] kthread: add a mechanism to store cgroup info

2017-09-13 Thread Tejun Heo
Hello,

On Wed, Sep 13, 2017 at 02:01:26PM -0700, Shaohua Li wrote:
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 26db528..3107eee 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -20,7 +20,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  static DEFINE_SPINLOCK(kthread_create_lock);
> @@ -47,6 +46,7 @@ struct kthread {
>   void *data;
>   struct completion parked;
>   struct completion exited;

maybe #ifdef CONFIG_CGROUPS?

> + struct cgroup_subsys_state *blkcg_css;
>  };
...
> +void kthread_associate_blkcg(struct cgroup_subsys_state *css)
> +{
> + struct kthread *kthread;
> +
> + if (!(current->flags & PF_KTHREAD) || !current->set_child_tid)
> + return;
> + kthread = to_kthread(current);
> + if (css) {
> + css_get(css);
> + kthread->blkcg_css = css;
> + } else if (kthread->blkcg_css) {
> + css_put(kthread->blkcg_css);
> + kthread->blkcg_css = NULL;
> + }
> +}
> +EXPORT_SYMBOL(kthread_associate_blkcg);

Maybe doing sth like the following is less error-prone?

kthread_associate_blkcg(@css)
{
if (current's kthread->blkcg_css)
put kthread->blkcg_css and set it to NULL;
if (@css)
get @css and set kthread->blkcg;
}

Thanks.

-- 
tejun


[PATCH V2 1/4] kthread: add a mechanism to store cgroup info

2017-09-13 Thread Shaohua Li
From: Shaohua Li 

kthread usually runs jobs on behalf of other threads. The jobs should be
charged to cgroup of original threads. But the jobs run in a kthread,
where we lose the cgroup context of original threads. The patch adds a
machanism to record cgroup info of original threads in kthread context.
Later we can retrieve the cgroup info and attach the cgroup info to jobs.

Since this mechanism is only required by kthread, we store the cgroup
info in kthread data instead of generic task_struct.

Signed-off-by: Shaohua Li 
---
 include/linux/kthread.h | 11 +++
 kernel/kthread.c| 44 +++-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 82e197e..bd4369c 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -3,6 +3,7 @@
 /* Simple interface for creating and stopping kernel threads without mess. */
 #include 
 #include 
+#include 
 
 __printf(4, 5)
 struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
@@ -198,4 +199,14 @@ bool kthread_cancel_delayed_work_sync(struct 
kthread_delayed_work *work);
 
 void kthread_destroy_worker(struct kthread_worker *worker);
 
+#ifdef CONFIG_CGROUPS
+void kthread_associate_blkcg(struct cgroup_subsys_state *css);
+struct cgroup_subsys_state *kthread_blkcg(void);
+#else
+static inline void kthread_associate_blkcg(struct cgroup_subsys_state *css) { }
+static inline struct cgroup_subsys_state *kthread_blkcg(void)
+{
+   return NULL;
+}
+#endif
 #endif /* _LINUX_KTHREAD_H */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528..3107eee 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -20,7 +20,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 static DEFINE_SPINLOCK(kthread_create_lock);
@@ -47,6 +46,7 @@ struct kthread {
void *data;
struct completion parked;
struct completion exited;
+   struct cgroup_subsys_state *blkcg_css;
 };
 
 enum KTHREAD_BITS {
@@ -1153,3 +1153,45 @@ void kthread_destroy_worker(struct kthread_worker 
*worker)
kfree(worker);
 }
 EXPORT_SYMBOL(kthread_destroy_worker);
+
+#ifdef CONFIG_CGROUPS
+/**
+ * kthread_associate_blkcg - associate blkcg to current kthread
+ * @css: the cgroup info
+ *
+ * Current thread must be a kthread. The thread is running jobs on behalf of
+ * other threads. In some cases, we expect the jobs attach cgroup info of
+ * original threads instead of that of current thread. This function stores
+ * original thread's cgroup info in current kthread context for later
+ * retrieval.
+ */
+void kthread_associate_blkcg(struct cgroup_subsys_state *css)
+{
+   struct kthread *kthread;
+
+   if (!(current->flags & PF_KTHREAD) || !current->set_child_tid)
+   return;
+   kthread = to_kthread(current);
+   if (css) {
+   css_get(css);
+   kthread->blkcg_css = css;
+   } else if (kthread->blkcg_css) {
+   css_put(kthread->blkcg_css);
+   kthread->blkcg_css = NULL;
+   }
+}
+EXPORT_SYMBOL(kthread_associate_blkcg);
+
+/**
+ * kthread_blkcg - get associated blkcg css of current kthread
+ *
+ * Current thread must be a kthread.
+ */
+struct cgroup_subsys_state *kthread_blkcg(void)
+{
+   if ((current->flags & PF_KTHREAD) && current->set_child_tid)
+   return to_kthread(current)->blkcg_css;
+   return NULL;
+}
+EXPORT_SYMBOL(kthread_blkcg);
+#endif
-- 
2.9.5



[PATCH V2 3/4] block: make blkcg aware of kthread stored original cgroup info

2017-09-13 Thread Shaohua Li
From: Shaohua Li 

bio_blkcg is the only API to get cgroup info for a bio right now. If
bio_blkcg finds current task is a kthread and has original blkcg
associated, it will use the css instead of associating the bio to
current task. This makes it possible that kthread dispatches bios on
behalf of other threads.

Signed-off-by: Shaohua Li 
---
 include/linux/blk-cgroup.h | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 0cfa8d2..f57e54d 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
 #define BLKG_STAT_CPU_BATCH(INT_MAX / 2)
@@ -223,16 +224,16 @@ static inline struct blkcg *css_to_blkcg(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct blkcg, css) : NULL;
 }
 
-static inline struct blkcg *task_blkcg(struct task_struct *tsk)
-{
-   return css_to_blkcg(task_css(tsk, io_cgrp_id));
-}
-
 static inline struct blkcg *bio_blkcg(struct bio *bio)
 {
+   struct cgroup_subsys_state *css;
+
if (bio && bio->bi_css)
return css_to_blkcg(bio->bi_css);
-   return task_blkcg(current);
+   css = kthread_blkcg();
+   if (css)
+   return css_to_blkcg(css);
+   return css_to_blkcg(task_css(current, io_cgrp_id));
 }
 
 /**
-- 
2.9.5



Re: [PATCH V2 00/12] scsi-mq support for ZBC disks

2017-09-13 Thread Christoph Hellwig
On Tue, Sep 12, 2017 at 05:38:05PM +0900, Damien Le Moal wrote:
> struct blk_zoned {
>   unsigned int nr_zones;
>   unsigned long *seq_zones;
> };
> 
> struct request_queue {
>   ...
> #ifdef CONFIG_BLK_DEV_ZONED
>   struct blk_zoned zoned;
> #endif
>   ...

Do we even need a new structure for it?  I'd just hang it off
the gendisk directly.

Except for that the idea looks fine to me.


Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Bart Van Assche
On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> No, that patch only changes blk_insert_cloned_request() which is used
> by dm-rq(mpath) only, nothing to do with the reported issue during
> suspend and sending SCSI Domain validation.

There may be other ways to fix the SCSI domain validation code.

Regarding the hang during resume: I have not been able to reproduce that
hang with Jens' latest for-next branch. If that hang would turn out not to
be reproducible with v4.13-rc1, why to change the behavior of the block
layer with regard to PREEMPT requests?

Bart.

Re: [PATCH V4 02/14] sbitmap: introduce __sbitmap_for_each_set()

2017-09-13 Thread Omar Sandoval
On Mon, Sep 11, 2017 at 12:08:29PM +0800, Ming Lei wrote:
> On Sun, Sep 10, 2017 at 10:20:27AM -0700, Omar Sandoval wrote:

[snip]

> > What I mean is that you keep the same initialization above, but instead of
> > depth += nr
> > you do
> > depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> > because like I said, the reasoning about why `+= nr` is okay in the
> > `sb->depth - scanned` case is subtle.
> > 
> > And maybe even replace the
> > scanned += depth;
> > with
> > scanned += min_t(unsigned int, word->depth - nr,
> >  sb->depth - scanned);
> > I.e., don't reuse the depth local variable for two different things. I'm
> > nitpicking here but this code is tricky enough as it is.
> 
> It wasn't reused in old version, just for saving one local variable, and
> one extra min_t().
> 
> Yeah, I admit it isn't clean enough.
> 
> > 
> > For completeness, I mean this exactly:
> > 
> > while (1) {
> > struct sbitmap_word *word = >map[index];
> > unsigned int depth;
> > 
> > scanned += min_t(unsigned int, word->depth - nr,
> >  sb->depth - scanned);
> > if (!word->word)
> > goto next;
> > 
> > depth = min_t(unsigned int, word->depth, sb->depth - scanned);
> 
> two min_t and a little code duplication.

They're similar but they represent different things, so I think trying
to deduplicate this code just makes it more confusing. If performance is
your concern, I'd be really surprised if there's a noticable difference.

As a side note, I also realized that this code doesn't handle the
sb->depth == 0 case. We should change the while (1) to
while (scanned < sb->depth) and remove the
if (scanned >= sb->depth) break;

> > off = index << sb->shift;
> > while (1) {
> > nr = find_next_bit(>word, depth, nr);
> > if (nr >= depth)
> > break;
> > 
> > if (!fn(sb, off + nr, data))
> > return;
> > 
> > nr++;
> > }
> > next:
> > if (scanned >= sb->depth)
> > break;
> > nr = 0;
> > if (++index >= sb->map_nr)
> > index = 0;
> > }
> 
> The following patch switches to do{}while and handles the
> 1st scan outside of the loop, then it should be clean
> enough(no two min_t()), so how about this one?

I find this one subtler and harder to follow. The less it looks like the
typical loop pattern, the longer someone reading the code has to reason
about it.


Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Ming Lei
On Wed, Sep 13, 2017 at 05:28:24PM +, Bart Van Assche wrote:
> On Thu, 2017-09-14 at 00:48 +0800, Ming Lei wrote:
> > Could you please let me know if your concern about race between
> > preempt freeze and blk_cleanup_queue() is addressed in my last
> > reply?
> 
> Shouldn't we wait until v4.13-rc1 has been released before spending more
> energy on this? Certain patches that have been sent to Linus, e.g. commit

The discussion should have been done any time, right?

I believe I replied all your comments, if you don't agree any one of
my follow-up, please point it out.

> 157f377beb71 ("block: directly insert blk-mq request from
> blk_insert_cloned_request()"), may affect the test results. I have not

No, that patch only changes blk_insert_cloned_request() which is used
by dm-rq(mpath) only, nothing to do with the reported issue during
suspend and sending SCSI Domain validation.


-- 
Ming


Re: [PATCH V4 05/14] blk-mq-sched: improve dispatching from sw queue

2017-09-13 Thread Omar Sandoval
On Mon, Sep 11, 2017 at 12:13:09PM +0800, Ming Lei wrote:
> On Sun, Sep 10, 2017 at 10:38:33AM -0700, Omar Sandoval wrote:
> > On Sun, Sep 10, 2017 at 12:45:15PM +0800, Ming Lei wrote:
> > > On Fri, Sep 08, 2017 at 04:54:39PM -0700, Omar Sandoval wrote:
> > > > On Sat, Sep 02, 2017 at 11:17:20PM +0800, Ming Lei wrote:
> > > > > SCSI devices use host-wide tagset, and the shared
> > > > > driver tag space is often quite big. Meantime
> > > > > there is also queue depth for each lun(.cmd_per_lun),
> > > > > which is often small.
> > > > > 
> > > > > So lots of requests may stay in sw queue, and we
> > > > > always flush all belonging to same hw queue and
> > > > > dispatch them all to driver, unfortunately it is
> > > > > easy to cause queue busy because of the small
> > > > > per-lun queue depth. Once these requests are flushed
> > > > > out, they have to stay in hctx->dispatch, and no bio
> > > > > merge can participate into these requests, and
> > > > > sequential IO performance is hurted.
> > > > > 
> > > > > This patch improves dispatching from sw queue when
> > > > > there is per-request-queue queue depth by taking
> > > > > request one by one from sw queue, just like the way
> > > > > of IO scheduler.
> > > > > 
> > > > > Reviewed-by: Bart Van Assche 
> > > > > Signed-off-by: Ming Lei 
> > > > > ---
> > > > >  block/blk-mq-sched.c   | 61 
> > > > > +-
> > > > >  include/linux/blk-mq.h |  2 ++
> > > > >  2 files changed, 57 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > index f69752961a34..735e432294ab 100644
> > > > > --- a/block/blk-mq-sched.c
> > > > > +++ b/block/blk-mq-sched.c
> > > > > @@ -89,9 +89,9 @@ static bool blk_mq_sched_restart_hctx(struct 
> > > > > blk_mq_hw_ctx *hctx)
> > > > >   return false;
> > > > >  }
> > > > >  
> > > > > -static void blk_mq_do_dispatch(struct request_queue *q,
> > > > > -struct elevator_queue *e,
> > > > > -struct blk_mq_hw_ctx *hctx)
> > > > > +static void blk_mq_do_dispatch_sched(struct request_queue *q,
> > > > > +  struct elevator_queue *e,
> > > > > +  struct blk_mq_hw_ctx *hctx)
> > > > >  {
> > > > >   LIST_HEAD(rq_list);
> > > > >  
> > > > > @@ -105,6 +105,42 @@ static void blk_mq_do_dispatch(struct 
> > > > > request_queue *q,
> > > > >   } while (blk_mq_dispatch_rq_list(q, _list));
> > > > >  }
> > > > >  
> > > > > +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> > > > > +   struct blk_mq_ctx *ctx)
> > > > > +{
> > > > > + unsigned idx = ctx->index_hw;
> > > > > +
> > > > > + if (++idx == hctx->nr_ctx)
> > > > > + idx = 0;
> > > > > +
> > > > > + return hctx->ctxs[idx];
> > > > > +}
> > > > > +
> > > > > +static void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > > > > +struct blk_mq_hw_ctx *hctx)
> > > > > +{
> > > > > + LIST_HEAD(rq_list);
> > > > > + struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > > > > + bool dispatched;
> > > > > +
> > > > > + do {
> > > > > + struct request *rq;
> > > > > +
> > > > > + rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > > > > + if (!rq)
> > > > > + break;
> > > > > + list_add(>queuelist, _list);
> > > > > +
> > > > > + /* round robin for fair dispatch */
> > > > > + ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > > > 
> > > > Hm... this next ctx will get skipped if the dispatch on the previous ctx
> > > > fails, since we call blk_mq_next_ctx() again. Seems unfair. Maybe move
> > > > the blk_mq_next_ctx() from the if (!dispatched) below into the if (!rq)
> > > > above?
> > > 
> > > In case of if (!rq), that means there isn't any request in all ctxs
> > > belonging to this hctx, so it is reasonable to start the dispatch from
> > > any one of these ctxs next time, include the next one.
> > 
> > Yep, that case is okay.
> > 
> > > If dispatch fails on previous ctx, the rq from that ctx will be
> > > put into ->dispatch, so it is fair to start dispatch from next ctx
> > > next time too.
> > 
> > I'm talking about this case
> > 
> > LIST_HEAD(rq_list);
> > struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> > bool dispatched;
> >  
> > /*
> >  * Let's say that ctxs 0, 1, and 2 all have requests pending and
> >  * hctx->dispatch_from was ctx0, so ctx is ctx0 when we start.
> >  */
> > do {
> > struct request *rq;
> >  
> > rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > if (!rq)
> > break;
> > list_add(>queuelist, _list);
> > 
> > /* Now rq is a request from ctx0 */
> > 
> >  

Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Bart Van Assche
On Thu, 2017-09-14 at 00:48 +0800, Ming Lei wrote:
> Could you please let me know if your concern about race between
> preempt freeze and blk_cleanup_queue() is addressed in my last
> reply?

Shouldn't we wait until v4.13-rc1 has been released before spending more
energy on this? Certain patches that have been sent to Linus, e.g. commit
157f377beb71 ("block: directly insert blk-mq request from
blk_insert_cloned_request()"), may affect the test results. I have not
been able to reproduce the hang during resume with Jens' latest for-next
branch.

Bart.

Re: [PATCH] brd: fix overflow in __brd_direct_access

2017-09-13 Thread Jens Axboe
On 09/13/2017 07:17 AM, Mikulas Patocka wrote:
> The code in __brd_direct_access multiplies the pgoff variable by page size
> and divides it by 512. It can cause overflow on 32-bit architectures. The
> overflow happens if we create ramdisk larger than 4G and use it as a
> sparse device.
> 
> This patch replaces multiplication and division with multiplication by the
> number of sectors per page.

Thanks, applied with Dan's ack.

-- 
Jens Axboe



Re: [BUG] xfs/104 triggered NULL pointer dereference in iomap based dio

2017-09-13 Thread Christoph Hellwig
Does it work fine if you call sb_init_dio_done_wq unconditionally?


Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen

2017-09-13 Thread Ming Lei
On Tue, Sep 12, 2017 at 11:40:57AM +0800, Ming Lei wrote:
> On Mon, Sep 11, 2017 at 04:03:55PM +, Bart Van Assche wrote:
> > On Mon, 2017-09-11 at 19:10 +0800, Ming Lei wrote:
> > > @@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, 
> > > unsigned flags)
> > >   if (percpu_ref_tryget_live(>q_usage_counter))
> > >   return 0;
> > >  
> > > + /*
> > > +  * If queue is preempt frozen and caller need to allocate
> > > +  * request for RQF_PREEMPT, we grab the .q_usage_counter
> > > +  * unconditionally and return successfully.
> > > +  *
> > > +  * There isn't race with queue cleanup because:
> > > +  *
> > > +  * 1) it is guaranteed that preempt freeze can't be
> > > +  * started after queue is set as dying
> > > +  *
> > > +  * 2) normal freeze runs exclusively with preempt
> > > +  * freeze, so even after queue is set as dying
> > > +  * afterwards, blk_queue_cleanup() won't move on
> > > +  * until preempt freeze is done
> > > +  *
> > > +  * 3) blk_queue_dying() needn't to be checked here
> > > +  *  - for legacy path, it will be checked in
> > > +  *  __get_request()
> > 
> > For the legacy block layer core, what do you think will happen if the
> > "dying" state is set by another thread after __get_request() has passed the
> > blk_queue_dying() check?
> 
> Without this patchset, block core still need to handle the above
> situation, so your question isn't related with this patchset.
> 
> Also q->queue_lock is required in both setting dying and checking
> dying in__get_request(). But the lock can be released in
> __get_request(), so it is possible to allocate one request after
> queue is set as dying, and the request can be dispatched to a
> dying queue too for legacy.
> 
> > 
> > > +  *  - blk-mq depends on driver to handle dying well
> > > +  *  because it is normal for queue to be set as dying
> > > +  *  just between blk_queue_enter() and allocating new
> > > +  *  request.
> > 
> > The above comment is not correct. The block layer core handles the "dying"
> > state. Block drivers other than dm-mpath should not have to query this state
> > directly.
> 
> If blk-mq doesn't query dying state, how does it know queue is dying
> and handle the state? Also blk-mq isn't different with legacy wrt.
> depending on driver to handle dying.
> 
> > 
> > > +  */
> > > + if ((flags & BLK_REQ_PREEMPT) &&
> > > + blk_queue_is_preempt_frozen(q)) {
> > > + blk_queue_enter_live(q);
> > > + return 0;
> > > + }
> > > +
> > 
> > Sorry but to me it looks like the above code introduces a race condition
> > between blk_queue_cleanup() and blk_get_request() for at least blk-mq.
> > Consider e.g. the following scenario:
> > * A first thread preempt-freezes a request queue.
> > * A second thread calls blk_get_request() with BLK_REQ_PREEMPT set. That
> >   results in a call of blk_queue_is_preempt_frozen().
> > * A context switch occurs to the first thread.
> > * The first thread preempt-unfreezes the same request queue and calls
> >   blk_queue_cleanup(). That last function changes the request queue state
> >   into DYING and waits until all pending requests have finished.
> > * The second thread continues and calls blk_queue_enter_live(), allocates
> >   a request and submits it.
> 
> OK, looks a race I don't think of, but it can be fixed easily by calling
> blk_queue_enter_live() with holding q->freeze_lock, and it won't
> cause performance issue too since it is in slow path.
> 
> For example, we can introduce the following code in blk_queue_enter():
> 
>   if ((flags & BLK_REQ_PREEMPT) &&
>   blk_queue_enter_preempt_freeze(q))
>   return 0;
> 
> static inline bool blk_queue_enter_preempt_freeze(struct request_queue *q)
> {
>   bool preempt_frozen;
> 
>   spin_lock(>freeze_lock);
>   preempt_frozen = q->preempt_freezing && !q->preempt_unfreezing;
>   if (preempt_froze)
>   blk_queue_enter_live(q);
>   spin_unlock(>freeze_lock);
> 
>   return preempt_frozen;
> }
> 
> > 
> > In other words, a request gets submitted against a dying queue. This must
> > not happen. See also my explanation of queue shutdown from a few days ago
> 
> That is not correct, think about why queue dead is checked in
> __blk_run_queue_uncond() instead of queue dying. We still need to
> submit requests to driver when queue is dying, and driver knows
> how to handle that.
> 
> > (https://marc.info/?l=linux-block=150449845831789).
> 
> > from (https://marc.info/?l=linux-block=150449845831789).
> >> Do you understand how request queue cleanup works? The algorithm used for
> >> request queue cleanup is as follows:
> >> * Set the DYING flag. This flag 

Re: [PATCH] blk-mq: don't allocate driver tag beforehand for flush rq

2017-09-13 Thread Ming Lei
On Wed, Sep 13, 2017 at 09:32:30PM +0800, Ming Lei wrote:
> The behind idea is simple:
> 
> 1) for none scheduler, driver tag has to be borrowed for flush
> rq, otherwise we may run out of tag, and IO hang is caused.
> get/put driver tag is actually a nop, so reorder tags isn't
> necessary at all.
> 
> 2) for real I/O scheduler, we needn't to allocate driver tag
> beforehand for flush rq, and it works just fine to follow the
> way for normal requests: allocate driver tag for each rq just
> before calling .queue_rq().
> 
> Then flush rq isn't treated specially wrt. get/put driver tag,
> codes get cleanup much, such as, reorder_tags_to_front() is
> removed, needn't to worry about request order in dispatch list
> any more.
> 
> One visible change to driver is that flush rq's tag may not be
> same with the data rq in flush sequence, that won't be a
> problem, since we always do that in legacy path.

Please ignore this patch, since looks there is one issue.

-- 
Ming


Re: [BUG] xfs/104 triggered NULL pointer dereference in iomap based dio

2017-09-13 Thread Chandan Rajendra
On Wednesday, September 13, 2017 4:28:23 PM IST Eryu Guan wrote:
> Hi all,
> 
> Recently I noticed multiple crashes triggered by xfs/104 on ppc64 hosts
> in my upstream 4.13 kernel testings. The block layer is involved in the
> call trace so I add linux-block to cc list too. I append the full
> console log to the end of this mail.
> 
> Now I can reproduce the crash on x86_64 hosts too by running xfs/104
> many times (usually within 100 iterations). A git-bisect run (I ran it
> for 500 iterations before calling it good in bisect run to be safe)
> pointed the first bad commit to commit acdda3aae146 ("xfs: use
> iomap_dio_rw").
> 
> I confirmed the bisect result by checking out a new branch with commit
> acdda3aae146 as HEAD, xfs/104 would crash kernel within 100 iterations,
> then reverting HEAD, xfs/104 passed 1500 iterations.

I am able to recreate the issue on my ppc64 guest. I added some printk()
statements and got this,

xfs_fs_fill_super:1670: Filled up sb c006344db800.
iomap_dio_bio_end_io:784: sb = c006344db800; inode->i_sb->s_dio_done_wq =   
(null), >aio.work = c006344bb5b0.


In iomap_dio_rw(), I had added the following printk() statement,

   ret = sb_init_dio_done_wq(inode->i_sb);
if (ret < 0)
iomap_dio_set_error(dio, ret);
printk("%s:%d: sb = %p; Created s_dio_done_wq.\n",
__func__, __LINE__, inode->i_sb);

In the case of crash, I don't see the above message being printed. 

Btw, I am unable to recreate this issue on today's linux-next though. Maybe
it is because the race condition is accidently masked out.

I will continue debugging this and provide an update.

> 
> On one of my test vms, the crash happened as
> 
> [  340.419429] BUG: unable to handle kernel NULL pointer dereference at 
> 0102
> [  340.420408] IP: __queue_work+0x32/0x420
> 
> and that IP points to
> 
> (gdb) l *(__queue_work+0x32)
> 0x9cf32 is in __queue_work (kernel/workqueue.c:1383).
> 1378WARN_ON_ONCE(!irqs_disabled());
> 1379
> 1380debug_work_activate(work);
> 1381
> 1382/* if draining, only works from the same workqueue are 
> allowed */
> 1383if (unlikely(wq->flags & __WQ_DRAINING) &&
> 1384WARN_ON_ONCE(!is_chained_work(wq)))
> 1385return;
> 1386retry:
> 1387if (req_cpu == WORK_CPU_UNBOUND)
> 
> So looks like "wq" is null. The test vm is a kvm guest running 4.13
> kernel with 4 vcpus and 8G memory.
> 
> If more information is needed please let me know.
> 
> Thanks,
> Eryu
> 
> P.S. console log when crashing
> 
> [  339.746983] run fstests xfs/104 at 2017-09-13 17:38:26
> [  340.027352] XFS (vda6): Unmounting Filesystem
> [  340.207107] XFS (vda6): Mounting V5 Filesystem
> [  340.217553] XFS (vda6): Ending clean mount
> [  340.419429] BUG: unable to handle kernel NULL pointer dereference at 
> 0102
> [  340.420408] IP: __queue_work+0x32/0x420
> [  340.420408] PGD 215373067
> [  340.420408] P4D 215373067
> [  340.420408] PUD 21210d067
> [  340.420408] PMD 0
> [  340.420408]
> [  340.420408] Oops:  [#1] SMP
> [  340.420408] Modules linked in: xfs ip6t_rpfilter ipt_REJECT nf_reject_ipv4 
> nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
> nf_defrag_ipv6 xt_conntrack nf_conntrack libcrc32c ip_set nfnetlink 
> ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security 
> ip6table_raw iptable_mangle iptable_security iptable_raw ebtable_filter 
> ebtables ip6table_filter ip6_tables iptable_filter btrfs xor raid6_pq ppdev 
> i2c_piix4 parport_pc i2c_core parport virtio_balloon pcspkr nfsd auth_rpcgss 
> nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi 
> virtio_net virtio_blk ata_piix libata virtio_pci virtio_ring serio_raw virtio 
> floppy
> [  340.420408] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.13.0 #64
> [  340.420408] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
> [  340.420408] task: 8b1d96222500 task.stack: b06bc0cb8000
> [  340.420408] RIP: 0010:__queue_work+0x32/0x420
> [  340.420408] RSP: 0018:8b1d9fd83d18 EFLAGS: 00010046
> [  340.420408] RAX: 0096 RBX: 0002 RCX: 
> 8b1d9489e6d8
> [  340.420408] RDX: 8b1d903c2090 RSI:  RDI: 
> 2000
> [  340.420408] RBP: 8b1d9fd83d58 R08: 0400 R09: 
> 0009
> [  340.420408] R10: 8b1d9532b400 R11:  R12: 
> 2000
> [  340.420408] R13:  R14: 8b1d903c2090 R15: 
> 7800
> [  340.420408] FS:  () GS:8b1d9fd8() 
> knlGS:
> [  340.420408] CS:  0010 DS:  ES:  CR0: 80050033
> [  340.420408] CR2: 0102 CR3: 0002152ce000 CR4: 
> 06e0
> [  340.420408] Call Trace:
> [  340.420408]  
> [  340.420408]  ? __slab_free+0x8e/0x260
> [  

Re: [PATCH] brd: fix overflow in __brd_direct_access

2017-09-13 Thread Dan Williams
On Wed, Sep 13, 2017 at 6:17 AM, Mikulas Patocka  wrote:
> The code in __brd_direct_access multiplies the pgoff variable by page size
> and divides it by 512. It can cause overflow on 32-bit architectures. The
> overflow happens if we create ramdisk larger than 4G and use it as a
> sparse device.
>
> This patch replaces multiplication and division with multiplication by the
> number of sectors per page.
>
> Signed-off-by: Mikulas Patocka 
> Fixes: 1647b9b959c7 ("brd: add dax_operations support")
> Cc: sta...@vger.kernel.org  # 4.12+
>
> ---
>  drivers/block/brd.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-4.13/drivers/block/brd.c
> ===
> --- linux-4.13.orig/drivers/block/brd.c
> +++ linux-4.13/drivers/block/brd.c
> @@ -339,7 +339,7 @@ static long __brd_direct_access(struct b
>
> if (!brd)
> return -ENODEV;
> -   page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512);
> +   page = brd_insert_page(brd, (sector_t)pgoff << PAGE_SECTORS_SHIFT);

Looks good to me, you can add:

Reviewed-by: Dan Williams 


[PATCH] Support for secure erase functionality

2017-09-13 Thread Philipp Guendisch
This patch adds a software based secure erase option to improve data
confidentiality. The CONFIG_BLK_DEV_SECURE_ERASE option enables a mount
flag called 'sw_secure_erase'. When you mount a volume with this flag,
every discard call is prepended by an explicit write command to overwrite
the data before it is discarded. A volume without a discard compatibility
can be used as well but the discard calls will be enabled for this device
and suppressed after the write call is made.

Built against torvalds/linux

Signed-off-by: Philipp Guendisch 
Signed-off-by: Mate Horvath 
---
 block/Kconfig  | 14 
 block/blk-lib.c| 86 +-
 fs/super.c | 59 ++
 include/linux/blkdev.h | 14 
 4 files changed, 172 insertions(+), 1 deletion(-)

diff --git a/block/Kconfig b/block/Kconfig
index 3ab42bb..438da83 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -103,6 +103,20 @@ config BLK_DEV_ZONED
 
Say yes here if you have a ZAC or ZBC storage device.
 
+config BLK_DEV_SECURE_ERASE
+   bool "Block layer secure erase support (EXPERIMENTAL)"
+   default n
+   ---help---
+   With this option set, every discard operation will be prepended by
+   a write operation which overwrites the data with random values.
+   Use this option for a secure deletion of data.
+
+   WARNING:
+   Due to unpredictable circumstances we cannot guarantee you that your
+   data will be irrecoverably deleted in every case.
+   This option also increases the amount of data written to block
+   devices which may reduce their lifetime.
+
 config BLK_DEV_THROTTLING
bool "Block layer bio throttling support"
depends on BLK_CGROUP=y
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e01adb5..949a666 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -6,6 +6,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_BLK_DEV_SECURE_ERASE
+#include 
+#endif
 
 #include "blk.h"
 
@@ -22,6 +25,60 @@ static struct bio *next_bio(struct bio *bio, unsigned int 
nr_pages,
return new;
 }
 
+/*
+ * __blkdev_secure_erase - erase data queued for discard
+ * @bdev:  blockdev to issue discard for
+ * @sector:start sector
+ * @nr_sects:  number of sectors to discard
+ * @gfp_mask:  memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *Overwrites sectors issued to discard with random data before discarding
+ */
+#ifdef CONFIG_BLK_DEV_SECURE_ERASE
+static unsigned int __blkdev_sectors_to_bio_pages(sector_t nr_sects);
+static void __blkdev_secure_erase(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask,
+ struct bio **biop)
+{
+   struct bio *bio = *biop;
+   int bi_size = 0;
+   static struct page *datapage;
+   void *page_cont;
+   static unsigned int count = 1;
+   unsigned int sz;
+
+   if (unlikely(!datapage))
+   datapage = alloc_page(GFP_NOIO);
+
+   if (unlikely(count % 64 == 1)) {
+   page_cont = kmap(datapage);
+   get_random_bytes(page_cont, PAGE_SIZE);
+   kunmap(datapage);
+   }
+   count++;
+
+   while (nr_sects != 0) {
+   bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
+  gfp_mask);
+   bio->bi_iter.bi_sector = sector;
+   bio_set_dev(bio, bdev);
+   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+   bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
+
+   while (nr_sects != 0) {
+   sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
+   bi_size = bio_add_page(bio, datapage, sz, 0);
+   nr_sects -= bi_size >> 9;
+   sector += bi_size >> 9;
+   if (bi_size < sz)
+   break;
+   }
+   cond_resched();
+   }
+}
+#endif
+
 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, int flags,
struct bio **biop)
@@ -29,13 +86,14 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
struct request_queue *q = bdev_get_queue(bdev);
struct bio *bio = *biop;
unsigned int granularity;
-   unsigned int op;
+   unsigned int op = REQ_OP_DISCARD;
int alignment;
sector_t bs_mask;
 
if (!q)
return -ENXIO;
 
+#ifndef CONFIG_BLK_DEV_SECURE_ERASE
if (flags & BLKDEV_DISCARD_SECURE) {
if (!blk_queue_secure_erase(q))
return -EOPNOTSUPP;
@@ -45,6 +103,7 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
return -EOPNOTSUPP;
 

Re: [PATCH 01/10] Initial multipath implementation.

2017-09-13 Thread Keith Busch
On Tue, Sep 12, 2017 at 10:48:22PM -0700, Anish Jhaveri wrote:
> On Tue, Sep 12, 2017 at 12:00:44PM -0400, Keith Busch wrote:
> > 
> > I find this patch series confusing to review. You declare these failover
> > functions in patch 1, use them in patch 2, but they're not defined until
> > patch 7.
> Sorry for late reply. 
> 
> Idea was to keep header file changes as separate patch. 
> I will move the function declaration to patch 7 and 
> rearrange the patch series. If you or anyone else finds 
> something which could help in browsing the changes, I
> will try to incorporate next patchset.

At the very least, you don't want any patch in the series to depend
on a future patch just to compile. There are multiple breakages like
that in your series. For example, patch 5 starts a kthread at function
nvme_mpath_kthread, but that function doesn't exist until patch 8.


[GIT PULL] Follow up block fixes for 4.14-rc1

2017-09-13 Thread Jens Axboe
Hi Linus,

Small collection of fixes that would be nice to have in -rc1. This pull
request contains:

- NVMe pull request form Christoph, mostly with fixes for nvme-pci,
  host memory buffer in particular.

- Error handling fixup for cgwb_create(), in case allocation of 'wb'
  fails. From Christophe JAILLET.

- Ensure that trace_block_getrq() gets the 'dev' in an appropriate
  fashion, to avoid a potential NULL deref. From Greg Thelen.

- Regression fix for dm-mq with blk-mq, fixing a problem with stacking
  IO schedulers. From me.

- string.h fixup, fixing an issue with memcpy_and_pad(). This original
  change came in through an NVMe dependency, which is why I'm including
  it here. From Martin Wilck.

- Fix potential int overflow in __blkdev_sectors_to_bio_pages(), from
  Mikulas.

- MBR enable fix for sed-opal, from Scott.

Please pull!


  git://git.kernel.dk/linux-block.git for-linus



Akinobu Mita (1):
  nvme-pci: use appropriate initial chunk size for HMB allocation

Christoph Hellwig (4):
  nvme: fix lightnvm check
  nvme-pci: fix host memory buffer allocation fallback
  nvme-pci: propagate (some) errors from host memory buffer setup
  nvme-pci: implement the HMB entry number and size limitations

Christophe JAILLET (1):
  mm/backing-dev.c: fix an error handling path in 'cgwb_create()'

Greg Thelen (1):
  block: tolerate tracing of NULL bio

Jens Axboe (2):
  Merge branch 'nvme-4.14' of git://git.infradead.org/nvme into for-linus
  block: directly insert blk-mq request from blk_insert_cloned_request()

Martin Wilck (1):
  string.h: un-fortify memcpy_and_pad

Mikulas Patocka (1):
  block: fix integer overflow in __blkdev_sectors_to_bio_pages()

Scott Bauer (1):
  block: sed-opal: Set MBRDone on S3 resume path if TPER is MBREnabled

 block/blk-core.c |  7 -
 block/blk-lib.c  |  4 +--
 block/blk-mq.c   | 16 ++
 block/blk-mq.h   |  1 +
 block/opal_proto.h   |  1 +
 block/sed-opal.c | 32 +++
 drivers/nvme/host/core.c | 11 ---
 drivers/nvme/host/lightnvm.c | 26 
 drivers/nvme/host/nvme.h | 13 +---
 drivers/nvme/host/pci.c  | 74 ++--
 include/linux/nvme.h |  4 ++-
 include/linux/string.h   | 15 ++---
 include/trace/events/block.h |  5 ++-
 mm/backing-dev.c |  6 ++--
 14 files changed, 134 insertions(+), 81 deletions(-)

-- 
Jens Axboe



[PATCH] blk-mq: don't allocate driver tag beforehand for flush rq

2017-09-13 Thread Ming Lei
The behind idea is simple:

1) for none scheduler, driver tag has to be borrowed for flush
rq, otherwise we may run out of tag, and IO hang is caused.
get/put driver tag is actually a nop, so reorder tags isn't
necessary at all.

2) for real I/O scheduler, we needn't to allocate driver tag
beforehand for flush rq, and it works just fine to follow the
way for normal requests: allocate driver tag for each rq just
before calling .queue_rq().

Then flush rq isn't treated specially wrt. get/put driver tag,
codes get cleanup much, such as, reorder_tags_to_front() is
removed, needn't to worry about request order in dispatch list
any more.

One visible change to driver is that flush rq's tag may not be
same with the data rq in flush sequence, that won't be a
problem, since we always do that in legacy path.

Signed-off-by: Ming Lei 
---
 block/blk-flush.c| 43 ++-
 block/blk-mq-sched.c | 64 
 block/blk-mq.c   | 38 ---
 3 files changed, 56 insertions(+), 89 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 4938bec8cfef..080f778257cf 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -216,6 +216,17 @@ static bool blk_flush_complete_seq(struct request *rq,
return kicked | queued;
 }
 
+/*
+ * We don't share tag between flush rq and data rq in case of
+ * IO scheduler, so have to release tag and restart queue
+ */
+static void put_flush_driver_tag(struct blk_mq_hw_ctx *hctx,
+struct blk_mq_ctx *ctx, int tag)
+{
+   blk_mq_put_tag(hctx, hctx->tags, ctx, tag);
+   blk_mq_sched_restart(hctx);
+}
+
 static void flush_end_io(struct request *flush_rq, blk_status_t error)
 {
struct request_queue *q = flush_rq->q;
@@ -231,8 +242,14 @@ static void flush_end_io(struct request *flush_rq, 
blk_status_t error)
/* release the tag's ownership to the req cloned from */
spin_lock_irqsave(>mq_flush_lock, flags);
hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
-   blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
-   flush_rq->tag = -1;
+   if (!q->elevator) {
+   blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
+   flush_rq->tag = -1;
+   } else {
+   put_flush_driver_tag(hctx, flush_rq->mq_ctx,
+flush_rq->tag);
+   flush_rq->internal_tag = -1;
+   }
}
 
running = >flush_queue[fq->flush_running_idx];
@@ -321,16 +338,24 @@ static bool blk_kick_flush(struct request_queue *q, 
struct blk_flush_queue *fq)
 * Borrow tag from the first request since they can't
 * be in flight at the same time. And acquire the tag's
 * ownership for flush req.
+*
+* In case of io scheduler, flush rq need to borrow
+* scheduler tag for making put/get driver tag workable.
+* In case of none, flush rq need to borrow driver tag.
 */
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
 
flush_rq->mq_ctx = first_rq->mq_ctx;
-   flush_rq->tag = first_rq->tag;
-   fq->orig_rq = first_rq;
 
-   hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
-   blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
+   if (!q->elevator) {
+   fq->orig_rq = first_rq;
+   flush_rq->tag = first_rq->tag;
+   hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
+   blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
+   } else {
+   flush_rq->internal_tag = first_rq->internal_tag;
+   }
}
 
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
@@ -394,6 +419,12 @@ static void mq_flush_data_end_io(struct request *rq, 
blk_status_t error)
 
hctx = blk_mq_map_queue(q, ctx->cpu);
 
+   if (q->elevator) {
+   WARN_ON(rq->tag < 0);
+   put_flush_driver_tag(hctx, ctx, rq->tag);
+   rq->tag = -1;
+   }
+
/*
 * After populating an empty queue, kick it to avoid stall.  Read
 * the comment in flush_end_io().
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..dbf100871ad6 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -259,22 +259,20 @@ void blk_mq_sched_request_inserted(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
 
-static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
-  struct request *rq)
+static bool blk_mq_bypass_insert(struct blk_mq_hw_ctx *hctx,
+bool has_sched, struct request *rq)
 {
-   if (rq->tag == -1) 

[PATCH] brd: fix overflow in __brd_direct_access

2017-09-13 Thread Mikulas Patocka
The code in __brd_direct_access multiplies the pgoff variable by page size
and divides it by 512. It can cause overflow on 32-bit architectures. The
overflow happens if we create ramdisk larger than 4G and use it as a
sparse device.

This patch replaces multiplication and division with multiplication by the
number of sectors per page.

Signed-off-by: Mikulas Patocka 
Fixes: 1647b9b959c7 ("brd: add dax_operations support")
Cc: sta...@vger.kernel.org  # 4.12+

---
 drivers/block/brd.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-4.13/drivers/block/brd.c
===
--- linux-4.13.orig/drivers/block/brd.c
+++ linux-4.13/drivers/block/brd.c
@@ -339,7 +339,7 @@ static long __brd_direct_access(struct b
 
if (!brd)
return -ENODEV;
-   page = brd_insert_page(brd, PFN_PHYS(pgoff) / 512);
+   page = brd_insert_page(brd, (sector_t)pgoff << PAGE_SECTORS_SHIFT);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);


[PATCH V8 01/14] mmc: core: Introduce host claiming by context

2017-09-13 Thread Adrian Hunter
Currently the host can be claimed by a task.  Change this so that the host
can be claimed by a context that may or may not be a task.  This provides
for the host to be claimed by a block driver queue to support blk-mq, while
maintaining compatibility with the existing use of mmc_claim_host().

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/core.c  | 92 
 drivers/mmc/core/core.h  |  6 
 include/linux/mmc/host.h |  7 +++-
 3 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 66c9cf49ad2f..09fdc467b804 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -832,9 +832,38 @@ unsigned int mmc_align_data_size(struct mmc_card *card, 
unsigned int sz)
 }
 EXPORT_SYMBOL(mmc_align_data_size);
 
+/*
+ * Allow claiming an already claimed host if the context is the same or there 
is
+ * no context but the task is the same.
+ */
+static inline bool mmc_ctx_matches(struct mmc_host *host, struct mmc_ctx *ctx,
+  struct task_struct *task)
+{
+   return host->claimer == ctx ||
+  (!ctx && task && host->claimer->task == task);
+}
+
+static inline void mmc_ctx_set_claimer(struct mmc_host *host,
+  struct mmc_ctx *ctx,
+  struct task_struct *task)
+{
+   if (!host->claimer) {
+   if (ctx)
+   host->claimer = ctx;
+   else
+   host->claimer = >default_ctx;
+   }
+   if (task)
+   host->claimer->task = task;
+}
+
 /**
- * __mmc_claim_host - exclusively claim a host
+ * __mmc_ctx_task_claim_host - exclusively claim a host
  * @host: mmc host to claim
+ * @ctx: context that claims the host or NULL in which case the default
+ * context will be used
+ * @task: task that claims the host or NULL in which case @ctx must be
+ * provided
  * @abort: whether or not the operation should be aborted
  *
  * Claim a host for a set of operations.  If @abort is non null and
@@ -842,7 +871,8 @@ unsigned int mmc_align_data_size(struct mmc_card *card, 
unsigned int sz)
  * that non-zero value without acquiring the lock.  Returns zero
  * with the lock held otherwise.
  */
-int __mmc_claim_host(struct mmc_host *host, atomic_t *abort)
+static int __mmc_ctx_task_claim_host(struct mmc_host *host, struct mmc_ctx 
*ctx,
+struct task_struct *task, atomic_t *abort)
 {
DECLARE_WAITQUEUE(wait, current);
unsigned long flags;
@@ -856,7 +886,7 @@ int __mmc_claim_host(struct mmc_host *host, atomic_t *abort)
while (1) {
set_current_state(TASK_UNINTERRUPTIBLE);
stop = abort ? atomic_read(abort) : 0;
-   if (stop || !host->claimed || host->claimer == current)
+   if (stop || !host->claimed || mmc_ctx_matches(host, ctx, task))
break;
spin_unlock_irqrestore(>lock, flags);
schedule();
@@ -865,7 +895,7 @@ int __mmc_claim_host(struct mmc_host *host, atomic_t *abort)
set_current_state(TASK_RUNNING);
if (!stop) {
host->claimed = 1;
-   host->claimer = current;
+   mmc_ctx_set_claimer(host, ctx, task);
host->claim_cnt += 1;
if (host->claim_cnt == 1)
pm = true;
@@ -879,8 +909,19 @@ int __mmc_claim_host(struct mmc_host *host, atomic_t 
*abort)
 
return stop;
 }
+
+int __mmc_claim_host(struct mmc_host *host, atomic_t *abort)
+{
+   return __mmc_ctx_task_claim_host(host, NULL, current, abort);
+}
 EXPORT_SYMBOL(__mmc_claim_host);
 
+void mmc_ctx_claim_host(struct mmc_host *host, struct mmc_ctx *ctx)
+{
+   __mmc_ctx_task_claim_host(host, ctx, NULL, NULL);
+}
+EXPORT_SYMBOL(mmc_ctx_claim_host);
+
 /**
  * mmc_release_host - release a host
  * @host: mmc host to release
@@ -888,18 +929,17 @@ int __mmc_claim_host(struct mmc_host *host, atomic_t 
*abort)
  * Release a MMC host, allowing others to claim the host
  * for their operations.
  */
-void mmc_release_host(struct mmc_host *host)
+static void __mmc_release_host(struct mmc_host *host)
 {
unsigned long flags;
 
-   WARN_ON(!host->claimed);
-
spin_lock_irqsave(>lock, flags);
if (--host->claim_cnt) {
/* Release for nested claim */
spin_unlock_irqrestore(>lock, flags);
} else {
host->claimed = 0;
+   host->claimer->task = NULL;
host->claimer = NULL;
spin_unlock_irqrestore(>lock, flags);
wake_up(>wq);
@@ -907,8 +947,23 @@ void mmc_release_host(struct mmc_host *host)
pm_runtime_put_autosuspend(mmc_dev(host));
}
 }
+
+void mmc_release_host(struct 

[PATCH V8 04/14] mmc: mmc: Enable CQE's

2017-09-13 Thread Adrian Hunter
Enable or disable CQE when a card is added or removed respectively.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/bus.c |  7 +++
 drivers/mmc/core/mmc.c | 12 
 2 files changed, 19 insertions(+)

diff --git a/drivers/mmc/core/bus.c b/drivers/mmc/core/bus.c
index 301246513a37..a4b49e25fe96 100644
--- a/drivers/mmc/core/bus.c
+++ b/drivers/mmc/core/bus.c
@@ -369,10 +369,17 @@ int mmc_add_card(struct mmc_card *card)
  */
 void mmc_remove_card(struct mmc_card *card)
 {
+   struct mmc_host *host = card->host;
+
 #ifdef CONFIG_DEBUG_FS
mmc_remove_card_debugfs(card);
 #endif
 
+   if (host->cqe_enabled) {
+   host->cqe_ops->cqe_disable(host);
+   host->cqe_enabled = false;
+   }
+
if (mmc_card_present(card)) {
if (mmc_host_is_spi(card->host)) {
pr_info("%s: SPI card removed\n",
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index 194cf2082e73..0dcbd25126a1 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1807,6 +1807,18 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
 */
card->reenable_cmdq = card->ext_csd.cmdq_en;
 
+   if (card->ext_csd.cmdq_en && !host->cqe_enabled) {
+   err = host->cqe_ops->cqe_enable(host, card);
+   if (err) {
+   pr_err("%s: Failed to enable CQE, error %d\n",
+   mmc_hostname(host), err);
+   } else {
+   host->cqe_enabled = true;
+   pr_info("%s: Command Queue Engine enabled\n",
+   mmc_hostname(host));
+   }
+   }
+
if (!oldcard)
host->card = card;
 
-- 
1.9.1



[PATCH V8 03/14] mmc: mmc: Enable Command Queuing

2017-09-13 Thread Adrian Hunter
Enable the Command Queue if the host controller supports a command queue
engine. It is not compatible with Packed Commands, so make a note of that in the
comment.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/mmc.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index a7eb623f8daa..194cf2082e73 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1784,6 +1784,23 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
}
 
/*
+* Enable Command Queue if supported. Note that Packed Commands cannot
+* be used with Command Queue.
+*/
+   card->ext_csd.cmdq_en = false;
+   if (card->ext_csd.cmdq_support && host->caps2 & MMC_CAP2_CQE) {
+   err = mmc_cmdq_enable(card);
+   if (err && err != -EBADMSG)
+   goto free_card;
+   if (err) {
+   pr_warn("%s: Enabling CMDQ failed\n",
+   mmc_hostname(card->host));
+   card->ext_csd.cmdq_support = false;
+   card->ext_csd.cmdq_depth = 0;
+   err = 0;
+   }
+   }
+   /*
 * In some cases (e.g. RPMB or mmc_test), the Command Queue must be
 * disabled for a time, so a flag is needed to indicate to re-enable the
 * Command Queue.
-- 
1.9.1



[PATCH V8 05/14] mmc: block: Use local variables in mmc_blk_data_prep()

2017-09-13 Thread Adrian Hunter
Use local variables in mmc_blk_data_prep() in preparation for adding CQE
support which doesn't use the output variables.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 29 ++---
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 29fc1e662891..1ec5f0bc98d2 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1534,21 +1534,22 @@ static enum mmc_blk_status mmc_blk_err_check(struct 
mmc_card *card,
 }
 
 static void mmc_blk_data_prep(struct mmc_queue *mq, struct mmc_queue_req *mqrq,
- int disable_multi, bool *do_rel_wr,
- bool *do_data_tag)
+ int disable_multi, bool *do_rel_wr_p,
+ bool *do_data_tag_p)
 {
struct mmc_blk_data *md = mq->blkdata;
struct mmc_card *card = md->queue.card;
struct mmc_blk_request *brq = >brq;
struct request *req = mmc_queue_req_to_req(mqrq);
+   bool do_rel_wr, do_data_tag;
 
/*
 * Reliable writes are used to implement Forced Unit Access and
 * are supported only on MMCs.
 */
-   *do_rel_wr = (req->cmd_flags & REQ_FUA) &&
-rq_data_dir(req) == WRITE &&
-(md->flags & MMC_BLK_REL_WR);
+   do_rel_wr = (req->cmd_flags & REQ_FUA) &&
+   rq_data_dir(req) == WRITE &&
+   (md->flags & MMC_BLK_REL_WR);
 
memset(brq, 0, sizeof(struct mmc_blk_request));
 
@@ -1596,18 +1597,18 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
brq->data.blocks);
}
 
-   if (*do_rel_wr)
+   if (do_rel_wr)
mmc_apply_rel_rw(brq, card, req);
 
/*
 * Data tag is used only during writing meta data to speed
 * up write and any subsequent read of this meta data
 */
-   *do_data_tag = card->ext_csd.data_tag_unit_size &&
-  (req->cmd_flags & REQ_META) &&
-  (rq_data_dir(req) == WRITE) &&
-  ((brq->data.blocks * brq->data.blksz) >=
-   card->ext_csd.data_tag_unit_size);
+   do_data_tag = card->ext_csd.data_tag_unit_size &&
+ (req->cmd_flags & REQ_META) &&
+ (rq_data_dir(req) == WRITE) &&
+ ((brq->data.blocks * brq->data.blksz) >=
+  card->ext_csd.data_tag_unit_size);
 
mmc_set_data_timeout(>data, card);
 
@@ -1636,6 +1637,12 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
mqrq->areq.mrq = >mrq;
 
mmc_queue_bounce_pre(mqrq);
+
+   if (do_rel_wr_p)
+   *do_rel_wr_p = do_rel_wr;
+
+   if (do_data_tag_p)
+   *do_data_tag_p = do_data_tag;
 }
 
 static void mmc_blk_rw_rq_prep(struct mmc_queue_req *mqrq,
-- 
1.9.1



[PATCH V8 06/14] mmc: block: Prepare CQE data

2017-09-13 Thread Adrian Hunter
Enhance mmc_blk_data_prep() to support CQE requests. That means adding
some things that for non-CQE requests would be encoded into the command
arguments - such as the block address, reliable-write flag, and data tag
flag. Also the request tag is needed to provide the command queue task id,
and a comment is added to explain the future possibility of defining a
priority.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 1ec5f0bc98d2..2424c7360687 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1554,6 +1554,7 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
memset(brq, 0, sizeof(struct mmc_blk_request));
 
brq->mrq.data = >data;
+   brq->mrq.tag = req->tag;
 
brq->stop.opcode = MMC_STOP_TRANSMISSION;
brq->stop.arg = 0;
@@ -1568,6 +1569,14 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
 
brq->data.blksz = 512;
brq->data.blocks = blk_rq_sectors(req);
+   brq->data.blk_addr = blk_rq_pos(req);
+
+   /*
+* The command queue supports 2 priorities: "high" (1) and "simple" (0).
+* The eMMC will give "high" priority tasks priority over "simple"
+* priority tasks. Here we always set "simple" priority by not setting
+* MMC_DATA_PRIO.
+*/
 
/*
 * The block layer doesn't support all sector count
@@ -1597,8 +1606,10 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
brq->data.blocks);
}
 
-   if (do_rel_wr)
+   if (do_rel_wr) {
mmc_apply_rel_rw(brq, card, req);
+   brq->data.flags |= MMC_DATA_REL_WR;
+   }
 
/*
 * Data tag is used only during writing meta data to speed
@@ -1610,6 +1621,9 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
  ((brq->data.blocks * brq->data.blksz) >=
   card->ext_csd.data_tag_unit_size);
 
+   if (do_data_tag)
+   brq->data.flags |= MMC_DATA_DAT_TAG;
+
mmc_set_data_timeout(>data, card);
 
brq->data.sg = mqrq->sg;
-- 
1.9.1



[PATCH V8 11/14] mmc: core: Export mmc_start_request()

2017-09-13 Thread Adrian Hunter
Export mmc_start_request() so that the block driver can use it.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/core.c | 3 ++-
 drivers/mmc/core/core.h | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 3638ed4f0f9e..24a73f387482 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -344,7 +344,7 @@ static int mmc_mrq_prep(struct mmc_host *host, struct 
mmc_request *mrq)
return 0;
 }
 
-static int mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
+int mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
 {
int err;
 
@@ -366,6 +366,7 @@ static int mmc_start_request(struct mmc_host *host, struct 
mmc_request *mrq)
 
return 0;
 }
+EXPORT_SYMBOL(mmc_start_request);
 
 /*
  * mmc_wait_data_done() - done callback for data request
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 535539a9e7eb..03e0f8384b1c 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -109,6 +109,8 @@ static inline void mmc_unregister_pm_notifier(struct 
mmc_host *host) { }
 void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq);
 bool mmc_is_req_done(struct mmc_host *host, struct mmc_request *mrq);
 
+int mmc_start_request(struct mmc_host *host, struct mmc_request *mrq);
+
 struct mmc_async_req;
 
 struct mmc_async_req *mmc_start_areq(struct mmc_host *host,
-- 
1.9.1



[PATCH V8 09/14] mmc: core: Remove unnecessary host claim

2017-09-13 Thread Adrian Hunter
Callers already have the host claimed, so remove the unnecessary
calls to mmc_claim_host() and mmc_release_host().

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/mmc_ops.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index 54686ca4bfb7..a6b0a232f24a 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -977,7 +977,6 @@ void mmc_start_bkops(struct mmc_card *card, bool 
from_exception)
from_exception)
return;
 
-   mmc_claim_host(card->host);
if (card->ext_csd.raw_bkops_status >= EXT_CSD_BKOPS_LEVEL_2) {
timeout = MMC_OPS_TIMEOUT_MS;
use_busy_signal = true;
@@ -995,7 +994,7 @@ void mmc_start_bkops(struct mmc_card *card, bool 
from_exception)
pr_warn("%s: Error %d starting bkops\n",
mmc_hostname(card->host), err);
mmc_retune_release(card->host);
-   goto out;
+   return;
}
 
/*
@@ -1007,8 +1006,6 @@ void mmc_start_bkops(struct mmc_card *card, bool 
from_exception)
mmc_card_set_doing_bkops(card);
else
mmc_retune_release(card->host);
-out:
-   mmc_release_host(card->host);
 }
 
 /*
-- 
1.9.1



[PATCH V8 10/14] mmc: core: Export mmc_start_bkops()

2017-09-13 Thread Adrian Hunter
Export mmc_start_bkops() so that the block driver can use it.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/mmc_ops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index a6b0a232f24a..908e4db03535 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -1007,6 +1007,7 @@ void mmc_start_bkops(struct mmc_card *card, bool 
from_exception)
else
mmc_retune_release(card->host);
 }
+EXPORT_SYMBOL(mmc_start_bkops);
 
 /*
  * Flush the cache to the non-volatile storage.
-- 
1.9.1



[PATCH V8 12/14] mmc: block: Add CQE and blk-mq support

2017-09-13 Thread Adrian Hunter
Add CQE support to the block driver, including:
- optionally using DCMD for flush requests
- "manually" issuing discard requests
- issuing read / write requests to the CQE
- supporting block-layer timeouts
- handling recovery
- supporting re-tuning

CQE offers 25% - 50% better random multi-threaded I/O.  There is a slight
(e.g. 2%) drop in sequential read speed but no observable change to sequential
write.

CQE automatically sends the commands to complete requests.  However it only
supports reads / writes and so-called "direct commands" (DCMD).  Furthermore
DCMD is limited to one command at a time, but discards require 3 commands.
That makes issuing discards through CQE very awkward, but some CQE's don't
support DCMD anyway.  So for discards, the existing non-CQE approach is
taken, where the mmc core code issues the 3 commands one at a time i.e.
mmc_erase(). Where DCMD is used, is for issuing flushes.

For host controllers without CQE support, blk-mq support is extended to
synchronous reads/writes or, if the host supports CAP_WAIT_WHILE_BUSY,
asynchonous reads/writes.  The advantage of asynchronous reads/writes is
that it allows the preparation of the next request while the current
request is in progress.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 734 ++-
 drivers/mmc/core/block.h |   8 +
 drivers/mmc/core/queue.c | 428 +--
 drivers/mmc/core/queue.h |  54 +++-
 4 files changed, 1192 insertions(+), 32 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 2424c7360687..eca81502d09f 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -109,6 +109,7 @@ struct mmc_blk_data {
 #define MMC_BLK_WRITE  BIT(1)
 #define MMC_BLK_DISCARDBIT(2)
 #define MMC_BLK_SECDISCARD BIT(3)
+#define MMC_BLK_CQE_RECOVERY   BIT(4)
 
/*
 * Only set in main mmc_blk_data associated
@@ -1266,7 +1267,10 @@ static void mmc_blk_issue_discard_rq(struct mmc_queue 
*mq, struct request *req)
else
mmc_blk_reset_success(md, type);
 fail:
-   blk_end_request(req, status, blk_rq_bytes(req));
+   if (req->mq_ctx)
+   blk_mq_end_request(req, status);
+   else
+   blk_end_request(req, status, blk_rq_bytes(req));
 }
 
 static void mmc_blk_issue_secdiscard_rq(struct mmc_queue *mq,
@@ -1336,7 +1340,10 @@ static void mmc_blk_issue_secdiscard_rq(struct mmc_queue 
*mq,
if (!err)
mmc_blk_reset_success(md, type);
 out:
-   blk_end_request(req, status, blk_rq_bytes(req));
+   if (req->mq_ctx)
+   blk_mq_end_request(req, status);
+   else
+   blk_end_request(req, status, blk_rq_bytes(req));
 }
 
 static void mmc_blk_issue_flush(struct mmc_queue *mq, struct request *req)
@@ -1346,7 +1353,10 @@ static void mmc_blk_issue_flush(struct mmc_queue *mq, 
struct request *req)
int ret = 0;
 
ret = mmc_flush_cache(card);
-   blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
+   if (req->mq_ctx)
+   blk_mq_end_request(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
+   else
+   blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
 }
 
 /*
@@ -1372,15 +1382,18 @@ static inline void mmc_apply_rel_rw(struct 
mmc_blk_request *brq,
}
 }
 
-#define CMD_ERRORS \
-   (R1_OUT_OF_RANGE |  /* Command argument out of range */ \
-R1_ADDRESS_ERROR | /* Misaligned address */\
+#define CMD_ERRORS_EXCL_OOR\
+   (R1_ADDRESS_ERROR | /* Misaligned address */\
 R1_BLOCK_LEN_ERROR |   /* Transferred block length incorrect */\
 R1_WP_VIOLATION |  /* Tried to write to protected block */ \
 R1_CARD_ECC_FAILED |   /* Card ECC failed */   \
 R1_CC_ERROR |  /* Card controller error */ \
 R1_ERROR)  /* General/unknown error */
 
+#define CMD_ERRORS \
+   (CMD_ERRORS_EXCL_OOR |  \
+R1_OUT_OF_RANGE)   /* Command argument out of range */ \
+
 static void mmc_blk_eval_resp_error(struct mmc_blk_request *brq)
 {
u32 val;
@@ -1659,6 +1672,138 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
*do_data_tag_p = do_data_tag;
 }
 
+#define MMC_CQE_RETRIES 2
+
+static void mmc_blk_cqe_complete_rq(struct mmc_queue *mq, struct request *req)
+{
+   struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+   struct mmc_request *mrq = >brq.mrq;
+   struct request_queue *q = req->q;
+   struct mmc_host *host = mq->card->host;
+   unsigned long flags;
+   bool put_card;

[PATCH V8 13/14] mmc: cqhci: support for command queue enabled host

2017-09-13 Thread Adrian Hunter
From: Venkat Gopalakrishnan 

This patch adds CMDQ support for command-queue compatible
hosts.

Command queue is added in eMMC-5.1 specification. This
enables the controller to process upto 32 requests at
a time.

Adrian Hunter contributed renaming to cqhci, recovery, suspend
and resume, cqhci_off, cqhci_wait_for_idle, and external timeout
handling.

Signed-off-by: Asutosh Das 
Signed-off-by: Sujit Reddy Thumma 
Signed-off-by: Konstantin Dorfman 
Signed-off-by: Venkat Gopalakrishnan 
Signed-off-by: Subhash Jadavani 
Signed-off-by: Ritesh Harjani 
Signed-off-by: Adrian Hunter 
---
 drivers/mmc/host/Kconfig  |   13 +
 drivers/mmc/host/Makefile |1 +
 drivers/mmc/host/cqhci.c  | 1154 +
 drivers/mmc/host/cqhci.h  |  240 ++
 4 files changed, 1408 insertions(+)
 create mode 100644 drivers/mmc/host/cqhci.c
 create mode 100644 drivers/mmc/host/cqhci.h

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 8c15637178ff..6db8e46b4e12 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -843,6 +843,19 @@ config MMC_SUNXI
  This selects support for the SD/MMC Host Controller on
  Allwinner sunxi SoCs.
 
+config MMC_CQHCI
+   tristate "Command Queue Host Controller Interface support"
+   depends on HAS_DMA
+   help
+ This selects the Command Queue Host Controller Interface (CQHCI)
+ support present in host controllers of Qualcomm Technologies, Inc
+ amongst others.
+ This controller supports eMMC devices with command queue support.
+
+ If you have a controller with this interface, say Y or M here.
+
+ If unsure, say N.
+
 config MMC_TOSHIBA_PCI
tristate "Toshiba Type A SD/MMC Card Interface Driver"
depends on PCI
diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile
index 303f5cd46cd9..4cec2a985a04 100644
--- a/drivers/mmc/host/Makefile
+++ b/drivers/mmc/host/Makefile
@@ -89,6 +89,7 @@ obj-$(CONFIG_MMC_SDHCI_MSM)   += sdhci-msm.o
 obj-$(CONFIG_MMC_SDHCI_ST) += sdhci-st.o
 obj-$(CONFIG_MMC_SDHCI_MICROCHIP_PIC32)+= sdhci-pic32.o
 obj-$(CONFIG_MMC_SDHCI_BRCMSTB)+= sdhci-brcmstb.o
+obj-$(CONFIG_MMC_CQHCI)+= cqhci.o
 
 ifeq ($(CONFIG_CB710_DEBUG),y)
CFLAGS-cb710-mmc+= -DDEBUG
diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
new file mode 100644
index ..eb3c1695b0c7
--- /dev/null
+++ b/drivers/mmc/host/cqhci.c
@@ -0,0 +1,1154 @@
+/* Copyright (c) 2015, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "cqhci.h"
+
+#define DCMD_SLOT 31
+#define NUM_SLOTS 32
+
+struct cqhci_slot {
+   struct mmc_request *mrq;
+   unsigned int flags;
+#define CQHCI_EXTERNAL_TIMEOUT BIT(0)
+#define CQHCI_COMPLETEDBIT(1)
+#define CQHCI_HOST_CRC BIT(2)
+#define CQHCI_HOST_TIMEOUT BIT(3)
+#define CQHCI_HOST_OTHER   BIT(4)
+};
+
+static inline u8 *get_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->desc_base + (tag * cq_host->slot_sz);
+}
+
+static inline u8 *get_link_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   u8 *desc = get_desc(cq_host, tag);
+
+   return desc + cq_host->task_desc_len;
+}
+
+static inline dma_addr_t get_trans_desc_dma(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->trans_desc_dma_base +
+   (cq_host->mmc->max_segs * tag *
+cq_host->trans_desc_len);
+}
+
+static inline u8 *get_trans_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->trans_desc_base +
+   (cq_host->trans_desc_len * cq_host->mmc->max_segs * tag);
+}
+
+static void setup_trans_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   u8 *link_temp;
+   dma_addr_t trans_temp;
+
+   link_temp = get_link_desc(cq_host, tag);
+   trans_temp = get_trans_desc_dma(cq_host, tag);
+
+   memset(link_temp, 0, cq_host->link_desc_len);
+   if (cq_host->link_desc_len > 8)
+   *(link_temp + 8) = 0;
+
+   if (tag == DCMD_SLOT && (cq_host->mmc->caps2 & MMC_CAP2_CQE_DCMD)) {
+   *link_temp = CQHCI_VALID(0) 

[PATCH V8 14/14] mmc: sdhci-pci: Add CQHCI support for Intel GLK

2017-09-13 Thread Adrian Hunter
Add CQHCI initialization and implement CQHCI operations for Intel GLK.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/host/Kconfig  |   1 +
 drivers/mmc/host/sdhci-pci-core.c | 154 +-
 2 files changed, 154 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 6db8e46b4e12..23621c52d2f8 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -81,6 +81,7 @@ config MMC_SDHCI_BIG_ENDIAN_32BIT_BYTE_SWAPPER
 config MMC_SDHCI_PCI
tristate "SDHCI support on PCI bus"
depends on MMC_SDHCI && PCI
+   select MMC_CQHCI
help
  This selects the PCI Secure Digital Host Controller Interface.
  Most controllers found today are PCI devices.
diff --git a/drivers/mmc/host/sdhci-pci-core.c 
b/drivers/mmc/host/sdhci-pci-core.c
index bbaddf18a1b3..e22075f99707 100644
--- a/drivers/mmc/host/sdhci-pci-core.c
+++ b/drivers/mmc/host/sdhci-pci-core.c
@@ -30,6 +30,8 @@
 #include 
 #include 
 
+#include "cqhci.h"
+
 #include "sdhci.h"
 #include "sdhci-pci.h"
 #include "sdhci-pci-o2micro.h"
@@ -117,6 +119,28 @@ int sdhci_pci_resume_host(struct sdhci_pci_chip *chip)
 
return 0;
 }
+
+static int sdhci_cqhci_suspend(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = cqhci_suspend(chip->slots[0]->host->mmc);
+   if (ret)
+   return ret;
+
+   return sdhci_pci_suspend_host(chip);
+}
+
+static int sdhci_cqhci_resume(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = sdhci_pci_resume_host(chip);
+   if (ret)
+   return ret;
+
+   return cqhci_resume(chip->slots[0]->host->mmc);
+}
 #endif
 
 #ifdef CONFIG_PM
@@ -167,8 +191,48 @@ static int sdhci_pci_runtime_resume_host(struct 
sdhci_pci_chip *chip)
 
return 0;
 }
+
+static int sdhci_cqhci_runtime_suspend(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = cqhci_suspend(chip->slots[0]->host->mmc);
+   if (ret)
+   return ret;
+
+   return sdhci_pci_runtime_suspend_host(chip);
+}
+
+static int sdhci_cqhci_runtime_resume(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = sdhci_pci_runtime_resume_host(chip);
+   if (ret)
+   return ret;
+
+   return cqhci_resume(chip->slots[0]->host->mmc);
+}
 #endif
 
+static u32 sdhci_cqhci_irq(struct sdhci_host *host, u32 intmask)
+{
+   int cmd_error = 0;
+   int data_error = 0;
+
+   if (!sdhci_cqe_irq(host, intmask, _error, _error))
+   return intmask;
+
+   cqhci_irq(host->mmc, intmask, cmd_error, data_error);
+
+   return 0;
+}
+
+static void sdhci_pci_dumpregs(struct mmc_host *mmc)
+{
+   sdhci_dumpregs(mmc_priv(mmc));
+}
+
 /*\
  *   *
  * Hardware specific quirk handling  *
@@ -567,6 +631,17 @@ static void intel_hs400_enhanced_strobe(struct mmc_host 
*mmc,
.hw_reset   = sdhci_pci_hw_reset,
 };
 
+static const struct sdhci_ops sdhci_intel_glk_ops = {
+   .set_clock  = sdhci_set_clock,
+   .set_power  = sdhci_intel_set_power,
+   .enable_dma = sdhci_pci_enable_dma,
+   .set_bus_width  = sdhci_set_bus_width,
+   .reset  = sdhci_reset,
+   .set_uhs_signaling  = sdhci_set_uhs_signaling,
+   .hw_reset   = sdhci_pci_hw_reset,
+   .irq= sdhci_cqhci_irq,
+};
+
 static void byt_read_dsm(struct sdhci_pci_slot *slot)
 {
struct intel_host *intel_host = sdhci_pci_priv(slot);
@@ -596,15 +671,83 @@ static int glk_emmc_probe_slot(struct sdhci_pci_slot 
*slot)
 {
int ret = byt_emmc_probe_slot(slot);
 
+   slot->host->mmc->caps2 |= MMC_CAP2_CQE;
+
if (slot->chip->pdev->device != PCI_DEVICE_ID_INTEL_GLK_EMMC) {
slot->host->mmc->caps2 |= MMC_CAP2_HS400_ES,
slot->host->mmc_host_ops.hs400_enhanced_strobe =
intel_hs400_enhanced_strobe;
+   slot->host->mmc->caps2 |= MMC_CAP2_CQE_DCMD;
}
 
return ret;
 }
 
+static void glk_cqe_enable(struct mmc_host *mmc)
+{
+   struct sdhci_host *host = mmc_priv(mmc);
+   u32 reg;
+
+   /*
+* CQE gets stuck if it sees Buffer Read Enable bit set, which can be
+* the case after tuning, so ensure the buffer is drained.
+*/
+   reg = sdhci_readl(host, SDHCI_PRESENT_STATE);
+   while (reg & SDHCI_DATA_AVAILABLE) {
+   sdhci_readl(host, SDHCI_BUFFER);
+   reg = sdhci_readl(host, SDHCI_PRESENT_STATE);
+   }
+
+   sdhci_cqe_enable(mmc);
+}
+
+static const struct cqhci_host_ops glk_cqhci_ops = {
+   .enable = glk_cqe_enable,
+   .disable

[PATCH V8 08/14] mmc: core: Add parameter use_blk_mq

2017-09-13 Thread Adrian Hunter
Until mmc has blk-mq support fully implemented and tested, add a
parameter use_blk_mq, default to false unless config option MMC_MQ_DEFAULT
is selected.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/Kconfig  | 11 +++
 drivers/mmc/core/core.c  |  7 +++
 drivers/mmc/core/core.h  |  2 ++
 drivers/mmc/core/host.c  |  2 ++
 drivers/mmc/core/host.h  |  4 
 include/linux/mmc/host.h |  1 +
 6 files changed, 27 insertions(+)

diff --git a/drivers/mmc/Kconfig b/drivers/mmc/Kconfig
index ec21388311db..98202934bd29 100644
--- a/drivers/mmc/Kconfig
+++ b/drivers/mmc/Kconfig
@@ -12,6 +12,17 @@ menuconfig MMC
  If you want MMC/SD/SDIO support, you should say Y here and
  also to your specific host controller driver.
 
+config MMC_MQ_DEFAULT
+   bool "MMC: use blk-mq I/O path by default"
+   depends on MMC && BLOCK
+   ---help---
+ This option enables the new blk-mq based I/O path for MMC block
+ devices by default.  With the option the mmc_core.use_blk_mq
+ module/boot option defaults to Y, without it to N, but it can
+ still be overridden either way.
+
+ If unsure say N.
+
 if MMC
 
 source "drivers/mmc/core/Kconfig"
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index ef2d8aa1e7d2..3638ed4f0f9e 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -66,6 +66,13 @@
 bool use_spi_crc = 1;
 module_param(use_spi_crc, bool, 0);
 
+#ifdef CONFIG_MMC_MQ_DEFAULT
+bool mmc_use_blk_mq = true;
+#else
+bool mmc_use_blk_mq = false;
+#endif
+module_param_named(use_blk_mq, mmc_use_blk_mq, bool, S_IWUSR | S_IRUGO);
+
 static int mmc_schedule_delayed_work(struct delayed_work *work,
 unsigned long delay)
 {
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index e941342ed450..535539a9e7eb 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -35,6 +35,8 @@ struct mmc_bus_ops {
int (*reset)(struct mmc_host *);
 };
 
+extern bool mmc_use_blk_mq;
+
 void mmc_attach_bus(struct mmc_host *host, const struct mmc_bus_ops *ops);
 void mmc_detach_bus(struct mmc_host *host);
 
diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index ad88deb2e8f3..b624dbb6cd15 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -398,6 +398,8 @@ struct mmc_host *mmc_alloc_host(int extra, struct device 
*dev)
host->max_blk_size = 512;
host->max_blk_count = PAGE_SIZE / 512;
 
+   host->use_blk_mq = mmc_use_blk_mq;
+
return host;
 }
 
diff --git a/drivers/mmc/core/host.h b/drivers/mmc/core/host.h
index 77d6f60d1bf9..170fe5947087 100644
--- a/drivers/mmc/core/host.h
+++ b/drivers/mmc/core/host.h
@@ -69,6 +69,10 @@ static inline bool mmc_card_hs400es(struct mmc_card *card)
return card->host->ios.enhanced_strobe;
 }
 
+static inline bool mmc_host_use_blk_mq(struct mmc_host *host)
+{
+   return host->use_blk_mq;
+}
 
 #endif
 
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index 54b0463443bd..5d1bd10991f7 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -378,6 +378,7 @@ struct mmc_host {
unsigned intdoing_retune:1; /* re-tuning in progress */
unsigned intretune_now:1;   /* do re-tuning at next req */
unsigned intretune_paused:1; /* re-tuning is temporarily 
disabled */
+   unsigned intuse_blk_mq:1;   /* use blk-mq */
 
int rescan_disable; /* disable card detection */
int rescan_entered; /* used with nonremovable 
devices */
-- 
1.9.1



[PATCH V8 07/14] mmc: block: Factor out mmc_setup_queue()

2017-09-13 Thread Adrian Hunter
Factor out some common code that will also be used with blk-mq.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/queue.c | 51 
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 74c663b1c0a7..da74fe4e638a 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -223,6 +223,35 @@ static void mmc_exit_request(struct request_queue *q, 
struct request *req)
mq_rq->sg = NULL;
 }
 
+static void mmc_setup_queue(struct mmc_queue *mq, struct mmc_card *card)
+{
+   struct mmc_host *host = card->host;
+   u64 limit = BLK_BOUNCE_HIGH;
+
+   if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
+   limit = (u64)dma_max_pfn(mmc_dev(host)) << PAGE_SHIFT;
+
+   queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
+   queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, mq->queue);
+   if (mmc_can_erase(card))
+   mmc_queue_setup_discard(mq->queue, card);
+
+   if (card->bouncesz) {
+   blk_queue_max_hw_sectors(mq->queue, card->bouncesz / 512);
+   blk_queue_max_segments(mq->queue, card->bouncesz / 512);
+   blk_queue_max_segment_size(mq->queue, card->bouncesz);
+   } else {
+   blk_queue_bounce_limit(mq->queue, limit);
+   blk_queue_max_hw_sectors(mq->queue,
+   min(host->max_blk_count, host->max_req_size / 512));
+   blk_queue_max_segments(mq->queue, host->max_segs);
+   blk_queue_max_segment_size(mq->queue, host->max_seg_size);
+   }
+
+   /* Initialize thread_sem even if it is not used */
+   sema_init(>thread_sem, 1);
+}
+
 /**
  * mmc_init_queue - initialise a queue structure.
  * @mq: mmc queue
@@ -236,12 +265,8 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card 
*card,
   spinlock_t *lock, const char *subname)
 {
struct mmc_host *host = card->host;
-   u64 limit = BLK_BOUNCE_HIGH;
int ret = -ENOMEM;
 
-   if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
-   limit = (u64)dma_max_pfn(mmc_dev(host)) << PAGE_SHIFT;
-
/*
 * mmc_init_request() depends on card->bouncesz so it must be calculated
 * before blk_init_allocated_queue() starts allocating requests.
@@ -266,24 +291,8 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card 
*card,
}
 
blk_queue_prep_rq(mq->queue, mmc_prep_request);
-   queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
-   queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, mq->queue);
-   if (mmc_can_erase(card))
-   mmc_queue_setup_discard(mq->queue, card);
-
-   if (card->bouncesz) {
-   blk_queue_max_hw_sectors(mq->queue, card->bouncesz / 512);
-   blk_queue_max_segments(mq->queue, card->bouncesz / 512);
-   blk_queue_max_segment_size(mq->queue, card->bouncesz);
-   } else {
-   blk_queue_bounce_limit(mq->queue, limit);
-   blk_queue_max_hw_sectors(mq->queue,
-   min(host->max_blk_count, host->max_req_size / 512));
-   blk_queue_max_segments(mq->queue, host->max_segs);
-   blk_queue_max_segment_size(mq->queue, host->max_seg_size);
-   }
 
-   sema_init(>thread_sem, 1);
+   mmc_setup_queue(mq, card);
 
mq->thread = kthread_run(mmc_queue_thread, mq, "mmcqd/%d%s",
host->index, subname ? subname : "");
-- 
1.9.1



[PATCH V8 02/14] mmc: core: Add support for handling CQE requests

2017-09-13 Thread Adrian Hunter
Add core support for handling CQE requests, including starting, completing
and recovering.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/core.c  | 163 +--
 drivers/mmc/core/core.h  |   4 ++
 include/linux/mmc/host.h |   2 +
 3 files changed, 164 insertions(+), 5 deletions(-)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 09fdc467b804..ef2d8aa1e7d2 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -266,7 +266,8 @@ static void __mmc_start_request(struct mmc_host *host, 
struct mmc_request *mrq)
host->ops->request(host, mrq);
 }
 
-static void mmc_mrq_pr_debug(struct mmc_host *host, struct mmc_request *mrq)
+static void mmc_mrq_pr_debug(struct mmc_host *host, struct mmc_request *mrq,
+bool cqe)
 {
if (mrq->sbc) {
pr_debug("<%s: starting CMD%u arg %08x flags %08x>\n",
@@ -275,9 +276,12 @@ static void mmc_mrq_pr_debug(struct mmc_host *host, struct 
mmc_request *mrq)
}
 
if (mrq->cmd) {
-   pr_debug("%s: starting CMD%u arg %08x flags %08x\n",
-mmc_hostname(host), mrq->cmd->opcode, mrq->cmd->arg,
-mrq->cmd->flags);
+   pr_debug("%s: starting %sCMD%u arg %08x flags %08x\n",
+mmc_hostname(host), cqe ? "CQE direct " : "",
+mrq->cmd->opcode, mrq->cmd->arg, mrq->cmd->flags);
+   } else if (cqe) {
+   pr_debug("%s: starting CQE transfer for tag %d blkaddr %u\n",
+mmc_hostname(host), mrq->tag, mrq->data->blk_addr);
}
 
if (mrq->data) {
@@ -342,7 +346,7 @@ static int mmc_start_request(struct mmc_host *host, struct 
mmc_request *mrq)
if (mmc_card_removed(host->card))
return -ENOMEDIUM;
 
-   mmc_mrq_pr_debug(host, mrq);
+   mmc_mrq_pr_debug(host, mrq, false);
 
WARN_ON(!host->claimed);
 
@@ -482,6 +486,155 @@ void mmc_wait_for_req_done(struct mmc_host *host, struct 
mmc_request *mrq)
 }
 EXPORT_SYMBOL(mmc_wait_for_req_done);
 
+/*
+ * mmc_cqe_start_req - Start a CQE request.
+ * @host: MMC host to start the request
+ * @mrq: request to start
+ *
+ * Start the request, re-tuning if needed and it is possible. Returns an error
+ * code if the request fails to start or -EBUSY if CQE is busy.
+ */
+int mmc_cqe_start_req(struct mmc_host *host, struct mmc_request *mrq)
+{
+   int err;
+
+   /*
+* CQE cannot process re-tuning commands. Caller must hold retuning
+* while CQE is in use.  Re-tuning can happen here only when CQE has no
+* active requests i.e. this is the first.  Note, re-tuning will call
+* ->cqe_off().
+*/
+   err = mmc_retune(host);
+   if (err)
+   goto out_err;
+
+   mrq->host = host;
+
+   mmc_mrq_pr_debug(host, mrq, true);
+
+   err = mmc_mrq_prep(host, mrq);
+   if (err)
+   goto out_err;
+
+   err = host->cqe_ops->cqe_request(host, mrq);
+   if (err)
+   goto out_err;
+
+   trace_mmc_request_start(host, mrq);
+
+   return 0;
+
+out_err:
+   if (mrq->cmd) {
+   pr_debug("%s: failed to start CQE direct CMD%u, error %d\n",
+mmc_hostname(host), mrq->cmd->opcode, err);
+   } else {
+   pr_debug("%s: failed to start CQE transfer for tag %d, error 
%d\n",
+mmc_hostname(host), mrq->tag, err);
+   }
+   return err;
+}
+EXPORT_SYMBOL(mmc_cqe_start_req);
+
+/**
+ * mmc_cqe_request_done - CQE has finished processing an MMC request
+ * @host: MMC host which completed request
+ * @mrq: MMC request which completed
+ *
+ * CQE drivers should call this function when they have completed
+ * their processing of a request.
+ */
+void mmc_cqe_request_done(struct mmc_host *host, struct mmc_request *mrq)
+{
+   mmc_should_fail_request(host, mrq);
+
+   /* Flag re-tuning needed on CRC errors */
+   if ((mrq->cmd && mrq->cmd->error == -EILSEQ) ||
+   (mrq->data && mrq->data->error == -EILSEQ))
+   mmc_retune_needed(host);
+
+   trace_mmc_request_done(host, mrq);
+
+   if (mrq->cmd) {
+   pr_debug("%s: CQE req done (direct CMD%u): %d\n",
+mmc_hostname(host), mrq->cmd->opcode, mrq->cmd->error);
+   } else {
+   pr_debug("%s: CQE transfer done tag %d\n",
+mmc_hostname(host), mrq->tag);
+   }
+
+   if (mrq->data) {
+   pr_debug("%s: %d bytes transferred: %d\n",
+mmc_hostname(host),
+mrq->data->bytes_xfered, mrq->data->error);
+   }
+
+   mrq->done(mrq);
+}
+EXPORT_SYMBOL(mmc_cqe_request_done);
+
+/**
+ * mmc_cqe_post_req - CQE post process of a completed MMC request
+ * @host: MMC host
+ * @mrq: MMC 

[PATCH V8 00/14] mmc: Add Command Queue support

2017-09-13 Thread Adrian Hunter
Hi

Here is V8 of the hardware command queue patches without the software
command queue patches, now using blk-mq and now with blk-mq support for
non-CQE I/O.

After the unacceptable debacle of the last release cycle, I expect an
immediate response to these patches.

HW CMDQ offers 25% - 50% better random multi-threaded I/O.  I see a slight
2% drop in sequential read speed but no change to sequential write.

Non-CQE blk-mq showed a 3% decrease in sequential read performance.  This
seemed to be coming from the inferior latency of running work items compared
with a dedicated thread.  Hacking blk-mq workqueue to be unbound reduced the
performance degradation from 3% to 1%.

While we should look at changing blk-mq to give better workqueue performance,
a bigger gain is likely to be made by adding a new host API to enable the
next already-prepared request to be issued directly from within ->done()
callback of the current request.


Changes since V7:
Re-based
  mmc: core: Introduce host claiming by context
Slightly simplified
  mmc: core: Add parameter use_blk_mq
New patch.
  mmc: core: Remove unnecessary host claim
New patch.
  mmc: core: Export mmc_start_bkops()
New patch.
  mmc: core: Export mmc_start_request()
New patch.
  mmc: block: Add CQE and blk-mq support
Add blk-mq support for non_CQE requests

Changes since V6:
  mmc: core: Introduce host claiming by context
New patch.
  mmc: core: Move mmc_start_areq() declaration
Dropped because it has been applied
  mmc: block: Fix block status codes
Dropped because it has been applied
  mmc: host: Add CQE interface
Dropped because it has been applied
  mmc: core: Turn off CQE before sending commands
Dropped because it has been applied
  mmc: block: Factor out mmc_setup_queue()
New patch.
  mmc: block: Add CQE support
Drop legacy support and add blk-mq support

Changes since V5:
Re-based
  mmc: core: Add mmc_retune_hold_now()
Dropped because it has been applied
  mmc: core: Add members to mmc_request and mmc_data for CQE's
Dropped because it has been applied
  mmc: core: Move mmc_start_areq() declaration
New patch at Ulf's request
  mmc: block: Fix block status codes
Another un-related patch
  mmc: host: Add CQE interface
Move recovery_notifier() callback to struct mmc_request
  mmc: core: Add support for handling CQE requests
Roll __mmc_cqe_request_done() into mmc_cqe_request_done()
Move function declarations requested by Ulf
  mmc: core: Remove unused MMC_CAP2_PACKED_CMD
Dropped because it has been applied
  mmc: block: Add CQE support
Add explanation to commit message
Adjustment for changed recovery_notifier() callback
  mmc: cqhci: support for command queue enabled host
Adjustment for changed recovery_notifier() callback
  mmc: sdhci-pci: Add CQHCI support for Intel GLK
Add DCMD capability for Intel controllers except GLK

Changes since V4:
  mmc: core: Add mmc_retune_hold_now()
Add explanation to commit message.
  mmc: host: Add CQE interface
Add comments to callback declarations.
  mmc: core: Turn off CQE before sending commands
Add explanation to commit message.
  mmc: core: Add support for handling CQE requests
Add comments as requested by Ulf.
  mmc: core: Remove unused MMC_CAP2_PACKED_CMD
New patch.
  mmc: mmc: Enable Command Queuing
Adjust for removal of MMC_CAP2_PACKED_CMD.
Add a comment about Packed Commands.
  mmc: mmc: Enable CQE's
Remove un-necessary check for MMC_CAP2_CQE
  mmc: block: Use local variables in mmc_blk_data_prep()
New patch.
  mmc: block: Prepare CQE data
Adjust due to "mmc: block: Use local variables in mmc_blk_data_prep()"
Remove priority setting.
Add explanation to commit message.
  mmc: cqhci: support for command queue enabled host
Fix transfer descriptor setting in cqhci_set_tran_desc() for 32-bit DMA

Changes since V3:
Adjusted ...blk_end_request...() for new block status codes
Fixed CQHCI transaction descriptor for "no DCMD" case

Changes since V2:
Dropped patches that have been applied.
Re-based
Added "mmc: sdhci-pci: Add CQHCI support for Intel GLK"

Changes since V1:

"Share mmc request array between partitions" is dependent
on changes in "Introduce queue semantics", so added that
and block fixes:

Added "Fix is_waiting_last_req set incorrectly"
Added "Fix cmd error reset failure path"
Added "Use local var for mqrq_cur"
Added "Introduce queue semantics"

Changes since RFC:

Re-based on next.
Added comment about command queue priority.
Added some acks and reviews.

[BUG] xfs/104 triggered NULL pointer dereference in iomap based dio

2017-09-13 Thread Eryu Guan
Hi all,

Recently I noticed multiple crashes triggered by xfs/104 on ppc64 hosts
in my upstream 4.13 kernel testings. The block layer is involved in the
call trace so I add linux-block to cc list too. I append the full
console log to the end of this mail.

Now I can reproduce the crash on x86_64 hosts too by running xfs/104
many times (usually within 100 iterations). A git-bisect run (I ran it
for 500 iterations before calling it good in bisect run to be safe)
pointed the first bad commit to commit acdda3aae146 ("xfs: use
iomap_dio_rw").

I confirmed the bisect result by checking out a new branch with commit
acdda3aae146 as HEAD, xfs/104 would crash kernel within 100 iterations,
then reverting HEAD, xfs/104 passed 1500 iterations.

On one of my test vms, the crash happened as

[  340.419429] BUG: unable to handle kernel NULL pointer dereference at 
0102
[  340.420408] IP: __queue_work+0x32/0x420

and that IP points to

(gdb) l *(__queue_work+0x32)
0x9cf32 is in __queue_work (kernel/workqueue.c:1383).
1378WARN_ON_ONCE(!irqs_disabled());
1379
1380debug_work_activate(work);
1381
1382/* if draining, only works from the same workqueue are allowed 
*/
1383if (unlikely(wq->flags & __WQ_DRAINING) &&
1384WARN_ON_ONCE(!is_chained_work(wq)))
1385return;
1386retry:
1387if (req_cpu == WORK_CPU_UNBOUND)

So looks like "wq" is null. The test vm is a kvm guest running 4.13
kernel with 4 vcpus and 8G memory.

If more information is needed please let me know.

Thanks,
Eryu

P.S. console log when crashing

[  339.746983] run fstests xfs/104 at 2017-09-13 17:38:26
[  340.027352] XFS (vda6): Unmounting Filesystem
[  340.207107] XFS (vda6): Mounting V5 Filesystem
[  340.217553] XFS (vda6): Ending clean mount
[  340.419429] BUG: unable to handle kernel NULL pointer dereference at 
0102
[  340.420408] IP: __queue_work+0x32/0x420
[  340.420408] PGD 215373067
[  340.420408] P4D 215373067
[  340.420408] PUD 21210d067
[  340.420408] PMD 0
[  340.420408]
[  340.420408] Oops:  [#1] SMP
[  340.420408] Modules linked in: xfs ip6t_rpfilter ipt_REJECT nf_reject_ipv4 
nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
nf_defrag_ipv6 xt_conntrack nf_conntrack libcrc32c ip_set nfnetlink ebtable_nat 
ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw 
iptable_mangle iptable_security iptable_raw ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter btrfs xor raid6_pq ppdev i2c_piix4 
parport_pc i2c_core parport virtio_balloon pcspkr nfsd auth_rpcgss nfs_acl 
lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net 
virtio_blk ata_piix libata virtio_pci virtio_ring serio_raw virtio floppy
[  340.420408] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.13.0 #64
[  340.420408] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[  340.420408] task: 8b1d96222500 task.stack: b06bc0cb8000
[  340.420408] RIP: 0010:__queue_work+0x32/0x420
[  340.420408] RSP: 0018:8b1d9fd83d18 EFLAGS: 00010046
[  340.420408] RAX: 0096 RBX: 0002 RCX: 8b1d9489e6d8
[  340.420408] RDX: 8b1d903c2090 RSI:  RDI: 2000
[  340.420408] RBP: 8b1d9fd83d58 R08: 0400 R09: 0009
[  340.420408] R10: 8b1d9532b400 R11:  R12: 2000
[  340.420408] R13:  R14: 8b1d903c2090 R15: 7800
[  340.420408] FS:  () GS:8b1d9fd8() 
knlGS:
[  340.420408] CS:  0010 DS:  ES:  CR0: 80050033
[  340.420408] CR2: 0102 CR3: 0002152ce000 CR4: 06e0
[  340.420408] Call Trace:
[  340.420408]  
[  340.420408]  ? __slab_free+0x8e/0x260
[  340.420408]  queue_work_on+0x38/0x40
[  340.420408]  iomap_dio_bio_end_io+0x86/0x120
[  340.420408]  bio_endio+0x9f/0x120
[  340.420408]  blk_update_request+0xa8/0x2f0
[  340.420408]  blk_mq_end_request+0x1e/0x70
[  340.420408]  virtblk_request_done+0x22/0x30 [virtio_blk]
[  340.420408]  __blk_mq_complete_request+0x8f/0x140
[  340.420408]  blk_mq_complete_request+0x2a/0x30
[  340.420408]  virtblk_done+0x71/0x100 [virtio_blk]
[  340.420408]  vring_interrupt+0x34/0x80 [virtio_ring]
[  340.420408]  __handle_irq_event_percpu+0x7e/0x190
[  340.420408]  handle_irq_event_percpu+0x32/0x80
[  340.420408]  handle_irq_event+0x3b/0x60
[  340.420408]  handle_edge_irq+0x72/0x180
[  340.420408]  handle_irq+0x6f/0x110
[  340.420408]  do_IRQ+0x46/0xd0
[  340.420408]  common_interrupt+0x93/0x93
[  340.420408] RIP: 0010:native_safe_halt+0x6/0x10
[  340.420408] RSP: 0018:b06bc0cbbe70 EFLAGS: 0246 ORIG_RAX: 
ffae
[  340.420408] RAX:  RBX: 8b1d96222500 RCX: 
[  340.420408] RDX:  RSI:  RDI: 
[  340.420408] RBP: b06bc0cbbe70 R08: 

Re: [PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()

2017-09-13 Thread Ilya Dryomov
On Wed, Sep 6, 2017 at 7:38 PM, Ilya Dryomov  wrote:
> sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
> permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
> is set.  This means blkdev_issue_zeroout() must cope with WRITE SAME
> failing with BLK_STS_TARGET/-EREMOTEIO and explicitly write zeroes,
> unless BLKDEV_ZERO_NOFALLBACK is specified.
>
> Commit 71027e97d796 ("block: stop using discards for zeroing") added
> the following to __blkdev_issue_zeroout() comment:
>
>   "Note that this function may fail with -EOPNOTSUPP if the driver signals
>   zeroing offload support, but the device fails to process the command (for
>   some devices there is no non-destructive way to verify whether this
>   operation is actually supported).  In this case the caller should call
>   retry the call to blkdev_issue_zeroout() and the fallback path will be 
> used."
>
> But __blkdev_issue_zeroout() doesn't fail in this case: if WRITE SAME
> support is indicated, a REQ_OP_WRITE_ZEROES bio is built and returned
> to blkdev_issue_zeroout().  -EREMOTEIO is then propagated up:
>
>   $ fallocate -zn -l 1k /dev/sdg
>   fallocate: fallocate failed: Remote I/O error
>   $ fallocate -zn -l 1k /dev/sdg  # OK
>
> (The second fallocate(1) succeeds because sd_done() sets ->no_write_same
> in response to a sense that would become BLK_STS_TARGET.)
>
> Retry __blkdev_issue_zeroout() if the I/O fails with -EREMOTEIO.  This
> is roughly what we did until 4.12, sans BLKDEV_ZERO_NOFALLBACK knob.
>
> Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing")
> Cc: Christoph Hellwig 
> Cc: "Martin K. Petersen" 
> Cc: Hannes Reinecke 
> Cc: sta...@vger.kernel.org # 4.12+
> Signed-off-by: Ilya Dryomov 
> ---
>  block/blk-lib.c | 49 +
>  1 file changed, 33 insertions(+), 16 deletions(-)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 3fe0aec90597..876b5478c1a2 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -287,12 +287,6 @@ static unsigned int 
> __blkdev_sectors_to_bio_pages(sector_t nr_sects)
>   *  Zero-fill a block range, either using hardware offload or by explicitly
>   *  writing zeroes to the device.
>   *
> - *  Note that this function may fail with -EOPNOTSUPP if the driver signals
> - *  zeroing offload support, but the device fails to process the command (for
> - *  some devices there is no non-destructive way to verify whether this
> - *  operation is actually supported).  In this case the caller should call
> - *  retry the call to blkdev_issue_zeroout() and the fallback path will be 
> used.
> - *
>   *  If a device is using logical block provisioning, the underlying space 
> will
>   *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
>   *
> @@ -343,6 +337,34 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
> sector_t sector,
>  }
>  EXPORT_SYMBOL(__blkdev_issue_zeroout);
>
> +/*
> + * This function may fail with -EREMOTEIO if the driver signals zeroing
> + * offload support, but the device fails to process the command (for
> + * some devices there is no non-destructive way to verify whether this
> + * operation is actually supported).  In this case the caller should
> + * retry and the fallback path in __blkdev_issue_zeroout() will be used
> + * unless %flags contains BLKDEV_ZERO_NOFALLBACK.
> + */
> +static int __blkdev_issue_zeroout_wait(struct block_device *bdev,
> +  sector_t sector, sector_t nr_sects,
> +  gfp_t gfp_mask, unsigned flags)
> +{
> +   int ret;
> +   struct bio *bio = NULL;
> +   struct blk_plug plug;
> +
> +   blk_start_plug();
> +   ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
> +   , flags);
> +   if (ret == 0 && bio) {
> +   ret = submit_bio_wait(bio);
> +   bio_put(bio);
> +   }
> +   blk_finish_plug();
> +
> +   return ret;
> +}
> +
>  /**
>   * blkdev_issue_zeroout - zero-fill a block range
>   * @bdev:  blockdev to write
> @@ -360,17 +382,12 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
> sector_t sector,
> sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
>  {
> int ret;
> -   struct bio *bio = NULL;
> -   struct blk_plug plug;
>
> -   blk_start_plug();
> -   ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
> -   , flags);
> -   if (ret == 0 && bio) {
> -   ret = submit_bio_wait(bio);
> -   bio_put(bio);
> -   }
> -   blk_finish_plug();
> +   ret = __blkdev_issue_zeroout_wait(bdev, sector, nr_sects,
> + gfp_mask, flags);
> +   if (ret == -EREMOTEIO)
> +   ret = __blkdev_issue_zeroout_wait(bdev, sector, nr_sects,
> + 

Re: [PATCH 00/10] nvme multipath support on top of nvme-4.13 branch

2017-09-13 Thread Hannes Reinecke
On 09/12/2017 06:20 AM, Anish M Jhaveri wrote:
> Hello everyone,
> Please find patchset for supporting Multipathing using nvme namespace 
> and nvme controller. 
> 
> Basic idea behind the design and implementation is to re-use the same 
> logic which is between generic nvme namespace and it's controller. 
> One controller to many namespaces. Similarly, this implementation uses
> one multipath controller and sibling namespaces.
> It creates a head multipath nvme namespace as soon as it find a nvme
> namespace during enumeration which has property of being shared. Prior
> to creating new head multipath device, nvme namespace check for match-
> ing NGUID, if it finds one, this nvme namespace is added as a sibling
> to that given namespaces head device. 
> Failover is triggered either due to keep-alive timer expiration or RDMA
>  timeout. Selection is of device is based on first available standby
> device. As of present this implementation support Active-Standby multi-
> path model.
> On selection of device, a command is sent to target for acknowledgement 
> of active device. In meanwhile if any IO are received, they get queued 
> under head multipath device congestion queue and processed as soon as
> a single path becomes active. In scenario where new active path results
> in failure, it will cancel those IOs after multiple retries.
> 
> I have gone through the solutiion Christoph Hellwig suggested and this
> one follow similar path except for it doesn't require any change from 
> kernel and will work prior version of kernel.
> It can made standalone module and be used with other block devices as
> we open any block device handle.
> 
> It has been tested with interface up/down on host and target, target
> node crash and disconnect command. Performance has been tested with
> multiple multipath devices on single host and there is un-noticeable
>  difference in numbers. 
> 
In general I am _not_ in favour of this approach.

This is essentially the same level of multipath support we had in the
old qlogic and lpfc drivers in 2.4/2.6 series, and it took us _years_ to
get rid of this.
Main objection here is that it will be really hard (if not impossible)
to use the same approach for other subsystems (eg SCSI), so we'll end up
having different multipath implementations depending on which subsystem
is being used.
Which really is a maintenance nightmare.

I'm not averse to having other multipath implementations in the kernel,
but it really should be abstracted so that other subsystems can _try_ to
leverage it.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)