RE: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements")

2017-02-08 Thread Dexuan Cui
> From: h...@lst.de [mailto:h...@lst.de]
> Sent: Thursday, February 9, 2017 02:03
> To: Jens Axboe 
> Cc: Dexuan Cui ; Bart Van Assche
> ; h...@suse.com; h...@suse.de; Martin K.
> Petersen ; h...@lst.de; linux-
> ker...@vger.kernel.org; linux-block@vger.kernel.org; j...@kernel.org
> Subject: Re: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock
> when scheduling workqueue elements")
> 
> On Wed, Feb 08, 2017 at 10:43:59AM -0700, Jens Axboe wrote:
> > I've changed the subject line, this issue has nothing to do with the
> > issue that Hannes was attempting to fix.
> 
> Nothing really useful in the thread.  Dexuan, can you throw in some
> prints to see which command times out?

My colleagues have sent you the logs in another thread.

Thanks for looking into this!

-- Dexuan


Re: [PATCHv6 03/37] page-flags: relax page flag policy for few flags

2017-02-08 Thread Matthew Wilcox
On Thu, Jan 26, 2017 at 02:57:45PM +0300, Kirill A. Shutemov wrote:
> These flags are in use for filesystems with backing storage: PG_error,
> PG_writeback and PG_readahead.

Oh ;-)  Then I amend my comment on patch 1 to be "patch 3 needs to go
ahead of patch 1" ;-)

> Signed-off-by: Kirill A. Shutemov 
> ---
>  include/linux/page-flags.h | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6b5818d6de32..b85b73cfb1b3 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -263,7 +263,7 @@ static inline int TestClearPage##uname(struct page *page) 
> { return 0; }
>  
>  __PAGEFLAG(Locked, locked, PF_NO_TAIL)
>  PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, 
> PF_ONLY_HEAD)
> -PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, 
> PF_NO_COMPOUND)
> +PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL)
>  PAGEFLAG(Referenced, referenced, PF_HEAD)
>   TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
>   __SETPAGEFLAG(Referenced, referenced, PF_HEAD)
> @@ -303,15 +303,15 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
>   * Only test-and-set exist for PG_writeback.  The unconditional operators are
>   * risky: they bypass page accounting.
>   */
> -TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
> - TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
> +TESTPAGEFLAG(Writeback, writeback, PF_NO_TAIL)
> + TESTSCFLAG(Writeback, writeback, PF_NO_TAIL)
>  PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
>  
>  /* PG_readahead is only used for reads; PG_reclaim is only for writes */
>  PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
>   TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
> -PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> - TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
> +PAGEFLAG(Readahead, reclaim, PF_NO_TAIL)
> + TESTCLEARFLAG(Readahead, reclaim, PF_NO_TAIL)
>  
>  #ifdef CONFIG_HIGHMEM
>  /*
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


Re: [PATCHv6 01/37] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2017-02-08 Thread Matthew Wilcox
On Thu, Jan 26, 2017 at 02:57:43PM +0300, Kirill A. Shutemov wrote:
> +++ b/include/linux/pagemap.h
> @@ -332,6 +332,15 @@ static inline struct page *grab_cache_page_nowait(struct 
> address_space *mapping,
>   mapping_gfp_mask(mapping));
>  }
>  
> +static inline struct page *find_subpage(struct page *page, pgoff_t offset)
> +{
> + VM_BUG_ON_PAGE(PageTail(page), page);
> + VM_BUG_ON_PAGE(page->index > offset, page);
> + VM_BUG_ON_PAGE(page->index + (1 << compound_order(page)) < offset,
> + page);
> + return page - page->index + offset;
> +}

What would you think to:

static inline void check_page_index(struct page *page, pgoff_t offset)
{
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(page->index > offset, page);
VM_BUG_ON_PAGE(page->index + (1 << compound_order(page)) <= offset,
page);
}

(I think I fixed an off-by-one error up there ...  if
index + (1 << order) == offset, this is also a bug, right?
because offset would then refer to the next page, not this page)

static inline struct page *find_subpage(struct page *page, pgoff_t offset)
{
check_page_index(page, offset);
return page + (offset - page->index);
}

... then you can use check_page_index down ...

> @@ -1250,7 +1233,6 @@ struct page *find_lock_entry(struct address_space 
> *mapping, pgoff_t offset)
>   put_page(page);
>   goto repeat;
>   }
> - VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);

... here?

> @@ -1472,25 +1451,35 @@ unsigned find_get_pages(struct address_space 
> *mapping, pgoff_t start,
>   goto repeat;
>   }
>  
> + /* For multi-order entries, find relevant subpage */
> + if (PageTransHuge(page)) {
> + VM_BUG_ON(index - page->index < 0);
> + VM_BUG_ON(index - page->index >= HPAGE_PMD_NR);
> + page += index - page->index;
> + }

Use find_subpage() here?

>   pages[ret] = page;
>   if (++ret == nr_pages)
>   break;
> + if (!PageTransCompound(page))
> + continue;
> + for (refs = 0; ret < nr_pages &&
> + (index + 1) % HPAGE_PMD_NR;
> + ret++, refs++, index++)
> + pages[ret] = ++page;
> + if (refs)
> + page_ref_add(compound_head(page), refs);
> + if (ret == nr_pages)
> + break;

Can we avoid referencing huge pages specifically in the page cache?  I'd
like us to get to the point where we can put arbitrary compound pages into
the page cache.  For example, I think this can be written as:

if (!PageCompound(page))
continue;
for (refs = 0; ret < nr_pages; refs++, index++) {
if (index > page->index + (1 << compound_order(page)))
break;
pages[ret++] = ++page;
}
if (refs)
page_ref_add(compound_head(page), refs);
if (ret == nr_pages)
break;

> @@ -1541,19 +1533,12 @@ unsigned find_get_pages_contig(struct address_space 
> *mapping, pgoff_t index,
>  
> + /* For multi-order entries, find relevant subpage */
> + if (PageTransHuge(page)) {
> + VM_BUG_ON(index - page->index < 0);
> + VM_BUG_ON(index - page->index >= HPAGE_PMD_NR);
> + page += index - page->index;
> + }
> +
>   pages[ret] = page;
>   if (++ret == nr_pages)
>   break;
> + if (!PageTransCompound(page))
> + continue;
> + for (refs = 0; ret < nr_pages &&
> + (index + 1) % HPAGE_PMD_NR;
> + ret++, refs++, index++)
> + pages[ret] = ++page;
> + if (refs)
> + page_ref_add(compound_head(page), refs);
> + if (ret == nr_pages)
> + break;
>   }
>   rcu_read_unlock();
>   return ret;

Ugh, the same code again.  Hmm ... we only need to modify 'ret' as a result
of this ... so could we split it out like this?

static unsigned populate_pages(struct page **pages, unsigned i, unsigned max,
struct page *page)
{
unsigned refs = 0;
for (;;) {
pages[i++] = page;
if (i == max)
break;
if (PageHead(page + 1))
break;
page++;
refs++;
}
if (refs)
page_ref_add(compound_head(page), refs);
return i;
}


Re: [PATCH] block: sed-opal: reduce stack size of ioctl handler

2017-02-08 Thread Scott Bauer
On Wed, Feb 08, 2017 at 02:58:28PM -0700, Scott Bauer wrote:
> On Wed, Feb 08, 2017 at 10:15:28PM +0100, Arnd Bergmann wrote:
> > When CONFIG_KASAN is in use, the sed_ioctl function uses unusually large 
> > stack,
> > as each possible ioctl argument gets its own stack area plus redzone:
> > 
> > block/sed-opal.c: In function 'sed_ioctl':
> > block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 
> > 2048 bytes [-Werror=frame-larger-than=]
> > 
> > Moving the copy_from_user() calls into the individual functions has little
> > effect on readablility, but significantly reduces the stack size, with the
> > largest individual function (opal_enable_disable_shadow_mbr) now at
> > reasonable 456 bytes.
> > 
> > Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
> > Signed-off-by: Arnd Bergmann 
> 
> 
> Hi Arnd,
> 
> Thank you for the report. We want to keep the function calls agnostic to 
> userland.
> In the future we will have in-kernel callers and I don't want to have to do 
> any
> get_fs(KERNEL_DS) wizardry.
> 
> Instead I think we can use a union to lessen the stack burden. I tested this 
> patch just now
> with config_ksasan and was able to build.

Nack on this patch, it only really masks the issue. Keith pointed out we have a 
call chain
up to this ioctl then deeper down into nvme then the block layer. If we use 25% 
of the stack
just for this function it's still too dangerous and we'll run into corruption 
later on and not
remember this fix. I'll come up with another solution.


Re: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements")

2017-02-08 Thread h...@lst.de
On Wed, Feb 08, 2017 at 10:43:59AM -0700, Jens Axboe wrote:
> I've changed the subject line, this issue has nothing to do with the
> issue that Hannes was attempting to fix.

Nothing really useful in the thread.  Dexuan, can you throw in some
prints to see which command times out?


Boot regression (was "Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements")

2017-02-08 Thread Jens Axboe
On 02/08/2017 03:48 AM, Dexuan Cui wrote:
>> From: Jens Axboe [mailto:ax...@kernel.dk]
>> Sent: Wednesday, February 8, 2017 00:09
>> To: Dexuan Cui ; Bart Van Assche
>> ; h...@suse.com; h...@suse.de
>> Cc: h...@lst.de; linux-ker...@vger.kernel.org; linux-block@vger.kernel.org;
>> j...@kernel.org
>> Subject: Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue
>> elements
>>
>> On 02/06/2017 11:29 PM, Dexuan Cui wrote:
 From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
 ow...@vger.kernel.org] On Behalf Of Dexuan Cui
 with the linux-next kernel.

 I can boot the guest with linux-next's next-20170130 without any issue,
 but since next-20170131 I haven't succeeded in booting the guest.

 With next-20170203 (mentioned in my mail last Friday), I got the same
 calltrace as Hannes.

 With today's linux-next (next-20170206), actually the calltrace changed to
 the below.
 [  122.023036]  ? remove_wait_queue+0x70/0x70
 [  122.051383]  async_synchronize_full+0x17/0x20
 [  122.076925]  do_init_module+0xc1/0x1f9
 [  122.097530]  load_module+0x24bc/0x2980
>>>
>>> I don't know why it hangs here, but this is the same calltrace in my
>>> last-Friday mail, which contains 2 calltraces. It looks the other calltrace 
>>> has
>>> been resolved by some changes between next-20170203 and today.
>>>
>>> Here the kernel is trying to load the Hyper-V storage driver (hv_storvsc), 
>>> and
>>> the driver's __init and .probe have finished successfully and then the 
>>> kernel
>>> hangs here.
>>>
>>> I believe something is broken recently, because I don't have any issue 
>>> before
>>> Jan 31.
>>
>> Can you try and bisect it?
>>
>> Jens Axboe
> 
> I bisected it on the branch for-4.11/next of the linux-block repo and the log 
> shows
> the first bad commit is 
> [e9c787e6] scsi: allocate scsi_cmnd structures as part of struct request
> 
> # git bisect log
> git bisect start
> # bad: [80c6b15732f0d8830032149cbcbc8d67e074b5e8] blk-mq-sched: (un)register 
> elevator when (un)registering queue
> git bisect bad 80c6b15732f0d8830032149cbcbc8d67e074b5e8
> # good: [309bd96af9e26da3038661bf5cdad780eef49dd9] md: cleanup bio op / flags 
> handling in raid1_write_request
> git bisect good 309bd96af9e26da3038661bf5cdad780eef49dd9
> # bad: [27410a8927fb89bd150de08d749a8ed7f67b7739] nbd: remove 
> REQ_TYPE_DRV_PRIV leftovers
> git bisect bad 27410a8927fb89bd150de08d749a8ed7f67b7739
> # bad: [e9c787e65c0c36529745be47d490d998b4b6e589] scsi: allocate scsi_cmnd 
> structures as part of struct request
> git bisect bad e9c787e65c0c36529745be47d490d998b4b6e589
> # good: [3278255741326b6d66d8ca7d1cb2c57633ee43d9] scsi_dh_rdac: switch to 
> scsi_execute_req_flags()
> git bisect good 3278255741326b6d66d8ca7d1cb2c57633ee43d9
> # good: [0fbc3e0ff623f1012e7c2af96e781eeb26bcc0d7] scsi: remove gfp_flags 
> member in scsi_host_cmd_pool
> git bisect good 0fbc3e0ff623f1012e7c2af96e781eeb26bcc0d7
> # good: [eeff68c5618c8d0920b14533c70b2df007bd94b4] scsi: remove 
> scsi_cmd_dma_pool
> git bisect good eeff68c5618c8d0920b14533c70b2df007bd94b4
> # good: [d48777a633d6fa7ccde0f0e6509f0c01fbfc5299] scsi: remove 
> __scsi_alloc_queue
> git bisect good d48777a633d6fa7ccde0f0e6509f0c01fbfc5299
> # first bad commit: [e9c787e65c0c36529745be47d490d998b4b6e589] scsi: allocate 
> scsi_cmnd structures as part of struct request

Christoph?

I've changed the subject line, this issue has nothing to do with the
issue that Hannes was attempting to fix.

-- 
Jens Axboe



Re: [PATCH] bfq-mq: cause deadlock by executing exit_icq body immediately

2017-02-08 Thread Omar Sandoval
On Wed, Feb 08, 2017 at 11:39:24AM +0100, Paolo Valente wrote:
> 
> > Il giorno 08 feb 2017, alle ore 11:33, Omar Sandoval  
> > ha scritto:
> > 
> > On Wed, Feb 08, 2017 at 11:03:01AM +0100, Paolo Valente wrote:
> >> 
> >>> Il giorno 07 feb 2017, alle ore 22:45, Omar Sandoval 
> >>>  ha scritto:
> >>> 
> >>> On Tue, Feb 07, 2017 at 06:33:46PM +0100, Paolo Valente wrote:
>  Hi,
>  this patch is meant to show that, if the  body of the hook exit_icq is 
>  executed
>  from inside that hook, and not as deferred work, then a circular deadlock
>  occurs.
>  
>  It happens if, on a CPU
>  - the body of icq_exit takes the scheduler lock,
>  - it does so from inside the exit_icq hook, which is invoked with the 
>  queue
>  lock held
>  
>  while, on another CPU
>  - bfq_bio_merge, after taking the scheduler lock, invokes bfq_bic_lookup,
>  which, in its turn, takes the queue lock. bfq_bic_lookup needs to take 
>  such a
>  lock, because it invokes ioc_lookup_icq.
>  
>  For more details, here is a lockdep report, right before the deadlock 
>  did occur.
>  
>  [   44.059877] ==
>  [   44.124922] [ INFO: possible circular locking dependency detected ]
>  [   44.125795] 4.10.0-rc5-bfq-mq+ #38 Not tainted
>  [   44.126414] ---
>  [   44.127291] sync/2043 is trying to acquire lock:
>  [   44.128918]  (&(>lock)->rlock){-.-...}, at: 
>  [] bfq_exit_icq_bfqq+0x55/0x140
>  [   44.134052]
>  [   44.134052] but task is already holding lock:
>  [   44.134868]  (&(>__queue_lock)->rlock){-.}, at: 
>  [] put_io_context_active+0x6e/0xc0
> >>> 
> >>> Hey, Paolo,
> >>> 
> >>> I only briefly skimmed the code, but what are you using the queue_lock
> >>> for? You should just use your scheduler lock everywhere. blk-mq doesn't
> >>> use the queue lock, so the scheduler is the only thing you need mutual
> >>> exclusion against.
> >> 
> >> Hi Omar,
> >> the cause of the problem is that the hook functions bfq_request_merge
> >> and bfq_allow_bio_merge invoke, directly or through other functions,
> >> the function bfq_bic_lookup, which, in its turn, invokes
> >> ioc_lookup_icq.  The latter must be invoked with the queue lock held.
> >> In particular the offending lines in bfq_bic_lookup are:
> >> 
> >>spin_lock_irqsave(q->queue_lock, flags);
> >>icq = icq_to_bic(ioc_lookup_icq(ioc, q));
> >>spin_unlock_irqrestore(q->queue_lock, flags);
> >> 
> >> Maybe I'm missing something and we can avoid taking this lock?
> > 
> > Ah, I didn't realize we still used the q->queue_lock for the icq stuff.
> > You're right, you still need that lock for ioc_lookup_icq(). Unless
> > there's something else I'm forgetting, that should be the only thing you
> > need it for in the core code, and you should use your scheduler lock for
> > everything else. What else are you using q->queue_lock for? 
> 
> Nothing.  The deadlock follows from that bfq_request_merge gets called
> with the scheduler lock already held.  Problematic paths start from:
> bfq_bio_merge and bfq_insert_request.
> 
> I'm trying to understand whether I/we can reorder operations in some
> way that avoids the nested locking, but at no avail so far.
> 
> Thanks,
> Paolo

Okay, I understand what you're saying now. It was all in the first email
but I didn't see it right away, sorry about that.

I don't think it makes sense for ->exit_icq() to be invoked while
holding q->queue_lock for blk-mq -- we don't hold that lock for any of
the other hooks. Could you try the below? I haven't convinced myself
that there isn't a circular locking dependency between bfqd->lock and
ioc->lock now, but it's probably easiest for you to just try it.

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index fe186a9eade9..b12f9c87b4c3 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -35,7 +35,10 @@ static void icq_free_icq_rcu(struct rcu_head *head)
kmem_cache_free(icq->__rcu_icq_cache, icq);
 }
 
-/* Exit an icq. Called with both ioc and q locked. */
+/*
+ * Exit an icq. Called with both ioc and q locked for sq, only ioc locked for
+ * mq.
+ */
 static void ioc_exit_icq(struct io_cq *icq)
 {
struct elevator_type *et = icq->q->elevator->type;
@@ -166,6 +169,7 @@ EXPORT_SYMBOL(put_io_context);
  */
 void put_io_context_active(struct io_context *ioc)
 {
+   struct elevator_type *et;
unsigned long flags;
struct io_cq *icq;
 
@@ -184,13 +188,19 @@ void put_io_context_active(struct io_context *ioc)
hlist_for_each_entry(icq, >icq_list, ioc_node) {
if (icq->flags & ICQ_EXITED)
continue;
-   if (spin_trylock(icq->q->queue_lock)) {
+
+   et = icq->q->elevator->type;
+   if (et->uses_mq) {
 

Re: [PATCH 08/24] btrfs: Convert to separately allocated bdi

2017-02-08 Thread David Sterba
On Thu, Feb 02, 2017 at 06:34:06PM +0100, Jan Kara wrote:
> Allocate struct backing_dev_info separately instead of embedding it
> inside superblock. This unifies handling of bdi among users.
> 
> CC: Chris Mason 
> CC: Josef Bacik 
> CC: David Sterba 
> CC: linux-bt...@vger.kernel.org
> Signed-off-by: Jan Kara 

Reviewed-by: David Sterba 


[PATCH 3/4] block: optionally merge discontiguous discard bios into a single request

2017-02-08 Thread Christoph Hellwig
Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code.  I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.

Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 27 +++
 block/blk-merge.c|  5 -
 block/blk-mq.c   |  3 +++
 block/blk-settings.c | 20 
 block/blk-sysfs.c| 12 
 block/blk.h  |  2 ++
 include/linux/blkdev.h   | 26 ++
 include/linux/elevator.h |  1 +
 8 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 75fe534861df..b9e857f4afe8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1485,6 +1485,30 @@ bool bio_attempt_front_merge(struct request_queue *q, 
struct request *req,
return true;
 }
 
+bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
+   struct bio *bio)
+{
+   unsigned short segments = blk_rq_nr_discard_segments(req);
+
+   if (segments >= queue_max_discard_segments(q))
+   goto no_merge;
+   if (blk_rq_sectors(req) + bio_sectors(bio) >
+   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
+   goto no_merge;
+
+   req->biotail->bi_next = bio;
+   req->biotail = bio;
+   req->__data_len += bio->bi_iter.bi_size;
+   req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
+   req->nr_phys_segments = segments + 1;
+
+   blk_account_io_start(req, false);
+   return true;
+no_merge:
+   req_set_nomerge(q, req);
+   return false;
+}
+
 /**
  * blk_attempt_plug_merge - try to merge with %current's plugged list
  * @q: request_queue new bio is being queued at
@@ -1549,6 +1573,9 @@ bool blk_attempt_plug_merge(struct request_queue *q, 
struct bio *bio,
case ELEVATOR_FRONT_MERGE:
merged = bio_attempt_front_merge(q, rq, bio);
break;
+   case ELEVATOR_DISCARD_MERGE:
+   merged = bio_attempt_discard_merge(q, rq, bio);
+   break;
default:
break;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6cbd90ad5f90..2afa262425d1 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -803,7 +803,10 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 
 enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
 {
-   if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
+   if (req_op(rq) == REQ_OP_DISCARD &&
+   queue_max_discard_segments(rq->q) > 1)
+   return ELEVATOR_DISCARD_MERGE;
+   else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
return ELEVATOR_BACK_MERGE;
else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
return ELEVATOR_FRONT_MERGE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9294633759d9..89cb2d224488 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -780,6 +780,9 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
if (blk_mq_sched_allow_merge(q, rq, bio))
merged = bio_attempt_front_merge(q, rq, bio);
break;
+   case ELEVATOR_DISCARD_MERGE:
+   merged = bio_attempt_discard_merge(q, rq, bio);
+   break;
default:
continue;
}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6eb19bcbf3cb..1e7174ffc9d4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -88,6 +88,7 @@ EXPORT_SYMBOL_GPL(blk_queue_lld_busy);
 void blk_set_default_limits(struct queue_limits *lim)
 {
lim->max_segments = BLK_MAX_SEGMENTS;
+   lim->max_discard_segments = 1;
lim->max_integrity_segments = 0;
lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
lim->virt_boundary_mask = 0;
@@ -128,6 +129,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
/* Inherit limits from component devices */
lim->discard_zeroes_data = 1;
lim->max_segments = USHRT_MAX;
+   lim->max_discard_segments = 1;
lim->max_hw_sectors = UINT_MAX;
lim->max_segment_size = UINT_MAX;
lim->max_sectors = UINT_MAX;
@@ -337,6 +339,22 @@ void blk_queue_max_segments(struct request_queue *q, 
unsigned short max_segments
 EXPORT_SYMBOL(blk_queue_max_segments);
 
 /**
+ * blk_queue_max_discard_segments - set max segments for discard requests
+ * @q:  the request queue for the device
+ * @max_segments:  

support for multi-range discard requests V3

2017-02-08 Thread Christoph Hellwig
Hi all,

this series adds support for merging discontiguous discard bios into a
single request if the driver supports it.  This reduces the number of
discards sent to the device by about a factor of 5-6 for typical
workloads on NVMe, and for slower devices that use I/O scheduling
the number will probably be even bigger but I've not implemented
that support yet.

Changes since V2:
  - really fix nr of NVMe ranges to 256
  - free range on error

Changes since V1:
  - hardcoded max ranges for NVMe to 255
  - minor cleanups


[PATCH 1/4] block: move req_set_nomerge to blk.h

2017-02-08 Thread Christoph Hellwig
This makes it available outside of blk-merge.c, and inlining such a trivial
helper seems pretty useful to start with.

Signed-off-by: Christoph Hellwig 
---
 block/blk-merge.c | 7 ---
 block/blk.h   | 7 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index a373416dbc9a..c956d9e7aafd 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -482,13 +482,6 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
-static void req_set_nomerge(struct request_queue *q, struct request *req)
-{
-   req->cmd_flags |= REQ_NOMERGE;
-   if (req == q->last_merge)
-   q->last_merge = NULL;
-}
-
 static inline int ll_new_hw_segment(struct request_queue *q,
struct request *req,
struct bio *bio)
diff --git a/block/blk.h b/block/blk.h
index 4972b98d47e1..3e08703902a9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -256,6 +256,13 @@ static inline int blk_do_io_stat(struct request *rq)
!blk_rq_is_passthrough(rq);
 }
 
+static inline void req_set_nomerge(struct request_queue *q, struct request 
*req)
+{
+   req->cmd_flags |= REQ_NOMERGE;
+   if (req == q->last_merge)
+   q->last_merge = NULL;
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.11.0



Re: [PATCH 3/4] block: optionally merge discontiguous discard bios into a single request

2017-02-08 Thread Christoph Hellwig
On Wed, Feb 08, 2017 at 06:54:24PM +0800, Ming Lei wrote:
> > +   struct bio *bio)
> > +{
> > +   unsigned short segments = blk_rq_nr_discard_segments(req);
> > +
> > +   if (segments >= queue_max_discard_segments(q))
> > +   goto no_merge;
> > +   if (blk_rq_sectors(req) + bio_sectors(bio) >
> > +   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
> > +   goto no_merge;
> > +
> > +   req->biotail->bi_next = bio;
> > +   req->biotail = bio;
> > +   req->__data_len += bio->bi_iter.bi_size;
> 
> typeof(__data_len) is unsigned, and should be easy to overflow
> for discard rq's merge.

We respect the max_discard_sectors setting which is unsigned as well
above in blk_rq_get_max_sectors(), so we can't actually overflow here.

And please remove the fullquote after your two line comment, thanks!


[PATCH 4/4] nvme: support ranged discard requests

2017-02-08 Thread Christoph Hellwig
NVMe supports up to 256 ranges per DSM command, so wire up support
for ranged discards up to that limit.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 30 +++---
 include/linux/nvme.h |  2 ++
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index efe8ec300126..d9f49036819c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -238,26 +238,38 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
+   unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
struct nvme_dsm_range *range;
-   unsigned int nr_bytes = blk_rq_bytes(req);
+   struct bio *bio;
 
-   range = kmalloc(sizeof(*range), GFP_ATOMIC);
+   range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
if (!range)
return BLK_MQ_RQ_QUEUE_BUSY;
 
-   range->cattr = cpu_to_le32(0);
-   range->nlb = cpu_to_le32(nr_bytes >> ns->lba_shift);
-   range->slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+   __rq_for_each_bio(bio, req) {
+   u64 slba = nvme_block_nr(ns, bio->bi_iter.bi_sector);
+   u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
+
+   range[n].cattr = cpu_to_le32(0);
+   range[n].nlb = cpu_to_le32(nlb);
+   range[n].slba = cpu_to_le64(slba);
+   n++;
+   }
+
+   if (WARN_ON_ONCE(n != segments)) {
+   kfree(range);
+   return BLK_MQ_RQ_QUEUE_ERROR;
+   }
 
memset(cmnd, 0, sizeof(*cmnd));
cmnd->dsm.opcode = nvme_cmd_dsm;
cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
-   cmnd->dsm.nr = 0;
+   cmnd->dsm.nr = segments - 1;
cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
req->special_vec.bv_page = virt_to_page(range);
req->special_vec.bv_offset = offset_in_page(range);
-   req->special_vec.bv_len = sizeof(*range);
+   req->special_vec.bv_len = sizeof(*range) * segments;
req->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
return BLK_MQ_RQ_QUEUE_OK;
@@ -877,6 +889,9 @@ static void nvme_config_discard(struct nvme_ns *ns)
struct nvme_ctrl *ctrl = ns->ctrl;
u32 logical_block_size = queue_logical_block_size(ns->queue);
 
+   BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
+   NVME_DSM_MAX_RANGES);
+
if (ctrl->quirks & NVME_QUIRK_DISCARD_ZEROES)
ns->queue->limits.discard_zeroes_data = 1;
else
@@ -885,6 +900,7 @@ static void nvme_config_discard(struct nvme_ns *ns)
ns->queue->limits.discard_alignment = logical_block_size;
ns->queue->limits.discard_granularity = logical_block_size;
blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
+   blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
 }
 
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 3d1c6f1b15c9..3e2ed49c3ad8 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -553,6 +553,8 @@ enum {
NVME_DSMGMT_AD  = 1 << 2,
 };
 
+#define NVME_DSM_MAX_RANGES256
+
 struct nvme_dsm_range {
__le32  cattr;
__le32  nlb;
-- 
2.11.0



Re: [PATCH 4/4] nvme: support ranged discard requests

2017-02-08 Thread Christoph Hellwig
On Wed, Feb 08, 2017 at 01:32:22PM +, Andrey Kuzmin wrote:
> On Wed, Feb 8, 2017, 16:14 Christoph Hellwig  wrote:
> 
> > NVMe supports up to 255 ranges per DSM command,
> 
> 
> 256 per spec, with number of ranges being zero-based.

You're right.  I'll do another round of the patch.  That beeing
said in my tests I never even managed to get three-digit ranges
anyway with any workload.


[PATCH 4/4] nvme: support ranged discard requests

2017-02-08 Thread Christoph Hellwig
NVMe supports up to 255 ranges per DSM command, so wire up support
for ranged discards up to that limit.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 28 +---
 include/linux/nvme.h |  2 ++
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index efe8ec300126..3fb25d1d0512 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -238,26 +238,36 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
+   unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
struct nvme_dsm_range *range;
-   unsigned int nr_bytes = blk_rq_bytes(req);
+   struct bio *bio;
 
-   range = kmalloc(sizeof(*range), GFP_ATOMIC);
+   range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
if (!range)
return BLK_MQ_RQ_QUEUE_BUSY;
 
-   range->cattr = cpu_to_le32(0);
-   range->nlb = cpu_to_le32(nr_bytes >> ns->lba_shift);
-   range->slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+   __rq_for_each_bio(bio, req) {
+   u64 slba = nvme_block_nr(ns, bio->bi_iter.bi_sector);
+   u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
+
+   range[n].cattr = cpu_to_le32(0);
+   range[n].nlb = cpu_to_le32(nlb);
+   range[n].slba = cpu_to_le64(slba);
+   n++;
+   }
+
+   if (WARN_ON_ONCE(n != segments))
+   return BLK_MQ_RQ_QUEUE_ERROR;
 
memset(cmnd, 0, sizeof(*cmnd));
cmnd->dsm.opcode = nvme_cmd_dsm;
cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
-   cmnd->dsm.nr = 0;
+   cmnd->dsm.nr = segments - 1;
cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
req->special_vec.bv_page = virt_to_page(range);
req->special_vec.bv_offset = offset_in_page(range);
-   req->special_vec.bv_len = sizeof(*range);
+   req->special_vec.bv_len = sizeof(*range) * segments;
req->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
return BLK_MQ_RQ_QUEUE_OK;
@@ -877,6 +887,9 @@ static void nvme_config_discard(struct nvme_ns *ns)
struct nvme_ctrl *ctrl = ns->ctrl;
u32 logical_block_size = queue_logical_block_size(ns->queue);
 
+   BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
+   NVME_DSM_MAX_RANGES);
+
if (ctrl->quirks & NVME_QUIRK_DISCARD_ZEROES)
ns->queue->limits.discard_zeroes_data = 1;
else
@@ -885,6 +898,7 @@ static void nvme_config_discard(struct nvme_ns *ns)
ns->queue->limits.discard_alignment = logical_block_size;
ns->queue->limits.discard_granularity = logical_block_size;
blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
+   blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
 }
 
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 3d1c6f1b15c9..dcea305157c2 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -553,6 +553,8 @@ enum {
NVME_DSMGMT_AD  = 1 << 2,
 };
 
+#define NVME_DSM_MAX_RANGES255
+
 struct nvme_dsm_range {
__le32  cattr;
__le32  nlb;
-- 
2.11.0



[PATCH 3/4] block: optionally merge discontiguous discard bios into a single request

2017-02-08 Thread Christoph Hellwig
Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code.  I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.

Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 27 +++
 block/blk-merge.c|  5 -
 block/blk-mq.c   |  3 +++
 block/blk-settings.c | 20 
 block/blk-sysfs.c| 12 
 block/blk.h  |  2 ++
 include/linux/blkdev.h   | 26 ++
 include/linux/elevator.h |  1 +
 8 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 75fe534861df..b9e857f4afe8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1485,6 +1485,30 @@ bool bio_attempt_front_merge(struct request_queue *q, 
struct request *req,
return true;
 }
 
+bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
+   struct bio *bio)
+{
+   unsigned short segments = blk_rq_nr_discard_segments(req);
+
+   if (segments >= queue_max_discard_segments(q))
+   goto no_merge;
+   if (blk_rq_sectors(req) + bio_sectors(bio) >
+   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
+   goto no_merge;
+
+   req->biotail->bi_next = bio;
+   req->biotail = bio;
+   req->__data_len += bio->bi_iter.bi_size;
+   req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
+   req->nr_phys_segments = segments + 1;
+
+   blk_account_io_start(req, false);
+   return true;
+no_merge:
+   req_set_nomerge(q, req);
+   return false;
+}
+
 /**
  * blk_attempt_plug_merge - try to merge with %current's plugged list
  * @q: request_queue new bio is being queued at
@@ -1549,6 +1573,9 @@ bool blk_attempt_plug_merge(struct request_queue *q, 
struct bio *bio,
case ELEVATOR_FRONT_MERGE:
merged = bio_attempt_front_merge(q, rq, bio);
break;
+   case ELEVATOR_DISCARD_MERGE:
+   merged = bio_attempt_discard_merge(q, rq, bio);
+   break;
default:
break;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6cbd90ad5f90..2afa262425d1 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -803,7 +803,10 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 
 enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
 {
-   if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
+   if (req_op(rq) == REQ_OP_DISCARD &&
+   queue_max_discard_segments(rq->q) > 1)
+   return ELEVATOR_DISCARD_MERGE;
+   else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
return ELEVATOR_BACK_MERGE;
else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
return ELEVATOR_FRONT_MERGE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9294633759d9..89cb2d224488 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -780,6 +780,9 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
if (blk_mq_sched_allow_merge(q, rq, bio))
merged = bio_attempt_front_merge(q, rq, bio);
break;
+   case ELEVATOR_DISCARD_MERGE:
+   merged = bio_attempt_discard_merge(q, rq, bio);
+   break;
default:
continue;
}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6eb19bcbf3cb..1e7174ffc9d4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -88,6 +88,7 @@ EXPORT_SYMBOL_GPL(blk_queue_lld_busy);
 void blk_set_default_limits(struct queue_limits *lim)
 {
lim->max_segments = BLK_MAX_SEGMENTS;
+   lim->max_discard_segments = 1;
lim->max_integrity_segments = 0;
lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
lim->virt_boundary_mask = 0;
@@ -128,6 +129,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
/* Inherit limits from component devices */
lim->discard_zeroes_data = 1;
lim->max_segments = USHRT_MAX;
+   lim->max_discard_segments = 1;
lim->max_hw_sectors = UINT_MAX;
lim->max_segment_size = UINT_MAX;
lim->max_sectors = UINT_MAX;
@@ -337,6 +339,22 @@ void blk_queue_max_segments(struct request_queue *q, 
unsigned short max_segments
 EXPORT_SYMBOL(blk_queue_max_segments);
 
 /**
+ * blk_queue_max_discard_segments - set max segments for discard requests
+ * @q:  the request queue for the device
+ * @max_segments:  

[PATCH 1/4] block: move req_set_nomerge to blk.h

2017-02-08 Thread Christoph Hellwig
This makes it available outside of blk-merge.c, and inlining such a trivial
helper seems pretty useful to start with.

Signed-off-by: Christoph Hellwig 
---
 block/blk-merge.c | 7 ---
 block/blk.h   | 7 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index a373416dbc9a..c956d9e7aafd 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -482,13 +482,6 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
-static void req_set_nomerge(struct request_queue *q, struct request *req)
-{
-   req->cmd_flags |= REQ_NOMERGE;
-   if (req == q->last_merge)
-   q->last_merge = NULL;
-}
-
 static inline int ll_new_hw_segment(struct request_queue *q,
struct request *req,
struct bio *bio)
diff --git a/block/blk.h b/block/blk.h
index 4972b98d47e1..3e08703902a9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -256,6 +256,13 @@ static inline int blk_do_io_stat(struct request *rq)
!blk_rq_is_passthrough(rq);
 }
 
+static inline void req_set_nomerge(struct request_queue *q, struct request 
*req)
+{
+   req->cmd_flags |= REQ_NOMERGE;
+   if (req == q->last_merge)
+   q->last_merge = NULL;
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.11.0



Re: [PATCH 22/24] ubifs: Convert to separately allocated bdi

2017-02-08 Thread Richard Weinberger
Am 02.02.2017 um 18:34 schrieb Jan Kara:
> Allocate struct backing_dev_info separately instead of embedding it
> inside the superblock. This unifies handling of bdi among users.
> 
> CC: Richard Weinberger 
> CC: Artem Bityutskiy 
> CC: Adrian Hunter 
> CC: linux-...@lists.infradead.org
> Signed-off-by: Jan Kara 
> ---
>  fs/ubifs/super.c | 23 +++
>  fs/ubifs/ubifs.h |  3 ---
>  2 files changed, 7 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index e08aa04fc835..34810eb52b22 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -1827,7 +1827,6 @@ static void ubifs_put_super(struct super_block *sb)
>   }
>  
>   ubifs_umount(c);
> - bdi_destroy(>bdi);
>   ubi_close_volume(c->ubi);
>   mutex_unlock(>umount_mutex);
>  }
> @@ -2019,29 +2018,23 @@ static int ubifs_fill_super(struct super_block *sb, 
> void *data, int silent)
>   goto out;
>   }
>  
> + err = ubifs_parse_options(c, data, 0);
> + if (err)
> + goto out_close;
> +
>   /*
>* UBIFS provides 'backing_dev_info' in order to disable read-ahead. For
>* UBIFS, I/O is not deferred, it is done immediately in readpage,
>* which means the user would have to wait not just for their own I/O
>* but the read-ahead I/O as well i.e. completely pointless.
>*
> -  * Read-ahead will be disabled because @c->bdi.ra_pages is 0.
> +  * Read-ahead will be disabled because @sb->s_bdi->ra_pages is 0.
>*/
> - c->bdi.name = "ubifs",
> - c->bdi.capabilities = 0;

So ->capabilities is now zero by default since you use __GFP_ZERO in
bdi_alloc().
At least for UBIFS I'll add a comment on this, otherwise it is not so
clear that UBIFS wants a BDI with no capabilities and how it achieves that.

Acked-by: Richard Weinberger 

Thanks,
//richard


Re: [PATCH] bfq-mq: cause deadlock by executing exit_icq body immediately

2017-02-08 Thread Paolo Valente

> Il giorno 07 feb 2017, alle ore 22:45, Omar Sandoval  ha 
> scritto:
> 
> On Tue, Feb 07, 2017 at 06:33:46PM +0100, Paolo Valente wrote:
>> Hi,
>> this patch is meant to show that, if the  body of the hook exit_icq is 
>> executed
>> from inside that hook, and not as deferred work, then a circular deadlock
>> occurs.
>> 
>> It happens if, on a CPU
>> - the body of icq_exit takes the scheduler lock,
>> - it does so from inside the exit_icq hook, which is invoked with the queue
>>  lock held
>> 
>> while, on another CPU
>> - bfq_bio_merge, after taking the scheduler lock, invokes bfq_bic_lookup,
>>  which, in its turn, takes the queue lock. bfq_bic_lookup needs to take such 
>> a
>>  lock, because it invokes ioc_lookup_icq.
>> 
>> For more details, here is a lockdep report, right before the deadlock did 
>> occur.
>> 
>> [   44.059877] ==
>> [   44.124922] [ INFO: possible circular locking dependency detected ]
>> [   44.125795] 4.10.0-rc5-bfq-mq+ #38 Not tainted
>> [   44.126414] ---
>> [   44.127291] sync/2043 is trying to acquire lock:
>> [   44.128918]  (&(>lock)->rlock){-.-...}, at: [] 
>> bfq_exit_icq_bfqq+0x55/0x140
>> [   44.134052]
>> [   44.134052] but task is already holding lock:
>> [   44.134868]  (&(>__queue_lock)->rlock){-.}, at: 
>> [] put_io_context_active+0x6e/0xc0
> 
> Hey, Paolo,
> 
> I only briefly skimmed the code, but what are you using the queue_lock
> for? You should just use your scheduler lock everywhere. blk-mq doesn't
> use the queue lock, so the scheduler is the only thing you need mutual
> exclusion against.

Hi Omar,
the cause of the problem is that the hook functions bfq_request_merge
and bfq_allow_bio_merge invoke, directly or through other functions,
the function bfq_bic_lookup, which, in its turn, invokes
ioc_lookup_icq.  The latter must be invoked with the queue lock held.
In particular the offending lines in bfq_bic_lookup are:

spin_lock_irqsave(q->queue_lock, flags);
icq = icq_to_bic(ioc_lookup_icq(ioc, q));
spin_unlock_irqrestore(q->queue_lock, flags);

Maybe I'm missing something and we can avoid taking this lock?

Thanks,
Paolo

> I'm guessing if you stopped using that, your locking
> issues would go away.



Re: [PATCH] bfq-mq: cause deadlock by executing exit_icq body immediately

2017-02-08 Thread Omar Sandoval
On Wed, Feb 08, 2017 at 11:03:01AM +0100, Paolo Valente wrote:
> 
> > Il giorno 07 feb 2017, alle ore 22:45, Omar Sandoval  
> > ha scritto:
> > 
> > On Tue, Feb 07, 2017 at 06:33:46PM +0100, Paolo Valente wrote:
> >> Hi,
> >> this patch is meant to show that, if the  body of the hook exit_icq is 
> >> executed
> >> from inside that hook, and not as deferred work, then a circular deadlock
> >> occurs.
> >> 
> >> It happens if, on a CPU
> >> - the body of icq_exit takes the scheduler lock,
> >> - it does so from inside the exit_icq hook, which is invoked with the queue
> >>  lock held
> >> 
> >> while, on another CPU
> >> - bfq_bio_merge, after taking the scheduler lock, invokes bfq_bic_lookup,
> >>  which, in its turn, takes the queue lock. bfq_bic_lookup needs to take 
> >> such a
> >>  lock, because it invokes ioc_lookup_icq.
> >> 
> >> For more details, here is a lockdep report, right before the deadlock did 
> >> occur.
> >> 
> >> [   44.059877] ==
> >> [   44.124922] [ INFO: possible circular locking dependency detected ]
> >> [   44.125795] 4.10.0-rc5-bfq-mq+ #38 Not tainted
> >> [   44.126414] ---
> >> [   44.127291] sync/2043 is trying to acquire lock:
> >> [   44.128918]  (&(>lock)->rlock){-.-...}, at: [] 
> >> bfq_exit_icq_bfqq+0x55/0x140
> >> [   44.134052]
> >> [   44.134052] but task is already holding lock:
> >> [   44.134868]  (&(>__queue_lock)->rlock){-.}, at: 
> >> [] put_io_context_active+0x6e/0xc0
> > 
> > Hey, Paolo,
> > 
> > I only briefly skimmed the code, but what are you using the queue_lock
> > for? You should just use your scheduler lock everywhere. blk-mq doesn't
> > use the queue lock, so the scheduler is the only thing you need mutual
> > exclusion against.
> 
> Hi Omar,
> the cause of the problem is that the hook functions bfq_request_merge
> and bfq_allow_bio_merge invoke, directly or through other functions,
> the function bfq_bic_lookup, which, in its turn, invokes
> ioc_lookup_icq.  The latter must be invoked with the queue lock held.
> In particular the offending lines in bfq_bic_lookup are:
> 
>   spin_lock_irqsave(q->queue_lock, flags);
>   icq = icq_to_bic(ioc_lookup_icq(ioc, q));
>   spin_unlock_irqrestore(q->queue_lock, flags);
> 
> Maybe I'm missing something and we can avoid taking this lock?

Ah, I didn't realize we still used the q->queue_lock for the icq stuff.
You're right, you still need that lock for ioc_lookup_icq(). Unless
there's something else I'm forgetting, that should be the only thing you
need it for in the core code, and you should use your scheduler lock for
everything else. What else are you using q->queue_lock for? 


Re: [PATCH] bfq-mq: cause deadlock by executing exit_icq body immediately

2017-02-08 Thread Paolo Valente

> Il giorno 08 feb 2017, alle ore 11:33, Omar Sandoval  ha 
> scritto:
> 
> On Wed, Feb 08, 2017 at 11:03:01AM +0100, Paolo Valente wrote:
>> 
>>> Il giorno 07 feb 2017, alle ore 22:45, Omar Sandoval  
>>> ha scritto:
>>> 
>>> On Tue, Feb 07, 2017 at 06:33:46PM +0100, Paolo Valente wrote:
 Hi,
 this patch is meant to show that, if the  body of the hook exit_icq is 
 executed
 from inside that hook, and not as deferred work, then a circular deadlock
 occurs.
 
 It happens if, on a CPU
 - the body of icq_exit takes the scheduler lock,
 - it does so from inside the exit_icq hook, which is invoked with the queue
 lock held
 
 while, on another CPU
 - bfq_bio_merge, after taking the scheduler lock, invokes bfq_bic_lookup,
 which, in its turn, takes the queue lock. bfq_bic_lookup needs to take 
 such a
 lock, because it invokes ioc_lookup_icq.
 
 For more details, here is a lockdep report, right before the deadlock did 
 occur.
 
 [   44.059877] ==
 [   44.124922] [ INFO: possible circular locking dependency detected ]
 [   44.125795] 4.10.0-rc5-bfq-mq+ #38 Not tainted
 [   44.126414] ---
 [   44.127291] sync/2043 is trying to acquire lock:
 [   44.128918]  (&(>lock)->rlock){-.-...}, at: [] 
 bfq_exit_icq_bfqq+0x55/0x140
 [   44.134052]
 [   44.134052] but task is already holding lock:
 [   44.134868]  (&(>__queue_lock)->rlock){-.}, at: 
 [] put_io_context_active+0x6e/0xc0
>>> 
>>> Hey, Paolo,
>>> 
>>> I only briefly skimmed the code, but what are you using the queue_lock
>>> for? You should just use your scheduler lock everywhere. blk-mq doesn't
>>> use the queue lock, so the scheduler is the only thing you need mutual
>>> exclusion against.
>> 
>> Hi Omar,
>> the cause of the problem is that the hook functions bfq_request_merge
>> and bfq_allow_bio_merge invoke, directly or through other functions,
>> the function bfq_bic_lookup, which, in its turn, invokes
>> ioc_lookup_icq.  The latter must be invoked with the queue lock held.
>> In particular the offending lines in bfq_bic_lookup are:
>> 
>>  spin_lock_irqsave(q->queue_lock, flags);
>>  icq = icq_to_bic(ioc_lookup_icq(ioc, q));
>>  spin_unlock_irqrestore(q->queue_lock, flags);
>> 
>> Maybe I'm missing something and we can avoid taking this lock?
> 
> Ah, I didn't realize we still used the q->queue_lock for the icq stuff.
> You're right, you still need that lock for ioc_lookup_icq(). Unless
> there's something else I'm forgetting, that should be the only thing you
> need it for in the core code, and you should use your scheduler lock for
> everything else. What else are you using q->queue_lock for? 

Nothing.  The deadlock follows from that bfq_request_merge gets called
with the scheduler lock already held.  Problematic paths start from:
bfq_bio_merge and bfq_insert_request.

I'm trying to understand whether I/we can reorder operations in some
way that avoids the nested locking, but at no avail so far.

Thanks,
Paolo

RE: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements

2017-02-08 Thread Dexuan Cui
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Wednesday, February 8, 2017 00:09
> To: Dexuan Cui ; Bart Van Assche
> ; h...@suse.com; h...@suse.de
> Cc: h...@lst.de; linux-ker...@vger.kernel.org; linux-block@vger.kernel.org;
> j...@kernel.org
> Subject: Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue
> elements
> 
> On 02/06/2017 11:29 PM, Dexuan Cui wrote:
> >> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> >> ow...@vger.kernel.org] On Behalf Of Dexuan Cui
> >> with the linux-next kernel.
> >>
> >> I can boot the guest with linux-next's next-20170130 without any issue,
> >> but since next-20170131 I haven't succeeded in booting the guest.
> >>
> >> With next-20170203 (mentioned in my mail last Friday), I got the same
> >> calltrace as Hannes.
> >>
> >> With today's linux-next (next-20170206), actually the calltrace changed to
> >> the below.
> >> [  122.023036]  ? remove_wait_queue+0x70/0x70
> >> [  122.051383]  async_synchronize_full+0x17/0x20
> >> [  122.076925]  do_init_module+0xc1/0x1f9
> >> [  122.097530]  load_module+0x24bc/0x2980
> >
> > I don't know why it hangs here, but this is the same calltrace in my
> > last-Friday mail, which contains 2 calltraces. It looks the other calltrace 
> > has
> > been resolved by some changes between next-20170203 and today.
> >
> > Here the kernel is trying to load the Hyper-V storage driver (hv_storvsc), 
> > and
> > the driver's __init and .probe have finished successfully and then the 
> > kernel
> > hangs here.
> >
> > I believe something is broken recently, because I don't have any issue 
> > before
> > Jan 31.
> 
> Can you try and bisect it?
> 
> Jens Axboe

I bisected it on the branch for-4.11/next of the linux-block repo and the log 
shows
the first bad commit is 
[e9c787e6] scsi: allocate scsi_cmnd structures as part of struct request

# git bisect log
git bisect start
# bad: [80c6b15732f0d8830032149cbcbc8d67e074b5e8] blk-mq-sched: (un)register 
elevator when (un)registering queue
git bisect bad 80c6b15732f0d8830032149cbcbc8d67e074b5e8
# good: [309bd96af9e26da3038661bf5cdad780eef49dd9] md: cleanup bio op / flags 
handling in raid1_write_request
git bisect good 309bd96af9e26da3038661bf5cdad780eef49dd9
# bad: [27410a8927fb89bd150de08d749a8ed7f67b7739] nbd: remove REQ_TYPE_DRV_PRIV 
leftovers
git bisect bad 27410a8927fb89bd150de08d749a8ed7f67b7739
# bad: [e9c787e65c0c36529745be47d490d998b4b6e589] scsi: allocate scsi_cmnd 
structures as part of struct request
git bisect bad e9c787e65c0c36529745be47d490d998b4b6e589
# good: [3278255741326b6d66d8ca7d1cb2c57633ee43d9] scsi_dh_rdac: switch to 
scsi_execute_req_flags()
git bisect good 3278255741326b6d66d8ca7d1cb2c57633ee43d9
# good: [0fbc3e0ff623f1012e7c2af96e781eeb26bcc0d7] scsi: remove gfp_flags 
member in scsi_host_cmd_pool
git bisect good 0fbc3e0ff623f1012e7c2af96e781eeb26bcc0d7
# good: [eeff68c5618c8d0920b14533c70b2df007bd94b4] scsi: remove 
scsi_cmd_dma_pool
git bisect good eeff68c5618c8d0920b14533c70b2df007bd94b4
# good: [d48777a633d6fa7ccde0f0e6509f0c01fbfc5299] scsi: remove 
__scsi_alloc_queue
git bisect good d48777a633d6fa7ccde0f0e6509f0c01fbfc5299
# first bad commit: [e9c787e65c0c36529745be47d490d998b4b6e589] scsi: allocate 
scsi_cmnd structures as part of struct request

Thanks,
-- Dexuan


Re: [PATCH 0/4 v2] BDI lifetime fix

2017-02-08 Thread Jan Kara
On Wed 08-02-17 08:51:42, Jan Kara wrote:
> On Tue 07-02-17 12:21:01, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Feb 07, 2017 at 01:33:31PM +0100, Jan Kara wrote:
> > > > We can see above that inode->i_wb is in r31, and the machine crashed at 
> > > > 0xc03799a0 so it was trying to dereference wb and crashed.
> > > > r31 is NULL in the crash information.
> > > 
> > > Thanks for report and the analysis. After some looking into the code I see
> > > where the problem is. Writeback code assumes inode->i_wb can never become
> > > invalid once it is set however we still call inode_detach_wb() from
> > > __blkdev_put(). So in a way this is a different problem but closely
> > > related.
> > 
> > Heh, it feels like we're chasing our own tails.
> 
> Pretty much. I went through the history of bdi registration and
> unregistration to understand various constraints and various different
> reasons keep pushing that around and always something breaks...
> 
> > > It seems to me that instead of calling inode_detach_wb() in __blkdev_put()
> > > we may just switch blkdev inode to bdi->wb (it is now guaranteed to stay
> > > around). That way bdi_unregister() can complete (destroying all writeback
> > > structures except for bdi->wb) while block device inode can still live 
> > > with
> > > a valid i_wb structure.
> > 
> > So, the problem there would be synchronizing get_wb against the
> > transition.  We can do that and inode_switch_wbs_work_fn() already
> > does it but it is a bit nasty.
> 
> Yeah, I have prototyped that and it is relatively simple although we have
> to use synchronize_rcu() to be sure unlocked users of i_wb are done before
> switching and that is somewhat ugly. So I'm looking for ways to avoid the
> switching as well. Especially since from high-level POV the switching
> should not be necessary. Everything is going away and there is no real work
> to be done. Just we may be unlucky enough that e.g. flusher is looking
> whether there's some work to do and we remove stuff under its hands. So
> switching seems like a bit of an overkill.
> 
> > I'm getting a bit confused here, so the reason we added
> > inode_detach_wb() in __blkdev_put() was because the root wb might go
> > away because it's embedded in the bdi which is embedded in the
> > request_queue which is put and may be released by put_disk().
> > 
> > Are you saying that we changed the behavior so that bdi->wb stays
> > around?  If so, we can just remove the inode_detach_wb() call, no?
> 
> Yes, my patches (currently in linux-block) make bdi->wb stay around as long
> as the block device inode. However things are complicated by the fact that
> these days bdev_inode->i_wb may be pointing even to non-root wb_writeback
> structure. If that happens and we don't call inode_detach_wb(),
> bdi_unregister() will block waiting for reference count on wb_writeback to
> drop to zero which happens only once bdev inode is evicted from inode cache
> which may be far far in the future.
> 
> Now I think we can move bdi_unregister() into del_gendisk() (where it IMHO
> belongs anyway as a counterpart to device_add_disk() in which we call
> bdi_register()) and shutdown the block device inode there before calling
> bdi_unregister(). But I'm still figuring out whether it will not break
> something else because the code has lots of interactions...

More news from device shutdown world ;): I was looking more into how device
shutdown works. I was wondering what happens when device gets hot-removed
and how do we shutdown stuff. If I tracked the callback maze correctly, when
we remove scsi disk, we do __scsi_remove_device() -> device_del() ->
bus_remove_device() -> device_release_driver() -> sd_remove() ->
del_gendisk(). We also have __scsi_remove_device() -> blk_cleanup_queue()
-> bdi_unregister() shortly after the previous happening

This ordering seems to be the real culprit of the bdi name reuse problems
Omar has reported? Same as described in commit 6cd18e711dd8 for MD BTW and
Dan's patch could be IMHO replaced by a move of bdi_unregister() from
blk_cleanup_queue() to del_gendisk() where it logically belongs as a
counterpart of device_add_disk(). I'll test that.


__scsi_remove_device() is also called when the device was hot-removed. At
that point the bdev can still be open and in active use and its i_wb can
point to some non-root wb_writeback struct. Thus bdi_unregister() will
block waiting for that wb_writeback to get released and thus SCSI device
removal will block basically intefinitely (at least until fs on top of bdev
gets unmounted). I believe this is a bug and __scsi_remove_device() is
expected to finish regardless of upper layers still using the bdev. So to
fix this I don't think we can really avoid the switching of bdev inode from
non-root wb_writeback structure to bdi->wb.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/4 v2] BDI lifetime fix

2017-02-08 Thread Jan Kara
On Tue 07-02-17 12:21:01, Tejun Heo wrote:
> Hello,
> 
> On Tue, Feb 07, 2017 at 01:33:31PM +0100, Jan Kara wrote:
> > > We can see above that inode->i_wb is in r31, and the machine crashed at 
> > > 0xc03799a0 so it was trying to dereference wb and crashed.
> > > r31 is NULL in the crash information.
> > 
> > Thanks for report and the analysis. After some looking into the code I see
> > where the problem is. Writeback code assumes inode->i_wb can never become
> > invalid once it is set however we still call inode_detach_wb() from
> > __blkdev_put(). So in a way this is a different problem but closely
> > related.
> 
> Heh, it feels like we're chasing our own tails.

Pretty much. I went through the history of bdi registration and
unregistration to understand various constraints and various different
reasons keep pushing that around and always something breaks...

> > It seems to me that instead of calling inode_detach_wb() in __blkdev_put()
> > we may just switch blkdev inode to bdi->wb (it is now guaranteed to stay
> > around). That way bdi_unregister() can complete (destroying all writeback
> > structures except for bdi->wb) while block device inode can still live with
> > a valid i_wb structure.
> 
> So, the problem there would be synchronizing get_wb against the
> transition.  We can do that and inode_switch_wbs_work_fn() already
> does it but it is a bit nasty.

Yeah, I have prototyped that and it is relatively simple although we have
to use synchronize_rcu() to be sure unlocked users of i_wb are done before
switching and that is somewhat ugly. So I'm looking for ways to avoid the
switching as well. Especially since from high-level POV the switching
should not be necessary. Everything is going away and there is no real work
to be done. Just we may be unlucky enough that e.g. flusher is looking
whether there's some work to do and we remove stuff under its hands. So
switching seems like a bit of an overkill.

> I'm getting a bit confused here, so the reason we added
> inode_detach_wb() in __blkdev_put() was because the root wb might go
> away because it's embedded in the bdi which is embedded in the
> request_queue which is put and may be released by put_disk().
> 
> Are you saying that we changed the behavior so that bdi->wb stays
> around?  If so, we can just remove the inode_detach_wb() call, no?

Yes, my patches (currently in linux-block) make bdi->wb stay around as long
as the block device inode. However things are complicated by the fact that
these days bdev_inode->i_wb may be pointing even to non-root wb_writeback
structure. If that happens and we don't call inode_detach_wb(),
bdi_unregister() will block waiting for reference count on wb_writeback to
drop to zero which happens only once bdev inode is evicted from inode cache
which may be far far in the future.

Now I think we can move bdi_unregister() into del_gendisk() (where it IMHO
belongs anyway as a counterpart to device_add_disk() in which we call
bdi_register()) and shutdown the block device inode there before calling
bdi_unregister(). But I'm still figuring out whether it will not break
something else because the code has lots of interactions...

Honza
-- 
Jan Kara 
SUSE Labs, CR