date:20170301

Re: [PATCH 3/3] nvme: Complete all stuck requests

2017-03-01 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

2017-03-01 Thread Tahsin Erdogan

Hi Tejun,
>
> Ah, indeed, but we can break out allocation of blkg and its
> initialization, right?  It's a bit more work but then we'd be able to
> do something like.
>
>
> retry:
> new_blkg = alloc;
> lock;
> sanity checks;
> blkg = blkg_lookup_and_create(..., new_blkg);
> if (!blkg) {
> unlock;
> goto retry;
> }
I tried doing it based on the sample above but I wasn't happy with the
result. The amount of code grew too big. I sent a simplified version
that does blkg allocation within blkg_lookup_create(). I think this
version is simpler, let me know what you think.


On Wed, Mar 1, 2017 at 3:43 PM, Tahsin Erdogan  wrote:
> blkg_conf_prep() currently calls blkg_lookup_create() while holding
> request queue spinlock. This means allocating memory for struct
> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> failures in call paths like below:
>
>   pcpu_alloc+0x68f/0x710
>   __alloc_percpu_gfp+0xd/0x10
>   __percpu_counter_init+0x55/0xc0
>   cfq_pd_alloc+0x3b2/0x4e0
>   blkg_alloc+0x187/0x230
>   blkg_create+0x489/0x670
>   blkg_lookup_create+0x9a/0x230
>   blkg_conf_prep+0x1fb/0x240
>   __cfqg_set_weight_device.isra.105+0x5c/0x180
>   cfq_set_weight_on_dfl+0x69/0xc0
>   cgroup_file_write+0x39/0x1c0
>   kernfs_fop_write+0x13f/0x1d0
>   __vfs_write+0x23/0x120
>   vfs_write+0xc2/0x1f0
>   SyS_write+0x44/0xb0
>   entry_SYSCALL_64_fastpath+0x18/0xad
>
> In the code path above, percpu allocator cannot call vmalloc() due to
> queue spinlock.
>
> A failure in this call path gives grief to tools which are trying to
> configure io weights. We see occasional failures happen shortly after
> reboots even when system is not under any memory pressure. Machines
> with a lot of cpus are more vulnerable to this condition.
>
> Add a flag to blkg_lookup_create() to indicate whether releasing locks
> for memory allocation purposes is okay.
>
> Suggested-by: Tejun Heo 
> Signed-off-by: Tahsin Erdogan 
> ---
> v2:
>   Moved blkg creation into blkg_lookup_create() to avoid duplicating
>   blkg_lookup_create() logic.
>
>  block/blk-cgroup.c | 51 
> +++---
>  include/linux/blk-cgroup.h |  4 ++--
>  2 files changed, 46 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 295e98c2c8cc..afb16e998bf3 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>   * blkg_lookup_create - lookup blkg, try to create one if not there
>   * @blkcg: blkcg of interest
>   * @q: request_queue of interest
> + * @wait_ok: whether blocking for memory allocations is okay
>   *
>   * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
>   * create one.  blkg creation is performed recursively from blkcg_root such
>   * that all non-root blkg's have access to the parent blkg.  This function
>   * should be called under RCU read lock and @q->queue_lock.
>   *
> + * When @wait_ok is true, rcu and queue locks may be dropped for allocating
> + * memory. In this case, the locks will be reacquired on return.
> + *
>   * Returns pointer to the looked up or created blkg on success, ERR_PTR()
>   * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
>   * dead and bypassing, returns ERR_PTR(-EBUSY).
>   */
>  struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -   struct request_queue *q)
> +   struct request_queue *q, bool wait_ok)
>  {
> struct blkcg_gq *blkg;
>
> @@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> parent = blkcg_parent(parent);
> }
>
> -   blkg = blkg_create(pos, q, NULL);
> +   if (wait_ok) {
> +   struct blkcg_gq *new_blkg;
> +
> +   spin_unlock_irq(q->queue_lock);
> +   rcu_read_unlock();
> +
> +   new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
> +
> +   rcu_read_lock();
> +   spin_lock_irq(q->queue_lock);
> +
> +   if (unlikely(!new_blkg))
> +   return ERR_PTR(-ENOMEM);
> +
> +   if (unlikely(blk_queue_bypass(q))) {
> +   blkg_free(new_blkg);
> +   return ERR_PTR(blk_queue_dying(q) ?
> +   -ENODEV : -EBUSY);
> +   }
> +
> +   blkg = blkg_create(pos, q, new_blkg);
> +   } else
> +   blkg = blkg_create(pos, q, NULL);
> +
> if (pos == blkcg || IS_ERR(blkg))
> return blkg;
> }
> @@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
> blkcg_policy *p

Re: blk_integrity_revalidate() clears BDI_CAP_STABLE_WRITES

2017-03-01 Thread Martin K. Petersen

> "Ilya" == Ilya Dryomov  writes:

Ilya,

Ilya> Given the above, I'm not sure what the baseline is --
Ilya> blk_integrity code isn't invoked for data-only lbafs.  Could nvme
Ilya> folks please look at this?  rbd regression caused by integrity
Ilya> revalidate change goes back to 4.4 and I'd really like to get it
Ilya> fixed.  The proposed patch is attached to my earlier reply on
Ilya> linux-block.

I've had a couple of fires to attend to this week. I'll try and give
your patch a spin on my FC setup tomorrow or Friday.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH 1/2] loop: fix LO_FLAGS_PARTSCAN hang

2017-03-01 Thread Jens Axboe

On 03/01/2017 11:42 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> loop_reread_partitions() needs to do I/O, but we just froze the queue,
> so we end up waiting forever. This can easily be reproduced with losetup
> -P. Fix it by moving the reread to after we unfreeze the queue.

Thanks Omar, applied. That's a brown paper bag bug.

-- 
Jens Axboe

Re: [PATCH 2/2] blk-mq-debugfs: add q->mq_freeze_depth output

2017-03-01 Thread Omar Sandoval

On Wed, Mar 01, 2017 at 07:25:59PM -0700, Jens Axboe wrote:
> On 03/01/2017 11:42 AM, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > This can be used to diagnose freeze-related hangs.
> 
> Do we really need this? For the loop case, it was painfully
> obvious that it was a count issue. And I can't think of any
> cases where it would have helped us, outside of rewriting
> core parts (like the first edition of the shadow requests
> for the scheduler).
> 
> Not strongly opposed to it, feel free to convince me.

I don't feel a burning need for it, let's just drop it.

Re: [PATCH 1/2] loop: fix LO_FLAGS_PARTSCAN hang

2017-03-01 Thread Ming Lei

On Thu, Mar 2, 2017 at 2:42 AM, Omar Sandoval  wrote:
> From: Omar Sandoval 
>
> loop_reread_partitions() needs to do I/O, but we just froze the queue,
> so we end up waiting forever. This can easily be reproduced with losetup
> -P. Fix it by moving the reread to after we unfreeze the queue.
>
> Fixes: ecdd09597a57 ("block/loop: fix race between I/O and set_status")
> Reported-by: Tejun Heo 
> Cc: Ming Lei 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Omar Sandoval 

Reviewed-by: Ming Lei 

Thanks,
Ming

Re: [PATCH 2/2] blk-mq-debugfs: add q->mq_freeze_depth output

2017-03-01 Thread Jens Axboe

On 03/01/2017 11:42 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> This can be used to diagnose freeze-related hangs.

Do we really need this? For the loop case, it was painfully
obvious that it was a count issue. And I can't think of any
cases where it would have helped us, outside of rewriting
core parts (like the first edition of the shadow requests
for the scheduler).

Not strongly opposed to it, feel free to convince me.

-- 
Jens Axboe

Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error

2017-03-01 Thread Ming Lei

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li  wrote:
> On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
>> The cost is 128bytes(8*16) stack space in kernel thread context, and
>> just use the bio helper to retrieve pages from bio.
>>
>> Signed-off-by: Ming Lei 
>> ---
>>  drivers/md/raid10.c | 12 ++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 0b97631e3905..6ffb64ab45f8 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev 
>> *mddev,
>>   struct r10bio *r10b = &on_stack.r10_bio;
>>   int slot = 0;
>>   int idx = 0;
>> - struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
>> + struct bio_vec *bvl;
>> + struct page *pages[RESYNC_PAGES];
>> +
>> + /*
>> +  * This bio is allocated in reshape_request(), and size
>> +  * is still RESYNC_PAGES
>> +  */
>> + bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
>> + pages[idx] = bvl->bv_page;
>
> The reshape bio is doing IO against the memory we allocated for r10_bio, I'm
> wondering why we can't get the pages from r10_bio. In this way, we don't need
> access the bio_vec any more.

Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
.r10buf_pool, please see reshape_request().

Thanks,
Ming Lei

Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()

2017-03-01 Thread Ming Lei

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li  wrote:
> On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>> Use this helper, instead of direct access to .bi_vcnt.
>
> what We really need to do for the behind IO is:
> - allocate memory and copy bio data to the memory
> - let behind bio do IO against the memory
>
> The behind bio doesn't need to have the exactly same bio_vec setting. If we
> just track the new memory, we don't need use the bio_segments_all and access
> bio_vec too.

But we need to figure out how many vecs(each vec store one page) to be
allocated for the cloned/behind bio, and that is the only value of
bio_segments_all() here. Or you have idea to avoid that?

Thanks,
Ming

Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages

2017-03-01 Thread Ming Lei

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li  wrote:
> On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
>> Now we allocate one page array for managing resync pages, instead
>> of using bio's vec table to do that, and the old way is very hacky
>> and won't work any more if multipage bvec is enabled.
>>
>> The introduced cost is that we need to allocate (128 + 16) * raid_disks
>> bytes per r1_bio, and it is fine because the inflight r1_bio for
>> resync shouldn't be much, as pointed by Shaohua.
>>
>> Also the bio_reset() in raid1_sync_request() is removed because
>> all bios are freshly new now and not necessary to reset any more.
>>
>> This patch can be thought as a cleanup too
>>
>> Suggested-by: Shaohua Li 
>> Signed-off-by: Ming Lei 
>> ---
>>  drivers/md/raid1.c | 83 
>> ++
>>  1 file changed, 53 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index c442b4657e2f..900144f39630 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t 
>> sector_nr);
>>  #define raid1_log(md, fmt, args...)  \
>>   do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, 
>> ##args); } while (0)
>>
>> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
>> +{
>> + return bio->bi_private;
>> +}
>> +
>> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
>> +{
>> + return get_resync_pages(bio)->raid_bio;
>> +}
>
> This is a weird between bio, r1bio and the resync_pages. I'd like the pages 
> are

It is only a bit weird inside allocating and freeing r1bio, once all
are allocated, you
can see everthing is clean and simple:

- r1bio includes lots of bioes,
- and one bio is attached by one resync_pages via .bi_private

> embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and 
> more
> straightforward.

In theory, the cleanest way is to allocate one resync_pages for each resync bio,
but that isn't efficient. That is why this patch allocates all
resync_pages together
in r1buf_pool_alloc(), and split them into bio.

BTW, the only trick is just that the whole page array is stored in .bi_private
of the 1st bio, then we can avoid to add one pointer into r1bio.

>
> I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will be 
> broken.
>

No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
with reading from the pre-allocated page array.  Both should be fine, but the
latter is cleaner and simpler to do.

Thanks,
Ming

Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()

2017-03-01 Thread Ming Lei

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li  wrote:
> On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
>> This patch gets each page's reference of each bio for resync,
>> then r1buf_pool_free() gets simplified a lot.
>>
>> The same policy has been taken in raid10's buf pool allocation/free
>> too.
>
> We are going to delete the code, this simplify isn't really required.

Could you explain it a bit? Or I can put this patch into this patchset if you
can provide one?

Thanks,
Ming Lei

Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way

2017-03-01 Thread Ming Lei

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li  wrote:
> On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
>> Now resync I/O use bio's bec table to manage pages,
>> this way is very hacky, and may not work any more
>> once multipage bvec is introduced.
>>
>> So introduce helpers and new data structure for
>> managing resync I/O pages more cleanly.
>>
>> Signed-off-by: Ming Lei 
>> ---
>>  drivers/md/md.h | 54 ++
>>  1 file changed, 54 insertions(+)
>>
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 1d63239a1be4..b5a638d85cb4 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct mddev 
>> *mddev, struct bio *bio)
>>  #define RESYNC_BLOCK_SIZE (64*1024)
>>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
>>
>> +/* for managing resync I/O pages */
>> +struct resync_pages {
>> + unsignedidx;/* for get/put page from the pool */
>> + void*raid_bio;
>> + struct page *pages[RESYNC_PAGES];
>> +};
>
> I'd like this to be embedded into r1bio directly. Not sure if we really need a
> structure.

There are two reasons we can't put this into r1bio:
- r1bio is used in both normal and resync I/O, not fair to allocate more
in normal I/O, and that is why this patch wouldn't like to touch r1bio or r10bio

- the count of 'struct resync_pages' instance depends on raid_disks(raid1)
or copies(raid10), both can't be decided during compiling.

>
>> +
>> +static inline int resync_alloc_pages(struct resync_pages *rp,
>> +  gfp_t gfp_flags)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < RESYNC_PAGES; i++) {
>> + rp->pages[i] = alloc_page(gfp_flags);
>> + if (!rp->pages[i])
>> + goto out_free;
>> + }
>> +
>> + return 0;
>> +
>> + out_free:
>> + while (--i >= 0)
>> + __free_page(rp->pages[i]);
>> + return -ENOMEM;
>> +}
>> +
>> +static inline void resync_free_pages(struct resync_pages *rp)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < RESYNC_PAGES; i++)
>> + __free_page(rp->pages[i]);
>
> Since we will use get_page, shouldn't this be put_page?

You are right, will fix in v3.

>
>> +}
>> +
>> +static inline void resync_get_all_pages(struct resync_pages *rp)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < RESYNC_PAGES; i++)
>> + get_page(rp->pages[i]);
>> +}
>> +
>> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
>> +{
>> + if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
>> + return NULL;
>> + return rp->pages[rp->idx++];
>
> I'd like the caller explicitly specify the index instead of a global variable
> to track it, which will make the code more understandable and less error 
> prone.

That is fine, but the index has to be per bio, and finally the index
has to be stored
in 'struct resync_pages', so every user has to call it in the following way:

  resync_fetch_page(rp, rp->idx);

then looks no benefit to pass index explicitly.

>
>> +}
>> +
>> +static inline bool resync_page_available(struct resync_pages *rp)
>> +{
>> + return rp->idx < RESYNC_PAGES;
>> +}
>
> Then we don't need this obscure API.

That is fine.


Thanks,
Ming Lei

Re: [PATCH 2/3] blk-mq: Provide freeze queue timeout

2017-03-01 Thread Christoph Hellwig

> +int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
> +  unsigned long timeout)
> +{
> + return wait_event_timeout(q->mq_freeze_wq,
> + percpu_ref_is_zero(&q->q_usage_counter),
> + timeout);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);

Can you just add the timeout argument to blk_mq_freeze_queue_wait?
Existing callers can pass 0, which is interpreted as no timeout by
the low-level wait code.

[PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock

2017-03-01 Thread Tahsin Erdogan

blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:

  pcpu_alloc+0x68f/0x710
  __alloc_percpu_gfp+0xd/0x10
  __percpu_counter_init+0x55/0xc0
  cfq_pd_alloc+0x3b2/0x4e0
  blkg_alloc+0x187/0x230
  blkg_create+0x489/0x670
  blkg_lookup_create+0x9a/0x230
  blkg_conf_prep+0x1fb/0x240
  __cfqg_set_weight_device.isra.105+0x5c/0x180
  cfq_set_weight_on_dfl+0x69/0xc0
  cgroup_file_write+0x39/0x1c0
  kernfs_fop_write+0x13f/0x1d0
  __vfs_write+0x23/0x120
  vfs_write+0xc2/0x1f0
  SyS_write+0x44/0xb0
  entry_SYSCALL_64_fastpath+0x18/0xad

In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.

A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.

Add a flag to blkg_lookup_create() to indicate whether releasing locks
for memory allocation purposes is okay.

Suggested-by: Tejun Heo 
Signed-off-by: Tahsin Erdogan 
---
v2:
  Moved blkg creation into blkg_lookup_create() to avoid duplicating
  blkg_lookup_create() logic.

 block/blk-cgroup.c | 51 +++---
 include/linux/blk-cgroup.h |  4 ++--
 2 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 295e98c2c8cc..afb16e998bf3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
  * blkg_lookup_create - lookup blkg, try to create one if not there
  * @blkcg: blkcg of interest
  * @q: request_queue of interest
+ * @wait_ok: whether blocking for memory allocations is okay
  *
  * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
  * create one.  blkg creation is performed recursively from blkcg_root such
  * that all non-root blkg's have access to the parent blkg.  This function
  * should be called under RCU read lock and @q->queue_lock.
  *
+ * When @wait_ok is true, rcu and queue locks may be dropped for allocating
+ * memory. In this case, the locks will be reacquired on return.
+ *
  * Returns pointer to the looked up or created blkg on success, ERR_PTR()
  * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
  * dead and bypassing, returns ERR_PTR(-EBUSY).
  */
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
-   struct request_queue *q)
+   struct request_queue *q, bool wait_ok)
 {
struct blkcg_gq *blkg;
 
@@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
parent = blkcg_parent(parent);
}
 
-   blkg = blkg_create(pos, q, NULL);
+   if (wait_ok) {
+   struct blkcg_gq *new_blkg;
+
+   spin_unlock_irq(q->queue_lock);
+   rcu_read_unlock();
+
+   new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
+
+   rcu_read_lock();
+   spin_lock_irq(q->queue_lock);
+
+   if (unlikely(!new_blkg))
+   return ERR_PTR(-ENOMEM);
+
+   if (unlikely(blk_queue_bypass(q))) {
+   blkg_free(new_blkg);
+   return ERR_PTR(blk_queue_dying(q) ?
+   -ENODEV : -EBUSY);
+   }
+
+   blkg = blkg_create(pos, q, new_blkg);
+   } else
+   blkg = blkg_create(pos, q, NULL);
+
if (pos == blkcg || IS_ERR(blkg))
return blkg;
}
@@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
 {
struct gendisk *disk;
struct blkcg_gq *blkg;
+   struct request_queue *q;
struct module *owner;
unsigned int major, minor;
int key_len, part, ret;
@@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
return -ENODEV;
}
 
+   q = disk->queue;
+
rcu_read_lock();
-   spin_lock_irq(disk->queue->queue_lock);
+   spin_lock_irq(q->queue_lock);
 
-   if (blkcg_policy_enabled(disk->queue, pol))
-   blkg = blkg_lookup_create(blkcg, disk->queue);
-   else
+   if (blkcg_policy_enabled(q, pol)) {
+   blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
+
+   /*
+* blkg_lookup_create() may have dropped and reacquired the
+* queue lock. Check policy enabled state again.
+*/
+

Re: [PATCH 0/3] nvme suspend/resume fix

2017-03-01 Thread Jens Axboe

On 03/01/2017 12:22 PM, Keith Busch wrote:
> Hi Jens,
> 
> This is hopefully the last version to fix nvme stopping blk-mq's CPU
> event from making forward progress. The solution requires a couple new
> blk-mq exports so the nvme driver can properly sync with queue states.
> 
> Since this depends on the blk-mq parts, and if you approve of the
> proposal, I think it'd be easiest if you can take this directly into
> linux-block/for-linus. Otherwise, we can send you a pull request if you
> Ack the blk-mq parts.
> 
> The difference from the previous patch is an update that Artur
> confirmed passes hibernate on a stacked request queue. Personally,
> I tested this for several hours with fio running buffered writes
> in the back-ground and rtcwake running suspend/resume at intervals.
> This succeeded with no fio errors.

I've queued it up for this series, thanks Keith.

-- 
Jens Axboe

Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT

2017-03-01 Thread Christoph Hellwig

On Wed, Mar 01, 2017 at 10:57:17AM -0600, Goldwyn Rodrigues wrote:
> RWF_* ? Isn't that kernel space flags? Or did you intend to say
> IOCB_FLAG_*?

No, they are the flags for preadv2/pwritev2.

> If yes, we maintain two flag fields? aio_reserved1 (perhaps
> renamed to aio_flags2) and aio_flags?

Yes - I'd call it aio_rw_flags or similar.

> aio_reserved1 is also used to return key for the purpose of io_cancel,
> but we should be able to fetch the flags before putting the key value
> there. Still I am not comfortable using the same field for it because it
> will be overwritten when io_submit returns.

It's not - the key is a separate field.  It's just that the two are
defined using a very strange macro switching around their positions
based on the endiannes.

> Which brings me to the next question: What is the purpose of aio_key?
> Why is aio_key set to KIOCB_KEY (which is zero) every time? You are not
> differentiating the request by setting all the iocb's key to zero.

I don't know the history of this rather odd field.

[PATCH 1/2] loop: fix LO_FLAGS_PARTSCAN hang

2017-03-01 Thread Omar Sandoval

From: Omar Sandoval 

loop_reread_partitions() needs to do I/O, but we just froze the queue,
so we end up waiting forever. This can easily be reproduced with losetup
-P. Fix it by moving the reread to after we unfreeze the queue.

Fixes: ecdd09597a57 ("block/loop: fix race between I/O and set_status")
Reported-by: Tejun Heo 
Cc: Ming Lei 
Cc: sta...@vger.kernel.org
Signed-off-by: Omar Sandoval 
---
 drivers/block/loop.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4b52a1690329..132c9f371dce 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1142,13 +1142,6 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 (info->lo_flags & LO_FLAGS_AUTOCLEAR))
lo->lo_flags ^= LO_FLAGS_AUTOCLEAR;
 
-   if ((info->lo_flags & LO_FLAGS_PARTSCAN) &&
-!(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
-   lo->lo_flags |= LO_FLAGS_PARTSCAN;
-   lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
-   loop_reread_partitions(lo, lo->lo_device);
-   }
-
lo->lo_encrypt_key_size = info->lo_encrypt_key_size;
lo->lo_init[0] = info->lo_init[0];
lo->lo_init[1] = info->lo_init[1];
@@ -1163,6 +1156,14 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 
  exit:
blk_mq_unfreeze_queue(lo->lo_queue);
+
+   if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
+!(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
+   lo->lo_flags |= LO_FLAGS_PARTSCAN;
+   lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
+   loop_reread_partitions(lo, lo->lo_device);
+   }
+
return err;
 }
 
-- 
2.12.0

[PATCH 2/2] blk-mq-debugfs: add q->mq_freeze_depth output

2017-03-01 Thread Omar Sandoval

From: Omar Sandoval 

This can be used to diagnose freeze-related hangs.

Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f6d917977b33..1459546788da 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -29,6 +29,26 @@ struct blk_mq_debugfs_attr {
const struct file_operations *fops;
 };
 
+static int queue_freeze_depth_show(struct seq_file *m, void *v)
+{
+   struct request_queue *q = m->private;
+
+   seq_printf(m, "%d\n", atomic_read(&q->mq_freeze_depth));
+   return 0;
+}
+
+static int queue_freeze_depth_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, queue_freeze_depth_show, inode->i_private);
+}
+
+static const struct file_operations queue_freeze_depth_fops = {
+   .open   = queue_freeze_depth_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static int blk_mq_debugfs_seq_open(struct inode *inode, struct file *file,
   const struct seq_operations *ops)
 {
@@ -636,6 +656,11 @@ static const struct file_operations ctx_completed_fops = {
.release= single_release,
 };
 
+static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
+   {"freeze_depth", 0400, &queue_freeze_depth_fops},
+   {},
+};
+
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
{"state", 0400, &hctx_state_fops},
{"flags", 0400, &hctx_flags_fops},
@@ -753,6 +778,9 @@ int blk_mq_debugfs_register_hctxs(struct request_queue *q)
if (!q->mq_debugfs_dir)
goto err;
 
+   if (!debugfs_create_files(q->mq_debugfs_dir, q, 
blk_mq_debugfs_queue_attrs))
+   return -ENOMEM;
+
queue_for_each_hw_ctx(q, hctx, i) {
if (blk_mq_debugfs_register_hctx(q, hctx))
goto err;
-- 
2.12.0

[PATCH 3/3] nvme: Complete all stuck requests

2017-03-01 Thread Keith Busch

If the nvme driver is shutting down its controller, the drievr will not
start the queues up again, preventing blk-mq's hot CPU notifier from
making forward progress.

To fix that, this patch starts a request_queue freeze when the driver
resets a controller so no new requests may enter. The driver will wait
for frozen after IO queues are restarted to ensure the queue reference
can be reinitialized when nvme requests to unfreeze the queues.

If the driver is doing a safe shutdown, the driver will wait for the
controller to successfully complete all inflight requests so that we
don't unnecessarily fail them. Once the controller has been disabled,
the queues will be restarted to force remaining entered requests to end
in failure so that blk-mq's hot cpu notifier may progress.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/core.c | 47 +++
 drivers/nvme/host/nvme.h |  4 
 drivers/nvme/host/pci.c  | 33 -
 3 files changed, 79 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 25ec4e5..9b3b57f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2344,6 +2344,53 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_kill_queues);
 
+void nvme_unfreeze(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   mutex_lock(&ctrl->namespaces_mutex);
+   list_for_each_entry(ns, &ctrl->namespaces, list)
+   blk_mq_unfreeze_queue(ns->queue);
+   mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_unfreeze);
+
+void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
+{
+   struct nvme_ns *ns;
+
+   mutex_lock(&ctrl->namespaces_mutex);
+   list_for_each_entry(ns, &ctrl->namespaces, list) {
+   timeout = blk_mq_freeze_queue_wait_timeout(ns->queue, timeout);
+   if (timeout <= 0)
+   break;
+   }
+   mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout);
+
+void nvme_wait_freeze(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   mutex_lock(&ctrl->namespaces_mutex);
+   list_for_each_entry(ns, &ctrl->namespaces, list)
+   blk_mq_freeze_queue_wait(ns->queue);
+   mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_wait_freeze);
+
+void nvme_start_freeze(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   mutex_lock(&ctrl->namespaces_mutex);
+   list_for_each_entry(ns, &ctrl->namespaces, list)
+   blk_mq_freeze_queue_start(ns->queue);
+   mutex_unlock(&ctrl->namespaces_mutex);
+}
+EXPORT_SYMBOL_GPL(nvme_start_freeze);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a3da1e9..2aa20e3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -294,6 +294,10 @@ void nvme_queue_async_events(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
+void nvme_unfreeze(struct nvme_ctrl *ctrl);
+void nvme_wait_freeze(struct nvme_ctrl *ctrl);
+void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
+void nvme_start_freeze(struct nvme_ctrl *ctrl);
 
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index eee8f84..26a5fd0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1675,21 +1675,34 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
int i, queues;
-   u32 csts = -1;
+   bool dead = true;
+   struct pci_dev *pdev = to_pci_dev(dev->dev);
 
del_timer_sync(&dev->watchdog_timer);
 
mutex_lock(&dev->shutdown_lock);
-   if (pci_is_enabled(to_pci_dev(dev->dev))) {
-   nvme_stop_queues(&dev->ctrl);
-   csts = readl(dev->bar + NVME_REG_CSTS);
+   if (pci_is_enabled(pdev)) {
+   u32 csts = readl(dev->bar + NVME_REG_CSTS);
+
+   if (dev->ctrl.state == NVME_CTRL_LIVE)
+   nvme_start_freeze(&dev->ctrl);
+   dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
+   pdev->error_state  != pci_channel_io_normal);
}
 
+   /*
+* Give the controller a chance to complete all entered requests if
+* doing a safe shutdown.
+*/
+   if (!dead && shutdown)
+   nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+   nvme_stop_queues(&dev->ctrl);
+
queues = dev->online_queues - 1;
for (i = dev->queue_count - 1; i > 0; i--)
nvme_suspend_queue(dev->queues[i]);
 
-   if (csts & NVME_CSTS_CFS || !(cst

[PATCH 2/3] blk-mq: Provide freeze queue timeout

2017-03-01 Thread Keith Busch

A driver may wish to take corrective action if queued requests do not
complete within a set time.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c | 9 +
 include/linux/blk-mq.h | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8da2c04..a5e66a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -81,6 +81,15 @@ void blk_mq_freeze_queue_wait(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
+int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
+unsigned long timeout)
+{
+   return wait_event_timeout(q->mq_freeze_wq,
+   percpu_ref_is_zero(&q->q_usage_counter),
+   timeout);
+}
+EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
+
 /*
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8dacf68..b296a90 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -246,6 +246,8 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
 void blk_mq_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
+int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
+unsigned long timeout);
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set);
 
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
-- 
2.5.5

[PATCH 1/3] blk-mq: Export blk_mq_freeze_queue_wait

2017-03-01 Thread Keith Busch

Drivers can start a freeze, so this provides a way to wait for frozen.

Signed-off-by: Keith Busch 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Sagi Grimberg 
---
 block/blk-mq.c | 3 ++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 94593c6..8da2c04 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -75,10 +75,11 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
-static void blk_mq_freeze_queue_wait(struct request_queue *q)
+void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
+EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
 /*
  * Guarantee no request is in use, so we can change any data structure of
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 001d30d..8dacf68 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -245,6 +245,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
 void blk_mq_freeze_queue_start(struct request_queue *q);
+void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set);
 
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
-- 
2.5.5

[PATCH 0/3] nvme suspend/resume fix

2017-03-01 Thread Keith Busch

Hi Jens,

This is hopefully the last version to fix nvme stopping blk-mq's CPU
event from making forward progress. The solution requires a couple new
blk-mq exports so the nvme driver can properly sync with queue states.

Since this depends on the blk-mq parts, and if you approve of the
proposal, I think it'd be easiest if you can take this directly into
linux-block/for-linus. Otherwise, we can send you a pull request if you
Ack the blk-mq parts.

The difference from the previous patch is an update that Artur
confirmed passes hibernate on a stacked request queue. Personally,
I tested this for several hours with fio running buffered writes
in the back-ground and rtcwake running suspend/resume at intervals.
This succeeded with no fio errors.

Keith Busch (3):
  blk-mq: Export blk_mq_freeze_queue_wait
  blk-mq: Provide queue freeze wait timeout
  nvme: Complete all stuck requests

 block/blk-mq.c   | 12 +++-
 drivers/nvme/host/core.c | 47 +++
 drivers/nvme/host/nvme.h |  4 
 drivers/nvme/host/pci.c  | 33 -
 include/linux/blk-mq.h   |  3 +++
 5 files changed, 93 insertions(+), 6 deletions(-)

-- 
2.5.5

Re: [PATCH 11/16] mmc: block: shuffle retry and error handling

2017-03-01 Thread Bartlomiej Zolnierkiewicz

On Wednesday, March 01, 2017 04:52:38 PM Bartlomiej Zolnierkiewicz wrote:

> I assume that the problem got introduced even earlier,
> commit 4515dc6 ("mmc: block: shuffle retry and error
> handling") just makes it happen every time.

It seems to be introduced by patch #6. Patch #5 survived
30 consecutive boot+sync iterations (with later patches
the issue shows up during the first 12 iterations).

root@target:~# sync
[  248.801846] INFO: task mmcqd/0:128 blocked for more than 120 seconds.
[  248.806866]   Tainted: GW   4.10.0-rc3-00113-g5750765 #2739
[  248.814051] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  248.821696] mmcqd/0 D0   128  2 0x
[  248.827123] [] (__schedule) from [] (schedule+0x40/0xac)
[  248.834210] [] (schedule) from [] 
(schedule_timeout+0x148/0x220)
[  248.841912] [] (schedule_timeout) from [] 
(wait_for_common+0xb8/0x144)
[  248.850058] [] (wait_for_common) from [] 
(mmc_start_areq+0x40/0x1ac)
[  248.858209] [] (mmc_start_areq) from [] 
(mmc_blk_issue_rw_rq+0x78/0x314)
[  248.866599] [] (mmc_blk_issue_rw_rq) from [] 
(mmc_blk_issue_rq+0x9c/0x458)
[  248.875293] [] (mmc_blk_issue_rq) from [] 
(mmc_queue_thread+0x98/0x180)
[  248.883789] [] (mmc_queue_thread) from [] 
(kthread+0xfc/0x134)
[  248.891058] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[  248.898364] INFO: task jbd2/mmcblk0p2-:136 blocked for more than 120 seconds.
[  248.905400]   Tainted: GW   4.10.0-rc3-00113-g5750765 #2739
[  248.912353] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  248.919923] jbd2/mmcblk0p2- D0   136  2 0x
[  248.925693] [] (__schedule) from [] (schedule+0x40/0xac)
[  248.932470] [] (schedule) from [] 
(jbd2_journal_commit_transaction+0x1e8/0x15c4)
[  248.941552] [] (jbd2_journal_commit_transaction) from [] 
(kjournald2+0xbc/0x264)
[  248.950608] [] (kjournald2) from [] (kthread+0xfc/0x134)
[  248.957660] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[  248.964860] INFO: task kworker/u16:2:730 blocked for more than 120 seconds.
[  248.971780]   Tainted: GW   4.10.0-rc3-00113-g5750765 #2739
[  248.978673] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  248.986686] kworker/u16:2   D0   730  2 0x
[  248.991993] Workqueue: writeback wb_workfn (flush-179:0)
[  248.997230] [] (__schedule) from [] (schedule+0x40/0xac)
[  249.004287] [] (schedule) from [] 
(schedule_timeout+0x148/0x220)
[  249.011997] [] (schedule_timeout) from [] 
(io_schedule_timeout+0x74/0xb0)
[  249.020451] [] (io_schedule_timeout) from [] 
(bit_wait_io+0x10/0x58)
[  249.028545] [] (bit_wait_io) from [] 
(__wait_on_bit_lock+0x74/0xd0)
[  249.036513] [] (__wait_on_bit_lock) from [] 
(out_of_line_wait_on_bit_lock+0x68/0x70)
[  249.046231] [] (out_of_line_wait_on_bit_lock) from [] 
(do_get_write_access+0x3d0/0x4c4)
[  249.055729] [] (do_get_write_access) from [] 
(jbd2_journal_get_write_access+0x38/0x64)
[  249.065336] [] (jbd2_journal_get_write_access) from [] 
(__ext4_journal_get_write_access+0x2c/0x68)
[  249.076016] [] (__ext4_journal_get_write_access) from [] 
(ext4_mb_mark_diskspace_used+0x64/0x474)
[  249.086515] [] (ext4_mb_mark_diskspace_used) from [] 
(ext4_mb_new_blocks+0x258/0xa1c)
[  249.096040] [] (ext4_mb_new_blocks) from [] 
(ext4_ext_map_blocks+0x8b4/0xf28)
[  249.104883] [] (ext4_ext_map_blocks) from [] 
(ext4_map_blocks+0x144/0x5f8)
[  249.113468] [] (ext4_map_blocks) from [] 
(mpage_map_and_submit_extent+0xa4/0x788)
[  249.122641] [] (mpage_map_and_submit_extent) from [] 
(ext4_writepages+0x4e0/0x670)
[  249.131925] [] (ext4_writepages) from [] 
(do_writepages+0x24/0x38)
[  249.139774] [] (do_writepages) from [] 
(__writeback_single_inode+0x28/0x18c)
[  249.148555] [] (__writeback_single_inode) from [] 
(writeback_sb_inodes+0x1e0/0x394)
[  249.157909] [] (writeback_sb_inodes) from [] 
(__writeback_inodes_wb+0x70/0xac)
[  249.166833] [] (__writeback_inodes_wb) from [] 
(wb_writeback+0x18c/0x1b4)
[  249.175324] [] (wb_writeback) from [] 
(wb_workfn+0xd4/0x388)
[  249.182704] [] (wb_workfn) from [] 
(process_one_work+0x120/0x318)
[  249.190464] [] (process_one_work) from [] 
(worker_thread+0x2c/0x4ac)
[  249.198551] [] (worker_thread) from [] 
(kthread+0xfc/0x134)
[  249.205904] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[  249.213094] INFO: task sync:1403 blocked for more than 120 seconds.
[  249.219261]   Tainted: GW   4.10.0-rc3-00113-g5750765 #2739
[  249.226220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  249.234019] syncD0  1403   1396 0x
[  249.239424] [] (__schedule) from [] (schedule+0x40/0xac)
[  249.246624] [] (schedule) from [] 
(wb_wait_for_completion+0x50/0x7c)
[  249.254538] [] (wb_wait_for_completion) from [] 
(sync_inodes_sb+0x94/0x20c)
[  249.263200] [] (sync_inodes_sb) from [] 
(iterate_supers+0xac/0xd4)
[  249.271056] [] (iterate

[PATCH] nbd: stop leaking sockets

2017-03-01 Thread Josef Bacik

This was introduced in the multi-connection patch, we've been leaking
socket's ever since.

Fixes: 9561a7a ("nbd: add multi-connection support")
cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 0bf2b21..c7e93f6 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -689,8 +689,10 @@ static int nbd_clear_sock(struct nbd_device *nbd, struct 
block_device *bdev)
nbd->num_connections) {
int i;
 
-   for (i = 0; i < nbd->num_connections; i++)
+   for (i = 0; i < nbd->num_connections; i++) {
+   sockfd_put(nbd->socks[i]->sock);
kfree(nbd->socks[i]);
+   }
kfree(nbd->socks);
nbd->socks = NULL;
nbd->num_connections = 0;
-- 
2.7.4

Re: [PATCH] nbd: stop leaking sockets

2017-03-01 Thread Jens Axboe

On 03/01/2017 09:47 AM, Josef Bacik wrote:
> This was introduced in the multi-connection patch, we've been leaking
> socket's ever since.
> 
> Fixes: 9561a7a ("nbd: add multi-connection support")
> cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 

Applied for this series, thanks Josef.

-- 
Jens Axboe

Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT

2017-03-01 Thread Goldwyn Rodrigues

On 03/01/2017 09:56 AM, Christoph Hellwig wrote:
> On Wed, Mar 01, 2017 at 07:36:48AM -0800, Christoph Hellwig wrote:
>> Given that we aren't validating aio_flags in older kernels we can't
>> just add this flag as it will be a no-op in older kernels.  I think
>> we will have to add IOCB_CMD_PREADV2/IOCB_CMD_WRITEV2 opcodes that
>> properly validate all reserved fields or flags first.
>>
>> Once we do that I'd really prefer to use the same flags values
>> as preadv2/pwritev2 so that we'll only need one set of flags over
>> sync/async read/write ops.
> 
> I just took another look and we do verify that
> aio_reserved1/aio_reserved2 must be zero.  So I think we can just
> stick RWF_* into aio_reserved1 and fix that problem that way.
> 

RWF_* ? Isn't that kernel space flags? Or did you intend to say
IOCB_FLAG_*? If yes, we maintain two flag fields? aio_reserved1 (perhaps
renamed to aio_flags2) and aio_flags?

aio_reserved1 is also used to return key for the purpose of io_cancel,
but we should be able to fetch the flags before putting the key value
there. Still I am not comfortable using the same field for it because it
will be overwritten when io_submit returns.

Which brings me to the next question: What is the purpose of aio_key?
Why is aio_key set to KIOCB_KEY (which is zero) every time? You are not
differentiating the request by setting all the iocb's key to zero.

-- 
Goldwyn

Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock

2017-03-01 Thread Tejun Heo

Hello,

On Tue, Feb 28, 2017 at 03:51:27PM -0800, Tahsin Erdogan wrote:
> On Tue, Feb 28, 2017 at 2:47 PM, Tejun Heo  wrote:
> >> + if (!blkcg_policy_enabled(q, pol)) {
> >> + ret = -EOPNOTSUPP;
> >> + goto fail;
> >
> > Pulling this out of the queue_lock doesn't seem safe to me.  This
> > function may end up calling into callbacks of disabled policies this
> > way.
> 
> I will move this to within the lock. To make things safe, I am also
> thinking of rechecking both blkcg_policy_enabled()  and
> blk_queue_bypass() after reacquiring the locks in each iteration.
> 
> >> + parent = blkcg_parent(blkcg);
> >> + while (parent && !__blkg_lookup(parent, q, false)) {
> >> + pos = parent;
> >> + parent = blkcg_parent(parent);
> >> + }
> >
> > Hmm... how about adding @new_blkg to blkg_lookup_create() and calling
> > it with non-NULL @new_blkg until it succeeds?  Wouldn't that be
> > simpler?
> >
> >> +
> >> + new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
> 
> The challenge with that approach is creating a new_blkg with the right
> blkcg before passing to blkg_lookup_create(). blkg_lookup_create()
> walks down the hierarchy and will try to fill the first missing entry
> and the preallocated new_blkg must have been created with the right
> blkcg (feel free to send a code fragment if you think I am
> misunderstanding the suggestion).

Ah, indeed, but we can break out allocation of blkg and its
initialization, right?  It's a bit more work but then we'd be able to
do something like.


retry:
new_blkg = alloc;
lock;
sanity checks;
blkg = blkg_lookup_and_create(..., new_blkg);
if (!blkg) {
unlock;
goto retry;
}

Thanks.

-- 
tejun

Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-03-01 Thread Tejun Heo

Hello, Jan.

On Wed, Mar 01, 2017 at 04:37:00PM +0100, Jan Kara wrote:
> > The other thing which came to mind is that the congested->__bdi sever
> > semantics.  IIRC, that one was also to support the "bdi must go away now"
> > behavior.  As bdi is refcnted now, I think we can probably just let cong
> > hold onto the bdi rather than try to sever the ref there.
> 
> So currently I get away with __bdi not being a proper refcounted reference.
> If we were to remove the clearing of __bdi, we'd have to make it into
> refcounted reference which is sligthly ugly as we need to special-case
> embedded bdi_writeback_congested structures. Maybe it will be a worthwhile
> cleanup but for now I left it alone...

Yeah, absolutely, it's an additional step that we can take later.
Nothing urgent.

Thanks.

-- 
tejun

Re: [PATCH 11/16] mmc: block: shuffle retry and error handling

2017-03-01 Thread Bartlomiej Zolnierkiewicz

On Wednesday, March 01, 2017 04:52:38 PM Bartlomiej Zolnierkiewicz wrote:

> I assume that the problem got introduced even earlier,
> commit 4515dc6 ("mmc: block: shuffle retry and error
> handling") just makes it happen every time.

Patch #16 makes it worse as now I get deadlock on boot:

[  248.801750] INFO: task kworker/2:2:113 blocked for more than 120 seconds.
[  248.807119]   Tainted: GW   4.10.0-rc3-00123-g1bec9a6 #2726
[  248.814162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  248.821943] kworker/2:2 D0   113  2 0x
[  248.827357] Workqueue: events_freezable mmc_rescan
[  248.832227] [] (__schedule) from [] (schedule+0x40/0xac)
[  248.839123] [] (schedule) from [] 
(__mmc_claim_host+0x8c/0x1a0)
[  248.846851] [] (__mmc_claim_host) from [] 
(mmc_attach_mmc+0xb8/0x14c)
[  248.854989] [] (mmc_attach_mmc) from [] 
(mmc_rescan+0x274/0x34c)
[  248.862725] [] (mmc_rescan) from [] 
(process_one_work+0x120/0x318)
[  248.870498] [] (process_one_work) from [] 
(worker_thread+0x2c/0x4ac)
[  248.878653] [] (worker_thread) from [] 
(kthread+0xfc/0x134)
[  248.885934] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[  248.893098] INFO: task jbd2/mmcblk0p2-:132 blocked for more than 120 seconds.
[  248.900092]   Tainted: GW   4.10.0-rc3-00123-g1bec9a6 #2726
[  248.907108] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  248.914904] jbd2/mmcblk0p2- D0   132  2 0x
[  248.920319] [] (__schedule) from [] (schedule+0x40/0xac)
[  248.927433] [] (schedule) from [] 
(schedule_timeout+0x148/0x220)
[  248.935139] [] (schedule_timeout) from [] 
(io_schedule_timeout+0x74/0xb0)
[  248.943634] [] (io_schedule_timeout) from [] 
(bit_wait_io+0x10/0x58)
[  248.951684] [] (bit_wait_io) from [] 
(__wait_on_bit+0x84/0xbc)
[  248.959134] [] (__wait_on_bit) from [] 
(out_of_line_wait_on_bit+0x68/0x70)
[  248.968142] [] (out_of_line_wait_on_bit) from [] 
(jbd2_journal_commit_transaction+0x1468/0x15c4)
[  248.978397] [] (jbd2_journal_commit_transaction) from [] 
(kjournald2+0xbc/0x264)
[  248.987514] [] (kjournald2) from [] (kthread+0xfc/0x134)
[  248.994494] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[  249.001714] INFO: task kworker/1:2H:134 blocked for more than 120 seconds.
[  249.008412]   Tainted: GW   4.10.0-rc3-00123-g1bec9a6 #2726
[  249.015479] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  249.023094] kworker/1:2HD0   134  2 0x
[  249.028510] Workqueue: kblockd blk_mq_run_work_fn
[  249.00] [] (__schedule) from [] (schedule+0x40/0xac)
[  249.040199] [] (schedule) from [] 
(__mmc_claim_host+0x8c/0x1a0)
[  249.047856] [] (__mmc_claim_host) from [] 
(mmc_queue_rq+0x9c/0xa8)
[  249.055736] [] (mmc_queue_rq) from [] 
(blk_mq_dispatch_rq_list+0xd4/0x1d0)
[  249.064316] [] (blk_mq_dispatch_rq_list) from [] 
(blk_mq_process_rq_list+0x180/0x198)
[  249.073845] [] (blk_mq_process_rq_list) from [] 
(__blk_mq_run_hw_queue+0xb8/0x110)
[  249.083120] [] (__blk_mq_run_hw_queue) from [] 
(process_one_work+0x120/0x318)
[  249.092076] [] (process_one_work) from [] 
(worker_thread+0x2c/0x4ac)
[  249.00] [] (worker_thread) from [] 
(kthread+0xfc/0x134)
[  249.107322] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[  249.114485] INFO: task kworker/5:2H:136 blocked for more than 120 seconds.
[  249.121326]   Tainted: GW   4.10.0-rc3-00123-g1bec9a6 #2726
[  249.128232] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  249.136074] kworker/5:2HD0   136  2 0x
[  249.141544] Workqueue: kblockd blk_mq_run_work_fn
[  249.146187] [] (__schedule) from [] (schedule+0x40/0xac)
[  249.153419] [] (schedule) from [] 
(__mmc_claim_host+0x8c/0x1a0)
[  249.160825] [] (__mmc_claim_host) from [] 
(mmc_queue_rq+0x9c/0xa8)
[  249.168755] [] (mmc_queue_rq) from [] 
(blk_mq_dispatch_rq_list+0xd4/0x1d0)
[  249.177318] [] (blk_mq_dispatch_rq_list) from [] 
(blk_mq_process_rq_list+0x180/0x198)
[  249.186858] [] (blk_mq_process_rq_list) from [] 
(__blk_mq_run_hw_queue+0xb8/0x110)
[  249.196124] [] (__blk_mq_run_hw_queue) from [] 
(process_one_work+0x120/0x318)
[  249.204969] [] (process_one_work) from [] 
(worker_thread+0x2c/0x4ac)
[  249.213161] [] (worker_thread) from [] 
(kthread+0xfc/0x134)
[  249.220270] [] (kthread) from [] 
(ret_from_fork+0x14/0x3c)
[  249.227505] INFO: task kworker/0:1H:145 blocked for more than 120 seconds.
[  249.234328]   Tainted: GW   4.10.0-rc3-00123-g1bec9a6 #2726
[  249.241229] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  249.249066] kworker/0:1HD0   145  2 0x
[  249.254521] Workqueue: kblockd blk_mq_run_work_fn
[  249.259176] [] (__schedule) from [] (schedule+0x40/0xac)
[  249.266233] [] (schedule) from [] 
(__mmc_claim_host+0x8c/0x1a0)
[  249.274001] [] (__mmc_claim_host) from [] 
(mmc_queue_rq+0x9c/0xa8

Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT

2017-03-01 Thread Christoph Hellwig

On Wed, Mar 01, 2017 at 07:36:48AM -0800, Christoph Hellwig wrote:
> Given that we aren't validating aio_flags in older kernels we can't
> just add this flag as it will be a no-op in older kernels.  I think
> we will have to add IOCB_CMD_PREADV2/IOCB_CMD_WRITEV2 opcodes that
> properly validate all reserved fields or flags first.
> 
> Once we do that I'd really prefer to use the same flags values
> as preadv2/pwritev2 so that we'll only need one set of flags over
> sync/async read/write ops.

I just took another look and we do verify that
aio_reserved1/aio_reserved2 must be zero.  So I think we can just
stick RWF_* into aio_reserved1 and fix that problem that way.

Re: [PATCH 11/16] mmc: block: shuffle retry and error handling

2017-03-01 Thread Bartlomiej Zolnierkiewicz

On Wednesday, March 01, 2017 12:45:57 PM Bartlomiej Zolnierkiewicz wrote:
> 
> Hi,
> 
> On Tuesday, February 28, 2017 06:45:20 PM Bartlomiej Zolnierkiewicz wrote:
> > On Thursday, February 09, 2017 04:33:58 PM Linus Walleij wrote:
> > > Instead of doing retries at the same time as trying to submit new
> > > requests, do the retries when the request is reported as completed
> > > by the driver, in the finalization worker.
> > > 
> > > This is achieved by letting the core worker call back into the block
> > > layer using mmc_blk_rw_done(), that will read the status and repeatedly
> > > try to hammer the request using single request etc by calling back to
> > > the core layer using mmc_restart_areq()
> > > 
> > > The beauty of it is that the completion will not complete until the
> > > block layer has had the opportunity to hammer a bit at the card using
> > > a bunch of different approaches in the while() loop in
> > > mmc_blk_rw_done()
> > > 
> > > The algorithm for recapture, retry and handle errors is essentially
> > > identical to the one we used to have in mmc_blk_issue_rw_rq(),
> > > only augmented to get called in another path.
> > > 
> > > We have to add and initialize a pointer back to the struct mmc_queue
> > > from the struct mmc_queue_req to find the queue from the asynchronous
> > > request.
> > > 
> > > Signed-off-by: Linus Walleij 
> > 
> > It seems that after this change we can end up queuing more
> > work for kthread from the kthread worker itself and wait
> > inside it for this nested work to complete.  I hope that
> 
> On the second look it seems that there is no waiting for
> the retried areq to complete so I cannot see what protects
> us from racing and trying to run two areq-s in parallel:
> 
> 1st areq being retried (in the completion kthread):
> 
>   mmc_blk_rw_done()->mmc_restart_areq()->__mmc_start_data_req()
> 
> 2nd areq coming from the second request in the queue
> (in the queuing kthread):
> 
>   mmc_blk_issue_rw_rq()->mmc_start_areq()->__mmc_start_data_req()
> 
> (after mmc_blk_rw_done() is done in mmc_finalize_areq() 1st
> areq is marked as completed by the completion kthread and
> the waiting on host->areq in mmc_start_areq() of the queuing
> kthread is done and 2nd areq is started while the 1st one
> is still being retried)
> 
> ?
> 
> Also retrying of areqs for MMC_BLK_RETRY status case got broken
> (before change do {} while() loop increased retry variable,
> now the loop is gone and retry variable will not be increased
> correctly and we can loop forever).

There is another problem with this patch.

During boot there is ~30 sec delay and later I get deadlock
on trying to run sync command (first thing I do after boot):

...
[5.960623] asoc-simple-card sound: HiFi <-> 383.i2s mapping ok
done.
[] Waiting for /dev to be fully populated...[   17.745887] random: crng 
init done
done.
[] Activating swap...done.
[   39.767982] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
...
root@target:~# sync
[  248.801708] INFO: task udevd:287 blocked for more than 120 seconds.
[  248.806552]   Tainted: GW   4.10.0-rc3-00118-g4515dc6 #2736
[  248.813590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  248.821275] udevd   D0   287249 0x0005
[  248.826815] [] (__schedule) from [] (schedule+0x40/0xac)
[  248.833889] [] (schedule) from [] 
(schedule_timeout+0x148/0x220)
[  248.841598] [] (schedule_timeout) from [] 
(io_schedule_timeout+0x74/0xb0)
[  248.849993] [] (io_schedule_timeout) from [] 
(__lock_page+0xe8/0x118)
[  248.858235] [] (__lock_page) from [] 
(truncate_inode_pages_range+0x580/0x59c)
[  248.867053] [] (truncate_inode_pages_range) from [] 
(truncate_inode_pages+0x18/0x20)
[  248.876525] [] (truncate_inode_pages) from [] 
(__blkdev_put+0x68/0x1d8)
[  248.884828] [] (__blkdev_put) from [] 
(blkdev_close+0x18/0x20)
[  248.892375] [] (blkdev_close) from [] (__fput+0x84/0x1c0)
[  248.899383] [] (__fput) from [] (task_work_run+0xbc/0xdc)
[  248.906593] [] (task_work_run) from [] 
(do_exit+0x304/0x9bc)
[  248.913938] [] (do_exit) from [] 
(do_group_exit+0x3c/0xbc)
[  248.921046] [] (do_group_exit) from [] 
(get_signal+0x200/0x65c)
[  248.928776] [] (get_signal) from [] 
(do_signal+0x84/0x3c4)
[  248.935970] [] (do_signal) from [] 
(do_work_pending+0xa4/0xb4)
[  248.943506] [] (do_work_pending) from [] 
(slow_work_pending+0xc/0x20)
[  248.951637] INFO: task sync:1398 blocked for more than 120 seconds.
[  248.957756]   Tainted: GW   4.10.0-rc3-00118-g4515dc6 #2736
[  248.965052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  248.972681] syncD0  1398   1390 0x
[  248.978117] [] (__schedule) from [] (schedule+0x40/0xac)
[  248.985174] [] (schedule) from [] 
(schedule_preempt_disabled+0x14/0x20)
[  248.993609] [] (schedule_preempt_disabled) from [] 
(__mutex_lock_slowpath+0x480/0x6ec)
[  249.003153] [] (__mutex_lock_slowpath) from []

Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-03-01 Thread Jan Kara

On Tue 28-02-17 23:26:53, Omar Sandoval wrote:
> On Tue, Feb 28, 2017 at 11:25:28PM -0800, Omar Sandoval wrote:
> > On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> > > On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > > > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > > > Hello,
> > > > > 
> > > > > this is a second revision of the patch set to fix several different 
> > > > > races and
> > > > > issues I've found when testing device shutdown and reuse. The first 
> > > > > three
> > > > > patches are fixes to problems in my previous series fixing BDI 
> > > > > lifetime issues.
> > > > > Patch 4 fixes issues with reuse of BDI name with scsi devices. With 
> > > > > it I cannot
> > > > > reproduce the BDI name reuse issues using Omar's stress test using 
> > > > > scsi_debug
> > > > > so it can be used as a replacement of Dan's patches. Patches 5-11 fix 
> > > > > oops that
> > > > > is triggered by __blkdev_put() calling inode_detach_wb() too early 
> > > > > (the problem
> > > > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in 
> > > > > gendisk code
> > > > > where get_gendisk() can return already freed gendisk structure (again 
> > > > > triggered
> > > > > by Omar's stress test).
> > > > > 
> > > > > People, please have a look at patches. They are mostly simple however 
> > > > > the
> > > > > interactions are rather complex so I may have missed something. Also 
> > > > > I'm
> > > > > happy for any additional testing these patches can get - I've 
> > > > > stressed them
> > > > > with Omar's script, tested memcg writeback, tested static (not udev 
> > > > > managed)
> > > > > device inodes.
> > > > > 
> > > > > Jens, I think at least patches 1-3 should go in together with my 
> > > > > fixes you
> > > > > already have in your tree (or shortly after them). It is up to you 
> > > > > whether
> > > > > you decide to delay my first fixes or pick these up quickly. Patch 4 
> > > > > is
> > > > > (IMHO a cleaner) replacement of Dan's patches so consider whether you 
> > > > > want
> > > > > to use it instead of those patches.
> > > > 
> > > > I have applied 1-3 to my for-linus branch, which will go in after
> > > > the initial pull request has been pulled by Linus. Consider fixing up
> > > > #4 so it applies, I like it.
> > > 
> > > OK, attached is patch 4 rebased on top of Linus' tree from today which
> > > already has linux-block changes pulled in. I've left put_disk_devt() in
> > > blk_cleanup_queue() to maintain the logic in the original patch (now 
> > > commit
> > > 0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
> > > The bdi_unregister() call that is moved to del_gendisk() by this patch is
> > > now protected by the gendisk reference instead of the request_queue one
> > > so it still maintains the property that devt reference protects bdi
> > > registration-unregistration lifetime (as much as that is not needed 
> > > anymore
> > > after this patch).
> > > 
> > > I have also updated the comment in the code and the changelog - they were
> > > somewhat stale after changes to the whole series Tejun suggested.
> > > 
> > >   Honza
> > 
> > Hey, Jan, I just tested this out when I was seeing similar crashes with
> > sr instead of sd, and this fixed it.
> > 
> > Tested-by: Omar Sandoval 
> 
> Just realized it wasn't clear, I'm talking about patch 4 specifically.

Thanks for confirmation!

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH 7/8] nowait aio: xfs

2017-03-01 Thread Christoph Hellwig

> @@ -528,12 +528,17 @@ xfs_file_dio_aio_write(
>   ((iocb->ki_pos + count) & mp->m_blockmask)) {
>   unaligned_io = 1;
>   iolock = XFS_IOLOCK_EXCL;
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;

So all unaligned I/O will return -EAGAIN?  Why?  Also please explain
that reason in a comment right here.

> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 1aa3abd..84f981a 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1020,6 +1020,11 @@ xfs_file_iomap_begin(
>   if ((flags & IOMAP_REPORT) ||
>   (xfs_is_reflink_inode(ip) &&
>(flags & IOMAP_WRITE) && (flags & IOMAP_DIRECT))) {
> + /* Allocations due to reflinks */
> + if ((flags & IOMAP_NOWAIT) && !(flags & IOMAP_REPORT)) {
> + error = -EAGAIN;
> + goto out_unlock;
> + }

FYI, this code looks very different in current Linus' tree - I think
you're on some old kernel base.

Re: [PATCH] sbitmap: boundary checks for resized bitmap

2017-03-01 Thread Hannes Reinecke

On 02/28/2017 08:15 PM, Omar Sandoval wrote:
> On Wed, Feb 15, 2017 at 12:10:42PM +0100, Hannes Reinecke wrote:
>> If the sbitmap gets resized we need to ensure not to overflow
>> the original allocation. And we should limit the search in
>> sbitmap_any_bit_set() to the available depth to avoid looking
>> into unused space.
> 
> Hey, Hannes, I don't really like this change. It's easy enough for the
> caller to keep track of this and check themselves if they really care. I
> even included a comment in sbitmap.h to that effect:
> 
Okay.

> /**
>  * sbitmap_resize() - Resize a &struct sbitmap.
>  * @sb: Bitmap to resize.
>  * @depth: New number of bits to resize to.
>  *
>  * Doesn't reallocate anything. It's up to the caller to ensure that the new
>  * depth doesn't exceed the depth that the sb was initialized with.
>  */
> 
> 
> As for the sbitmap_any_bit_set() change, the bits beyond the actual
> depth should all be zero, so I don't think that change is worth it,
> either.
> 
Hmm. That would be okay if we can be sure that the remaining bits really
are zero. Which probably would need to be checked by the caller, too.

So yeah, if you don't like it, okay.
Just ignore it then.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback

2017-03-01 Thread Christoph Hellwig

On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> Ugh, this is pretty inefficient.  If that's all you want to know, then
> using the radix tree directly will be far more efficient than spinning
> up all the pagevec machinery only to discard the pages found.
> 
> But what's going to kick these pages out of cache?  Shouldn't we rather
> find the pages, kick them out if clean, start writeback if not, and *then*
> return -EAGAIN?
> 
> So maybe we want to spin up the pagevec machinery after all so we can
> do that extra work?

As pointed out in the last round of these patches I think we really
need to pass a flags argument to filemap_write_and_wait_range to
communicate the non-blocking nature and only return -EAGAIN if we'd
block.  As a bonus that can indeed start to kick the pages out.

Re: [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem

2017-03-01 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT

2017-03-01 Thread Christoph Hellwig

On Tue, Feb 28, 2017 at 05:36:03PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> This flag informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> IOCB_FLAG_NOWAIT is translated to IOCB_NOWAIT for
> iocb->ki_flags.

Given that we aren't validating aio_flags in older kernels we can't
just add this flag as it will be a no-op in older kernels.  I think
we will have to add IOCB_CMD_PREADV2/IOCB_CMD_WRITEV2 opcodes that
properly validate all reserved fields or flags first.

Once we do that I'd really prefer to use the same flags values
as preadv2/pwritev2 so that we'll only need one set of flags over
sync/async read/write ops.

Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-03-01 Thread Jan Kara

Hello,

On Tue 28-02-17 11:54:41, Tejun Heo wrote:
> It generally looks good to me.

Thanks for review!

> The only worry I have is around wb_shutdown() synchronization and if that
> is actually an issue it shouldn't be too difficult to fix.

Yeah, I'll have a look at that.

> The other thing which came to mind is that the congested->__bdi sever
> semantics.  IIRC, that one was also to support the "bdi must go away now"
> behavior.  As bdi is refcnted now, I think we can probably just let cong
> hold onto the bdi rather than try to sever the ref there.

So currently I get away with __bdi not being a proper refcounted reference.
If we were to remove the clearing of __bdi, we'd have to make it into
refcounted reference which is sligthly ugly as we need to special-case
embedded bdi_writeback_congested structures. Maybe it will be a worthwhile
cleanup but for now I left it alone...

Honza
-- 
Jan Kara 
SUSE Labs, CR

RE: connect cmd error for nvme-rdma with eventual kernel crash

2017-03-01 Thread Parav Pandit

Thanks Omar.

> -Original Message-
> From: Omar Sandoval [mailto:osan...@osandov.com]
> Sent: Tuesday, February 28, 2017 11:51 PM
> To: Parav Pandit 
> Cc: Jens Axboe ; ax...@fb.com; linux-
> bl...@vger.kernel.org; Sagi Grimberg ; Christoph
> Hellwig ; Max Gurtovoy 
> Subject: Re: connect cmd error for nvme-rdma with eventual kernel crash
> 
> On Wed, Mar 01, 2017 at 04:55:23AM +, Parav Pandit wrote:
> > Hi Jens,
> >
> > > -Original Message-
> > > From: Jens Axboe [mailto:ax...@kernel.dk]
> > > Subject: Re: connect cmd error for nvme-rdma with eventual kernel
> > > crash
> > >
> > > > On Feb 28, 2017, at 5:57 PM, Parav Pandit 
> wrote:
> > > >
> > > > Hi Jens,
> > > >
> > > > With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in
> > > > linux-block git tree, There are two requests tables. Static and
> > > > dynamic of
> > > same size.
> > > > However function blk_mq_tag_to_rq() always try to get the tag from
> > > > the
> > > dynamic table which doesn't seem to be always initialized.
> > > >
> > > > I am running nvme-rdma initiator and it fails to find the request
> > > > for the
> > > given tag when command completes.
> > > > Command triggers error recovery with "tag not found" error.
> > > > Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with
> > > > NULL
> > > pointer. Seems to be additional bug in error recovery.
> > > >
> > > > To debug, I added initializing dynamic tags as well.
> > > >
> > > > blk_mq_alloc_rqs() {
> > > >tags->static_rqs[i] = rq;
> > > > +tags->rqs[i] = rq;
> > > >
> > > > This appears to resolve the issue. But that's not the fix.
> > > > It appears to me that nvme stack is broken in certain conditions
> > > > with recent
> > > static and dynamic rq tables change.
> > >
> > > Can you try my for-linus branch?
> >
> > I tried for-linus branch and it works.
> >
> > Seems like ac6e0c2d633ab0411810fe6b15a40808309041db fixes it.
> > __blk_mq_alloc_request()
> > data->hctx->tags->rqs[rq->tag] = rq;
> >
> > Commit says no functional difference but it is actually fixing this issue.
> >
> > Parav
> 
> The fix is actually this one:
> 
> commit f867f4804d55adeef42c68c89edad49cdf3058f7
> Author: Omar Sandoval 
> Date:   Mon Feb 27 10:28:27 2017 -0800
> 
> blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
> 
> blk_mq_alloc_request_hctx() allocates a driver request directly, unlike
> its blk_mq_alloc_request() counterpart. It also crashes because it
> doesn't update the tags->rqs map.
> 
> Fix it by making it allocate a scheduler request.
> 
> Reported-by: Sagi Grimberg 
> Signed-off-by: Omar Sandoval 
> Signed-off-by: Jens Axboe 
> Tested-by: Sagi Grimberg 
> 
> The commit you pointed out would have also fixed it, but after this change
> it's a no-op.

Re: [PATCH 07/12] xfs: implement failure-atomic writes

2017-03-01 Thread Christoph Hellwig

On Tue, Feb 28, 2017 at 03:09:40PM -0800, Darrick J. Wong wrote:
> By the way, the copy on write code remembers the extents it has
> allocated for CoW staging in the refcount btree so that it can free them
> after a crash, which means that O_ATOMIC requires reflink to be enabled.

Yeah.

> There doesn't seem to be any explicit checking that reflink is even
> enabled, which will probably just lead to weird crashes on a pre-reflink
> xfs.

True.  I had this earlier when I hat basic O_ATOMIC validity checking,
but that was dropped from the series I posted.

> 
> FWIW I didn't see any checking anywhere (vfs or xfs) that the filesystem
> can actually support O_ATOMIC.  If the FS doesn't support atomic writes,
> shouldn't the kernel send EINVAL or something back to userspace?

Older kernels can't check it, so having new ones check it creates even
more of a mess.

I'm still not feeling very well about O_ATOMIC - either we need an
open2 that checks for unknown flags, or I need to change this to
a per-op flag - RWF_ATOMIC for write (pwritev2 actually), and MAP_ATOMIC
for mmap.  But given that pwritev2 isn't really supported in common
userland yet that might be rather painful.

> At the start of xfs_reflink.c is a long block comment describing how the
> copy on write mechanism works.  Since O_ATOMIC is a variant on CoW (it's
> basically CoW with remapping deferred until fsync), please update the
> comment so that the comments capture the details of how atomic writes
> work.
> 
> (IOWs: Dave asked me to leave the big comment, so I'm going to try to
> keep it fairly up to date.)

I'll add some information to it.

> I suppose it goes without saying that userspace will have to coordinate
> its O_ATOMIC writes to the file.

It does - but if you have multiple writers to a file they really need
to be coordinated anyway.  If you have threads whose updates race
you'd need something like

open(O_TMPFILE)
clone file (or range) into tempfile

update tempfile

clone region you want atomically inserted back into the original file.

We can actually do that with existing primitives, but it's a bit more
heavyweight.  We could opimize this a bit by checking if an extent
already points to the same physical blocks before replacing it in
clone_file_range.

> > +   if (file->f_flags & O_ATOMIC)
> > +   printk_ratelimited("O_ATOMIC!\n");
> 
> Per above,
> 
> if (file->f_flags & O_ATOMIC) {
>   if (!xfs_sb_version_hasreflink(...))
>   return -EPROTONOSUPPORT;

Yeah.

>   printk_ratelimited("EXPERIMENTAL atomic writes feature in use!\n");

And that should just go away - it was a local debug aid :)

Re: [PATCH 05/12] fs: add a F_IOINFO fcntl

2017-03-01 Thread Christoph Hellwig

On Tue, Feb 28, 2017 at 08:51:39AM -0800, Darrick J. Wong wrote:
> Hm... is fio_alignment is specified in units of bytes?

Yes.

> If so, then
> shouldn't this be a __u32 so that we can handle some weird future device
> that wants, say, 1MB alignment for its atomic IO?

That would be pretty useless.  Anything bigger than sector / block
size would not really be usable for typical applications.

> Though, now that I look at the XFS ioinfo patch, I guess fio_alignment
> is set only for O_DIRECT files?

Yes.

> So it's really the required alignment
> for directio operations.

For buffered I/O we can write at byte granularity and still use the
atomic commits, but for direct I/O we can only COW at block size
granularity.

Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Christoph Hellwig

On Tue, Feb 28, 2017 at 03:22:04PM -0800, Darrick J. Wong wrote:
> (Assuming there's no syncv involved here...?)

No.  While I think we could implement it for XFS similar how we roll
transactions over multiple inodes for a few transactions, the use case
is much more limited, and the potential pitfalls are much bigger.

> > have to check the F_IOINFO fcntl before, which is a bit of a killer.
> > Because of that I've also not implemented any other validity checks
> > yet, as they might make thing even worse when an open on a not supported
> > file system or device fails, but not on an old kernel.  Maybe we need
> > a new open version that checks arguments properly first?
> 
> Does fcntl(F_SETFL...) suffer from this?

Yes.

Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Christoph Hellwig

On Tue, Feb 28, 2017 at 03:48:16PM -0500, Chris Mason wrote:
> One thing that isn't clear to me is how we're dealing with boundary bio 
> mappings, which will get submitted by submit_page_section()
>
> sdio->boundary = buffer_boundary(map_bh);

The old dio code is not supported at all by this code at the moment.
We'll either need the new block dev direct I/O code on block
devices (and limit to BIO_MAX_PAGES, this is a bug in this patchset
if people ever have devices with > 1MB atomic write support.  And thanks
to NVMe the failure case is silent, sigh..), or we need file system support
for out of place writes.

>
> In btrfs I'd just chain things together and do the extent pointer swap 
> afterwards, but I didn't follow the XFS code well enough to see how its 
> handled there.  But either way it feels like an error prone surprise 
> waiting for later, and one gap we really want to get right in the FS 
> support is O_ATOMIC across a fragmented extent.
>
> If I'm reading the XFS patches right, the code always cows for atomic.

It doesn't really COW - it uses the COW infrastructure to write out of
place and then commit it into the file later.  Because of that we don't
really care about things like boundary blocks (which XFS never used in
that form anyway) - data is written first, the cache is flushed and then
we swap around the extent pointers.

> Are 
> you planning on adding an optimization to use atomic support in the device 
> to skip COW when possible?

We could do that fairly easily for files that have a contiguous mapping
for the atomic write I/O.  But at this point I have a lot more trust in
the fs code than the devices, especially due to the silent failure mode.

Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Christoph Hellwig

On Wed, Mar 01, 2017 at 01:21:41PM +0200, Amir Goldstein wrote:
> [CC += linux-...@vger.kernel.org] for that question and for the new API

We'll need to iterate over the API a few more times first I think..

Re: [PATCH] sbitmap: boundary checks for resized bitmap

2017-03-01 Thread Omar Sandoval

On Wed, Feb 15, 2017 at 12:10:42PM +0100, Hannes Reinecke wrote:
> If the sbitmap gets resized we need to ensure not to overflow
> the original allocation. And we should limit the search in
> sbitmap_any_bit_set() to the available depth to avoid looking
> into unused space.

Hey, Hannes, I don't really like this change. It's easy enough for the
caller to keep track of this and check themselves if they really care. I
even included a comment in sbitmap.h to that effect:

/**
 * sbitmap_resize() - Resize a &struct sbitmap.
 * @sb: Bitmap to resize.
 * @depth: New number of bits to resize to.
 *
 * Doesn't reallocate anything. It's up to the caller to ensure that the new
 * depth doesn't exceed the depth that the sb was initialized with.
 */

As for the sbitmap_any_bit_set() change, the bits beyond the actual
depth should all be zero, so I don't think that change is worth it,
either.

Thanks!

Re: [PATCH 15/16] mmc: queue: issue requests in massive parallel

2017-03-01 Thread Bartlomiej Zolnierkiewicz

On Thursday, February 09, 2017 04:34:02 PM Linus Walleij wrote:
> This makes a crucial change to the issueing mechanism for the
> MMC requests:
> 
> Before commit "mmc: core: move the asynchronous post-processing"
> some parallelism on the read/write requests was achieved by
> speculatively postprocessing a request and re-preprocess and
> re-issue the request if something went wrong, which we discover
> later when checking for an error.
> 
> This is kind of ugly. Instead we need a mechanism like here:
> 
> We issue requests, and when they come back from the hardware,
> we know if they finished successfully or not. If the request
> was successful, we complete the asynchronous request and let a
> new request immediately start on the hardware. If, and only if,
> it returned an error from the hardware we go down the error
> path.
> 
> This is achieved by splitting the work path from the hardware
> in two: a successful path ending up calling down to
> mmc_blk_rw_done_success() and an errorpath calling down to
> mmc_blk_rw_done_error().
> 
> This has a profound effect: we reintroduce the parallelism on
> the successful path as mmc_post_req() can now be called in
> while the next request is in transit (just like prior to
> commit "mmc: core: move the asynchronous post-processing")
> but ALSO we can call mmc_queue_bounce_post() and
> blk_end_request() in parallel.
> 
> The latter has the profound effect of issuing a new request
> again so that we actually need to have at least three requests
> in transit at the same time: we haven't yet dropped the
> reference to our struct mmc_queue_req so we need at least
> three. I put the pool to 4 requests for now.
> 
> I expect the imrovement to be noticeable on systems that use
> bounce buffers since they can now process requests in parallel
> with post-processing their bounce buffers, but I don't have a
> test target for that.
> 
> Signed-off-by: Linus Walleij 
> ---
>  drivers/mmc/core/block.c | 61 
> +---
>  drivers/mmc/core/block.h |  4 +++-
>  drivers/mmc/core/core.c  | 27 ++---
>  drivers/mmc/core/queue.c |  2 +-
>  4 files changed, 75 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index acca15cc1807..f1008ce5376b 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -1622,8 +1622,51 @@ static void mmc_blk_rw_try_restart(struct 
> mmc_queue_req *mq_rq)
>   mmc_restart_areq(mq->card->host, &mq_rq->areq);
>  }
>  
> -void mmc_blk_rw_done(struct mmc_async_req *areq,
> -  enum mmc_blk_status status)
> +/**
> + * Final handling of an asynchronous request if there was no error.
> + * This is the common path that we take when everything is nice
> + * and smooth. The status from the command is always MMC_BLK_SUCCESS.
> + */
> +void mmc_blk_rw_done_success(struct mmc_async_req *areq)
> +{
> + struct mmc_queue_req *mq_rq;
> + struct mmc_blk_request *brq;
> + struct mmc_blk_data *md;
> + struct request *old_req;
> + bool req_pending;
> + int type;
> +
> + mq_rq = container_of(areq, struct mmc_queue_req, areq);
> + md = mq_rq->mq->blkdata;
> + brq = &mq_rq->brq;
> + old_req = mq_rq->req;
> + type = rq_data_dir(old_req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;
> +
> + mmc_queue_bounce_post(mq_rq);
> + mmc_blk_reset_success(md, type);
> + req_pending = blk_end_request(old_req, 0,
> +   brq->data.bytes_xfered);
> + /*
> +  * If the blk_end_request function returns non-zero even
> +  * though all data has been transferred and no errors
> +  * were returned by the host controller, it's a bug.
> +  */
> + if (req_pending) {
> + pr_err("%s BUG rq_tot %d d_xfer %d\n",
> +__func__, blk_rq_bytes(old_req),
> +brq->data.bytes_xfered);

What has happened to mmc_blk_rw_cmd_abort() call?

> + return;
> + }
> +}
> +
> +/**
> + * Error, recapture, retry etc for asynchronous requests.
> + * This is the error path that we take when there is bad status
> + * coming back from the hardware and we need to do a bit of
> + * cleverness.
> + */
> +void mmc_blk_rw_done_error(struct mmc_async_req *areq,
> +enum mmc_blk_status status)
>  {
>   struct mmc_queue *mq;
>   struct mmc_queue_req *mq_rq;
> @@ -1652,6 +1695,8 @@ void mmc_blk_rw_done(struct mmc_async_req *areq,
>  
>   switch (status) {
>   case MMC_BLK_SUCCESS:
> + pr_err("%s: MMC_BLK_SUCCESS on error path\n", __func__);
> + /* This should not happen: anyway fall through */
>   case MMC_BLK_PARTIAL:
>   /*
>* A block was successfully transferred.
> @@ -1660,18 +1705,6 @@ void mmc_blk_rw_done(struct mmc_async_req *areq,
>  
>   req_pending = blk_end_request(old_req, 0,
>

Re: [PATCH 11/16] mmc: block: shuffle retry and error handling

2017-03-01 Thread Bartlomiej Zolnierkiewicz


Hi,

On Tuesday, February 28, 2017 06:45:20 PM Bartlomiej Zolnierkiewicz wrote:
> On Thursday, February 09, 2017 04:33:58 PM Linus Walleij wrote:
> > Instead of doing retries at the same time as trying to submit new
> > requests, do the retries when the request is reported as completed
> > by the driver, in the finalization worker.
> > 
> > This is achieved by letting the core worker call back into the block
> > layer using mmc_blk_rw_done(), that will read the status and repeatedly
> > try to hammer the request using single request etc by calling back to
> > the core layer using mmc_restart_areq()
> > 
> > The beauty of it is that the completion will not complete until the
> > block layer has had the opportunity to hammer a bit at the card using
> > a bunch of different approaches in the while() loop in
> > mmc_blk_rw_done()
> > 
> > The algorithm for recapture, retry and handle errors is essentially
> > identical to the one we used to have in mmc_blk_issue_rw_rq(),
> > only augmented to get called in another path.
> > 
> > We have to add and initialize a pointer back to the struct mmc_queue
> > from the struct mmc_queue_req to find the queue from the asynchronous
> > request.
> > 
> > Signed-off-by: Linus Walleij 
> 
> It seems that after this change we can end up queuing more
> work for kthread from the kthread worker itself and wait
> inside it for this nested work to complete.  I hope that

On the second look it seems that there is no waiting for
the retried areq to complete so I cannot see what protects
us from racing and trying to run two areq-s in parallel:

1st areq being retried (in the completion kthread):

mmc_blk_rw_done()->mmc_restart_areq()->__mmc_start_data_req()

2nd areq coming from the second request in the queue
(in the queuing kthread):

mmc_blk_issue_rw_rq()->mmc_start_areq()->__mmc_start_data_req()

(after mmc_blk_rw_done() is done in mmc_finalize_areq() 1st
areq is marked as completed by the completion kthread and
the waiting on host->areq in mmc_start_areq() of the queuing
kthread is done and 2nd areq is started while the 1st one
is still being retried)

?

Also retrying of areqs for MMC_BLK_RETRY status case got broken
(before change do {} while() loop increased retry variable,
now the loop is gone and retry variable will not be increased
correctly and we can loop forever).

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics

Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Amir Goldstein

On Tue, Feb 28, 2017 at 4:57 PM, Christoph Hellwig  wrote:
> Hi all,
>
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
>
> https://lwn.net/Articles/573092/
>
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
>
> 
> https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
>
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
>
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
>
> The second case is for file systems, where we simply write new blocks
> out of places and then remap them into the file atomically on either
> completion of an O_(D)SYNC write or when fsync is called explicitly.
>
> The semantics of the latter case are explained in detail at the Usenix
> paper above.
>
> Last but not least a new fcntl is implemented to provide information
> about I/O restrictions such as alignment requirements and the maximum
> atomic write size.
>
> The implementation is simple and clean, but I'm rather unhappy about
> the interface as it has too many failure modes to bullet proof.  For
> one old kernels ignore unknown open flags silently, so applications
> have to check the F_IOINFO fcntl before, which is a bit of a killer.
> Because of that I've also not implemented any other validity checks
> yet, as they might make thing even worse when an open on a not supported
> file system or device fails, but not on an old kernel.  Maybe we need
> a new open version that checks arguments properly first?
>

[CC += linux-...@vger.kernel.org] for that question and for the new API

> Also I'm really worried about the NVMe failure modes - devices simply
> advertise an atomic write size, with no way for the device to know
> that the host requested a given write to be atomic, and thus no
> error reporting.  This is made worse by NVMe 1.2 adding per-namespace
> atomic I/O parameters that devices can use to introduce additional
> odd alignment quirks - while there is some language in the spec
> requiring them not to weaken the per-controller guarantees it all
> looks rather weak and I'm not too confident in all implementations
> getting everything right.
>
> Last but not least this depends on a few XFS patches, so to actually
> apply / run the patches please use this git tree:
>
> git://git.infradead.org/users/hch/vfs.git O_ATOMIC
>
> Gitweb:
>
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC

Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

2017-03-01 Thread Omar Sandoval

On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > Hello,
> > > 
> > > this is a second revision of the patch set to fix several different races 
> > > and
> > > issues I've found when testing device shutdown and reuse. The first three
> > > patches are fixes to problems in my previous series fixing BDI lifetime 
> > > issues.
> > > Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I 
> > > cannot
> > > reproduce the BDI name reuse issues using Omar's stress test using 
> > > scsi_debug
> > > so it can be used as a replacement of Dan's patches. Patches 5-11 fix 
> > > oops that
> > > is triggered by __blkdev_put() calling inode_detach_wb() too early (the 
> > > problem
> > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk 
> > > code
> > > where get_gendisk() can return already freed gendisk structure (again 
> > > triggered
> > > by Omar's stress test).
> > > 
> > > People, please have a look at patches. They are mostly simple however the
> > > interactions are rather complex so I may have missed something. Also I'm
> > > happy for any additional testing these patches can get - I've stressed 
> > > them
> > > with Omar's script, tested memcg writeback, tested static (not udev 
> > > managed)
> > > device inodes.
> > > 
> > > Jens, I think at least patches 1-3 should go in together with my fixes you
> > > already have in your tree (or shortly after them). It is up to you whether
> > > you decide to delay my first fixes or pick these up quickly. Patch 4 is
> > > (IMHO a cleaner) replacement of Dan's patches so consider whether you want
> > > to use it instead of those patches.
> > 
> > I have applied 1-3 to my for-linus branch, which will go in after
> > the initial pull request has been pulled by Linus. Consider fixing up
> > #4 so it applies, I like it.
> 
> OK, attached is patch 4 rebased on top of Linus' tree from today which
> already has linux-block changes pulled in. I've left put_disk_devt() in
> blk_cleanup_queue() to maintain the logic in the original patch (now commit
> 0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
> The bdi_unregister() call that is moved to del_gendisk() by this patch is
> now protected by the gendisk reference instead of the request_queue one
> so it still maintains the property that devt reference protects bdi
> registration-unregistration lifetime (as much as that is not needed anymore
> after this patch).
> 
> I have also updated the comment in the code and the changelog - they were
> somewhat stale after changes to the whole series Tejun suggested.
> 
>   Honza

Hey, Jan, I just tested this out when I was seeing similar crashes with
sr instead of sd, and this fixed it.

Tested-by: Omar Sandoval

Re: [PATCHv2 2/2] nvme: Complete all stuck requests

2017-03-01 Thread Artur Paszkiewicz

On 02/28/2017 05:57 PM, Keith Busch wrote:
> On Tue, Feb 28, 2017 at 08:42:19AM +0100, Artur Paszkiewicz wrote:
>>
>> I'm observing the same thing when hibernating during mdraid resync on
>> nvme - it hangs in blk_mq_freeze_queue_wait() after "Disabling non-boot
>> CPUs ...".
> 
> The patch guarantees forward progress for blk-mq's hot-cpu notifier on
> nvme request queues by failing all entered requests. It sounds like some
> part of your setup needs those requests to succeed in order to hibernate.
> 
> If your mdraid uses a stacking request_queue that submits retries while
> it's request queue is entered, that may explain how you remain stuck
> at blk_mq_freeze_queue_wait.
> 
>> This patch did not help but when I put nvme_wait_freeze()
>> right after nvme_start_freeze() it appeared to be working. Maybe the
>> difference here is that requests are submitted from a non-freezable
>> kernel thread (md sync_thread)?
> 
> Wait freeze prior to quiescing the queue is ok when the controller is
> functioning, but it'd be impossible to complete a reset if the controller
> is in a failed or degraded state.
> 
> We probably want to give those requests a chance to succeed, and I think
> we'd need to be able to timeout the freeze wait. Below are two patches
> I tested. Prior to these, the fio test would report IO errors from some
> of its jobs; no errors with these.

With these patches it works fine. I tested multiple iterations on 2
platforms and they were able to hibernate and resume without issues.

Thanks,
Artur

50 matches

Mail list logo