Re: [PATCH 16/34] iomap: add initial support for writes without buffer heads

2018-05-22 Thread Christoph Hellwig
On Wed, May 23, 2018 at 08:38:06AM +1000, Dave Chinner wrote:
> Ok, I missed that detail as it's in a different patch. It looks like
> if (pos > EOF) it will zeroed. But in this case I think that pos ==
> EOF and so it was reading instead. That smells like off-by-one bug
> to me.

This has been fixed in the tree I pushed yesterday already.


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Hannes Reinecke

On 05/22/2018 06:17 PM, Christoph Hellwig wrote:

Hi Keith,

I like this series a lot.  One comment that is probably close
to the big discussion in the thread:


switch (ret) {
case BLK_EH_HANDLED:
/*
+* If the request is still in flight, the driver is requesting
+* blk-mq complete it.
 */
+   if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
+   __blk_mq_complete_request(req);
+   break;


The state check here really irked me, and from the thread it seems like
I'm not the only one.  At least for the NVMe case I think it is perfectly
safe, although I agree I'd rather audit what other drivers do carefully.

That being said I think BLK_EH_HANDLED seems like a fundamentally broken
idea, and I'd actually prefer to get rid of it over adding things like
the MQ_RQ_IN_FLIGHT check above.


I can't agree more here.

BLK_EH_HANDLED is breaking all sorts of assumptions, and I'd be glad to 
see it go.



E.g. if we look at the cases where nvme-pci returns it:

  - if we did call nvme_dev_disable, we already canceled all requests,
so we might as well just return BLK_EH_NOT_HANDLED
  - the poll for completion case already completed the command,
so we should return BLK_EH_NOT_HANDLED

So I think we need to fix up nvme and if needed any other driver
to return the right value and then assert that the request is
still in in-flight status for the BLK_EH_HANDLED case.


... and then kill BLK_EH_HANDLED :-)

Cheers,

Hannes


Re: [PATCH] bcache: do not check return value of debugfs_create_dir()

2018-05-22 Thread Coly Li
On 2018/5/23 3:56 AM, Kai Krakow wrote:
> Hello Coly!
> 
> 2018-05-22 9:10 GMT+02:00 Coly Li :
>> Greg KH suggests that normal code should not care about debugfs. Therefore
>> no matter successful or failed of debugfs_create_dir() execution, it is
>> unncessary to check its return value.
>>
>> There are two functions called debugfs_create_dir() and check the return
>> value, which are bch_debug_init() and closure_debug_init(). This patch
>> changes these two functions from int to void type, and ignore return values
>> of debugfs_create_dir().
>>
>> This patch does not fix exact bug, just makes things work as they should.
> 
> I applied the patch to master and cherry-picked to my 4.16 branch
> (cleaning up the conflicts).
> 
> I'm not sure if I did the conflict in closure_debug right but as I
> compiled with CONFIG_DEBUG_FS=n, I think it doesn't matter for my
> configuration.
> 
> The resulting patch works, currently running it while writing this message.
> 
> So you can add my tested-by if it makes sense.

Hi Kai,

Thank you for the effort, very helpful!

I will add a Tested-by: in next version post.

Coly Li


>> Signed-off-by: Coly Li 
>> Suggested-by: Greg Kroah-Hartman 
>> Cc: Kai Krakow 
>> Cc: Kent Overstreet 
>> ---
>>  drivers/md/bcache/bcache.h  |  2 +-
>>  drivers/md/bcache/closure.c | 13 +
>>  drivers/md/bcache/closure.h |  4 ++--
>>  drivers/md/bcache/debug.c   | 11 ++-
>>  drivers/md/bcache/super.c   |  4 +++-
>>  5 files changed, 21 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 3a0cfb237af9..bee6251d2d40 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -995,7 +995,7 @@ void bch_open_buckets_free(struct cache_set *);
>>  int bch_cache_allocator_start(struct cache *ca);
>>
>>  void bch_debug_exit(void);
>> -int bch_debug_init(struct kobject *);
>> +void bch_debug_init(struct kobject *);
>>  void bch_request_exit(void);
>>  int bch_request_init(void);
>>
>> diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
>> index 0e14969182c6..618253683d40 100644
>> --- a/drivers/md/bcache/closure.c
>> +++ b/drivers/md/bcache/closure.c
>> @@ -199,11 +199,16 @@ static const struct file_operations debug_ops = {
>> .release= single_release
>>  };
>>
>> -int __init closure_debug_init(void)
>> +void  __init closure_debug_init(void)
>>  {
>> -   closure_debug = debugfs_create_file("closures",
>> -   0400, bcache_debug, NULL, _ops);
>> -   return IS_ERR_OR_NULL(closure_debug);
>> +   if (!IS_ERR_OR_NULL(bcache_debug))
>> +   /*
>> +* it is unnecessary to check return value of
>> +* debugfs_create_file(), we should not care
>> +* about this.
>> +*/
>> +   closure_debug = debugfs_create_file(
>> +   "closures", 0400, bcache_debug, NULL, _ops);
>>  }
>>  #endif
>>
>> diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
>> index 71427eb5fdae..7c2c5bc7c88b 100644
>> --- a/drivers/md/bcache/closure.h
>> +++ b/drivers/md/bcache/closure.h
>> @@ -186,13 +186,13 @@ static inline void closure_sync(struct closure *cl)
>>
>>  #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
>>
>> -int closure_debug_init(void);
>> +void closure_debug_init(void);
>>  void closure_debug_create(struct closure *cl);
>>  void closure_debug_destroy(struct closure *cl);
>>
>>  #else
>>
>> -static inline int closure_debug_init(void) { return 0; }
>> +static inline void closure_debug_init(void) {}
>>  static inline void closure_debug_create(struct closure *cl) {}
>>  static inline void closure_debug_destroy(struct closure *cl) {}
>>
>> diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
>> index d030ce3025a6..57f8f5aeee55 100644
>> --- a/drivers/md/bcache/debug.c
>> +++ b/drivers/md/bcache/debug.c
>> @@ -248,11 +248,12 @@ void bch_debug_exit(void)
>> debugfs_remove_recursive(bcache_debug);
>>  }
>>
>> -int __init bch_debug_init(struct kobject *kobj)
>> +void __init bch_debug_init(struct kobject *kobj)
>>  {
>> -   if (!IS_ENABLED(CONFIG_DEBUG_FS))
>> -   return 0;
>> -
>> +   /*
>> +* it is unnecessary to check return value of
>> +* debugfs_create_file(), we should not care
>> +* about this.
>> +*/
>> bcache_debug = debugfs_create_dir("bcache", NULL);
>> -   return IS_ERR_OR_NULL(bcache_debug);
>>  }
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 3dea06b41d43..2b62671e4d83 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -2301,10 +2301,12 @@ static int __init bcache_init(void)
>> if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) ||
>> !(bcache_kobj = 

Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates

2018-05-22 Thread Jens Axboe
On 5/19/18 1:44 AM, Ming Lei wrote:
> When the allocation process is scheduled back and the mapped hw queue is
> changed, do one extra wake up on orignal queue for compensating wake up
> miss, so other allocations on the orignal queue won't be starved.
> 
> This patch fixes one request allocation hang issue, which can be
> triggered easily in case of very low nr_request.

Trying to think of better ways we can fix this, but I don't see
any right now. Getting rid of the wake_up_nr() kills us on tons
of tasks waiting. Maybe it might be possible to only go through
the fake wakeup IFF we have a task waiting on the list, that'd
spare us the atomic dec and cmpxchg for all cases except if we
have a task (or more) waiting on the existing wait state.

> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 336dde07b230..77607f89d205 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -134,6 +134,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data 
> *data)
>   ws = bt_wait_ptr(bt, data->hctx);
>   drop_ctx = data->ctx == NULL;
>   do {
> + struct sbitmap_queue *bt_orig;

This should be called 'bt_prev'.

> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 841585f6e5f2..b23f50355281 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -484,6 +484,13 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct 
> sbitmap_queue *sbq,
>  void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
>  
>  /**
> + * sbitmap_wake_up() - Do a regular wake up compensation if the queue
> + * allocated from is changed after scheduling back.
> + * @sbq: Bitmap queue to wake up.
> + */
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);

The blk-mq issue is bleeding into sbitmap here. This should just detail
that this issues a wakeup, similar to how freeing a tag would

-- 
Jens Axboe



Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates

2018-05-22 Thread Jens Axboe
On 5/22/18 6:22 PM, Ming Lei wrote:
> On Tue, May 22, 2018 at 02:20:37PM -0600, Jens Axboe wrote:
>> On 5/19/18 1:44 AM, Ming Lei wrote:
>>> When the allocation process is scheduled back and the mapped hw queue is
>>> changed, do one extra wake up on orignal queue for compensating wake up
>>> miss, so other allocations on the orignal queue won't be starved.
>>>
>>> This patch fixes one request allocation hang issue, which can be
>>> triggered easily in case of very low nr_request.
>>
>> Can you explain what the actual bug is? The commit message doesn't
>> really say, it just explains what the code change does. What wake up
>> miss?
> 
> For example:
> 
> 1) one request queue has 2 queues, and queue depth is low, so wake_batch
> is set 1 or very small.

We have too many 'queues' :)

> 2) There are 2 allocations which are blocked on hw queue 0, now the last
> request mapped to hw queue 0 is completed, then sbq_wake_up() wakes only
> one of 2 blockers.
> 
> 3) the waken up allocation process is migrated to another CPU, and its
> hw queue becomes mapped to hw queue 1, and this allocation is switched
> to hw queue 1
> 
> 4) so there isn't any chance for another blocked allocation to move one,
> since all request in hw queue 0 has been completed. That is why I call
> it wake up miss.

OK, that makes sense to me. With that, let me review your patch in
a different email.

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Ming Lei
On Tue, May 22, 2018 at 02:38:37PM -0600, Jens Axboe wrote:
> On 5/22/18 2:33 PM, Bart Van Assche wrote:
> > On Tue, 2018-05-22 at 13:38 -0600, Jens Axboe wrote:
> >> On 5/22/18 1:03 PM, Jens Axboe wrote:
> >>> On 5/22/18 12:47 PM, Jens Axboe wrote:
>  Ran into this, running block/014 from blktests:
> 
>  [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
>  [ 5750.723000] null: rq ff68f103 timed out
>  [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
>  __blk_mq_complete_request+0xa6/0x0
>  [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core 
>  sb_edac x86_pkg_temp_therma]
>  [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 
>  4.17.0-rc6+ #712
>  [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 
>  2.3.4 11/09/2016
>  [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
>  [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
>  [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
>  [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
>  
>  [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
>  881ff100d800
>  [ 5750.819119] RBP: e8d91800 R08:  R09: 
>  82066eb8
>  [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
>  0001
>  [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
>  0002
>  [ 5750.844529] FS:  () GS:88407fd8() 
>  knlGS:
>  [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
>  [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
>  003606e0
>  [ 5750.869861] DR0:  DR1:  DR2: 
>  
>  [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
>  0400
>  [ 5750.886805] Call Trace:
>  [ 5750.890033]  bt_iter+0x42/0x50
>  [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
>  [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
>  [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
>  [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
>  [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
>  [ 5750.920220]  process_one_work+0x21b/0x510
>  [ 5750.925197]  worker_thread+0x3a/0x390
>  [ 5750.929781]  ? process_one_work+0x510/0x510
>  [ 5750.934944]  kthread+0x11c/0x140
>  [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
>  [ 5750.945169]  ret_from_fork+0x1f/0x30
>  [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 
>  e2 75 3b 48 89 df ff 90 2 
>  [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> 
>  which is this:
> 
>  WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> >>>
> >>> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> >>> state transition. So that check should be:
> >>>
> >>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
> >>>  blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
> >>
> >> I guess it would be cleaner to actually do the transition, in
> >> blk_mq_rq_timed_out():
> >>
> >> case BLK_EH_HANDLED:   
> >>  
> >> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,   
> >>  
> >> MQ_RQ_COMPLETE))   
> >>  
> >> __blk_mq_complete_request(req);
> >>  
> >> break;
> >>
> >> This works for me.
> > 
> > Hello Jens,
> > 
> > Thanks for having reported this. How about using the following change to 
> > suppress
> > that warning:
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index bb99c03e7a34..84e55ea55baf 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, 
> > bool reserved)
> >  
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > +   blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > 
> > I think this will work better than what was proposed in your last e-mail. 
> > I'm afraid
> > that with that change that a completion that occurs while the timeout 
> > handler is
> > running can be ignored.
> 
> What if that races with eg requeue? We get the completion from IRQ, decide we
> need to requeue/restart the request rather than complete it.

If drivers need to requeue this request, which should have been done
in its .complete.

But if the requeue is done via IRQ handler directly, that may be one
existed race between normal completion vs. timeout, and Bart's patch
doesn't change situation too.

thanks,

Re: [PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()"

2018-05-22 Thread Kent Overstreet
On Wed, May 23, 2018 at 01:36:02AM +, Bart Van Assche wrote:
> On Tue, 2018-05-22 at 21:30 -0400, Kent Overstreet wrote:
> > On Tue, May 22, 2018 at 04:55:05PM -0700, Bart Van Assche wrote:
> > > This patch avoids that KASAN reports the following complaint when
> > > running the srp-test software:
> > 
> > I made a real attempt to get your test suite working.
> 
> Please try a little harder. Everyone I asked to run that test suite
> has been able to run it.
> 
> Bart.
> 
> 
> 

= TEST   srp

systemd-journald[143]: Received SIGTERM from PID 1 (systemd).
Removed /etc/systemd/system/multipathd.service.
Starting multipath-tools (via systemctl): multipath-tools.service.
multipathd> reconfigure
ok
multipathd> make -C discontiguous-io discontiguous-io
make[1]: Entering directory
'/host/home/kent/ktest/tests/srp-test/discontiguous-io'
make[1]: 'discontiguous-io' is up to date.
make[1]: Leaving directory
'/host/home/kent/ktest/tests/srp-test/discontiguous-io'
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module
../run_tests: line 65: cd: /lib/modules/4.17.0-rc4+/kernel/block: No such file
or directory
Zero-initializing /dev/ram0 ... done
Zero-initializing /dev/ram1 ... done
Configured SRP target driver
Running test ../tests/01 ...
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/01 failed
Running test ../tests/02-mq ...
Test file I/O on top of multipath concurrently with logout and login (10 min;
mq)
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/02-mq failed
Running test ../tests/02-sq ...
Test file I/O on top of multipath concurrently with logout and login (10 min;
sq)
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/02-sq failed
Running test ../tests/02-sq-on-mq ...
Test file I/O on top of multipath concurrently with logout and login (10 min;
sq-on-mq)
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/02-sq-on-mq failed
Running test ../tests/03-4M ...
Test direct I/O with large transfer sizes and cmd_sg_entries=255
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/03-4M failed
Running test ../tests/03-8M ...
Test direct I/O with large transfer sizes and cmd_sg_entries=255
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/03-8M failed
Running test ../tests/04-4M ...
Test direct I/O with large transfer sizes and cmd_sg_entries=1
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/04-4M failed
Running test ../tests/04-8M ...
Test direct I/O with large transfer sizes and cmd_sg_entries=1
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/04-8M failed
Running test ../tests/05-4M ...
Test buffered I/O with large transfer sizes and cmd_sg_entries=255
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/05-4M failed
Running test ../tests/05-8M ...
Test buffered I/O with large transfer sizes and cmd_sg_entries=255
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/05-8M failed
Running test ../tests/06 ...
Test block I/O on top of multipath concurrently with logout and login (10 min)
Unloaded the ib_srp kernel module
SRP login failed
Test ../tests/06 failed
0 tests succeeded and 11 tests failed
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module

= PASSED srp in 14s


May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: start up
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: read /etc/multipath.conf
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: path checkers start up
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sda: fail to get serial
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sda: using deprecated
getuid callout
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]:
/.../srp-test/bin/getuid_callout exited with 255
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sda: failed to get
callout uid: Input/output error
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sdb: fail to get serial
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sdb: using deprecated
getuid callout
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]:
/.../srp-test/bin/getuid_callout exited with 255
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sdb: failed to get
callout uid: Input/output error
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sdc: using deprecated
getuid callout
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]:
/.../srp-test/bin/getuid_callout exited with 255
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sdc: failed to get
callout uid: Input/output error
May 22 22:13:46 moria-kvm-kvm-kvm-kvm systemd[1]: Started Device-Mapper
Multipath Device Controller.
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: reconfigure (operator)
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sda: fail to get serial
May 22 22:13:46 moria-kvm-kvm-kvm-kvm multipathd[387]: sda: using deprecated
getuid callout
May 22 22:13:46 

Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread jianchao.wang
Hi Jens and Holger

Thank for your kindly response.
That's really appreciated.

I will post next version based on Jens' patch.

Thanks
Jianchao

On 05/23/2018 02:32 AM, Holger Hoffstätte wrote:
 This looks great but prevents kyber from being built as module,
 which is AFAIK supposed to work (and works now):

 ..
   CC [M]  block/kyber-iosched.o
   Building modules, stage 2.
   MODPOST 313 modules
 ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
 ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
 ..

 It does build fine when compiled in, obviously. :)
>>> It's basically duplicating the contents of blk_attempt_plug_merge().
>>> I would suggest abstracting out the list loop and merge check
>>> into a helper, that could then both be called from kyber and the
>>> plug merge function.
>> See attached, prep patch and yours rebased on top of it.
> That was quick. :)
> 
> Applies smoothly on top of my 4.16++ tree, now builds correctly as
> module and is reproducibly (slightly) faster even on my pedestrian
> SATA SSDs, now on par or occasionally even faster than mq-deadline.
> What's not to like? So:
> 
> Tested-by: Holger Hoffstätte 


Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread jianchao.wang
Hi Omar

Thanks for your kindly response.

On 05/23/2018 04:02 AM, Omar Sandoval wrote:
> On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote:
>> Currently, kyber is very unfriendly with merging. kyber depends
>> on ctx rq_list to do merging, however, most of time, it will not
>> leave any requests in ctx rq_list. This is because even if tokens
>> of one domain is used up, kyber will try to dispatch requests
>> from other domain and flush the rq_list there.
> 
> That's a great catch, I totally missed this.
> 
> This approach does end up duplicating a lot of code with the blk-mq core
> even after Jens' change, so I'm curious if you tried other approaches.
> One idea I had is to try the bio merge against the kqd->rqs lists. Since
> that's per-queue, the locking overhead might be too high. Alternatively,

Yes, I used to make a patch as you say, try the bio merge against kqd->rqs 
directly.
The patch looks even simpler. However, because the khd->lock is needed every 
time 
when try bio merge, there maybe high contending overhead on hkd->lock when 
cpu-hctx
mapping is not 1:1. 

> you could keep the software queues as-is but add our own version of
> flush_busy_ctxs() that only removes requests of the domain that we want.
> If one domain gets backed up, that might get messy with long iterations,
> though.

Yes, I also considered this approach :)
But the long iterations on every ctx->rq_list looks really inefficient.

> 
> Regarding this approach, a couple of comments below.
...
>>  }
>> @@ -379,12 +414,33 @@ static void kyber_exit_sched(struct elevator_queue *e)
>>  static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int 
>> hctx_idx)
>>  {
>>  struct kyber_hctx_data *khd;
>> +struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
>>  int i;
>> +int sd;
>>  
>>  khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node);
>>  if (!khd)
>>  return -ENOMEM;
>>  
>> +khd->kcqs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
>> +GFP_KERNEL, hctx->numa_node);
>> +if (!khd->kcqs)
>> +goto err_khd;
> 
> Why the double indirection of a percpu allocation per hardware queue
> here? With, say, 56 cpus and that many hardware queues, that's 3136
> pointers, which seems like overkill. Can't you just use the percpu array
> in the kqd directly, or make it per-hardware queue instead?

oops, I forgot to change the nr_cpu_ids to hctx->nr_ctx.
The mapping between cpu and hctx has been setup when kyber_init_hctx is invoked,
so just need to allocate hctx->nr_ctx * struct kyber_ctx_queue per khd.

...
>> +static int bio_sched_domain(const struct bio *bio)
>> +{
>> +unsigned int op = bio->bi_opf;
>> +
>> +if ((op & REQ_OP_MASK) == REQ_OP_READ)
>> +return KYBER_READ;
>> +else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op))
>> +return KYBER_SYNC_WRITE;
>> +else
>> +return KYBER_OTHER;
>> +}
> 
> Please add a common helper for rq_sched_domain() and bio_sched_domain()
> instead of duplicating the logic.
> 

Yes, I will do it in next version.

Thanks
Jianchao


Re: [PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()"

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 21:30 -0400, Kent Overstreet wrote:
> On Tue, May 22, 2018 at 04:55:05PM -0700, Bart Van Assche wrote:
> > This patch avoids that KASAN reports the following complaint when
> > running the srp-test software:
> 
> I made a real attempt to get your test suite working.

Please try a little harder. Everyone I asked to run that test suite
has been able to run it.

Bart.





Re: [PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()"

2018-05-22 Thread Kent Overstreet
On Tue, May 22, 2018 at 04:55:05PM -0700, Bart Van Assche wrote:
> This patch avoids that KASAN reports the following complaint when
> running the srp-test software:

I made a real attempt to get your test suite working. It would be nice if you
could make an attempt at tracking it down instead of just reverting, and holding
up improvements to the whole block layer, since you seem to have an easy repro
for it.

This is a real looming issue - we have other code which assumes things about
bi_next and this isn't a practical thing to grep/audit for.


Re: [PATCH blktests] block/020: test IO req allocation

2018-05-22 Thread Ming Lei
On Tue, May 22, 2018 at 05:01:54PM -0700, Omar Sandoval wrote:
> On Tue, May 22, 2018 at 07:12:49PM +0800, Ming Lei wrote:
> > IO request allocation may hang for ever if the allocation process
> > migrages. This test covers the request allocation code path.
> > 
> > The following patch can fix this issue on linus kernel:
> > 
> > https://marc.info/?l=linux-block=152671586923185=2
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  tests/block/020 | 50 ++
> >  tests/block/020.out |  2 ++
> >  2 files changed, 52 insertions(+)
> >  create mode 100755 tests/block/020
> >  create mode 100755 tests/block/020.out
> > 
> > diff --git a/tests/block/020 b/tests/block/020
> > new file mode 100755
> > index ..b102d7f12173
> > --- /dev/null
> > +++ b/tests/block/020
> > @@ -0,0 +1,50 @@
> > +#!/bin/bash
> > +#
> > +# Test blk-mq IO request allocation.
> 
> This isn't very descriptive. How about
> 
> Test blk-mq request allocation when hardware tags are limited.

It is fine, except the issue isn't limited to small hw queue tags,
and it might be triggered on NVMe if there are huge IO processes.

> 
> Additionally, for regression tests for specific patches, please
> reference what patch it is testing. In this case, it'd be
> 
> Regression test for patch "blk-mq: avoid to starve tag allocation after
> allocation process migrates".

OK.

> 
> > +# Copyright (C) 2018 Ming Lei 
> > +#
> > +# This program is free software: you can redistribute it and/or modify
> > +# it under the terms of the GNU General Public License as published by
> > +# the Free Software Foundation, either version 3 of the License, or
> > +# (at your option) any later version.
> > +#
> > +# This program is distributed in the hope that it will be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program.  If not, see .
> > +
> > +DESCRIPTION="check if IO req allocation may hang"
> 
> A better description would say what this test actually does. In this
> case, something like
> 
> DESCRIPTION="run null-blk on different schedulers with only one hardware tag"
> 
> I can fix this and apply when the discussion on the patch itself
> settles. Thanks, Ming!

OK, thanks!

Thanks,
Ming


Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Ming Lei
On Tue, May 22, 2018 at 06:17:04PM +0200, Christoph Hellwig wrote:
> Hi Keith,
> 
> I like this series a lot.  One comment that is probably close
> to the big discussion in the thread:
> 
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > /*
> > +* If the request is still in flight, the driver is requesting
> > +* blk-mq complete it.
> >  */
> > +   if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> > +   __blk_mq_complete_request(req);
> > +   break;
> 
> The state check here really irked me, and from the thread it seems like
> I'm not the only one.  At least for the NVMe case I think it is perfectly
> safe, although I agree I'd rather audit what other drivers do carefully.

Let's consider the normal NVMe timeout code path:

1) one request is timed out;

2) controller is shutdown, this timed-out request is requeued from
nvme_cancel_request(), but can't dispatch because queues are quiesced

3) reset is done from another context, and this request is dispatched
again, and completed exactly before returning EH_HANDLED to blk-mq, but
its state isn't updated to COMPLETE yet.

4) then double completions are done from both normal completion and timeout
path.

Seems same issue exists on poll path.

Thanks,
Ming


Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates

2018-05-22 Thread Ming Lei
On Tue, May 22, 2018 at 02:20:37PM -0600, Jens Axboe wrote:
> On 5/19/18 1:44 AM, Ming Lei wrote:
> > When the allocation process is scheduled back and the mapped hw queue is
> > changed, do one extra wake up on orignal queue for compensating wake up
> > miss, so other allocations on the orignal queue won't be starved.
> > 
> > This patch fixes one request allocation hang issue, which can be
> > triggered easily in case of very low nr_request.
> 
> Can you explain what the actual bug is? The commit message doesn't
> really say, it just explains what the code change does. What wake up
> miss?

For example:

1) one request queue has 2 queues, and queue depth is low, so wake_batch
is set 1 or very small.

2) There are 2 allocations which are blocked on hw queue 0, now the last
request mapped to hw queue 0 is completed, then sbq_wake_up() wakes only
one of 2 blockers.

3) the waken up allocation process is migrated to another CPU, and its
hw queue becomes mapped to hw queue 1, and this allocation is switched
to hw queue 1

4) so there isn't any chance for another blocked allocation to move one,
since all request in hw queue 0 has been completed. That is why I call
it wake up miss.

BTW, this issue can be reproduced reliably by the block/020 I posted
yesterday:

https://marc.info/?l=linux-block=152698758418536=2

Thanks,
Ming


Re: [PATCH blktests] Documentation: document prerequisite scriptlets

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 09:30:24AM +0200, Johannes Thumshirn wrote:
> On Mon, May 21, 2018 at 11:29:04AM -0700, Omar Sandoval wrote:
> > But I'm curious about this specific example. Is this not mounted for you
> > automatically? I'm guessing systemd does it for me on my setup.
> 
> No my setup is kind of special. I don't want to mess with all the
> user-space so all I do is:
> 
> dracut --no-compress --kver `make kernelrelease` --kmoddir mods/ \
>--no-hostonly --no-hostonly-cmdline --modules "bash base" \
>--tmpdir `pwd`/myinitrd --force myinitrd/initrd \
>--add-drivers "loop nvme nvme-loop nvmet" \
>--install "lsblk find sort parted getopt fio tput column date \
>dirname mktemp ps diff awk timeout time losetup \
>truncate wc grep stat basename cut blktrace sg_inq \
>realpath findmnt vi dd sed rm rmdir nvme" \
>--include "$HOME/src/blktests" "/blktests
> 
> qemu-kvm -m 4096 -smp 4 -nographic -serial mon:stdio -kernel \
>arch/x86/boot/bzImage -initrd myinitrd/initrd \
>-append "console=ttyS0 debug"
> 
> I have the above in a shell wrapper as I simply can't remember it, or
> I just use rapido [1] which a co-worker of mine started for this job.
> 
> As minimal as possible. I try to get my rebuild - boot - test cycles
> short.
> 
> With my current config I have a bootup time in qemu below half a
> second on a reasonable machine in my lab and just below one second
> on my laptop.
> 
> [1] https://github.com/rapido-linux/rapido
> 
> Byte,
>   Johannes

Cool setup :) I have my own fanciness [1] which lets met boot straight
into a test kernel without packaging it up in any way.

Anyways, I added some documentation based on yours. Thanks!

1: https://github.com/osandov/osandov-linux#vm-setup


Re: [PATCH blktests] block/020: test IO req allocation

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 07:12:49PM +0800, Ming Lei wrote:
> IO request allocation may hang for ever if the allocation process
> migrages. This test covers the request allocation code path.
> 
> The following patch can fix this issue on linus kernel:
> 
> https://marc.info/?l=linux-block=152671586923185=2
> 
> Signed-off-by: Ming Lei 
> ---
>  tests/block/020 | 50 ++
>  tests/block/020.out |  2 ++
>  2 files changed, 52 insertions(+)
>  create mode 100755 tests/block/020
>  create mode 100755 tests/block/020.out
> 
> diff --git a/tests/block/020 b/tests/block/020
> new file mode 100755
> index ..b102d7f12173
> --- /dev/null
> +++ b/tests/block/020
> @@ -0,0 +1,50 @@
> +#!/bin/bash
> +#
> +# Test blk-mq IO request allocation.

This isn't very descriptive. How about

Test blk-mq request allocation when hardware tags are limited.

Additionally, for regression tests for specific patches, please
reference what patch it is testing. In this case, it'd be

Regression test for patch "blk-mq: avoid to starve tag allocation after
allocation process migrates".

> +# Copyright (C) 2018 Ming Lei 
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation, either version 3 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +DESCRIPTION="check if IO req allocation may hang"

A better description would say what this test actually does. In this
case, something like

DESCRIPTION="run null-blk on different schedulers with only one hardware tag"

I can fix this and apply when the discussion on the patch itself
settles. Thanks, Ming!

> +QUICK=1
> +
> +requires() {
> + _have_module null_blk && _have_fio
> +}
> +
> +test() {
> + echo "Running ${TEST_NAME}"
> +
> + modprobe -r null_blk
> + modprobe null_blk queue_mode=2 irqmode=2 completion_nsec=200 \
> + shared_tags=1 submit_queues=4 hw_queue_depth=1 || 
> continue
> +
> + local scheds
> + # shellcheck disable=SC2207
> + scheds=($(sed 's/[][]//g' /sys/block/nullb0/queue/scheduler))
> +
> + for sched in "${scheds[@]}"; do
> + echo "Testing $sched" >> "$FULL"
> + echo "$sched" > /sys/block/nullb0/queue/scheduler
> + _fio_perf --bs=4k --ioengine=libaio --iodepth=1024 \
> + --numjobs="$(nproc)" --rw=randread --name=async 
> \
> + --filename=/dev/nullb0 --size=1g --direct=1 \
> + --runtime=10
> + done
> +
> + modprobe -r null_blk
> +
> + echo "Test complete"
> +}
> diff --git a/tests/block/020.out b/tests/block/020.out
> new file mode 100755
> index ..dcb584d22e3e
> --- /dev/null
> +++ b/tests/block/020.out
> @@ -0,0 +1,2 @@
> +Running block/020
> +Test complete
> -- 
> 2.9.5
> 


Re: [PATCH blktests] blktests: use consistent helper returns

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 03:36:11AM -0400, Chaitanya Kulkarni wrote:
> This is a cleanup patch which uses consistent return values
> for helper functions.
> 
> Signed-off-by: Chaitanya Kulkarni 

Thanks, applied.


[PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()"

2018-05-22 Thread Bart Van Assche
This patch avoids that KASAN reports the following complaint when
running the srp-test software:

BUG: KASAN: use-after-free in bio_advance+0x110/0x1b0
Read of size 4 at addr 88014c7aa950 by task ksoftirqd/10/72
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0x65/0x270
kasan_report+0x232/0x350
bio_advance+0x110/0x1b0
blk_update_request+0x9d/0x5a0
scsi_end_request+0x4c/0x300 [scsi_mod]
scsi_io_completion+0x71e/0xa40 [scsi_mod]
__blk_mq_complete_request+0x13e/0x220
srp_recv_done+0x454/0x1100 [ib_srp]
__ib_process_cq+0x9a/0xf0 [ib_core]
ib_poll_handler+0x2d/0x90 [ib_core]
irq_poll_softirq+0xe5/0x1e0
__do_softirq+0x112/0x5f0
run_ksoftirqd+0x29/0x50
smpboot_thread_fn+0x30f/0x410
kthread+0x1b2/0x1d0
ret_from_fork+0x24/0x30

This reverts commit 0ba99ca4838bc75481a4bf0e70bad20b0a5457c7.

Fixes: commit 45db54d58de0 ("block: Split out bio_list_copy_data()")
Signed-off-by: Bart Van Assche 
Cc: Kent Overstreet 
---
 block/bio.c  | 3 ---
 block/blk-core.c | 8 +---
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 0a4df92cd689..e22ebab450f8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1777,9 +1777,6 @@ void bio_endio(struct bio *bio)
if (!bio_integrity_endio(bio))
return;
 
-   if (WARN_ONCE(bio->bi_next, "driver left bi_next not NULL"))
-   bio->bi_next = NULL;
-
/*
 * Need to have a real endio function for chained bios, otherwise
 * various corner cases will break (like stacking block devices that
diff --git a/block/blk-core.c b/block/blk-core.c
index a216b8b137f4..4d69f91a6431 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -277,10 +277,6 @@ static void req_bio_endio(struct request *rq, struct bio 
*bio,
bio_advance(bio, nbytes);
 
/* don't actually finish bio if it's part of flush sequence */
-   /*
-* XXX this code looks suspicious - it's not consistent with advancing
-* req->bio in caller
-*/
if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
bio_endio(bio);
 }
@@ -3115,10 +3111,8 @@ bool blk_update_request(struct request *req, 
blk_status_t error,
struct bio *bio = req->bio;
unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
 
-   if (bio_bytes == bio->bi_iter.bi_size) {
+   if (bio_bytes == bio->bi_iter.bi_size)
req->bio = bio->bi_next;
-   bio->bi_next = NULL;
-   }
 
/* Completion has already been traced */
bio_clear_flag(bio, BIO_TRACE_COMPLETION);
-- 
2.16.3



Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Kees Cook
On Tue, May 22, 2018 at 4:42 PM, Jens Axboe  wrote:
> On May 22, 2018, at 5:31 PM, Kees Cook  wrote:
>>
>>> On Tue, May 22, 2018 at 12:16 PM, Jens Axboe  wrote:
 On 5/22/18 1:13 PM, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
> I think Martin and Christoph are objecting to moving the code to
> block/scsi_ioctl.h. I don't care too much about where the code is, but
> think it would be nice to have the definitions in a separate header. But
> if they prefer just pulling in all of SCSI for it, well then I guess
> it's pointless to move the header bits. Seems very heavy handed to me,
> though.

 It might be heavy handed for the 3 remaining users of drivers/ide,
>>>
>>> Brutal :-)
>>
>> Heh. I noticed a similar sense buffer use in drivers/cdrom/cdrom.c
>> too. Is this okay under the same considerations?
>
> This is going from somewhat crazy to pretty nuts, imho. I guess in practical 
> terms it doesn’t matter that much, since most folks would be using sr. I 
> still think a split would be vastly superior. Just keep the scsi code in 
> drivers/scsi/ and make it independently selectable.

I had originally tied it to BLK_SCSI_REQUEST. Logically speaking,
sense buffers are part of the request, and the CONFIG work is already
done. This is roughly what I tried to do before, since scsi_ioctl.c is
the only code pulled in for CONFIG_BLK_SCSI_REQUEST:

$ git grep BLK_SCSI_REQUEST | grep Makefile
block/Makefile:obj-$(CONFIG_BLK_SCSI_REQUEST)   += scsi_ioctl.o

Should I move to code to a new drivers/scsi/scsi_sense.c and add it to
drivers/scsi/Makefile as:

obj-$(CONFIG_BLK_SCSI_REQUEST)+= scsi_sense.o

Every place I want to use the code is already covered by
CONFIG_BLK_SCSI_REQUEST, so it seems like I just need to know where to
put the .c file. :P

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Jens Axboe
On May 22, 2018, at 5:31 PM, Kees Cook  wrote:
> 
>> On Tue, May 22, 2018 at 12:16 PM, Jens Axboe  wrote:
>>> On 5/22/18 1:13 PM, Christoph Hellwig wrote:
 On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
 I think Martin and Christoph are objecting to moving the code to
 block/scsi_ioctl.h. I don't care too much about where the code is, but
 think it would be nice to have the definitions in a separate header. But
 if they prefer just pulling in all of SCSI for it, well then I guess
 it's pointless to move the header bits. Seems very heavy handed to me,
 though.
>>> 
>>> It might be heavy handed for the 3 remaining users of drivers/ide,
>> 
>> Brutal :-)
> 
> Heh. I noticed a similar sense buffer use in drivers/cdrom/cdrom.c
> too. Is this okay under the same considerations?

This is going from somewhat crazy to pretty nuts, imho. I guess in practical 
terms it doesn’t matter that much, since most folks would be using sr. I still 
think a split would be vastly superior. Just keep the scsi code in 
drivers/scsi/ and make it independently selectable.

Jens



Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Randy Dunlap
On 05/22/2018 04:39 PM, Kees Cook wrote:
> On Tue, May 22, 2018 at 4:34 PM, Randy Dunlap  wrote:
>> On 05/22/2018 04:31 PM, Kees Cook wrote:
>>> On Tue, May 22, 2018 at 12:16 PM, Jens Axboe  wrote:
 On 5/22/18 1:13 PM, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
>> I think Martin and Christoph are objecting to moving the code to
>> block/scsi_ioctl.h. I don't care too much about where the code is, but
>> think it would be nice to have the definitions in a separate header. But
>> if they prefer just pulling in all of SCSI for it, well then I guess
>> it's pointless to move the header bits. Seems very heavy handed to me,
>> though.
>
> It might be heavy handed for the 3 remaining users of drivers/ide,

 Brutal :-)
>>>
>>> Heh. I noticed a similar sense buffer use in drivers/cdrom/cdrom.c
>>> too. Is this okay under the same considerations?
>>
>> No.  Do not select an entire subsystem.  Use depends on it instead.
> 
> I looked at that first, but it seems it's designed for that. For
> example, "config ATA" already has a "select SCSI".
> 
> It does look fishy, though, since "config SCSI" has a "depends" which
> would be ignored by "select". Luckily, all these uses already do a
> "depends on BLOCK" (directly or indirectly).

Linus has railed against selecting subsystems.  We shouldn't be adding
more IMHO, although it is difficult to get rid of ones that we already have.


-- 
~Randy


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Randy Dunlap
On 05/22/2018 04:31 PM, Kees Cook wrote:
> On Tue, May 22, 2018 at 12:16 PM, Jens Axboe  wrote:
>> On 5/22/18 1:13 PM, Christoph Hellwig wrote:
>>> On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
 I think Martin and Christoph are objecting to moving the code to
 block/scsi_ioctl.h. I don't care too much about where the code is, but
 think it would be nice to have the definitions in a separate header. But
 if they prefer just pulling in all of SCSI for it, well then I guess
 it's pointless to move the header bits. Seems very heavy handed to me,
 though.
>>>
>>> It might be heavy handed for the 3 remaining users of drivers/ide,
>>
>> Brutal :-)
> 
> Heh. I noticed a similar sense buffer use in drivers/cdrom/cdrom.c
> too. Is this okay under the same considerations?

No.  Do not select an entire subsystem.  Use depends on it instead.


> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index ad9b687a236a..220ff321c102 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -79,7 +79,7 @@ config GDROM
> tristate "SEGA Dreamcast GD-ROM drive"
> depends on SH_DREAMCAST
> select CDROM
> -   select BLK_SCSI_REQUEST # only for the generic cdrom code
> +   select SCSI
> help
>   A standard SEGA Dreamcast comes with a modified CD ROM drive called 
> a
>   "GD-ROM" by SEGA to signify it is capable of reading special disks
> @@ -345,7 +345,7 @@ config CDROM_PKTCDVD
> tristate "Packet writing on CD/DVD media (DEPRECATED)"
> depends on !UML
> select CDROM
> -   select BLK_SCSI_REQUEST
> +   select SCSI
> help
>   Note: This driver is deprecated and will be removed from the
>   kernel in the near future!
> diff --git a/drivers/block/paride/Kconfig b/drivers/block/paride/Kconfig
> index f8bd6ef3605a..7fdfcc5eaca5 100644
> --- a/drivers/block/paride/Kconfig
> +++ b/drivers/block/paride/Kconfig
> @@ -27,7 +27,7 @@ config PARIDE_PCD
> tristate "Parallel port ATAPI CD-ROMs"
> depends on PARIDE
> select CDROM
> -   select BLK_SCSI_REQUEST # only for the generic cdrom code
> +   select SCSI
> ---help---
>   This option enables the high-level driver for ATAPI CD-ROM devices
>   connected through a parallel port. If you chose to build PARIDE
> 
>>> but as long as that stuff just keeps working I'd rather worry about
>>> everyone else, and keep the scsi code where it belongs.
>>
>> Fine with me then, hopefully we can some day kill it off.
> 
> I'll send a v2. I found a few other things to fix up (including the
> cdrom.c one).


-- 
~Randy


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Kees Cook
On Tue, May 22, 2018 at 12:16 PM, Jens Axboe  wrote:
> On 5/22/18 1:13 PM, Christoph Hellwig wrote:
>> On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
>>> I think Martin and Christoph are objecting to moving the code to
>>> block/scsi_ioctl.h. I don't care too much about where the code is, but
>>> think it would be nice to have the definitions in a separate header. But
>>> if they prefer just pulling in all of SCSI for it, well then I guess
>>> it's pointless to move the header bits. Seems very heavy handed to me,
>>> though.
>>
>> It might be heavy handed for the 3 remaining users of drivers/ide,
>
> Brutal :-)

Heh. I noticed a similar sense buffer use in drivers/cdrom/cdrom.c
too. Is this okay under the same considerations?

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index ad9b687a236a..220ff321c102 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -79,7 +79,7 @@ config GDROM
tristate "SEGA Dreamcast GD-ROM drive"
depends on SH_DREAMCAST
select CDROM
-   select BLK_SCSI_REQUEST # only for the generic cdrom code
+   select SCSI
help
  A standard SEGA Dreamcast comes with a modified CD ROM drive called a
  "GD-ROM" by SEGA to signify it is capable of reading special disks
@@ -345,7 +345,7 @@ config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media (DEPRECATED)"
depends on !UML
select CDROM
-   select BLK_SCSI_REQUEST
+   select SCSI
help
  Note: This driver is deprecated and will be removed from the
  kernel in the near future!
diff --git a/drivers/block/paride/Kconfig b/drivers/block/paride/Kconfig
index f8bd6ef3605a..7fdfcc5eaca5 100644
--- a/drivers/block/paride/Kconfig
+++ b/drivers/block/paride/Kconfig
@@ -27,7 +27,7 @@ config PARIDE_PCD
tristate "Parallel port ATAPI CD-ROMs"
depends on PARIDE
select CDROM
-   select BLK_SCSI_REQUEST # only for the generic cdrom code
+   select SCSI
---help---
  This option enables the high-level driver for ATAPI CD-ROM devices
  connected through a parallel port. If you chose to build PARIDE

>> but as long as that stuff just keeps working I'd rather worry about
>> everyone else, and keep the scsi code where it belongs.
>
> Fine with me then, hopefully we can some day kill it off.

I'll send a v2. I found a few other things to fix up (including the
cdrom.c one).

Thanks!

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 16/34] iomap: add initial support for writes without buffer heads

2018-05-22 Thread Dave Chinner
On Tue, May 22, 2018 at 10:24:54AM +0200, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 10:07:45AM +1000, Dave Chinner wrote:
> > > Something doesn't smell right here.  The only pages we need to read in
> > > are the first and last pages in the write_begin range, and only if they
> > > aren't page aligned and the underlying extent is IOMAP_MAPPED, right?
> > 
> > And not beyond EOF, too.
> > 
> > The bufferhead code handles this via the buffer_new() flag - it
> > triggers the skipping of read IO and the states in which it is
> > set are clearly indicated in iomap_to_bh(). That same logic needs to
> > apply here.
> 
> The buffer_new logic itself isn't really something to copy directly
> as it has all kinds of warts..

Sure, my point was that it documents all the cases where we can
avoid reading from disk, not that we should copy the logic.

> > > I also noticed that speculative preallocation kicks in by the second 80M
> > > write() call and writeback for the second call can successfully allocate
> > > the entire preallocation, which means that the third (or nth) write call
> > > can have a real extent already mapped in, and then we end up reading it.
> > 
> > Yeah, that's because there's no check against EOF here. These writes
> > are all beyond EOF, so there shouldn't be any read at all...
> 
> The EOF case is already handled in iomap_block_needs_zeroing.

Ok, I missed that detail as it's in a different patch. It looks like
if (pos > EOF) it will zeroed. But in this case I think that pos ==
EOF and so it was reading instead. That smells like off-by-one bug
to me.

> We just
> need to skip the read for ranges entirely covered by the write.

Yup.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v7 0/5] AIO add per-command iopriority

2018-05-22 Thread Adam Manzanares


On 5/22/18 11:30 AM, Jens Axboe wrote:
> On 5/22/18 12:30 PM, Al Viro wrote:
>> On Tue, May 22, 2018 at 11:55:04AM -0600, Jens Axboe wrote:
>>> On 5/22/18 11:52 AM, adam.manzana...@wdc.com wrote:
 From: Adam Manzanares 

 This is the per-I/O equivalent of the ioprio_set system call.
 See the following link for performance implications on a SATA HDD:
 https://lkml.org/lkml/2016/12/6/495

 First patch factors ioprio_check_cap function out of ioprio_set system 
 call to
 also be used by the aio ioprio interface.

 Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.

 Third patch passes ioprio hint from aio iocb to kiocb and initializes kiocb
 ioprio value appropriately when it is not explicitly set.

 Fourth patch enables the feature for blkdev.

 Fifth patch enables the feature for iomap direct IO
>>>
>>> LGTM, you can add:
>>>
>>> Reviewed-by: Jens Axboe 
>>>
>>> Al, are you picking this series up, or should I?
>>
>> Probably better if I do, once I finish reviewing Christoph's patchset -
>> we already have a bunch of stuff around fs/aio.c in this cycle...
> 
> Alright, sounds good, thanks Al.
> 

I was working on the man page update for this feature and noticed I 
could be bit more informative on error if I return the error value 
returned by ioprio_check_cap in fs/aio.c.

Should I resend the whole series?

Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-22 Thread Bart Van Assche
On Sun, 2018-05-20 at 18:17 -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 03:12:27PM +, Bart Van Assche wrote:
> > On Fri, 2018-05-18 at 05:06 -0400, Kent Overstreet wrote:
> > > On Thu, May 17, 2018 at 08:54:57PM +, Bart Van Assche wrote:
> > > > With Jens' latest for-next branch I hit the kernel warning shown below. 
> > > > Can
> > > > you have a look?
> > > 
> > > Any hints on how to reproduce it?
> > 
> > Sure. This is how I triggered it:
> > * Clone https://github.com/bvanassche/srp-test.
> > * Follow the instructions in README.md.
> > * Run srp-test/run_tests -c -r 10
> 
> Can you bisect it? I don't have infiniband hardware handy...

Hello Kent,

The bisect I ran reported the following commit as the first faulty commit:

block: Add warning for bi_next not NULL in bio_endio()

Bart.








Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-05-22 Thread Logan Gunthorpe
Thanks for the review Randy! I'll make the changes for the next time we
post the series.

On 22/05/18 03:24 PM, Randy Dunlap wrote:
>> +The first task an orchestrator driver must do is compile a list of
>> +all client drivers that will be involved in a given transaction. For
>> +example, the NVMe Target driver creates a list including all NVMe drives
>  ^^
>or drivers 
> ?
> Could be either, I guess, but the previous sentence says "compile a list of 
> drivers."

I did mean "drives". But perhaps "devices" would be more clear. A list
of all NVMe drivers doesn't make much sense as I'm pretty sure there is
only one NVMe driver.

Logan


Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-05-22 Thread Randy Dunlap
On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe 
> Cc: Jonathan Corbet 
> ---
>  Documentation/PCI/index.rst |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 
> 
>  Documentation/index.rst |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst


> diff --git a/Documentation/driver-api/pci/p2pdma.rst 
> b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index ..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +
> +PCI Peer-to-Peer DMA Support
> +
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is
> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required
> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.
> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
> +
> +
> +Driver Writer's Guide
> +=
> +
> +In a given P2P implementation there may be three or more different
> +types of kernel drivers in play:
> +
> +* Providers - A driver which provides or publishes P2P resources like

   * Provider -

> +  memory or doorbell registers to other drivers.
> +* Clients - A driver which makes use of a resource by setting up a

   * Client -

> +  DMA transaction to or from it.
> +* Orchestrators - A driver which orchestrates the flow of data between

   * Orchestrator -

> +  clients and providers
> +
> +In many cases there could be overlap between these three types (ie.

  (i.e.,

> +it may be typical for a driver to be both a provider and a client).
> +
> +For example, in the NVMe Target Copy Offload implementation:
> +
> +* The NVMe PCI driver is both a client, provider and orchestrator
> +  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
> +  resource (provider), it accepts P2P memory pages as buffers in requests
> +  to be used directly (client) and it can also make use the CMB as
> +  submission queue entries.
> +* The RDMA driver is a client in this arrangement so that an RNIC
> +  can DMA directly to the memory exposed by the NVMe device.
> +* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
> +  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
> +
> +This is currently the only arrangement supported by the kernel but
> +one could imagine slight tweaks to this that would allow for the same
> +functionality. For example, if a specific RNIC added a BAR with some
> +memory behind it, its driver could add support as a P2P provider and
> +then the NVMe Target could use the RNIC's memory instead of the CMB
> +in cases where the NVMe cards in use do not have CMB support.
> +
> +
> +Provider Drivers
> +
> +
> +A provider simply needs to register a BAR (or a portion of a BAR)
> +as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
> +This will register struct pages for all the specified memory.
> +
> +After that it may optionally publish all of its resources as
> +P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
> +any orchestrator drivers to find and use the memory. When marked in
> +this way, the resource must be regular memory with no side effects.
> +
> +For the time being this is fairly rudimentary in that all resources
> +are typically going to be P2P memory. Future work will likely expand
> +this to include other types 

Re: [PATCH 2/6] scsi: cxlflash: Drop unused sense buffers

2018-05-22 Thread Matthew R. Ochs
On Tue, May 22, 2018 at 11:15:08AM -0700, Kees Cook wrote:
> This removes the unused sense buffer in read_cap16() and write_same16().
> 
> Signed-off-by: Kees Cook 

Looks good.

Acked-by: Matthew R. Ochs 



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Christoph Hellwig
On Tue, May 22, 2018 at 02:29:21PM -0600, Jens Axboe wrote:
> > of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
> > when the driver actually knows the request has been completed before
> > returning the status?
> 
> If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
> Or BLK_EH_HANDLED would work too, since the above state transition would
> then fail.

Btw, I think we should just kill off BLK_EH_HANDLED.  WIP totally
untested progress toward that is here:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-ret

The only real missing bit is SCSI overloading the value for internal
use.



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 14:44 -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 08:36:27PM +, Bart Van Assche wrote:
> > 
> > Have you noticed that if blk_mq_complete_request() encounters a request with
> > state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I 
> > think
> > the code in blk_mq_complete_request() together with the above code 
> > guarantees
> > that __blk_mq_complete_request() is only called once per request generation.
> 
> Okay, now to the BLK_EH_NOT_HANDLED case: that's supposedly the correct
> status to return if the driver knows blk_mq_complete_request() was called
> prior to returning from the timeout handler, so we need a similiar check
> there, right?

Good catch. To me that seems like the right place to handle that case.

Bart.

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 14:38 -0600, Jens Axboe wrote:
> On 5/22/18 2:33 PM, Bart Van Assche wrote:
> > Thanks for having reported this. How about using the following change to 
> > suppress
> > that warning:
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index bb99c03e7a34..84e55ea55baf 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, 
> > bool reserved)
> >  
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > +   blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > 
> > I think this will work better than what was proposed in your last e-mail. 
> > I'm afraid
> > that with that change that a completion that occurs while the timeout 
> > handler is
> > running can be ignored.
> 
> What if that races with eg requeue? We get the completion from IRQ, decide we
> need to requeue/restart the request rather than complete it.

Shouldn't block drivers that requeue a request from inside their timeout handler
return BLK_EH_NOT_HANDLED instead of BLK_EH_HANDLED?

Thanks,

Bart.




Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 08:36:27PM +, Bart Van Assche wrote:
> 
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I 
> think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Okay, now to the BLK_EH_NOT_HANDLED case: that's supposedly the correct
status to return if the driver knows blk_mq_complete_request() was called
prior to returning from the timeout handler, so we need a similiar check
there, right?


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 2:33 PM, Bart Van Assche wrote:
> On Tue, 2018-05-22 at 13:38 -0600, Jens Axboe wrote:
>> On 5/22/18 1:03 PM, Jens Axboe wrote:
>>> On 5/22/18 12:47 PM, Jens Axboe wrote:
 Ran into this, running block/014 from blktests:

 [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
 [ 5750.723000] null: rq ff68f103 timed out
 [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
 __blk_mq_complete_request+0xa6/0x0
 [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core 
 sb_edac x86_pkg_temp_therma]
 [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 
 4.17.0-rc6+ #712
 [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
 11/09/2016
 [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
 [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
 [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
 [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
 
 [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
 881ff100d800
 [ 5750.819119] RBP: e8d91800 R08:  R09: 
 82066eb8
 [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
 0001
 [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
 0002
 [ 5750.844529] FS:  () GS:88407fd8() 
 knlGS:
 [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
 [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
 003606e0
 [ 5750.869861] DR0:  DR1:  DR2: 
 
 [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
 0400
 [ 5750.886805] Call Trace:
 [ 5750.890033]  bt_iter+0x42/0x50
 [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
 [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
 [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
 [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
 [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
 [ 5750.920220]  process_one_work+0x21b/0x510
 [ 5750.925197]  worker_thread+0x3a/0x390
 [ 5750.929781]  ? process_one_work+0x510/0x510
 [ 5750.934944]  kthread+0x11c/0x140
 [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
 [ 5750.945169]  ret_from_fork+0x1f/0x30
 [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 
 75 3b 48 89 df ff 90 2 
 [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---

 which is this:

 WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
>>>
>>> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
>>> state transition. So that check should be:
>>>
>>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
>>>  blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
>>
>> I guess it would be cleaner to actually do the transition, in
>> blk_mq_rq_timed_out():
>>
>> case BLK_EH_HANDLED: 
>>
>> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, 
>>
>> MQ_RQ_COMPLETE)) 
>>
>> __blk_mq_complete_request(req);  
>>
>> break;
>>
>> This works for me.
> 
> Hello Jens,
> 
> Thanks for having reported this. How about using the following change to 
> suppress
> that warning:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index bb99c03e7a34..84e55ea55baf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>  
>   switch (ret) {
>   case BLK_EH_HANDLED:
> + blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
> 
> I think this will work better than what was proposed in your last e-mail. I'm 
> afraid
> that with that change that a completion that occurs while the timeout handler 
> is
> running can be ignored.

What if that races with eg requeue? We get the completion from IRQ, decide we
need to requeue/restart the request rather than complete it.

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 08:36:27PM +, Bart Van Assche wrote:
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I 
> think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Right, my mistake. I noticed that when I saw your reply on the EH_HANDLED
case, so looks fine.


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 14:33 -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> > @@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, 
> > bool reserved)
> > case BLK_EH_RESET_TIMER:
> > +   blk_mq_add_timer(req);
> > /*
> > +* The loop is for the unlikely case of a race with the
> > +* completion code. There is no need to reset the deadline
> > +* if the transition to the in-flight state fails because
> > +* the deadline only matters in the in-flight state.
> >  */
> > -   blk_mq_rq_update_aborted_gstate(req, 0);
> > -   blk_add_timer(req);
> > +   while (true) {
> > +   if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
> > +  MQ_RQ_IN_FLIGHT))
> > +   break;
> > +   if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
> > +   __blk_mq_complete_request(req);
> > +   break;
> > +   }
> > +   }
> 
> I'm having some trouble triggering this case where the state is already
> MQ_RQ_COMPLETE, so I'm just trying to figure this out through inspection.
> 
> It looks like the new blk_mq_complete_request() already called
> __blk_mq_complete_request() when it gets the state to MQ_RQ_COMPLETE,
> so doing that again completes it a second time.

Hello Keith,

Have you noticed that if blk_mq_complete_request() encounters a request with
state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I think
the code in blk_mq_complete_request() together with the above code guarantees
that __blk_mq_complete_request() is only called once per request generation.

Bart.





Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 13:38 -0600, Jens Axboe wrote:
> On 5/22/18 1:03 PM, Jens Axboe wrote:
> > On 5/22/18 12:47 PM, Jens Axboe wrote:
> > > Ran into this, running block/014 from blktests:
> > > 
> > > [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
> > > [ 5750.723000] null: rq ff68f103 timed out
> > > [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
> > > __blk_mq_complete_request+0xa6/0x0
> > > [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core 
> > > sb_edac x86_pkg_temp_therma]
> > > [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 
> > > 4.17.0-rc6+ #712
> > > [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> > > 11/09/2016
> > > [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
> > > [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
> > > [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
> > > [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
> > > 
> > > [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
> > > 881ff100d800
> > > [ 5750.819119] RBP: e8d91800 R08:  R09: 
> > > 82066eb8
> > > [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
> > > 0001
> > > [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
> > > 0002
> > > [ 5750.844529] FS:  () GS:88407fd8() 
> > > knlGS:
> > > [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
> > > [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
> > > 003606e0
> > > [ 5750.869861] DR0:  DR1:  DR2: 
> > > 
> > > [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
> > > 0400
> > > [ 5750.886805] Call Trace:
> > > [ 5750.890033]  bt_iter+0x42/0x50
> > > [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
> > > [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
> > > [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
> > > [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
> > > [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
> > > [ 5750.920220]  process_one_work+0x21b/0x510
> > > [ 5750.925197]  worker_thread+0x3a/0x390
> > > [ 5750.929781]  ? process_one_work+0x510/0x510
> > > [ 5750.934944]  kthread+0x11c/0x140
> > > [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
> > > [ 5750.945169]  ret_from_fork+0x1f/0x30
> > > [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 
> > > e2 75 3b 48 89 df ff 90 2 
> > > [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> > > 
> > > which is this:
> > > 
> > > WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> > 
> > That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> > state transition. So that check should be:
> > 
> > WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
> >  blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
> 
> I guess it would be cleaner to actually do the transition, in
> blk_mq_rq_timed_out():
> 
> case BLK_EH_HANDLED:  
>   
> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,  
>   
> MQ_RQ_COMPLETE))  
>   
> __blk_mq_complete_request(req);   
>   
> break;
> 
> This works for me.

Hello Jens,

Thanks for having reported this. How about using the following change to 
suppress
that warning:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bb99c03e7a34..84e55ea55baf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 
switch (ret) {
case BLK_EH_HANDLED:
+   blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:

I think this will work better than what was proposed in your last e-mail. I'm 
afraid
that with that change that a completion that occurs while the timeout handler is
running can be ignored.

Thanks,

Bart.



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> @@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>   case BLK_EH_RESET_TIMER:
> + blk_mq_add_timer(req);
>   /*
> +  * The loop is for the unlikely case of a race with the
> +  * completion code. There is no need to reset the deadline
> +  * if the transition to the in-flight state fails because
> +  * the deadline only matters in the in-flight state.
>*/
> - blk_mq_rq_update_aborted_gstate(req, 0);
> - blk_add_timer(req);
> + while (true) {
> + if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
> +MQ_RQ_IN_FLIGHT))
> + break;
> + if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
> + __blk_mq_complete_request(req);
> + break;
> + }
> + }

I'm having some trouble triggering this case where the state is already
MQ_RQ_COMPLETE, so I'm just trying to figure this out through inspection.

It looks like the new blk_mq_complete_request() already called
__blk_mq_complete_request() when it gets the state to MQ_RQ_COMPLETE,
so doing that again completes it a second time.


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 2:26 PM, Keith Busch wrote:
> On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
>> I guess it would be cleaner to actually do the transition, in
>> blk_mq_rq_timed_out():
>>
>> case BLK_EH_HANDLED: 
>>
>> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, 
>>
>> MQ_RQ_COMPLETE)) 
>>
>> __blk_mq_complete_request(req);  
>>
>> break;
>>
>> This works for me.
> 
> Works for me as well on manual fault injection tests.
> 
> I think this change above goes back to Christoph's point earlier on usage
> of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
> when the driver actually knows the request has been completed before
> returning the status?

If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
Or BLK_EH_HANDLED would work too, since the above state transition would
then fail.

-- 
Jens Axboe



Re: buffered I/O without buffer heads in xfs and iomap v2

2018-05-22 Thread Christoph Hellwig
On Mon, May 21, 2018 at 01:46:10PM -0700, Darrick J. Wong wrote:
> On Fri, May 18, 2018 at 06:47:56PM +0200, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series adds support for buffered I/O without buffer heads to
> > the iomap and XFS code.
> > 
> > For now this series only contains support for block size == PAGE_SIZE,
> > with the 4k support split into a separate series.
> > 
> > 
> > A git tree is available at:
> > 
> > git://git.infradead.org/users/hch/xfs.git xfs-iomap-read.2
> > 
> > Gitweb:
> > 
> > 
> > http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-iomap-read.2
> 
> Hmm, so I pulled this and ran my trivial stupid benchmark on for-next.
> It's a stupid VM with a 2G of RAM and a 12GB virtio-scsi disk backed by
> tmpfs:

The xfs-iomap-read.3 branch in the above repo should sort out these
issues.  Still waiting for some more feedback before reposting.


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
> I guess it would be cleaner to actually do the transition, in
> blk_mq_rq_timed_out():
> 
> case BLK_EH_HANDLED:  
>   
> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,  
>   
> MQ_RQ_COMPLETE))  
>   
> __blk_mq_complete_request(req);   
>   
> break;
> 
> This works for me.

Works for me as well on manual fault injection tests.

I think this change above goes back to Christoph's point earlier on usage
of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
when the driver actually knows the request has been completed before
returning the status?


Re: [PATCH V2] blk-mq: avoid to starve tag allocation after allocation process migrates

2018-05-22 Thread Jens Axboe
On 5/19/18 1:44 AM, Ming Lei wrote:
> When the allocation process is scheduled back and the mapped hw queue is
> changed, do one extra wake up on orignal queue for compensating wake up
> miss, so other allocations on the orignal queue won't be starved.
> 
> This patch fixes one request allocation hang issue, which can be
> triggered easily in case of very low nr_request.

Can you explain what the actual bug is? The commit message doesn't
really say, it just explains what the code change does. What wake up
miss?

-- 
Jens Axboe



Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Omar Sandoval
On Tue, May 22, 2018 at 10:48:29PM +0800, Jianchao Wang wrote:
> Currently, kyber is very unfriendly with merging. kyber depends
> on ctx rq_list to do merging, however, most of time, it will not
> leave any requests in ctx rq_list. This is because even if tokens
> of one domain is used up, kyber will try to dispatch requests
> from other domain and flush the rq_list there.

That's a great catch, I totally missed this.

This approach does end up duplicating a lot of code with the blk-mq core
even after Jens' change, so I'm curious if you tried other approaches.
One idea I had is to try the bio merge against the kqd->rqs lists. Since
that's per-queue, the locking overhead might be too high. Alternatively,
you could keep the software queues as-is but add our own version of
flush_busy_ctxs() that only removes requests of the domain that we want.
If one domain gets backed up, that might get messy with long iterations,
though.

Regarding this approach, a couple of comments below.

> To improve this, we setup kyber_ctx_queue (kcq) which is similar
> with ctx, but it has rq_lists for different domain and build same
> mapping between kcq and khd as the ctx & hctx. Then we could merge,
> insert and dispatch for different domains separately. If one domain
> token is used up, the requests could be left in the rq_list of
> that domain and maybe merged with following io.
> 
> Following is my test result on machine with 8 cores and NVMe card
> INTEL SSDPEKKR128G7
> 
> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
> seq/random
> +--+---+
> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
> +--+
> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
> +--+
> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
> +--+
> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
> on my platform.
> 
> Signed-off-by: Jianchao Wang 
> ---
>  block/kyber-iosched.c | 240 
> --
>  1 file changed, 212 insertions(+), 28 deletions(-)
> 
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index 0d6d25e3..04da05b 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = {
>   [KYBER_OTHER] = 8,
>  };
>  
> +struct kyber_ctx_queue {
> + /*
> +  * Copied from blk_mq_ctx->index_hw
> +  */
> + unsigned int index;
> + spinlock_t lock;
> + struct list_head rq_list[KYBER_NUM_DOMAINS];
> +} cacheline_aligned_in_smp;
> +
>  struct kyber_queue_data {
>   struct request_queue *q;
>  
> @@ -84,6 +93,7 @@ struct kyber_queue_data {
>*/
>   struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
>  
> + struct kyber_ctx_queue *ctx_queue;
>   /*
>* Async request percentage, converted to per-word depth for
>* sbitmap_get_shallow().
> @@ -99,6 +109,8 @@ struct kyber_hctx_data {
>   struct list_head rqs[KYBER_NUM_DOMAINS];
>   unsigned int cur_domain;
>   unsigned int batching;
> + struct kyber_ctx_queue **kcqs;
> + struct sbitmap kcq_map[KYBER_NUM_DOMAINS];
>   wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS];
>   struct sbq_wait_state *domain_ws[KYBER_NUM_DOMAINS];
>   atomic_t wait_index[KYBER_NUM_DOMAINS];
> @@ -284,6 +296,19 @@ static unsigned int kyber_sched_tags_shift(struct 
> kyber_queue_data *kqd)
>   return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
>  }
>  
> +static void kyber_ctx_queue_init(struct kyber_queue_data *kqd)
> +{
> + unsigned int i, j;
> +
> + for (i = 0; i < nr_cpu_ids; i++) {
> + struct kyber_ctx_queue *kcq = >ctx_queue[i];
> +
> + spin_lock_init(>lock);
> + for (j = 0; j < KYBER_NUM_DOMAINS; j++)
> + INIT_LIST_HEAD(>rq_list[j]);
> + }
> +}
> +
>  static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue 
> *q)
>  {
>   struct kyber_queue_data *kqd;
> @@ -302,6 +327,13 @@ static struct kyber_queue_data 
> *kyber_queue_data_alloc(struct request_queue *q)
>   if (!kqd->cb)
>   goto err_kqd;
>  
> + kqd->ctx_queue = kmalloc_array_node(nr_cpu_ids,
> + sizeof(struct kyber_ctx_queue), GFP_KERNEL, -1);

The whitespace here and in several other places is weird, please run
this through checkpatch.

> + if (!kqd->ctx_queue)
> + goto err_cb;
> +
> + kyber_ctx_queue_init(kqd);
> +
>   /*
>* The maximum number of tokens for any scheduling domain is at least
>* the queue depth of a single hardware queue. If the hardware doesn't
> @@ 

Re: [PATCH] bcache: do not check return value of debugfs_create_dir()

2018-05-22 Thread Kai Krakow
Hello Coly!

2018-05-22 9:10 GMT+02:00 Coly Li :
> Greg KH suggests that normal code should not care about debugfs. Therefore
> no matter successful or failed of debugfs_create_dir() execution, it is
> unncessary to check its return value.
>
> There are two functions called debugfs_create_dir() and check the return
> value, which are bch_debug_init() and closure_debug_init(). This patch
> changes these two functions from int to void type, and ignore return values
> of debugfs_create_dir().
>
> This patch does not fix exact bug, just makes things work as they should.

I applied the patch to master and cherry-picked to my 4.16 branch
(cleaning up the conflicts).

I'm not sure if I did the conflict in closure_debug right but as I
compiled with CONFIG_DEBUG_FS=n, I think it doesn't matter for my
configuration.

The resulting patch works, currently running it while writing this message.

So you can add my tested-by if it makes sense.

Regards,
Kai


> Signed-off-by: Coly Li 
> Suggested-by: Greg Kroah-Hartman 
> Cc: Kai Krakow 
> Cc: Kent Overstreet 
> ---
>  drivers/md/bcache/bcache.h  |  2 +-
>  drivers/md/bcache/closure.c | 13 +
>  drivers/md/bcache/closure.h |  4 ++--
>  drivers/md/bcache/debug.c   | 11 ++-
>  drivers/md/bcache/super.c   |  4 +++-
>  5 files changed, 21 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 3a0cfb237af9..bee6251d2d40 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -995,7 +995,7 @@ void bch_open_buckets_free(struct cache_set *);
>  int bch_cache_allocator_start(struct cache *ca);
>
>  void bch_debug_exit(void);
> -int bch_debug_init(struct kobject *);
> +void bch_debug_init(struct kobject *);
>  void bch_request_exit(void);
>  int bch_request_init(void);
>
> diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
> index 0e14969182c6..618253683d40 100644
> --- a/drivers/md/bcache/closure.c
> +++ b/drivers/md/bcache/closure.c
> @@ -199,11 +199,16 @@ static const struct file_operations debug_ops = {
> .release= single_release
>  };
>
> -int __init closure_debug_init(void)
> +void  __init closure_debug_init(void)
>  {
> -   closure_debug = debugfs_create_file("closures",
> -   0400, bcache_debug, NULL, _ops);
> -   return IS_ERR_OR_NULL(closure_debug);
> +   if (!IS_ERR_OR_NULL(bcache_debug))
> +   /*
> +* it is unnecessary to check return value of
> +* debugfs_create_file(), we should not care
> +* about this.
> +*/
> +   closure_debug = debugfs_create_file(
> +   "closures", 0400, bcache_debug, NULL, _ops);
>  }
>  #endif
>
> diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
> index 71427eb5fdae..7c2c5bc7c88b 100644
> --- a/drivers/md/bcache/closure.h
> +++ b/drivers/md/bcache/closure.h
> @@ -186,13 +186,13 @@ static inline void closure_sync(struct closure *cl)
>
>  #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
>
> -int closure_debug_init(void);
> +void closure_debug_init(void);
>  void closure_debug_create(struct closure *cl);
>  void closure_debug_destroy(struct closure *cl);
>
>  #else
>
> -static inline int closure_debug_init(void) { return 0; }
> +static inline void closure_debug_init(void) {}
>  static inline void closure_debug_create(struct closure *cl) {}
>  static inline void closure_debug_destroy(struct closure *cl) {}
>
> diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
> index d030ce3025a6..57f8f5aeee55 100644
> --- a/drivers/md/bcache/debug.c
> +++ b/drivers/md/bcache/debug.c
> @@ -248,11 +248,12 @@ void bch_debug_exit(void)
> debugfs_remove_recursive(bcache_debug);
>  }
>
> -int __init bch_debug_init(struct kobject *kobj)
> +void __init bch_debug_init(struct kobject *kobj)
>  {
> -   if (!IS_ENABLED(CONFIG_DEBUG_FS))
> -   return 0;
> -
> +   /*
> +* it is unnecessary to check return value of
> +* debugfs_create_file(), we should not care
> +* about this.
> +*/
> bcache_debug = debugfs_create_dir("bcache", NULL);
> -   return IS_ERR_OR_NULL(bcache_debug);
>  }
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 3dea06b41d43..2b62671e4d83 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2301,10 +2301,12 @@ static int __init bcache_init(void)
> if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) ||
> !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
> bch_request_init() ||
> -   bch_debug_init(bcache_kobj) || closure_debug_init() ||
> sysfs_create_files(bcache_kobj, files))
> goto err;
>
> +   

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 1:03 PM, Jens Axboe wrote:
> On 5/22/18 12:47 PM, Jens Axboe wrote:
>> On 5/22/18 11:17 AM, Jens Axboe wrote:
>>> On 5/22/18 10:44 AM, Jens Axboe wrote:

 On 5/22/18 10:25 AM, Bart Van Assche wrote:
> Recently the blk-mq timeout handling code was reworked. See also Tejun
> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
> This patch reworks the blk-mq timeout handling code again. The timeout
> handling code is simplified by introducing a state machine per request.
> This change avoids that the blk-mq timeout handling code ignores
> completions that occur after blk_mq_check_expired() has been called and
> before blk_mq_rq_timed_out() has been called.

 I'll take a look at this again, getting rid of cmpxchg64 makes me
 much more comfortable.
>>>
>>> FWIW, a quick pass on runtime testing works fine. As expected, it's
>>> more efficient than what's currently in the kernel, testing with both
>>> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
>>> is that we shrink the request size from 360 bytes to 312, and I did
>>> a small followup patch that brings that to 304. That's a 15% reduction,
>>> massive.
>>
>> Ran into this, running block/014 from blktests:
>>
>> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
>> [ 5750.723000] null: rq ff68f103 timed out
>> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
>> __blk_mq_complete_request+0xa6/0x0
>> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core 
>> sb_edac x86_pkg_temp_therma]
>> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ 
>> #712
>> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
>> 11/09/2016
>> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
>> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
>> [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
>> [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
>> 
>> [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
>> 881ff100d800
>> [ 5750.819119] RBP: e8d91800 R08:  R09: 
>> 82066eb8
>> [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
>> 0001
>> [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
>> 0002
>> [ 5750.844529] FS:  () GS:88407fd8() 
>> knlGS:
>> [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
>> [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
>> 003606e0
>> [ 5750.869861] DR0:  DR1:  DR2: 
>> 
>> [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [ 5750.886805] Call Trace:
>> [ 5750.890033]  bt_iter+0x42/0x50
>> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
>> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
>> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
>> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
>> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
>> [ 5750.920220]  process_one_work+0x21b/0x510
>> [ 5750.925197]  worker_thread+0x3a/0x390
>> [ 5750.929781]  ? process_one_work+0x510/0x510
>> [ 5750.934944]  kthread+0x11c/0x140
>> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
>> [ 5750.945169]  ret_from_fork+0x1f/0x30
>> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 
>> 75 3b 48 89 df ff 90 2 
>> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
>>
>> which is this:
>>
>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> 
> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> state transition. So that check should be:
> 
> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
>  blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);

I guess it would be cleaner to actually do the transition, in
blk_mq_rq_timed_out():

case BLK_EH_HANDLED:
if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
MQ_RQ_COMPLETE))
__blk_mq_complete_request(req); 
break;

This works for me.

-- 
Jens Axboe



Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Jens Axboe
On 5/22/18 1:13 PM, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
>> I think Martin and Christoph are objecting to moving the code to
>> block/scsi_ioctl.h. I don't care too much about where the code is, but
>> think it would be nice to have the definitions in a separate header. But
>> if they prefer just pulling in all of SCSI for it, well then I guess
>> it's pointless to move the header bits. Seems very heavy handed to me,
>> though.
> 
> It might be heavy handed for the 3 remaining users of drivers/ide,

Brutal :-)

> but as long as that stuff just keeps working I'd rather worry about
> everyone else, and keep the scsi code where it belongs.

Fine with me then, hopefully we can some day kill it off.

-- 
Jens Axboe



Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Christoph Hellwig
On Tue, May 22, 2018 at 01:09:41PM -0600, Jens Axboe wrote:
> I think Martin and Christoph are objecting to moving the code to
> block/scsi_ioctl.h. I don't care too much about where the code is, but
> think it would be nice to have the definitions in a separate header. But
> if they prefer just pulling in all of SCSI for it, well then I guess
> it's pointless to move the header bits. Seems very heavy handed to me,
> though.

It might be heavy handed for the 3 remaining users of drivers/ide,
but as long as that stuff just keeps working I'd rather worry about
everyone else, and keep the scsi code where it belongs.


Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-22 Thread Mike Snitzer
On Tue, May 22 2018 at  2:41am -0400,
Christoph Hellwig  wrote:

> On Mon, May 21, 2018 at 07:38:55PM -0400, Kent Overstreet wrote:
> > On Mon, May 21, 2018 at 02:24:32PM -0400, Mike Snitzer wrote:
> > > Every single data structure change in this series should be reviewed for
> > > unforeseen alignment consequences.  Jens seemed to say that is
> > > worthwhile.  Not sure if he'll do it or we divide it up.  If we divide
> > > it up a temp topic branch should be published for others to inspect.
> > > 
> > > Could be alignment hasn't been a historic concern for a bunch of the
> > > data structures changed in this series.. if so then all we can do is fix
> > > up any obvious potential for false sharing.
> > 
> > Honestly, I almost never worry about alignment... the very few times I do 
> > care,
> > I use __cacheline_aligned_in_smp.
> 
> And that generally is the right stratgey.  If Mike really doesn't want
> a change we can just open code the kmalloc for the bio set there, the
> important point is that we should not keep the old API around for no
> good reason.

Again, I never said I didn't want these changes.  I just thought it
worthwhile to mention the potential for alignment quirks arising.

Reality is false sharing is quite uncommon.  So my concern was/is pretty
niche and unlikely to be applicable.

That said, I did take the opportunity to look at all the DM structures
that were modified.  The bio_set and mempool_t structs are so large that
they span multiple cachelines as is.  So I focused on eliminating
unnecessary spanning of cachelines (by non-bio_set and non-mempool_t
members) and eliminated most holes in DM structs.  This is the result:
http://people.redhat.com/msnitzer/dm-4.18-struct-reorg/

Compare before.txt vs after_kent.txt vs after_mike.txt

Nothing staggering or special.  Just something I'll add once Kent's
latest changes land.

But, I also eliminated 2 4-byte holes in struct bio_set, Jens please
feel free to pick this up (if you think it useful):

From: Mike Snitzer 
Subject: [PATCH] block: eliminate 2 4-byte holes in bio_set structure

Signed-off-by: Mike Snitzer 
---
 include/linux/bio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5e472fc..e6fc692 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -735,7 +735,6 @@ static inline void bio_inc_remaining(struct bio *bio)
 
 struct bio_set {
struct kmem_cache *bio_slab;
-   unsigned int front_pad;
 
mempool_t bio_pool;
mempool_t bvec_pool;
@@ -743,6 +742,7 @@ struct bio_set {
mempool_t bio_integrity_pool;
mempool_t bvec_integrity_pool;
 #endif
+   unsigned int front_pad;
 
/*
 * Deadlock avoidance for stacking block drivers: see comments in



Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Jens Axboe
On 5/22/18 12:59 PM, Kees Cook wrote:
> On Tue, May 22, 2018 at 11:50 AM, Martin K. Petersen
>  wrote:
>>
>> Christoph,
>>
>>> On Tue, May 22, 2018 at 11:15:09AM -0700, Kees Cook wrote:
 Both SCSI and ATAPI share the sense header. In preparation for using the
 struct scsi_sense_hdr more widely, move this into a separate header and
 move the helper function to scsi_ioctl.c which is linked with CONFIG_IDE
 by way of CONFIG_BLK_SCSI_REQUEST.
>>>
>>> Please keep the code where it is and just depend on SCSI on the legacy
>>> ide driver.  No need to do gymnastics just for a legacy case.
>>
>> Yup, I agree.
> 
> Oh, er, this was mainly done at Jens's request. Jens, can you advise?

I think Martin and Christoph are objecting to moving the code to
block/scsi_ioctl.h. I don't care too much about where the code is, but
think it would be nice to have the definitions in a separate header. But
if they prefer just pulling in all of SCSI for it, well then I guess
it's pointless to move the header bits. Seems very heavy handed to me,
though.

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 12:47 PM, Jens Axboe wrote:
> On 5/22/18 11:17 AM, Jens Axboe wrote:
>> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>>
>>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
 Recently the blk-mq timeout handling code was reworked. See also Tejun
 Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
 This patch reworks the blk-mq timeout handling code again. The timeout
 handling code is simplified by introducing a state machine per request.
 This change avoids that the blk-mq timeout handling code ignores
 completions that occur after blk_mq_check_expired() has been called and
 before blk_mq_rq_timed_out() has been called.
>>>
>>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>>> much more comfortable.
>>
>> FWIW, a quick pass on runtime testing works fine. As expected, it's
>> more efficient than what's currently in the kernel, testing with both
>> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
>> is that we shrink the request size from 360 bytes to 312, and I did
>> a small followup patch that brings that to 304. That's a 15% reduction,
>> massive.
> 
> Ran into this, running block/014 from blktests:
> 
> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
> [ 5750.723000] null: rq ff68f103 timed out
> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
> __blk_mq_complete_request+0xa6/0x0
> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac 
> x86_pkg_temp_therma]
> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ 
> #712
> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> 11/09/2016
> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
> [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
> [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
> 
> [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
> 881ff100d800
> [ 5750.819119] RBP: e8d91800 R08:  R09: 
> 82066eb8
> [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
> 0001
> [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
> 0002
> [ 5750.844529] FS:  () GS:88407fd8() 
> knlGS:
> [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
> [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
> 003606e0
> [ 5750.869861] DR0:  DR1:  DR2: 
> 
> [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 5750.886805] Call Trace:
> [ 5750.890033]  bt_iter+0x42/0x50
> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
> [ 5750.920220]  process_one_work+0x21b/0x510
> [ 5750.925197]  worker_thread+0x3a/0x390
> [ 5750.929781]  ? process_one_work+0x510/0x510
> [ 5750.934944]  kthread+0x11c/0x140
> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
> [ 5750.945169]  ret_from_fork+0x1f/0x30
> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 
> 3b 48 89 df ff 90 2 
> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> 
> which is this:
> 
> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);

That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
state transition. So that check should be:

WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
 blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);

-- 
Jens Axboe



Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Kees Cook
On Tue, May 22, 2018 at 11:50 AM, Martin K. Petersen
 wrote:
>
> Christoph,
>
>> On Tue, May 22, 2018 at 11:15:09AM -0700, Kees Cook wrote:
>>> Both SCSI and ATAPI share the sense header. In preparation for using the
>>> struct scsi_sense_hdr more widely, move this into a separate header and
>>> move the helper function to scsi_ioctl.c which is linked with CONFIG_IDE
>>> by way of CONFIG_BLK_SCSI_REQUEST.
>>
>> Please keep the code where it is and just depend on SCSI on the legacy
>> ide driver.  No need to do gymnastics just for a legacy case.
>
> Yup, I agree.

Oh, er, this was mainly done at Jens's request. Jens, can you advise?

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Martin K. Petersen

Christoph,

> On Tue, May 22, 2018 at 11:15:09AM -0700, Kees Cook wrote:
>> Both SCSI and ATAPI share the sense header. In preparation for using the
>> struct scsi_sense_hdr more widely, move this into a separate header and
>> move the helper function to scsi_ioctl.c which is linked with CONFIG_IDE
>> by way of CONFIG_BLK_SCSI_REQUEST.
>
> Please keep the code where it is and just depend on SCSI on the legacy
> ide driver.  No need to do gymnastics just for a legacy case.

Yup, I agree.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 11:17 AM, Jens Axboe wrote:
> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>
>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>>> This patch reworks the blk-mq timeout handling code again. The timeout
>>> handling code is simplified by introducing a state machine per request.
>>> This change avoids that the blk-mq timeout handling code ignores
>>> completions that occur after blk_mq_check_expired() has been called and
>>> before blk_mq_rq_timed_out() has been called.
>>
>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>> much more comfortable.
> 
> FWIW, a quick pass on runtime testing works fine. As expected, it's
> more efficient than what's currently in the kernel, testing with both
> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
> is that we shrink the request size from 360 bytes to 312, and I did
> a small followup patch that brings that to 304. That's a 15% reduction,
> massive.

Ran into this, running block/014 from blktests:

[ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
[ 5750.723000] null: rq ff68f103 timed out
[ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
__blk_mq_complete_request+0xa6/0x0
[ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac 
x86_pkg_temp_therma]
[ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ 
#712
[ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
11/09/2016
[ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
[ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
[ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
[ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
[ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 881ff100d800
[ 5750.819119] RBP: e8d91800 R08:  R09: 82066eb8
[ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 0001
[ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 0002
[ 5750.844529] FS:  () GS:88407fd8() 
knlGS:
[ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
[ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 003606e0
[ 5750.869861] DR0:  DR1:  DR2: 
[ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 0400
[ 5750.886805] Call Trace:
[ 5750.890033]  bt_iter+0x42/0x50
[ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
[ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
[ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
[ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
[ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
[ 5750.920220]  process_one_work+0x21b/0x510
[ 5750.925197]  worker_thread+0x3a/0x390
[ 5750.929781]  ? process_one_work+0x510/0x510
[ 5750.934944]  kthread+0x11c/0x140
[ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
[ 5750.945169]  ret_from_fork+0x1f/0x30
[ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 
3b 48 89 df ff 90 2 
[ 5750.972139] ---[ end trace 40065cb1764bf500 ]---

which is this:

WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);

hitting when blk_mq_terminate_expired() completes the request through 
BLK_EH_HANDLED.

-- 
Jens Axboe



Re: [PATCH 1/6] ide-cd: Drop unused sense buffers

2018-05-22 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 2/6] scsi: cxlflash: Drop unused sense buffers

2018-05-22 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Christoph Hellwig
On Tue, May 22, 2018 at 11:15:09AM -0700, Kees Cook wrote:
> Both SCSI and ATAPI share the sense header. In preparation for using the
> struct scsi_sense_hdr more widely, move this into a separate header and
> move the helper function to scsi_ioctl.c which is linked with CONFIG_IDE
> by way of CONFIG_BLK_SCSI_REQUEST.

Please keep the code where it is and just depend on SCSI on the legacy
ide driver.  No need to do gymnastics just for a legacy case.



Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Holger Hoffstätte
On 05/22/18 19:46, Jens Axboe wrote:
> On 5/22/18 10:20 AM, Jens Axboe wrote:
>> On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
>>> On 05/22/18 16:48, Jianchao Wang wrote:
 Currently, kyber is very unfriendly with merging. kyber depends
 on ctx rq_list to do merging, however, most of time, it will not
 leave any requests in ctx rq_list. This is because even if tokens
 of one domain is used up, kyber will try to dispatch requests
 from other domain and flush the rq_list there.

 To improve this, we setup kyber_ctx_queue (kcq) which is similar
 with ctx, but it has rq_lists for different domain and build same
 mapping between kcq and khd as the ctx & hctx. Then we could merge,
 insert and dispatch for different domains separately. If one domain
 token is used up, the requests could be left in the rq_list of
 that domain and maybe merged with following io.

 Following is my test result on machine with 8 cores and NVMe card
 INTEL SSDPEKKR128G7

 fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
 seq/random
 +--+---+
 |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
 +--+
 | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
 +--+
 | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
 +--+
 When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
 on my platform.

 Signed-off-by: Jianchao Wang 
>>>
>>> 
>>>
>>> This looks great but prevents kyber from being built as module,
>>> which is AFAIK supposed to work (and works now):
>>>
>>> ..
>>>   CC [M]  block/kyber-iosched.o
>>>   Building modules, stage 2.
>>>   MODPOST 313 modules
>>> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
>>> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
>>> ..
>>>
>>> It does build fine when compiled in, obviously. :)
>>
>> It's basically duplicating the contents of blk_attempt_plug_merge().
>> I would suggest abstracting out the list loop and merge check
>> into a helper, that could then both be called from kyber and the
>> plug merge function.
> 
> See attached, prep patch and yours rebased on top of it.

That was quick. :)

Applies smoothly on top of my 4.16++ tree, now builds correctly as
module and is reproducibly (slightly) faster even on my pedestrian
SATA SSDs, now on par or occasionally even faster than mq-deadline.
What's not to like? So:

Tested-by: Holger Hoffstätte 

cheers,
Holger


Re: [PATCH v7 0/5] AIO add per-command iopriority

2018-05-22 Thread Jens Axboe
On 5/22/18 12:30 PM, Al Viro wrote:
> On Tue, May 22, 2018 at 11:55:04AM -0600, Jens Axboe wrote:
>> On 5/22/18 11:52 AM, adam.manzana...@wdc.com wrote:
>>> From: Adam Manzanares 
>>>
>>> This is the per-I/O equivalent of the ioprio_set system call.
>>> See the following link for performance implications on a SATA HDD:
>>> https://lkml.org/lkml/2016/12/6/495
>>>
>>> First patch factors ioprio_check_cap function out of ioprio_set system call 
>>> to
>>> also be used by the aio ioprio interface.
>>>
>>> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
>>>
>>> Third patch passes ioprio hint from aio iocb to kiocb and initializes kiocb
>>> ioprio value appropriately when it is not explicitly set.
>>>
>>> Fourth patch enables the feature for blkdev.
>>>
>>> Fifth patch enables the feature for iomap direct IO
>>
>> LGTM, you can add:
>>
>> Reviewed-by: Jens Axboe 
>>
>> Al, are you picking this series up, or should I?
> 
> Probably better if I do, once I finish reviewing Christoph's patchset -
> we already have a bunch of stuff around fs/aio.c in this cycle...

Alright, sounds good, thanks Al.

-- 
Jens Axboe



Re: [PATCH v7 0/5] AIO add per-command iopriority

2018-05-22 Thread Al Viro
On Tue, May 22, 2018 at 11:55:04AM -0600, Jens Axboe wrote:
> On 5/22/18 11:52 AM, adam.manzana...@wdc.com wrote:
> > From: Adam Manzanares 
> > 
> > This is the per-I/O equivalent of the ioprio_set system call.
> > See the following link for performance implications on a SATA HDD:
> > https://lkml.org/lkml/2016/12/6/495
> > 
> > First patch factors ioprio_check_cap function out of ioprio_set system call 
> > to
> > also be used by the aio ioprio interface.
> > 
> > Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
> > 
> > Third patch passes ioprio hint from aio iocb to kiocb and initializes kiocb
> > ioprio value appropriately when it is not explicitly set.
> > 
> > Fourth patch enables the feature for blkdev.
> > 
> > Fifth patch enables the feature for iomap direct IO
> 
> LGTM, you can add:
> 
> Reviewed-by: Jens Axboe 
> 
> Al, are you picking this series up, or should I?

Probably better if I do, once I finish reviewing Christoph's patchset -
we already have a bunch of stuff around fs/aio.c in this cycle...


[PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Kees Cook
Both SCSI and ATAPI share the sense header. In preparation for using the
struct scsi_sense_hdr more widely, move this into a separate header and
move the helper function to scsi_ioctl.c which is linked with CONFIG_IDE
by way of CONFIG_BLK_SCSI_REQUEST.

Signed-off-by: Kees Cook 
---
 block/scsi_ioctl.c | 64 ++
 drivers/scsi/scsi_common.c | 64 --
 include/scsi/scsi_cmnd.h   |  1 -
 include/scsi/scsi_common.h | 32 +--
 include/scsi/scsi_sense.h  | 44 ++
 5 files changed, 109 insertions(+), 96 deletions(-)
 create mode 100644 include/scsi/scsi_sense.h

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 60b471f8621b..87ff3cc7a364 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -728,6 +728,70 @@ void scsi_req_init(struct scsi_request *req)
 }
 EXPORT_SYMBOL(scsi_req_init);
 
+/**
+ * scsi_normalize_sense - normalize main elements from either fixed or
+ * descriptor sense data format into a common format.
+ *
+ * @sense_buffer:  byte array containing sense data returned by device
+ * @sb_len:number of valid bytes in sense_buffer
+ * @sshdr: pointer to instance of structure that common
+ * elements are written to.
+ *
+ * Notes:
+ * The "main elements" from sense data are: response_code, sense_key,
+ * asc, ascq and additional_length (only for descriptor format).
+ *
+ * Typically this function can be called after a device has
+ * responded to a SCSI command with the CHECK_CONDITION status.
+ *
+ * Return value:
+ * true if valid sense data information found, else false;
+ */
+bool scsi_normalize_sense(const u8 *sense_buffer, int sb_len,
+ struct scsi_sense_hdr *sshdr)
+{
+   memset(sshdr, 0, sizeof(struct scsi_sense_hdr));
+
+   if (!sense_buffer || !sb_len)
+   return false;
+
+   sshdr->response_code = (sense_buffer[0] & 0x7f);
+
+   if (!scsi_sense_valid(sshdr))
+   return false;
+
+   if (sshdr->response_code >= 0x72) {
+   /*
+* descriptor format
+*/
+   if (sb_len > 1)
+   sshdr->sense_key = (sense_buffer[1] & 0xf);
+   if (sb_len > 2)
+   sshdr->asc = sense_buffer[2];
+   if (sb_len > 3)
+   sshdr->ascq = sense_buffer[3];
+   if (sb_len > 7)
+   sshdr->additional_length = sense_buffer[7];
+   } else {
+   /*
+* fixed format
+*/
+   if (sb_len > 2)
+   sshdr->sense_key = (sense_buffer[2] & 0xf);
+   if (sb_len > 7) {
+   sb_len = (sb_len < (sense_buffer[7] + 8)) ?
+sb_len : (sense_buffer[7] + 8);
+   if (sb_len > 12)
+   sshdr->asc = sense_buffer[12];
+   if (sb_len > 13)
+   sshdr->ascq = sense_buffer[13];
+   }
+   }
+
+   return true;
+}
+EXPORT_SYMBOL(scsi_normalize_sense);
+
 static int __init blk_scsi_ioctl_init(void)
 {
blk_set_cmd_filter_defaults(_default_cmd_filter);
diff --git a/drivers/scsi/scsi_common.c b/drivers/scsi/scsi_common.c
index 90349498f686..8621a07cb967 100644
--- a/drivers/scsi/scsi_common.c
+++ b/drivers/scsi/scsi_common.c
@@ -116,70 +116,6 @@ void int_to_scsilun(u64 lun, struct scsi_lun *scsilun)
 }
 EXPORT_SYMBOL(int_to_scsilun);
 
-/**
- * scsi_normalize_sense - normalize main elements from either fixed or
- * descriptor sense data format into a common format.
- *
- * @sense_buffer:  byte array containing sense data returned by device
- * @sb_len:number of valid bytes in sense_buffer
- * @sshdr: pointer to instance of structure that common
- * elements are written to.
- *
- * Notes:
- * The "main elements" from sense data are: response_code, sense_key,
- * asc, ascq and additional_length (only for descriptor format).
- *
- * Typically this function can be called after a device has
- * responded to a SCSI command with the CHECK_CONDITION status.
- *
- * Return value:
- * true if valid sense data information found, else false;
- */
-bool scsi_normalize_sense(const u8 *sense_buffer, int sb_len,
- struct scsi_sense_hdr *sshdr)
-{
-   memset(sshdr, 0, sizeof(struct scsi_sense_hdr));
-
-   if (!sense_buffer || !sb_len)
-   return false;
-
-   sshdr->response_code = (sense_buffer[0] & 0x7f);
-
-   if (!scsi_sense_valid(sshdr))
-   return false;
-
-   if (sshdr->response_code >= 0x72) {
-   /*
-* descriptor format
-*/
-   if 

[PATCH 2/6] scsi: cxlflash: Drop unused sense buffers

2018-05-22 Thread Kees Cook
This removes the unused sense buffer in read_cap16() and write_same16().

Signed-off-by: Kees Cook 
---
 drivers/scsi/cxlflash/superpipe.c | 8 ++--
 drivers/scsi/cxlflash/vlun.c  | 7 ++-
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/cxlflash/superpipe.c 
b/drivers/scsi/cxlflash/superpipe.c
index 2fe79df5c73c..59b9f2023748 100644
--- a/drivers/scsi/cxlflash/superpipe.c
+++ b/drivers/scsi/cxlflash/superpipe.c
@@ -324,7 +324,6 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
struct scsi_sense_hdr sshdr;
u8 *cmd_buf = NULL;
u8 *scsi_cmd = NULL;
-   u8 *sense_buf = NULL;
int rc = 0;
int result = 0;
int retry_cnt = 0;
@@ -333,8 +332,7 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
 retry:
cmd_buf = kzalloc(CMD_BUFSIZE, GFP_KERNEL);
scsi_cmd = kzalloc(MAX_COMMAND_SIZE, GFP_KERNEL);
-   sense_buf = kzalloc(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL);
-   if (unlikely(!cmd_buf || !scsi_cmd || !sense_buf)) {
+   if (unlikely(!cmd_buf || !scsi_cmd)) {
rc = -ENOMEM;
goto out;
}
@@ -349,7 +347,7 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
/* Drop the ioctl read semahpore across lengthy call */
up_read(>ioctl_rwsem);
result = scsi_execute(sdev, scsi_cmd, DMA_FROM_DEVICE, cmd_buf,
- CMD_BUFSIZE, sense_buf, , to, CMD_RETRIES,
+ CMD_BUFSIZE, NULL, , to, CMD_RETRIES,
  0, 0, NULL);
down_read(>ioctl_rwsem);
rc = check_state(cfg);
@@ -380,7 +378,6 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
if (retry_cnt++ < 1) {
kfree(cmd_buf);
kfree(scsi_cmd);
-   kfree(sense_buf);
goto retry;
}
}
@@ -411,7 +408,6 @@ static int read_cap16(struct scsi_device *sdev, struct 
llun_info *lli)
 out:
kfree(cmd_buf);
kfree(scsi_cmd);
-   kfree(sense_buf);
 
dev_dbg(dev, "%s: maxlba=%lld blklen=%d rc=%d\n",
__func__, gli->max_lba, gli->blk_len, rc);
diff --git a/drivers/scsi/cxlflash/vlun.c b/drivers/scsi/cxlflash/vlun.c
index 5deef57a7834..e7e9b2f2ad21 100644
--- a/drivers/scsi/cxlflash/vlun.c
+++ b/drivers/scsi/cxlflash/vlun.c
@@ -425,7 +425,6 @@ static int write_same16(struct scsi_device *sdev,
 {
u8 *cmd_buf = NULL;
u8 *scsi_cmd = NULL;
-   u8 *sense_buf = NULL;
int rc = 0;
int result = 0;
u64 offset = lba;
@@ -439,8 +438,7 @@ static int write_same16(struct scsi_device *sdev,
 
cmd_buf = kzalloc(CMD_BUFSIZE, GFP_KERNEL);
scsi_cmd = kzalloc(MAX_COMMAND_SIZE, GFP_KERNEL);
-   sense_buf = kzalloc(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL);
-   if (unlikely(!cmd_buf || !scsi_cmd || !sense_buf)) {
+   if (unlikely(!cmd_buf || !scsi_cmd)) {
rc = -ENOMEM;
goto out;
}
@@ -456,7 +454,7 @@ static int write_same16(struct scsi_device *sdev,
/* Drop the ioctl read semahpore across lengthy call */
up_read(>ioctl_rwsem);
result = scsi_execute(sdev, scsi_cmd, DMA_TO_DEVICE, cmd_buf,
- CMD_BUFSIZE, sense_buf, NULL, to,
+ CMD_BUFSIZE, NULL, NULL, to,
  CMD_RETRIES, 0, 0, NULL);
down_read(>ioctl_rwsem);
rc = check_state(cfg);
@@ -481,7 +479,6 @@ static int write_same16(struct scsi_device *sdev,
 out:
kfree(cmd_buf);
kfree(scsi_cmd);
-   kfree(sense_buf);
dev_dbg(dev, "%s: returning rc=%d\n", __func__, rc);
return rc;
 }
-- 
2.17.0



[PATCH 1/6] ide-cd: Drop unused sense buffers

2018-05-22 Thread Kees Cook
This drops unused sense buffers from:

cdrom_eject()
cdrom_read_capacity()
cdrom_read_tocentry()
ide_cd_lockdoor()
ide_cd_read_toc()

Signed-off-by: Kees Cook 
---
 drivers/ide/ide-cd.c   | 36 +++-
 drivers/ide/ide-cd.h   |  2 +-
 drivers/ide/ide-cd_ioctl.c | 34 --
 3 files changed, 28 insertions(+), 44 deletions(-)

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 5a8e8e3c22cd..1d5b35312e33 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -890,8 +890,7 @@ int cdrom_check_status(ide_drive_t *drive, struct 
request_sense *sense)
 }
 
 static int cdrom_read_capacity(ide_drive_t *drive, unsigned long *capacity,
-  unsigned long *sectors_per_frame,
-  struct request_sense *sense)
+  unsigned long *sectors_per_frame)
 {
struct {
__be32 lba;
@@ -908,7 +907,7 @@ static int cdrom_read_capacity(ide_drive_t *drive, unsigned 
long *capacity,
memset(cmd, 0, BLK_MAX_CDB);
cmd[0] = GPCMD_READ_CDVD_CAPACITY;
 
-   stat = ide_cd_queue_pc(drive, cmd, 0, , , sense, 0,
+   stat = ide_cd_queue_pc(drive, cmd, 0, , , NULL, 0,
   RQF_QUIET);
if (stat)
return stat;
@@ -944,8 +943,7 @@ static int cdrom_read_capacity(ide_drive_t *drive, unsigned 
long *capacity,
 }
 
 static int cdrom_read_tocentry(ide_drive_t *drive, int trackno, int msf_flag,
-   int format, char *buf, int buflen,
-   struct request_sense *sense)
+   int format, char *buf, int buflen)
 {
unsigned char cmd[BLK_MAX_CDB];
 
@@ -962,11 +960,11 @@ static int cdrom_read_tocentry(ide_drive_t *drive, int 
trackno, int msf_flag,
if (msf_flag)
cmd[1] = 2;
 
-   return ide_cd_queue_pc(drive, cmd, 0, buf, , sense, 0, 
RQF_QUIET);
+   return ide_cd_queue_pc(drive, cmd, 0, buf, , NULL, 0, RQF_QUIET);
 }
 
 /* Try to read the entire TOC for the disk into our internal buffer. */
-int ide_cd_read_toc(ide_drive_t *drive, struct request_sense *sense)
+int ide_cd_read_toc(ide_drive_t *drive)
 {
int stat, ntracks, i;
struct cdrom_info *info = drive->driver_data;
@@ -996,14 +994,13 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
 * Check to see if the existing data is still valid. If it is,
 * just return.
 */
-   (void) cdrom_check_status(drive, sense);
+   (void) cdrom_check_status(drive, NULL);
 
if (drive->atapi_flags & IDE_AFLAG_TOC_VALID)
return 0;
 
/* try to get the total cdrom capacity and sector size */
-   stat = cdrom_read_capacity(drive, >capacity, _per_frame,
-  sense);
+   stat = cdrom_read_capacity(drive, >capacity, _per_frame);
if (stat)
toc->capacity = 0x1f;
 
@@ -1016,7 +1013,7 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
 
/* first read just the header, so we know how long the TOC is */
stat = cdrom_read_tocentry(drive, 0, 1, 0, (char *) >hdr,
-   sizeof(struct atapi_toc_header), sense);
+   sizeof(struct atapi_toc_header));
if (stat)
return stat;
 
@@ -1036,7 +1033,7 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
  (char *)>hdr,
   sizeof(struct atapi_toc_header) +
   (ntracks + 1) *
-  sizeof(struct atapi_toc_entry), sense);
+  sizeof(struct atapi_toc_entry));
 
if (stat && toc->hdr.first_track > 1) {
/*
@@ -1056,8 +1053,7 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
   (char *)>hdr,
   sizeof(struct atapi_toc_header) +
   (ntracks + 1) *
-  sizeof(struct atapi_toc_entry),
-  sense);
+  sizeof(struct atapi_toc_entry));
if (stat)
return stat;
 
@@ -1094,7 +1090,7 @@ int ide_cd_read_toc(ide_drive_t *drive, struct 
request_sense *sense)
if (toc->hdr.first_track != CDROM_LEADOUT) {
/* read the multisession information */
stat = cdrom_read_tocentry(drive, 0, 0, 1, (char *)_tmp,
-  sizeof(ms_tmp), sense);
+  sizeof(ms_tmp));
if (stat)
return 

[PATCH 0/6] block: Consolidate scsi sense buffer usage

2018-05-22 Thread Kees Cook
This is a follow-up to commit f7068114d45e ("sr: pass down correctly
sized SCSI sense buffer") which further cleans up and removes needless
sense character array buffers and "struct request_sense" usage in favor
of the common "struct scsi_sense_hdr".

First, drop a bunch of unused sense buffers:
 [PATCH 1/6] ide-cd: Drop unused sense buffers
 [PATCH 2/6] scsi: cxlflash: Drop unused sense buffers

Next, split out struct scsi_sense_hdr:
 [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

Then move all request_sense usage to scsi_sense_hdr:
 [PATCH 4/6] block: Consolidate scsi sense buffer usage

Finally add a build-time check to make sure we don't pass bad buffer sizes:
 [PATCH 5/6] libata-scsi: Move sense buffers onto stack
 [PATCH 6/6] scsi: Check sense buffer size at build time

-Kees



[PATCH 6/6] scsi: Check sense buffer size at build time

2018-05-22 Thread Kees Cook
To avoid introducing problems like those fixed in commit f7068114d45e
("sr: pass down correctly sized SCSI sense buffer"), this creates a macro
wrapper for scsi_execute() that verifies the size of the sense buffer
similar to what was done for command string sizes in commit 3756f6401c30
("exec: avoid gcc-8 warning for get_task_comm").

Another solution could be to add another argument to scsi_execute(),
but this function already takes a lot of arguments and Jens was not fond
of that approach. As there was only a pair of dynamically allocated sense
buffers, this also moves those 96 bytes onto the stack to avoid triggering
the sizeof() check.

Signed-off-by: Kees Cook 
---
 drivers/scsi/scsi_lib.c|  6 +++---
 include/scsi/scsi_device.h | 12 +++-
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9b4f279d29c..718c2bec4516 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -238,7 +238,7 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
 
 
 /**
- * scsi_execute - insert request and wait for the result
+ * __scsi_execute - insert request and wait for the result
  * @sdev:  scsi device
  * @cmd:   scsi command
  * @data_direction: data direction
@@ -255,7 +255,7 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
  * Returns the scsi_cmnd result field if a command was executed, or a negative
  * Linux error code if we didn't get that far.
  */
-int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
+int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 int data_direction, void *buffer, unsigned bufflen,
 unsigned char *sense, struct scsi_sense_hdr *sshdr,
 int timeout, int retries, u64 flags, req_flags_t rq_flags,
@@ -309,7 +309,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
 
return ret;
 }
-EXPORT_SYMBOL(scsi_execute);
+EXPORT_SYMBOL(__scsi_execute);
 
 /*
  * Function:scsi_init_cmd_errh()
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 7ae177c8e399..1bb87b6c0ad2 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -426,11 +426,21 @@ extern const char *scsi_device_state_name(enum 
scsi_device_state);
 extern int scsi_is_sdev_device(const struct device *);
 extern int scsi_is_target_device(const struct device *);
 extern void scsi_sanitize_inquiry_string(unsigned char *s, int len);
-extern int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
+extern int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
int data_direction, void *buffer, unsigned bufflen,
unsigned char *sense, struct scsi_sense_hdr *sshdr,
int timeout, int retries, u64 flags,
req_flags_t rq_flags, int *resid);
+/* Make sure any sense buffer is the correct size. */
+#define scsi_execute(sdev, cmd, data_direction, buffer, bufflen, sense,
\
+sshdr, timeout, retries, flags, rq_flags, resid)   \
+({ \
+   BUILD_BUG_ON((sense) != NULL && \
+sizeof(sense) != SCSI_SENSE_BUFFERSIZE);   \
+   __scsi_execute(sdev, cmd, data_direction, buffer, bufflen,  \
+  sense, sshdr, timeout, retries, flags, rq_flags, \
+  resid);  \
+})
 static inline int scsi_execute_req(struct scsi_device *sdev,
const unsigned char *cmd, int data_direction, void *buffer,
unsigned bufflen, struct scsi_sense_hdr *sshdr, int timeout,
-- 
2.17.0



[PATCH 5/6] libata-scsi: Move sense buffers onto stack

2018-05-22 Thread Kees Cook
Instead of dynamically allocating the sense buffers, put them on the
stack so that future compile-time sizeof() checks will be able to see
their buffer length.

Signed-off-by: Kees Cook 
---
 drivers/ata/libata-scsi.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 89a9d4a2efc8..3a43d3a1ce2d 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -597,8 +597,9 @@ static int ata_get_identity(struct ata_port *ap, struct 
scsi_device *sdev,
 int ata_cmd_ioctl(struct scsi_device *scsidev, void __user *arg)
 {
int rc = 0;
+   u8 sensebuf[SCSI_SENSE_BUFFERSIZE];
u8 scsi_cmd[MAX_COMMAND_SIZE];
-   u8 args[4], *argbuf = NULL, *sensebuf = NULL;
+   u8 args[4], *argbuf = NULL;
int argsize = 0;
enum dma_data_direction data_dir;
struct scsi_sense_hdr sshdr;
@@ -610,10 +611,7 @@ int ata_cmd_ioctl(struct scsi_device *scsidev, void __user 
*arg)
if (copy_from_user(args, arg, sizeof(args)))
return -EFAULT;
 
-   sensebuf = kzalloc(SCSI_SENSE_BUFFERSIZE, GFP_NOIO);
-   if (!sensebuf)
-   return -ENOMEM;
-
+   memset(sensebuf, 0, sizeof(sensebuf));
memset(scsi_cmd, 0, sizeof(scsi_cmd));
 
if (args[3]) {
@@ -685,7 +683,6 @@ int ata_cmd_ioctl(struct scsi_device *scsidev, void __user 
*arg)
 && copy_to_user(arg + sizeof(args), argbuf, argsize))
rc = -EFAULT;
 error:
-   kfree(sensebuf);
kfree(argbuf);
return rc;
 }
@@ -704,8 +701,9 @@ int ata_cmd_ioctl(struct scsi_device *scsidev, void __user 
*arg)
 int ata_task_ioctl(struct scsi_device *scsidev, void __user *arg)
 {
int rc = 0;
+   u8 sensebuf[SCSI_SENSE_BUFFERSIZE];
u8 scsi_cmd[MAX_COMMAND_SIZE];
-   u8 args[7], *sensebuf = NULL;
+   u8 args[7];
struct scsi_sense_hdr sshdr;
int cmd_result;
 
@@ -715,10 +713,7 @@ int ata_task_ioctl(struct scsi_device *scsidev, void 
__user *arg)
if (copy_from_user(args, arg, sizeof(args)))
return -EFAULT;
 
-   sensebuf = kzalloc(SCSI_SENSE_BUFFERSIZE, GFP_NOIO);
-   if (!sensebuf)
-   return -ENOMEM;
-
+   memset(sensebuf, 0, sizeof(sensebuf));
memset(scsi_cmd, 0, sizeof(scsi_cmd));
scsi_cmd[0]  = ATA_16;
scsi_cmd[1]  = (3 << 1); /* Non-data */
@@ -769,7 +764,6 @@ int ata_task_ioctl(struct scsi_device *scsidev, void __user 
*arg)
}
 
  error:
-   kfree(sensebuf);
return rc;
 }
 
-- 
2.17.0



[PATCH 4/6] block: Consolidate scsi sense buffer usage

2018-05-22 Thread Kees Cook
There is a lot of needless struct request_sense usage in the CDROM
code. These can all be struct scsi_sense_hdr instead, to avoid any
confusion over their respective structure sizes. This patch is a lot
of noise changing "sense" to "sshdr", but the final code is more
readable to distinguish between "sense" meaning "struct request_sense"
and "sshdr" meaning "struct scsi_sense_hdr".

Signed-off-by: Kees Cook 
---
 drivers/block/pktcdvd.c| 36 ++--
 drivers/cdrom/cdrom.c  | 22 +++---
 drivers/ide/ide-cd.c   | 11 ++-
 drivers/ide/ide-cd.h   |  5 +++--
 drivers/ide/ide-cd_ioctl.c | 30 +++---
 drivers/scsi/sr_ioctl.c| 22 +-
 include/linux/cdrom.h  |  3 ++-
 7 files changed, 64 insertions(+), 65 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index c61d20c9f3f8..f91c9f85e29d 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -748,13 +748,13 @@ static const char *sense_key_string(__u8 index)
 static void pkt_dump_sense(struct pktcdvd_device *pd,
   struct packet_command *cgc)
 {
-   struct request_sense *sense = cgc->sense;
+   struct scsi_sense_hdr *sshdr = cgc->sshdr;
 
-   if (sense)
+   if (sshdr)
pkt_err(pd, "%*ph - sense %02x.%02x.%02x (%s)\n",
CDROM_PACKET_SIZE, cgc->cmd,
-   sense->sense_key, sense->asc, sense->ascq,
-   sense_key_string(sense->sense_key));
+   sshdr->sense_key, sshdr->asc, sshdr->ascq,
+   sense_key_string(sshdr->sense_key));
else
pkt_err(pd, "%*ph - no sense\n", CDROM_PACKET_SIZE, cgc->cmd);
 }
@@ -787,11 +787,11 @@ static noinline_for_stack int pkt_set_speed(struct 
pktcdvd_device *pd,
unsigned write_speed, unsigned read_speed)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
int ret;
 
init_cdrom_command(, NULL, 0, CGC_DATA_NONE);
-   cgc.sense = 
+   cgc.sshdr = 
cgc.cmd[0] = GPCMD_SET_SPEED;
cgc.cmd[2] = (read_speed >> 8) & 0xff;
cgc.cmd[3] = read_speed & 0xff;
@@ -1645,7 +1645,7 @@ static noinline_for_stack int pkt_get_last_written(struct 
pktcdvd_device *pd,
 static noinline_for_stack int pkt_set_write_settings(struct pktcdvd_device *pd)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
write_param_page *wp;
char buffer[128];
int ret, size;
@@ -1656,7 +1656,7 @@ static noinline_for_stack int 
pkt_set_write_settings(struct pktcdvd_device *pd)
 
memset(buffer, 0, sizeof(buffer));
init_cdrom_command(, buffer, sizeof(*wp), CGC_DATA_READ);
-   cgc.sense = 
+   cgc.sshdr = 
if ((ret = pkt_mode_sense(pd, , GPMODE_WRITE_PARMS_PAGE, 0))) {
pkt_dump_sense(pd, );
return ret;
@@ -1671,7 +1671,7 @@ static noinline_for_stack int 
pkt_set_write_settings(struct pktcdvd_device *pd)
 * now get it all
 */
init_cdrom_command(, buffer, size, CGC_DATA_READ);
-   cgc.sense = 
+   cgc.sshdr = 
if ((ret = pkt_mode_sense(pd, , GPMODE_WRITE_PARMS_PAGE, 0))) {
pkt_dump_sense(pd, );
return ret;
@@ -1905,12 +1905,12 @@ static noinline_for_stack int pkt_write_caching(struct 
pktcdvd_device *pd,
int set)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
unsigned char buf[64];
int ret;
 
init_cdrom_command(, buf, sizeof(buf), CGC_DATA_READ);
-   cgc.sense = 
+   cgc.sshdr = 
cgc.buflen = pd->mode_offset + 12;
 
/*
@@ -1950,14 +1950,14 @@ static noinline_for_stack int pkt_get_max_speed(struct 
pktcdvd_device *pd,
unsigned *write_speed)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
unsigned char buf[256+18];
unsigned char *cap_buf;
int ret, offset;
 
cap_buf = [sizeof(struct mode_page_header) + pd->mode_offset];
init_cdrom_command(, buf, sizeof(buf), CGC_DATA_UNKNOWN);
-   cgc.sense = 
+   cgc.sshdr = 
 
ret = pkt_mode_sense(pd, , GPMODE_CAPABILITIES_PAGE, 0);
if (ret) {
@@ -2011,13 +2011,13 @@ static noinline_for_stack int pkt_media_speed(struct 
pktcdvd_device *pd,
unsigned *speed)
 {
struct packet_command cgc;
-   struct request_sense sense;
+   struct scsi_sense_hdr sshdr;
unsigned char buf[64];
unsigned int size, st, sp;
int ret;
 
init_cdrom_command(, buf, 2, 

Re: [PATCH] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

2018-05-22 Thread Jens Axboe
On 5/22/18 9:27 AM, Bart Van Assche wrote:
> Avoid that complaints similar to the following appear in the kernel log
> if the number of zones is sufficiently large:
> 
>   fio: page allocation failure: order:9, 
> mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
>   Call Trace:
>   dump_stack+0x63/0x88
>   warn_alloc+0xf5/0x190
>   __alloc_pages_slowpath+0x8f0/0xb0d
>   __alloc_pages_nodemask+0x242/0x260
>   alloc_pages_current+0x6a/0xb0
>   kmalloc_order+0x18/0x50
>   kmalloc_order_trace+0x26/0xb0
>   __kmalloc+0x20e/0x220
>   blkdev_report_zones_ioctl+0xa5/0x1a0
>   blkdev_ioctl+0x1ba/0x930
>   block_ioctl+0x41/0x50
>   do_vfs_ioctl+0xaa/0x610
>   SyS_ioctl+0x79/0x90
>   do_syscall_64+0x79/0x1b0
>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Applied, thanks.

-- 
Jens Axboe



Re: [PATCH v7 0/5] AIO add per-command iopriority

2018-05-22 Thread Jens Axboe
On 5/22/18 11:52 AM, adam.manzana...@wdc.com wrote:
> From: Adam Manzanares 
> 
> This is the per-I/O equivalent of the ioprio_set system call.
> See the following link for performance implications on a SATA HDD:
> https://lkml.org/lkml/2016/12/6/495
> 
> First patch factors ioprio_check_cap function out of ioprio_set system call to
> also be used by the aio ioprio interface.
> 
> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
> 
> Third patch passes ioprio hint from aio iocb to kiocb and initializes kiocb
> ioprio value appropriately when it is not explicitly set.
> 
> Fourth patch enables the feature for blkdev.
> 
> Fifth patch enables the feature for iomap direct IO

LGTM, you can add:

Reviewed-by: Jens Axboe 

Al, are you picking this series up, or should I?

-- 
Jens Axboe



[PATCH v7 5/5] fs: iomap dio set bio prio from kiocb prio

2018-05-22 Thread adam . manzanares
From: Adam Manzanares 

Now that kiocb has an ioprio field copy this over to the bio when it is
created from the kiocb during direct IO.

Signed-off-by: Adam Manzanares 
Reviewed-by: Jeff Moyer 
Reviewed-by: Christoph Hellwig 
---
 fs/iomap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index afd163586aa0..65aae194aeca 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -919,6 +919,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
length,
bio->bi_iter.bi_sector =
(iomap->addr + pos - iomap->offset) >> 9;
bio->bi_write_hint = dio->iocb->ki_hint;
+   bio->bi_ioprio = dio->iocb->ki_ioprio;
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
 
-- 
2.15.1



[PATCH v7 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-22 Thread adam . manzanares
From: Adam Manzanares 

In order to avoid kiocb bloat for per command iopriority support, rw_hint
is converted from enum to a u16. Added a guard around ki_hint assignment.

Signed-off-by: Adam Manzanares 
---
 fs/aio.c   |  2 +-
 include/linux/fs.h | 13 +++--
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 755d3f57bcc8..87d8939bb1e4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1450,7 +1450,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
*iocb)
req->ki_flags = iocb_flags(req->ki_filp);
if (iocb->aio_flags & IOCB_FLAG_RESFD)
req->ki_flags |= IOCB_EVENTFD;
-   req->ki_hint = file_write_hint(req->ki_filp);
+   req->ki_hint = ki_hint_validate(file_write_hint(req->ki_filp));
ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
if (unlikely(ret))
fput(req->ki_filp);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7f07977bdfd7..9b76ee73af14 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -299,7 +299,7 @@ struct kiocb {
void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
void*private;
int ki_flags;
-   enum rw_hintki_hint;
+   u16 ki_hint;
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1927,12 +1927,21 @@ static inline enum rw_hint file_write_hint(struct file 
*file)
 
 static inline int iocb_flags(struct file *file);
 
+static inline u16 ki_hint_validate(enum rw_hint hint)
+{
+   typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;
+
+   if (hint <= max_hint)
+   return hint;
+   return 0;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
*kiocb = (struct kiocb) {
.ki_filp = filp,
.ki_flags = iocb_flags(filp),
-   .ki_hint = file_write_hint(filp),
+   .ki_hint = ki_hint_validate(file_write_hint(filp)),
};
 }
 
-- 
2.15.1



[PATCH v7 3/5] fs: Add aio iopriority support

2018-05-22 Thread adam . manzanares
From: Adam Manzanares 

This is the per-I/O equivalent of the ioprio_set system call.

When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

This patch depends on block: add ioprio_check_cap function.

Signed-off-by: Adam Manzanares 
Reviewed-by: Jeff Moyer 
Reviewed-by: Christoph Hellwig 
---
 drivers/block/loop.c |  3 +++
 fs/aio.c | 16 
 include/linux/fs.h   |  3 +++
 include/uapi/linux/aio_abi.h |  1 +
 4 files changed, 23 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e31655d96..dd98dfd97f5e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -76,6 +76,8 @@
 #include 
 #include 
 #include 
+#include 
+
 #include "loop.h"
 
 #include 
@@ -559,6 +561,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_filp = file;
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
+   cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
if (cmd->css)
kthread_associate_blkcg(cmd->css);
 
diff --git a/fs/aio.c b/fs/aio.c
index 87d8939bb1e4..76a9d4c14e55 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
*iocb)
if (iocb->aio_flags & IOCB_FLAG_RESFD)
req->ki_flags |= IOCB_EVENTFD;
req->ki_hint = ki_hint_validate(file_write_hint(req->ki_filp));
+   if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
+   /*
+* If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
+* aio_reqprio is interpreted as an I/O scheduling
+* class and priority.
+*/
+   ret = ioprio_check_cap(iocb->aio_reqprio);
+   if (ret) {
+   pr_debug("aio ioprio check cap error\n");
+   return -EINVAL;
+   }
+
+   req->ki_ioprio = iocb->aio_reqprio;
+   } else
+   req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
+
ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
if (unlikely(ret))
fput(req->ki_filp);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9b76ee73af14..0c61d5987879 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -300,6 +301,7 @@ struct kiocb {
void*private;
int ki_flags;
u16 ki_hint;
+   u16 ki_ioprio; /* See linux/ioprio.h */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1942,6 +1944,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
.ki_filp = filp,
.ki_flags = iocb_flags(filp),
.ki_hint = ki_hint_validate(file_write_hint(filp)),
+   .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0),
};
 }
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 2c0a3415beee..d4e768d55d14 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -55,6 +55,7 @@ enum {
  *   is valid.
  */
 #define IOCB_FLAG_RESFD(1 << 0)
+#define IOCB_FLAG_IOPRIO   (1 << 1)
 
 /* read() from /dev/aio returns these structures. */
 struct io_event {
-- 
2.15.1



[PATCH v7 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-22 Thread adam . manzanares
From: Adam Manzanares 

Now that kiocb has an ioprio field copy this over to the bio when it is
created from the kiocb.

Signed-off-by: Adam Manzanares 
Reviewed-by: Jeff Moyer 
---
 fs/block_dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7ec920e27065..11ba99e79d2a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -216,6 +216,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
bio.bi_write_hint = iocb->ki_hint;
bio.bi_private = current;
bio.bi_end_io = blkdev_bio_end_io_simple;
+   bio.bi_ioprio = iocb->ki_ioprio;
 
ret = bio_iov_iter_get_pages(, iter);
if (unlikely(ret))
@@ -355,6 +356,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio->bi_write_hint = iocb->ki_hint;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
+   bio->bi_ioprio = iocb->ki_ioprio;
 
ret = bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret)) {
-- 
2.15.1



[PATCH v7 0/5] AIO add per-command iopriority

2018-05-22 Thread adam . manzanares
From: Adam Manzanares 

This is the per-I/O equivalent of the ioprio_set system call.
See the following link for performance implications on a SATA HDD:
https://lkml.org/lkml/2016/12/6/495

First patch factors ioprio_check_cap function out of ioprio_set system call to
also be used by the aio ioprio interface.

Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.

Third patch passes ioprio hint from aio iocb to kiocb and initializes kiocb
ioprio value appropriately when it is not explicitly set.

Fourth patch enables the feature for blkdev.

Fifth patch enables the feature for iomap direct IO

Note: this work is based on top of linux-vfs/for-next

v2: merge patches
use IOCB_FLAG_IOPRIO
validate intended use with IOCB_IOPRIO
add linux-api and linux-block to cc

v3: add ioprio_check_cap function
convert kiocb ki_hint to u16
use ioprio_check_cap when adding ioprio to kiocb in aio.c

v4: handle IOCB_IOPRIO in aio_prep_rw
note patch 3 depends on patch 1 in commit msg

v5: rename ki_hint_valid -> ki_hint_validate
remove ki_hint_validate comment and whitespace
remove IOCB_IOPRIO flag
initialize kiocb to have no priority

v6: add __blkdev_direct_IO_simple ioprio support

v7: Tie ki_hint_validate to kiocb ki_hint type
Add additional ki_hint_validate check

Adam Manzanares (5):
  block: add ioprio_check_cap function
  fs: Convert kiocb rw_hint from enum to u16
  fs: Add aio iopriority support
  fs: blkdev set bio prio from kiocb prio
  fs: iomap dio set bio prio from kiocb prio

 block/ioprio.c   | 22 --
 drivers/block/loop.c |  3 +++
 fs/aio.c | 18 +-
 fs/block_dev.c   |  2 ++
 fs/iomap.c   |  1 +
 include/linux/fs.h   | 16 ++--
 include/linux/ioprio.h   |  2 ++
 include/uapi/linux/aio_abi.h |  1 +
 8 files changed, 56 insertions(+), 9 deletions(-)

-- 
2.15.1



[PATCH v7 1/5] block: add ioprio_check_cap function

2018-05-22 Thread adam . manzanares
From: Adam Manzanares 

Aio per command iopriority support introduces a second interface between
userland and the kernel capable of passing iopriority. The aio interface also
needs the ability to verify that the submitting context has sufficient
privileges to submit IOPRIO_RT commands. This patch creates the
ioprio_check_cap function to be used by the ioprio_set system call and also by
the aio interface.

Signed-off-by: Adam Manzanares 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jeff Moyer 
---
 block/ioprio.c | 22 --
 include/linux/ioprio.h |  2 ++
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/ioprio.c b/block/ioprio.c
index 6f5d0b6625e3..f9821080c92c 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
 }
 EXPORT_SYMBOL_GPL(set_task_ioprio);
 
-SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
+int ioprio_check_cap(int ioprio)
 {
int class = IOPRIO_PRIO_CLASS(ioprio);
int data = IOPRIO_PRIO_DATA(ioprio);
-   struct task_struct *p, *g;
-   struct user_struct *user;
-   struct pid *pgrp;
-   kuid_t uid;
-   int ret;
 
switch (class) {
case IOPRIO_CLASS_RT:
@@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, 
ioprio)
return -EINVAL;
}
 
+   return 0;
+}
+
+SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
+{
+   struct task_struct *p, *g;
+   struct user_struct *user;
+   struct pid *pgrp;
+   kuid_t uid;
+   int ret;
+
+   ret = ioprio_check_cap(ioprio);
+   if (ret)
+   return ret;
+
ret = -ESRCH;
rcu_read_lock();
switch (which) {
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 627efac73e6d..4a28cec49ec3 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short 
bprio);
 
 extern int set_task_ioprio(struct task_struct *task, int ioprio);
 
+extern int ioprio_check_cap(int ioprio);
+
 #endif
-- 
2.15.1



Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Jens Axboe
On 5/22/18 10:20 AM, Jens Axboe wrote:
> On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
>> On 05/22/18 16:48, Jianchao Wang wrote:
>>> Currently, kyber is very unfriendly with merging. kyber depends
>>> on ctx rq_list to do merging, however, most of time, it will not
>>> leave any requests in ctx rq_list. This is because even if tokens
>>> of one domain is used up, kyber will try to dispatch requests
>>> from other domain and flush the rq_list there.
>>>
>>> To improve this, we setup kyber_ctx_queue (kcq) which is similar
>>> with ctx, but it has rq_lists for different domain and build same
>>> mapping between kcq and khd as the ctx & hctx. Then we could merge,
>>> insert and dispatch for different domains separately. If one domain
>>> token is used up, the requests could be left in the rq_list of
>>> that domain and maybe merged with following io.
>>>
>>> Following is my test result on machine with 8 cores and NVMe card
>>> INTEL SSDPEKKR128G7
>>>
>>> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
>>> seq/random
>>> +--+---+
>>> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
>>> +--+
>>> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
>>> +--+
>>> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
>>> +--+
>>> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
>>> on my platform.
>>>
>>> Signed-off-by: Jianchao Wang 
>>
>> 
>>
>> This looks great but prevents kyber from being built as module,
>> which is AFAIK supposed to work (and works now):
>>
>> ..
>>   CC [M]  block/kyber-iosched.o
>>   Building modules, stage 2.
>>   MODPOST 313 modules
>> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
>> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
>> ..
>>
>> It does build fine when compiled in, obviously. :)
> 
> It's basically duplicating the contents of blk_attempt_plug_merge().
> I would suggest abstracting out the list loop and merge check
> into a helper, that could then both be called from kyber and the
> plug merge function.

See attached, prep patch and yours rebased on top of it.

-- 
Jens Axboe

>From edb57af07cf619f27a3fee85705870875f853f58 Mon Sep 17 00:00:00 2001
From: Jianchao Wang 
Date: Tue, 22 May 2018 11:45:01 -0600
Subject: [PATCH 2/2] block: kyber: make kyber more friendly with merging

Currently, kyber is very unfriendly with merging. kyber depends
on ctx rq_list to do merging, however, most of time, it will not
leave any requests in ctx rq_list. This is because even if tokens
of one domain is used up, kyber will try to dispatch requests
from other domain and flush the rq_list there.

To improve this, we setup kyber_ctx_queue (kcq) which is similar
with ctx, but it has rq_lists for different domain and build same
mapping between kcq and khd as the ctx & hctx. Then we could merge,
insert and dispatch for different domains separately. If one domain
token is used up, the requests could be left in the rq_list of
that domain and maybe merged with following io.

Following is my test result on machine with 8 cores and NVMe card
INTEL SSDPEKKR128G7

fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
seq/random
+--+---+
|patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
+--+
| w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
+--+
| w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
+--+
When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
on my platform.

Signed-off-by: Jianchao Wang 
---
 block/kyber-iosched.c | 212 +++---
 1 file changed, 184 insertions(+), 28 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 5b33dc394cc7..f28147be729b 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -72,6 +72,15 @@ static const unsigned int kyber_batch_size[] = {
 	[KYBER_OTHER] = 8,
 };
 
+struct kyber_ctx_queue {
+	/*
+	 * Copied from blk_mq_ctx->index_hw
+	 */
+	unsigned int index;
+	spinlock_t lock;
+	struct list_head rq_list[KYBER_NUM_DOMAINS];
+} cacheline_aligned_in_smp;
+
 

Re: [PATCH v5 3/5] fs: Add aio iopriority support

2018-05-22 Thread kbuild test robot
Hi Adam,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20180517]
[cannot apply to linus/master block/for-next v4.17-rc6 v4.17-rc5 v4.17-rc4 
v4.17-rc6]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/adam-manzanares-wdc-com/AIO-add-per-command-iopriority/20180522-232203
config: x86_64-randconfig-s0-05230027 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/aio.o: In function `aio_prep_rw':
>> fs/aio.c:1460: undefined reference to `ioprio_check_cap'

vim +1460 fs/aio.c

  1440  
  1441  static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
  1442  {
  1443  int ret;
  1444  
  1445  req->ki_filp = fget(iocb->aio_fildes);
  1446  if (unlikely(!req->ki_filp))
  1447  return -EBADF;
  1448  req->ki_complete = aio_complete_rw;
  1449  req->ki_pos = iocb->aio_offset;
  1450  req->ki_flags = iocb_flags(req->ki_filp);
  1451  if (iocb->aio_flags & IOCB_FLAG_RESFD)
  1452  req->ki_flags |= IOCB_EVENTFD;
  1453  req->ki_hint = file_write_hint(req->ki_filp);
  1454  if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
  1455  /*
  1456   * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, 
then
  1457   * aio_reqprio is interpreted as an I/O scheduling
  1458   * class and priority.
  1459   */
> 1460  ret = ioprio_check_cap(iocb->aio_reqprio);
  1461  if (ret) {
  1462  pr_debug("aio ioprio check cap error\n");
  1463  return -EINVAL;
  1464  }
  1465  
  1466  req->ki_ioprio = iocb->aio_reqprio;
  1467  } else
  1468  req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 
0);
  1469  
  1470  ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
  1471  if (unlikely(ret))
  1472  fput(req->ki_filp);
  1473  return ret;
  1474  }
  1475  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v6 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-22 Thread Adam Manzanares


On 5/22/18 9:30 AM, Jens Axboe wrote:
> On 5/22/18 10:24 AM, Goldwyn Rodrigues wrote:
>>
>>
>> On 05/22/2018 10:32 AM, Jens Axboe wrote:
>>> On 5/22/18 9:07 AM, adam.manzana...@wdc.com wrote:
 From: Adam Manzanares 

 In order to avoid kiocb bloat for per command iopriority support, rw_hint
 is converted from enum to a u16. Added a guard around ki_hint assignment.

 Signed-off-by: Adam Manzanares 
 Reviewed-by: Christoph Hellwig 
 ---
   include/linux/fs.h | 13 +++--
   1 file changed, 11 insertions(+), 2 deletions(-)

 diff --git a/include/linux/fs.h b/include/linux/fs.h
 index 7f07977bdfd7..50de40dbbb85 100644
 --- a/include/linux/fs.h
 +++ b/include/linux/fs.h
 @@ -284,6 +284,8 @@ enum rw_hint {
WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
   };
   
 +#define MAX_KI_HINT   ((1 << 16) - 1) /* ki_hint type is u16 
 */
>>>
>>> Instead of having to do this and now rely on those now being synced,
>>> how about something like the below.
>>>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 760d8da1b6c7..070438d0b62d 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -299,7 +299,7 @@ struct kiocb {
>>> void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>>> void*private;
>>> int ki_flags;
>>> -   enum rw_hintki_hint;
>>> +   u16 ki_hint;
>>>   } __randomize_layout;
>>>   
>>>   static inline bool is_sync_kiocb(struct kiocb *kiocb)
>>> @@ -1927,12 +1927,22 @@ static inline enum rw_hint file_write_hint(struct 
>>> file *file)
>>>   
>>>   static inline int iocb_flags(struct file *file);
>>>   
>>> +static inline u16 ki_hint_validate(enum rw_hint hint)
>>> +{
>>> +   typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;
>>
>> This looks complex to me. Would force a reader to lookback at what
>> datatype ki_hint is. I'd prefer to declare it as u16 max_hint = -1, or
>> even the previous #define MAX_KI_HINT format is easier to read. Just a
>> program reading style you are comfortable with though.
> 
> How is it complex? It's defining a type that'll be the same as ki_hint
> in the kiocb, which is _exactly_ what we care about. Any sort of other
> definition will rely on those two locations now being synced. The
> above will never break.
> 
> So I strongly disagree. The above will _never_ require the reader to
> figure out what the type is. Any other variant will _always_ require
> the reader to check if they are the same.
> 

I do think the previous code was a bit easier to parse at first glance, 
but that is outweighed by the fact that the validate function is now 
directly tied to the kiocb ki_hint type.

I also missed one spot where I should have used ki_hint_validate. Will 
resend soon.

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 10:44 AM, Jens Axboe wrote:
> 
> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>> This patch reworks the blk-mq timeout handling code again. The timeout
>> handling code is simplified by introducing a state machine per request.
>> This change avoids that the blk-mq timeout handling code ignores
>> completions that occur after blk_mq_check_expired() has been called and
>> before blk_mq_rq_timed_out() has been called.
> 
> I'll take a look at this again, getting rid of cmpxchg64 makes me
> much more comfortable.

FWIW, a quick pass on runtime testing works fine. As expected, it's
more efficient than what's currently in the kernel, testing with both
null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
is that we shrink the request size from 360 bytes to 312, and I did
a small followup patch that brings that to 304. That's a 15% reduction,
massive.

-- 
Jens Axboe



Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-22 Thread Paul E. McKenney
On Tue, May 22, 2018 at 09:38:13AM -0700, Linus Torvalds wrote:
> On Tue, May 22, 2018 at 2:09 AM Roman Penyaev <
> roman.peny...@profitbricks.com> wrote:
> 
> > Should I resend current patch with more clear comments about how careful
> > caller should be with a leaking pointer?
> 
> No. Even if we go your way, there is *one* single user, and that one is
> special and needs to take a lot more care.
> 
> Just roll your own version, and make it an inline function like I've asked
> now now many times, and add a shit-ton of explanations of why it's safe to
> use in that *one* situation.
> 
> I don't want any crazy and unsafe stuff in the generic header file that
> absolutely *nobody* should ever use.

Completely agreed!

I was perhaps foolishly assuming that they would be making that adjustment
based on earlier emails, but yes, I should have explicitly stated this
requirement in my earlier reply.

Thanx, Paul



Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-22 Thread Paul E. McKenney
On Tue, May 22, 2018 at 11:09:16AM +0200, Roman Penyaev wrote:
> On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney
>  wrote:
> > On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
> >> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
> >>  wrote:
> >> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> >> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> >> >>  wrote:
> >> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> >> >> Function is going to be used in transport over RDMA module
> >> >> >> in subsequent patches.
> >> >> >>
> >> >> >> Function returns next element in round-robin fashion,
> >> >> >> i.e. head will be skipped.  NULL will be returned if list
> >> >> >> is observed as empty.
> >> >> >>
> >> >> >> Signed-off-by: Roman Pen 
> >> >> >> Cc: Paul E. McKenney 
> >> >> >> Cc: linux-ker...@vger.kernel.org
> >> >> >> ---
> >> >> >>  include/linux/rculist.h | 19 +++
> >> >> >>  1 file changed, 19 insertions(+)
> >> >> >>
> >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> >> >> index 127f534fec94..b0840d5ab25a 100644
> >> >> >> --- a/include/linux/rculist.h
> >> >> >> +++ b/include/linux/rculist.h
> >> >> >> @@ -339,6 +339,25 @@ static inline void 
> >> >> >> list_splice_tail_init_rcu(struct list_head *list,
> >> >> >>  })
> >> >> >>
> >> >> >>  /**
> >> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin 
> >> >> >> fashion.
> >> >> >> + * @head:the head for the list.
> >> >> >> + * @ptr:the list head to take the next element from.
> >> >> >> + * @type:   the type of the struct this is embedded in.
> >> >> >> + * @memb:   the name of the list_head within the struct.
> >> >> >> + *
> >> >> >> + * Next element returned in round-robin fashion, i.e. head will be 
> >> >> >> skipped,
> >> >> >> + * but if list is observed as empty, NULL will be returned.
> >> >> >> + *
> >> >> >> + * This primitive may safely run concurrently with the _rcu 
> >> >> >> list-mutation
> >> >> >> + * primitives such as list_add_rcu() as long as it's guarded by 
> >> >> >> rcu_read_lock().
> >> >> >
> >> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> >> >> > are round-robining a given list must all be under the same RCU 
> >> >> > read-side
> >> >> > critical section.  For example, the following will break badly:
> >> >> >
> >> >> > struct foo *take_rr_step(struct list_head *head, struct foo 
> >> >> > *ptr)
> >> >> > {
> >> >> > struct foo *ret;
> >> >> >
> >> >> > rcu_read_lock();
> >> >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, 
> >> >> > foolist);
> >> >> > rcu_read_unlock();  /* BUG */
> >> >> > return ret;
> >> >> > }
> >> >> >
> >> >> > You need a big fat comment stating this, at the very least.  The 
> >> >> > resulting
> >> >> > bug can be very hard to trigger and even harder to debug.
> >> >> >
> >> >> > And yes, I know that the same restriction applies to list_next_rcu()
> >> >> > and friends.  The difference is that if you try to invoke those in an
> >> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
> >> >> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> >> >> > might tempt people to do something broken like take_rr_step() above.
> >> >>
> >> >> Hi Paul,
> >> >>
> >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> >> >> connection from the list in order to send a request, i.e. my code is
> >> >> something like the following:
> >> >>
> >> >> static struct conn *get_and_set_next_conn(void)
> >> >> {
> >> >> struct conn *conn;
> >> >>
> >> >> conn = rcu_dereferece(rcu_conn);
> >> >> if (unlikely(!conn))
> >> >> return conn;
> >> >
> >> > Wait.  Don't you need to restart from the beginning of the list in
> >> > this case?  Or does the list never have anything added to it and is
> >> > rcu_conn initially the first element in the list?
> >>
> >> Hi Paul,
> >>
> >> No, I continue from the pointer, which I assigned on the previous IO
> >> in order to send IO fairly and keep load balanced.
> >>
> >> Initially @rcu_conn points to the first element, but elements can
> >> be deleted from the list and list can become empty.
> >>
> >> The deletion code is below.
> >>
> >> >
> >> >> conn = list_next_or_null_rr_rcu(_list,
> >> >> >entry,
> >> >> typeof(*conn),
> >> >> entry);
> >> >> rcu_assign_pointer(rcu_conn, conn);
> >> >
> >> > Linus is correct to doubt this 

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 10:34 -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 04:29:07PM +, Bart Van Assche wrote:
> > Please have another look at the current code that handles request timeouts
> > and completions. The current implementation guarantees that no double
> > completions can occur but your patch removes essential aspects of that
> > implementation.
> 
> How does the current implementation guarantee a double completion doesn't
> happen when the request is allocated for a new command?

Hello Keith,

If a request is completes and is reused after the timeout handler has set
aborted_gstate and before blk_mq_terminate_expired() is called then the latter
function will skip the request because restarting a request causes the
generation number in rq->gstate to be incremented. From 
blk_mq_rq_update_state():

if (state == MQ_RQ_IN_FLIGHT) {
WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
new_val += MQ_RQ_GEN_INC;
}

Bart.





Re: [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-05-22 Thread Jason Gunthorpe
On Fri, May 18, 2018 at 03:03:47PM +0200, Roman Pen wrote:
> Hi all,
> 
> This is v2 of series, which introduces IBNBD/IBTRS modules.
> 
> This cover letter is split on three parts:
> 
> 1. Introduction, which almost repeats everything from previous cover
>letters.
> 2. Changelog.
> 3. Performance measurements on linux-4.17.0-rc2 and on two different
>Mellanox cards: ConnectX-2 and ConnectX-3 and CPUs: Intel and AMD.
> 
> 
>  Introduction
> 
> IBTRS (InfiniBand Transport) is a reliable high speed transport library
> which allows for establishing connection between client and server
> machines via RDMA. It is optimized to transfer (read/write) IO blocks
> in the sense that it follows the BIO semantics of providing the
> possibility to either write data from a scatter-gather list to the
> remote side or to request ("read") data transfer from the remote side
> into a given set of buffers.
> 
> IBTRS is multipath capalbdke and provides I/O fail-over and load-balancing
> functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
> CMs and particular path is selected according to the load-balancing policy.
> 
> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
> (client and server) that allow for remote access of a block device on
> the server over IBTRS protocol. After being mapped, the remote block
> devices can be accessed on the client side as local block devices.
> Internally IBNBD uses IBTRS as an RDMA transport library.
> 
> Why?
> 
>- IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>  thus internal protocol is simple.
>- IBTRS was developed as an independent RDMA transport library, which
>  supports fail-over and load-balancing policies using multipath, thus
>  it can be used for any other IO needs rather than only for block
>  device.
>- IBNBD/IBTRS is faster than NVME over RDMA.
>  Old comparison results:
>  https://www.spinics.net/lists/linux-rdma/msg48799.html
>  New comparison results: see performance measurements section below.
> 
> Key features of IBTRS transport library and IBNBD block device:
> 
> o High throughput and low latency due to:
>- Only two RDMA messages per IO.
>- IMM InfiniBand messages on responses to reduce round trip latency.
>- Simplified memory management: memory allocation happens once on
>  server side when IBTRS session is established.
> 
> o IO fail-over and load-balancing by using multipath.  According to
>   our test loads additional path brings ~20% of bandwidth.  
> 
> o Simple configuration of IBNBD:
>- Server side is completely passive: volumes do not need to be
>  explicitly exported.
>- Only IB port GID and device path needed on client side to map
>  a block device.
>- A device is remapped automatically i.e. after storage reboot.
> 
> Commits for kernel can be found here:
>https://github.com/profitbricks/ibnbd/commits/linux-4.17-rc2
> 
> The out-of-tree modules are here:
>https://github.com/profitbricks/ibnbd/
> 
> Vault 2017 presentation:
>
> http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf

I think from the RDMA side, before we accept something like this, I'd
like to hear from Christoph, Chuck or Sagi that the dataplane
implementation of this is correct, eg it uses the MRs properly and
invalidates at the right time, sequences with dma_ops as required,
etc.

They all have done this work on their ULPs and it was tricky, I don't
want to see another ULP implement this wrong..

I'm skeptical here already due to the performance numbers - they are
not really what I'd expects and we may find that invalidate changes
will bring the performance down further.

Jason


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe

On 5/22/18 10:25 AM, Bart Van Assche wrote:
> Recently the blk-mq timeout handling code was reworked. See also Tejun
> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
> This patch reworks the blk-mq timeout handling code again. The timeout
> handling code is simplified by introducing a state machine per request.
> This change avoids that the blk-mq timeout handling code ignores
> completions that occur after blk_mq_check_expired() has been called and
> before blk_mq_rq_timed_out() has been called.

I'll take a look at this again, getting rid of cmpxchg64 makes me
much more comfortable.

-- 
Jens Axboe



Re: [RFC PATCH 0/3] blk-mq: Timeout rework

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 04:30:41PM +, Bart Van Assche wrote:
> 
> Please have a look at v13 of the timeout handling rework patch that I posted.
> That patch should not introduce any new race conditions and should also handle
> the scenario fine in which blk_mq_complete_request() is called while the NVMe
> timeout handling function is in progress.

Thanks for the notice. That sounds very interesting and will be happy
take a look.


Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-22 Thread Linus Torvalds
On Tue, May 22, 2018 at 2:09 AM Roman Penyaev <
roman.peny...@profitbricks.com> wrote:

> Should I resend current patch with more clear comments about how careful
> caller should be with a leaking pointer?

No. Even if we go your way, there is *one* single user, and that one is
special and needs to take a lot more care.

Just roll your own version, and make it an inline function like I've asked
now now many times, and add a shit-ton of explanations of why it's safe to
use in that *one* situation.

I don't want any crazy and unsafe stuff in the generic header file that
absolutely *nobody* should ever use.

  Linus


Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-22 Thread Paul E. McKenney
On Tue, May 22, 2018 at 11:09:08AM +0200, Roman Penyaev wrote:
> On Mon, May 21, 2018 at 5:33 PM, Paul E. McKenney
>  wrote:
> > On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote:
> >> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
> >> roman.peny...@profitbricks.com> wrote:
> >>
> >> > No, I continue from the pointer, which I assigned on the previous IO
> >> > in order to send IO fairly and keep load balanced.
> >>
> >> Right. And that's exactly what has both me and Paul nervous. You're no
> >> longer in the RCU domain. You're using a pointer where the lifetime has
> >> nothing to do with RCU any more.
> >>
> >> Can it be done? Sure. But you need *other* locking for it (that you haven't
> >> explained), and it's fragile as hell.
> >
> > He looks to actually have it right, but I would want to see a big comment
> > on the read side noting the leak of the pointer and documenting why it
> > is OK.
> 
> Hi Paul and Linus,
> 
> Should I resend current patch with more clear comments about how careful
> caller should be with a leaking pointer?  Also I will update read side
> with a fat comment about "rcu_assign_pointer()" which leaks the pointer
> out of RCU domain and what is done to prevent nasty consequences.
> Does that sound acceptable?

That sounds good to me.

Thanx, Paul



Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 04:29:07PM +, Bart Van Assche wrote:
> Please have another look at the current code that handles request timeouts
> and completions. The current implementation guarantees that no double
> completions can occur but your patch removes essential aspects of that
> implementation.

How does the current implementation guarantee a double completion doesn't
happen when the request is allocated for a new command?


Re: [RFC PATCH 0/3] blk-mq: Timeout rework

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 08:06 -0600, Keith Busch wrote:
> On Mon, May 21, 2018 at 11:29:21PM +, Bart Van Assche wrote:
> > Can you explain why the NVMe driver needs reference counting of requests but
> > no other block driver needs this? Additionally, why is it that for all block
> > drivers except NVMe the current block layer API is sufficient
> > (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/
> > blk_mq_complete_request()/blk_mq_end_request())?
> 
> I'm pretty sure NVMe isn't the only driver where a call to
> blk_mq_complete_request silently fails to transition the request to
> COMPLETE, forcing unnecessary error handling. This patch isn't so
> much about NVMe as it is about removing that silent exception from the
> block API.

Hello Keith,

Please have a look at v13 of the timeout handling rework patch that I posted.
That patch should not introduce any new race conditions and should also handle
the scenario fine in which blk_mq_complete_request() is called while the NVMe
timeout handling function is in progress.

Thanks,

Bart.





Re: [PATCH v6 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-22 Thread Jens Axboe
On 5/22/18 10:24 AM, Goldwyn Rodrigues wrote:
> 
> 
> On 05/22/2018 10:32 AM, Jens Axboe wrote:
>> On 5/22/18 9:07 AM, adam.manzana...@wdc.com wrote:
>>> From: Adam Manzanares 
>>>
>>> In order to avoid kiocb bloat for per command iopriority support, rw_hint
>>> is converted from enum to a u16. Added a guard around ki_hint assignment.
>>>
>>> Signed-off-by: Adam Manzanares 
>>> Reviewed-by: Christoph Hellwig 
>>> ---
>>>  include/linux/fs.h | 13 +++--
>>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 7f07977bdfd7..50de40dbbb85 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -284,6 +284,8 @@ enum rw_hint {
>>> WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
>>>  };
>>>  
>>> +#define MAX_KI_HINT((1 << 16) - 1) /* ki_hint type is u16 
>>> */
>>
>> Instead of having to do this and now rely on those now being synced,
>> how about something like the below.
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 760d8da1b6c7..070438d0b62d 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -299,7 +299,7 @@ struct kiocb {
>>  void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>>  void*private;
>>  int ki_flags;
>> -enum rw_hintki_hint;
>> +u16 ki_hint;
>>  } __randomize_layout;
>>  
>>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
>> @@ -1927,12 +1927,22 @@ static inline enum rw_hint file_write_hint(struct 
>> file *file)
>>  
>>  static inline int iocb_flags(struct file *file);
>>  
>> +static inline u16 ki_hint_validate(enum rw_hint hint)
>> +{
>> +typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;
> 
> This looks complex to me. Would force a reader to lookback at what
> datatype ki_hint is. I'd prefer to declare it as u16 max_hint = -1, or
> even the previous #define MAX_KI_HINT format is easier to read. Just a
> program reading style you are comfortable with though.

How is it complex? It's defining a type that'll be the same as ki_hint
in the kiocb, which is _exactly_ what we care about. Any sort of other
definition will rely on those two locations now being synced. The
above will never break.

So I strongly disagree. The above will _never_ require the reader to
figure out what the type is. Any other variant will _always_ require
the reader to check if they are the same.

-- 
Jens Axboe



Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 08:15 -0600, Keith Busch wrote:
> This shouldn't be introducing any new concorrent calls to
> __blk_mq_complete_request if they don't already exist. The timeout work
> calls it only if the driver's timeout returns BLK_EH_HANDLED, which means
> the driver is claiming the command is now done, but the driver didn't call
> blk_mq_complete_request as indicated by the request's IN_FLIGHT state.
> 
> In order to get a second call to __blk_mq_complete_request(), then,
> the driver would have to end up calling blk_mq_complete_request()
> in another context. But there's nothing stopping that from happening
> today, and would be bad if any driver actually did that: the request
> may have been re-allocated and issued as a new command, and calling
> blk_mq_complete_request() the second time introduces data corruption.

Hello Keith,

Please have another look at the current code that handles request timeouts
and completions. The current implementation guarantees that no double
completions can occur but your patch removes essential aspects of that
implementation.

Thanks,

Bart.

Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 17:24 +0200, Christoph Hellwig wrote:
> On Mon, May 21, 2018 at 05:11:30PM -0600, Keith Busch wrote:
> > The block layer had been setting the state to in-flight prior to updating
> > the timer. This is the wrong order since the timeout handler could observe
> > the in-flight state with the older timeout, believing the request had
> > expired when in fact it is just getting started.
> > 
> > Signed-off-by: Keith Busch 
> 
> The way I understood Barts comments to my comments on his patch we
> actually need the two updates to be atomic.  I haven't had much time
> to follow up, but I'd like to hear Barts opinion.  Either way we
> clearly need to document our assumptions here in comments.

After we discussed request state updating I have been thinking further about
this. I think now that it is safe to update the request deadline first because
the timeout code ignores requests anyway that have another state than
MQ_RQ_IN_FLIGHT. Additionally, it is unlikely that the request timer will fire
before the request state has been updated. And if that would happen the request
timeout will be handled the next time request timeouts are examined.

Bart.






Re: [PATCH v6 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-22 Thread Goldwyn Rodrigues


On 05/22/2018 10:32 AM, Jens Axboe wrote:
> On 5/22/18 9:07 AM, adam.manzana...@wdc.com wrote:
>> From: Adam Manzanares 
>>
>> In order to avoid kiocb bloat for per command iopriority support, rw_hint
>> is converted from enum to a u16. Added a guard around ki_hint assignment.
>>
>> Signed-off-by: Adam Manzanares 
>> Reviewed-by: Christoph Hellwig 
>> ---
>>  include/linux/fs.h | 13 +++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 7f07977bdfd7..50de40dbbb85 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -284,6 +284,8 @@ enum rw_hint {
>>  WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
>>  };
>>  
>> +#define MAX_KI_HINT ((1 << 16) - 1) /* ki_hint type is u16 */
> 
> Instead of having to do this and now rely on those now being synced,
> how about something like the below.
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 760d8da1b6c7..070438d0b62d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -299,7 +299,7 @@ struct kiocb {
>   void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>   void*private;
>   int ki_flags;
> - enum rw_hintki_hint;
> + u16 ki_hint;
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -1927,12 +1927,22 @@ static inline enum rw_hint file_write_hint(struct 
> file *file)
>  
>  static inline int iocb_flags(struct file *file);
>  
> +static inline u16 ki_hint_validate(enum rw_hint hint)
> +{
> + typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;

This looks complex to me. Would force a reader to lookback at what
datatype ki_hint is. I'd prefer to declare it as u16 max_hint = -1, or
even the previous #define MAX_KI_HINT format is easier to read. Just a
program reading style you are comfortable with though.


-- 
Goldwyn


[PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has been called.

Fix this race as follows:
- Reduce the gstate field from 64 to 32 bits such that cmpxchg() can
  be used to update it. Introduce deadline_seq for updating the deadline
  on 32-bit systems.
- Remove the request member variables that became superfluous due to
  this change, namely gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Keith Busch 
Cc: Jianchao Wang 
Cc: Ming Lei 
Cc: Sebastian Ott 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
---

Changes compared to v12:
- Switched from cmpxchg64() to cmpxchg(). This became possible because the
  deadline is now updated before the request state.
- Introduced a new request state to ensure that completions that occur while
  the timeout function is in progress are not lost.
- Left out the ARM cmpxchg64() patch.

Changes compared to v11:
- Reworked patch 1/2: instead of introducing CONFIG_ARCH_HAVE_CMPXCHG64, make
  sure that cmpxchg64() is only defined if it can be used.

Changes compared to v10:
- In patch 1/2, added "default y if 64BIT" to the "config ARCH_HAVE_CMPXCHG64"
  entry in arch/Kconfig. Left out the "select ARCH_HAVE_CMPXCHG64" statements
  that became superfluous due to this change (alpha, arm64, powerpc and s390).
- Also in patch 1/2, only select ARCH_HAVE_CMPXCHG64 if X86_CMPXCHG64 has been
  selected.
- In patch 2/2, moved blk_mq_change_rq_state() from blk-mq.h to blk-mq.c.
- Added a comment header above __blk_mq_requeue_request() and
  blk_mq_requeue_request().
- Documented the MQ_RQ_* state transitions in block/blk-mq.h.
- Left out the fourth argument of blk_mq_rq_set_deadline().

Changes compared to v9:
- Addressed multiple comments related to patch 1/2: added
  CONFIG_ARCH_HAVE_CMPXCHG64 for riscv, modified
  features/locking/cmpxchg64/arch-support.txt as requested and made the
  order of the symbols in the arch/*/Kconfig alphabetical where possible.

Changes compared to v8:
- Split into two patches.
- Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into
  blk_mq_init_request().
- Fixed the deadline set by blk_add_timer().
- Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 /
  #endif.

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: 

Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Jens Axboe
On 5/22/18 10:17 AM, Holger Hoffstätte wrote:
> On 05/22/18 16:48, Jianchao Wang wrote:
>> Currently, kyber is very unfriendly with merging. kyber depends
>> on ctx rq_list to do merging, however, most of time, it will not
>> leave any requests in ctx rq_list. This is because even if tokens
>> of one domain is used up, kyber will try to dispatch requests
>> from other domain and flush the rq_list there.
>>
>> To improve this, we setup kyber_ctx_queue (kcq) which is similar
>> with ctx, but it has rq_lists for different domain and build same
>> mapping between kcq and khd as the ctx & hctx. Then we could merge,
>> insert and dispatch for different domains separately. If one domain
>> token is used up, the requests could be left in the rq_list of
>> that domain and maybe merged with following io.
>>
>> Following is my test result on machine with 8 cores and NVMe card
>> INTEL SSDPEKKR128G7
>>
>> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
>> seq/random
>> +--+---+
>> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
>> +--+
>> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
>> +--+
>> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
>> +--+
>> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
>> on my platform.
>>
>> Signed-off-by: Jianchao Wang 
> 
> 
> 
> This looks great but prevents kyber from being built as module,
> which is AFAIK supposed to work (and works now):
> 
> ..
>   CC [M]  block/kyber-iosched.o
>   Building modules, stage 2.
>   MODPOST 313 modules
> ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
> ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
> ..
> 
> It does build fine when compiled in, obviously. :)

It's basically duplicating the contents of blk_attempt_plug_merge().
I would suggest abstracting out the list loop and merge check
into a helper, that could then both be called from kyber and the
plug merge function.

-- 
Jens Axboe



Re: [PATCH] block: kyber: make kyber more friendly with merging

2018-05-22 Thread Holger Hoffstätte
On 05/22/18 16:48, Jianchao Wang wrote:
> Currently, kyber is very unfriendly with merging. kyber depends
> on ctx rq_list to do merging, however, most of time, it will not
> leave any requests in ctx rq_list. This is because even if tokens
> of one domain is used up, kyber will try to dispatch requests
> from other domain and flush the rq_list there.
> 
> To improve this, we setup kyber_ctx_queue (kcq) which is similar
> with ctx, but it has rq_lists for different domain and build same
> mapping between kcq and khd as the ctx & hctx. Then we could merge,
> insert and dispatch for different domains separately. If one domain
> token is used up, the requests could be left in the rq_list of
> that domain and maybe merged with following io.
> 
> Following is my test result on machine with 8 cores and NVMe card
> INTEL SSDPEKKR128G7
> 
> fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
> seq/random
> +--+---+
> |patch?| bw(MB/s) |   iops| slat(usec) |clat(usec)   |  merge  |
> +--+
> | w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
> +--+
> | w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
> +--+
> When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
> on my platform.
> 
> Signed-off-by: Jianchao Wang 



This looks great but prevents kyber from being built as module,
which is AFAIK supposed to work (and works now):

..
  CC [M]  block/kyber-iosched.o
  Building modules, stage 2.
  MODPOST 313 modules
ERROR: "bio_attempt_back_merge" [block/kyber-iosched.ko] undefined!
ERROR: "bio_attempt_front_merge" [block/kyber-iosched.ko] undefined!
ERROR: "bio_attempt_discard_merge" [block/kyber-iosched.ko] undefined!
ERROR: "blk_try_merge" [block/kyber-iosched.ko] undefined!
ERROR: "blk_rq_merge_ok" [block/kyber-iosched.ko] undefined!
..

It does build fine when compiled in, obviously. :)

cheers,
Holger


Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation

2018-05-22 Thread Roman Penyaev
On Tue, May 22, 2018 at 3:18 PM, Leon Romanovsky  wrote:
> On Tue, May 22, 2018 at 11:27:21AM +0200, Roman Penyaev wrote:
>> On Tue, May 22, 2018 at 7:05 AM, Leon Romanovsky  wrote:
>> > On Fri, May 18, 2018 at 03:04:01PM +0200, Roman Pen wrote:
>> >> Add IBTRS Makefile, Kconfig and also corresponding lines into upper
>> >> layer infiniband/ulp files.
>> >>
>> >> Signed-off-by: Roman Pen 
>> >> Signed-off-by: Danil Kipnis 
>> >> Cc: Jack Wang 
>> >> ---
>> >>  drivers/infiniband/Kconfig|  1 +
>> >>  drivers/infiniband/ulp/Makefile   |  1 +
>> >>  drivers/infiniband/ulp/ibtrs/Kconfig  | 20 
>> >>  drivers/infiniband/ulp/ibtrs/Makefile | 15 +++
>> >>  4 files changed, 37 insertions(+)
>> >>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
>> >>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
>> >>
>> >> diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
>> >> index ee270e065ba9..787bd286fb08 100644
>> >> --- a/drivers/infiniband/Kconfig
>> >> +++ b/drivers/infiniband/Kconfig
>> >> @@ -94,6 +94,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
>> >>
>> >>  source "drivers/infiniband/ulp/iser/Kconfig"
>> >>  source "drivers/infiniband/ulp/isert/Kconfig"
>> >> +source "drivers/infiniband/ulp/ibtrs/Kconfig"
>> >>
>> >>  source "drivers/infiniband/ulp/opa_vnic/Kconfig"
>> >>  source "drivers/infiniband/sw/rdmavt/Kconfig"
>> >> diff --git a/drivers/infiniband/ulp/Makefile 
>> >> b/drivers/infiniband/ulp/Makefile
>> >> index 437813c7b481..1c4f10dc8d49 100644
>> >> --- a/drivers/infiniband/ulp/Makefile
>> >> +++ b/drivers/infiniband/ulp/Makefile
>> >> @@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT) += srpt/
>> >>  obj-$(CONFIG_INFINIBAND_ISER)+= iser/
>> >>  obj-$(CONFIG_INFINIBAND_ISERT)   += isert/
>> >>  obj-$(CONFIG_INFINIBAND_OPA_VNIC)+= opa_vnic/
>> >> +obj-$(CONFIG_INFINIBAND_IBTRS)   += ibtrs/
>> >> diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig 
>> >> b/drivers/infiniband/ulp/ibtrs/Kconfig
>> >> new file mode 100644
>> >> index ..eaeb8f3f6b4e
>> >> --- /dev/null
>> >> +++ b/drivers/infiniband/ulp/ibtrs/Kconfig
>> >> @@ -0,0 +1,20 @@
>> >> +config INFINIBAND_IBTRS
>> >> + tristate
>> >> + depends on INFINIBAND_ADDR_TRANS
>> >> +
>> >> +config INFINIBAND_IBTRS_CLIENT
>> >> + tristate "IBTRS client module"
>> >> + depends on INFINIBAND_ADDR_TRANS
>> >> + select INFINIBAND_IBTRS
>> >> + help
>> >> +   IBTRS client allows for simplified data transfer and connection
>> >> +   establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
>> >> +   READ/WRITE semantics and provides multipath capabilities.
>> >> +
>> >> +config INFINIBAND_IBTRS_SERVER
>> >> + tristate "IBTRS server module"
>> >> + depends on INFINIBAND_ADDR_TRANS
>> >> + select INFINIBAND_IBTRS
>> >> + help
>> >> +   IBTRS server module processing connection and IO requests received
>> >> +   from the IBTRS client module.
>> >> diff --git a/drivers/infiniband/ulp/ibtrs/Makefile 
>> >> b/drivers/infiniband/ulp/ibtrs/Makefile
>> >> new file mode 100644
>> >> index ..e6ea858745ad
>> >> --- /dev/null
>> >> +++ b/drivers/infiniband/ulp/ibtrs/Makefile
>> >> @@ -0,0 +1,15 @@
>> >> +ibtrs-client-y := ibtrs-clt.o \
>> >> +   ibtrs-clt-stats.o \
>> >> +   ibtrs-clt-sysfs.o
>> >> +
>> >> +ibtrs-server-y := ibtrs-srv.o \
>> >> +   ibtrs-srv-stats.o \
>> >> +   ibtrs-srv-sysfs.o
>> >> +
>> >> +ibtrs-core-y := ibtrs.o
>> >> +
>> >> +obj-$(CONFIG_INFINIBAND_IBTRS)+= ibtrs-core.o
>> >
>> > Will it build ibtrs-core in case both server and client are disabled in 
>> > .config?
>>
>> No, CONFIG_INFINIBAND_IBTRS is selected/deselected by
>> CONFIG_INFINIBAND_IBTRS_CLIENT or CONFIG_INFINIBAND_IBTRS_SERVER,
>> when you choose them in kconfig.
>>
>
> Thanks
>
>>
>> >> +obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
>> >> +obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
>> >> +
>> >> +-include $(src)/compat/compat.mk
>> >
>> > What is this?
>>
>> Well, in our production we use same source code and in order not to spoil
>> sources with 'ifdef' macros for different kernel versions I use compat
>> layer, which obviously will never go upstream.  This line is the only
>> clean way to keep sources always up-to-date with latest kernel and still
>> be compatible with what we have on our servers in production.
>>
>> '-' prefix at the beginning of the line tells make to ignore it if
>> file does not exist, so should not rise any error for compilation
>> against latest kernel.
>>
>> Here is an example of the compat layer for IBNBD block device:
>> https://github.com/profitbricks/ibnbd/tree/master/ibnbd/compat
>
> I see it, you will need to remove this line from the 

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Christoph Hellwig
Hi Keith,

I like this series a lot.  One comment that is probably close
to the big discussion in the thread:

>   switch (ret) {
>   case BLK_EH_HANDLED:
>   /*
> +  * If the request is still in flight, the driver is requesting
> +  * blk-mq complete it.
>*/
> + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> + __blk_mq_complete_request(req);
> + break;

The state check here really irked me, and from the thread it seems like
I'm not the only one.  At least for the NVMe case I think it is perfectly
safe, although I agree I'd rather audit what other drivers do carefully.

That being said I think BLK_EH_HANDLED seems like a fundamentally broken
idea, and I'd actually prefer to get rid of it over adding things like
the MQ_RQ_IN_FLIGHT check above.

E.g. if we look at the cases where nvme-pci returns it:

 - if we did call nvme_dev_disable, we already canceled all requests,
   so we might as well just return BLK_EH_NOT_HANDLED
 - the poll for completion case already completed the command,
   so we should return BLK_EH_NOT_HANDLED

So I think we need to fix up nvme and if needed any other driver
to return the right value and then assert that the request is
still in in-flight status for the BLK_EH_HANDLED case.

> @@ -124,16 +119,7 @@ static inline int blk_mq_rq_state(struct request *rq)
>  static inline void blk_mq_rq_update_state(struct request *rq,
> enum mq_rq_state state)
>  {
> + WRITE_ONCE(rq->state, state);
>  }

I think this helper can go away now.  But we should have a comment
near the state field documenting the concurrency implications.



> + u64 state;

This should probably be a mq_rq_state instead.  Which means it needs
to be moved to blkdev.h, but that should be ok.


Re: [PATCH v6 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-22 Thread Adam Manzanares


On 5/22/18 8:32 AM, Jens Axboe wrote:
> On 5/22/18 9:07 AM, adam.manzana...@wdc.com wrote:
>> From: Adam Manzanares 
>>
>> In order to avoid kiocb bloat for per command iopriority support, rw_hint
>> is converted from enum to a u16. Added a guard around ki_hint assignment.
>>
>> Signed-off-by: Adam Manzanares 
>> Reviewed-by: Christoph Hellwig 
>> ---
>>   include/linux/fs.h | 13 +++--
>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 7f07977bdfd7..50de40dbbb85 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -284,6 +284,8 @@ enum rw_hint {
>>  WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
>>   };
>>   
>> +#define MAX_KI_HINT ((1 << 16) - 1) /* ki_hint type is u16 */
> 
> Instead of having to do this and now rely on those now being synced,
> how about something like the below.
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 760d8da1b6c7..070438d0b62d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -299,7 +299,7 @@ struct kiocb {
>   void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>   void*private;
>   int ki_flags;
> - enum rw_hintki_hint;
> + u16 ki_hint;
>   } __randomize_layout;
>   
>   static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -1927,12 +1927,22 @@ static inline enum rw_hint file_write_hint(struct 
> file *file)
>   
>   static inline int iocb_flags(struct file *file);
>   
> +static inline u16 ki_hint_validate(enum rw_hint hint)
> +{
> + typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;
> +
> + if (hint <= max_hint)
> + return hint;
> +
> + return 0;
> +}
> +
>   static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
>   {
>   *kiocb = (struct kiocb) {
>   .ki_filp = filp,
>   .ki_flags = iocb_flags(filp),
> - .ki_hint = file_write_hint(filp),
> + .ki_hint = ki_hint_validate(file_write_hint(filp)),
>   };
>   }

Looks good. I'll resubmit.

>   
> 

Re: [PATCH v6 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-22 Thread Jens Axboe
On 5/22/18 9:07 AM, adam.manzana...@wdc.com wrote:
> From: Adam Manzanares 
> 
> In order to avoid kiocb bloat for per command iopriority support, rw_hint
> is converted from enum to a u16. Added a guard around ki_hint assignment.
> 
> Signed-off-by: Adam Manzanares 
> Reviewed-by: Christoph Hellwig 
> ---
>  include/linux/fs.h | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7f07977bdfd7..50de40dbbb85 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -284,6 +284,8 @@ enum rw_hint {
>   WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
>  };
>  
> +#define MAX_KI_HINT  ((1 << 16) - 1) /* ki_hint type is u16 */

Instead of having to do this and now rely on those now being synced,
how about something like the below.

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 760d8da1b6c7..070438d0b62d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -299,7 +299,7 @@ struct kiocb {
void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
void*private;
int ki_flags;
-   enum rw_hintki_hint;
+   u16 ki_hint;
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1927,12 +1927,22 @@ static inline enum rw_hint file_write_hint(struct file 
*file)
 
 static inline int iocb_flags(struct file *file);
 
+static inline u16 ki_hint_validate(enum rw_hint hint)
+{
+   typeof(((struct kiocb *)0)->ki_hint) max_hint = -1;
+
+   if (hint <= max_hint)
+   return hint;
+
+   return 0;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
*kiocb = (struct kiocb) {
.ki_filp = filp,
.ki_flags = iocb_flags(filp),
-   .ki_hint = file_write_hint(filp),
+   .ki_hint = ki_hint_validate(file_write_hint(filp)),
};
 }
 

-- 
Jens Axboe



[PATCH] blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

2018-05-22 Thread Bart Van Assche
Avoid that complaints similar to the following appear in the kernel log
if the number of zones is sufficiently large:

  fio: page allocation failure: order:9, 
mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
  Call Trace:
  dump_stack+0x63/0x88
  warn_alloc+0xf5/0x190
  __alloc_pages_slowpath+0x8f0/0xb0d
  __alloc_pages_nodemask+0x242/0x260
  alloc_pages_current+0x6a/0xb0
  kmalloc_order+0x18/0x50
  kmalloc_order_trace+0x26/0xb0
  __kmalloc+0x20e/0x220
  blkdev_report_zones_ioctl+0xa5/0x1a0
  blkdev_ioctl+0x1ba/0x930
  block_ioctl+0x41/0x50
  do_vfs_ioctl+0xaa/0x610
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x79/0x1b0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Bart Van Assche 
Cc: Shaun Tancheff 
Cc: Damien Le Moal 
Cc: Christoph Hellwig 
Cc: Martin K. Petersen 
Cc: Hannes Reinecke 
Cc: 
---
 block/blk-zoned.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 20bfc37e1852..92e6108487c4 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -328,7 +328,11 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, 
fmode_t mode,
if (!rep.nr_zones)
return -EINVAL;
 
-   zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
+   if (rep.nr_zones > INT_MAX / sizeof(struct blk_zone))
+   return -ERANGE;
+
+   zones = kvmalloc(rep.nr_zones * sizeof(struct blk_zone),
+   GFP_KERNEL | __GFP_ZERO);
if (!zones)
return -ENOMEM;
 
@@ -350,7 +354,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, 
fmode_t mode,
}
 
  out:
-   kfree(zones);
+   kvfree(zones);
 
return ret;
 }
-- 
2.16.3



  1   2   >