Re: [PATCH 1/3] blk-mq: unify hctx delayed_run_work and run_work

2017-04-14 Thread Bart Van Assche
On Fri, 2017-04-14 at 14:02 -0600, Jens Axboe wrote:
> I was waiting for further comments on patch 3/3.

Hello Jens,

Patch 3/3 is probably fine but I hope that you understand that the introduction
of a new race condition does not make me enthusiast. Should your explanation of
why that race is harmless perhaps be added as a comment?

Bart.

Re: [PATCH block-tree] net: off by one in inet6_pton()

2017-04-14 Thread Jens Axboe
On 04/13/2017 01:42 PM, Dan Carpenter wrote:
> If "scope_len" is sizeof(scope_id) then we would put the NUL terminator
> one space beyond the end of the buffer.

Added, thanks Dan.

-- 
Jens Axboe



Re: [PATCH v4 0/5] blk-mq: Kyber multiqueue I/O scheduler

2017-04-14 Thread Jens Axboe
On 04/14/2017 01:59 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> This is v4 of Kyber, an I/O scheduler for multiqueue devices combining
> several techniques: the scalable bitmap library, the new blk-stats API,
> and queue depth throttling similar to blk-wbt. v1 was here [1], v2 was
> here [2], v3 was here [3].
> 
> v4 fixes a hang in v3 caused by a race condition in the wait queue
> handling in kyber_get_domain_token().
> 
> This series is based on block/for-next. Patches 1 and 2 implement a new
> sbitmap operation. Patch 3 exports a couple of helpers. Patch 4 moves a
> scheduler callback to somewhere more useful. Patch 5 implements the new
> scheduler.

Added for 4.12, thanks Omar.

-- 
Jens Axboe



Re: [PATCH 1/3] blk-mq: unify hctx delayed_run_work and run_work

2017-04-14 Thread Jens Axboe
On 04/11/2017 12:00 PM, Bart Van Assche wrote:
> On Mon, 2017-04-10 at 09:54 -0600, Jens Axboe wrote:
>>  void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
>>  {
>> -cancel_work(>run_work);
>> +cancel_delayed_work(>run_work);
>>  cancel_delayed_work(>delay_work);
>>  set_bit(BLK_MQ_S_STOPPED, >state);
>>  }
> 
> Hello Jens,
> 
> I would like to change the above cancel_*work() calls into cancel_*work_sync()
> calls because this code is used when e.g. switching between I/O schedulers and
> no .queue_rq() calls must be ongoing while switching between schedulers. Do 
> you
> want to integrate that change into this patch or do you want me to post a
> separate patch? In the latter case, should I start from your for-next branch
> to develop that patch or from your for-next branch + this patch series?

I agree, we should make it _sync(). I'll just make the edit in the patch
when I send it out again. I was waiting for further comments on patch 3/3.

-- 
Jens Axboe



Re: [PATCH] remove the mg_disk driver

2017-04-14 Thread Jens Axboe
On 04/06/2017 05:28 AM, Christoph Hellwig wrote:
> This drivers was added in 2008, but as far as a I can tell we never had a
> single platform that actually registered resources for the platform driver.
> 
> It's also been unmaintained for a long time and apparently has a ATA mode
> that can be driven using the IDE/libata subsystem.

Added for 4.12.

-- 
Jens Axboe



Re: [PATCH] block: fix bio_will_gap()

2017-04-14 Thread Jens Axboe
On 04/14/2017 05:26 AM, Johannes Thumshirn wrote:
> On Thu, Apr 13, 2017 at 6:06 PM, Ming Lei  wrote:
>> Now commit 729204ef49ec("block: relax check on sg gap")
>> allows us to merge bios if both are physically contiguous,
>> this change can merge huge of small bios in use case of mkfs,
>> for example, mkfs.ntfs running time can be decreased to ~1/10.
>>
>> But if one rq starts with non-aligned buffer(the 1st bvec's
>> bv_offset isn't zero) and if we allow to merge, it is quite
>> difficult to respect sg gap limit, especially the segment
>> can't be at maximum segment size, otherwise the segment
>> ends in unaligned virt boundary. This patch trys to avoid the
>> issue by not allowing to merge if the req starts with non-aligned
>> buffer.
>>
>> Also add comments to explain why the merged segment can't
>> end in unaligned virt boundary.
>>
>> Fixes: 729204ef49ec ("block: relax check on sg gap")
>> Tested-by: Johannes Thumshirn 
>> Reviewed-by: Johannes Thumshirn 
>> Signed-off-by: Ming Lei 
>> ---
> 
> 
> Hi Jens and Omar,
> 
> I know Jens is currently on vacation, but can maybe Omar get this fix to Linus
> before v4.11? This would be highly appreciated.

I'll pick this up now. I'm back early next week, so we're still fine for
4.11. But I'm glad I pushed back on the original change for 4.10 now.

-- 
Jens Axboe



RE: Outstanding MQ questions from MMC

2017-04-14 Thread Avri Altman


> 
> 2. Turn RPMB and other ioctl() MMC operations into mmc_queue_req
>things and funnel them into the block scheduler
>using REQ_OP_DRV_IN/OUT requests.
> 

Accessing the RPMB is done via a strange protocol, in which each access is 
comprised of several requests.
For example, writing to the RPMB will require sending 5 different requests: 
2 requests to read the write counter, and then 3 more requests for the write 
operation itself.

Once the sequence has started, it should not get interfered by other requests, 
or the operation will fail.

So, if you are looking to eliminate the host lock, and count merely on the blk 
queue to regulate  access to the device,
You'll be needing some barrier mechanism that will assure that a sequence of 
requests will take place as an atomic unit.

But then again, isn't it contradicts the very idea of a multi-queue?

Cheers,
Avri


Re: [PATCH 1/6] blk-mq: Do not invoke queue operations on a dead queue

2017-04-14 Thread Bart Van Assche
On Fri, 2017-04-14 at 10:13 -0700, Omar Sandoval wrote:
> Thanks, Bart. In this case, the absence of the "mq" directory should
> tell you that the queue is dead. Will you move the cleanup in v2 or
> should I submit a separate patch?

Hello Omar,

I will include a patch in v2 of this patch series that unregisters the
debugfs attributes earlier.

Bart.

Re: [PATCH 1/6] blk-mq: Do not invoke queue operations on a dead queue

2017-04-14 Thread Omar Sandoval
On Fri, Apr 14, 2017 at 04:12:01PM +, Bart Van Assche wrote:
> On Fri, 2017-04-14 at 00:40 -0700, Omar Sandoval wrote:
> > On Thu, Apr 13, 2017 at 11:05:32PM +, Bart Van Assche wrote:
> > > On Thu, 2017-04-13 at 16:01 -0700, Omar Sandoval wrote:
> > > > Looking at this, I think we have similar issues with most of the other
> > > > debugfs files. Should we move the debugfs cleanup earlier?
> > > 
> > > That's a good question. However, while I was debugging it was very 
> > > convenient
> > > to be able to access the queue state after it had reached the "dead" 
> > > state.
> > > Performing the cleanup earlier would be an alternative solution but would
> > > make debugging a bit harder ...
> > 
> > What useful information were you getting out of debugfs once the queue
> > was already dead? Wasn't the interesting stuff freed at that point?
> 
> Hello Omar,
> 
> I'm currently chasing a stall of dm-rq + dm-mpath that occurs after the
> queues below it have reached the "dead" state. I will look for another
> way to obtain the information I need such that we can remove the block
> layer queue debugfs information before these queues reach the "dead"
> state.
> 
> Bart.

Thanks, Bart. In this case, the absence of the "mq" directory should
tell you that the queue is dead. Will you move the cleanup in v2 or
should I submit a separate patch?


Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically

2017-04-14 Thread Bart Van Assche
On Fri, 2017-04-14 at 09:13 +0800, Ming Lei wrote:
> On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> > On 04/12/17 19:20, Ming Lei wrote:
> > > On Wed, Apr 12, 2017 at 06:38:07PM +, Bart Van Assche wrote:
> > > > If the blk-mq core would always rerun a hardware queue if a block driver
> > > > returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU 
> > > > core
> > > 
> > > It won't casue 100% CPU utilization since we restart queue in completion
> > > path and at that time at least one tag is available, then progress can be
> > > made.
> > 
> > Hello Ming,
> > 
> > Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> > then it's likely that calling .queue_rq() again after only a few
> > microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> > don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> > !test_bit(BLK_MQ_S_TAG_WAITING, >state)) blk_mq_run_hw_queue(hctx,
> > true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
> 
> Yes, that can be true, but I mean it is still OK to run the queue again
> with
> 
>   if (!blk_mq_sched_needs_restart(hctx) &&
>   !test_bit(BLK_MQ_S_TAG_WAITING, >state))
>   blk_mq_run_hw_queue(hctx, true);
> 
> and restarting queue in __blk_mq_finish_request() when
> BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
> blk-mq implementation.
> 
> Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?

Because if dm_mq_queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY that there is no
guarantee that __blk_mq_finish_request() will be called later on for the
same queue. dm_mq_queue_rq() can e.g. return BLK_MQ_RQ_QUEUE_BUSY while no
dm requests are in progress because the SCSI error handler is active for
all underlying paths. See also scsi_lld_busy() and scsi_host_in_recovery().

Bart.

Re: [PATCH 1/6] blk-mq: Do not invoke queue operations on a dead queue

2017-04-14 Thread Bart Van Assche
On Fri, 2017-04-14 at 00:40 -0700, Omar Sandoval wrote:
> On Thu, Apr 13, 2017 at 11:05:32PM +, Bart Van Assche wrote:
> > On Thu, 2017-04-13 at 16:01 -0700, Omar Sandoval wrote:
> > > Looking at this, I think we have similar issues with most of the other
> > > debugfs files. Should we move the debugfs cleanup earlier?
> > 
> > That's a good question. However, while I was debugging it was very 
> > convenient
> > to be able to access the queue state after it had reached the "dead" state.
> > Performing the cleanup earlier would be an alternative solution but would
> > make debugging a bit harder ...
> 
> What useful information were you getting out of debugfs once the queue
> was already dead? Wasn't the interesting stuff freed at that point?

Hello Omar,

I'm currently chasing a stall of dm-rq + dm-mpath that occurs after the
queues below it have reached the "dead" state. I will look for another
way to obtain the information I need such that we can remove the block
layer queue debugfs information before these queues reach the "dead"
state.

Bart.

Re: [PATCH 5/6] blk-mq: Add blk_mq_ops.show_rq()

2017-04-14 Thread Bart Van Assche
On Thu, 2017-04-13 at 16:21 -0700, Omar Sandoval wrote:
> How about passing the seq_file to the callback instead of this
> arbitrarily-sized on-stack buffer?

Hello Omar,

That sounds like a good idea to me. I will make that change.

Bart.

Re: [PATCH] block: Make writeback throttling defaults consistent for SQ devices

2017-04-14 Thread Bart Van Assche
On Wed, 2017-04-12 at 10:23 +0200, Jan Kara wrote:
> +#ifndef CONFIG_BLK_WBT_MQ
> + if (q->mq_ops)
> + return;
> +#endif
> +#ifndef CONFIG_BLK_WBT_SQ
> + if (q->request_fn)
> + return;
> +#endif
> +
> + /*
> +  * If this fails, we don't get throttling
> +  */
> + wbt_init(q);

Hello Jan,

How about using positive logic to enable WBT, e.g. as follows? Wouldn't
that make the code easier to read?

if ((IS_ENABLED(CONFIG_BLK_WBT_MQ) && q->mq_ops) ||
(IS_ENABLED(CONFIG_BLK_WBT_SQ) && q->request_fn))
wbt_init(q);

> +static void deadline_registered_queue(struct request_queue *q)
> +{
> + wbt_enable_default(q);
> +}
> +
>  /*
>   * sysfs parts below
>   */
> @@ -445,6 +451,7 @@ static struct elevator_type iosched_deadline = {
>   .elevator_latter_req_fn =   elv_rb_latter_request,
>   .elevator_init_fn = deadline_init_queue,
>   .elevator_exit_fn = deadline_exit_queue,
> + .elevator_registered_fn =   deadline_registered_queue,
>   },
>  
>   .elevator_attrs = deadline_attrs,
> [ ... ]
> @@ -91,6 +92,11 @@ static void noop_exit_queue(struct elevator_queue *e)
>   kfree(nd);
>  }
>  
> +static void noop_registered_queue(struct request_queue *q)
> +{
> + wbt_enable_default(q);
> +}
> +
>  static struct elevator_type elevator_noop = {
>   .ops.sq = {
>   .elevator_merge_req_fn  = noop_merged_requests,
> @@ -100,6 +106,7 @@ static struct elevator_type elevator_noop = {
>   .elevator_latter_req_fn = noop_latter_request,
>   .elevator_init_fn   = noop_init_queue,
>   .elevator_exit_fn   = noop_exit_queue,
> + .elevator_registered_fn = noop_registered_queue,
>   },
>   .elevator_name = "noop",
>   .elevator_owner = THIS_MODULE,

This approach is not suited for blk-mq because with blk-mq "none" means no
scheduler and hence no struct elevator_type. Please consider not to add any
elevator_registered_fn() callbacks to noop and deadline but instead to call
wbt_enable_default() from elv_unregister_queue().

Thanks,

Bart.

[PATCH 6/8] nowait aio: ext4

2017-04-14 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues 
---
 fs/ext4/file.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cefa9835f275..2efdc6d4d3e8 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
return ext4_dax_write_iter(iocb, from);
 #endif
 
-   inode_lock(inode);
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   } else {
+   inode_lock(inode);
+   }
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = 
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-   overwrite = 1;
+   if (o_direct && !unaligned_aio) {
+   if (ext4_overwrite_io(inode, iocb->ki_pos, 
iov_iter_count(from))) {
+   if (ext4_should_dioread_nolock(inode))
+   overwrite = 1;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
 
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
-- 
2.12.0



[PATCH 7/8] nowait aio: xfs

2017-04-14 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/xfs/xfs_file.c  | 19 ++-
 fs/xfs/xfs_iomap.c | 17 +
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35703a801372..b307940e7d56 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
iolock = XFS_IOLOCK_SHARED;
}
 
-   xfs_ilock(ip, iolock);
-
+   if (!xfs_ilock_nowait(ip, iolock)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, iolock);
+   }
ret = xfs_file_aio_write_checks(iocb, from, );
if (ret)
goto out;
@@ -553,9 +556,15 @@ xfs_file_dio_aio_write(
 * otherwise demote the lock if we had to take the exclusive lock
 * for other reasons in xfs_file_aio_write_checks.
 */
-   if (unaligned_io)
-   inode_dio_wait(inode);
-   else if (iolock == XFS_IOLOCK_EXCL) {
+   if (unaligned_io) {
+   /* If we are going to wait for other DIO to finish, bail */
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (atomic_read(>i_dio_count))
+   return -EAGAIN;
+   } else {
+   inode_dio_wait(inode);
+   }
+   } else if (iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 288ee5b840d7..9baa65eeae9e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1015,6 +1015,15 @@ xfs_file_iomap_begin(
 
if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
if (flags & IOMAP_DIRECT) {
+   /*
+* A reflinked inode will result in CoW alloc.
+* FIXME: It could still overwrite on unshared extents
+* and not need allocation.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, , ,
);
@@ -1032,6 +1041,14 @@ xfs_file_iomap_begin(
 
if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, , nimaps)) {
/*
+* If nowait is set bail since we are going to make
+* allocations.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
-- 
2.12.0



[PATCH 3/8] nowait aio: return if direct write will trigger writeback

2017-04-14 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Find out if the write will trigger a wait due to writeback. If yes,
return -EAGAIN.

This introduces a new function filemap_range_has_page() which
returns true if the file's mapping has a page within the range
mentioned.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Signed-off-by: Goldwyn Rodrigues 
---
 include/linux/fs.h |  2 ++
 mm/filemap.c   | 50 +++---
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e44de1c981a0..b14eab5daeb2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2514,6 +2514,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_range_has_page(struct address_space *, loff_t lstart,
+  loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index d51670b7fe6b..48b83d1d4a30 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,39 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:   address space structure to wait for
+ * @start_byte:offset in bytes where the range starts
+ * @end_byte:  offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+int filemap_range_has_page(struct address_space *mapping,
+   loff_t start_byte, loff_t end_byte)
+{
+   pgoff_t index = start_byte >> PAGE_SHIFT;
+   pgoff_t end = end_byte >> PAGE_SHIFT;
+   struct pagevec pvec;
+   int ret;
+
+   if (end_byte < start_byte)
+   return 0;
+
+   if (mapping->nrpages == 0)
+   return 0;
+
+   pagevec_init(, 0);
+   ret = pagevec_lookup(, mapping, index, 1);
+   if (!ret)
+   return 0;
+   ret = (pvec.pages[0]->index <= end);
+   pagevec_release();
+   return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
@@ -2640,6 +2673,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 
pos = iocb->ki_pos;
 
+   if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+   return -EINVAL;
+
if (limit != RLIM_INFINITY) {
if (iocb->ki_pos >= limit) {
send_sig(SIGXFSZ, current, 0);
@@ -2709,9 +2745,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
iov_iter *from)
write_len = iov_iter_count(from);
end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-   written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
1);
-   if (written)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   /* If there are pages to writeback, return */
+   if (filemap_range_has_page(inode->i_mapping, pos,
+  pos + iov_iter_count(from)))
+   return -EAGAIN;
+   } else {
+   written = filemap_write_and_wait_range(mapping, pos,
+   pos + write_len - 1);
+   if (written)
+   goto out;
+   }
 
/*
 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.12.0



[PATCH 5/8] nowait aio: return on congested block device

2017-04-14 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

To facilitate this, QUEUE_FLAG_NOWAIT is set to devices
which support this. While currently this is set to
virtio and sd only. Support to more devices will be added soon
once I am sure they don't block. Currently blocks such as dm/md
block while performing sync.

Signed-off-by: Goldwyn Rodrigues 
---
 block/blk-core.c   | 24 ++--
 block/blk-mq-sched.c   |  3 +++
 block/blk-mq.c |  4 
 drivers/block/virtio_blk.c |  3 +++
 drivers/scsi/sd.c  |  3 +++
 fs/direct-io.c | 10 --
 include/linux/bio.h|  6 ++
 include/linux/blk_types.h  |  2 ++
 include/linux/blkdev.h |  3 +++
 9 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..54698521756b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1232,6 +1232,11 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
if (!IS_ERR(rq))
return rq;
 
+   if (bio && (bio->bi_opf & REQ_NOWAIT)) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -1870,6 +1875,18 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   if (bio->bi_opf & REQ_NOWAIT) {
+   if (!blk_queue_nowait(q)) {
+   err = -EOPNOTSUPP;
+   goto end_io;
+   }
+   if (!(bio->bi_opf & REQ_SYNC)) {
+   err = -EINVAL;
+   goto end_io;
+   }
+   }
+
+
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(_to_disk(part)->part0,
@@ -2021,7 +2038,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2046,7 +2063,10 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(_list_on_stack[0], );
bio_list_merge(_list_on_stack[0], 
_list_on_stack[1]);
} else {
-   bio_io_error(bio);
+   if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & 
REQ_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
}
bio = bio_list_pop(_list_on_stack[0]);
} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c974a1bbf4cb..9f88190ff395 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+   if (bio && (bio->bi_opf & REQ_NOWAIT))
+   data->flags |= BLK_MQ_REQ_NOWAIT;
+
if (e) {
data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 572966f49596..8b9b1a411ce2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio && (bio->bi_opf & REQ_NOWAIT))
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
@@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio && (bio->bi_opf & REQ_NOWAIT))
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 1d4c9f8bc1e1..7481124c5025 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -731,6 +731,9 @@ static int virtblk_probe(struct virtio_device *vdev)
/* No real sector limit. */
blk_queue_max_hw_sectors(q, -1U);
 
+   /* Request queue supports BIO_NOWAIT */
+

[PATCH 8/8] nowait aio: btrfs

2017-04-14 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c  | 25 -
 fs/btrfs/inode.c |  3 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb7230b2d..a870e5dd2b4d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1823,12 +1823,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos;
-   size_t count;
+   loff_t pos = iocb->ki_pos;
+   size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   inode_lock(inode);
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+   (iocb->ki_flags & IOCB_DIRECT)) {
+   /* Don't sleep on inode rwsem */
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   /*
+* We will allocate space in case nodatacow is not set,
+* so bail
+*/
+   if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) ||
+   check_can_nocow(BTRFS_I(inode), pos, ) <= 0) {
+   inode_unlock(inode);
+   return -EAGAIN;
+   }
+   } else
+   inode_lock(inode);
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
inode_unlock(inode);
@@ -1862,8 +1879,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
-   pos = iocb->ki_pos;
-   count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a18510be76c1..d91b21a76d6d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8627,6 +8627,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
dio_data.overwrite = 1;
inode_unlock(inode);
relock = true;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
}
ret = btrfs_delalloc_reserve_space(inode, offset, count);
if (ret)
-- 
2.12.0



[PATCH 2/8] nowait aio: Introduce RWF_NOWAIT

2017-04-14 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

This flag informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

Unfortunately, aio_flags is not checked for validity, which would
break existing applications which have it set to anything besides zero
or IOCB_FLAG_RESFD. So, we are using aio_reserved1 and renaming it
to aio_rw_flags.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

The check for -EOPNOTSUPP is placed in generic_file_write_iter(). This
is called by most filesystems, either through fsops.write_iter() or through
the function defined by write_iter(). If not, we perform the check defined
by .write_iter() which is called for direct IO specifically.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/9p/vfs_file.c| 3 +++
 fs/aio.c| 9 +++--
 fs/ceph/file.c  | 3 +++
 fs/cifs/file.c  | 3 +++
 fs/fuse/file.c  | 3 +++
 fs/nfs/direct.c | 3 +++
 fs/ocfs2/file.c | 3 +++
 fs/read_write.c | 2 +-
 include/linux/fs.h  | 3 +++
 include/uapi/linux/fs.h | 1 +
 mm/filemap.c| 3 +++
 11 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3de3b4a89d89..403681db7723 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -411,6 +411,9 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
loff_t origin;
int err = 0;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
retval = generic_write_checks(iocb, from);
if (retval <= 0)
return retval;
diff --git a/fs/aio.c b/fs/aio.c
index b8a33f5beef5..d3b5c8dc6549 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1546,12 +1546,12 @@ static int io_submit_one(struct kioctx *ctx, struct 
iocb __user *user_iocb,
return -EINVAL;
}
 
-   if (unlikely(iocb->aio_rw_flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))) 
{
+   if (unlikely(iocb->aio_rw_flags &
+~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT))) {
pr_debug("EINVAL: aio_rw_flags set with incompatible flags\n");
return -EINVAL;
}
 
-
/* prevent overflows */
if (unlikely(
(iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
@@ -1593,6 +1593,11 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
}
 
req->common.ki_flags |= iocb_rw_flags(iocb->aio_rw_flags);
+   if ((req->common.ki_flags & IOCB_NOWAIT) &&
+   !(req->common.ki_flags & IOCB_DIRECT)) {
+   ret = -EINVAL;
+   goto out_put_req;
+   }
 
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 26cc95421cca..af28419b1731 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1267,6 +1267,9 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
int err, want, got;
loff_t pos;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index aa3debbba826..a828ab3e7775 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2638,6 +2638,9 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct 
iov_iter *from)
 * write request.
 */
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
rc = generic_write_checks(iocb, from);
if (rc <= 0)
return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ec238fb5a584..72786e798319 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1425,6 +1425,9 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(file);
ssize_t res;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
if (is_bad_inode(inode))
return -EIO;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index aab32fc3d6a8..ab419caebd5f 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -991,6 +991,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct 
iov_iter *iter)
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
result = generic_write_checks(iocb, iter);
if (result <= 0)
return result;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index bfeb647459d9..e7f8ba890305 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2235,6 +2235,9 @@ 

[PATCH 0/8 v6] No wait AIO

2017-04-14 Thread Goldwyn Rodrigues
Formerly known as non-blocking AIO.

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, BIO_NOWAIT for bio and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs while I intend to add more filesystems.
Same with QUEUE_FLAG_NOWAIT, which is currently set for sd and virtio devices.
This is primarily to block md/dm devices which may wait in places such as
recovery/sync/suspend. In the future, I intend to add support to
these devices as well. Applications will have to check supportability
by sending a async direct write and any other error besides -EAGAIN
would mean it is not supported.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
- included reflink 
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

 Changes since v3:
  + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature.
  + Checks in generic_make_request() to make sure BIO_NOWAIT comes in
for async direct writes only.
  + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT.
This is added (rather not set) to block devices such as dm/md currently.

 Changes since v4:
  + Ported AIO code to use RWF_* flags. Check for RWF_* flags in
generic_file_write_iter().
  + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT.

 Changes since v5:
  + BIO_NOWAIT to REQ_NOWAIT
  + Common helper for RWF flags.

-- 
Goldwyn




[PATCH 4/8] nowait-aio: Introduce IOMAP_NOWAIT

2017-04-14 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.
---
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 141c3cd55a8b..d1c81753d411 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -885,6 +885,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
} else {
dio->flags |= IOMAP_DIO_WRITE;
flags |= IOMAP_WRITE;
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   flags |= IOMAP_NOWAIT;
}
 
if (mapping->nrpages) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7291810067eb..53f6af89c625 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -51,6 +51,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
/*
-- 
2.12.0



[PATCH 1/8] Use RWF_* flags for AIO operations

2017-04-14 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

RWF_* flags is used for preadv2/pwritev2 calls. Port to use
it for aio operations as well. For this, aio_rw_flags is
introduced in struct iocb (using aio_reserved1) which will
carry these flags.

This is a precursor to the nowait AIO calls.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/aio.c | 10 +-
 fs/read_write.c  |  7 +--
 include/linux/fs.h   | 12 
 include/uapi/linux/aio_abi.h |  2 +-
 4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925ee259..b8a33f5beef5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,11 +1541,17 @@ static int io_submit_one(struct kioctx *ctx, struct 
iocb __user *user_iocb,
ssize_t ret;
 
/* enforce forwards compatibility on users */
-   if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
 
+   if (unlikely(iocb->aio_rw_flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))) 
{
+   pr_debug("EINVAL: aio_rw_flags set with incompatible flags\n");
+   return -EINVAL;
+   }
+
+
/* prevent overflows */
if (unlikely(
(iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
@@ -1586,6 +1592,8 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->common.ki_flags |= IOCB_EVENTFD;
}
 
+   req->common.ki_flags |= iocb_rw_flags(iocb->aio_rw_flags);
+
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/fs/read_write.c b/fs/read_write.c
index c4f88afbc67f..9aa557bb471c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -682,12 +682,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
return -EOPNOTSUPP;
 
init_sync_kiocb(, filp);
-   if (flags & RWF_HIPRI)
-   kiocb.ki_flags |= IOCB_HIPRI;
-   if (flags & RWF_DSYNC)
-   kiocb.ki_flags |= IOCB_DSYNC;
-   if (flags & RWF_SYNC)
-   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   kiocb.ki_flags |= iocb_rw_flags(flags);
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7251f7bb45e8..35cfb08ceb9d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3049,6 +3049,18 @@ static inline int iocb_flags(struct file *file)
return res;
 }
 
+static inline int iocb_rw_flags(int flags)
+{
+   int res = 0;
+   if (flags & RWF_HIPRI)
+   res |= IOCB_HIPRI;
+   if (flags & RWF_DSYNC)
+   res |= IOCB_DSYNC;
+   if (flags & RWF_SYNC)
+   res |= (IOCB_DSYNC | IOCB_SYNC);
+   return res;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
ino_t res;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f7fbd1..a2d4a8ac94ca 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -79,7 +79,7 @@ struct io_event {
 struct iocb {
/* these are internal to the kernel/libc. */
__u64   aio_data;   /* data to be returned in event's data */
-   __u32   PADDED(aio_key, aio_reserved1);
+   __u32   PADDED(aio_key, aio_rw_flags);
/* the kernel sets aio_key to the req # */
 
/* common fields */
-- 
2.12.0



Re: [PATCH] block: fix bio_will_gap()

2017-04-14 Thread Johannes Thumshirn
On Thu, Apr 13, 2017 at 6:06 PM, Ming Lei  wrote:
> Now commit 729204ef49ec("block: relax check on sg gap")
> allows us to merge bios if both are physically contiguous,
> this change can merge huge of small bios in use case of mkfs,
> for example, mkfs.ntfs running time can be decreased to ~1/10.
>
> But if one rq starts with non-aligned buffer(the 1st bvec's
> bv_offset isn't zero) and if we allow to merge, it is quite
> difficult to respect sg gap limit, especially the segment
> can't be at maximum segment size, otherwise the segment
> ends in unaligned virt boundary. This patch trys to avoid the
> issue by not allowing to merge if the req starts with non-aligned
> buffer.
>
> Also add comments to explain why the merged segment can't
> end in unaligned virt boundary.
>
> Fixes: 729204ef49ec ("block: relax check on sg gap")
> Tested-by: Johannes Thumshirn 
> Reviewed-by: Johannes Thumshirn 
> Signed-off-by: Ming Lei 
> ---


Hi Jens and Omar,

I know Jens is currently on vacation, but can maybe Omar get this fix to Linus
before v4.11? This would be highly appreciated.

Thanks a lot,
Johannes


Re: [PATCH 02/25] block: remove the blk_execute_rq return value

2017-04-14 Thread h...@lst.de
On Thu, Apr 13, 2017 at 08:03:22PM +, Bart Van Assche wrote:
> That blk_execute_rq() call can only be reached if a few lines above 0 was
> assigned to the "error" variable. Since nfsd4_scsi_identify_device() returns
> the value of the "error" variable I think -EIO should be assigned to that
> variable before the "goto out_put_request" statement is reached.

You're right!  I'll fix it up.


[PATCH v4 5/5] blk-mq: introduce Kyber multiqueue I/O scheduler

2017-04-14 Thread Omar Sandoval
From: Omar Sandoval 

The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.

The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.

A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.

Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.

These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.

Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.

Signed-off-by: Omar Sandoval 
---
 Documentation/block/kyber-iosched.txt |  14 +
 block/Kconfig.iosched |   9 +
 block/Makefile|   1 +
 block/kyber-iosched.c | 719 ++
 4 files changed, 743 insertions(+)
 create mode 100644 Documentation/block/kyber-iosched.txt
 create mode 100644 block/kyber-iosched.c

diff --git a/Documentation/block/kyber-iosched.txt 
b/Documentation/block/kyber-iosched.txt
new file mode 100644
index ..e94feacd7edc
--- /dev/null
+++ b/Documentation/block/kyber-iosched.txt
@@ -0,0 +1,14 @@
+Kyber I/O scheduler tunables
+===
+
+The only two tunables for the Kyber scheduler are the target latencies for
+reads and synchronous writes. Kyber will throttle requests in order to meet
+these target latencies.
+
+read_lat_nsec
+-
+Target latency for reads (in nanoseconds).
+
+write_lat_nsec
+--
+Target latency for synchronous writes (in nanoseconds).
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 58fc8684788d..916e69c68fa4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -69,6 +69,15 @@ config MQ_IOSCHED_DEADLINE
---help---
  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_KYBER
+   tristate "Kyber I/O scheduler"
+   default y
+   ---help---
+ The Kyber I/O scheduler is a low-overhead scheduler suitable for
+ multiqueue and other fast devices. Given target latencies for reads 
and
+ synchronous writes, it will self-tune queue depths to achieve that
+ goal.
+
 endmenu
 
 endif
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..6146d2eaaeaa 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IOSCHED_NOOP)+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)  += cfq-iosched.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)  += mq-deadline.o
+obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)   += cmdline-parser.o
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
new file mode 100644
index ..fe4af5b97c0e
--- /dev/null
+++ b/block/kyber-iosched.c
@@ -0,0 +1,719 @@
+/*
+ * The Kyber I/O scheduler. Controls latency by throttling queue depths using
+ * scalable techniques.
+ *
+ * Copyright (C) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-stat.h"
+
+/* Scheduling domains. */
+enum {
+   KYBER_READ,
+   KYBER_SYNC_WRITE,
+   KYBER_OTHER, /* Async writes, discard, etc. */
+   KYBER_NUM_DOMAINS,
+};
+
+enum {
+   KYBER_MIN_DEPTH = 256,
+
+   /*
+* In order to prevent starvation of synchronous requests 

[PATCH v4 2/5] blk-mq: add shallow depth option for blk_mq_get_tag()

2017-04-14 Thread Omar Sandoval
From: Omar Sandoval 

Wire up the sbitmap_get_shallow() operation to the tag code so that a
caller can limit the number of tags available to it.

Signed-off-by: Omar Sandoval 
---
 block/blk-mq-tag.c | 5 -
 block/blk-mq.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9d97bfc4d465..d0be72ccb091 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -96,7 +96,10 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
!hctx_may_queue(data->hctx, bt))
return -1;
-   return __sbitmap_queue_get(bt);
+   if (data->shallow_depth)
+   return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
+   else
+   return __sbitmap_queue_get(bt);
 }
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7e6f2e467696..524f44742816 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -141,6 +141,7 @@ struct blk_mq_alloc_data {
/* input parameter */
struct request_queue *q;
unsigned int flags;
+   unsigned int shallow_depth;
 
/* input & output parameter */
struct blk_mq_ctx *ctx;
-- 
2.12.2



[PATCH v4 1/5] sbitmap: add sbitmap_get_shallow() operation

2017-04-14 Thread Omar Sandoval
From: Omar Sandoval 

This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside some
bits at the end of the bitmap, we can set aside bits in each word of the
bitmap. This means we can keep the allocation hints spread out and
support sbitmap_resize() nicely at the cost of lower granularity for the
allowed depth.

Signed-off-by: Omar Sandoval 
---
 include/linux/sbitmap.h | 55 
 lib/sbitmap.c   | 75 -
 2 files changed, 123 insertions(+), 7 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index d4e0a204c118..a1904aadbc45 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -176,6 +176,25 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int 
depth);
 int sbitmap_get(struct sbitmap *sb, unsigned int alloc_hint, bool round_robin);
 
 /**
+ * sbitmap_get_shallow() - Try to allocate a free bit from a  sbitmap,
+ * limiting the depth used from each word.
+ * @sb: Bitmap to allocate from.
+ * @alloc_hint: Hint for where to start searching for a free bit.
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ *
+ * This rather specific operation allows for having multiple users with
+ * different allocation limits. E.g., there can be a high-priority class that
+ * uses sbitmap_get() and a low-priority class that uses sbitmap_get_shallow()
+ * with a @shallow_depth of (1 << (@sb->shift - 1)). Then, the low-priority
+ * class can only allocate half of the total bits in the bitmap, preventing it
+ * from starving out the high-priority class.
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
+   unsigned long shallow_depth);
+
+/**
  * sbitmap_any_bit_set() - Check for a set bit in a  sbitmap.
  * @sb: Bitmap to check.
  *
@@ -326,6 +345,19 @@ void sbitmap_queue_resize(struct sbitmap_queue *sbq, 
unsigned int depth);
 int __sbitmap_queue_get(struct sbitmap_queue *sbq);
 
 /**
+ * __sbitmap_queue_get_shallow() - Try to allocate a free bit from a 
+ * sbitmap_queue, limiting the depth used from each word, with preemption
+ * already disabled.
+ * @sbq: Bitmap queue to allocate from.
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ * See sbitmap_get_shallow().
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+   unsigned int shallow_depth);
+
+/**
  * sbitmap_queue_get() - Try to allocate a free bit from a 
  * sbitmap_queue.
  * @sbq: Bitmap queue to allocate from.
@@ -346,6 +378,29 @@ static inline int sbitmap_queue_get(struct sbitmap_queue 
*sbq,
 }
 
 /**
+ * sbitmap_queue_get_shallow() - Try to allocate a free bit from a 
+ * sbitmap_queue, limiting the depth used from each word.
+ * @sbq: Bitmap queue to allocate from.
+ * @cpu: Output parameter; will contain the CPU we ran on (e.g., to be passed 
to
+ *   sbitmap_queue_clear()).
+ * @shallow_depth: The maximum number of bits to allocate from a single word.
+ * See sbitmap_get_shallow().
+ *
+ * Return: Non-negative allocated bit number if successful, -1 otherwise.
+ */
+static inline int sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
+   unsigned int *cpu,
+   unsigned int shallow_depth)
+{
+   int nr;
+
+   *cpu = get_cpu();
+   nr = __sbitmap_queue_get_shallow(sbq, shallow_depth);
+   put_cpu();
+   return nr;
+}
+
+/**
  * sbitmap_queue_clear() - Free an allocated bit and wake up waiters on a
  *  sbitmap_queue.
  * @sbq: Bitmap to free from.
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 60e800e0b5a0..80aa8d5463fa 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -79,15 +79,15 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 }
 EXPORT_SYMBOL_GPL(sbitmap_resize);
 
-static int __sbitmap_get_word(struct sbitmap_word *word, unsigned int hint,
- bool wrap)
+static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
+ unsigned int hint, bool wrap)
 {
unsigned int orig_hint = hint;
int nr;
 
while (1) {
-   nr = find_next_zero_bit(>word, word->depth, hint);
-   if (unlikely(nr >= word->depth)) {
+   nr = find_next_zero_bit(word, depth, hint);
+   if (unlikely(nr >= depth)) {
/*
 * We started with an offset, and we didn't reset the
 * offset to 0 in a failure case, so start from 0 to
@@ -100,11 +100,11 @@ static int __sbitmap_get_word(struct sbitmap_word *word, 
unsigned int hint,

[PATCH v4 4/5] blk-mq-sched: make completed_request() callback more useful

2017-04-14 Thread Omar Sandoval
From: Omar Sandoval 

Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.

Signed-off-by: Omar Sandoval 
---
 block/blk-mq-sched.h | 11 +++
 block/blk-mq.c   |  5 -
 include/linux/elevator.h |  2 +-
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index f4bc186c3440..120c6abc37cc 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -82,17 +82,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct 
request *rq,
return true;
 }
 
-static inline void
-blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static inline void blk_mq_sched_completed_request(struct request *rq)
 {
-   struct elevator_queue *e = hctx->queue->elevator;
+   struct elevator_queue *e = rq->q->elevator;
 
if (e && e->type->ops.mq.completed_request)
-   e->type->ops.mq.completed_request(hctx, rq);
-
-   BUG_ON(rq->internal_tag == -1);
-
-   blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
+   e->type->ops.mq.completed_request(rq);
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7138cd98146e..e2ef7b460924 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -350,7 +350,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, 
struct blk_mq_ctx *ctx,
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
-   blk_mq_sched_completed_request(hctx, rq);
+   blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
blk_mq_sched_restart(hctx);
blk_queue_exit(q);
 }
@@ -444,6 +444,9 @@ static void __blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
 
+   if (rq->internal_tag != -1)
+   blk_mq_sched_completed_request(rq);
+
blk_mq_stat_add(rq);
 
if (!q->softirq_done_fn)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index b7ec315ee7e7..3a216318ae73 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -106,7 +106,7 @@ struct elevator_mq_ops {
void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, 
bool);
struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
bool (*has_work)(struct blk_mq_hw_ctx *);
-   void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+   void (*completed_request)(struct request *);
void (*started_request)(struct request *);
void (*requeue_request)(struct request *);
struct request *(*former_request)(struct request_queue *, struct 
request *);
-- 
2.12.2



[PATCH v4 3/5] blk-mq: export helpers

2017-04-14 Thread Omar Sandoval
From: Omar Sandoval 

blk_mq_finish_request() is required for schedulers that define their own
put_request(). blk_mq_run_hw_queue() is required for schedulers that
hold back requests to be run later.

Signed-off-by: Omar Sandoval 
---
 block/blk-mq.c | 2 ++
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e536dacfae4c..7138cd98146e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -368,6 +368,7 @@ void blk_mq_finish_request(struct request *rq)
 {
blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), 
rq);
 }
+EXPORT_SYMBOL_GPL(blk_mq_finish_request);
 
 void blk_mq_free_request(struct request *rq)
 {
@@ -1183,6 +1184,7 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool 
async)
 {
__blk_mq_delay_run_hw_queue(hctx, async, 0);
 }
+EXPORT_SYMBOL(blk_mq_run_hw_queue);
 
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b90c3d5766cd..d75de612845d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -238,6 +238,7 @@ void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long 
msecs);
+void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-- 
2.12.2



[PATCH v4 0/5] blk-mq: Kyber multiqueue I/O scheduler

2017-04-14 Thread Omar Sandoval
From: Omar Sandoval 

This is v4 of Kyber, an I/O scheduler for multiqueue devices combining
several techniques: the scalable bitmap library, the new blk-stats API,
and queue depth throttling similar to blk-wbt. v1 was here [1], v2 was
here [2], v3 was here [3].

v4 fixes a hang in v3 caused by a race condition in the wait queue
handling in kyber_get_domain_token().

This series is based on block/for-next. Patches 1 and 2 implement a new
sbitmap operation. Patch 3 exports a couple of helpers. Patch 4 moves a
scheduler callback to somewhere more useful. Patch 5 implements the new
scheduler.

Thanks!

1: http://marc.info/?l=linux-block=148978871820916=2
2: http://marc.info/?l=linux-block=149132467510945=2
3: http://marc.info/?l=linux-block=149180713919341=2

Omar Sandoval (5):
  sbitmap: add sbitmap_get_shallow() operation
  blk-mq: add shallow depth option for blk_mq_get_tag()
  blk-mq: export helpers
  blk-mq-sched: make completed_request() callback more useful
  blk-mq: introduce Kyber multiqueue I/O scheduler

 Documentation/block/kyber-iosched.txt |  14 +
 block/Kconfig.iosched |   9 +
 block/Makefile|   1 +
 block/blk-mq-sched.h  |  11 +-
 block/blk-mq-tag.c|   5 +-
 block/blk-mq.c|   7 +-
 block/blk-mq.h|   1 +
 block/kyber-iosched.c | 719 ++
 include/linux/blk-mq.h|   1 +
 include/linux/elevator.h  |   2 +-
 include/linux/sbitmap.h   |  55 +++
 lib/sbitmap.c |  75 +++-
 12 files changed, 882 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/block/kyber-iosched.txt
 create mode 100644 block/kyber-iosched.c

-- 
2.12.2



Re: [PATCH 1/6] blk-mq: Do not invoke queue operations on a dead queue

2017-04-14 Thread Omar Sandoval
On Thu, Apr 13, 2017 at 11:05:32PM +, Bart Van Assche wrote:
> On Thu, 2017-04-13 at 16:01 -0700, Omar Sandoval wrote:
> > On Tue, Apr 11, 2017 at 01:58:37PM -0700, Bart Van Assche wrote:
> > > The blk-mq debugfs attributes are removed after blk_cleanup_queue()
> > > has finished. Since running a queue after a queue has entered the
> > > "dead" state is not allowed, disallow this. This patch avoids that
> > > an attempt to run a dead queue triggers a kernel crash.
> > > 
> > > Signed-off-by: Bart Van Assche 
> > > Cc: Omar Sandoval 
> > > Cc: Hannes Reinecke 
> > > ---
> > >  block/blk-mq-debugfs.c | 8 
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > > index df9b688b877c..a1ce823578c7 100644
> > > --- a/block/blk-mq-debugfs.c
> > > +++ b/block/blk-mq-debugfs.c
> > > @@ -111,6 +111,14 @@ static ssize_t blk_queue_flags_store(struct file 
> > > *file, const char __user *ubuf,
> > >   struct request_queue *q = file_inode(file)->i_private;
> > >   char op[16] = { }, *s;
> > >  
> > > + /*
> > > +  * The debugfs attributes are removed after blk_cleanup_queue() has
> > > +  * called blk_mq_free_queue(). Return if QUEUE_FLAG_DEAD has been set
> > > +  * to avoid triggering a use-after-free.
> > > +  */
> > > + if (blk_queue_dead(q))
> > > + return -ENOENT;
> > > +
> > >   len = min(len, sizeof(op) - 1);
> > >   if (copy_from_user(op, ubuf, len))
> > >   return -EFAULT;
> > 
> > Looking at this, I think we have similar issues with most of the other
> > debugfs files. Should we move the debugfs cleanup earlier?
> 
> Hello Omar,
> 
> That's a good question. However, while I was debugging it was very convenient
> to be able to access the queue state after it had reached the "dead" state.
> Performing the cleanup earlier would be an alternative solution but would
> make debugging a bit harder ...
> 
> Bart.

What useful information were you getting out of debugfs once the queue
was already dead? Wasn't the interesting stuff freed at that point?