Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-18 Thread Ming Lei
On Fri, May 18, 2018 at 05:44:08PM -0600, Keith Busch wrote:
> On Sat, May 19, 2018 at 06:32:11AM +0800, Ming Lei wrote:
> > This way can't sync timeout reliably, since timeout events can
> > come from two NS at the same time, and one may be handled as
> > RESET_TIMER, and another one can be handled as EH_HANDLED.
> 
> You keep saying that, but the controller state is global to the
> controller. It doesn't matter which namespace request_queue started the
> reset: every namespaces request queue sees the RESETTING controller state

When timeouts come, the global state of RESETTING may not be updated
yet, so all the timeouts may not observe the state.

Please see my previous explanation:

https://marc.info/?l=linux-block=152600464317808=2


Thanks,
Ming


Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-18 Thread Ming Lei
On Fri, May 18, 2018 at 05:45:00PM -0600, Keith Busch wrote:
> On Sat, May 19, 2018 at 06:26:28AM +0800, Ming Lei wrote:
> > So could we please face to the real issue instead of working around
> > test case?
> 
> Yes, that's why I want you to stop referencing the broken test.

Unfortunately I don't care if it is broken or not, and I think
the test case is great because it can expose the real problem.
That is its value!

The reason why I referenced this test is that everyone can reproduce
this real issue easily, that is it.

Thanks,
Ming


Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-18 Thread Keith Busch
On Sat, May 19, 2018 at 06:26:28AM +0800, Ming Lei wrote:
> So could we please face to the real issue instead of working around
> test case?

Yes, that's why I want you to stop referencing the broken test.


Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-18 Thread Keith Busch
On Sat, May 19, 2018 at 06:32:11AM +0800, Ming Lei wrote:
> This way can't sync timeout reliably, since timeout events can
> come from two NS at the same time, and one may be handled as
> RESET_TIMER, and another one can be handled as EH_HANDLED.

You keep saying that, but the controller state is global to the
controller. It doesn't matter which namespace request_queue started the
reset: every namespaces request queue sees the RESETTING controller state
from the point the syncing occurs, and they don't return RESET_TIMER,
and on top of that, the reset reclaims every single IO command no matter
what namespace request_queue initiated the reset.


Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-18 Thread Ming Lei
On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote:
> IO may be retryable, so don't wait for them in the reset path. These
> commands may trigger a reset if that IO expires without a completion,
> placing it on the requeue list. Waiting for these would then deadlock
> the reset handler.
> 
> To fix the theoretical deadlock, this patch unblocks IO submission from
> the reset_work as before, but moves the waiting to the IO safe scan_work
> so that the reset_work may proceed to completion. Since the unfreezing
> happens in the controller LIVE state, the nvme device has to track if
> the queues were frozen now to prevent incorrect freeze depths.
> 
> This patch is also renaming the function 'nvme_dev_add' to a
> more appropriate name that describes what it's actually doing:
> nvme_alloc_io_tags.
> 
> Signed-off-by: Keith Busch 
> ---
>  drivers/nvme/host/core.c |  3 +++
>  drivers/nvme/host/nvme.h |  1 +
>  drivers/nvme/host/pci.c  | 46 +-
>  3 files changed, 37 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1de68b56b318..34d7731f1419 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -214,6 +214,7 @@ static inline bool nvme_req_needs_retry(struct request 
> *req)
>   if (blk_noretry_request(req))
>   return false;
>   if (nvme_req(req)->status & NVME_SC_DNR)
> +
>   return false;
>   if (nvme_req(req)->retries >= nvme_max_retries)
>   return false;
> @@ -3177,6 +3178,8 @@ static void nvme_scan_work(struct work_struct *work)
>   struct nvme_id_ctrl *id;
>   unsigned nn;
>  
> + if (ctrl->ops->update_hw_ctx)
> + ctrl->ops->update_hw_ctx(ctrl);
>   if (ctrl->state != NVME_CTRL_LIVE)
>   return;
>  
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index c15c2ee7f61a..230c5424b197 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -320,6 +320,7 @@ struct nvme_ctrl_ops {
>   int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
>   int (*reinit_request)(void *data, struct request *rq);
>   void (*stop_ctrl)(struct nvme_ctrl *ctrl);
> + void (*update_hw_ctx)(struct nvme_ctrl *ctrl);
>  };
>  
>  #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 2bd9d84f58d0..6a7cbc631d92 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -99,6 +99,7 @@ struct nvme_dev {
>   u32 cmbloc;
>   struct nvme_ctrl ctrl;
>   struct completion ioq_wait;
> + bool queues_froze;
>  
>   /* shadow doorbell buffer support: */
>   u32 *dbbuf_dbs;
> @@ -2065,10 +2066,33 @@ static void nvme_disable_io_queues(struct nvme_dev 
> *dev)
>   }
>  }
>  
> +static void nvme_pci_update_hw_ctx(struct nvme_ctrl *ctrl)
> +{
> + struct nvme_dev *dev = to_nvme_dev(ctrl);
> + bool unfreeze;
> +
> + mutex_lock(>shutdown_lock);
> + unfreeze = dev->queues_froze;
> + mutex_unlock(>shutdown_lock);

What if nvme_dev_disable() just sets the .queues_froze flag and
userspace sends a RESCAN command at the same time?

> +
> + if (unfreeze)
> + nvme_wait_freeze(>ctrl);
> +

timeout may comes just before blk_mq_update_nr_hw_queues() or
the above nvme_wait_freeze(), then both two may hang forever.

> + blk_mq_update_nr_hw_queues(ctrl->tagset, dev->online_queues - 1);
> + nvme_free_queues(dev, dev->online_queues);
> +
> + if (unfreeze)
> + nvme_unfreeze(>ctrl);
> +
> + mutex_lock(>shutdown_lock);
> + dev->queues_froze = false;
> + mutex_unlock(>shutdown_lock);

If the running scan work is triggered via user-space, the above code
may clear the .queues_froze flag wrong.

Thanks,
Ming


Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-18 Thread Ming Lei
On Fri, May 18, 2018 at 10:38:18AM -0600, Keith Busch wrote:
> This patch fixes races that occur with simultaneous controller
> resets by synchronizing request queues prior to initializing the
> controller. Withouth this, a thread may attempt disabling a controller
> at the same time as we're trying to enable it.
> 
> Signed-off-by: Keith Busch 
> ---
>  drivers/nvme/host/core.c | 21 +++--
>  drivers/nvme/host/nvme.h |  1 +
>  drivers/nvme/host/pci.c  |  1 +
>  3 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 99b857e5a7a9..1de68b56b318 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3471,6 +3471,12 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct 
> device *dev,
>  }
>  EXPORT_SYMBOL_GPL(nvme_init_ctrl);
>  
> +static void nvme_start_queue(struct nvme_ns *ns)
> +{
> + blk_mq_unquiesce_queue(ns->queue);
> + blk_mq_kick_requeue_list(ns->queue);
> +}
> +
>  /**
>   * nvme_kill_queues(): Ends all namespace queues
>   * @ctrl: the dead controller that needs to end
> @@ -3499,7 +3505,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
>   blk_set_queue_dying(ns->queue);
>  
>   /* Forcibly unquiesce queues to avoid blocking dispatch */
> - blk_mq_unquiesce_queue(ns->queue);
> + nvme_start_queue(ns);
>   }
>   up_read(>namespaces_rwsem);
>  }
> @@ -3569,11 +3575,22 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>  
>   down_read(>namespaces_rwsem);
>   list_for_each_entry(ns, >namespaces, list)
> - blk_mq_unquiesce_queue(ns->queue);
> + nvme_start_queue(ns);
>   up_read(>namespaces_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(nvme_start_queues);
>  
> +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> +{
> + struct nvme_ns *ns;
> +
> + down_read(>namespaces_rwsem);
> + list_for_each_entry(ns, >namespaces, list)
> + blk_sync_queue(ns->queue);
> + up_read(>namespaces_rwsem);
> +}
> +EXPORT_SYMBOL_GPL(nvme_sync_queues);

This way can't sync timeout reliably, since timeout events can
come from two NS at the same time, and one may be handled as
RESET_TIMER, and another one can be handled as EH_HANDLED.


Thanks,
Ming


Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-18 Thread Ming Lei
On Fri, May 18, 2018 at 07:57:51AM -0600, Keith Busch wrote:
> On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
> > What I think block/011 is helpful is that it can trigger IO timeout
> > during reset, which can be triggered in reality too.
> 
> As I mentioned earlier, there is nothing wrong with the spirit of
> the test. What's wrong with it is the misguided implemention.
> 
> Do you underestand why it ever passes? The success happens when the
> enabling part of the loop happens to coincide with the driver's enabling,
> creating the pci_dev->enable_cnt > 1, making subsequent disable parts
> of the loop do absolutely nothing; the exact same as the one-liner
> (non-serious) patch I sent to defeat the test.
> 
> A better way to induce the timeout is:
> 
>   # setpci -s  4.w=0:6
> 
> This will halt the device without messing with the kernel structures,
> just like how a real device failure would occur.

Frankly speaking, I don't care how the test-case is implemented at all.

The big problem is that NVMe driver can't handle IO time-out during
reset context, and finally either the controller becomes DEAD or reset
context hangs forever, and everything can't move on.

The issue can be reproduced easier via io-timeout-fail fault injection.

So could we please face to the real issue instead of working around
test case?

Thanks,
Ming


Re: [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-18 Thread Palmer Dabbelt

On Tue, 15 May 2018 15:51:20 PDT (-0700), bart.vanass...@wdc.com wrote:

The next patch in this series introduces a call to cmpxchg64()
in the block layer core for those architectures on which this
functionality is available. Make it possible to test whether
cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.

Signed-off-by: Bart Van Assche 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: David S. Miller 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: H. Peter Anvin 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Arnd Bergmann 
Cc: Jonathan Corbet 
---
 .../features/locking/cmpxchg64/arch-support.txt| 33 ++
 arch/Kconfig   |  3 ++
 arch/alpha/Kconfig |  1 +
 arch/arm/Kconfig   |  1 +
 arch/arm64/Kconfig |  1 +
 arch/ia64/Kconfig  |  1 +
 arch/m68k/Kconfig  |  1 +
 arch/mips/Kconfig  |  1 +
 arch/parisc/Kconfig|  1 +
 arch/powerpc/Kconfig   |  1 +
 arch/riscv/Kconfig |  1 +
 arch/s390/Kconfig  |  1 +
 arch/sparc/Kconfig |  1 +
 arch/x86/Kconfig   |  1 +
 arch/xtensa/Kconfig|  1 +
 15 files changed, 49 insertions(+)
 create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt


[...]


--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -8,6 +8,7 @@ config RISCV
select OF
select OF_EARLY_FLATTREE
select OF_IRQ
+   select ARCH_HAVE_CMPXCHG64
select ARCH_WANT_FRAME_POINTERS
select CLONE_BACKWARDS
select COMMON_CLK


If I understand correctly, we should only have ARCH_HAVE_CMPXCHG64 on 64-bit 
RISC-V systems so this should look something like


  select ARCH_HAVE_CMPXCHG64 if 64BIT

Of course, the RV32I port is broken right now so it's not a big deal, but we're 
working through making it less broken...


Re: [PATCH v4 3/3] fs: Add aio iopriority support for block_dev

2018-05-18 Thread kbuild test robot
Hi Adam,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20180516]
[cannot apply to linus/master block/for-next v4.17-rc5 v4.17-rc4 v4.17-rc3 
v4.17-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/adam-manzanares-wdc-com/AIO-add-per-command-iopriority/20180519-031848
config: x86_64-randconfig-x013-201819 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/aio.c: In function 'aio_prep_rw':
>> fs/aio.c:1460:9: error: implicit declaration of function 'ioprio_check_cap'; 
>> did you mean 'param_check_charp'? [-Werror=implicit-function-declaration]
  ret = ioprio_check_cap(iocb->aio_reqprio);
^~~~
param_check_charp
   cc1: some warnings being treated as errors

vim +1460 fs/aio.c

  1440  
  1441  static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
  1442  {
  1443  int ret;
  1444  
  1445  req->ki_filp = fget(iocb->aio_fildes);
  1446  if (unlikely(!req->ki_filp))
  1447  return -EBADF;
  1448  req->ki_complete = aio_complete_rw;
  1449  req->ki_pos = iocb->aio_offset;
  1450  req->ki_flags = iocb_flags(req->ki_filp);
  1451  if (iocb->aio_flags & IOCB_FLAG_RESFD)
  1452  req->ki_flags |= IOCB_EVENTFD;
  1453  req->ki_hint = file_write_hint(req->ki_filp);
  1454  if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
  1455  /*
  1456   * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, 
then
  1457   * aio_reqprio is interpreted as an I/O scheduling
  1458   * class and priority.
  1459   */
> 1460  ret = ioprio_check_cap(iocb->aio_reqprio);
  1461  if (ret) {
  1462  pr_debug("aio ioprio check cap error\n");
  1463  return -EINVAL;
  1464  }
  1465  
  1466  req->ki_ioprio = iocb->aio_reqprio;
  1467  req->ki_flags |= IOCB_IOPRIO;
  1468  }
  1469  
  1470  ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
  1471  if (unlikely(ret))
  1472  fput(req->ki_filp);
  1473  return ret;
  1474  }
  1475  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v4 3/3] fs: Add aio iopriority support for block_dev

2018-05-18 Thread Adam Manzanares


On 5/18/18 9:06 AM, Christoph Hellwig wrote:
> Looks fine, although I'd split it into a aio and block_dev patch.
> 
> Also please wire this up for the fs/iomap.c direct I/O code, it should
> be essentially the same sniplet as in the block_dev.c code.
> 

Will do.

Re: [PATCH v4 2/3] fs: Convert kiocb rw_hint from enum to u16

2018-05-18 Thread Adam Manzanares


On 5/18/18 9:05 AM, Christoph Hellwig wrote:
>> +/* ki_hint changed from enum to u16, make sure rw_hint fits into u16 */
> 
> I don't think this comment is very useful.
> 
>> +static inline u16 ki_hint_valid(enum rw_hint hint)
> 
> I'd call this ki_hint_validate.
> 
>> +{
>> +if (hint > MAX_KI_HINT)
>> +return 0;
>> +
>> +return hint;
> 
> Nit: kill the empty line.
> 

I'll clean this up in the next revision.

[PATCH v3 0/2] Ensure that a request queue is dissociated from the cgroup controller

2018-05-18 Thread Bart Van Assche
Hello Jens,

Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers trigger a kernel
crash. Please consider these patches for the upstream kernel.

Thanks,

Bart.

Changes between v2 and v3:
- Avoid code duplication by introducing a new helper function.

Changes between v1 and v2:
- Fixed the build for CONFIG_BLK_CGROUP=n.

Bart Van Assche (2):
  block: Introduce blk_exit_queue()
  block: Ensure that a request queue is dissociated from the cgroup
controller

 block/blk-core.c  | 54 ++
 block/blk-sysfs.c | 25 +
 block/blk.h   |  1 +
 3 files changed, 56 insertions(+), 24 deletions(-)

-- 
2.16.3



[PATCH v3 2/2] block: Ensure that a request queue is dissociated from the cgroup controller

2018-05-18 Thread Bart Van Assche
Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
block cgroup controller")
Signed-off-by: Bart Van Assche 
Tested-by: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Tejun Heo 
Cc: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Joseph Qi 
---
 block/blk-sysfs.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9cf41fee3790..a239c73fa20f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -804,6 +804,31 @@ static void __blk_release_queue(struct work_struct *work)
blk_stat_remove_callback(q, q->poll_cb);
blk_stat_free_callback(q->poll_cb);
 
+   if (!blk_queue_dead(q)) {
+   /*
+* Last reference was dropped without having called
+* blk_cleanup_queue().
+*/
+   WARN_ONCE(blk_queue_init_done(q),
+ "request queue %p has been registered but 
blk_cleanup_queue() has not been called for that queue\n",
+ q);
+   blk_exit_queue(q);
+   }
+
+#ifdef CONFIG_BLK_CGROUP
+   {
+   struct blkcg_gq *blkg;
+
+   rcu_read_lock();
+   blkg = blkg_lookup(_root, q);
+   rcu_read_unlock();
+
+   WARN(blkg,
+"request queue %p is being released but it has not yet 
been removed from the blkcg controller\n",
+q);
+   }
+#endif
+
blk_free_queue_stats(q->stats);
 
blk_exit_rl(q, >root_rl);
-- 
2.16.3



[PATCH v3 1/2] block: Introduce blk_exit_queue()

2018-05-18 Thread Bart Van Assche
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Hannes Reinecke 
Cc: Ming Lei 
Cc: Omar Sandoval 
Cc: Joseph Qi 
---
 block/blk-core.c | 54 ++
 block/blk.h  |  1 +
 2 files changed, 31 insertions(+), 24 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4dadff238b02..ac28a9c7136f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -720,6 +720,35 @@ void blk_set_queue_dying(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
+/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
+void blk_exit_queue(struct request_queue *q)
+{
+   /*
+* Since the I/O scheduler exit code may access cgroup information,
+* perform I/O scheduler exit before disassociating from the block
+* cgroup controller.
+*/
+   if (q->elevator) {
+   ioc_clear_queue(q);
+   elevator_exit(q, q->elevator);
+   q->elevator = NULL;
+   }
+
+   /*
+* Remove all references to @q from the block cgroup controller before
+* restoring @q->queue_lock to avoid that restoring this pointer causes
+* e.g. blkcg_print_blkgs() to crash.
+*/
+   blkcg_exit_queue(q);
+
+   /*
+* Since the cgroup code may dereference the @q->backing_dev_info
+* pointer, only decrease its reference count after having removed the
+* association with the block cgroup controller.
+*/
+   bdi_put(q->backing_dev_info);
+}
+
 /**
  * blk_cleanup_queue - shutdown a request queue
  * @q: request queue to shutdown
@@ -785,30 +814,7 @@ void blk_cleanup_queue(struct request_queue *q)
 */
WARN_ON_ONCE(q->kobj.state_in_sysfs);
 
-   /*
-* Since the I/O scheduler exit code may access cgroup information,
-* perform I/O scheduler exit before disassociating from the block
-* cgroup controller.
-*/
-   if (q->elevator) {
-   ioc_clear_queue(q);
-   elevator_exit(q, q->elevator);
-   q->elevator = NULL;
-   }
-
-   /*
-* Remove all references to @q from the block cgroup controller before
-* restoring @q->queue_lock to avoid that restoring this pointer causes
-* e.g. blkcg_print_blkgs() to crash.
-*/
-   blkcg_exit_queue(q);
-
-   /*
-* Since the cgroup code may dereference the @q->backing_dev_info
-* pointer, only decrease its reference count after having removed the
-* association with the block cgroup controller.
-*/
-   bdi_put(q->backing_dev_info);
+   blk_exit_queue(q);
 
if (q->mq_ops)
blk_mq_free_queue(q);
diff --git a/block/blk.h b/block/blk.h
index 204a0345996c..ec914ae7130c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -130,6 +130,7 @@ void blk_free_flush_queue(struct blk_flush_queue *q);
 int blk_init_rl(struct request_list *rl, struct request_queue *q,
gfp_t gfp_mask);
 void blk_exit_rl(struct request_queue *q, struct request_list *rl);
+void blk_exit_queue(struct request_queue *q);
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
-- 
2.16.3



Re: [PATCH v11 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-18 Thread hpa
On May 18, 2018 11:00:05 AM PDT, Bart Van Assche  wrote:
>The next patch in this series introduces a call to cmpxchg64()
>in the block layer core for those architectures on which this
>functionality is available. Make it possible to test whether
>cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.
>
>Signed-off-by: Bart Van Assche 
>Cc: Catalin Marinas 
>Cc: Will Deacon 
>Cc: Tony Luck 
>Cc: Fenghua Yu 
>Cc: Geert Uytterhoeven 
>Cc: "James E.J. Bottomley" 
>Cc: Helge Deller 
>Cc: Benjamin Herrenschmidt 
>Cc: Paul Mackerras 
>Cc: Michael Ellerman 
>Cc: Martin Schwidefsky 
>Cc: Heiko Carstens 
>Cc: David S. Miller 
>Cc: Thomas Gleixner 
>Cc: Ingo Molnar 
>Cc: H. Peter Anvin 
>Cc: Chris Zankel 
>Cc: Max Filippov 
>Cc: Arnd Bergmann 
>Cc: Jonathan Corbet 
>---
>.../features/locking/cmpxchg64/arch-support.txt| 33
>++
> arch/Kconfig   |  4 +++
> arch/arm/Kconfig   |  1 +
> arch/ia64/Kconfig  |  1 +
> arch/m68k/Kconfig  |  1 +
> arch/mips/Kconfig  |  1 +
> arch/parisc/Kconfig|  1 +
> arch/riscv/Kconfig |  1 +
> arch/sparc/Kconfig |  1 +
> arch/x86/Kconfig   |  1 +
> arch/xtensa/Kconfig|  1 +
> 11 files changed, 46 insertions(+)
>create mode 100644
>Documentation/features/locking/cmpxchg64/arch-support.txt
>
>diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt
>b/Documentation/features/locking/cmpxchg64/arch-support.txt
>new file mode 100644
>index ..84bfef7242b2
>--- /dev/null
>+++ b/Documentation/features/locking/cmpxchg64/arch-support.txt
>@@ -0,0 +1,33 @@
>+#
>+# Feature name:  cmpxchg64
>+# Kconfig:   ARCH_HAVE_CMPXCHG64
>+# description:   arch supports the cmpxchg64() API
>+#
>+---
>+| arch |status|
>+---
>+|   alpha: |  ok  |
>+| arc: |  ..  |
>+| arm: |  ok  |
>+|   arm64: |  ok  |
>+| c6x: |  ..  |
>+|   h8300: |  ..  |
>+| hexagon: |  ..  |
>+|ia64: |  ok  |
>+|m68k: |  ok  |
>+|  microblaze: |  ..  |
>+|mips: |  ok  |
>+|   nds32: |  ..  |
>+|   nios2: |  ..  |
>+|openrisc: |  ..  |
>+|  parisc: |  ok  |
>+| powerpc: |  ok  |
>+|   riscv: |  ok  |
>+|s390: |  ok  |
>+|  sh: |  ..  |
>+|   sparc: |  ok  |
>+|  um: |  ..  |
>+|   unicore32: |  ..  |
>+| x86: |  ok  |
>+|  xtensa: |  ok  |
>+---
>diff --git a/arch/Kconfig b/arch/Kconfig
>index 8e0d665c8d53..9840b2577af1 100644
>--- a/arch/Kconfig
>+++ b/arch/Kconfig
>@@ -358,6 +358,10 @@ config HAVE_ALIGNED_STRUCT_PAGE
> on a struct page for better performance. However selecting this
> might increase the size of a struct page by a word.
> 
>+config ARCH_HAVE_CMPXCHG64
>+  bool
>+  default y if 64BIT
>+
> config HAVE_CMPXCHG_LOCAL
>   bool
> 
>diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>index a7f8e7f4b88f..02c75697176e 100644
>--- a/arch/arm/Kconfig
>+++ b/arch/arm/Kconfig
>@@ -13,6 +13,7 @@ config ARM
>   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
>   select ARCH_HAS_STRICT_MODULE_RWX if MMU
>   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>+  select ARCH_HAVE_CMPXCHG64 if !THUMB2_KERNEL
>   select ARCH_HAVE_CUSTOM_GPIO_H
>   select ARCH_HAS_GCOV_PROFILE_ALL
>   select ARCH_MIGHT_HAVE_PC_PARPORT
>diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
>index bbe12a038d21..31c49e1482e2 100644
>--- a/arch/ia64/Kconfig
>+++ b/arch/ia64/Kconfig
>@@ -41,6 +41,7 @@ config IA64
>   select GENERIC_PENDING_IRQ if SMP
>   select GENERIC_IRQ_SHOW
>   select GENERIC_IRQ_LEGACY
>+  select ARCH_HAVE_CMPXCHG64
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   select GENERIC_IOMAP
>   select GENERIC_SMP_IDLE_THREAD
>diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
>index 785612b576f7..7b87cda3bbed 100644
>--- a/arch/m68k/Kconfig
>+++ b/arch/m68k/Kconfig
>@@ -11,6 +11,7 @@ config M68K
>   select GENERIC_ATOMIC64
>   select HAVE_UID16
>   select VIRT_TO_BUS
>+  select 

[PATCH v11 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-18 Thread Bart Van Assche
The next patch in this series introduces a call to cmpxchg64()
in the block layer core for those architectures on which this
functionality is available. Make it possible to test whether
cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.

Signed-off-by: Bart Van Assche 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: David S. Miller 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: H. Peter Anvin 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Arnd Bergmann 
Cc: Jonathan Corbet 
---
 .../features/locking/cmpxchg64/arch-support.txt| 33 ++
 arch/Kconfig   |  4 +++
 arch/arm/Kconfig   |  1 +
 arch/ia64/Kconfig  |  1 +
 arch/m68k/Kconfig  |  1 +
 arch/mips/Kconfig  |  1 +
 arch/parisc/Kconfig|  1 +
 arch/riscv/Kconfig |  1 +
 arch/sparc/Kconfig |  1 +
 arch/x86/Kconfig   |  1 +
 arch/xtensa/Kconfig|  1 +
 11 files changed, 46 insertions(+)
 create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt

diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt 
b/Documentation/features/locking/cmpxchg64/arch-support.txt
new file mode 100644
index ..84bfef7242b2
--- /dev/null
+++ b/Documentation/features/locking/cmpxchg64/arch-support.txt
@@ -0,0 +1,33 @@
+#
+# Feature name:  cmpxchg64
+# Kconfig:   ARCH_HAVE_CMPXCHG64
+# description:   arch supports the cmpxchg64() API
+#
+---
+| arch |status|
+---
+|   alpha: |  ok  |
+| arc: |  ..  |
+| arm: |  ok  |
+|   arm64: |  ok  |
+| c6x: |  ..  |
+|   h8300: |  ..  |
+| hexagon: |  ..  |
+|ia64: |  ok  |
+|m68k: |  ok  |
+|  microblaze: |  ..  |
+|mips: |  ok  |
+|   nds32: |  ..  |
+|   nios2: |  ..  |
+|openrisc: |  ..  |
+|  parisc: |  ok  |
+| powerpc: |  ok  |
+|   riscv: |  ok  |
+|s390: |  ok  |
+|  sh: |  ..  |
+|   sparc: |  ok  |
+|  um: |  ..  |
+|   unicore32: |  ..  |
+| x86: |  ok  |
+|  xtensa: |  ok  |
+---
diff --git a/arch/Kconfig b/arch/Kconfig
index 8e0d665c8d53..9840b2577af1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -358,6 +358,10 @@ config HAVE_ALIGNED_STRUCT_PAGE
  on a struct page for better performance. However selecting this
  might increase the size of a struct page by a word.
 
+config ARCH_HAVE_CMPXCHG64
+   bool
+   default y if 64BIT
+
 config HAVE_CMPXCHG_LOCAL
bool
 
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index a7f8e7f4b88f..02c75697176e 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -13,6 +13,7 @@ config ARM
select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
select ARCH_HAS_STRICT_MODULE_RWX if MMU
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+   select ARCH_HAVE_CMPXCHG64 if !THUMB2_KERNEL
select ARCH_HAVE_CUSTOM_GPIO_H
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index bbe12a038d21..31c49e1482e2 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -41,6 +41,7 @@ config IA64
select GENERIC_PENDING_IRQ if SMP
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_LEGACY
+   select ARCH_HAVE_CMPXCHG64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select GENERIC_IOMAP
select GENERIC_SMP_IDLE_THREAD
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 785612b576f7..7b87cda3bbed 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -11,6 +11,7 @@ config M68K
select GENERIC_ATOMIC64
select HAVE_UID16
select VIRT_TO_BUS
+   select ARCH_HAVE_CMPXCHG64
select ARCH_HAVE_NMI_SAFE_CMPXCHG if RMW_INSNS
select GENERIC_CPU_DEVICES
select GENERIC_IOMAP
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 

[PATCH v11 0/2] blk-mq: Rework blk-mq timeout handling again

2018-05-18 Thread Bart Van Assche
Hello Jens,

This patch series reworks blk-mq timeout handling by introducing a state
machine per request. Please consider this patch series for inclusion in the
upstream kernel.

Bart.

Changes compared to v10:
- In patch 1/2, added "default y if 64BIT" to the "config ARCH_HAVE_CMPXCHG64"
  entry in arch/Kconfig. Left out the "select ARCH_HAVE_CMPXCHG64" statements
  that became superfluous due to this change (alpha, arm64, powerpc and s390).
- Also in patch 1/2, only select ARCH_HAVE_CMPXCHG64 if X86_CMPXCHG64 has been
  selected.
- In patch 2/2, moved blk_mq_change_rq_state() from blk-mq.h to blk-mq.c.
- Added a comment header above __blk_mq_requeue_request() and
  blk_mq_requeue_request().
- Documented the MQ_RQ_* state transitions in block/blk-mq.h.
- Left out the fourth argument of blk_mq_rq_set_deadline().

Changes compared to v9:
- Addressed multiple comments related to patch 1/2: added
  CONFIG_ARCH_HAVE_CMPXCHG64 for riscv, modified
  features/locking/cmpxchg64/arch-support.txt as requested and made the
  order of the symbols in the arch/*/Kconfig alphabetical where possible.

Changes compared to v8:
- Split into two patches.
- Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into
  blk_mq_init_request().
- Fixed the deadline set by blk_add_timer().
- Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 /
  #endif.

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

Bart Van Assche (2):
  arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  blk-mq: Rework blk-mq timeout handling again

 .../features/locking/cmpxchg64/arch-support.txt|  33 +++
 arch/Kconfig   |   4 +
 arch/arm/Kconfig   |   1 +
 arch/ia64/Kconfig  |   1 +
 arch/m68k/Kconfig  |   1 +
 arch/mips/Kconfig  |   1 +
 arch/parisc/Kconfig|   1 +
 arch/riscv/Kconfig |   1 +
 arch/sparc/Kconfig |   1 +
 arch/x86/Kconfig   |   1 +
 arch/xtensa/Kconfig|   1 +
 block/blk-core.c   |   6 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 238 ++---
 block/blk-mq.h |  64 +++---
 block/blk-timeout.c| 133 
 block/blk.h|  11 +-
 include/linux/blkdev.h |  47 ++--
 18 files changed, 308 insertions(+), 238 deletions(-)
 create mode 100644 

[PATCH v11 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-18 Thread Bart Van Assche
Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has been called.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Ming Lei 
Cc: Sebastian Ott 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
---
 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 238 ++---
 block/blk-mq.h |  64 ++---
 block/blk-timeout.c| 133 ++-
 block/blk.h|  11 +--
 include/linux/blkdev.h |  47 +-
 7 files changed, 262 insertions(+), 238 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43370faee935..cee03cad99f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->internal_tag = -1;
rq->start_time_ns = ktime_get_ns();
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
-   /*
-* See comment of blk_mq_init_request
-*/
-   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4cbfd784e837..e7dfa6ed7a44 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -318,7 +318,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->special = NULL;
/* tag was already set */
rq->extra_len = 0;
-   rq->__deadline = 0;
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
INIT_LIST_HEAD(>timeout_list);
rq->timeout = 0;
@@ -465,6 +465,39 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static bool blk_mq_change_rq_state(struct request *rq,
+  enum mq_rq_state old_state,
+  enum mq_rq_state new_state)
+{
+   union blk_deadline_and_state das = READ_ONCE(rq->das);
+   union blk_deadline_and_state old_val = das;
+   union blk_deadline_and_state new_val = das;
+
+   WARN_ON_ONCE(new_state == MQ_RQ_IN_FLIGHT);
+
+   old_val.state = old_state;
+   new_val.state = new_state;
+   /*
+* For transitions from state in-flight to another state cmpxchg64()
+* must be used. For other state transitions it is safe to use
+* WRITE_ONCE().
+*/
+   if (old_state 

[PATCH blktests] Fix block/011 to not use sysfs for device disabling

2018-05-18 Thread Keith Busch
The PCI sysfs interface may not be a dependable method for toggling the
PCI device state to trigger the timeouts. This patch goes directly to
the config space to make device failure occur.

The success of this test is still senstive to timing, as it may disable
IO memory when a driver is trying to bring it online. This can look like
a permanent device failure from the driver's perspective.

Signed-off-by: Keith Busch 
---
 tests/block/011 | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tests/block/011 b/tests/block/011
index 62e89f7..2fc0ffb 100755
--- a/tests/block/011
+++ b/tests/block/011
@@ -21,7 +21,7 @@ DESCRIPTION="disable PCI device while doing I/O"
 TIMED=1
 
 requires() {
-   _have_fio
+   _have_fio && _have_program setpci
 }
 
 device_requires() {
@@ -43,10 +43,11 @@ test_device() {
_run_fio_rand_io --filename="$TEST_DEV" --size="$size" \
--ignore_error=EIO,ENXIO,ENODEV &
 
+   # toggle PCI Command Register's Memory and Bus Master enabling
while kill -0 $! 2>/dev/null; do
-   echo 0 > "/sys/bus/pci/devices/${pdev}/enable"
+   setpci -s "${pdev}" 4.w=0:6
sleep .2
-   echo 1 > "/sys/bus/pci/devices/${pdev}/enable"
+   setpci -s "${pdev}" 4.w=6:6
sleep .2
done
 
-- 
2.14.3



Re: [PATCH 02/10] block: Convert bio_set to mempool_init()

2018-05-18 Thread Kent Overstreet
On Fri, May 18, 2018 at 09:20:28AM -0700, Christoph Hellwig wrote:
> On Tue, May 08, 2018 at 09:33:50PM -0400, Kent Overstreet wrote:
> > Minor performance improvement by getting rid of pointer indirections
> > from allocation/freeing fastpaths.
> 
> Can you please also send a long conversion for the remaining
> few bioset_create users?  It would be rather silly to keep two
> almost the same interfaces around for just about two hand full
> of users.

Yeah, I can do that


[PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data

2018-05-18 Thread Christoph Hellwig
This way the implementation doesn't depend on buffer_head internals.

Signed-off-by: Christoph Hellwig 
---
 fs/iomap.c | 83 +++---
 1 file changed, 42 insertions(+), 41 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index bef5e91d40bf..0fecd5789d7b 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -594,31 +594,54 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
  *
  * Returns the offset within the file on success, and -ENOENT otherwise.
  */
-static loff_t
-page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
+static bool
+page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
+   int whence)
 {
-   loff_t offset = page_offset(page);
-   struct buffer_head *bh, *head;
+   const struct address_space_operations *ops = inode->i_mapping->a_ops;
+   unsigned int bsize = i_blocksize(inode), off;
bool seek_data = whence == SEEK_DATA;
+   loff_t poff = page_offset(page);
 
-   if (lastoff < offset)
-   lastoff = offset;
-
-   bh = head = page_buffers(page);
-   do {
-   offset += bh->b_size;
-   if (lastoff >= offset)
-   continue;
+   if (WARN_ON_ONCE(*lastoff >= poff + PAGE_SIZE))
+   return false;
 
+   if (*lastoff < poff) {
/*
-* Any buffer with valid data in it should have BH_Uptodate set.
+* Last offset smaller than the start of the page means we found
+* a hole:
 */
-   if (buffer_uptodate(bh) == seek_data)
-   return lastoff;
+   if (whence == SEEK_HOLE)
+   return true;
+   *lastoff = poff;
+   }
 
-   lastoff = offset;
-   } while ((bh = bh->b_this_page) != head);
-   return -ENOENT;
+   /*
+* Just check the page unless we can and should check block ranges:
+*/
+   if (bsize == PAGE_SIZE || !ops->is_partially_uptodate) {
+   if (PageUptodate(page) == seek_data)
+   return true;
+   return false;
+   }
+
+   lock_page(page);
+   if (unlikely(page->mapping != inode->i_mapping))
+   goto out_unlock_not_found;
+
+   for (off = 0; off < PAGE_SIZE; off += bsize) {
+   if ((*lastoff & ~PAGE_MASK) >= off + bsize)
+   continue;
+   if (ops->is_partially_uptodate(page, off, bsize) == seek_data) {
+   unlock_page(page);
+   return true;
+   }
+   *lastoff = poff + off + bsize;
+   }
+
+out_unlock_not_found:
+   unlock_page(page);
+   return false;
 }
 
 /*
@@ -655,30 +678,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t 
offset, loff_t length,
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
 
-   /*
-* At this point, the page may be truncated or
-* invalidated (changing page->mapping to NULL), or
-* even swizzled back from swapper_space to tmpfs file
-* mapping.  However, page->index will not change
-* because we have a reference on the page.
- *
-* If current page offset is beyond where we've ended,
-* we've found a hole.
- */
-   if (whence == SEEK_HOLE &&
-   lastoff < page_offset(page))
+   if (page_seek_hole_data(inode, page, , whence))
goto check_range;
-
-   lock_page(page);
-   if (likely(page->mapping == inode->i_mapping) &&
-   page_has_buffers(page)) {
-   lastoff = page_seek_hole_data(page, lastoff, 
whence);
-   if (lastoff >= 0) {
-   unlock_page(page);
-   goto check_range;
-   }
-   }
-   unlock_page(page);
lastoff = page_offset(page) + PAGE_SIZE;
}
pagevec_release();
-- 
2.17.0



[PATCH 14/34] iomap: add an iomap-based bmap implementation

2018-05-18 Thread Christoph Hellwig
This adds a simple iomap-based implementation of the legacy ->bmap
interface.  Note that we can't easily add checks for rt or reflink
files, so these will have to remain in the callers.  This interface
just needs to die..

Signed-off-by: Christoph Hellwig 
---
 fs/iomap.c| 34 ++
 include/linux/iomap.h |  3 +++
 2 files changed, 37 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 44259eadb69d..7c1b071d115c 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1419,3 +1419,37 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 }
 EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
 #endif /* CONFIG_SWAP */
+
+static loff_t
+iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
+   void *data, struct iomap *iomap)
+{
+   sector_t *bno = data, addr;
+
+   if (iomap->type == IOMAP_MAPPED) {
+   addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
+   if (addr > INT_MAX)
+   WARN(1, "would truncate bmap result\n");
+   else
+   *bno = addr;
+   }
+   return 0;
+}
+
+/* legacy ->bmap interface.  0 is the error return (!) */
+sector_t
+iomap_bmap(struct address_space *mapping, sector_t bno,
+   const struct iomap_ops *ops)
+{
+   struct inode *inode = mapping->host;
+   loff_t pos = bno >> inode->i_blkbits;
+   unsigned blocksize = i_blocksize(inode);
+
+   if (filemap_write_and_wait(mapping))
+   return 0;
+
+   bno = 0;
+   iomap_apply(inode, pos, blocksize, 0, ops, , iomap_bmap_actor);
+   return bno;
+}
+EXPORT_SYMBOL_GPL(iomap_bmap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 819e0cd2a950..a044a824da85 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -4,6 +4,7 @@
 
 #include 
 
+struct address_space;
 struct fiemap_extent_info;
 struct inode;
 struct iov_iter;
@@ -100,6 +101,8 @@ loff_t iomap_seek_hole(struct inode *inode, loff_t offset,
const struct iomap_ops *ops);
 loff_t iomap_seek_data(struct inode *inode, loff_t offset,
const struct iomap_ops *ops);
+sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
+   const struct iomap_ops *ops);
 
 /*
  * Flags for direct I/O ->end_io:
-- 
2.17.0



[PATCH 12/34] iomap: use __bio_add_page in iomap_dio_zero

2018-05-18 Thread Christoph Hellwig
We don't need any merging logic, and this also replaces a BUG_ON with a
WARN_ON_ONCE inside __bio_add_page for the impossible overflow condition.

Signed-off-by: Christoph Hellwig 
---
 fs/iomap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index a859e15d7bec..6427627a247f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -957,8 +957,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, 
loff_t pos,
bio->bi_end_io = iomap_dio_bio_end_io;
 
get_page(page);
-   if (bio_add_page(bio, page, len, 0) != len)
-   BUG();
+   __bio_add_page(bio, page, len, 0);
bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
 
atomic_inc(>ref);
-- 
2.17.0



[PATCH 20/34] xfs: simplify xfs_aops_discard_page

2018-05-18 Thread Christoph Hellwig
Instead of looking at the buffer heads to see if a block is delalloc just
call xfs_bmap_punch_delalloc_range on the whole page - this will leave
any non-delalloc block intact and handle the iteration for us.  As a side
effect one more place stops caring about buffer heads and we can remove the
xfs_check_page_type function entirely.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 85 +--
 1 file changed, 9 insertions(+), 76 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index c631c457b444..f2333e351e07 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -711,49 +711,6 @@ xfs_map_at_offset(
clear_buffer_unwritten(bh);
 }
 
-/*
- * Test if a given page contains at least one buffer of a given @type.
- * If @check_all_buffers is true, then we walk all the buffers in the page to
- * try to find one of the type passed in. If it is not set, then the caller 
only
- * needs to check the first buffer on the page for a match.
- */
-STATIC bool
-xfs_check_page_type(
-   struct page *page,
-   unsigned inttype,
-   boolcheck_all_buffers)
-{
-   struct buffer_head  *bh;
-   struct buffer_head  *head;
-
-   if (PageWriteback(page))
-   return false;
-   if (!page->mapping)
-   return false;
-   if (!page_has_buffers(page))
-   return false;
-
-   bh = head = page_buffers(page);
-   do {
-   if (buffer_unwritten(bh)) {
-   if (type == XFS_IO_UNWRITTEN)
-   return true;
-   } else if (buffer_delay(bh)) {
-   if (type == XFS_IO_DELALLOC)
-   return true;
-   } else if (buffer_dirty(bh) && buffer_mapped(bh)) {
-   if (type == XFS_IO_OVERWRITE)
-   return true;
-   }
-
-   /* If we are only checking the first buffer, we are done now. */
-   if (!check_all_buffers)
-   break;
-   } while ((bh = bh->b_this_page) != head);
-
-   return false;
-}
-
 STATIC void
 xfs_vm_invalidatepage(
struct page *page,
@@ -785,9 +742,6 @@ xfs_vm_invalidatepage(
  * transaction. Indeed - if we get ENOSPC errors, we have to be able to do this
  * truncation without a transaction as there is no space left for block
  * reservation (typically why we see a ENOSPC in writeback).
- *
- * This is not a performance critical path, so for now just do the punching a
- * buffer head at a time.
  */
 STATIC void
 xfs_aops_discard_page(
@@ -795,47 +749,26 @@ xfs_aops_discard_page(
 {
struct inode*inode = page->mapping->host;
struct xfs_inode*ip = XFS_I(inode);
-   struct buffer_head  *bh, *head;
+   struct xfs_mount*mp = ip->i_mount;
loff_t  offset = page_offset(page);
+   xfs_fileoff_t   start_fsb = XFS_B_TO_FSBT(mp, offset);
+   int error;
 
-   if (!xfs_check_page_type(page, XFS_IO_DELALLOC, true))
-   goto out_invalidate;
-
-   if (XFS_FORCED_SHUTDOWN(ip->i_mount))
+   if (XFS_FORCED_SHUTDOWN(mp))
goto out_invalidate;
 
-   xfs_alert(ip->i_mount,
+   xfs_alert(mp,
"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
page, ip->i_ino, offset);
 
xfs_ilock(ip, XFS_ILOCK_EXCL);
-   bh = head = page_buffers(page);
-   do {
-   int error;
-   xfs_fileoff_t   start_fsb;
-
-   if (!buffer_delay(bh))
-   goto next_buffer;
-
-   start_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
-   error = xfs_bmap_punch_delalloc_range(ip, start_fsb, 1);
-   if (error) {
-   /* something screwed, just bail */
-   if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-   xfs_alert(ip->i_mount,
-   "page discard unable to remove delalloc mapping.");
-   }
-   break;
-   }
-next_buffer:
-   offset += i_blocksize(inode);
-
-   } while ((bh = bh->b_this_page) != head);
-
+   error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
+   PAGE_SIZE / i_blocksize(inode));
xfs_iunlock(ip, XFS_ILOCK_EXCL);
+   if (error && !XFS_FORCED_SHUTDOWN(mp))
+   xfs_alert(mp, "page discard unable to remove delalloc 
mapping.");
 out_invalidate:
xfs_vm_invalidatepage(page, 0, PAGE_SIZE);
-   return;
 }
 
 static int
-- 
2.17.0



[PATCH 22/34] xfs: make xfs_writepage_map extent map centric

2018-05-18 Thread Christoph Hellwig
From: Dave Chinner 

xfs_writepage_map() iterates over the bufferheads on a page to decide
what sort of IO to do and what actions to take.  However, when it comes
to reflink and deciding when it needs to execute a COW operation, we no
longer look at the bufferhead state but instead we ignore than and look
up internal state held in teh COW fork extent list.

This means xfs_writepage_map() is somewhat confused. It does stuff, then
ignores it, then tries to handle the impedence mismatch by shovelling the
results inside the existing mapping code.  It works, but it's a bit of a
mess and it makes it hard to fix the cached map bug that the writepage
code currently has.

To unify the two different mechanisms, we first have to choose a direction.
That's already been set - we're de-emphasising bufferheads so they are no
longer a control structure as we need to do taht to allow for eventual
removal.  Hence we need to move away from looking at bufferhead state to
determine what operations we need to perform.

We can't completely get rid of bufferheads yet - they do contain some
state that is absolutely necessary, such as whether that part of the page
contains valid data or not (buffer_uptodate()).  Other state in the
bufferhead is redundant:

BH_dirty - the page is dirty, so we can ignore this and just
write it
BH_delay - we have delalloc extent info in the DATA fork extent
tree
BH_unwritten - same as BH_delay
BH_mapped - indicates we've already used it once for IO and it is
mapped to a disk address. Needs to be ignored for COW
blocks.

The BH_mapped flag is an interesting case - it's supposed to indicate that
it's already mapped to disk and so we can just use it "as is".  In theory,
we don't even have to do an extent lookup to find where to write it too,
but we have to do that anyway to determine we are actually writing over a
valid extent.  Hence it's not even serving the purpose of avoiding a an
extent lookup during writeback, and so we can pretty much ignore it.
Especially as we have to ignore it for COW operations...

Therefore, use the extent map as the source of information to tell us
what actions we need to take and what sort of IO we should perform.  The
first step is integration xfs_map_blocks() and xfs_map_cow() and have
xfs_map_blocks() set the io type according to what it looks up.  This
means it can easily handle both normal overwrite and COW cases.  The
only thing we also need to add is the ability to return hole mappings.

We need to return and cache hole mappings now for the case of multiple
blocks per page.  We no longer use the BH_mapped to indicate a block over
a hole, so we have to get that info from xfs_map_blocks().  We cache it so
that holes that span two pages don't need separate lookups.  This allows us
to avoid ever doing write IO over a hole, too.

Further, we need to drop the XFS_BMAPI_IGSTATE flag so that we don't
combine contiguous written and unwritten extents into a single map.  The
io type needs to match the extent type we are writing to so that we run the
correct IO completion routine for the IO. There is scope for optimisation
that would allow us to re-instate the XFS_BMAPI_IGSTATE flag, but this
requires tweaks to code outside the scope of this change.

Now that we have xfs_map_blocks() returning both a cached map and the type
of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
the bufferhead control. It's also much simplified because it doesn't need
to explicitly handle COW operations.  Instead of iterating bufferheads, it
iterates blocks within the page and then looks up what per-block state is
required from the appropriate bufferhead.  It then validates the cached
map, and if it's not valid, we get a new map.  If we don't get a valid map
or it's over a hole, we skip the block.

At this point, we have to remap the bufferhead via xfs_map_at_offset().
As previously noted, we had to do this even if the buffer was already
mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
Bufferheads that span such regions still need their BH_Delay flags cleared
and their block numbers calculated, so we now unconditionally map each
bufferhead before submission.

But wait! There's more - remember the old "treat unwritten extents as
holes on read" hack?  Yeah, that means we can have a dirty page with
unmapped, unwritten bufferheads that contain data!  What makes these so
special is that the unwritten "hole" bufferheads do not have a valid block
device pointer, so if we attempt to write them xfs_add_to_ioend() blows
up. So we make xfs_map_at_offset() do the "realtime or data device"
lookup from the inode and ignore what was or 

[PATCH 19/34] xfs: simplify xfs_bmap_punch_delalloc_range

2018-05-18 Thread Christoph Hellwig
Instead of using xfs_bmapi_read to find delalloc extents and then punch
them out using xfs_bunmapi, opencode the loop to iterate over the extents
and call xfs_bmap_del_extent_delay directly.  This both simplifies the
code and reduces the number of extent tree lookups required.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_bmap_util.c | 78 ++
 1 file changed, 25 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 06badcbadeb4..c009bdf9fdce 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -695,12 +695,10 @@ xfs_getbmap(
 }
 
 /*
- * dead simple method of punching delalyed allocation blocks from a range in
- * the inode. Walks a block at a time so will be slow, but is only executed in
- * rare error cases so the overhead is not critical. This will always punch out
- * both the start and end blocks, even if the ranges only partially overlap
- * them, so it is up to the caller to ensure that partial blocks are not
- * passed in.
+ * Dead simple method of punching delalyed allocation blocks from a range in
+ * the inode.  This will always punch out both the start and end blocks, even
+ * if the ranges only partially overlap them, so it is up to the caller to
+ * ensure that partial blocks are not passed in.
  */
 int
 xfs_bmap_punch_delalloc_range(
@@ -708,63 +706,37 @@ xfs_bmap_punch_delalloc_range(
xfs_fileoff_t   start_fsb,
xfs_fileoff_t   length)
 {
-   xfs_fileoff_t   remaining = length;
+   struct xfs_ifork*ifp = >i_df;
+   struct xfs_bmbt_irecgot, del;
+   struct xfs_iext_cursor  icur;
int error = 0;
 
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
-   do {
-   int done;
-   xfs_bmbt_irec_t imap;
-   int nimaps = 1;
-   xfs_fsblock_t   firstblock;
-   struct xfs_defer_ops dfops;
+   if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+   error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
+   if (error)
+   return error;
+   }
 
-   /*
-* Map the range first and check that it is a delalloc extent
-* before trying to unmap the range. Otherwise we will be
-* trying to remove a real extent (which requires a
-* transaction) or a hole, which is probably a bad idea...
-*/
-   error = xfs_bmapi_read(ip, start_fsb, 1, , ,
-  XFS_BMAPI_ENTIRE);
+   if (!xfs_iext_lookup_extent(ip, ifp, start_fsb, , ))
+   return 0;
 
-   if (error) {
-   /* something screwed, just bail */
-   if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-   xfs_alert(ip->i_mount,
-   "Failed delalloc mapping lookup ino %lld fsb %lld.",
-   ip->i_ino, start_fsb);
-   }
+   do {
+   if (got.br_startoff >= start_fsb + length)
break;
-   }
-   if (!nimaps) {
-   /* nothing there */
-   goto next_block;
-   }
-   if (imap.br_startblock != DELAYSTARTBLOCK) {
-   /* been converted, ignore */
-   goto next_block;
-   }
-   WARN_ON(imap.br_blockcount == 0);
+   if (!isnullstartblock(got.br_startblock))
+   continue;
 
-   /*
-* Note: while we initialise the firstblock/dfops pair, they
-* should never be used because blocks should never be
-* allocated or freed for a delalloc extent and hence we need
-* don't cancel or finish them after the xfs_bunmapi() call.
-*/
-   xfs_defer_init(, );
-   error = xfs_bunmapi(NULL, ip, start_fsb, 1, 0, 1, ,
-   , );
+   del = got;
+   xfs_trim_extent(, start_fsb, length);
+   error = xfs_bmap_del_extent_delay(ip, XFS_DATA_FORK, ,
+   , );
if (error)
break;
-
-   ASSERT(!xfs_defer_has_unfinished_work());
-next_block:
-   start_fsb++;
-   remaining--;
-   } while(remaining > 0);
+   if (!xfs_iext_get_extent(ifp, , ))
+   break;
+   } while (xfs_iext_next_extent(ifp, , ));
 
return error;
 }
-- 
2.17.0



[PATCH 24/34] xfs: remove xfs_reflink_find_cow_mapping

2018-05-18 Thread Christoph Hellwig
We only have one caller left, and open coding the simple extent list
lookup in it allows us to make the code both more understandable and
reuse calculations and variables already present.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c| 17 -
 fs/xfs/xfs_reflink.c | 30 --
 fs/xfs/xfs_reflink.h |  2 --
 fs/xfs/xfs_trace.h   |  1 -
 4 files changed, 12 insertions(+), 38 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a50f69c2c602..a4b4a7037deb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -385,6 +385,7 @@ xfs_map_blocks(
ssize_t count = i_blocksize(inode);
xfs_fileoff_t   offset_fsb, end_fsb;
int whichfork = XFS_DATA_FORK;
+   struct xfs_iext_cursor  icur;
int error = 0;
int nimaps = 1;
 
@@ -396,8 +397,18 @@ xfs_map_blocks(
   (ip->i_df.if_flags & XFS_IFEXTENTS));
ASSERT(offset <= mp->m_super->s_maxbytes);
 
+   if (offset > mp->m_super->s_maxbytes - count)
+   count = mp->m_super->s_maxbytes - offset;
+   end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
+   offset_fsb = XFS_B_TO_FSBT(mp, offset);
+
+   /*
+* Check if this is offset is covered by a COW extents, and if yes use
+* it directly instead of looking up anything in the data fork.
+*/
if (xfs_is_reflink_inode(ip) &&
-   xfs_reflink_find_cow_mapping(ip, offset, imap)) {
+   xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, , imap) &&
+   imap->br_startoff <= offset_fsb) {
xfs_iunlock(ip, XFS_ILOCK_SHARED);
/*
 * Truncate can race with writeback since writeback doesn't
@@ -417,10 +428,6 @@ xfs_map_blocks(
goto done;
}
 
-   if (offset > mp->m_super->s_maxbytes - count)
-   count = mp->m_super->s_maxbytes - offset;
-   end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
-   offset_fsb = XFS_B_TO_FSBT(mp, offset);
error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
imap, , XFS_BMAPI_ENTIRE);
if (!nimaps) {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 713e857d9ffa..8e5eb8e70c89 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -484,36 +484,6 @@ xfs_reflink_allocate_cow(
return error;
 }
 
-/*
- * Find the CoW reservation for a given byte offset of a file.
- */
-bool
-xfs_reflink_find_cow_mapping(
-   struct xfs_inode*ip,
-   xfs_off_t   offset,
-   struct xfs_bmbt_irec*imap)
-{
-   struct xfs_ifork*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-   xfs_fileoff_t   offset_fsb;
-   struct xfs_bmbt_irecgot;
-   struct xfs_iext_cursor  icur;
-
-   ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
-
-   if (!xfs_is_reflink_inode(ip))
-   return false;
-   offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
-   if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, , ))
-   return false;
-   if (got.br_startoff > offset_fsb)
-   return false;
-
-   trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
-   );
-   *imap = got;
-   return true;
-}
-
 /*
  * Trim an extent to end at the next CoW reservation past offset_fsb.
  */
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 701487bab468..15a456492667 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -32,8 +32,6 @@ extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t count);
-extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t 
offset,
-   struct xfs_bmbt_irec *imap);
 extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9d4c4ca24fe6..ed8f774944ba 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3227,7 +3227,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
 DEFINE_RW_EVENT(xfs_reflink_reserve_cow);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_bounce_dio_write);
-DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_irec);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
-- 
2.17.0



[PATCH 29/34] xfs: don't look at buffer heads in xfs_add_to_ioend

2018-05-18 Thread Christoph Hellwig
Calculate all information for the bio based on the passed in information
without requiring a buffer_head structure.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 68 ++-
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f01c1dd737ec..592b33b35a30 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -44,7 +44,6 @@ struct xfs_writepage_ctx {
struct xfs_bmbt_irecimap;
unsigned intio_type;
struct xfs_ioend*ioend;
-   sector_tlast_block;
 };
 
 void
@@ -545,11 +544,6 @@ xfs_start_page_writeback(
unlock_page(page);
 }
 
-static inline int xfs_bio_add_buffer(struct bio *bio, struct buffer_head *bh)
-{
-   return bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
-}
-
 /*
  * Submit the bio for an ioend. We are passed an ioend with a bio attached to
  * it, and we submit that bio. The ioend may be used for multiple bio
@@ -604,27 +598,20 @@ xfs_submit_ioend(
return 0;
 }
 
-static void
-xfs_init_bio_from_bh(
-   struct bio  *bio,
-   struct buffer_head  *bh)
-{
-   bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
-   bio_set_dev(bio, bh->b_bdev);
-}
-
 static struct xfs_ioend *
 xfs_alloc_ioend(
struct inode*inode,
unsigned inttype,
xfs_off_t   offset,
-   struct buffer_head  *bh)
+   struct block_device *bdev,
+   sector_tsector)
 {
struct xfs_ioend*ioend;
struct bio  *bio;
 
bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, xfs_ioend_bioset);
-   xfs_init_bio_from_bh(bio, bh);
+   bio_set_dev(bio, bdev);
+   bio->bi_iter.bi_sector = sector;
 
ioend = container_of(bio, struct xfs_ioend, io_inline_bio);
INIT_LIST_HEAD(>io_list);
@@ -649,13 +636,14 @@ static void
 xfs_chain_bio(
struct xfs_ioend*ioend,
struct writeback_control *wbc,
-   struct buffer_head  *bh)
+   struct block_device *bdev,
+   sector_tsector)
 {
struct bio *new;
 
new = bio_alloc(GFP_NOFS, BIO_MAX_PAGES);
-   xfs_init_bio_from_bh(new, bh);
-
+   bio_set_dev(new, bdev);
+   new->bi_iter.bi_sector = sector;
bio_chain(ioend->io_bio, new);
bio_get(ioend->io_bio); /* for xfs_destroy_ioend */
ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
@@ -665,39 +653,45 @@ xfs_chain_bio(
 }
 
 /*
- * Test to see if we've been building up a completion structure for
- * earlier buffers -- if so, we try to append to this ioend if we
- * can, otherwise we finish off any current ioend and start another.
- * Return the ioend we finished off so that the caller can submit it
- * once it has finished processing the dirty page.
+ * Test to see if we have an existing ioend structure that we could append to
+ * first, otherwise finish off the current ioend and start another.
  */
 STATIC void
 xfs_add_to_ioend(
struct inode*inode,
-   struct buffer_head  *bh,
xfs_off_t   offset,
+   struct page *page,
struct xfs_writepage_ctx *wpc,
struct writeback_control *wbc,
struct list_head*iolist)
 {
+   struct xfs_inode*ip = XFS_I(inode);
+   struct xfs_mount*mp = ip->i_mount;
+   struct block_device *bdev = xfs_find_bdev_for_inode(inode);
+   unsignedlen = i_blocksize(inode);
+   unsignedpoff = offset & (PAGE_SIZE - 1);
+   sector_tsector;
+
+   sector = xfs_fsb_to_db(ip, wpc->imap.br_startblock) +
+   ((offset - XFS_FSB_TO_B(mp, wpc->imap.br_startoff)) >> 9);
+
if (!wpc->ioend || wpc->io_type != wpc->ioend->io_type ||
-   bh->b_blocknr != wpc->last_block + 1 ||
+   sector != bio_end_sector(wpc->ioend->io_bio) ||
offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
if (wpc->ioend)
list_add(>ioend->io_list, iolist);
-   wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
+   wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset,
+   bdev, sector);
}
 
/*
-* If the buffer doesn't fit into the bio we need to allocate a new
-* one.  This shouldn't happen more than once for a given buffer.
+* If the block doesn't fit into the bio we need to allocate a new
+* one.  This shouldn't happen more than once for a given block.
 */
-   while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
-   xfs_chain_bio(wpc->ioend, wbc, bh);
+   while (bio_add_page(wpc->ioend->io_bio, page, len, poff) != len)
+  

[PATCH 26/34] xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly

2018-05-18 Thread Christoph Hellwig
xfs_bmapi_read adds zero value in xfs_map_blocks.  Replace it with a
direct call to the low-level extent lookup function.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 354d26d66c12..b1dee2171194 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -387,7 +387,6 @@ xfs_map_blocks(
int whichfork = XFS_DATA_FORK;
struct xfs_iext_cursor  icur;
int error = 0;
-   int nimaps = 1;
boolcow_valid = false;
 
if (XFS_FORCED_SHUTDOWN(mp))
@@ -432,24 +431,16 @@ xfs_map_blocks(
goto done;
}
 
-   error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-   imap, , XFS_BMAPI_ENTIRE);
+   if (!xfs_iext_lookup_extent(ip, >i_df, offset_fsb, , imap))
+   imap->br_startoff = end_fsb;/* fake a hole past EOF */
xfs_iunlock(ip, XFS_ILOCK_SHARED);
-   if (error)
-   return error;
 
-   if (!nimaps) {
-   /*
-* Lookup returns no match? Beyond eof? regardless,
-* return it as a hole so we don't write it
-*/
+   if (imap->br_startoff > offset_fsb) {
+   /* landed in a hole or beyond EOF */
+   imap->br_blockcount = imap->br_startoff - offset_fsb;
imap->br_startoff = offset_fsb;
-   imap->br_blockcount = end_fsb - offset_fsb;
imap->br_startblock = HOLESTARTBLOCK;
*type = XFS_IO_HOLE;
-   } else if (imap->br_startblock == HOLESTARTBLOCK) {
-   /* landed in a hole */
-   *type = XFS_IO_HOLE;
} else if (isnullstartblock(imap->br_startblock)) {
/* got a delalloc extent */
*type = XFS_IO_DELALLOC;
-- 
2.17.0



[PATCH 30/34] xfs: move all writeback buffer_head manipulation into xfs_map_at_offset

2018-05-18 Thread Christoph Hellwig
This keeps it in a single place so it can be made otional more easily.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 22 +-
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 592b33b35a30..951b329abb23 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,21 +505,6 @@ xfs_imap_valid(
offset < imap->br_startoff + imap->br_blockcount;
 }
 
-STATIC void
-xfs_start_buffer_writeback(
-   struct buffer_head  *bh)
-{
-   ASSERT(buffer_mapped(bh));
-   ASSERT(buffer_locked(bh));
-   ASSERT(!buffer_delay(bh));
-   ASSERT(!buffer_unwritten(bh));
-
-   bh->b_end_io = NULL;
-   set_buffer_async_write(bh);
-   set_buffer_uptodate(bh);
-   clear_buffer_dirty(bh);
-}
-
 STATIC void
 xfs_start_page_writeback(
struct page *page,
@@ -728,6 +713,7 @@ xfs_map_at_offset(
ASSERT(imap->br_startblock != HOLESTARTBLOCK);
ASSERT(imap->br_startblock != DELAYSTARTBLOCK);
 
+   lock_buffer(bh);
xfs_map_buffer(inode, bh, imap, offset);
set_buffer_mapped(bh);
clear_buffer_delay(bh);
@@ -740,6 +726,10 @@ xfs_map_at_offset(
 * set the bdev now.
 */
bh->b_bdev = xfs_find_bdev_for_inode(inode);
+   bh->b_end_io = NULL;
+   set_buffer_async_write(bh);
+   set_buffer_uptodate(bh);
+   clear_buffer_dirty(bh);
 }
 
 STATIC void
@@ -885,11 +875,9 @@ xfs_writepage_map(
continue;
}
 
-   lock_buffer(bh);
xfs_map_at_offset(inode, bh, >imap, file_offset);
xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
_list);
-   xfs_start_buffer_writeback(bh);
count++;
}
 
-- 
2.17.0



[PATCH 31/34] xfs: remove xfs_start_page_writeback

2018-05-18 Thread Christoph Hellwig
This helper only has two callers, one of them with a constant error
argument.  Remove it to make pending changes to the code a little easier.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 47 +--
 1 file changed, 21 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 951b329abb23..dd92d99df51f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,30 +505,6 @@ xfs_imap_valid(
offset < imap->br_startoff + imap->br_blockcount;
 }
 
-STATIC void
-xfs_start_page_writeback(
-   struct page *page,
-   int clear_dirty)
-{
-   ASSERT(PageLocked(page));
-   ASSERT(!PageWriteback(page));
-
-   /*
-* if the page was not fully cleaned, we need to ensure that the higher
-* layers come back to it correctly. That means we need to keep the page
-* dirty, and for WB_SYNC_ALL writeback we need to ensure the
-* PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
-* write this page in this writeback sweep will be made.
-*/
-   if (clear_dirty) {
-   clear_page_dirty_for_io(page);
-   set_page_writeback(page);
-   } else
-   set_page_writeback_keepwrite(page);
-
-   unlock_page(page);
-}
-
 /*
  * Submit the bio for an ioend. We are passed an ioend with a bio attached to
  * it, and we submit that bio. The ioend may be used for multiple bio
@@ -887,6 +863,9 @@ xfs_writepage_map(
ASSERT(wpc->ioend || list_empty(_list));
 
 out:
+   ASSERT(PageLocked(page));
+   ASSERT(!PageWriteback(page));
+
/*
 * On error, we have to fail the ioend here because we have locked
 * buffers in the ioend. If we don't do this, we'll deadlock
@@ -905,7 +884,21 @@ xfs_writepage_map(
 * treated correctly on error.
 */
if (count) {
-   xfs_start_page_writeback(page, !error);
+   /*
+* If the page was not fully cleaned, we need to ensure that the
+* higher layers come back to it correctly.  That means we need
+* to keep the page dirty, and for WB_SYNC_ALL writeback we need
+* to ensure the PAGECACHE_TAG_TOWRITE index mark is not removed
+* so another attempt to write this page in this writeback sweep
+* will be made.
+*/
+   if (error) {
+   set_page_writeback_keepwrite(page);
+   } else {
+   clear_page_dirty_for_io(page);
+   set_page_writeback(page);
+   }
+   unlock_page(page);
 
/*
 * Preserve the original error if there was one, otherwise catch
@@ -930,7 +923,9 @@ xfs_writepage_map(
 * race with a partial page truncate on a sub-page block sized
 * filesystem. In that case we need to mark the page clean.
 */
-   xfs_start_page_writeback(page, 1);
+   clear_page_dirty_for_io(page);
+   set_page_writeback(page);
+   unlock_page(page);
end_page_writeback(page);
}
 
-- 
2.17.0



[PATCH 33/34] xfs: do not set the page uptodate in xfs_writepage_map

2018-05-18 Thread Christoph Hellwig
We already track the page uptodate status based on the buffer uptodate
status, which is updated whenever reading or zeroing blocks.

This code has been there since commit a ptool commit in 2002, which
claims to:

"merge" the 2.4 fsx fix for block size < page size to 2.5.  This needed
major changes to actually fit.

and isn't present in other writepage implementations.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a4e53e0a57c2..492f4a4b1deb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -796,7 +796,6 @@ xfs_writepage_map(
ssize_t len = i_blocksize(inode);
int error = 0;
int count = 0;
-   booluptodate = true;
loff_t  file_offset;/* file offset of page */
unsignedpoffset;/* offset into page */
 
@@ -823,7 +822,6 @@ xfs_writepage_map(
if (!buffer_uptodate(bh)) {
if (PageUptodate(page))
ASSERT(buffer_mapped(bh));
-   uptodate = false;
continue;
}
 
@@ -857,9 +855,6 @@ xfs_writepage_map(
count++;
}
 
-   if (uptodate && poffset == PAGE_SIZE)
-   SetPageUptodate(page);
-
ASSERT(wpc->ioend || list_empty(_list));
 
 out:
-- 
2.17.0



[PATCH 32/34] xfs: refactor the tail of xfs_writepage_map

2018-05-18 Thread Christoph Hellwig
Rejuggle how we deal with the different error vs non-error and have
ioends vs not have ioend cases to keep the fast path streamlined, and
the duplicate code at a minimum.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 65 +++
 1 file changed, 32 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index dd92d99df51f..a4e53e0a57c2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -883,7 +883,14 @@ xfs_writepage_map(
 * submission of outstanding ioends on the writepage context so they are
 * treated correctly on error.
 */
-   if (count) {
+   if (unlikely(error)) {
+   if (!count) {
+   xfs_aops_discard_page(page);
+   ClearPageUptodate(page);
+   unlock_page(page);
+   goto done;
+   }
+
/*
 * If the page was not fully cleaned, we need to ensure that the
 * higher layers come back to it correctly.  That means we need
@@ -892,43 +899,35 @@ xfs_writepage_map(
 * so another attempt to write this page in this writeback sweep
 * will be made.
 */
-   if (error) {
-   set_page_writeback_keepwrite(page);
-   } else {
-   clear_page_dirty_for_io(page);
-   set_page_writeback(page);
-   }
-   unlock_page(page);
-
-   /*
-* Preserve the original error if there was one, otherwise catch
-* submission errors here and propagate into subsequent ioend
-* submissions.
-*/
-   list_for_each_entry_safe(ioend, next, _list, io_list) {
-   int error2;
-
-   list_del_init(>io_list);
-   error2 = xfs_submit_ioend(wbc, ioend, error);
-   if (error2 && !error)
-   error = error2;
-   }
-   } else if (error) {
-   xfs_aops_discard_page(page);
-   ClearPageUptodate(page);
-   unlock_page(page);
+   set_page_writeback_keepwrite(page);
} else {
-   /*
-* We can end up here with no error and nothing to write if we
-* race with a partial page truncate on a sub-page block sized
-* filesystem. In that case we need to mark the page clean.
-*/
clear_page_dirty_for_io(page);
set_page_writeback(page);
-   unlock_page(page);
-   end_page_writeback(page);
}
 
+   unlock_page(page);
+
+   /*
+* Preserve the original error if there was one, otherwise catch
+* submission errors here and propagate into subsequent ioend
+* submissions.
+*/
+   list_for_each_entry_safe(ioend, next, _list, io_list) {
+   int error2;
+
+   list_del_init(>io_list);
+   error2 = xfs_submit_ioend(wbc, ioend, error);
+   if (error2 && !error)
+   error = error2;
+   }
+
+   /*
+* We can end up here with no error and nothing to write if we race with
+* a partial page truncate on a sub-page block sized filesystem.
+*/
+   if (!count)
+   end_page_writeback(page);
+done:
mapping_set_error(page->mapping, error);
return error;
 }
-- 
2.17.0



Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-18 Thread Jens Axboe
On 5/18/18 7:57 AM, Keith Busch wrote:
> On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
>> What I think block/011 is helpful is that it can trigger IO timeout
>> during reset, which can be triggered in reality too.
> 
> As I mentioned earlier, there is nothing wrong with the spirit of
> the test. What's wrong with it is the misguided implemention.
> 
> Do you underestand why it ever passes? The success happens when the
> enabling part of the loop happens to coincide with the driver's enabling,
> creating the pci_dev->enable_cnt > 1, making subsequent disable parts
> of the loop do absolutely nothing; the exact same as the one-liner
> (non-serious) patch I sent to defeat the test.
> 
> A better way to induce the timeout is:
> 
>   # setpci -s  4.w=0:6
> 
> This will halt the device without messing with the kernel structures,
> just like how a real device failure would occur.

Let's just improve/fix the test case. Sounds like the 'enable' sysfs
attribute should never have been exported, and hence the test should
never have used it. blktests is not the source of truth, necessarily,
and it would be silly to work around cases in the kernel if it's a
clear case of "doctor it hurts when I shoot myself in the foot".

-- 
Jens Axboe



[PATCH 34/34] xfs: allow writeback on pages without buffer heads

2018-05-18 Thread Christoph Hellwig
Disable the IOMAP_F_BUFFER_HEAD flag on file systems with a block size
equal to the page size, and deal with pages without buffer heads in
writeback.  Thanks to the previous refactoring this is basically trivial
now.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c  | 47 +-
 fs/xfs/xfs_iomap.c |  3 ++-
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 492f4a4b1deb..efa2cbb27d67 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -91,6 +91,19 @@ xfs_find_daxdev_for_inode(
return mp->m_ddev_targp->bt_daxdev;
 }
 
+static void
+xfs_finish_page_writeback(
+   struct inode*inode,
+   struct bio_vec  *bvec,
+   int error)
+{
+   if (error) {
+   SetPageError(bvec->bv_page);
+   mapping_set_error(inode->i_mapping, -EIO);
+   }
+   end_page_writeback(bvec->bv_page);
+}
+
 /*
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
@@ -103,7 +116,7 @@ xfs_find_daxdev_for_inode(
  * and buffers potentially freed after every call to end_buffer_async_write.
  */
 static void
-xfs_finish_page_writeback(
+xfs_finish_buffer_writeback(
struct inode*inode,
struct bio_vec  *bvec,
int error)
@@ -178,9 +191,12 @@ xfs_destroy_ioend(
next = bio->bi_private;
 
/* walk each page on bio, ending page IO on them */
-   bio_for_each_segment_all(bvec, bio, i)
-   xfs_finish_page_writeback(inode, bvec, error);
-
+   bio_for_each_segment_all(bvec, bio, i) {
+   if (page_has_buffers(bvec->bv_page))
+   xfs_finish_buffer_writeback(inode, bvec, error);
+   else
+   xfs_finish_page_writeback(inode, bvec, error);
+   }
bio_put(bio);
}
 
@@ -792,13 +808,16 @@ xfs_writepage_map(
 {
LIST_HEAD(submit_list);
struct xfs_ioend*ioend, *next;
-   struct buffer_head  *bh;
+   struct buffer_head  *bh = NULL;
ssize_t len = i_blocksize(inode);
int error = 0;
int count = 0;
loff_t  file_offset;/* file offset of page */
unsignedpoffset;/* offset into page */
 
+   if (page_has_buffers(page))
+   bh = page_buffers(page);
+
/*
 * Walk the blocks on the page, and we we run off then end of the
 * current map or find the current map invalid, grab a new one.
@@ -807,11 +826,9 @@ xfs_writepage_map(
 * replace the bufferhead with some other state tracking mechanism in
 * future.
 */
-   file_offset = page_offset(page);
-   bh = page_buffers(page);
-   for (poffset = 0;
+   for (poffset = 0, file_offset = page_offset(page);
 poffset < PAGE_SIZE;
-poffset += len, file_offset += len, bh = bh->b_this_page) {
+poffset += len, file_offset += len) {
/* past the range we are writing, so nothing more to write. */
if (file_offset >= end_offset)
break;
@@ -819,9 +836,10 @@ xfs_writepage_map(
/*
 * Block does not contain valid data, skip it.
 */
-   if (!buffer_uptodate(bh)) {
+   if (bh && !buffer_uptodate(bh)) {
if (PageUptodate(page))
ASSERT(buffer_mapped(bh));
+   bh = bh->b_this_page;
continue;
}
 
@@ -846,10 +864,15 @@ xfs_writepage_map(
 * meaningless for holes (!mapped && uptodate), so 
check we did
 * have a buffer covering a hole here and continue.
 */
+   if (bh)
+   bh = bh->b_this_page;
continue;
}
 
-   xfs_map_at_offset(inode, bh, >imap, file_offset);
+   if (bh) {
+   xfs_map_at_offset(inode, bh, >imap, file_offset);
+   bh = bh->b_this_page;
+   }
xfs_add_to_ioend(inode, file_offset, page, wpc, wbc,
_list);
count++;
@@ -949,8 +972,6 @@ xfs_do_writepage(
 
trace_xfs_writepage(inode, page, 0, 0);
 
-   ASSERT(page_has_buffers(page));
-
/*
 * Refuse to write the page out if we are called from reclaim context.
 *
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f949f0dd7382..93c40da3378a 

Re: [PATCH 01/34] block: add a lower-level bio_add_page interface

2018-05-18 Thread Jens Axboe
On 5/18/18 10:47 AM, Christoph Hellwig wrote:
> For the upcoming removal of buffer heads in XFS we need to keep track of
> the number of outstanding writeback requests per page.  For this we need
> to know if bio_add_page merged a region with the previous bvec or not.
> Instead of adding additional arguments this refactors bio_add_page to
> be implemented using three lower level helpers which users like XFS can
> use directly if they care about the merge decisions.

Reviewed-by: Jens Axboe 

-- 
Jens Axboe



Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-18 Thread Linus Torvalds
On Fri, May 18, 2018 at 6:07 AM Roman Pen 
wrote:

> Function is going to be used in transport over RDMA module
> in subsequent patches.

Does this really merit its own helper macro in a generic header?

It honestly smells more like "just have an inline helper function that is
specific to rdma" to me. Particularly since it's probably just one specific
list where you want this oddly specific behavior.

Also, if we really want a round-robin list traversal macro, this isn't the
way it should be implemented, I suspect, and it probably shouldn't be
RCU-specific to begin with.

Side note: I notice that I should already  have been more critical of even
the much simpler "list_next_or_null_rcu()" macro. The "documentation"
comment above the macro is pure and utter cut-and-paste garbage.

Paul, mind giving this a look?

 Linus


[PATCH 2/2] xfs: add support for sub-pagesize writeback without buffer_heads

2018-05-18 Thread Christoph Hellwig
Switch to using the iomap_page structure for checking sub-page uptodate
status and track sub-page I/O completion status, and remove large
quantities of boilerplate code working around buffer heads.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c  | 534 +++--
 fs/xfs/xfs_buf.h   |   1 -
 fs/xfs/xfs_iomap.c |   3 -
 fs/xfs/xfs_super.c |   2 +-
 fs/xfs/xfs_trace.h |  18 +-
 5 files changed, 77 insertions(+), 481 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index efa2cbb27d67..fd664ba423e6 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -32,9 +32,6 @@
 #include "xfs_bmap_util.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
-#include 
-#include 
-#include 
 #include 
 
 /*
@@ -46,25 +43,6 @@ struct xfs_writepage_ctx {
struct xfs_ioend*ioend;
 };
 
-void
-xfs_count_page_state(
-   struct page *page,
-   int *delalloc,
-   int *unwritten)
-{
-   struct buffer_head  *bh, *head;
-
-   *delalloc = *unwritten = 0;
-
-   bh = head = page_buffers(page);
-   do {
-   if (buffer_unwritten(bh))
-   (*unwritten) = 1;
-   else if (buffer_delay(bh))
-   (*delalloc) = 1;
-   } while ((bh = bh->b_this_page) != head);
-}
-
 struct block_device *
 xfs_find_bdev_for_inode(
struct inode*inode)
@@ -97,67 +75,17 @@ xfs_finish_page_writeback(
struct bio_vec  *bvec,
int error)
 {
+   struct iomap_page   *iop = to_iomap_page(bvec->bv_page);
+
if (error) {
SetPageError(bvec->bv_page);
mapping_set_error(inode->i_mapping, -EIO);
}
-   end_page_writeback(bvec->bv_page);
-}
 
-/*
- * We're now finished for good with this page.  Update the page state via the
- * associated buffer_heads, paying attention to the start and end offsets that
- * we need to process on the page.
- *
- * Note that we open code the action in end_buffer_async_write here so that we
- * only have to iterate over the buffers attached to the page once.  This is 
not
- * only more efficient, but also ensures that we only calls end_page_writeback
- * at the end of the iteration, and thus avoids the pitfall of having the page
- * and buffers potentially freed after every call to end_buffer_async_write.
- */
-static void
-xfs_finish_buffer_writeback(
-   struct inode*inode,
-   struct bio_vec  *bvec,
-   int error)
-{
-   struct buffer_head  *head = page_buffers(bvec->bv_page), *bh = head;
-   boolbusy = false;
-   unsigned intoff = 0;
-   unsigned long   flags;
-
-   ASSERT(bvec->bv_offset < PAGE_SIZE);
-   ASSERT((bvec->bv_offset & (i_blocksize(inode) - 1)) == 0);
-   ASSERT(bvec->bv_offset + bvec->bv_len <= PAGE_SIZE);
-   ASSERT((bvec->bv_len & (i_blocksize(inode) - 1)) == 0);
-
-   local_irq_save(flags);
-   bit_spin_lock(BH_Uptodate_Lock, >b_state);
-   do {
-   if (off >= bvec->bv_offset &&
-   off < bvec->bv_offset + bvec->bv_len) {
-   ASSERT(buffer_async_write(bh));
-   ASSERT(bh->b_end_io == NULL);
-
-   if (error) {
-   mark_buffer_write_io_error(bh);
-   clear_buffer_uptodate(bh);
-   SetPageError(bvec->bv_page);
-   } else {
-   set_buffer_uptodate(bh);
-   }
-   clear_buffer_async_write(bh);
-   unlock_buffer(bh);
-   } else if (buffer_async_write(bh)) {
-   ASSERT(buffer_locked(bh));
-   busy = true;
-   }
-   off += bh->b_size;
-   } while ((bh = bh->b_this_page) != head);
-   bit_spin_unlock(BH_Uptodate_Lock, >b_state);
-   local_irq_restore(flags);
+   ASSERT(iop || i_blocksize(inode) == PAGE_SIZE);
+   ASSERT(!iop || atomic_read(>write_count) > 0);
 
-   if (!busy)
+   if (!iop || atomic_dec_and_test(>write_count))
end_page_writeback(bvec->bv_page);
 }
 
@@ -191,12 +119,8 @@ xfs_destroy_ioend(
next = bio->bi_private;
 
/* walk each page on bio, ending page IO on them */
-   bio_for_each_segment_all(bvec, bio, i) {
-   if (page_has_buffers(bvec->bv_page))
-   xfs_finish_buffer_writeback(inode, bvec, error);
-   else
-   xfs_finish_page_writeback(inode, bvec, error);
-   }
+   bio_for_each_segment_all(bvec, bio, i)
+   xfs_finish_page_writeback(inode, 

sub-page blocksize support in iomap non-buffer head path

2018-05-18 Thread Christoph Hellwig
Hi all,

this series adds support for buffered I/O without buffer heads for
block size < PAGE_SIZE to the iomap and XFS code.

A git tree is available at:

git://git.infradead.org/users/hch/xfs.git xfs-iomap-read 
xfs-remove-bufferheads.2

Gitweb:


http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-remove-bufferheads.2

Changes since v1:
 - call iomap_page_create in page_mkwrite to fix generic/095
 - split into a separate series


[PATCH 1/2] iomap: add support for sub-pagesize buffered I/O without buffer heads

2018-05-18 Thread Christoph Hellwig
After already supporting a simple implementation of buffered writes for
the blocksize == PAGE_SIZE case in the last commit this adds full support
even for smaller block sizes.   There are three bits of per-block
information in the buffer_head structure that really matter for the iomap
read and write path:

 - uptodate status (BH_uptodate)
 - marked as currently under read I/O (BH_Async_Read)
 - marked as currently under write I/O (BH_Async_Write)

Instead of having new per-block structures this now adds a per-page
structure called struct iomap_page to track this information in a slightly
different form:

 - a bitmap for the per-block uptodate status.  For worst case of a 64k
   page size system this bitmap needs to contain 128 bits.  For the
   typical 4k page size case it only needs 8 bits, although we still
   need a full unsigned long due to the way the atomic bitmap API works.
 - two atomic_t counters are used to track the outstanding read and write
   counts

There is quite a bit of boilerplate code as the buffered I/O path uses
various helper methods, but the actual code is very straight forward.

In this commit the code can't actually be used yet, as we need to
switch from the old implementation to the new one together with the
XFS writeback code.

Signed-off-by: Christoph Hellwig 
---
 fs/iomap.c| 260 +-
 include/linux/iomap.h |  31 +
 2 files changed, 262 insertions(+), 29 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index cd4c563db80a..8d62f0eb874c 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -110,6 +111,107 @@ iomap_block_needs_zeroing(struct inode *inode, loff_t 
pos, struct iomap *iomap)
return iomap->type != IOMAP_MAPPED || pos > i_size_read(inode);
 }
 
+static struct iomap_page *
+iomap_page_create(struct inode *inode, struct page *page)
+{
+   struct iomap_page *iop = to_iomap_page(page);
+
+   if (iop || i_blocksize(inode) == PAGE_SIZE)
+   return iop;
+
+   iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
+   atomic_set(>read_count, 0);
+   atomic_set(>write_count, 0);
+   bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+   set_page_private(page, (unsigned long)iop);
+   SetPagePrivate(page);
+   return iop;
+}
+
+/*
+ * Calculate the range inside the page that we actually need to read.
+ */
+static void
+iomap_read_calculate_range(struct inode *inode, struct iomap_page *iop,
+   loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
+{
+   unsigned poff = *pos & (PAGE_SIZE - 1);
+   unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+
+   if (iop) {
+   unsigned block_size = i_blocksize(inode);
+   unsigned first = poff >> inode->i_blkbits;
+   unsigned last = (poff + plen - 1) >> inode->i_blkbits;
+   unsigned int i;
+
+   /* move forward for each leading block marked uptodate */
+   for (i = first; i <= last; i++) {
+   if (!test_bit(i, iop->uptodate))
+   break;
+   *pos += block_size;
+   poff += block_size;
+   plen -= block_size;
+   }
+
+   /* truncate len if we find any trailing uptodate block(s) */
+   for ( ; i <= last; i++) {
+   if (test_bit(i, iop->uptodate)) {
+   plen -= (last - i + 1) * block_size;
+   break;
+   }
+   }
+   }
+
+   *offp = poff;
+   *lenp = plen;
+}
+
+static void
+iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
+{
+   struct iomap_page *iop = to_iomap_page(page);
+   struct inode *inode = page->mapping->host;
+   unsigned first = off >> inode->i_blkbits;
+   unsigned last = (off + len - 1) >> inode->i_blkbits;
+   unsigned int i;
+   bool uptodate = true;
+
+   if (iop) {
+   for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+   if (i >= first && i <= last)
+   set_bit(i, iop->uptodate);
+   else if (!test_bit(i, iop->uptodate))
+   uptodate = false;
+   }
+   }
+
+   if (uptodate && !PageError(page))
+   SetPageUptodate(page);
+}
+
+static void
+iomap_read_finish(struct iomap_page *iop, struct page *page)
+{
+   if (!iop || atomic_dec_and_test(>read_count))
+   unlock_page(page);
+}
+
+static void
+iomap_read_page_end_io(struct bio_vec *bvec, int error)
+{
+   struct page *page = bvec->bv_page;
+   struct iomap_page *iop = to_iomap_page(page);
+
+   if (unlikely(error)) {
+   ClearPageUptodate(page);
+   

[PATCH 28/34] xfs: remove the imap_valid flag

2018-05-18 Thread Christoph Hellwig
Simplify the way we check for a valid imap - we know we have a valid
mapping after xfs_map_blocks returned successfully, and we know we can
call xfs_imap_valid on any imap, as it will always fail on a
zero-initialized map.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 82fd08c29f7f..f01c1dd737ec 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -42,7 +42,6 @@
  */
 struct xfs_writepage_ctx {
struct xfs_bmbt_irecimap;
-   boolimap_valid;
unsigned intio_type;
struct xfs_ioend*ioend;
sector_tlast_block;
@@ -868,10 +867,6 @@ xfs_writepage_map(
continue;
}
 
-   /* Check to see if current map spans this file offset */
-   if (wpc->imap_valid)
-   wpc->imap_valid = xfs_imap_valid(inode, >imap,
-file_offset);
/*
 * If we don't have a valid map, now it's time to get a new one
 * for this offset.  This will convert delayed allocations
@@ -879,16 +874,14 @@ xfs_writepage_map(
 * a valid map, it means we landed in a hole and we skip the
 * block.
 */
-   if (!wpc->imap_valid) {
+   if (!xfs_imap_valid(inode, >imap, file_offset)) {
error = xfs_map_blocks(inode, file_offset, >imap,
 >io_type);
if (error)
goto out;
-   wpc->imap_valid = xfs_imap_valid(inode, >imap,
-file_offset);
}
 
-   if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
+   if (wpc->io_type == XFS_IO_HOLE) {
/*
 * set_page_dirty dirties all buffers in a page, 
independent
 * of their state.  The dirty state however is entirely
-- 
2.17.0



[PATCH 27/34] xfs: don't clear imap_valid for a non-uptodate buffers

2018-05-18 Thread Christoph Hellwig
Finding a buffer that isn't uptodate doesn't invalidate the mapping for
any given block.  The last_sector check will already take care of starting
another ioend as soon as we find any non-update buffer, and if the current
mapping doesn't include the next uptodate buffer the xfs_imap_valid check
will take care of it.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b1dee2171194..82fd08c29f7f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -859,15 +859,12 @@ xfs_writepage_map(
break;
 
/*
-* Block does not contain valid data, skip it, mark the current
-* map as invalid because we have a discontiguity. This ensures
-* we put subsequent writeable buffers into a new ioend.
+* Block does not contain valid data, skip it.
 */
if (!buffer_uptodate(bh)) {
if (PageUptodate(page))
ASSERT(buffer_mapped(bh));
uptodate = false;
-   wpc->imap_valid = false;
continue;
}
 
-- 
2.17.0



[PATCH 11/34] iomap: move IOMAP_F_BOUNDARY to gfs2

2018-05-18 Thread Christoph Hellwig
Just define a range of fs specific flags and use that in gfs2 instead of
exposing this internal flag flobally.

Signed-off-by: Christoph Hellwig 
---
 fs/gfs2/bmap.c| 8 +---
 include/linux/iomap.h | 9 +++--
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index cbeedd3cfb36..8efa6297e19c 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -683,6 +683,8 @@ static void gfs2_stuffed_iomap(struct inode *inode, struct 
iomap *iomap)
iomap->type = IOMAP_INLINE;
 }
 
+#define IOMAP_F_GFS2_BOUNDARY IOMAP_F_PRIVATE
+
 /**
  * gfs2_iomap_begin - Map blocks from an inode to disk blocks
  * @inode: The inode
@@ -774,7 +776,7 @@ int gfs2_iomap_begin(struct inode *inode, loff_t pos, 
loff_t length,
bh = mp.mp_bh[ip->i_height - 1];
len = gfs2_extent_length(bh->b_data, bh->b_size, ptr, lend - lblock, 
);
if (eob)
-   iomap->flags |= IOMAP_F_BOUNDARY;
+   iomap->flags |= IOMAP_F_GFS2_BOUNDARY;
iomap->length = (u64)len << inode->i_blkbits;
 
 out_release:
@@ -846,12 +848,12 @@ int gfs2_block_map(struct inode *inode, sector_t lblock,
 
if (iomap.length > bh_map->b_size) {
iomap.length = bh_map->b_size;
-   iomap.flags &= ~IOMAP_F_BOUNDARY;
+   iomap.flags &= ~IOMAP_F_GFS2_BOUNDARY;
}
if (iomap.addr != IOMAP_NULL_ADDR)
map_bh(bh_map, inode->i_sb, iomap.addr >> inode->i_blkbits);
bh_map->b_size = iomap.length;
-   if (iomap.flags & IOMAP_F_BOUNDARY)
+   if (iomap.flags & IOMAP_F_GFS2_BOUNDARY)
set_buffer_boundary(bh_map);
if (iomap.flags & IOMAP_F_NEW)
set_buffer_new(bh_map);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 13d19b4c29a9..819e0cd2a950 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -27,8 +27,7 @@ struct vm_fault;
  * written data and requires fdatasync to commit them to persistent storage.
  */
 #define IOMAP_F_NEW0x01/* blocks have been newly allocated */
-#define IOMAP_F_BOUNDARY   0x02/* mapping ends at metadata boundary */
-#define IOMAP_F_DIRTY  0x04/* uncommitted metadata */
+#define IOMAP_F_DIRTY  0x02/* uncommitted metadata */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
@@ -36,6 +35,12 @@ struct vm_fault;
 #define IOMAP_F_MERGED 0x10/* contains multiple blocks/extents */
 #define IOMAP_F_SHARED 0x20/* block shared with another file */
 
+/*
+ * Flags from 0x1000 up are for file system specific usage:
+ */
+#define IOMAP_F_PRIVATE0x1000
+
+
 /*
  * Magic value for addr:
  */
-- 
2.17.0



[PATCH 25/34] xfs: remove xfs_reflink_trim_irec_to_next_cow

2018-05-18 Thread Christoph Hellwig
In the only caller we just did a lookup in the COW extent tree for
the same offset.  Reuse that result and save a lookup, as well as
shortening the ilock hold time.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c| 25 +
 fs/xfs/xfs_reflink.c | 33 -
 fs/xfs/xfs_reflink.h |  2 --
 3 files changed, 17 insertions(+), 43 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a4b4a7037deb..354d26d66c12 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -383,11 +383,12 @@ xfs_map_blocks(
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
ssize_t count = i_blocksize(inode);
-   xfs_fileoff_t   offset_fsb, end_fsb;
+   xfs_fileoff_t   offset_fsb, end_fsb, cow_fsb = 0;
int whichfork = XFS_DATA_FORK;
struct xfs_iext_cursor  icur;
int error = 0;
int nimaps = 1;
+   boolcow_valid = false;
 
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
@@ -407,8 +408,11 @@ xfs_map_blocks(
 * it directly instead of looking up anything in the data fork.
 */
if (xfs_is_reflink_inode(ip) &&
-   xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, , imap) &&
-   imap->br_startoff <= offset_fsb) {
+   xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, , imap)) {
+   cow_fsb = imap->br_startoff;
+   cow_valid = true;
+   }
+   if (cow_valid && cow_fsb <= offset_fsb) {
xfs_iunlock(ip, XFS_ILOCK_SHARED);
/*
 * Truncate can race with writeback since writeback doesn't
@@ -430,6 +434,10 @@ xfs_map_blocks(
 
error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
imap, , XFS_BMAPI_ENTIRE);
+   xfs_iunlock(ip, XFS_ILOCK_SHARED);
+   if (error)
+   return error;
+
if (!nimaps) {
/*
 * Lookup returns no match? Beyond eof? regardless,
@@ -451,16 +459,17 @@ xfs_map_blocks(
 * is a pending CoW reservation before the end of this extent,
 * so that we pick up the COW extents in the next iteration.
 */
-   xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
+   if (cow_valid &&
+   cow_fsb < imap->br_startoff + imap->br_blockcount) {
+   imap->br_blockcount = cow_fsb - imap->br_startoff;
+   trace_xfs_reflink_trim_irec(ip, imap);
+   }
+
if (imap->br_state == XFS_EXT_UNWRITTEN)
*type = XFS_IO_UNWRITTEN;
else
*type = XFS_IO_OVERWRITE;
}
-   xfs_iunlock(ip, XFS_ILOCK_SHARED);
-   if (error)
-   return error;
-
 done:
switch (*type) {
case XFS_IO_HOLE:
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 8e5eb8e70c89..ff76bc56ff3d 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -484,39 +484,6 @@ xfs_reflink_allocate_cow(
return error;
 }
 
-/*
- * Trim an extent to end at the next CoW reservation past offset_fsb.
- */
-void
-xfs_reflink_trim_irec_to_next_cow(
-   struct xfs_inode*ip,
-   xfs_fileoff_t   offset_fsb,
-   struct xfs_bmbt_irec*imap)
-{
-   struct xfs_ifork*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-   struct xfs_bmbt_irecgot;
-   struct xfs_iext_cursor  icur;
-
-   if (!xfs_is_reflink_inode(ip))
-   return;
-
-   /* Find the extent in the CoW fork. */
-   if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, , ))
-   return;
-
-   /* This is the extent before; try sliding up one. */
-   if (got.br_startoff < offset_fsb) {
-   if (!xfs_iext_next_extent(ifp, , ))
-   return;
-   }
-
-   if (got.br_startoff >= imap->br_startoff + imap->br_blockcount)
-   return;
-
-   imap->br_blockcount = got.br_startoff - imap->br_startoff;
-   trace_xfs_reflink_trim_irec(ip, imap);
-}
-
 /*
  * Cancel CoW reservations for some block range of an inode.
  *
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 15a456492667..e8d4d50c629f 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -32,8 +32,6 @@ extern int xfs_reflink_allocate_cow(struct xfs_inode *ip,
struct xfs_bmbt_irec *imap, bool *shared, uint *lockmode);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t count);
-extern void xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
-   xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
 extern 

[PATCH 16/34] iomap: add initial support for writes without buffer heads

2018-05-18 Thread Christoph Hellwig
For now just limited to blocksize == PAGE_SIZE, where we can simply read
in the full page in write begin, and just set the whole page dirty after
copying data into it.  This code is enabled by default and XFS will now
be feed pages without buffer heads in ->writepage and ->writepages.

If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
path will still be used, this both helps the transition in XFS and
prepares for the gfs2 migration to the iomap infrastructure.

Signed-off-by: Christoph Hellwig 
---
 fs/iomap.c| 132 ++
 fs/xfs/xfs_iomap.c|   6 +-
 include/linux/iomap.h |   2 +
 3 files changed, 127 insertions(+), 13 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 821671af2618..cd4c563db80a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -314,6 +314,58 @@ iomap_write_failed(struct inode *inode, loff_t pos, 
unsigned len)
truncate_pagecache_range(inode, max(pos, i_size), pos + len);
 }
 
+static int
+iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page 
*page,
+   unsigned poff, unsigned plen, struct iomap *iomap)
+{
+   struct bio_vec bvec;
+   struct bio bio;
+   int ret;
+
+   bio_init(, , 1);
+   bio.bi_opf = REQ_OP_READ;
+   bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
+   bio_set_dev(, iomap->bdev);
+   __bio_add_page(, page, plen, poff);
+   ret = submit_bio_wait();
+   if (ret < 0 && iomap_block_needs_zeroing(inode, block_start, iomap))
+   zero_user(page, poff, plen);
+   return ret;
+}
+
+static int
+__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
+   struct page *page, struct iomap *iomap)
+{
+   loff_t block_size = i_blocksize(inode);
+   loff_t block_start = pos & ~(block_size - 1);
+   loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
+   unsigned poff = block_start & (PAGE_SIZE - 1);
+   unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - 
block_start);
+   int status;
+
+   WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
+
+   if (PageUptodate(page))
+   return 0;
+
+   if (iomap_block_needs_zeroing(inode, block_start, iomap)) {
+   unsigned from = pos & (PAGE_SIZE - 1), to = from + len;
+   unsigned pend = poff + plen;
+
+   if (poff < from || pend > to)
+   zero_user_segments(page, poff, from, to, pend);
+   } else {
+   status = iomap_read_page_sync(inode, block_start, page,
+   poff, plen, iomap);
+   if (status < 0)
+   return status;
+   SetPageUptodate(page);
+   }
+
+   return 0;
+}
+
 static int
 iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned 
flags,
struct page **pagep, struct iomap *iomap)
@@ -331,7 +383,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
unsigned len, unsigned flags,
if (!page)
return -ENOMEM;
 
-   status = __block_write_begin_int(page, pos, len, NULL, iomap);
+   if (iomap->flags & IOMAP_F_BUFFER_HEAD)
+   status = __block_write_begin_int(page, pos, len, NULL, iomap);
+   else
+   status = __iomap_write_begin(inode, pos, len, page, iomap);
if (unlikely(status)) {
unlock_page(page);
put_page(page);
@@ -344,14 +399,63 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
unsigned len, unsigned flags,
return status;
 }
 
+int
+iomap_set_page_dirty(struct page *page)
+{
+   struct address_space *mapping = page_mapping(page);
+   int newly_dirty;
+
+   if (unlikely(!mapping))
+   return !TestSetPageDirty(page);
+
+   /*
+* Lock out page->mem_cgroup migration to keep PageDirty
+* synchronized with per-memcg dirty page counters.
+*/
+   lock_page_memcg(page);
+   newly_dirty = !TestSetPageDirty(page);
+   if (newly_dirty)
+   __set_page_dirty(page, mapping, 0);
+   unlock_page_memcg(page);
+
+   if (newly_dirty)
+   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+   return newly_dirty;
+}
+EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
+
+static int
+__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
+   unsigned copied, struct page *page, struct iomap *iomap)
+{
+   unsigned start = pos & (PAGE_SIZE - 1);
+
+   if (unlikely(copied < len)) {
+   /* see block_write_end() for an explanation */
+   if (!PageUptodate(page))
+   copied = 0;
+   if (iomap_block_needs_zeroing(inode, pos, iomap))
+   zero_user(page, start + copied, len - copied);
+   }
+
+   flush_dcache_page(page);
+   SetPageUptodate(page);
+   iomap_set_page_dirty(page);
+ 

[PATCH 18/34] xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages

2018-05-18 Thread Christoph Hellwig
For file systems with a block size that equals the page size we never do
partial reads, so we can use the buffer_head-less iomap versions of
readpage and readpages without conflicting with the buffer_head structures
create later in write_begin.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 56e405572909..c631c457b444 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1402,6 +1402,8 @@ xfs_vm_readpage(
struct page *page)
 {
trace_xfs_vm_readpage(page->mapping->host, 1);
+   if (i_blocksize(page->mapping->host) == PAGE_SIZE)
+   return iomap_readpage(page, _iomap_ops);
return mpage_readpage(page, xfs_get_blocks);
 }
 
@@ -1413,6 +1415,8 @@ xfs_vm_readpages(
unsignednr_pages)
 {
trace_xfs_vm_readpages(mapping->host, nr_pages);
+   if (i_blocksize(mapping->host) == PAGE_SIZE)
+   return iomap_readpages(mapping, pages, nr_pages, 
_iomap_ops);
return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
 }
 
-- 
2.17.0



[PATCH 15/34] iomap: add an iomap-based readpage and readpages implementation

2018-05-18 Thread Christoph Hellwig
Simply use iomap_apply to iterate over the file and a submit a bio for
each non-uptodate but mapped region and zero everything else.  Note that
as-is this can not be used for file systems with a blocksize smaller than
the page size, but that support will be added later.

Signed-off-by: Christoph Hellwig 
---
 fs/iomap.c| 200 +-
 include/linux/iomap.h |   4 +
 2 files changed, 203 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 7c1b071d115c..821671af2618 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (C) 2010 Red Hat, Inc.
- * Copyright (c) 2016 Christoph Hellwig.
+ * Copyright (c) 2016-2018 Christoph Hellwig.
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -103,6 +104,203 @@ iomap_sector(struct iomap *iomap, loff_t pos)
return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
 }
 
+static inline bool
+iomap_block_needs_zeroing(struct inode *inode, loff_t pos, struct iomap *iomap)
+{
+   return iomap->type != IOMAP_MAPPED || pos > i_size_read(inode);
+}
+
+static void
+iomap_read_end_io(struct bio *bio)
+{
+   int error = blk_status_to_errno(bio->bi_status);
+   struct bio_vec *bvec;
+   int i;
+
+   bio_for_each_segment_all(bvec, bio, i)
+   page_endio(bvec->bv_page, false, error);
+   bio_put(bio);
+}
+
+static struct bio *
+iomap_read_bio_alloc(struct iomap *iomap, sector_t sector, loff_t length)
+{
+   int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   struct bio *bio = bio_alloc(GFP_NOFS, min(BIO_MAX_PAGES, nr_vecs));
+
+   bio->bi_opf = REQ_OP_READ;
+   bio->bi_iter.bi_sector = sector;
+   bio_set_dev(bio, iomap->bdev);
+   bio->bi_end_io = iomap_read_end_io;
+   return bio;
+}
+
+struct iomap_readpage_ctx {
+   struct page *cur_page;
+   boolcur_page_in_bio;
+   struct bio  *bio;
+   struct list_head*pages;
+};
+
+static loff_t
+iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void 
*data,
+   struct iomap *iomap)
+{
+   struct iomap_readpage_ctx *ctx = data;
+   struct page *page = ctx->cur_page;
+   unsigned poff = pos & (PAGE_SIZE - 1);
+   unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+   bool is_contig = false;
+   sector_t sector;
+
+   /* we don't support blocksize < PAGE_SIZE quite yet: */
+   WARN_ON_ONCE(pos != page_offset(page));
+   WARN_ON_ONCE(plen != PAGE_SIZE);
+
+   if (iomap_block_needs_zeroing(inode, pos, iomap)) {
+   zero_user(page, poff, plen);
+   SetPageUptodate(page);
+   goto done;
+   }
+
+   ctx->cur_page_in_bio = true;
+
+   /*
+* Try to merge into a previous segment if we can.
+*/
+   sector = iomap_sector(iomap, pos);
+   if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
+   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+   goto done;
+   is_contig = true;
+   }
+
+   if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
+   if (ctx->bio)
+   submit_bio(ctx->bio);
+   ctx->bio = iomap_read_bio_alloc(iomap, sector, length);
+   }
+
+   __bio_add_page(ctx->bio, page, plen, poff);
+done:
+   return plen;
+}
+
+int
+iomap_readpage(struct page *page, const struct iomap_ops *ops)
+{
+   struct iomap_readpage_ctx ctx = { .cur_page = page };
+   struct inode *inode = page->mapping->host;
+   unsigned poff;
+   loff_t ret;
+
+   WARN_ON_ONCE(page_has_buffers(page));
+
+   for (poff = 0; poff < PAGE_SIZE; poff += ret) {
+   ret = iomap_apply(inode, page_offset(page) + poff,
+   PAGE_SIZE - poff, 0, ops, ,
+   iomap_readpage_actor);
+   if (ret <= 0) {
+   WARN_ON_ONCE(ret == 0);
+   SetPageError(page);
+   break;
+   }
+   }
+
+   if (ctx.bio) {
+   submit_bio(ctx.bio);
+   WARN_ON_ONCE(!ctx.cur_page_in_bio);
+   } else {
+   WARN_ON_ONCE(ctx.cur_page_in_bio);
+   unlock_page(page);
+   }
+   return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_readpage);
+
+static struct page *
+iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
+   loff_t length, loff_t *done)
+{
+   while (!list_empty(pages)) {
+   struct page *page = lru_to_page(pages);
+
+   if (page_offset(page) >= (u64)pos + length)
+   break;
+
+   list_del(>lru);
+   

[PATCH 21/34] xfs: move locking into xfs_bmap_punch_delalloc_range

2018-05-18 Thread Christoph Hellwig
Both callers want the same looking, so do it only once.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c  | 2 --
 fs/xfs/xfs_bmap_util.c | 7 ---
 fs/xfs/xfs_iomap.c | 3 ---
 3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f2333e351e07..5dd09e83c81c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -761,10 +761,8 @@ xfs_aops_discard_page(
"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
page, ip->i_ino, offset);
 
-   xfs_ilock(ip, XFS_ILOCK_EXCL);
error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
PAGE_SIZE / i_blocksize(inode));
-   xfs_iunlock(ip, XFS_ILOCK_EXCL);
if (error && !XFS_FORCED_SHUTDOWN(mp))
xfs_alert(mp, "page discard unable to remove delalloc 
mapping.");
 out_invalidate:
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c009bdf9fdce..1a55fc06f917 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -711,12 +711,11 @@ xfs_bmap_punch_delalloc_range(
struct xfs_iext_cursor  icur;
int error = 0;
 
-   ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-
+   xfs_ilock(ip, XFS_ILOCK_EXCL);
if (!(ifp->if_flags & XFS_IFEXTENTS)) {
error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
if (error)
-   return error;
+   goto out_unlock;
}
 
if (!xfs_iext_lookup_extent(ip, ifp, start_fsb, , ))
@@ -738,6 +737,8 @@ xfs_bmap_punch_delalloc_range(
break;
} while (xfs_iext_next_extent(ifp, , ));
 
+out_unlock:
+   xfs_iunlock(ip, XFS_ILOCK_EXCL);
return error;
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index da6d1995e460..f949f0dd7382 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1203,11 +1203,8 @@ xfs_file_iomap_end_delalloc(
truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb),
 XFS_FSB_TO_B(mp, end_fsb) - 1);
 
-   xfs_ilock(ip, XFS_ILOCK_EXCL);
error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
   end_fsb - start_fsb);
-   xfs_iunlock(ip, XFS_ILOCK_EXCL);
-
if (error && !XFS_FORCED_SHUTDOWN(mp)) {
xfs_alert(mp, "%s: unable to clean up ino %lld",
__func__, ip->i_ino);
-- 
2.17.0



[PATCH 17/34] xfs: use iomap_bmap

2018-05-18 Thread Christoph Hellwig
Switch to the iomap based bmap implementation to get rid of one of the
last users of xfs_get_blocks.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_aops.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 80de476cecf8..56e405572909 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1378,10 +1378,9 @@ xfs_vm_bmap(
struct address_space*mapping,
sector_tblock)
 {
-   struct inode*inode = (struct inode *)mapping->host;
-   struct xfs_inode*ip = XFS_I(inode);
+   struct xfs_inode*ip = XFS_I(mapping->host);
 
-   trace_xfs_vm_bmap(XFS_I(inode));
+   trace_xfs_vm_bmap(ip);
 
/*
 * The swap code (ab-)uses ->bmap to get a block mapping and then
@@ -1394,9 +1393,7 @@ xfs_vm_bmap(
 */
if (xfs_is_reflink_inode(ip) || XFS_IS_REALTIME_INODE(ip))
return 0;
-
-   filemap_write_and_wait(mapping);
-   return generic_block_bmap(mapping, block, xfs_get_blocks);
+   return iomap_bmap(mapping, block, _iomap_ops);
 }
 
 STATIC int
-- 
2.17.0



buffered I/O without buffer heads in xfs and iomap v2

2018-05-18 Thread Christoph Hellwig
Hi all,

this series adds support for buffered I/O without buffer heads to
the iomap and XFS code.

For now this series only contains support for block size == PAGE_SIZE,
with the 4k support split into a separate series.


A git tree is available at:

git://git.infradead.org/users/hch/xfs.git xfs-iomap-read.2

Gitweb:


http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-iomap-read.2

Changes since v1:
 - fix the iomap_readpages error handling
 - use unsigned file offsets in a few places to avoid arithmetic overflows
 - allocate a iomap_page in iomap_page_mkwrite to fix generic/095
 - improve a few comments
 - add more asserts
 - warn about truncated block numbers from ->bmap
 - new patch to change the __do_page_cache_readahead return value to
   unsigned int
 - remove an incorrectly added empty line
 - make inline data an explicit iomap type instead of a flag
 - add a IOMAP_F_BUFFER_HEAD flag to force use of buffers heads for gfs2,
   and keep the basic buffer head infrastructure around for now.


[PATCH 10/34] iomap: fix the comment describing IOMAP_NOWAIT

2018-05-18 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/iomap.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8f7095fc514e..13d19b4c29a9 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -59,7 +59,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
-#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
+#define IOMAP_NOWAIT   (1 << 5) /* do not block */
 
 struct iomap_ops {
/*
-- 
2.17.0



[PATCH 06/34] mm: give the 'ret' variable a better name __do_page_cache_readahead

2018-05-18 Thread Christoph Hellwig
It counts the number of pages acted on, so name it nr_pages to make that
obvious.

Signed-off-by: Christoph Hellwig 
---
 mm/readahead.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 539bbb6c1fad..16d0cb1e2616 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -156,7 +156,7 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
unsigned long end_index;/* The last page we want to read */
LIST_HEAD(page_pool);
int page_idx;
-   int ret = 0;
+   int nr_pages = 0;
loff_t isize = i_size_read(inode);
gfp_t gfp_mask = readahead_gfp_mask(mapping);
 
@@ -187,7 +187,7 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
list_add(>lru, _pool);
if (page_idx == nr_to_read - lookahead_size)
SetPageReadahead(page);
-   ret++;
+   nr_pages++;
}
 
/*
@@ -195,11 +195,11 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
 * uptodate then the caller will launch readpage again, and
 * will then handle the error.
 */
-   if (ret)
-   read_pages(mapping, filp, _pool, ret, gfp_mask);
+   if (nr_pages)
+   read_pages(mapping, filp, _pool, nr_pages, gfp_mask);
BUG_ON(!list_empty(_pool));
 out:
-   return ret;
+   return nr_pages;
 }
 
 /*
-- 
2.17.0



[PATCH 01/34] block: add a lower-level bio_add_page interface

2018-05-18 Thread Christoph Hellwig
For the upcoming removal of buffer heads in XFS we need to keep track of
the number of outstanding writeback requests per page.  For this we need
to know if bio_add_page merged a region with the previous bvec or not.
Instead of adding additional arguments this refactors bio_add_page to
be implemented using three lower level helpers which users like XFS can
use directly if they care about the merge decisions.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c | 96 +
 include/linux/bio.h |  9 +
 2 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 53e0f0a1ed94..fdf635d42bbd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -773,7 +773,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio 
*bio, struct page
return 0;
}
 
-   if (bio->bi_vcnt >= bio->bi_max_vecs)
+   if (bio_full(bio))
return 0;
 
/*
@@ -821,52 +821,82 @@ int bio_add_pc_page(struct request_queue *q, struct bio 
*bio, struct page
 EXPORT_SYMBOL(bio_add_pc_page);
 
 /**
- * bio_add_page-   attempt to add page to bio
- * @bio: destination bio
- * @page: page to add
- * @len: vec entry length
- * @offset: vec entry offset
+ * __bio_try_merge_page - try appending data to an existing bvec.
+ * @bio: destination bio
+ * @page: page to add
+ * @len: length of the data to add
+ * @off: offset of the data in @page
  *
- * Attempt to add a page to the bio_vec maplist. This will only fail
- * if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
+ * Try to add the data at @page + @off to the last bvec of @bio.  This is a
+ * a useful optimisation for file systems with a block size smaller than the
+ * page size.
+ *
+ * Return %true on success or %false on failure.
  */
-int bio_add_page(struct bio *bio, struct page *page,
-unsigned int len, unsigned int offset)
+bool __bio_try_merge_page(struct bio *bio, struct page *page,
+   unsigned int len, unsigned int off)
 {
-   struct bio_vec *bv;
-
-   /*
-* cloned bio must not modify vec list
-*/
if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
-   return 0;
+   return false;
 
-   /*
-* For filesystems with a blocksize smaller than the pagesize
-* we will often be called with the same page as last time and
-* a consecutive offset.  Optimize this special case.
-*/
if (bio->bi_vcnt > 0) {
-   bv = >bi_io_vec[bio->bi_vcnt - 1];
+   struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
 
-   if (page == bv->bv_page &&
-   offset == bv->bv_offset + bv->bv_len) {
+   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
bv->bv_len += len;
-   goto done;
+   bio->bi_iter.bi_size += len;
+   return true;
}
}
+   return false;
+}
+EXPORT_SYMBOL_GPL(__bio_try_merge_page);
 
-   if (bio->bi_vcnt >= bio->bi_max_vecs)
-   return 0;
+/**
+ * __bio_add_page - add page to a bio in a new segment
+ * @bio: destination bio
+ * @page: page to add
+ * @len: length of the data to add
+ * @off: offset of the data in @page
+ *
+ * Add the data at @page + @off to @bio as a new bvec.  The caller must ensure
+ * that @bio has space for another bvec.
+ */
+void __bio_add_page(struct bio *bio, struct page *page,
+   unsigned int len, unsigned int off)
+{
+   struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt];
 
-   bv  = >bi_io_vec[bio->bi_vcnt];
-   bv->bv_page = page;
-   bv->bv_len  = len;
-   bv->bv_offset   = offset;
+   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+   WARN_ON_ONCE(bio_full(bio));
+
+   bv->bv_page = page;
+   bv->bv_offset = off;
+   bv->bv_len = len;
 
-   bio->bi_vcnt++;
-done:
bio->bi_iter.bi_size += len;
+   bio->bi_vcnt++;
+}
+EXPORT_SYMBOL_GPL(__bio_add_page);
+
+/**
+ * bio_add_page-   attempt to add page to bio
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This will only fail
+ * if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page,
+unsigned int len, unsigned int offset)
+{
+   if (!__bio_try_merge_page(bio, page, len, offset)) {
+   if (bio_full(bio))
+   return 0;
+   __bio_add_page(bio, page, len, offset);
+   }
return len;
 }
 EXPORT_SYMBOL(bio_add_page);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ce547a25e8ae..3e73c8bc25ea 100644
--- a/include/linux/bio.h
+++ 

[PATCH 13/34] iomap: add a iomap_sector helper

2018-05-18 Thread Christoph Hellwig
Factor the repeated calculation of the on-disk sector for a given logical
block into a littler helper.

Signed-off-by: Christoph Hellwig 
---
 fs/iomap.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 6427627a247f..44259eadb69d 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -97,6 +97,12 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
unsigned flags,
return written ? written : ret;
 }
 
+static sector_t
+iomap_sector(struct iomap *iomap, loff_t pos)
+{
+   return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
+}
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -354,11 +360,8 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
unsigned offset,
 static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
struct iomap *iomap)
 {
-   sector_t sector = (iomap->addr +
-  (pos & PAGE_MASK) - iomap->offset) >> 9;
-
-   return __dax_zero_page_range(iomap->bdev, iomap->dax_dev, sector,
-   offset, bytes);
+   return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
+   iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
 }
 
 static loff_t
@@ -951,8 +954,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, 
loff_t pos,
 
bio = bio_alloc(GFP_KERNEL, 1);
bio_set_dev(bio, iomap->bdev);
-   bio->bi_iter.bi_sector =
-   (iomap->addr + pos - iomap->offset) >> 9;
+   bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
 
@@ -1046,8 +1048,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
length,
 
bio = bio_alloc(GFP_KERNEL, nr_pages);
bio_set_dev(bio, iomap->bdev);
-   bio->bi_iter.bi_sector =
-   (iomap->addr + pos - iomap->offset) >> 9;
+   bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
bio->bi_write_hint = dio->iocb->ki_hint;
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
-- 
2.17.0



[PATCH 08/34] mm: split ->readpages calls to avoid non-contiguous pages lists

2018-05-18 Thread Christoph Hellwig
That way file systems don't have to go spotting for non-contiguous pages
and work around them.  It also kicks off I/O earlier, allowing it to
finish earlier and reduce latency.

Signed-off-by: Christoph Hellwig 
---
 mm/readahead.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index fa4d4b767130..044ab0c137cc 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -177,8 +177,18 @@ unsigned int __do_page_cache_readahead(struct 
address_space *mapping,
rcu_read_lock();
page = radix_tree_lookup(>i_pages, page_offset);
rcu_read_unlock();
-   if (page && !radix_tree_exceptional_entry(page))
+   if (page && !radix_tree_exceptional_entry(page)) {
+   /*
+* Page already present?  Kick off the current batch of
+* contiguous pages before continuing with the next
+* batch.
+*/
+   if (nr_pages)
+   read_pages(mapping, filp, _pool, nr_pages,
+   gfp_mask);
+   nr_pages = 0;
continue;
+   }
 
page = __page_cache_alloc(gfp_mask);
if (!page)
-- 
2.17.0



[PATCH 09/34] iomap: inline data should be an iomap type, not a flag

2018-05-18 Thread Christoph Hellwig
Inline data is fundamentally different from our normal mapped case in that
it doesn't even have a block address.  So instead of having a flag for it
it should be an entirely separate iomap range type.

Signed-off-by: Christoph Hellwig 
---
 fs/ext4/inline.c  |  4 ++--
 fs/gfs2/bmap.c|  3 +--
 fs/iomap.c| 21 -
 include/linux/iomap.h |  2 +-
 4 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 70cf4c7b268a..e1f00891ef95 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1835,8 +1835,8 @@ int ext4_inline_data_iomap(struct inode *inode, struct 
iomap *iomap)
iomap->offset = 0;
iomap->length = min_t(loff_t, ext4_get_inline_size(inode),
  i_size_read(inode));
-   iomap->type = 0;
-   iomap->flags = IOMAP_F_DATA_INLINE;
+   iomap->type = IOMAP_INLINE;
+   iomap->flags = 0;
 
 out:
up_read(_I(inode)->xattr_sem);
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 278ed0869c3c..cbeedd3cfb36 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -680,8 +680,7 @@ static void gfs2_stuffed_iomap(struct inode *inode, struct 
iomap *iomap)
  sizeof(struct gfs2_dinode);
iomap->offset = 0;
iomap->length = i_size_read(inode);
-   iomap->type = IOMAP_MAPPED;
-   iomap->flags = IOMAP_F_DATA_INLINE;
+   iomap->type = IOMAP_INLINE;
 }
 
 /**
diff --git a/fs/iomap.c b/fs/iomap.c
index 0fecd5789d7b..a859e15d7bec 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -503,10 +503,13 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
case IOMAP_DELALLOC:
flags |= FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN;
break;
+   case IOMAP_MAPPED:
+   break;
case IOMAP_UNWRITTEN:
flags |= FIEMAP_EXTENT_UNWRITTEN;
break;
-   case IOMAP_MAPPED:
+   case IOMAP_INLINE:
+   flags |= FIEMAP_EXTENT_DATA_INLINE;
break;
}
 
@@ -514,8 +517,6 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
flags |= FIEMAP_EXTENT_MERGED;
if (iomap->flags & IOMAP_F_SHARED)
flags |= FIEMAP_EXTENT_SHARED;
-   if (iomap->flags & IOMAP_F_DATA_INLINE)
-   flags |= FIEMAP_EXTENT_DATA_INLINE;
 
return fiemap_fill_next_extent(fi, iomap->offset,
iomap->addr != IOMAP_NULL_ADDR ? iomap->addr : 0,
@@ -1326,14 +1327,16 @@ static loff_t iomap_swapfile_activate_actor(struct 
inode *inode, loff_t pos,
struct iomap_swapfile_info *isi = data;
int error;
 
-   /* No inline data. */
-   if (iomap->flags & IOMAP_F_DATA_INLINE) {
+   switch (iomap->type) {
+   case IOMAP_MAPPED:
+   case IOMAP_UNWRITTEN:
+   /* Only real or unwritten extents. */
+   break;
+   case IOMAP_INLINE:
+   /* No inline data. */
pr_err("swapon: file is inline\n");
return -EINVAL;
-   }
-
-   /* Only real or unwritten extents. */
-   if (iomap->type != IOMAP_MAPPED && iomap->type != IOMAP_UNWRITTEN) {
+   default:
pr_err("swapon: file has unallocated extents\n");
return -EINVAL;
}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4bd87294219a..8f7095fc514e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -18,6 +18,7 @@ struct vm_fault;
 #define IOMAP_DELALLOC 0x02/* delayed allocation blocks */
 #define IOMAP_MAPPED   0x03/* blocks allocated at @addr */
 #define IOMAP_UNWRITTEN0x04/* blocks allocated at @addr in 
unwritten state */
+#define IOMAP_INLINE   0x05/* data inline in the inode */
 
 /*
  * Flags for all iomap mappings:
@@ -34,7 +35,6 @@ struct vm_fault;
  */
 #define IOMAP_F_MERGED 0x10/* contains multiple blocks/extents */
 #define IOMAP_F_SHARED 0x20/* block shared with another file */
-#define IOMAP_F_DATA_INLINE0x40/* data inline in the inode */
 
 /*
  * Magic value for addr:
-- 
2.17.0



[PATCH 04/34] fs: remove the buffer_unwritten check in page_seek_hole_data

2018-05-18 Thread Christoph Hellwig
We only call into this function through the iomap iterators, so we already
know the buffer is unwritten.  In addition to that we always require the
uptodate flag that is ORed with the result anyway.

Signed-off-by: Christoph Hellwig 
---
 fs/iomap.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 4a01d2f4e8e9..bef5e91d40bf 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -611,14 +611,9 @@ page_seek_hole_data(struct page *page, loff_t lastoff, int 
whence)
continue;
 
/*
-* Unwritten extents that have data in the page cache covering
-* them can be identified by the BH_Unwritten state flag.
-* Pages with multiple buffers might have a mix of holes, data
-* and unwritten extents - any buffer with valid data in it
-* should have BH_Uptodate flag set on it.
+* Any buffer with valid data in it should have BH_Uptodate set.
 */
-
-   if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
+   if (buffer_uptodate(bh) == seek_data)
return lastoff;
 
lastoff = offset;
@@ -630,8 +625,8 @@ page_seek_hole_data(struct page *page, loff_t lastoff, int 
whence)
  * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
  *
  * Within unwritten extents, the page cache determines which parts are holes
- * and which are data: unwritten and uptodate buffer heads count as data;
- * everything else counts as a hole.
+ * and which are data: uptodate buffer heads count as data; everything else
+ * counts as a hole.
  *
  * Returns the resulting offset on successs, and -ENOENT otherwise.
  */
-- 
2.17.0



[PATCH 07/34] mm: return an unsigned int from __do_page_cache_readahead

2018-05-18 Thread Christoph Hellwig
We never return an error, so switch to returning an unsigned int.  Most
callers already did implicit casts to an unsigned type, and the one that
didn't can be simplified now.

Suggested-by: Matthew Wilcox 
Signed-off-by: Christoph Hellwig 
---
 mm/internal.h  |  2 +-
 mm/readahead.c | 15 +--
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..954003ac766a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -53,7 +53,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 unsigned long addr, unsigned long end,
 struct zap_details *details);
 
-extern int __do_page_cache_readahead(struct address_space *mapping,
+extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
struct file *filp, pgoff_t offset, unsigned long nr_to_read,
unsigned long lookahead_size);
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 16d0cb1e2616..fa4d4b767130 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -147,16 +147,16 @@ static int read_pages(struct address_space *mapping, 
struct file *filp,
  *
  * Returns the number of pages requested, or the maximum amount of I/O allowed.
  */
-int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-   pgoff_t offset, unsigned long nr_to_read,
-   unsigned long lookahead_size)
+unsigned int __do_page_cache_readahead(struct address_space *mapping,
+   struct file *filp, pgoff_t offset, unsigned long nr_to_read,
+   unsigned long lookahead_size)
 {
struct inode *inode = mapping->host;
struct page *page;
unsigned long end_index;/* The last page we want to read */
LIST_HEAD(page_pool);
int page_idx;
-   int nr_pages = 0;
+   unsigned int nr_pages = 0;
loff_t isize = i_size_read(inode);
gfp_t gfp_mask = readahead_gfp_mask(mapping);
 
@@ -223,16 +223,11 @@ int force_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
nr_to_read = min(nr_to_read, max_pages);
while (nr_to_read) {
-   int err;
-
unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE;
 
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
-   err = __do_page_cache_readahead(mapping, filp,
-   offset, this_chunk, 0);
-   if (err < 0)
-   return err;
+   __do_page_cache_readahead(mapping, filp, offset, this_chunk, 0);
 
offset += this_chunk;
nr_to_read -= this_chunk;
-- 
2.17.0



[PATCH 02/34] fs: factor out a __generic_write_end helper

2018-05-18 Thread Christoph Hellwig
Bits of the buffer.c based write_end implementations that don't know
about buffer_heads and can be reused by other implementations.

Signed-off-by: Christoph Hellwig 
---
 fs/buffer.c   | 67 +++
 fs/internal.h |  2 ++
 2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 249b83fafe48..bd964b2ad99a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2076,6 +2076,40 @@ int block_write_begin(struct address_space *mapping, 
loff_t pos, unsigned len,
 }
 EXPORT_SYMBOL(block_write_begin);
 
+int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
+   struct page *page)
+{
+   loff_t old_size = inode->i_size;
+   bool i_size_changed = false;
+
+   /*
+* No need to use i_size_read() here, the i_size cannot change under us
+* because we hold i_rwsem.
+*
+* But it's important to update i_size while still holding page lock:
+* page writeout could otherwise come in and zero beyond i_size.
+*/
+   if (pos + copied > inode->i_size) {
+   i_size_write(inode, pos + copied);
+   i_size_changed = true;
+   }
+
+   unlock_page(page);
+   put_page(page);
+
+   if (old_size < pos)
+   pagecache_isize_extended(inode, old_size, pos);
+   /*
+* Don't mark the inode dirty under page lock. First, it unnecessarily
+* makes the holding time of page lock longer. Second, it forces lock
+* ordering of page lock and transaction start for journaling
+* filesystems.
+*/
+   if (i_size_changed)
+   mark_inode_dirty(inode);
+   return copied;
+}
+
 int block_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
@@ -2116,39 +2150,8 @@ int generic_write_end(struct file *file, struct 
address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
 {
-   struct inode *inode = mapping->host;
-   loff_t old_size = inode->i_size;
-   int i_size_changed = 0;
-
copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-
-   /*
-* No need to use i_size_read() here, the i_size
-* cannot change under us because we hold i_mutex.
-*
-* But it's important to update i_size while still holding page lock:
-* page writeout could otherwise come in and zero beyond i_size.
-*/
-   if (pos+copied > inode->i_size) {
-   i_size_write(inode, pos+copied);
-   i_size_changed = 1;
-   }
-
-   unlock_page(page);
-   put_page(page);
-
-   if (old_size < pos)
-   pagecache_isize_extended(inode, old_size, pos);
-   /*
-* Don't mark the inode dirty under page lock. First, it unnecessarily
-* makes the holding time of page lock longer. Second, it forces lock
-* ordering of page lock and transaction start for journaling
-* filesystems.
-*/
-   if (i_size_changed)
-   mark_inode_dirty(inode);
-
-   return copied;
+   return __generic_write_end(mapping->host, pos, copied, page);
 }
 EXPORT_SYMBOL(generic_write_end);
 
diff --git a/fs/internal.h b/fs/internal.h
index e08972db0303..b955232d3d49 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -43,6 +43,8 @@ static inline int __sync_blockdev(struct block_device *bdev, 
int wait)
 extern void guard_bio_eod(int rw, struct bio *bio);
 extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block, struct iomap *iomap);
+int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
+   struct page *page);
 
 /*
  * char_dev.c
-- 
2.17.0



[PATCH 03/34] fs: move page_cache_seek_hole_data to iomap.c

2018-05-18 Thread Christoph Hellwig
This function is only used by the iomap code, depends on being called
from it, and will soon stop poking into buffer head internals.

Signed-off-by: Christoph Hellwig 
---
 fs/buffer.c | 114 ---
 fs/iomap.c  | 116 
 include/linux/buffer_head.h |   2 -
 3 files changed, 116 insertions(+), 116 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index bd964b2ad99a..aba2a948b235 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3430,120 +3430,6 @@ int bh_submit_read(struct buffer_head *bh)
 }
 EXPORT_SYMBOL(bh_submit_read);
 
-/*
- * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
- *
- * Returns the offset within the file on success, and -ENOENT otherwise.
- */
-static loff_t
-page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
-{
-   loff_t offset = page_offset(page);
-   struct buffer_head *bh, *head;
-   bool seek_data = whence == SEEK_DATA;
-
-   if (lastoff < offset)
-   lastoff = offset;
-
-   bh = head = page_buffers(page);
-   do {
-   offset += bh->b_size;
-   if (lastoff >= offset)
-   continue;
-
-   /*
-* Unwritten extents that have data in the page cache covering
-* them can be identified by the BH_Unwritten state flag.
-* Pages with multiple buffers might have a mix of holes, data
-* and unwritten extents - any buffer with valid data in it
-* should have BH_Uptodate flag set on it.
-*/
-
-   if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
-   return lastoff;
-
-   lastoff = offset;
-   } while ((bh = bh->b_this_page) != head);
-   return -ENOENT;
-}
-
-/*
- * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
- *
- * Within unwritten extents, the page cache determines which parts are holes
- * and which are data: unwritten and uptodate buffer heads count as data;
- * everything else counts as a hole.
- *
- * Returns the resulting offset on successs, and -ENOENT otherwise.
- */
-loff_t
-page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
- int whence)
-{
-   pgoff_t index = offset >> PAGE_SHIFT;
-   pgoff_t end = DIV_ROUND_UP(offset + length, PAGE_SIZE);
-   loff_t lastoff = offset;
-   struct pagevec pvec;
-
-   if (length <= 0)
-   return -ENOENT;
-
-   pagevec_init();
-
-   do {
-   unsigned nr_pages, i;
-
-   nr_pages = pagevec_lookup_range(, inode->i_mapping, ,
-   end - 1);
-   if (nr_pages == 0)
-   break;
-
-   for (i = 0; i < nr_pages; i++) {
-   struct page *page = pvec.pages[i];
-
-   /*
-* At this point, the page may be truncated or
-* invalidated (changing page->mapping to NULL), or
-* even swizzled back from swapper_space to tmpfs file
-* mapping.  However, page->index will not change
-* because we have a reference on the page.
- *
-* If current page offset is beyond where we've ended,
-* we've found a hole.
- */
-   if (whence == SEEK_HOLE &&
-   lastoff < page_offset(page))
-   goto check_range;
-
-   lock_page(page);
-   if (likely(page->mapping == inode->i_mapping) &&
-   page_has_buffers(page)) {
-   lastoff = page_seek_hole_data(page, lastoff, 
whence);
-   if (lastoff >= 0) {
-   unlock_page(page);
-   goto check_range;
-   }
-   }
-   unlock_page(page);
-   lastoff = page_offset(page) + PAGE_SIZE;
-   }
-   pagevec_release();
-   } while (index < end);
-
-   /* When no page at lastoff and we are not done, we found a hole. */
-   if (whence != SEEK_HOLE)
-   goto not_found;
-
-check_range:
-   if (lastoff < offset + length)
-   goto out;
-not_found:
-   lastoff = -ENOENT;
-out:
-   pagevec_release();
-   return lastoff;
-}
-
 void __init buffer_init(void)
 {
unsigned long nrpages;
diff --git a/fs/iomap.c b/fs/iomap.c
index f2456d0d8ddd..4a01d2f4e8e9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -588,6 +589,121 @@ 

[PATCH 2/6] nvme-pci: Fix queue freeze criteria on reset

2018-05-18 Thread Keith Busch
The driver had been relying on the pci_dev to maintain the state of
the pci device to know when starting a freeze would be appropriate. The
blktests block/011 however shows us that users may alter the state of
pci_dev out from under drivers and break the criteria we had been using.

This patch uses the private nvme controller struct to track the
enabling/disabling state. Since we're relying on that now, the reset will
unconditionally disable the device on reset. This is necessary anyway
on a controller failure reset, and was already being done in the reset
during admin bring up, and is not harmful to do a second time.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/pci.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8da63402d474..2bd9d84f58d0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2196,24 +2196,22 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
struct pci_dev *pdev = to_pci_dev(dev->dev);
 
mutex_lock(>shutdown_lock);
-   if (pci_is_enabled(pdev)) {
+   if (dev->ctrl.ctrl_config & NVME_CC_ENABLE &&
+   (dev->ctrl.state == NVME_CTRL_LIVE ||
+dev->ctrl.state == NVME_CTRL_RESETTING)) {
u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-   if (dev->ctrl.state == NVME_CTRL_LIVE ||
-   dev->ctrl.state == NVME_CTRL_RESETTING)
-   nvme_start_freeze(>ctrl);
+   nvme_start_freeze(>ctrl);
dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
-   pdev->error_state  != pci_channel_io_normal);
+   pci_channel_offline(pdev) || !pci_is_enabled(pdev));
}
 
/*
 * Give the controller a chance to complete all entered requests if
 * doing a safe shutdown.
 */
-   if (!dead) {
-   if (shutdown)
-   nvme_wait_freeze_timeout(>ctrl, NVME_IO_TIMEOUT);
-   }
+   if (!dead && shutdown)
+   nvme_wait_freeze_timeout(>ctrl, NVME_IO_TIMEOUT);
 
nvme_stop_queues(>ctrl);
 
@@ -2227,8 +2225,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
if (dev->host_mem_descs)
nvme_set_host_mem(dev, 0);
nvme_disable_io_queues(dev);
-   nvme_disable_admin_queue(dev, shutdown);
}
+   nvme_disable_admin_queue(dev, shutdown);
for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
nvme_suspend_queue(>queues[i]);
 
-- 
2.14.3



[PATCH 5/6] nvme-pci: Attempt reset retry for IO failures

2018-05-18 Thread Keith Busch
If the reset failed due to a non-fatal error, this patch will attempt
to reset the controller again, with a maximum of 4 attempts.

Since the failed reset case has changed purpose, this patch provides a
more appropriate name and warning message for the reset failure.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/pci.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a7cbc631d92..ddfeb186d129 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -37,6 +37,8 @@
 
 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
+#define MAX_RESET_FAILURES 4
+
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0);
 
@@ -101,6 +103,8 @@ struct nvme_dev {
struct completion ioq_wait;
bool queues_froze;
 
+   int reset_failures;
+
/* shadow doorbell buffer support: */
u32 *dbbuf_dbs;
dma_addr_t dbbuf_dbs_dma_addr;
@@ -2307,9 +2311,23 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
kfree(dev);
 }
 
-static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
+static void nvme_reset_failure(struct nvme_dev *dev, int status)
 {
-   dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", 
status);
+   dev->reset_failures++;
+   dev_warn(dev->ctrl.device, "Reset failure status: %d, failures:%d\n",
+   status, dev->reset_failures);
+
+   /* IO and Interrupted Call may indicate a retryable error */
+   switch (status) {
+   case -EIO:
+   case -EINTR:
+   if (dev->reset_failures < MAX_RESET_FAILURES &&
+   !nvme_reset_ctrl(>ctrl))
+   return;
+   break;
+   default:
+   break;
+   }
 
nvme_get_ctrl(>ctrl);
nvme_dev_disable(dev, false);
@@ -2410,14 +2428,16 @@ static void nvme_reset_work(struct work_struct *work)
if (!nvme_change_ctrl_state(>ctrl, new_state)) {
dev_warn(dev->ctrl.device,
"failed to mark controller state %d\n", new_state);
+   result = -ENODEV;
goto out;
}
 
+   dev->reset_failures = 0;
nvme_start_ctrl(>ctrl);
return;
 
  out:
-   nvme_remove_dead_ctrl(dev, result);
+   nvme_reset_failure(dev, result);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
-- 
2.14.3



[PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-18 Thread Keith Busch
IO may be retryable, so don't wait for them in the reset path. These
commands may trigger a reset if that IO expires without a completion,
placing it on the requeue list. Waiting for these would then deadlock
the reset handler.

To fix the theoretical deadlock, this patch unblocks IO submission from
the reset_work as before, but moves the waiting to the IO safe scan_work
so that the reset_work may proceed to completion. Since the unfreezing
happens in the controller LIVE state, the nvme device has to track if
the queues were frozen now to prevent incorrect freeze depths.

This patch is also renaming the function 'nvme_dev_add' to a
more appropriate name that describes what it's actually doing:
nvme_alloc_io_tags.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/core.c |  3 +++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 46 +-
 3 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1de68b56b318..34d7731f1419 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -214,6 +214,7 @@ static inline bool nvme_req_needs_retry(struct request *req)
if (blk_noretry_request(req))
return false;
if (nvme_req(req)->status & NVME_SC_DNR)
+
return false;
if (nvme_req(req)->retries >= nvme_max_retries)
return false;
@@ -3177,6 +3178,8 @@ static void nvme_scan_work(struct work_struct *work)
struct nvme_id_ctrl *id;
unsigned nn;
 
+   if (ctrl->ops->update_hw_ctx)
+   ctrl->ops->update_hw_ctx(ctrl);
if (ctrl->state != NVME_CTRL_LIVE)
return;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c15c2ee7f61a..230c5424b197 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -320,6 +320,7 @@ struct nvme_ctrl_ops {
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
int (*reinit_request)(void *data, struct request *rq);
void (*stop_ctrl)(struct nvme_ctrl *ctrl);
+   void (*update_hw_ctx)(struct nvme_ctrl *ctrl);
 };
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2bd9d84f58d0..6a7cbc631d92 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -99,6 +99,7 @@ struct nvme_dev {
u32 cmbloc;
struct nvme_ctrl ctrl;
struct completion ioq_wait;
+   bool queues_froze;
 
/* shadow doorbell buffer support: */
u32 *dbbuf_dbs;
@@ -2065,10 +2066,33 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
}
 }
 
+static void nvme_pci_update_hw_ctx(struct nvme_ctrl *ctrl)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+   bool unfreeze;
+
+   mutex_lock(>shutdown_lock);
+   unfreeze = dev->queues_froze;
+   mutex_unlock(>shutdown_lock);
+
+   if (unfreeze)
+   nvme_wait_freeze(>ctrl);
+
+   blk_mq_update_nr_hw_queues(ctrl->tagset, dev->online_queues - 1);
+   nvme_free_queues(dev, dev->online_queues);
+
+   if (unfreeze)
+   nvme_unfreeze(>ctrl);
+
+   mutex_lock(>shutdown_lock);
+   dev->queues_froze = false;
+   mutex_unlock(>shutdown_lock);
+}
+
 /*
  * return error value only when tagset allocation failed
  */
-static int nvme_dev_add(struct nvme_dev *dev)
+static int nvme_alloc_io_tags(struct nvme_dev *dev)
 {
int ret;
 
@@ -2097,10 +2121,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 
nvme_dbbuf_set(dev);
} else {
-   blk_mq_update_nr_hw_queues(>tagset, dev->online_queues - 
1);
-
-   /* Free previously allocated queues that are no longer usable */
-   nvme_free_queues(dev, dev->online_queues);
+   nvme_start_queues(>ctrl);
}
 
return 0;
@@ -2201,7 +,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
 dev->ctrl.state == NVME_CTRL_RESETTING)) {
u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-   nvme_start_freeze(>ctrl);
+   if (!dev->queues_froze) {
+   nvme_start_freeze(>ctrl);
+   dev->queues_froze = true;
+   }
dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
pci_channel_offline(pdev) || !pci_is_enabled(pdev));
}
@@ -2375,13 +2399,8 @@ static void nvme_reset_work(struct work_struct *work)
nvme_kill_queues(>ctrl);
nvme_remove_namespaces(>ctrl);
new_state = NVME_CTRL_ADMIN_ONLY;
-   } else {
-   nvme_start_queues(>ctrl);
-   nvme_wait_freeze(>ctrl);
-   /* hit this only when allocate tagset fails */
-   if (nvme_dev_add(dev))
-   new_state = 

[PATCH 6/6] nvme-pci: Rate limit the nvme timeout warnings

2018-05-18 Thread Keith Busch
The block layer's timeout handling currently refuses to let the driver
complete commands outside the timeout callback once blk-mq decides they've
expired. If a device breaks, this could potentially create many thousands
of timed out commands. There's nothing of value to be gleaned from
observing each of those messages, so this patch adds a ratelimit on them.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ddfeb186d129..e4b91c246e36 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1251,7 +1251,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 * returned to the driver, or if this is the admin queue.
 */
if (!nvmeq->qid || iod->aborted) {
-   dev_warn(dev->ctrl.device,
+   dev_warn_ratelimited(dev->ctrl.device,
 "I/O %d QID %d timeout, reset controller\n",
 req->tag, nvmeq->qid);
nvme_dev_disable(dev, false);
-- 
2.14.3



[PATCH 1/6] nvme: Sync request queues on reset

2018-05-18 Thread Keith Busch
This patch fixes races that occur with simultaneous controller
resets by synchronizing request queues prior to initializing the
controller. Withouth this, a thread may attempt disabling a controller
at the same time as we're trying to enable it.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/core.c | 21 +++--
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  |  1 +
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 99b857e5a7a9..1de68b56b318 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3471,6 +3471,12 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device 
*dev,
 }
 EXPORT_SYMBOL_GPL(nvme_init_ctrl);
 
+static void nvme_start_queue(struct nvme_ns *ns)
+{
+   blk_mq_unquiesce_queue(ns->queue);
+   blk_mq_kick_requeue_list(ns->queue);
+}
+
 /**
  * nvme_kill_queues(): Ends all namespace queues
  * @ctrl: the dead controller that needs to end
@@ -3499,7 +3505,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
blk_set_queue_dying(ns->queue);
 
/* Forcibly unquiesce queues to avoid blocking dispatch */
-   blk_mq_unquiesce_queue(ns->queue);
+   nvme_start_queue(ns);
}
up_read(>namespaces_rwsem);
 }
@@ -3569,11 +3575,22 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 
down_read(>namespaces_rwsem);
list_for_each_entry(ns, >namespaces, list)
-   blk_mq_unquiesce_queue(ns->queue);
+   nvme_start_queue(ns);
up_read(>namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   down_read(>namespaces_rwsem);
+   list_for_each_entry(ns, >namespaces, list)
+   blk_sync_queue(ns->queue);
+   up_read(>namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
 {
if (!ctrl->ops->reinit_request)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 17d2f7cf3fed..c15c2ee7f61a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -407,6 +407,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void 
*buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
union nvme_result *res);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..8da63402d474 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2312,6 +2312,7 @@ static void nvme_reset_work(struct work_struct *work)
 */
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
nvme_dev_disable(dev, false);
+   nvme_sync_queues(>ctrl);
 
/*
 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
-- 
2.14.3



[PATCH 4/6] nvme: Allow reset from CONNECTING state

2018-05-18 Thread Keith Busch
A failed connection may be retryable. This patch allows the connecting
state to initiate a reset so that it may try to connect again.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 34d7731f1419..bccc92206fba 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -293,6 +293,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
case NVME_CTRL_ADMIN_ONLY:
+   case NVME_CTRL_CONNECTING:
changed = true;
/* FALLTHRU */
default:
-- 
2.14.3



Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-18 Thread Jens Axboe
On 5/18/18 10:23 AM, Christoph Hellwig wrote:
> On Fri, May 11, 2018 at 03:13:38PM -0600, Jens Axboe wrote:
>> Looked over the series, and looks like both good cleanups and optimizations.
>> If we can get the mempool patch sorted, I can apply this for 4.18.
> 
> FYI, I agree on the actual cleanups and optimization, but we really
> shouldn't add new functions or even just exports without the code
> using them.  I think it is enough if we can collect ACKs on them, but
> there is no point in using them.  Especially as I'd really like to see
> the users for some of them first.

I certainly agree on that in general, but at the same time it makes the
expected submission of bcachefs not having to carry a number of
(essentially) unrelated patches. I'm assuming the likelihood of bcachefs
being submitted soonish is high, hence we won't have exports that don't
have in-kernel users in the longer term.

-- 
Jens Axboe



Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-18 Thread Keith Busch
On Thu, May 17, 2018 at 04:23:45PM +0200, Johannes Thumshirn wrote:
> > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > callbacks.
> 
> Exactly, but as long as we can issue the reset via sysfs the test-case
> is still valid.

I disagree the test case is valid. The test writes '0' to the
pci-sysfs 'enable', but the driver also disables the pci device as part
of resetting, which is a perfectly reasonable thing for a driver to do.

If the timing of the test's loop happens to write '0' right after the
driver disabled the device that it owns, a 'write error' on that sysfs
write occurs, and blktests then incorrectly claims the test failed.


Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-18 Thread Christoph Hellwig
On Fri, May 11, 2018 at 03:13:38PM -0600, Jens Axboe wrote:
> Looked over the series, and looks like both good cleanups and optimizations.
> If we can get the mempool patch sorted, I can apply this for 4.18.

FYI, I agree on the actual cleanups and optimization, but we really
shouldn't add new functions or even just exports without the code
using them.  I think it is enough if we can collect ACKs on them, but
there is no point in using them.  Especially as I'd really like to see
the users for some of them first.


Re: [PATCH 02/10] block: Convert bio_set to mempool_init()

2018-05-18 Thread Christoph Hellwig
On Fri, May 18, 2018 at 09:20:28AM -0700, Christoph Hellwig wrote:
> On Tue, May 08, 2018 at 09:33:50PM -0400, Kent Overstreet wrote:
> > Minor performance improvement by getting rid of pointer indirections
> > from allocation/freeing fastpaths.
> 
> Can you please also send a long conversion for the remaining
> few bioset_create users?  It would be rather silly to keep two
> almost the same interfaces around for just about two hand full
> of users.

This comment was ment in reply to the next patch, sorry.


Re: [PATCH 02/10] block: Convert bio_set to mempool_init()

2018-05-18 Thread Christoph Hellwig
On Tue, May 08, 2018 at 09:33:50PM -0400, Kent Overstreet wrote:
> Minor performance improvement by getting rid of pointer indirections
> from allocation/freeing fastpaths.

Can you please also send a long conversion for the remaining
few bioset_create users?  It would be rather silly to keep two
almost the same interfaces around for just about two hand full
of users.


Re: [PATCH v4 3/3] fs: Add aio iopriority support for block_dev

2018-05-18 Thread Christoph Hellwig
Looks fine, although I'd split it into a aio and block_dev patch.

Also please wire this up for the fs/iomap.c direct I/O code, it should
be essentially the same sniplet as in the block_dev.c code.


Re: [PATCH v4 2/3] fs: Convert kiocb rw_hint from enum to u16

2018-05-18 Thread Christoph Hellwig
> +/* ki_hint changed from enum to u16, make sure rw_hint fits into u16 */

I don't think this comment is very useful.

> +static inline u16 ki_hint_valid(enum rw_hint hint)

I'd call this ki_hint_validate.

> +{
> + if (hint > MAX_KI_HINT)
> + return 0;
> +
> + return hint;

Nit: kill the empty line.


Re: [PATCH v4 1/3] block: add ioprio_check_cap function

2018-05-18 Thread Christoph Hellwig
On Thu, May 17, 2018 at 01:38:01PM -0700, adam.manzana...@wdc.com wrote:
> From: Adam Manzanares 
> 
> Aio per command iopriority support introduces a second interface between
> userland and the kernel capable of passing iopriority. The aio interface also
> needs the ability to verify that the submitting context has sufficient
> priviledges to submit IOPRIO_RT commands. This patch creates the
> ioprio_check_cap function to be used by the ioprio_set system call and also by
> the aio interface.
> 
> Signed-off-by: Adam Manzanares 

Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH v4 3/3] fs: Add aio iopriority support for block_dev

2018-05-18 Thread Adam Manzanares


On 5/18/18 8:14 AM, Jens Axboe wrote:
> On 5/17/18 2:38 PM, adam.manzana...@wdc.com wrote:
>> From: Adam Manzanares 
>>
>> This is the per-I/O equivalent of the ioprio_set system call.
>>
>> When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
>> newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.
>>
>> When a bio is created for an aio request by the block dev we set the priority
>> value of the bio to the user supplied value.
>>
>> This patch depends on block: add ioprio_check_cap function
> 
> Actually, one comment on this one:
> 
>> diff --git a/fs/aio.c b/fs/aio.c
>> index f3eae5d5771b..ff3107aa82d5 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
>> *iocb)
>>  if (iocb->aio_flags & IOCB_FLAG_RESFD)
>>  req->ki_flags |= IOCB_EVENTFD;
>>  req->ki_hint = file_write_hint(req->ki_filp);
>> +if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
>> +/*
>> + * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
>> + * aio_reqprio is interpreted as an I/O scheduling
>> + * class and priority.
>> + */
>> +ret = ioprio_check_cap(iocb->aio_reqprio);
>> +if (ret) {
>> +pr_debug("aio ioprio check cap error\n");
>> +return -EINVAL;
>> +}
>> +
>> +req->ki_ioprio = iocb->aio_reqprio;
>> +req->ki_flags |= IOCB_IOPRIO;
>> +}
> 
> Do we really need IOCB_IOPRIO? All zeroes is no priority set anyway,
> so we should be able to get by with just setting ->ki_ioprio to either
> the priority, or 0.
> 
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 7ec920e27065..970bef79caa6 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -355,6 +355,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
>> *iter, int nr_pages)
>>  bio->bi_write_hint = iocb->ki_hint;
>>  bio->bi_private = dio;
>>  bio->bi_end_io = blkdev_bio_end_io;
>> +if (iocb->ki_flags & IOCB_IOPRIO)
>> +bio->bi_ioprio = iocb->ki_ioprio;
> 
> And then this assignment can just happen unconditionally.

That is a cleaner way of guaranteeing the ioprio set on the kiocb is 
only set when the user intends to use the ioprio from the iocb.

I'll resend the series.


> 

Re: blk-mq: make sure that correct hctx->dispatch_from is set

2018-05-18 Thread Jens Axboe
On 5/18/18 9:10 AM, Bart Van Assche wrote:
> On Fri, 2018-05-18 at 22:46 +0800, huhai wrote:
>> Yes, it is more readable
>>
>> Finally, thank you for reminding me. Next time I'll change gmail to submit 
>> patch.
> 
> Hello Huhai,
> 
> Please have a look at Documentation/process/email-clients.rst.

Yeah, I did point at that one too.

For sending out patches, I would strongly recommend just using git send-email.
It works fine with gmail, that's what I always use.

$ cat ~/.gitconfig
[sendemail]
from = Jens Axboe 
smtpserver = smtp.gmail.com
smtpuser = ax...@kernel.dk
smtpencryption = tls
smtppass = 
smtpserverport = 587

-- 
Jens Axboe



Re: [PATCH v4 3/3] fs: Add aio iopriority support for block_dev

2018-05-18 Thread Jens Axboe
On 5/17/18 2:38 PM, adam.manzana...@wdc.com wrote:
> From: Adam Manzanares 
> 
> This is the per-I/O equivalent of the ioprio_set system call.
> 
> When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
> newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.
> 
> When a bio is created for an aio request by the block dev we set the priority
> value of the bio to the user supplied value.
> 
> This patch depends on block: add ioprio_check_cap function

Actually, one comment on this one:

> diff --git a/fs/aio.c b/fs/aio.c
> index f3eae5d5771b..ff3107aa82d5 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
> *iocb)
>   if (iocb->aio_flags & IOCB_FLAG_RESFD)
>   req->ki_flags |= IOCB_EVENTFD;
>   req->ki_hint = file_write_hint(req->ki_filp);
> + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
> + /*
> +  * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
> +  * aio_reqprio is interpreted as an I/O scheduling
> +  * class and priority.
> +  */
> + ret = ioprio_check_cap(iocb->aio_reqprio);
> + if (ret) {
> + pr_debug("aio ioprio check cap error\n");
> + return -EINVAL;
> + }
> +
> + req->ki_ioprio = iocb->aio_reqprio;
> + req->ki_flags |= IOCB_IOPRIO;
> + }

Do we really need IOCB_IOPRIO? All zeroes is no priority set anyway,
so we should be able to get by with just setting ->ki_ioprio to either
the priority, or 0.

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..970bef79caa6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + if (iocb->ki_flags & IOCB_IOPRIO)
> + bio->bi_ioprio = iocb->ki_ioprio;

And then this assignment can just happen unconditionally.

-- 
Jens Axboe



Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-18 Thread Bart Van Assche
On Fri, 2018-05-18 at 05:06 -0400, Kent Overstreet wrote:
> On Thu, May 17, 2018 at 08:54:57PM +, Bart Van Assche wrote:
> > With Jens' latest for-next branch I hit the kernel warning shown below. Can
> > you have a look?
> 
> Any hints on how to reproduce it?

Sure. This is how I triggered it:
* Clone https://github.com/bvanassche/srp-test.
* Follow the instructions in README.md.
* Run srp-test/run_tests -c -r 10

Thanks,

Bart.






Re: blk-mq: make sure that correct hctx->dispatch_from is set

2018-05-18 Thread Bart Van Assche
On Fri, 2018-05-18 at 22:46 +0800, huhai wrote:
> Yes, it is more readable
> 
> Finally, thank you for reminding me. Next time I'll change gmail to submit 
> patch.

Hello Huhai,

Please have a look at Documentation/process/email-clients.rst.

Thanks,

Bart.










Re: [PATCH v2 02/26] sysfs: export sysfs_remove_file_self()

2018-05-18 Thread Tejun Heo
On Fri, May 18, 2018 at 03:03:49PM +0200, Roman Pen wrote:
> Function is going to be used in transport over RDMA module
> in subsequent patches.
> 
> Signed-off-by: Roman Pen 
> Cc: Tejun Heo 
> Cc: linux-ker...@vger.kernel.org

Acked-by: Tejun Heo 

Please feel free to apply with other patches.

Thanks.

-- 
tejun


Re: [PATCH v4 0/3] AIO add per-command iopriority

2018-05-18 Thread Adam Manzanares


On 5/17/18 7:41 PM, Jens Axboe wrote:
> On 5/17/18 2:38 PM, adam.manzana...@wdc.com wrote:
>> From: Adam Manzanares 
>>
>> This is the per-I/O equivalent of the ioprio_set system call.
>> See the following link for performance implications on a SATA HDD:
>> https://lkml.org/lkml/2016/12/6/495
>>
>> First patch factors ioprio_check_cap function out of ioprio_set system call 
>> to
>> also be used by the aio ioprio interface.
>>
>> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
>>
>> Third patch passes ioprio hint from aio iocb to kiocb and enables block_dev
>> usage of the per I/O ioprio feature.
>>
>> v2: merge patches
>>  use IOCB_FLAG_IOPRIO
>>  validate intended use with IOCB_IOPRIO
>>  add linux-api and linux-block to cc
>>
>> v3: add ioprio_check_cap function
>>  convert kiocb ki_hint to u16
>>  use ioprio_check_cap when adding ioprio to kiocb in aio.c
>>
>> v4: handle IOCB_IOPRIO in aio_prep_rw
>>  note patch 3 depends on patch 1 in commit msg
>>
>> Adam Manzanares (3):
>>block: add ioprio_check_cap function
>>fs: Convert kiocb rw_hint from enum to u16
>>fs: Add aio iopriority support for block_dev
>>
>>   block/ioprio.c   | 22 --
>>   fs/aio.c | 16 
>>   fs/block_dev.c   |  2 ++
>>   include/linux/fs.h   | 17 +++--
>>   include/linux/ioprio.h   |  2 ++
>>   include/uapi/linux/aio_abi.h |  1 +
>>   6 files changed, 52 insertions(+), 8 deletions(-)
> 
> This looks fine to me now. I can pick up #1 for 4.18 - and 2+3 as well,
> unless someone else wants to take them.

Great, thanks Jens.

> 

[GIT PULL] Single block fix for 4.17-rc6

2018-05-18 Thread Jens Axboe
Hi Linus,

Single fix this time, from Coly, fixing a failure case when
CONFIG_DEBUGFS isn't enabled.

Please pull!


  git://git.kernel.dk/linux-block.git tags/for-linus-20180518



Coly Li (1):
  bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

 drivers/md/bcache/debug.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-- 
Jens Axboe



Re: blk-mq: make sure that correct hctx->dispatch_from is set

2018-05-18 Thread Jens Axboe
On 5/18/18 8:46 AM, huhai wrote:
> Yes, it is more readable

Final version:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.18/block=d416c92c5d6229b33f37f0f75e52194081ccbcc4

> Finally, thank you for reminding me. Next time I'll change gmail to submit 
> patch.

Not sure gmail can ever really work. You should not top-post reply to
postings either. If you want to experiment with getting a mailer setup
and whether or not it does the right thing, feel free to send a patch
to my email privately, and I can let you know if the end result is
as it should be.

-- 
Jens Axboe



Re: blk-mq: make sure that correct hctx->dispatch_from is set

2018-05-18 Thread huhai
Yes, it is more readable

Finally, thank you for reminding me. Next time I'll change gmail to submit 
patch.

 
 
 
-- Original --
From:  "Jens Axboe";
Date:  Fri, May 18, 2018 10:31 PM
To:  "胡海";
Cc:  "ming.lei"; 
"linux-block";
Subject:  Re: blk-mq: make sure that correct hctx->dispatch_from is set
 
On 5/18/18 8:27 AM, Jens Axboe wrote:
> On 5/18/18 7:42 AM, 胡海 wrote:
>> Author: huhai 
>> Date:   Fri May 18 17:09:56 2018 +0800
>>
>> blk-mq: make sure that correct hctx->dispatch_from is set
>> 
>> When the number of hardware queues is changed, the drivers will call
>> blk_mq_update_nr_hw_queues() to remap hardware queues, and then
>> the ctx mapped on hctx will also change, but the current code forgets to
>> make sure that correct hctx->dispatch_from is set, and 
>> hctx->dispatch_from
>> may point to a ctx that does not belong to the current hctx.
> 
> Looks good, thanks. One minor note for future patches - for cases like this,
> when the patch fixes an issue with a specific commit, add a fixes line.
> For this one, it would be:
> 
> Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")

Two more notes... Your patches are still coming through as base64 encoded,
they should just be plain text.

Finally, I think the below is much clearer, since that's the loop where
we clear any existing hctx context.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6c6aef44badd..4cbfd784e837 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2358,6 +2358,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i) {
cpumask_clear(hctx->cpumask);
hctx->nr_ctx = 0;
+   hctx->dispatch_from = NULL;
}
 
/*

-- 
Jens Axboe

Re: blk-mq: make sure that correct hctx->dispatch_from is set

2018-05-18 Thread Jens Axboe
On 5/18/18 7:42 AM, 胡海 wrote:
> Author: huhai 
> Date:   Fri May 18 17:09:56 2018 +0800
> 
> blk-mq: make sure that correct hctx->dispatch_from is set
> 
> When the number of hardware queues is changed, the drivers will call
> blk_mq_update_nr_hw_queues() to remap hardware queues, and then
> the ctx mapped on hctx will also change, but the current code forgets to
> make sure that correct hctx->dispatch_from is set, and hctx->dispatch_from
> may point to a ctx that does not belong to the current hctx.

Looks good, thanks. One minor note for future patches - for cases like this,
when the patch fixes an issue with a specific commit, add a fixes line.
For this one, it would be:

Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")

-- 
Jens Axboe



Re: blk-mq: make sure that correct hctx->dispatch_from is set

2018-05-18 Thread Ming Lei
On Fri, May 18, 2018 at 9:42 PM, 胡海  wrote:
> Author: huhai 
> Date:   Fri May 18 17:09:56 2018 +0800
>
> blk-mq: make sure that correct hctx->dispatch_from is set
>
> When the number of hardware queues is changed, the drivers will call
> blk_mq_update_nr_hw_queues() to remap hardware queues, and then
> the ctx mapped on hctx will also change, but the current code forgets to
> make sure that correct hctx->dispatch_from is set, and hctx->dispatch_from
> may point to a ctx that does not belong to the current hctx.
>
> Signed-off-by: huhai 
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2545081..55d8a3d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2214,6 +2214,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> hctx->tags = set->tags[i];
> WARN_ON(!hctx->tags);
>
> +   hctx->dispatch_from = NULL;
> +
> /*
>  * Set the map size to the number of mapped software queues.
>  * This is more accurate and more efficient than looping

Good catch,

Reviewed-by: Ming Lei 

Thanks,
Ming Lei


Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-18 Thread Keith Busch
On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
> What I think block/011 is helpful is that it can trigger IO timeout
> during reset, which can be triggered in reality too.

As I mentioned earlier, there is nothing wrong with the spirit of
the test. What's wrong with it is the misguided implemention.

Do you underestand why it ever passes? The success happens when the
enabling part of the loop happens to coincide with the driver's enabling,
creating the pci_dev->enable_cnt > 1, making subsequent disable parts
of the loop do absolutely nothing; the exact same as the one-liner
(non-serious) patch I sent to defeat the test.

A better way to induce the timeout is:

  # setpci -s  4.w=0:6

This will halt the device without messing with the kernel structures,
just like how a real device failure would occur.


blk-mq: make sure that correct hctx->dispatch_from is set

2018-05-18 Thread 胡海
Author: huhai 
Date:   Fri May 18 17:09:56 2018 +0800

blk-mq: make sure that correct hctx->dispatch_from is set

When the number of hardware queues is changed, the drivers will call
blk_mq_update_nr_hw_queues() to remap hardware queues, and then
the ctx mapped on hctx will also change, but the current code forgets to
make sure that correct hctx->dispatch_from is set, and hctx->dispatch_from
may point to a ctx that does not belong to the current hctx.

Signed-off-by: huhai 

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2545081..55d8a3d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2214,6 +2214,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
hctx->tags = set->tags[i];
WARN_ON(!hctx->tags);
 
+   hctx->dispatch_from = NULL;
+
/*
 * Set the map size to the number of mapped software queues.
 * This is more accurate and more efficient than looping

[PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-18 Thread Roman Pen
Function is going to be used in transport over RDMA module
in subsequent patches.

Function returns next element in round-robin fashion,
i.e. head will be skipped.  NULL will be returned if list
is observed as empty.

Signed-off-by: Roman Pen 
Cc: Paul E. McKenney 
Cc: linux-ker...@vger.kernel.org
---
 include/linux/rculist.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 127f534fec94..b0840d5ab25a 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct 
list_head *list,
 })
 
 /**
+ * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
+ * @head:  the head for the list.
+ * @ptr:the list head to take the next element from.
+ * @type:   the type of the struct this is embedded in.
+ * @memb:   the name of the list_head within the struct.
+ *
+ * Next element returned in round-robin fashion, i.e. head will be skipped,
+ * but if list is observed as empty, NULL will be returned.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by 
rcu_read_lock().
+ */
+#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
+({ \
+   list_next_or_null_rcu(head, ptr, type, memb) ?: \
+   list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, 
memb); \
+})
+
+/**
  * list_for_each_entry_rcu -   iterate over rcu list of given type
  * @pos:   the type * to use as a loop cursor.
  * @head:  the head for your list.
-- 
2.13.1



[PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-05-18 Thread Roman Pen
Hi all,

This is v2 of series, which introduces IBNBD/IBTRS modules.

This cover letter is split on three parts:

1. Introduction, which almost repeats everything from previous cover
   letters.
2. Changelog.
3. Performance measurements on linux-4.17.0-rc2 and on two different
   Mellanox cards: ConnectX-2 and ConnectX-3 and CPUs: Intel and AMD.


 Introduction
 -

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA. It is optimized to transfer (read/write) IO blocks
in the sense that it follows the BIO semantics of providing the
possibility to either write data from a scatter-gather list to the
remote side or to request ("read") data transfer from the remote side
into a given set of buffers.

IBTRS is multipath capalbdke and provides I/O fail-over and load-balancing
functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
CMs and particular path is selected according to the load-balancing policy.

IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.

Why?

   - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
 thus internal protocol is simple.
   - IBTRS was developed as an independent RDMA transport library, which
 supports fail-over and load-balancing policies using multipath, thus
 it can be used for any other IO needs rather than only for block
 device.
   - IBNBD/IBTRS is faster than NVME over RDMA.
 Old comparison results:
 https://www.spinics.net/lists/linux-rdma/msg48799.html
 New comparison results: see performance measurements section below.

Key features of IBTRS transport library and IBNBD block device:

o High throughput and low latency due to:
   - Only two RDMA messages per IO.
   - IMM InfiniBand messages on responses to reduce round trip latency.
   - Simplified memory management: memory allocation happens once on
 server side when IBTRS session is established.

o IO fail-over and load-balancing by using multipath.  According to
  our test loads additional path brings ~20% of bandwidth.  

o Simple configuration of IBNBD:
   - Server side is completely passive: volumes do not need to be
 explicitly exported.
   - Only IB port GID and device path needed on client side to map
 a block device.
   - A device is remapped automatically i.e. after storage reboot.

Commits for kernel can be found here:
   https://github.com/profitbricks/ibnbd/commits/linux-4.17-rc2

The out-of-tree modules are here:
   https://github.com/profitbricks/ibnbd/

Vault 2017 presentation:
   
http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf


 Changelog
 -

v2:
  o IBNBD:
 - No legacy request IO mode, only MQ is left.

  o IBTRS:
 - No FMR registration, only FR is left.

 - By default memory is always registered for the sake of the security,
   i.e. by default no pd is created with IB_PD_UNSAFE_GLOBAL_RKEY.

 - Server side (target) always does memory registration and exchanges
   MRs dma addresses with client for direct writes from client side.

 - Client side (initiator) has `noreg_cnt` module option, which 
specifies
   sg number, from which read IO should be registered.  By default 0
   is set, i.e. always register memory for read IOs. (IBTRS protocol
   does not require registration for writes, which always go directly
   to server memory).

 - Proper DMA sync with ib_dma_sync_single_for_(cpu|device) calls.

 - Do signalled IB_WR_LOCAL_INV.

 - Avoid open-coding of string conversion to IPv4/6 sockaddr,
   inet_pton_with_scope() is used instead.

 - Introduced block device namespaces configuration on server side
   (target) to avoid security gap in not trusted environment, when
   client can map a block device which does not belong to him.
   When device namespaces are enabled on server side, server opens
   device using client's session name in the device path, where
   session name is a random token, e.g. GUID.  If server is configured
   to find device namespaces in a folder /run/ibnbd-guid/, then
   request to map device 'sda1' from client with session 'A' (or any
   token) will be resolved by path /run/ibnbd-guid/A/sda1.

 - README is extended with description of IBTRS and IBNBD protocol,
   e.g. how IB IMM field is used to acknowledge IO requests or
   heartbeats.

 - IBTRS/IBNBD client and server modules are registered as devices in
   the kernel in order to have all sysfs configuration entries under

[PATCH v2 02/26] sysfs: export sysfs_remove_file_self()

2018-05-18 Thread Roman Pen
Function is going to be used in transport over RDMA module
in subsequent patches.

Signed-off-by: Roman Pen 
Cc: Tejun Heo 
Cc: linux-ker...@vger.kernel.org
---
 fs/sysfs/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 5c13f29bfcdb..ff7443ac2aa7 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -444,6 +444,7 @@ bool sysfs_remove_file_self(struct kobject *kobj, const 
struct attribute *attr)
kernfs_put(kn);
return ret;
 }
+EXPORT_SYMBOL_GPL(sysfs_remove_file_self);
 
 void sysfs_remove_files(struct kobject *kobj, const struct attribute **ptr)
 {
-- 
2.13.1



[PATCH v2 06/26] ibtrs: client: private header with client structs and functions

2018-05-18 Thread Roman Pen
This header describes main structs and functions used by ibtrs-client
module, mainly for managing IBTRS sessions, creating/destroying sysfs
entries, accounting statistics on client side.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h | 315 +++
 1 file changed, 315 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
new file mode 100644
index ..0323da91ca01
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
@@ -0,0 +1,315 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *  Swapnil Ingle 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBTRS_CLT_H
+#define IBTRS_CLT_H
+
+#include 
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_clt_state - Client states.
+ */
+enum ibtrs_clt_state {
+   IBTRS_CLT_CONNECTING,
+   IBTRS_CLT_CONNECTING_ERR,
+   IBTRS_CLT_RECONNECTING,
+   IBTRS_CLT_CONNECTED,
+   IBTRS_CLT_CLOSING,
+   IBTRS_CLT_CLOSED,
+   IBTRS_CLT_DEAD,
+};
+
+static inline const char *ibtrs_clt_state_str(enum ibtrs_clt_state state)
+{
+   switch (state) {
+   case IBTRS_CLT_CONNECTING:
+   return "IBTRS_CLT_CONNECTING";
+   case IBTRS_CLT_CONNECTING_ERR:
+   return "IBTRS_CLT_CONNECTING_ERR";
+   case IBTRS_CLT_RECONNECTING:
+   return "IBTRS_CLT_RECONNECTING";
+   case IBTRS_CLT_CONNECTED:
+   return "IBTRS_CLT_CONNECTED";
+   case IBTRS_CLT_CLOSING:
+   return "IBTRS_CLT_CLOSING";
+   case IBTRS_CLT_CLOSED:
+   return "IBTRS_CLT_CLOSED";
+   case IBTRS_CLT_DEAD:
+   return "IBTRS_CLT_DEAD";
+   default:
+   return "UNKNOWN";
+   }
+}
+
+enum ibtrs_mp_policy {
+   MP_POLICY_RR,
+   MP_POLICY_MIN_INFLIGHT,
+};
+
+struct ibtrs_clt_stats_reconnects {
+   int successful_cnt;
+   int fail_cnt;
+};
+
+struct ibtrs_clt_stats_wc_comp {
+   u32 cnt;
+   u64 total_cnt;
+};
+
+struct ibtrs_clt_stats_cpu_migr {
+   atomic_t from;
+   int to;
+};
+
+struct ibtrs_clt_stats_rdma {
+   struct {
+   u64 cnt;
+   u64 size_total;
+   } dir[2];
+
+   u64 failover_cnt;
+};
+
+struct ibtrs_clt_stats_rdma_lat {
+   u64 read;
+   u64 write;
+};
+
+#define MIN_LOG_SG 2
+#define MAX_LOG_SG 5
+#define MAX_LIN_SG BIT(MIN_LOG_SG)
+#define SG_DISTR_SZ (MAX_LOG_SG - MIN_LOG_SG + MAX_LIN_SG + 2)
+
+#define MAX_LOG_LAT 16
+#define MIN_LOG_LAT 0
+#define LOG_LAT_SZ (MAX_LOG_LAT - MIN_LOG_LAT + 2)
+
+struct ibtrs_clt_stats_pcpu {
+   struct ibtrs_clt_stats_cpu_migr cpu_migr;
+   struct ibtrs_clt_stats_rdma rdma;
+   u64 sg_list_total;
+   u64 sg_list_distr[SG_DISTR_SZ];
+   struct ibtrs_clt_stats_rdma_lat rdma_lat_distr[LOG_LAT_SZ];
+   struct ibtrs_clt_stats_rdma_lat rdma_lat_max;
+   struct ibtrs_clt_stats_wc_comp  wc_comp;
+};
+
+struct ibtrs_clt_stats {
+   boolenable_rdma_lat;
+   struct ibtrs_clt_stats_pcpu__percpu *pcpu_stats;
+   struct ibtrs_clt_stats_reconnects   reconnects;
+   atomic_tinflight;
+};
+
+struct ibtrs_clt_con {
+   struct ibtrs_conc;
+   unsignedcpu;
+   atomic_tio_cnt;
+   int cm_err;
+};
+
+/**
+ * ibtrs_tag - tags 

[PATCH v2 05/26] ibtrs: core: lib functions shared between client and server modules

2018-05-18 Thread Roman Pen
This is a set of library functions existing as a ibtrs-core module,
used by client and server modules.

Mainly these functions wrap IB and RDMA calls and provide a bit higher
abstraction for implementing of IBTRS protocol on client or server
sides.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs.c | 609 +++
 1 file changed, 609 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.c

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs.c
new file mode 100644
index ..39a933fe528e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.c
@@ -0,0 +1,609 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+
+#include "ibtrs-pri.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ib...@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Core");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+struct ibtrs_iu *ibtrs_iu_alloc(u32 tag, size_t size, gfp_t gfp_mask,
+   struct ib_device *dma_dev,
+   enum dma_data_direction direction,
+   void (*done)(struct ib_cq *cq,
+struct ib_wc *wc))
+{
+   struct ibtrs_iu *iu;
+
+   iu = kmalloc(sizeof(*iu), gfp_mask);
+   if (unlikely(!iu))
+   return NULL;
+
+   iu->buf = kzalloc(size, gfp_mask);
+   if (unlikely(!iu->buf))
+   goto err1;
+
+   iu->dma_addr = ib_dma_map_single(dma_dev, iu->buf, size, direction);
+   if (unlikely(ib_dma_mapping_error(dma_dev, iu->dma_addr)))
+   goto err2;
+
+   iu->cqe.done  = done;
+   iu->size  = size;
+   iu->direction = direction;
+   iu->tag   = tag;
+
+   return iu;
+
+err2:
+   kfree(iu->buf);
+err1:
+   kfree(iu);
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_alloc);
+
+void ibtrs_iu_free(struct ibtrs_iu *iu, enum dma_data_direction dir,
+  struct ib_device *ibdev)
+{
+   if (!iu)
+   return;
+
+   ib_dma_unmap_single(ibdev, iu->dma_addr, iu->size, dir);
+   kfree(iu->buf);
+   kfree(iu);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_free);
+
+int ibtrs_iu_post_recv(struct ibtrs_con *con, struct ibtrs_iu *iu)
+{
+   struct ibtrs_sess *sess = con->sess;
+   struct ib_recv_wr wr, *bad_wr;
+   struct ib_sge list;
+
+   list.addr   = iu->dma_addr;
+   list.length = iu->size;
+   list.lkey   = sess->dev->ib_pd->local_dma_lkey;
+
+   if (WARN_ON(list.length == 0)) {
+   ibtrs_wrn(con, "Posting receive work request failed,"
+ " sg list is empty\n");
+   return -EINVAL;
+   }
+
+   wr.next= NULL;
+   wr.wr_cqe  = >cqe;
+   wr.sg_list = 
+   wr.num_sge = 1;
+
+   return ib_post_recv(con->qp, , _wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_recv);
+
+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+   struct ib_recv_wr wr, *bad_wr;
+
+   wr.next= NULL;
+   wr.wr_cqe  = cqe;
+   wr.sg_list = NULL;
+   wr.num_sge = 0;
+
+   return ib_post_recv(con->qp, , _wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);
+
+int ibtrs_post_recv_empty_x2(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+   struct ib_recv_wr wr_arr[2], *wr, *bad_wr;
+   int i;
+
+   memset(wr_arr, 0, sizeof(wr_arr));
+   for (i = 0; i < ARRAY_SIZE(wr_arr); i++) {
+   wr 

[PATCH v2 04/26] ibtrs: private headers with IBTRS protocol structs and helpers

2018-05-18 Thread Roman Pen
These are common private headers with IBTRS protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h |  91 ++
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h | 459 +++
 2 files changed, 550 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-log.h
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-log.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
new file mode 100644
index ..f56257eabdee
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
@@ -0,0 +1,91 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBTRS_LOG_H
+#define IBTRS_LOG_H
+
+#define P1 )
+#define P2 ))
+#define P3 )))
+#define P4 
+#define P(N) P ## N
+
+#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
+#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
+
+#define LIST(...)  \
+   __VA_ARGS__,\
+   ({ unknown_type(); NULL; }) \
+   CAT(P, COUNT_ARGS(__VA_ARGS__)) \
+
+#define EMPTY()
+#define DEFER(id) id EMPTY()
+
+#define _CASE(obj, type, member)   \
+   __builtin_choose_expr(  \
+   __builtin_types_compatible_p(   \
+   typeof(obj), type), \
+   ((type)obj)->member
+#define CASE(o, t, m) DEFER(_CASE)(o,t,m)
+
+/*
+ * Below we define retrieving of sessname from common IBTRS types.
+ * Client or server related types have to be defined by special
+ * TYPES_TO_SESSNAME macro.
+ */
+
+void unknown_type(void);
+
+#ifndef TYPES_TO_SESSNAME
+#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
+#endif
+
+#define ibtrs_prefix(obj)  \
+   _CASE(obj, struct ibtrs_con *,  sess->sessname),\
+   _CASE(obj, struct ibtrs_sess *, sessname),  \
+   TYPES_TO_SESSNAME(obj)  \
+   ))
+
+#define ibtrs_log(fn, obj, fmt, ...)   \
+   fn("<%s>: " fmt, ibtrs_prefix(obj), ##__VA_ARGS__)
+
+#define ibtrs_err(obj, fmt, ...)   \
+   ibtrs_log(pr_err, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_err_rl(obj, fmt, ...)\
+   ibtrs_log(pr_err_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn(obj, fmt, ...)   \
+   ibtrs_log(pr_warn, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn_rl(obj, fmt, ...) \
+   ibtrs_log(pr_warn_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info(obj, fmt, ...) \
+   ibtrs_log(pr_info, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info_rl(obj, fmt, ...) \
+   ibtrs_log(pr_info_ratelimited, obj, fmt, ##__VA_ARGS__)
+
+#endif /* IBTRS_LOG_H */
diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
new file mode 100644
index ..40647f066840
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
@@ -0,0 +1,459 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 

[PATCH v2 03/26] ibtrs: public interface header to establish RDMA connections

2018-05-18 Thread Roman Pen
Introduce public header which provides set of API functions to
establish RDMA connections from client to server machine using
IBTRS protocol, which manages RDMA connections for each session,
does multipathing and load balancing.

Main functions for client (active) side:

 ibtrs_clt_open() - Creates set of RDMA connections incapsulated
in IBTRS session and returns pointer on IBTRS
session object.
 ibtrs_clt_close() - Closes RDMA connections associated with IBTRS
 session.
 ibtrs_clt_request() - Requests zero-copy RDMA transfer to/from
   server.

Main functions for server (passive) side:

 ibtrs_srv_open() - Starts listening for IBTRS clients on specified
port and invokes IBTRS callbacks for incoming
RDMA requests or link events.
 ibtrs_srv_close() - Closes IBTRS server context.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs.h | 324 +++
 1 file changed, 324 insertions(+)
 create mode 100644 drivers/infiniband/ulp/ibtrs/ibtrs.h

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs.h
new file mode 100644
index ..08325e39a41e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.h
@@ -0,0 +1,324 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBTRS_H
+#define IBTRS_H
+
+#include 
+#include 
+
+struct ibtrs_tag;
+struct ibtrs_clt;
+struct ibtrs_srv_ctx;
+struct ibtrs_srv;
+struct ibtrs_srv_op;
+
+/*
+ * Here goes IBTRS client API
+ */
+
+/**
+ * enum ibtrs_clt_link_ev - Events about connectivity state of a client
+ * @IBTRS_CLT_LINK_EV_RECONNECTED  Client was reconnected.
+ * @IBTRS_CLT_LINK_EV_DISCONNECTED Client was disconnected.
+ */
+enum ibtrs_clt_link_ev {
+   IBTRS_CLT_LINK_EV_RECONNECTED,
+   IBTRS_CLT_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * Source and destination address of a path to be established
+ */
+struct ibtrs_addr {
+   struct sockaddr_storage *src;
+   struct sockaddr_storage *dst;
+};
+
+typedef void (link_clt_ev_fn)(void *priv, enum ibtrs_clt_link_ev ev);
+/**
+ * ibtrs_clt_open() - Open a session to a IBTRS client
+ * @priv:  User supplied private data.
+ * @link_ev:   Event notification for connection state changes
+ * @priv:  user supplied data that was passed to
+ * ibtrs_clt_open()
+ * @ev:Occurred event
+ * @sessname: name of the session
+ * @paths: Paths to be established defined by their src and dst addresses
+ * @path_cnt: Number of elemnts in the @paths array
+ * @port: port to be used by the IBTRS session
+ * @pdu_sz: Size of extra payload which can be accessed after tag allocation.
+ * @max_inflight_msg: Max. number of parallel inflight messages for the session
+ * @max_segments: Max. number of segments per IO request
+ * @reconnect_delay_sec: time between reconnect tries
+ * @max_reconnect_attempts: Number of times to reconnect on error before giving
+ * up, 0 for * disabled, -1 for forever
+ *
+ * Starts session establishment with the ibtrs_server. The function can block
+ * up to ~2000ms until it returns.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+const char *sessname,
+const struct ibtrs_addr *paths,
+size_t path_cnt, short port,
+

[PATCH v2 23/26] ibnbd: server: sysfs interface functions

2018-05-18 Thread Roman Pen
This is the sysfs interface to IBNBD mapped devices on server side:

  /sys/devices/virtual/ibnbd-server/ctl/devices//
|- block_dev
|  *** link pointing to the corresponding block device sysfs entry
|
|- sessions//
|  *** sessions directory
   |
   |- read_only
   |  *** is devices mapped as read only
   |
   |- mapping_path
  *** relative device path provided by the client during mapping

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-srv-sysfs.c | 242 ++
 1 file changed, 242 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-srv-sysfs.c

diff --git a/drivers/block/ibnbd/ibnbd-srv-sysfs.c 
b/drivers/block/ibnbd/ibnbd-srv-sysfs.c
new file mode 100644
index ..5bf77cdb09c8
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-sysfs.c
@@ -0,0 +1,242 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "ibnbd-srv.h"
+
+static struct device *ibnbd_dev;
+static struct class *ibnbd_dev_class;
+static struct kobject *ibnbd_devs_kobj;
+
+static struct attribute *ibnbd_srv_default_dev_attrs[] = {
+   NULL,
+};
+
+static struct attribute_group ibnbd_srv_default_dev_attr_group = {
+   .attrs = ibnbd_srv_default_dev_attrs,
+};
+
+static struct kobj_type ktype = {
+   .sysfs_ops  = _sysfs_ops,
+};
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+  struct block_device *bdev,
+  const char *dir_name)
+{
+   struct kobject *bdev_kobj;
+   int ret;
+
+   ret = kobject_init_and_add(>dev_kobj, ,
+  ibnbd_devs_kobj, dir_name);
+   if (ret)
+   return ret;
+
+   ret = kobject_init_and_add(>dev_sessions_kobj,
+  ,
+  >dev_kobj, "sessions");
+   if (ret)
+   goto err;
+
+   ret = sysfs_create_group(>dev_kobj,
+_srv_default_dev_attr_group);
+   if (ret)
+   goto err2;
+
+   bdev_kobj = _to_dev(bdev->bd_disk)->kobj;
+   ret = sysfs_create_link(>dev_kobj, bdev_kobj, "block_dev");
+   if (ret)
+   goto err3;
+
+   return 0;
+
+err3:
+   sysfs_remove_group(>dev_kobj,
+  _srv_default_dev_attr_group);
+err2:
+   kobject_del(>dev_sessions_kobj);
+   kobject_put(>dev_sessions_kobj);
+err:
+   kobject_del(>dev_kobj);
+   kobject_put(>dev_kobj);
+   return ret;
+}
+
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev)
+{
+   sysfs_remove_link(>dev_kobj, "block_dev");
+   sysfs_remove_group(>dev_kobj, _srv_default_dev_attr_group);
+   kobject_del(>dev_sessions_kobj);
+   kobject_put(>dev_sessions_kobj);
+   kobject_del(>dev_kobj);
+   kobject_put(>dev_kobj);
+}
+
+static ssize_t ibnbd_srv_dev_session_ro_show(struct kobject *kobj,
+struct kobj_attribute *attr,
+char *page)
+{
+   struct ibnbd_srv_sess_dev *sess_dev;
+
+   sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+
+   return scnprintf(page, PAGE_SIZE, "%s\n",
+(sess_dev->open_flags & FMODE_WRITE) ? "0" : "1");
+}
+
+static struct kobj_attribute ibnbd_srv_dev_session_ro_attr =
+   __ATTR(read_only, 0444,
+  

[PATCH v2 20/26] ibnbd: server: private header with server structs and functions

2018-05-18 Thread Roman Pen
This header describes main structs and functions used by ibnbd-server
module, namely structs for managing sessions from different clients
and mapped (opened) devices.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-srv.h | 100 
 1 file changed, 100 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.h

diff --git a/drivers/block/ibnbd/ibnbd-srv.h b/drivers/block/ibnbd/ibnbd-srv.h
new file mode 100644
index ..191a1650bc1d
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv.h
@@ -0,0 +1,100 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBNBD_SRV_H
+#define IBNBD_SRV_H
+
+#include 
+#include 
+#include 
+
+#include "ibtrs.h"
+#include "ibnbd-proto.h"
+#include "ibnbd-log.h"
+
+struct ibnbd_srv_session {
+   /* Entry inside global sess_list */
+   struct list_headlist;
+   struct ibtrs_srv*ibtrs;
+   charsessname[NAME_MAX];
+   int queue_depth;
+   struct bio_set  *sess_bio_set;
+
+   rwlock_tindex_lock cacheline_aligned;
+   struct idr  index_idr;
+   /* List of struct ibnbd_srv_sess_dev */
+   struct list_headsess_dev_list;
+   struct mutexlock;
+   u8  ver;
+};
+
+struct ibnbd_srv_dev {
+   /* Entry inside global dev_list */
+   struct list_headlist;
+   struct kobject  dev_kobj;
+   struct kobject  dev_sessions_kobj;
+   struct kref kref;
+   charid[NAME_MAX];
+   /* List of ibnbd_srv_sess_dev structs */
+   struct list_headsess_dev_list;
+   struct mutexlock;
+   int open_write_cnt;
+   enum ibnbd_io_mode  mode;
+};
+
+/* Structure which binds N devices and N sessions */
+struct ibnbd_srv_sess_dev {
+   /* Entry inside ibnbd_srv_dev struct */
+   struct list_headdev_list;
+   /* Entry inside ibnbd_srv_session struct */
+   struct list_headsess_list;
+   struct ibnbd_dev*ibnbd_dev;
+   struct ibnbd_srv_session*sess;
+   struct ibnbd_srv_dev*dev;
+   struct kobject  kobj;
+   struct completion   *sysfs_release_compl;
+   u32 device_id;
+   fmode_t open_flags;
+   struct kref kref;
+   struct completion   *destroy_comp;
+   charpathname[NAME_MAX];
+};
+
+/* ibnbd-srv-sysfs.c */
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+  struct block_device *bdev,
+  const char *dir_name);
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev);
+int ibnbd_srv_create_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+void ibnbd_srv_destroy_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+int ibnbd_srv_create_sysfs_files(void);
+void ibnbd_srv_destroy_sysfs_files(void);
+
+#endif /* IBNBD_SRV_H */
-- 
2.13.1



[PATCH v2 21/26] ibnbd: server: main functionality

2018-05-18 Thread Roman Pen
This is main functionality of ibnbd-server module, which handles IBTRS
events and IBNBD protocol requests, like map (open) or unmap (close)
device.  Also server side is responsible for processing incoming IBTRS
IO requests and forward them to local mapped devices.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-srv.c | 922 
 1 file changed, 922 insertions(+)
 create mode 100644 drivers/block/ibnbd/ibnbd-srv.c

diff --git a/drivers/block/ibnbd/ibnbd-srv.c b/drivers/block/ibnbd/ibnbd-srv.c
new file mode 100644
index ..a42a9191dad9
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv.c
@@ -0,0 +1,922 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+
+#include "ibnbd-srv.h"
+#include "ibnbd-srv-dev.h"
+
+MODULE_AUTHOR("ib...@profitbricks.com");
+MODULE_VERSION(IBNBD_VER_STRING);
+MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
+MODULE_LICENSE("GPL");
+
+#define DEFAULT_DEV_SEARCH_PATH "/"
+
+static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;
+
+static int dev_search_path_set(const char *val, const struct kernel_param *kp)
+{
+   char *dup;
+
+   if (strlen(val) >= sizeof(dev_search_path))
+   return -EINVAL;
+
+   dup = kstrdup(val, GFP_KERNEL);
+
+   if (dup[strlen(dup) - 1] == '\n')
+   dup[strlen(dup) - 1] = '\0';
+
+   strlcpy(dev_search_path, dup, sizeof(dev_search_path));
+
+   kfree(dup);
+   pr_info("dev_search_path changed to '%s'\n", dev_search_path);
+
+   return 0;
+}
+
+static struct kparam_string dev_search_path_kparam_str = {
+   .maxlen = sizeof(dev_search_path),
+   .string = dev_search_path
+};
+
+static const struct kernel_param_ops dev_search_path_ops = {
+   .set= dev_search_path_set,
+   .get= param_get_string,
+};
+
+module_param_cb(dev_search_path, _search_path_ops,
+   _search_path_kparam_str, 0444);
+MODULE_PARM_DESC(dev_search_path, "Sets the dev_search_path."
+" When a device is mapped this path is prepended to the"
+" device path from the map device operation.  If %SESSNAME%"
+" is specified in a path, then device will be searched in a"
+" session namespace."
+" (default: " DEFAULT_DEV_SEARCH_PATH ")");
+
+static int def_io_mode = IBNBD_BLOCKIO;
+module_param(def_io_mode, int, 0444);
+MODULE_PARM_DESC(def_io_mode, "By default, export devices in"
+" blockio(" __stringify(_IBNBD_BLOCKIO) ") or"
+" fileio(" __stringify(_IBNBD_FILEIO) ") mode."
+" (default: " __stringify(_IBNBD_BLOCKIO) " (blockio))");
+
+static DEFINE_MUTEX(sess_lock);
+static DEFINE_SPINLOCK(dev_lock);
+
+static LIST_HEAD(sess_list);
+static LIST_HEAD(dev_list);
+
+struct ibnbd_io_private {
+   struct ibtrs_srv_op *id;
+   struct ibnbd_srv_sess_dev   *sess_dev;
+};
+
+static void ibnbd_sess_dev_release(struct kref *kref)
+{
+   struct ibnbd_srv_sess_dev *sess_dev;
+
+   sess_dev = container_of(kref, struct ibnbd_srv_sess_dev, kref);
+   complete(sess_dev->destroy_comp);
+}
+
+static inline void ibnbd_put_sess_dev(struct ibnbd_srv_sess_dev *sess_dev)
+{
+   kref_put(_dev->kref, ibnbd_sess_dev_release);
+}
+
+static void ibnbd_endio(void *priv, int error)
+{
+   struct ibnbd_io_private *ibnbd_priv = priv;
+   struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev;
+
+   

[PATCH v2 26/26] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules

2018-05-18 Thread Roman Pen
Signed-off-by: Roman Pen 
Cc: Danil Kipnis 
Cc: Jack Wang 
---
 MAINTAINERS | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 92be777d060a..e5a001bd0f05 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6786,6 +6786,20 @@ IBM ServeRAID RAID DRIVER
 S: Orphan
 F: drivers/scsi/ips.*
 
+IBNBD BLOCK DRIVERS
+M: IBNBD/IBTRS Storage Team 
+L: linux-block@vger.kernel.org
+S: Maintained
+T: git git://github.com/profitbricks/ibnbd.git
+F: drivers/block/ibnbd/
+
+IBTRS TRANSPORT DRIVERS
+M: IBNBD/IBTRS Storage Team 
+L: linux-r...@vger.kernel.org
+S: Maintained
+T: git git://github.com/profitbricks/ibnbd.git
+F: drivers/infiniband/ulp/ibtrs/
+
 ICH LPC AND GPIO DRIVER
 M: Peter Tyser 
 S: Maintained
-- 
2.13.1



[PATCH v2 24/26] ibnbd: include client and server modules into kernel compilation

2018-05-18 Thread Roman Pen
Add IBNBD Makefile, Kconfig and also corresponding lines into upper
block layer files.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/Kconfig|  2 ++
 drivers/block/Makefile   |  1 +
 drivers/block/ibnbd/Kconfig  | 22 ++
 drivers/block/ibnbd/Makefile | 13 +
 4 files changed, 38 insertions(+)
 create mode 100644 drivers/block/ibnbd/Kconfig
 create mode 100644 drivers/block/ibnbd/Makefile

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index ad9b687a236a..d8c1590411c8 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -481,4 +481,6 @@ config BLK_DEV_RSXX
  To compile this driver as a module, choose M here: the
  module will be called rsxx.
 
+source "drivers/block/ibnbd/Kconfig"
+
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dc061158b403..65346a1d0b1a 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)+= mtip32xx/
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
 obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_BLK_DEV_IBNBD)+= ibnbd/
 
 skd-y  := skd_main.o
 swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/ibnbd/Kconfig b/drivers/block/ibnbd/Kconfig
new file mode 100644
index ..b381c6c084d2
--- /dev/null
+++ b/drivers/block/ibnbd/Kconfig
@@ -0,0 +1,22 @@
+config BLK_DEV_IBNBD
+   bool
+
+config BLK_DEV_IBNBD_CLIENT
+   tristate "Network block device driver on top of IBTRS transport"
+   depends on INFINIBAND_IBTRS_CLIENT
+   select BLK_DEV_IBNBD
+   help
+ IBNBD client allows for mapping of a remote block devices over
+ IBTRS protocol from a target system where IBNBD server is running.
+
+ If unsure, say N.
+
+config BLK_DEV_IBNBD_SERVER
+   tristate "Network block device over RDMA Infiniband server support"
+   depends on INFINIBAND_IBTRS_SERVER
+   select BLK_DEV_IBNBD
+   help
+ IBNBD server allows for exporting local block devices to a remote 
client
+ over IBTRS protocol.
+
+ If unsure, say N.
diff --git a/drivers/block/ibnbd/Makefile b/drivers/block/ibnbd/Makefile
new file mode 100644
index ..5f20e72e0633
--- /dev/null
+++ b/drivers/block/ibnbd/Makefile
@@ -0,0 +1,13 @@
+ccflags-y := -Idrivers/infiniband/ulp/ibtrs
+
+ibnbd-client-y := ibnbd-clt.o \
+ ibnbd-clt-sysfs.o
+
+ibnbd-server-y := ibnbd-srv.o \
+ ibnbd-srv-dev.o \
+ ibnbd-srv-sysfs.o
+
+obj-$(CONFIG_BLK_DEV_IBNBD_CLIENT) += ibnbd-client.o
+obj-$(CONFIG_BLK_DEV_IBNBD_SERVER) += ibnbd-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.13.1



  1   2   >