date:20170425

Re: Status of pwritev2/preadv2 in glibc?

2017-04-25 Thread Adhemerval Zanella

On 25/04/2017 18:23, Stephen  Bates wrote:
> 
>> I am working on it btw [1]
> 
> Thanks for working on this Adhemerval!
> 
>> PS: resending with cc to all.
> 
> Can you cc linux-block when you submit the patchset?

Sure, I am just finishing the sanity tests on the supported architectures.

Re: [resend PATCH v2 00/33] dax: introduce dax_operations

2017-04-25 Thread Dan Williams

On Fri, Apr 21, 2017 at 6:06 PM, Dan Williams  wrote:
> [ adding akpm, sfr, and jens ]
>
> I applied this series and pushed it out for the nvdimm.git branch that
> gets auto pulled into -next. The set is still awaiting acks from
> device-mapper, ext4, xfs, and vfs (for the copy_from_iter_ops, patch
> 29/33). If those come next week perhaps this can be merged for 4.12,
> but if not this will need to wait until 4.13.
>
> There are some minor collisions with Al's copy_from_user rework, the
> new dax tracepoints, and the removal of discard support from the brd
> driver. A sample merge is available here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/log/?h=libnvdimm-for-4.12-merge
>
> If it causes any other problems just drop and I'll retry for 4.13.

Al has nak'd the uaccess related changes, and I'll need to rework
those patches to move the pmem routines into lib/iov_iter.c directly.
That doesn't affect the dax_device and dax_operations work, so I'm
still looking to move forward with that change.  That reduces the set
targeting 4.12 to just the first 18 patches from this series:

Dan Williams (18):
  device-dax: rename 'dax_dev' to 'dev_dax'
  dax: refactor dax-fs into a generic provider of 'struct
dax_device' instances
  dax: add a facility to lookup a dax device by 'host' device name
  dax: introduce dax_operations
  pmem: add dax_operations support
  axon_ram: add dax_operations support
  brd: add dax_operations support
  dcssblk: add dax_operations support
  block: kill bdev_dax_capable()
  dax: introduce dax_direct_access()
  dm: add dax_device and dax_operations support
  dm: teach dm-targets to use a dax_device + dax_operations
  ext2, ext4, xfs: retrieve dax_device for iomap operations
  Revert "block: use DAX for partition table reads"
  filesystem-dax: convert to dax_direct_access()
  block, dax: convert bdev_dax_supported() to dax_direct_access()
  block: remove block_device_operations ->direct_access()
  x86, dax, pmem: remove indirection around memcpy_from_pmem()

Re: [PATCH v5 05/10] blk-mq: Unregister debugfs attributes earlier

2017-04-25 Thread Omar Sandoval

On Tue, Apr 25, 2017 at 10:24:48PM +, Bart Van Assche wrote:
> On Tue, 2017-04-25 at 14:30 -0700, Omar Sandoval wrote:
> > On Tue, Apr 25, 2017 at 01:37:40PM -0700, Bart Van Assche wrote:
> > > One of the debugfs attributes allows to run a queue. Since running
> > > a queue after a queue has entered the "dead" state is not allowed
> > > and triggers a use-after-free, unregister the debugfs attributes
> > > before a queue reaches the "dead" state.
> > 
> > Still not happy with this commit message. I'd prefer:
> > 
> > We currently call blk_mq_free_queue() from blk_cleanup_queue() before we
> > unregister the debugfs attributes for that queue in blk_release_queue().
> > This leaves a window open during which accessing most of the mq debugfs
> > attributes would cause a use-after-free. Additionally, the "state"
> > attribute allows running the queue, which we should not do after the
> > queue has entered the "dead" state. Fix both of these cases by
> > unregistering the debugfs attributes before this.
> 
> Hello Omar,
> 
> That's a very verbose description. How about this?
> 
>     Unregister the debugfs attributes before freeing of request queue
> resources starts to avoid that a use-after-free can be triggered
> through one of the debugfs attributes.
> 
> Bart.

Are you aware that there is nothing wrong with a descriptive commit
message?

Re: [PATCH v5 05/10] blk-mq: Unregister debugfs attributes earlier

2017-04-25 Thread Jens Axboe

On 04/25/2017 03:24 PM, Bart Van Assche wrote:
> On Tue, 2017-04-25 at 14:30 -0700, Omar Sandoval wrote:
>> On Tue, Apr 25, 2017 at 01:37:40PM -0700, Bart Van Assche wrote:
>>> One of the debugfs attributes allows to run a queue. Since running
>>> a queue after a queue has entered the "dead" state is not allowed
>>> and triggers a use-after-free, unregister the debugfs attributes
>>> before a queue reaches the "dead" state.
>>
>> Still not happy with this commit message. I'd prefer:
>>
>> We currently call blk_mq_free_queue() from blk_cleanup_queue() before we
>> unregister the debugfs attributes for that queue in blk_release_queue().
>> This leaves a window open during which accessing most of the mq debugfs
>> attributes would cause a use-after-free. Additionally, the "state"
>> attribute allows running the queue, which we should not do after the
>> queue has entered the "dead" state. Fix both of these cases by
>> unregistering the debugfs attributes before this.
> 
> Hello Omar,
> 
> That's a very verbose description. How about this?
> 
> Unregister the debugfs attributes before freeing of request queue
> resources starts to avoid that a use-after-free can be triggered
> through one of the debugfs attributes.

Personally I find Omar's commit message much cleaner to read, and
more easily understandable. We really don't need to be laconic in
commit messages.

-- 
Jens Axboe

Re: [PATCH v5 05/10] blk-mq: Unregister debugfs attributes earlier

2017-04-25 Thread Bart Van Assche

On Tue, 2017-04-25 at 14:30 -0700, Omar Sandoval wrote:
> On Tue, Apr 25, 2017 at 01:37:40PM -0700, Bart Van Assche wrote:
> > One of the debugfs attributes allows to run a queue. Since running
> > a queue after a queue has entered the "dead" state is not allowed
> > and triggers a use-after-free, unregister the debugfs attributes
> > before a queue reaches the "dead" state.
> 
> Still not happy with this commit message. I'd prefer:
> 
> We currently call blk_mq_free_queue() from blk_cleanup_queue() before we
> unregister the debugfs attributes for that queue in blk_release_queue().
> This leaves a window open during which accessing most of the mq debugfs
> attributes would cause a use-after-free. Additionally, the "state"
> attribute allows running the queue, which we should not do after the
> queue has entered the "dead" state. Fix both of these cases by
> unregistering the debugfs attributes before this.

Hello Omar,

That's a very verbose description. How about this?

    Unregister the debugfs attributes before freeing of request queue
resources starts to avoid that a use-after-free can be triggered
through one of the debugfs attributes.

Bart.

Re: [PATCH v5 10/10] scsi: Implement blk_mq_ops.show_rq()

2017-04-25 Thread Bart Van Assche

On Tue, 2017-04-25 at 14:39 -0700, Omar Sandoval wrote:
> On Tue, Apr 25, 2017 at 01:37:45PM -0700, Bart Van Assche wrote:
> > Show the SCSI CDB, .eh_eflags and .result for pending SCSI commands
> > in /sys/kernel/debug/block/*/mq/*/dispatch and */rq_list.
> 
> Only thing I noticed was that the only other caller I see has buf[70].
> No idea if that's a meaningful number. For the sake of this not getting
> bike-shedded to death,

Neither length is sufficient to avoid truncation of e.g. ATA pass-through
commands or commands with variable length CDBs. However, from the point of
view of debugging queue lockups the most useful information in a SCSI
command are the first two bytes of the CDB. The chosen buffer length is
definitely enough to make sure that these two bytes will be reported.

Bart.

Re: [PATCH v5 05/10] blk-mq: Unregister debugfs attributes earlier

2017-04-25 Thread Jens Axboe

On 04/25/2017 02:30 PM, Omar Sandoval wrote:
> On Tue, Apr 25, 2017 at 01:37:40PM -0700, Bart Van Assche wrote:
>> One of the debugfs attributes allows to run a queue. Since running
>> a queue after a queue has entered the "dead" state is not allowed
>> and triggers a use-after-free, unregister the debugfs attributes
>> before a queue reaches the "dead" state.
> 
> Still not happy with this commit message. I'd prefer:
> 
> We currently call blk_mq_free_queue() from blk_cleanup_queue() before we
> unregister the debugfs attributes for that queue in blk_release_queue().
> This leaves a window open during which accessing most of the mq debugfs
> attributes would cause a use-after-free. Additionally, the "state"
> attribute allows running the queue, which we should not do after the
> queue has entered the "dead" state. Fix both of these cases by
> unregistering the debugfs attributes before this.
> 
> Jens, could you ack that dropping the lock is okay?

Looks fine to me. However, I think there's room for improvement here.
Why don't we just make it:

if (!q->mq_ops) {
spin_lock_irq(lock);
__blk_drain_queue(q, true);
} else {
blk_mq_debugfs_unregister_mq(q);
spin_lock_irq(lock);
}

queue_flag_set(QUEUE_FLAG_DEAD, q);
[...]

Would seem much more readable to me, and less dropping/acquiring for
cases where we don't need it.

-- 
Jens Axboe

Re: [PATCH v5 10/10] scsi: Implement blk_mq_ops.show_rq()

2017-04-25 Thread Omar Sandoval

On Tue, Apr 25, 2017 at 01:37:45PM -0700, Bart Van Assche wrote:
> Show the SCSI CDB, .eh_eflags and .result for pending SCSI commands
> in /sys/kernel/debug/block/*/mq/*/dispatch and */rq_list.

Only thing I noticed was that the only other caller I see has buf[70].
No idea if that's a meaningful number. For the sake of this not getting
bike-shedded to death,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Bart Van Assche 
> Cc: Martin K. Petersen 
> Cc: James Bottomley 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> Cc: 
> ---
>  drivers/scsi/Makefile   |  1 +
>  drivers/scsi/scsi_debugfs.c | 13 +
>  drivers/scsi/scsi_debugfs.h |  4 
>  drivers/scsi/scsi_lib.c |  4 
>  4 files changed, 22 insertions(+)
>  create mode 100644 drivers/scsi/scsi_debugfs.c
>  create mode 100644 drivers/scsi/scsi_debugfs.h
> 
> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
> index fc2855565a51..93dbe58c47c8 100644
> --- a/drivers/scsi/Makefile
> +++ b/drivers/scsi/Makefile
> @@ -166,6 +166,7 @@ scsi_mod-y+= scsi_scan.o 
> scsi_sysfs.o scsi_devinfo.o
>  scsi_mod-$(CONFIG_SCSI_NETLINK)  += scsi_netlink.o
>  scsi_mod-$(CONFIG_SYSCTL)+= scsi_sysctl.o
>  scsi_mod-$(CONFIG_SCSI_PROC_FS)  += scsi_proc.o
> +scsi_mod-$(CONFIG_BLK_DEBUG_FS)  += scsi_debugfs.o
>  scsi_mod-y   += scsi_trace.o scsi_logging.o
>  scsi_mod-$(CONFIG_PM)+= scsi_pm.o
>  scsi_mod-$(CONFIG_SCSI_DH)   += scsi_dh.o
> diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
> new file mode 100644
> index ..f831c23fdee3
> --- /dev/null
> +++ b/drivers/scsi/scsi_debugfs.c
> @@ -0,0 +1,13 @@
> +#include 
> +#include 
> +#include 
> +#include "scsi_debugfs.h"
> +
> +void scsi_show_rq(struct seq_file *m, struct request *rq)
> +{
> + struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
> + char buf[64];
> +
> + __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
> + seq_printf(m, ", .cmd=%s", buf);
> +}
> diff --git a/drivers/scsi/scsi_debugfs.h b/drivers/scsi/scsi_debugfs.h
> new file mode 100644
> index ..951b043e82d0
> --- /dev/null
> +++ b/drivers/scsi/scsi_debugfs.h
> @@ -0,0 +1,4 @@
> +struct request;
> +struct seq_file;
> +
> +void scsi_show_rq(struct seq_file *m, struct request *rq);
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index abc391e00f7d..1c3e87d6c48f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -34,6 +34,7 @@
>  
>  #include 
>  
> +#include "scsi_debugfs.h"
>  #include "scsi_priv.h"
>  #include "scsi_logging.h"
>  
> @@ -2157,6 +2158,9 @@ static const struct blk_mq_ops scsi_mq_ops = {
>   .queue_rq   = scsi_queue_rq,
>   .complete   = scsi_softirq_done,
>   .timeout= scsi_timeout,
> +#ifdef CONFIG_BLK_DEBUG_FS
> + .show_rq= scsi_show_rq,
> +#endif
>   .init_request   = scsi_init_request,
>   .exit_request   = scsi_exit_request,
>   .map_queues = scsi_map_queues,
> -- 
> 2.12.2
>

Re: [PATCH v5 09/10] blk-mq: Add blk_mq_ops.show_rq()

2017-04-25 Thread Omar Sandoval

On Tue, Apr 25, 2017 at 01:37:44PM -0700, Bart Van Assche wrote:
> This new callback function will be used in the next patch to show
> more information about SCSI requests.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> ---
>  block/blk-mq-debugfs.c | 6 +-
>  include/linux/blk-mq.h | 8 
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index ac39093c4ef7..bcd2a7d4a3a5 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -311,6 +311,7 @@ static const char *const rqf_name[] = {
>  static int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
>  {
>   struct request *rq = list_entry_rq(v);
> + const struct blk_mq_ops *const mq_ops = rq->q->mq_ops;
>   const unsigned int op = rq->cmd_flags & REQ_OP_MASK;
>  
>   seq_printf(m, "%p {.op=", rq);
> @@ -324,8 +325,11 @@ static int blk_mq_debugfs_rq_show(struct seq_file *m, 
> void *v)
>   seq_puts(m, ", .rq_flags=");
>   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
>  ARRAY_SIZE(rqf_name));
> - seq_printf(m, ", .tag=%d, .internal_tag=%d}\n", rq->tag,
> + seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
>  rq->internal_tag);
> + if (mq_ops->show_rq)
> + mq_ops->show_rq(m, rq);
> + seq_puts(m, "}\n");
>   return 0;
>  }
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 0c4dadb85f62..32bd8eb5ba67 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -142,6 +142,14 @@ struct blk_mq_ops {
>   reinit_request_fn   *reinit_request;
>  
>   map_queues_fn   *map_queues;
> +
> +#ifdef CONFIG_BLK_DEBUG_FS
> + /*
> +  * Used by the debugfs implementation to show driver-specific
> +  * information about a request.
> +  */
> + void (*show_rq)(struct seq_file *m, struct request *rq);
> +#endif
>  };
>  
>  enum {
> -- 
> 2.12.2
>

Re: [PATCH v5 05/10] blk-mq: Unregister debugfs attributes earlier

2017-04-25 Thread Omar Sandoval

On Tue, Apr 25, 2017 at 01:37:40PM -0700, Bart Van Assche wrote:
> One of the debugfs attributes allows to run a queue. Since running
> a queue after a queue has entered the "dead" state is not allowed
> and triggers a use-after-free, unregister the debugfs attributes
> before a queue reaches the "dead" state.

Still not happy with this commit message. I'd prefer:

We currently call blk_mq_free_queue() from blk_cleanup_queue() before we
unregister the debugfs attributes for that queue in blk_release_queue().
This leaves a window open during which accessing most of the mq debugfs
attributes would cause a use-after-free. Additionally, the "state"
attribute allows running the queue, which we should not do after the
queue has entered the "dead" state. Fix both of these cases by
unregistering the debugfs attributes before this.

Jens, could you ack that dropping the lock is okay?

> Signed-off-by: Bart Van Assche 
> Reviewed-by: Hannes Reinecke 
> Reviewed-by: Omar Sandoval 
> ---
>  block/blk-core.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a49b0830aaaf..33c91a4bee97 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -566,6 +566,11 @@ void blk_cleanup_queue(struct request_queue *q)
>   spin_lock_irq(lock);
>   if (!q->mq_ops)
>   __blk_drain_queue(q, true);
> + spin_unlock_irq(lock);
> +
> + blk_mq_debugfs_unregister_mq(q);
> +
> + spin_lock_irq(lock);
>   queue_flag_set(QUEUE_FLAG_DEAD, q);
>   spin_unlock_irq(lock);
>  
> -- 
> 2.12.2
>

Re: Status of pwritev2/preadv2 in glibc?

2017-04-25 Thread Stephen Bates


> I am working on it btw [1]

Thanks for working on this Adhemerval!

> PS: resending with cc to all.

Can you cc linux-block when you submit the patchset?

Stephen

Re: [PATCH v5 04/10] blk-mq: Only unregister hctxs for which registration succeeded

2017-04-25 Thread Omar Sandoval

On Tue, Apr 25, 2017 at 01:37:39PM -0700, Bart Van Assche wrote:
> Hctx unregistration involves calling kobject_del(). kobject_del()
> must not be called if kobject_add() has not been called. Hence in
> the error path only unregister hctxs for which registration succeeded.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> ---
>  block/blk-mq-sysfs.c | 18 +-
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index c2cac20d981d..053549a32778 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -323,16 +323,24 @@ int __blk_mq_register_dev(struct device *dev, struct 
> request_queue *q)
>   queue_for_each_hw_ctx(q, hctx, i) {
>   ret = blk_mq_register_hctx(hctx);
>   if (ret)
> - break;
> + goto unreg;
>   }
>  
> - if (ret)
> - __blk_mq_unregister_dev(dev, q);
> - else
> - q->mq_sysfs_init_done = true;
> + q->mq_sysfs_init_done = true;
>  
>  out:
>   return ret;
> +
> +unreg:
> + while (--i >= 0)
> + blk_mq_unregister_hctx(hctx);

Missed this in your last submission, won't this unregister the same hctx
repeatedly?

> + blk_mq_debugfs_unregister_mq(q);
> +
> + kobject_uevent(>mq_kobj, KOBJ_REMOVE);
> + kobject_del(>mq_kobj);
> + kobject_put(>kobj);
> + goto out;

The out label doesn't do anything interesting, can we just return ret?

>  }
>  
>  int blk_mq_register_dev(struct device *dev, struct request_queue *q)
> -- 
> 2.12.2
>

[PATCH v5 02/10] blk-mq: Let blk_mq_debugfs_register() look up the queue name

2017-04-25 Thread Bart Van Assche

A later patch will move the call of blk_mq_debugfs_register() to
a function to which the queue name is not passed as an argument.
To avoid having to add a 'name' argument to multiple callers, let
blk_mq_debugfs_register() look up the queue name.

Signed-off-by: Bart Van Assche 
Reviewed-by: Omar Sandoval 
Reviewed-by: Hannes Reinecke 
---
 block/blk-mq-debugfs.c | 5 +++--
 block/blk-mq-sysfs.c   | 2 +-
 block/blk-mq.h | 5 ++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3057641d5d15..e9282b945f6b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -785,12 +785,13 @@ static const struct blk_mq_debugfs_attr 
blk_mq_debugfs_ctx_attrs[] = {
{},
 };
 
-int blk_mq_debugfs_register(struct request_queue *q, const char *name)
+int blk_mq_debugfs_register(struct request_queue *q)
 {
if (!blk_debugfs_root)
return -ENOENT;
 
-   q->debugfs_dir = debugfs_create_dir(name, blk_debugfs_root);
+   q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
+   blk_debugfs_root);
if (!q->debugfs_dir)
goto err;
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index a2dbb1a48e72..afb3451cf8a5 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -318,7 +318,7 @@ int __blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
 
kobject_uevent(>mq_kobj, KOBJ_ADD);
 
-   blk_mq_debugfs_register(q, kobject_name(>kobj));
+   blk_mq_debugfs_register(q);
 
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7d955c756810..9049c0f11505 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -87,13 +87,12 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx 
*hctx);
  * debugfs helpers
  */
 #ifdef CONFIG_BLK_DEBUG_FS
-int blk_mq_debugfs_register(struct request_queue *q, const char *name);
+int blk_mq_debugfs_register(struct request_queue *q);
 void blk_mq_debugfs_unregister(struct request_queue *q);
 int blk_mq_debugfs_register_hctxs(struct request_queue *q);
 void blk_mq_debugfs_unregister_hctxs(struct request_queue *q);
 #else
-static inline int blk_mq_debugfs_register(struct request_queue *q,
- const char *name)
+static inline int blk_mq_debugfs_register(struct request_queue *q)
 {
return 0;
 }
-- 
2.12.2

[PATCH v5 10/10] scsi: Implement blk_mq_ops.show_rq()

2017-04-25 Thread Bart Van Assche

Show the SCSI CDB, .eh_eflags and .result for pending SCSI commands
in /sys/kernel/debug/block/*/mq/*/dispatch and */rq_list.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: James Bottomley 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: 
---
 drivers/scsi/Makefile   |  1 +
 drivers/scsi/scsi_debugfs.c | 13 +
 drivers/scsi/scsi_debugfs.h |  4 
 drivers/scsi/scsi_lib.c |  4 
 4 files changed, 22 insertions(+)
 create mode 100644 drivers/scsi/scsi_debugfs.c
 create mode 100644 drivers/scsi/scsi_debugfs.h

diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index fc2855565a51..93dbe58c47c8 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -166,6 +166,7 @@ scsi_mod-y  += scsi_scan.o scsi_sysfs.o 
scsi_devinfo.o
 scsi_mod-$(CONFIG_SCSI_NETLINK)+= scsi_netlink.o
 scsi_mod-$(CONFIG_SYSCTL)  += scsi_sysctl.o
 scsi_mod-$(CONFIG_SCSI_PROC_FS)+= scsi_proc.o
+scsi_mod-$(CONFIG_BLK_DEBUG_FS)+= scsi_debugfs.o
 scsi_mod-y += scsi_trace.o scsi_logging.o
 scsi_mod-$(CONFIG_PM)  += scsi_pm.o
 scsi_mod-$(CONFIG_SCSI_DH) += scsi_dh.o
diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
new file mode 100644
index ..f831c23fdee3
--- /dev/null
+++ b/drivers/scsi/scsi_debugfs.c
@@ -0,0 +1,13 @@
+#include 
+#include 
+#include 
+#include "scsi_debugfs.h"
+
+void scsi_show_rq(struct seq_file *m, struct request *rq)
+{
+   struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
+   char buf[64];
+
+   __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
+   seq_printf(m, ", .cmd=%s", buf);
+}
diff --git a/drivers/scsi/scsi_debugfs.h b/drivers/scsi/scsi_debugfs.h
new file mode 100644
index ..951b043e82d0
--- /dev/null
+++ b/drivers/scsi/scsi_debugfs.h
@@ -0,0 +1,4 @@
+struct request;
+struct seq_file;
+
+void scsi_show_rq(struct seq_file *m, struct request *rq);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index abc391e00f7d..1c3e87d6c48f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -34,6 +34,7 @@
 
 #include 
 
+#include "scsi_debugfs.h"
 #include "scsi_priv.h"
 #include "scsi_logging.h"
 
@@ -2157,6 +2158,9 @@ static const struct blk_mq_ops scsi_mq_ops = {
.queue_rq   = scsi_queue_rq,
.complete   = scsi_softirq_done,
.timeout= scsi_timeout,
+#ifdef CONFIG_BLK_DEBUG_FS
+   .show_rq= scsi_show_rq,
+#endif
.init_request   = scsi_init_request,
.exit_request   = scsi_exit_request,
.map_queues = scsi_map_queues,
-- 
2.12.2

[PATCH v5 09/10] blk-mq: Add blk_mq_ops.show_rq()

2017-04-25 Thread Bart Van Assche

This new callback function will be used in the next patch to show
more information about SCSI requests.

Signed-off-by: Bart Van Assche 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
---
 block/blk-mq-debugfs.c | 6 +-
 include/linux/blk-mq.h | 8 
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ac39093c4ef7..bcd2a7d4a3a5 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -311,6 +311,7 @@ static const char *const rqf_name[] = {
 static int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
 {
struct request *rq = list_entry_rq(v);
+   const struct blk_mq_ops *const mq_ops = rq->q->mq_ops;
const unsigned int op = rq->cmd_flags & REQ_OP_MASK;
 
seq_printf(m, "%p {.op=", rq);
@@ -324,8 +325,11 @@ static int blk_mq_debugfs_rq_show(struct seq_file *m, void 
*v)
seq_puts(m, ", .rq_flags=");
blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
   ARRAY_SIZE(rqf_name));
-   seq_printf(m, ", .tag=%d, .internal_tag=%d}\n", rq->tag,
+   seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
   rq->internal_tag);
+   if (mq_ops->show_rq)
+   mq_ops->show_rq(m, rq);
+   seq_puts(m, "}\n");
return 0;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0c4dadb85f62..32bd8eb5ba67 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -142,6 +142,14 @@ struct blk_mq_ops {
reinit_request_fn   *reinit_request;
 
map_queues_fn   *map_queues;
+
+#ifdef CONFIG_BLK_DEBUG_FS
+   /*
+* Used by the debugfs implementation to show driver-specific
+* information about a request.
+*/
+   void (*show_rq)(struct seq_file *m, struct request *rq);
+#endif
 };
 
 enum {
-- 
2.12.2

[PATCH v5 01/10] blk-mq: Register /queue/mq after having registered /queue

2017-04-25 Thread Bart Van Assche

A later patch in this series will modify blk_mq_debugfs_register()
such that it uses q->kobj.parent to determine the name of a
request queue. Hence make sure that that pointer is initialized
before blk_mq_debugfs_register() is called. To avoid lock inversion,
protect sysfs / debugfs registration with the queue sysfs_lock
instead of the global mutex all_q_mutex.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Omar Sandoval 
---
 block/blk-mq-sysfs.c | 35 ---
 block/blk-mq.h   |  1 +
 block/blk-sysfs.c|  6 +++---
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index d745ab81033a..a2dbb1a48e72 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -253,6 +253,8 @@ static void __blk_mq_unregister_dev(struct device *dev, 
struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i;
 
+   lockdep_assert_held(>sysfs_lock);
+
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
 
@@ -267,9 +269,9 @@ static void __blk_mq_unregister_dev(struct device *dev, 
struct request_queue *q)
 
 void blk_mq_unregister_dev(struct device *dev, struct request_queue *q)
 {
-   blk_mq_disable_hotplug();
+   mutex_lock(>sysfs_lock);
__blk_mq_unregister_dev(dev, q);
-   blk_mq_enable_hotplug();
+   mutex_unlock(>sysfs_lock);
 }
 
 void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx)
@@ -302,12 +304,13 @@ void blk_mq_sysfs_init(struct request_queue *q)
}
 }
 
-int blk_mq_register_dev(struct device *dev, struct request_queue *q)
+int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
 {
struct blk_mq_hw_ctx *hctx;
int ret, i;
 
-   blk_mq_disable_hotplug();
+   WARN_ON_ONCE(!q->kobj.parent);
+   lockdep_assert_held(>sysfs_lock);
 
ret = kobject_add(>mq_kobj, kobject_get(>kobj), "%s", "mq");
if (ret < 0)
@@ -327,8 +330,18 @@ int blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
__blk_mq_unregister_dev(dev, q);
else
q->mq_sysfs_init_done = true;
+
 out:
-   blk_mq_enable_hotplug();
+   return ret;
+}
+
+int blk_mq_register_dev(struct device *dev, struct request_queue *q)
+{
+   int ret;
+
+   mutex_lock(>sysfs_lock);
+   ret = __blk_mq_register_dev(dev, q);
+   mutex_unlock(>sysfs_lock);
 
return ret;
 }
@@ -339,13 +352,17 @@ void blk_mq_sysfs_unregister(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i;
 
+   mutex_lock(>sysfs_lock);
if (!q->mq_sysfs_init_done)
-   return;
+   goto unlock;
 
blk_mq_debugfs_unregister_hctxs(q);
 
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
+
+unlock:
+   mutex_unlock(>sysfs_lock);
 }
 
 int blk_mq_sysfs_register(struct request_queue *q)
@@ -353,8 +370,9 @@ int blk_mq_sysfs_register(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i, ret = 0;
 
+   mutex_lock(>sysfs_lock);
if (!q->mq_sysfs_init_done)
-   return ret;
+   goto unlock;
 
blk_mq_debugfs_register_hctxs(q);
 
@@ -364,5 +382,8 @@ int blk_mq_sysfs_register(struct request_queue *q)
break;
}
 
+unlock:
+   mutex_unlock(>sysfs_lock);
+
return ret;
 }
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 524f44742816..7d955c756810 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -78,6 +78,7 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct 
request_queue *q,
  */
 extern void blk_mq_sysfs_init(struct request_queue *q);
 extern void blk_mq_sysfs_deinit(struct request_queue *q);
+extern int __blk_mq_register_dev(struct device *dev, struct request_queue *q);
 extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f85723332288..3f37813ccbaf 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -877,9 +877,6 @@ int blk_register_queue(struct gendisk *disk)
if (ret)
return ret;
 
-   if (q->mq_ops)
-   blk_mq_register_dev(dev, q);
-
/* Prevent changes through sysfs until registration is completed. */
mutex_lock(>sysfs_lock);
 
@@ -889,6 +886,9 @@ int blk_register_queue(struct gendisk *disk)
goto unlock;
}
 
+   if (q->mq_ops)
+   __blk_mq_register_dev(dev, q);
+
kobject_uevent(>kobj, KOBJ_ADD);
 
wbt_enable_default(q);
-- 
2.12.2

[PATCH v5 05/10] blk-mq: Unregister debugfs attributes earlier

2017-04-25 Thread Bart Van Assche

One of the debugfs attributes allows to run a queue. Since running
a queue after a queue has entered the "dead" state is not allowed
and triggers a use-after-free, unregister the debugfs attributes
before a queue reaches the "dead" state.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Omar Sandoval 
---
 block/blk-core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index a49b0830aaaf..33c91a4bee97 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -566,6 +566,11 @@ void blk_cleanup_queue(struct request_queue *q)
spin_lock_irq(lock);
if (!q->mq_ops)
__blk_drain_queue(q, true);
+   spin_unlock_irq(lock);
+
+   blk_mq_debugfs_unregister_mq(q);
+
+   spin_lock_irq(lock);
queue_flag_set(QUEUE_FLAG_DEAD, q);
spin_unlock_irq(lock);
 
-- 
2.12.2

[PATCH v5 06/10] blk-mq: Move the "state" debugfs attribute one level down

2017-04-25 Thread Bart Van Assche

Move the "state" attribute from the top level to the "mq" directory
as requested by Omar.

Signed-off-by: Bart Van Assche 
Reviewed-by: Omar Sandoval 
Reviewed-by: Hannes Reinecke 
---
 block/blk-mq-debugfs.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index d99064e9e76a..1132be4e9c1c 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -141,11 +141,6 @@ static const struct file_operations blk_queue_flags_fops = 
{
.write  = blk_queue_flags_store,
 };
 
-static const struct blk_mq_debugfs_attr blk_queue_attrs[] = {
-   {"state", 0600, _queue_flags_fops},
-   {},
-};
-
 static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
 {
if (stat->nr_samples) {
@@ -757,6 +752,7 @@ static const struct file_operations ctx_completed_fops = {
 
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
{"poll_stat", 0400, _poll_stat_fops},
+   {"state", 0600, _queue_flags_fops},
{},
 };
 
@@ -873,9 +869,6 @@ int blk_mq_debugfs_register_mq(struct request_queue *q)
if (!q->debugfs_dir)
return -ENOENT;
 
-   if (!debugfs_create_files(q->debugfs_dir, q, blk_queue_attrs))
-   goto err;
-
q->mq_debugfs_dir = debugfs_create_dir("mq", q->debugfs_dir);
if (!q->mq_debugfs_dir)
goto err;
-- 
2.12.2

[PATCH v5 00/10] blk-mq debugfs patches for kernel v4.12

2017-04-25 Thread Bart Van Assche

Hello Jens,

Please consider the ten patches in this series for kernel v4.12.
These patches improve blk-mq debugfs support.

Thanks,

Bart.

Changes compared to v4:
- Modified patch 4 such that debugfs registration failures no longer cause
  block device registration to fail.
- Modified patch 8 such that it no longer introduces new sparse warnings.
- Modified patch 9 such that .show_rq() is only defined if CONFIG_BLK_DEBUG_FS
  is enabled.
- Moved the definition of scsi_show_rq() to a new file in patch 10 and changed
  the implementation of scsi_show_rq() such that it now uses
  __scsi_format_command().

Changes compared to v3:
- Changed the mutex_lock_interruptible() calls back to mutex_lock()
  calls.
- Added a patch that renames the functions for registering and
  unregistering the "mq" directory in debugfs.
- Moved the changes that add checking of the blk_mq_debugfs_register()
  return value into a separate patch.
- Moved the unregistration of the "mq" directory into blk_cleanup_queue().
- Removed uninteresting information from scsi_show_rq().
- Added Reviewed-by tag to the patches that got a positive review.

Changes compared to v2:
- Changed the mutex_lock() calls in registration methods into
  mutex_lock_interruptible() since these functions can be called from
  the context of a user space process.
- Avoid that the blk_mq_register_dev() changes in patch 1/8 cause a
  deadlock.

Changes compared to v1:
- Added two patches and replaced patch 1/6 such that debugfs
  attributes are now unregistered before freeing of a blk-mq queue
  starts instead of checking the "dead" queue flag.
- Changed "rq->cmd_flags ^ op" into "rq->cmd_flags & ~REQ_OP_MASK" as
  proposed by Omar.
- A seq_file pointer is now passed to the new queue_rq callback function
  instead of a fixed-size char buffer.
Bart Van Assche (10):
  blk-mq: Register /queue/mq after having registered /queue
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq-debugfs: Rename functions for registering and unregistering the
mq directory
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Add blk_mq_ops.show_rq()
  scsi: Implement blk_mq_ops.show_rq()

 block/blk-core.c|   5 +++
 block/blk-mq-debugfs.c  | 102 
 block/blk-mq-sysfs.c|  61 +++---
 block/blk-mq.h  |  14 +++---
 block/blk-sysfs.c   |   6 +--
 drivers/scsi/Makefile   |   1 +
 drivers/scsi/scsi_debugfs.c |  13 ++
 drivers/scsi/scsi_debugfs.h |   4 ++
 drivers/scsi/scsi_lib.c |   4 ++
 include/linux/blk-mq.h  |   8 
 10 files changed, 174 insertions(+), 44 deletions(-)
 create mode 100644 drivers/scsi/scsi_debugfs.c
 create mode 100644 drivers/scsi/scsi_debugfs.h

-- 
2.12.2

[PATCH v5 08/10] blk-mq: Show operation, cmd_flags and rq_flags names

2017-04-25 Thread Bart Van Assche

Show the operation name, .cmd_flags and .rq_flags as names instead
of numbers.

Signed-off-by: Bart Van Assche 
Reviewed-by: Omar Sandoval 
Reviewed-by: Hannes Reinecke 
---
 block/blk-mq-debugfs.c | 72 +++---
 1 file changed, 69 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ccc7b0f71230..ac39093c4ef7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -253,13 +253,79 @@ static const struct file_operations hctx_flags_fops = {
.release= single_release,
 };
 
+static const char *const op_name[] = {
+   [REQ_OP_READ]   = "READ",
+   [REQ_OP_WRITE]  = "WRITE",
+   [REQ_OP_FLUSH]  = "FLUSH",
+   [REQ_OP_DISCARD]= "DISCARD",
+   [REQ_OP_ZONE_REPORT]= "ZONE_REPORT",
+   [REQ_OP_SECURE_ERASE]   = "SECURE_ERASE",
+   [REQ_OP_ZONE_RESET] = "ZONE_RESET",
+   [REQ_OP_WRITE_SAME] = "WRITE_SAME",
+   [REQ_OP_WRITE_ZEROES]   = "WRITE_ZEROES",
+   [REQ_OP_SCSI_IN]= "SCSI_IN",
+   [REQ_OP_SCSI_OUT]   = "SCSI_OUT",
+   [REQ_OP_DRV_IN] = "DRV_IN",
+   [REQ_OP_DRV_OUT]= "DRV_OUT",
+};
+
+static const char *const cmd_flag_name[] = {
+   [__REQ_FAILFAST_DEV]= "FAILFAST_DEV",
+   [__REQ_FAILFAST_TRANSPORT]  = "FAILFAST_TRANSPORT",
+   [__REQ_FAILFAST_DRIVER] = "FAILFAST_DRIVER",
+   [__REQ_SYNC]= "SYNC",
+   [__REQ_META]= "META",
+   [__REQ_PRIO]= "PRIO",
+   [__REQ_NOMERGE] = "NOMERGE",
+   [__REQ_IDLE]= "IDLE",
+   [__REQ_INTEGRITY]   = "INTEGRITY",
+   [__REQ_FUA] = "FUA",
+   [__REQ_PREFLUSH]= "PREFLUSH",
+   [__REQ_RAHEAD]  = "RAHEAD",
+   [__REQ_BACKGROUND]  = "BACKGROUND",
+   [__REQ_NR_BITS] = "NR_BITS",
+};
+
+static const char *const rqf_name[] = {
+   [ilog2((__force u32)RQF_SORTED)]= "SORTED",
+   [ilog2((__force u32)RQF_STARTED)]   = "STARTED",
+   [ilog2((__force u32)RQF_QUEUED)]= "QUEUED",
+   [ilog2((__force u32)RQF_SOFTBARRIER)]   = "SOFTBARRIER",
+   [ilog2((__force u32)RQF_FLUSH_SEQ)] = "FLUSH_SEQ",
+   [ilog2((__force u32)RQF_MIXED_MERGE)]   = "MIXED_MERGE",
+   [ilog2((__force u32)RQF_MQ_INFLIGHT)]   = "MQ_INFLIGHT",
+   [ilog2((__force u32)RQF_DONTPREP)]  = "DONTPREP",
+   [ilog2((__force u32)RQF_PREEMPT)]   = "PREEMPT",
+   [ilog2((__force u32)RQF_COPY_USER)] = "COPY_USER",
+   [ilog2((__force u32)RQF_FAILED)]= "FAILED",
+   [ilog2((__force u32)RQF_QUIET)] = "QUIET",
+   [ilog2((__force u32)RQF_ELVPRIV)]   = "ELVPRIV",
+   [ilog2((__force u32)RQF_IO_STAT)]   = "IO_STAT",
+   [ilog2((__force u32)RQF_ALLOCED)]   = "ALLOCED",
+   [ilog2((__force u32)RQF_PM)]= "PM",
+   [ilog2((__force u32)RQF_HASHED)]= "HASHED",
+   [ilog2((__force u32)RQF_STATS)] = "STATS",
+   [ilog2((__force u32)RQF_SPECIAL_PAYLOAD)]   = "SPECIAL_PAYLOAD",
+};
+
 static int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
 {
struct request *rq = list_entry_rq(v);
+   const unsigned int op = rq->cmd_flags & REQ_OP_MASK;
 
-   seq_printf(m, "%p {.cmd_flags=0x%x, .rq_flags=0x%x, .tag=%d, 
.internal_tag=%d}\n",
-  rq, rq->cmd_flags, (__force unsigned int)rq->rq_flags,
-  rq->tag, rq->internal_tag);
+   seq_printf(m, "%p {.op=", rq);
+   if (op < ARRAY_SIZE(op_name) && op_name[op])
+   seq_printf(m, "%s", op_name[op]);
+   else
+   seq_printf(m, "%d", op);
+   seq_puts(m, ", .cmd_flags=");
+   blk_flags_show(m, rq->cmd_flags & ~REQ_OP_MASK, cmd_flag_name,
+  ARRAY_SIZE(cmd_flag_name));
+   seq_puts(m, ", .rq_flags=");
+   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
+  ARRAY_SIZE(rqf_name));
+   seq_printf(m, ", .tag=%d, .internal_tag=%d}\n", rq->tag,
+  rq->internal_tag);
return 0;
 }
 
-- 
2.12.2

[PATCH v5 07/10] blk-mq: Make blk_flags_show() callers append a newline character

2017-04-25 Thread Bart Van Assche

This patch does not change any functionality but makes it possible
to produce a single line of output with multiple flag-to-name
translations.

Signed-off-by: Bart Van Assche 
Reviewed-by: Omar Sandoval 
Reviewed-by: Hannes Reinecke 
---
 block/blk-mq-debugfs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 1132be4e9c1c..ccc7b0f71230 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -60,7 +60,6 @@ static int blk_flags_show(struct seq_file *m, const unsigned 
long flags,
else
seq_printf(m, "%d", i);
}
-   seq_puts(m, "\n");
return 0;
 }
 
@@ -102,6 +101,7 @@ static int blk_queue_flags_show(struct seq_file *m, void *v)
 
blk_flags_show(m, q->queue_flags, blk_queue_flag_name,
   ARRAY_SIZE(blk_queue_flag_name));
+   seq_puts(m, "\n");
return 0;
 }
 
@@ -193,6 +193,7 @@ static int hctx_state_show(struct seq_file *m, void *v)
 
blk_flags_show(m, hctx->state, hctx_state_name,
   ARRAY_SIZE(hctx_state_name));
+   seq_puts(m, "\n");
return 0;
 }
 
@@ -236,6 +237,7 @@ static int hctx_flags_show(struct seq_file *m, void *v)
blk_flags_show(m,
   hctx->flags ^ BLK_ALLOC_POLICY_TO_MQ_FLAG(alloc_policy),
   hctx_flag_name, ARRAY_SIZE(hctx_flag_name));
+   seq_puts(m, "\n");
return 0;
 }
 
-- 
2.12.2

Get back to me

2017-04-25 Thread Ashraf Basit

Good day. Did you receive the business proposal I sent to you yesterday? I was 
waiting for your reply but I am not sure if you receive the message.  If for 
some reason you did not receive my previous email, I can resend the message to 
you. Please confirm as this is very urgent and important.

Regards,

Ashraf.
ashraf...@secsuremailer.com

Re: [PATCH] lpfc: Fix memory corruption of the lpfc_ncmd->list pointers

2017-04-25 Thread Christoph Hellwig

On Tue, Apr 25, 2017 at 10:58:44AM -0700, Christoph Hellwig wrote:
> On Fri, Apr 21, 2017 at 05:49:08PM -0700, jsmart2...@gmail.com wrote:
> > This is a nvme-specific bug. The patch was cut against the
> > linux-block tree, for-4.12/block tree. It should be pulled in through
> > that tree.
> 
> It conflicts with your nvme changes that are in the nvme-4.12.
> Can you respin it?

Sorry for the noise.  It applies fine and I've applied it.

Re: [PATCH] lpfc: Fix memory corruption of the lpfc_ncmd->list pointers

2017-04-25 Thread Christoph Hellwig

On Fri, Apr 21, 2017 at 05:49:08PM -0700, jsmart2...@gmail.com wrote:
> This is a nvme-specific bug. The patch was cut against the
> linux-block tree, for-4.12/block tree. It should be pulled in through
> that tree.

It conflicts with your nvme changes that are in the nvme-4.12.
Can you respin it?

Re: [PATCH] nvme/scsi: Consider LBA format in IO splitting calculation

2017-04-25 Thread Christoph Hellwig

Applied to nvme-4.12.

Re: [PATCH] nvme/lightnvm: add missing endianess conversion in nvme_nvm_end_io

2017-04-25 Thread Matias Bjørling


From: Christoph Hellwig 
Sent: 25 April 2017 09:41
To: Javier Gonzalez; Matias Bjørling
Cc: linux-n...@lists.infradead.org
Subject: Re: [PATCH] nvme/lightnvm: add missing endianess conversion in 
nvme_nvm_end_io

Javier, Matias:  can I get a quick ACK?  This is the last patch needed
to make drivers/nvme warning-free.

On Fri, Apr 21, 2017 at 08:30:17AM +0200, Christoph Hellwig wrote:
> Found by sparse.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvme/host/lightnvm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
> index de61a4a03d78..e4e4e60b1224 100644
> --- a/drivers/nvme/host/lightnvm.c
> +++ b/drivers/nvme/host/lightnvm.c
> @@ -483,7 +483,7 @@ static void nvme_nvm_end_io(struct request *rq, int error)
>  {
>   struct nvm_rq *rqd = rq->end_io_data;
>
> - rqd->ppa_status = nvme_req(rq)->result.u64;
> + rqd->ppa_status = le64_to_cpu(nvme_req(rq)->result.u64);
>   rqd->error = nvme_req(rq)->status;
>   nvm_end_io(rqd);
>
> --
> 2.11.0
>
>
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

Thanks Christoph.

Reviewed-by: Matias Bjørling

Re: [PATCH -next] lightnvm: fix possible memory leak in pblk_bb_discovery()

2017-04-25 Thread Jens Axboe

On 04/25/2017 09:15 AM, Wei Yongjun wrote:
> From: Wei Yongjun 
> 
> 'blks' is malloced in pblk_bb_discovery() and should be freed
> before leaving from the nvm_get_tgt_bb_tbl() error handling cases,
> otherwise it will cause memory leak. Also skip assign blks to
> rlun->bb_list when error.

Added for 4.12, thanks.

-- 
Jens Axboe

Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-25 Thread Jeff Layton

On Tue, 2017-04-25 at 13:19 +0200, Jan Kara wrote:
> On Tue 25-04-17 06:35:13, Jeff Layton wrote:
> > On Tue, 2017-04-25 at 10:17 +0200, Jan Kara wrote:
> > > On Mon 24-04-17 13:14:36, Jeff Layton wrote:
> > > > On Mon, 2017-04-24 at 18:04 +0200, Jan Kara wrote:
> > > > > On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> > > > > > This ensures that we see errors on fsync when writeback fails.
> > > > > > 
> > > > > > Signed-off-by: Jeff Layton 
> > > > > 
> > > > > Hum, but do we really want to clobber mapping errors with temporary 
> > > > > stuff
> > > > > like ENOMEM? Or do you want to handle that in mapping_set_error?
> > > > > 
> > > > 
> > > > Right now we don't really have such a thing as temporary errors in the
> > > > writeback codepath. If you return an error here, the data doesn't stay
> > > > dirty or anything, and I think we want to ensure that that gets reported
> > > > via fsync.
> > > > 
> > > > I'd like to see us add better handling for retryable errors for stuff
> > > > like ENOMEM or EAGAIN. I think this is the first step toward that
> > > > though. Once we have more consistent handling of writeback errors in
> > > > general, then we can start doing more interesting things with retryable
> > > > errors.
> > > > 
> > > > So yeah, I this is the right thing to do for now.
> > > 
> > > OK, fair enough. And question number 2):
> > > 
> > > Who is actually responsible for setting the error in the mapping when 
> > > error
> > > happens inside ->writepage()? Is it the ->writepage() callback or the
> > > caller of ->writepage()? Or something else? Currently it seems to be a
> > > strange mix (e.g. mm/page-writeback.c: __writepage() calls
> > > mapping_set_error() when ->writepage() returns error) so I'd like to
> > > understand what's the plan and have that recorded in the changelogs.
> > > 
> > 
> > That's an excellent question.
> > 
> > I think we probably want the writepage/launder_page operations to call
> > mapping_set_error. That makes it possible for filesystems (e.g. NFS) to
> > handle their own error tracking and reporting without using the new
> > infrastructure. If they never call mapping_set_error then we'll always
> > just return whatever their ->fsync operation returns on an fsync.
> 
> OK, makes sense. It is also in line with what you did for DAX, 9p, or here
> for FUSE. So feel free to add:
> 
> Reviewed-by: Jan Kara 
> 
> for this patch but please also add a sentense that ->writepage() is
> responsible for calling mapping_set_error() if it fails and page is not
> redirtied to the changelogs of patches changing writepage handlers.
> 
> > I'll make another pass through the tree and see whether we have some
> > mapping_set_error calls that should be removed, and will flesh out
> > vfs.txt to state this. Maybe that file needs a whole section on
> > writeback error reporting? Hmmm...
> 
> I think it would be nice to have all the logic described in one place. So
> +1 from me.
> 
> > That probably also means that I should drop patch 8 from this series
> > (mm: ensure that we set mapping error if writeout fails), since that
> > should be happening in writepage already.
> 
> Yes.
> 
>   Honza

(cc'ing Jon since I'm proposing a doc update)

Here's what I'm thinking for a vfs.txt update after this series. The
section on writeback_control could probably be more specific.

--8<-

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index 94dd27ef4a76..aa912b65792a 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,23 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations. This tells the writepage and writepages operations something
+about the nature of and reason for the writeback request, and the
+constraints under which it is being done. It is also used to track state
+between successive writeback requests.
+
+When there is an error during writeback, then an error should be
+reported to fsync on all file descriptors that were open at the time of
+the error. This is typically done by setting the wb_err value in the
+address_space via mapping_set_error when writeback errors occur. The
+vfs-layer fsync code will then report the errors on a per-fd basis.
+
+Filesystems are free to track errors internally if they choose, but they
+should aim to provide the same semantics for error reporting when there
+are multiple writers. Filesystems that track their own errors should
+avoid calling mapping_set_error in order to ensure that errors stored in
+the mapping aren't improperly reported by the generic filesystem code.
 
 struct

Re: [PATCH -next] lightnvm: fix possible memory leak in pblk_bb_discovery()

2017-04-25 Thread Javier González

> On 25 Apr 2017, at 18.15, Wei Yongjun  wrote:
> 
> From: Wei Yongjun 
> 
> 'blks' is malloced in pblk_bb_discovery() and should be freed
> before leaving from the nvm_get_tgt_bb_tbl() error handling cases,
> otherwise it will cause memory leak. Also skip assign blks to
> rlun->bb_list when error.
> 
> Fixes: a4bd217b4326 ("lightnvm: physical block device (pblk) target")
> Signed-off-by: Wei Yongjun 
> ---
> 

Thanks Wei. Looks good.

Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v4 10/10] scsi: Implement blk_mq_ops.show_rq()

2017-04-25 Thread Martin K. Petersen


Bart,

>> I was merely objecting to the fact that we already have umpteen existing
>> interfaces for displaying SCSI command information.

> Do you perhaps want me to change the for-loop into a call to
> __scsi_format_command()?

If possible, I would love to see some commonality in the per-command
information regardless of whether it is displayed due to SCSI logging,
SCSI tracing or an error condition.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH v4 09/10] blk-mq: Add blk_mq_ops.show_rq()

2017-04-25 Thread Jens Axboe

On 04/25/2017 08:16 AM, Hannes Reinecke wrote:
> On 04/24/2017 11:51 PM, Bart Van Assche wrote:
>> On Mon, 2017-04-24 at 09:32 +0200, Hannes Reinecke wrote:
>>> On 04/22/2017 01:40 AM, Bart Van Assche wrote:
 --- a/include/linux/blk-mq.h
 +++ b/include/linux/blk-mq.h
 @@ -121,6 +121,12 @@ struct blk_mq_ops {
softirq_done_fn *complete;
  
/*
 +   * Used by the debugfs implementation to show driver-specific
 +   * information about a request.
 +   */
 +  void (*show_rq)(struct seq_file *m, struct request *rq);
 +
 +  /*
 * Called when the block layer side of a hardware queue has been
 * set up, allowing the driver to allocate/init matching structures.
 * Ditto for exit/teardown.

>>>
>>> I don't really like this; what does happen if someone disabled
>>> CONFIG_BLK_DEBUGFS? Won't we end up with a stale callback?
>>
>> Hello Hannes,
>>
>> How about surrounding (*show_rq)() function pointer with #ifdef 
>> CONFIG_BLK_DEBUGFS /
>> #endif?
>>
> Nope.
> 
> Then you'll end up with different offsets in the structures, depending
> on how the kernel is compiled. Making debugging a nightmare.

That's nonsense, we deal with this all the time already.

-- 
Jens Axboe

[PATCH -next] lightnvm: fix possible memory leak in pblk_bb_discovery()

2017-04-25 Thread Wei Yongjun

From: Wei Yongjun 

'blks' is malloced in pblk_bb_discovery() and should be freed
before leaving from the nvm_get_tgt_bb_tbl() error handling cases,
otherwise it will cause memory leak. Also skip assign blks to
rlun->bb_list when error.

Fixes: a4bd217b4326 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Wei Yongjun 
---
 drivers/lightnvm/pblk-init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 3996e4b..5c8f404 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -399,13 +399,15 @@ static int pblk_bb_discovery(struct nvm_tgt_dev *dev, 
struct pblk_lun *rlun)
 
nr_blks = nvm_bb_tbl_fold(dev->parent, blks, nr_blks);
if (nr_blks < 0) {
-   kfree(blks);
ret = nr_blks;
+   goto out;
}
 
rlun->bb_list = blks;
 
+   return 0;
 out:
+   kfree(blks);
return ret;
 }

Re: [PATCH v4 09/10] blk-mq: Add blk_mq_ops.show_rq()

2017-04-25 Thread Bart Van Assche

On Tue, 2017-04-25 at 17:16 +0200, Hannes Reinecke wrote:
> On 04/24/2017 11:51 PM, Bart Van Assche wrote:
> > On Mon, 2017-04-24 at 09:32 +0200, Hannes Reinecke wrote:
> > > On 04/22/2017 01:40 AM, Bart Van Assche wrote:
> > > > --- a/include/linux/blk-mq.h
> > > > +++ b/include/linux/blk-mq.h
> > > > @@ -121,6 +121,12 @@ struct blk_mq_ops {
> > > > softirq_done_fn *complete;
> > > >  
> > > > /*
> > > > +* Used by the debugfs implementation to show driver-specific
> > > > +* information about a request.
> > > > +*/
> > > > +   void (*show_rq)(struct seq_file *m, struct request *rq);
> > > > +
> > > > +   /*
> > > >  * Called when the block layer side of a hardware queue has been
> > > >  * set up, allowing the driver to allocate/init matching 
> > > > structures.
> > > >  * Ditto for exit/teardown.
> > > > 
> > > 
> > > I don't really like this; what does happen if someone disabled
> > > CONFIG_BLK_DEBUGFS? Won't we end up with a stale callback?
> > 
> > How about surrounding (*show_rq)() function pointer with #ifdef 
> > CONFIG_BLK_DEBUGFS /
> > #endif?
> 
> Then you'll end up with different offsets in the structures, depending
> on how the kernel is compiled. Making debugging a nightmare.

Hello Hannes,

How about moving the .show_rq function pointer to the end such that the
offset of other members of struct blk_mq_ops does not depend on whether or
not CONFIG_BLK_DEBUGFS has been defined?

Bart.

Re: [PATCH v4 09/10] blk-mq: Add blk_mq_ops.show_rq()

2017-04-25 Thread Hannes Reinecke

On 04/24/2017 11:51 PM, Bart Van Assche wrote:
> On Mon, 2017-04-24 at 09:32 +0200, Hannes Reinecke wrote:
>> On 04/22/2017 01:40 AM, Bart Van Assche wrote:
>>> --- a/include/linux/blk-mq.h
>>> +++ b/include/linux/blk-mq.h
>>> @@ -121,6 +121,12 @@ struct blk_mq_ops {
>>> softirq_done_fn *complete;
>>>  
>>> /*
>>> +* Used by the debugfs implementation to show driver-specific
>>> +* information about a request.
>>> +*/
>>> +   void (*show_rq)(struct seq_file *m, struct request *rq);
>>> +
>>> +   /*
>>>  * Called when the block layer side of a hardware queue has been
>>>  * set up, allowing the driver to allocate/init matching structures.
>>>  * Ditto for exit/teardown.
>>>
>>
>> I don't really like this; what does happen if someone disabled
>> CONFIG_BLK_DEBUGFS? Won't we end up with a stale callback?
> 
> Hello Hannes,
> 
> How about surrounding (*show_rq)() function pointer with #ifdef 
> CONFIG_BLK_DEBUGFS /
> #endif?
> 
Nope.

Then you'll end up with different offsets in the structures, depending
on how the kernel is compiled. Making debugging a nightmare.

Not sure what would be the best way here ...

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: Status of pwritev2/preadv2 in glibc?

2017-04-25 Thread Adhemerval Zanella



On 24/04/2017 14:55, Stephen  Bates wrote:
> 
>> So far, no one has submitted a patch.
> 
> OK, unless I hear that someone else is working on one I will take a look at 
> this.
> 
>> I hope the off_t parameter is passed exactly the same way as for pwritev 
>> and its 64-bit variant, for all architectures.
> 
> Duly noted.
> 
> For the kernel peeps I think this makes a case for revisiting the “big 
> hammer” control for IO polling since it is going to be a while before 
> applications can utilize the preadv2/pwritev2 approach… I will take a look at 
> that too.
> 
> Stephen
> 
> 

I am working on it btw [1]

[1] https://sourceware.org/ml/libc-alpha/2017-04/msg00461.html

PS: resending with cc to all.

Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-25 Thread Jan Kara

On Tue 25-04-17 06:35:13, Jeff Layton wrote:
> On Tue, 2017-04-25 at 10:17 +0200, Jan Kara wrote:
> > On Mon 24-04-17 13:14:36, Jeff Layton wrote:
> > > On Mon, 2017-04-24 at 18:04 +0200, Jan Kara wrote:
> > > > On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> > > > > This ensures that we see errors on fsync when writeback fails.
> > > > > 
> > > > > Signed-off-by: Jeff Layton 
> > > > 
> > > > Hum, but do we really want to clobber mapping errors with temporary 
> > > > stuff
> > > > like ENOMEM? Or do you want to handle that in mapping_set_error?
> > > > 
> > > 
> > > Right now we don't really have such a thing as temporary errors in the
> > > writeback codepath. If you return an error here, the data doesn't stay
> > > dirty or anything, and I think we want to ensure that that gets reported
> > > via fsync.
> > > 
> > > I'd like to see us add better handling for retryable errors for stuff
> > > like ENOMEM or EAGAIN. I think this is the first step toward that
> > > though. Once we have more consistent handling of writeback errors in
> > > general, then we can start doing more interesting things with retryable
> > > errors.
> > > 
> > > So yeah, I this is the right thing to do for now.
> > 
> > OK, fair enough. And question number 2):
> > 
> > Who is actually responsible for setting the error in the mapping when error
> > happens inside ->writepage()? Is it the ->writepage() callback or the
> > caller of ->writepage()? Or something else? Currently it seems to be a
> > strange mix (e.g. mm/page-writeback.c: __writepage() calls
> > mapping_set_error() when ->writepage() returns error) so I'd like to
> > understand what's the plan and have that recorded in the changelogs.
> > 
> 
> That's an excellent question.
> 
> I think we probably want the writepage/launder_page operations to call
> mapping_set_error. That makes it possible for filesystems (e.g. NFS) to
> handle their own error tracking and reporting without using the new
> infrastructure. If they never call mapping_set_error then we'll always
> just return whatever their ->fsync operation returns on an fsync.

OK, makes sense. It is also in line with what you did for DAX, 9p, or here
for FUSE. So feel free to add:

Reviewed-by: Jan Kara 

for this patch but please also add a sentense that ->writepage() is
responsible for calling mapping_set_error() if it fails and page is not
redirtied to the changelogs of patches changing writepage handlers.

> I'll make another pass through the tree and see whether we have some
> mapping_set_error calls that should be removed, and will flesh out
> vfs.txt to state this. Maybe that file needs a whole section on
> writeback error reporting? Hmmm...

I think it would be nice to have all the logic described in one place. So
+1 from me.

> That probably also means that I should drop patch 8 from this series
> (mm: ensure that we set mapping error if writeout fails), since that
> should be happening in writepage already.

Yes.

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-25 Thread Jeff Layton

On Tue, 2017-04-25 at 10:17 +0200, Jan Kara wrote:
> On Mon 24-04-17 13:14:36, Jeff Layton wrote:
> > On Mon, 2017-04-24 at 18:04 +0200, Jan Kara wrote:
> > > On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> > > > This ensures that we see errors on fsync when writeback fails.
> > > > 
> > > > Signed-off-by: Jeff Layton 
> > > 
> > > Hum, but do we really want to clobber mapping errors with temporary stuff
> > > like ENOMEM? Or do you want to handle that in mapping_set_error?
> > > 
> > 
> > Right now we don't really have such a thing as temporary errors in the
> > writeback codepath. If you return an error here, the data doesn't stay
> > dirty or anything, and I think we want to ensure that that gets reported
> > via fsync.
> > 
> > I'd like to see us add better handling for retryable errors for stuff
> > like ENOMEM or EAGAIN. I think this is the first step toward that
> > though. Once we have more consistent handling of writeback errors in
> > general, then we can start doing more interesting things with retryable
> > errors.
> > 
> > So yeah, I this is the right thing to do for now.
> 
> OK, fair enough. And question number 2):
> 
> Who is actually responsible for setting the error in the mapping when error
> happens inside ->writepage()? Is it the ->writepage() callback or the
> caller of ->writepage()? Or something else? Currently it seems to be a
> strange mix (e.g. mm/page-writeback.c: __writepage() calls
> mapping_set_error() when ->writepage() returns error) so I'd like to
> understand what's the plan and have that recorded in the changelogs.
> 

That's an excellent question.

I think we probably want the writepage/launder_page operations to call
mapping_set_error. That makes it possible for filesystems (e.g. NFS) to
handle their own error tracking and reporting without using the new
infrastructure. If they never call mapping_set_error then we'll always
just return whatever their ->fsync operation returns on an fsync.

I'll make another pass through the tree and see whether we have some
mapping_set_error calls that should be removed, and will flesh out
vfs.txt to state this. Maybe that file needs a whole section on
writeback error reporting? Hmmm...

That probably also means that I should drop patch 8 from this series
(mm: ensure that we set mapping error if writeout fails), since that
should be happening in writepage already.

> > 
> > > 
> > > > ---
> > > >  fs/fuse/file.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > > index ec238fb5a584..07d0efcb050c 100644
> > > > --- a/fs/fuse/file.c
> > > > +++ b/fs/fuse/file.c
> > > > @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page 
> > > > *page)
> > > >  err_free:
> > > > fuse_request_free(req);
> > > >  err:
> > > > +   mapping_set_error(page->mapping, error);
> > > > end_page_writeback(page);
> > > > return error;
> > > >  }
> > > > -- 
> > > > 2.9.3
> > > > 
> > > > 
> > 
> > -- 
> > Jeff Layton 

-- 
Jeff Layton

Re: bfq-mq performance comparison to cfq

2017-04-25 Thread Juri Lelli

Hi,

sorry if I jump into this interesting conversation, but I felt some people
might have missed this and might be interested as well (even if from a
slightly different POW). Let me Cc them (Patrick, Morten, Peter, Joel,
Andres).

On 19/04/17 09:02, Paolo Valente wrote:
> 
> > Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
> >  ha scritto:
> > 
> > On 04/11/17 00:29, Paolo Valente wrote:
> >> 
> >>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
> >>>  ha scritto:
> >>> 
> >>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>  That said, if you do always want maximum throughput, even at the
>  expense of latency, then just switch off low-latency heuristics, i.e.,
>  set low_latency to 0.  Depending on the device, setting slice_ilde to
>  0 may help a lot too (as well as with CFQ).  If the throughput is
>  still low also after forcing BFQ to an only-throughput mode, then you
>  hit some bug, and I'll have a little more work to do ...
> >>> 
> >>> Has it been considered to make applications tell the I/O scheduler
> >>> whether to optimize for latency or for throughput? It shouldn't be that
> >>> hard for window managers and shells to figure out whether or not a new
> >>> application that is being started is interactive or not. This would
> >>> require a mechanism that allows applications to provide such information
> >>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
> >>> scheduler trying to guess whether or not an application is an interactive
> >>> application?
> >> 
> >> IMO that would be an (or maybe the) optimal solution, in terms of both
> >> throughput and latency.  We have even developed a prototype doing what
> >> you propose, for Android.  Unfortunately, I have not yet succeeded in
> >> getting support, to turn it into candidate production code, or to make
> >> a similar solution for lsb-compliant systems.
> > 
> > Hello Paolo,
> > 
> > What API was used by the Android application to tell the I/O scheduler 
> > to optimize for latency? Do you think that it would be sufficient if the 
> > application uses the ioprio_set() system call to set the I/O priority to 
> > IOPRIO_CLASS_RT?
> > 
> 
> That's exactly the hack we are using in our prototype.  However, it
> can only be a temporary hack, because it mixes two slightly different
> concepts: 1) the activation of weight raising and other mechanisms for
> reducing latency for the target app, 2) the assignment of a different
> priority class, which (cleanly) means just that processes in a lower
> priority class will be served only when the processes of the target
> app have no pending I/O request.  Finding a clean boosting API would
> be one of the main steps to turn our prototype into a usable solution.
> 

I also need to append here latest Bart's reply (which hasn't all the
context):

On 19/04/17 15:43, Bart Van Assche wrote:
> On Wed, 2017-04-19 at 09:02 +0200, Paolo Valente wrote:
> > > Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
> > >  ha scritto:
> > > What API was used by the Android application to tell the I/O scheduler 
> > > to optimize for latency? Do you think that it would be sufficient if the 
> > > application uses the ioprio_set() system call to set the I/O priority to 
> > > IOPRIO_CLASS_RT?
> > 
> > That's exactly the hack we are using in our prototype.  However, it
> > can only be a temporary hack, because it mixes two slightly different
> > concepts: 1) the activation of weight raising and other mechanisms for
> > reducing latency for the target app, 2) the assignment of a different
> > priority class, which (cleanly) means just that processes in a lower
> > priority class will be served only when the processes of the target
> > app have no pending I/O request.  Finding a clean boosting API would
> > be one of the main steps to turn our prototype into a usable solution.
> 
> Hello Paolo,
> 
> Sorry but I do not agree that you call this use of I/O priorities a hack.
> I also do not agree that I/O requests submitted by processes in a lower
> priority class will only be served by the I/O scheduler when there are no
> pending requests in a higher class. It wouldn't be that hard to modify I/O
> schedulers that support I/O priorities to avoid the starvation you referred
> to. What I expect that will happen is that sooner or later a Linux
> distributor will start receiving bug reports about the heuristics for
> detecting interactive and streaming applications and that the person who
> will work on that bug report will realize that it will be easier to remove
> those heuristics from BFQ and to modify streaming applications and the
> software that starts interactive applications (e.g. a window manager) to
> use a higher I/O priority.
> 
> Please also note that what I described above may require to introduce
> additional I/O priorities in the Linux kernel next to the

Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-25 Thread Jan Kara

On Mon 24-04-17 13:14:36, Jeff Layton wrote:
> On Mon, 2017-04-24 at 18:04 +0200, Jan Kara wrote:
> > On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> > > This ensures that we see errors on fsync when writeback fails.
> > > 
> > > Signed-off-by: Jeff Layton 
> > 
> > Hum, but do we really want to clobber mapping errors with temporary stuff
> > like ENOMEM? Or do you want to handle that in mapping_set_error?
> > 
> 
> Right now we don't really have such a thing as temporary errors in the
> writeback codepath. If you return an error here, the data doesn't stay
> dirty or anything, and I think we want to ensure that that gets reported
> via fsync.
> 
> I'd like to see us add better handling for retryable errors for stuff
> like ENOMEM or EAGAIN. I think this is the first step toward that
> though. Once we have more consistent handling of writeback errors in
> general, then we can start doing more interesting things with retryable
> errors.
> 
> So yeah, I this is the right thing to do for now.

OK, fair enough. And question number 2):

Who is actually responsible for setting the error in the mapping when error
happens inside ->writepage()? Is it the ->writepage() callback or the
caller of ->writepage()? Or something else? Currently it seems to be a
strange mix (e.g. mm/page-writeback.c: __writepage() calls
mapping_set_error() when ->writepage() returns error) so I'd like to
understand what's the plan and have that recorded in the changelogs.

Honza

> 
> > 
> > > ---
> > >  fs/fuse/file.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index ec238fb5a584..07d0efcb050c 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
> > >  err_free:
> > >   fuse_request_free(req);
> > >  err:
> > > + mapping_set_error(page->mapping, error);
> > >   end_page_writeback(page);
> > >   return error;
> > >  }
> > > -- 
> > > 2.9.3
> > > 
> > > 
> 
> -- 
> Jeff Layton 
-- 
Jan Kara 
SUSE Labs, CR

39 matches

Mail list logo