Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue

2017-01-25 Thread Junichi Nomura
On 01/25/17 01:39, Mike Snitzer wrote:
> On Tue, Jan 24 2017 at  9:20am -0500, Christoph Hellwig  wrote:
>> On Tue, Jan 24, 2017 at 05:05:39AM -0500, Mike Snitzer wrote:
>>> possible and is welcomed cleanup.  The only concern I have is that using
>>> get_request() for the old request_fn request_queue eliminates the
>>> guaranteed availability of requests to allow for forward progress (on
>>> path failure or for the purposes of swap over mpath, etc).  This isn't a
>>> concern for blk-mq because as you know we have a fixed set of tags (and
>>> associated preallocated requests).
>>>
>>> So I'm left unconvinced old request_fn request-based DM multipath isn't
>>> regressing in how request resubmission can be assured a request will be
>>> available when needed on retry in the face of path failure.
>>
>> Mempool only need a size where we can make guaranteed requests, so for
>> get_request based drivers under dm the theoretical minimum size would be
>> one as we never rely on a second request to finish the first one,
>> and each request_queue has it's own mempool(s) to start with.
> 
> Fair enough.  Cc'ing Junichi just in case he sees anything we're
> missing.

DM multipath could not use blk_get_request() because the function
was not callable from interrupt-disabled context. E.g. request_fn.

However, since the current code no longer calls blk_get_request()
from such a context, the change should be ok.

-- 
Jun'ichi Nomura, NEC Corporation / NEC Solution Innovators, Ltd.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/18] block: don't assign cmd_flags in __blk_rq_prep_clone

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> These days we have the proper flags set since request
Christoph> allocation time.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/18] scsi: allocate scsi_cmnd structures as part of struct request

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> Rely on the new block layer functionality to allocate
Christoph> additional driver specific data behind struct request instead
Christoph> of implementing it in SCSI itѕelf.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/18] scsi: remove __scsi_alloc_queue

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> Instead do an internal export of __scsi_init_queue for the
Christoph> transport classes that export BSG nodes.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/18] scsi: remove scsi_cmd_dma_pool

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> There is no need for GFP_DMA allocations of the scsi_cmnd
Christoph> structures themselves, all that might be DMAed to or from is
Christoph> the actual payload, or the sense buffers.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/18] scsi: respect unchecked_isa_dma for blk-mq

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> Currently blk-mq always allocates the sense buffer using
Christoph> normal GFP_KERNEL allocation.  Refactor the cmd pool code to
Christoph> split the cmd and sense allocation and share the code to
Christoph> allocate the sense buffers as well as the sense buffer slab
Christoph> caches between the legacy and blk-mq path.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/18] scsi: remove gfp_flags member in scsi_host_cmd_pool

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> When using the slab allocator we already decide at cache
Christoph> creation time if an allocation comes from a GFP_DMA pool
Christoph> using the SLAB_CACHE_DMA flag, and there is no point passing
Christoph> the kmalloc-family only GFP_DMA flag to kmem_cache_alloc.
Christoph> Drop all the infrastructure for doing so.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/18] scsi_dh_hp_sw: switch to scsi_execute_req_flags()

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> From: Hannes Reinecke  Switch to
Christoph> scsi_execute_req_flags() instead of using the block interface
Christoph> directly.  This will set REQ_QUIET and REQ_PREEMPT, but this
Christoph> is okay as we're evaluating the errors anyway and should be
Christoph> able to send the command even if the device is quiesced.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/18] scsi_dh_emc: switch to scsi_execute_req_flags()

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> From: Hannes Reinecke  Switch to
Christoph> scsi_execute_req_flags() and scsi_get_vpd_page() instead of
Christoph> open-coding it.  Using scsi_execute_req_flags() will set
Christoph> REQ_QUIET and REQ_PREEMPT, but this is okay as we're
Christoph> evaluating the errors anyway and should be able to send the
Christoph> command even if the device is quiesced.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/18] scsi_dh_rdac: switch to scsi_execute_req_flags()

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> From: Hannes Reinecke  Switch to
Christoph> scsi_execute_req_flags() and scsi_get_vpd_page() instead of
Christoph> open-coding it.  Using scsi_execute_req_flags() will set
Christoph> REQ_QUIET and REQ_PREEMPT, but this is okay as we're
Christoph> evaluating the errors anyway and should be able to send the
Christoph> command even if the device is quiesced.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/18] block: allow specifying size for extra command data

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph,

Christoph> This mirrors the blk-mq capabilities to allocate extra
Christoph> drivers-specific data behind struct request by setting a
Christoph> cmd_size field, as well as having a constructor / destructor
Christoph> for it.

Nice!

A couple of minor nits:

+static void *alloc_request_size(gfp_t gfp_mask, void *data)

I like alloc_request_simple() but alloc_request_size() seems a bit
contrived. _reserve? _extra? _special? Don't have any good suggestions,
I'm afraid.

Also a bit heavy on the else brackets a couple of places. But no biggie.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/18] block: simplify blk_init_allocated_queue

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> Return an errno value instead of the passed in queue so that
Christoph> the callers don't have to keep track of two queues, and move
Christoph> the assignment of the request_fn and lock to the caller as
Christoph> passing them as argument doesn't simplify anything.  While
Christoph> we're at it also remove two pointless NULL assignments, given
Christoph> that the request structure is zeroed on allocation.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/18] block: fix elevator init check

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> We can't initalize the elevator fields for flushes as flush
Christoph> share space in struct request with the elevator data.  But
Christoph> currently we can't commnicate that a request is a flush

communicate

Christoph> through blk_get_request as we can only pass READ or WRITE,
Christoph> and the low-level code looks at the possible NULL bio to
Christoph> check for a flush.

Christoph> Fix this by allowing to pass any block op and flags, and by
Christoph> checking for the flush flags in __get_request.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/18] md: cleanup bio op / flags handling in raid1_write_request

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> No need for the local variables, the bio is still live and we
Christoph> can just assigned the bits we want directly.  Make me wonder
Christoph> why we can't assign all the bio flags to start with.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/18] block: add a op_is_flush helper

2017-01-25 Thread Martin K. Petersen
> "Christoph" == Christoph Hellwig  writes:

Christoph> This centralizes the checks for bios that needs to be go into
Christoph> the flush state machine.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/10] sbitmap: add helpers for dumping to a seq_file

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

This is useful debugging information that will be used in the blk-mq
debugfs directory.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
Jens took offense to me making the bitmap dumps binary, so this turns them into
nicely formatted hex dumps, like this one:

# cat /sys/kernel/debug/block/nvme0n1/mq/0/tags_bitmap
:        
0010:  2000      
0020:        
0030:        
0040:        
0050:        
0060:        
0070:        

I decided to go with this format because 1) it's consistent with
print_hex_dump() already in the kernel, and 2) it's much easier to read
than a string of bits or hex values. In the example above, it's easy to
see that byte 0x12 is set to 0x20 (and do some arithmetic from there to
figure out that the stuck bit is 149).

 include/linux/sbitmap.h | 30 
 lib/sbitmap.c   | 91 +
 2 files changed, 121 insertions(+)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index f017fd6e69c4..d4e0a204c118 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -259,6 +259,26 @@ static inline int sbitmap_test_bit(struct sbitmap *sb, 
unsigned int bitnr)
 unsigned int sbitmap_weight(const struct sbitmap *sb);
 
 /**
+ * sbitmap_show() - Dump  sbitmap information to a  seq_file.
+ * @sb: Bitmap to show.
+ * @m: struct seq_file to write to.
+ *
+ * This is intended for debugging. The format may change at any time.
+ */
+void sbitmap_show(struct sbitmap *sb, struct seq_file *m);
+
+/**
+ * sbitmap_bitmap_show() - Write a hex dump of a  sbitmap to a 
+ * seq_file.
+ * @sb: Bitmap to show.
+ * @m: struct seq_file to write to.
+ *
+ * This is intended for debugging. The output isn't guaranteed to be internally
+ * consistent.
+ */
+void sbitmap_bitmap_show(struct sbitmap *sb, struct seq_file *m);
+
+/**
  * sbitmap_queue_init_node() - Initialize a  sbitmap_queue on a specific
  * memory node.
  * @sbq: Bitmap queue to initialize.
@@ -370,4 +390,14 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct 
sbitmap_queue *sbq,
  */
 void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
 
+/**
+ * sbitmap_queue_show() - Dump  sbitmap_queue information to a 
+ * seq_file.
+ * @sbq: Bitmap queue to show.
+ * @m: struct seq_file to write to.
+ *
+ * This is intended for debugging. The format may change at any time.
+ */
+void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m);
+
 #endif /* __LINUX_SCALE_BITMAP_H */
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 8f5c3b268c77..014913b5407a 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -17,6 +17,7 @@
 
 #include 
 #include 
+#include 
 
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
  gfp_t flags, int node)
@@ -180,6 +181,62 @@ unsigned int sbitmap_weight(const struct sbitmap *sb)
 }
 EXPORT_SYMBOL_GPL(sbitmap_weight);
 
+void sbitmap_show(struct sbitmap *sb, struct seq_file *m)
+{
+   seq_printf(m, "depth=%u\n", sb->depth);
+   seq_printf(m, "weight=%u\n", sbitmap_weight(sb));
+   seq_printf(m, "bits_per_word=%u\n", 1U << sb->shift);
+   seq_printf(m, "map_nr=%u\n", sb->map_nr);
+}
+EXPORT_SYMBOL_GPL(sbitmap_show);
+
+static inline void emit_byte(struct seq_file *m, unsigned int offset, u8 byte)
+{
+   if ((offset & 0xf) == 0) {
+   if (offset != 0)
+   seq_putc(m, '\n');
+   seq_printf(m, "%08x:", offset);
+   }
+   if ((offset & 0x1) == 0)
+   seq_putc(m, ' ');
+   seq_printf(m, "%02x", byte);
+}
+
+void sbitmap_bitmap_show(struct sbitmap *sb, struct seq_file *m)
+{
+   u8 byte = 0;
+   unsigned int byte_bits = 0;
+   unsigned int offset = 0;
+   int i;
+
+   for (i = 0; i < sb->map_nr; i++) {
+   unsigned long word = READ_ONCE(sb->map[i].word);
+   unsigned int word_bits = READ_ONCE(sb->map[i].depth);
+
+   while (word_bits > 0) {
+   unsigned int bits = min(8 - byte_bits, word_bits);
+
+   byte |= (word & (BIT(bits) - 1)) << byte_bits;
+   byte_bits += bits;
+   if (byte_bits == 8) {
+   emit_byte(m, offset, byte);
+   byte = 0;
+   byte_bits = 0;
+   offset++;
+   }
+   word >>= bits;
+   word_bits -= bits;
+   }
+   }
+   if (byte_bits) {
+   emit_byte(m, offset, byte);
+   offset++;
+   }

Re: [PATCH] queue stall with blk-mq-sched

2017-01-25 Thread Jens Axboe
On 01/25/2017 10:42 AM, Jens Axboe wrote:
> On 01/25/2017 10:03 AM, Jens Axboe wrote:
>> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
 On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>>> [ .. ]
> Bah.
>
> Not quite. I'm still seeing some queues with state 'restart'.
>
> I've found that I need another patch on top of that:
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e872555..edcbb44 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
> *work)
>
> queue_for_each_hw_ctx(q, hctx, i) {
> /* the hctx may be unmapped, so check it here */
> -   if (blk_mq_hw_queue_mapped(hctx))
> +   if (blk_mq_hw_queue_mapped(hctx)) {
> blk_mq_tag_idle(hctx);
> +   blk_mq_sched_restart(hctx);
> +   }
> }
> }
> blk_queue_exit(q);
>
>
> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
> request on another hctx, but the original hctx might still have the
> SCHED_RESTART bit set.
> Which will never cleared as we complete the request on a different hctx,
> so anything we do on the end_request side won't do us any good.

 I think you are right, it'll potentially trigger with shared tags and
 multiple hardware queues. I'll debug this today and come up with a
 decent fix.

 I committed the previous patch, fwiw.

>>> THX.
>>>
>>> The above patch _does_ help in the sense that my testcase now completes 
>>> without stalls. And I even get a decent performance with the mq-sched 
>>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>>> when running without I/O scheduling.
>>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>>> getting there.
>>>
>>> However, I do get a noticeable stall during the stonewall sequence 
>>> before the timeout handler kicks in, so the must be a better way for 
>>> handling this.
>>>
>>> But nevertheless, thanks for all your work here.
>>> Very much appreciated.
>>
>> Yeah, the fix isn't really a fix, unless you are willing to tolerate
>> potentially tens of seconds of extra latency until we idle it out :-)
>>
>> So we can't use the un-idling for this, but we can track it on the
>> shared state, which is the tags. The problem isn't that we are
>> switching to a new hardware queue, it's if we mark the hardware queue
>> as restart AND it has nothing pending. In that case, we'll never
>> get it restarted, since IO completion is what restarts it.
>>
>> I need to handle that case separately. Currently testing a patch, I
>> should have something for you to test later today.
> 
> Can you try this one?

And another variant, this one should be better in that it should result
in less queue runs and get better merging. Hope it works with your
stalls as well.


diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d05061f..6a1656d 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -300,6 +300,34 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_bypass_insert);
 
+static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+{
+   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
+   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
+   if (blk_mq_hctx_has_pending(hctx))
+   blk_mq_run_hw_queue(hctx, true);
+   }
+}
+
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
+{
+   unsigned int i;
+
+   if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+   blk_mq_sched_restart_hctx(hctx);
+   else {
+   struct request_queue *q = hctx->queue;
+
+   if (!test_bit(QUEUE_FLAG_RESTART, >queue_flags))
+   return;
+
+   clear_bit(QUEUE_FLAG_RESTART, >queue_flags);
+
+   queue_for_each_hw_ctx(q, hctx, i)
+   blk_mq_sched_restart_hctx(hctx);
+   }
+}
+
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
   struct blk_mq_hw_ctx *hctx,
   unsigned int hctx_idx)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 6b465bc..becbc78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,6 +19,7 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, 
struct request *rq);
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request 
*rq);
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx 

Re: [RFC PATCH v2 0/2] block: fix backing_dev_info lifetime

2017-01-25 Thread Dan Williams
On Mon, Jan 23, 2017 at 1:17 PM, Thiago Jung Bauermann
 wrote:
> Hello Dan,
>
> Am Freitag, 6. Januar 2017, 17:02:51 BRST schrieb Dan Williams:
>> v1 of these changes [1] was a one line change to bdev_get_queue() to
>> prevent a shutdown crash when del_gendisk() races the final
>> __blkdev_put().
>>
>> While it is known at del_gendisk() time that the queue is still alive,
>> Jan Kara points to other paths [2] that are racing __blkdev_put() where
>> the assumption that ->bd_queue, or inode->i_wb is valid does not hold.
>>
>> Fix that broken assumption, make it the case that if you have a live
>> block_device, or block_device-inode that the corresponding queue and
>> inode-write-back data is still valid.
>>
>> These changes survive a run of the libnvdimm unit test suite which puts
>> some stress on the block_device shutdown path.
>
> I realize that the kernel test robot found problems with this series, but FWIW
> it fixes the bug mentioned in [2].
>

Thanks for the test result. I might take a look at cleaning up the
test robot reports and resubmitting this approach unless Jan beats me
to the punch with his backing_devi_info lifetime change patches.

>> [2]: http://www.spinics.net/lists/linux-fsdevel/msg105153.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Nbd] [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Eric Blake
On 01/25/2017 03:30 PM, Greg KH wrote:
>> Given (because of NBD_DO_IT) we need an ioctl anyway, and we have
>> an ioctl that isn't going to go away, it would seem better if possible
>> to stick with ioctls, and not introduce either a dependency
>> on netlink (which would presumably bloat static binaries that
>> are used early in the boot process). Personally I'd have thought
>> adding a new NBD ioctl (or extending an existing one) would be
>> less entropy than adding a new char device.
> 
> Why can't you just do this on any existing nbd block device with an
> ioctl to it?  No need to have to do it on an out-of-band char device
> node, right?

How do you get an fd to existing nbd block device?  Your intent is to
use an ioctl to request creating/opening a new nbd device that no one
else is using; opening an existing device in order to send that ioctl
may have negative ramifications to the actual user of that existing
device, if not permissions issues that prevent the open from even
happening.  Having a separate control fd makes it obvious that you are
asking for a new nbd device, and don't want to stomp on any existing
devices.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Nbd] [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Greg KH
On Wed, Jan 25, 2017 at 06:25:11PM +, Alex Bligh wrote:
> 
> > On 25 Jan 2017, at 16:48, Alex Gartrell  wrote:
> > 
> > 
> > If nbd were *all* netlink I think that that'd be fine, but you'd have
> > problems implementing the NBD_DOIT function in that fashion.  So I'd
> > rather stick to the char device ioctl thing because it's more
> > consistent with the old NBD stuff as well as the loop device stuff.
> 
> I spend most of my time looking at the userspace side of NBD so
> apologies if this is off base.
> 
> Given (because of NBD_DO_IT) we need an ioctl anyway, and we have
> an ioctl that isn't going to go away, it would seem better if possible
> to stick with ioctls, and not introduce either a dependency
> on netlink (which would presumably bloat static binaries that
> are used early in the boot process). Personally I'd have thought
> adding a new NBD ioctl (or extending an existing one) would be
> less entropy than adding a new char device.

Why can't you just do this on any existing nbd block device with an
ioctl to it?  No need to have to do it on an out-of-band char device
node, right?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] block level event logging for storage media management

2017-01-25 Thread Oleg Drokin

On Jan 25, 2017, at 4:56 AM, Jan Kara wrote:

> On Tue 24-01-17 15:18:57, Oleg Drokin wrote:
>> 
>> On Jan 23, 2017, at 2:27 AM, Dan Williams wrote:
>> 
>>> [ adding Oleg ]
>>> 
>>> On Sun, Jan 22, 2017 at 10:00 PM, Song Liu  wrote:
 Hi Dan,
 
 I think the the block level event log is more like log only system. When 
 en event
 happens,  it is not necessary to take immediate action. (I guess this is 
 different
 to bad block list?).
 
 I would hope the event log to track more information. Some of these 
 individual
 event may not be very interesting, for example, soft error or latency 
 outliers.
 However, when we gather event log for a fleet of devices, these "soft 
 event"
 may become valuable for health monitoring.
>>> 
>>> I'd be interested in this. It sounds like you're trying to fill a gap
>>> between tracing and console log messages which I believe others have
>>> encountered as well.
>> 
>> We have a somewhat similar problem problem in Lustre and I guess it's not
>> just Lustre.  Currently there are all sorts of conditional debug code all
>> over the place that goes to the console and when you enable it for
>> anything verbose, you quickly overflow your dmesg buffer no matter the
>> size, that might be mostly ok for local "block level" stuff, but once you
>> become distributed, it start to be a mess and once you get to be super
>> large it worsens even more since you need to somehow coordinate data from
>> multiple nodes, ensure all of it is not lost and still you don't end up
>> using a lot of it since only a few nodes end up being useful.  (I don't
>> know how NFS people manage to debug complicated issues using just this,
>> could not be super easy).
>> 
>> Having some sort of a buffer of a (potentially very) large size that
>> could be storing the data until it's needed, or eagerly polled by some
>> daemon for storage (helpful when you expect a lot of data that definitely
>> won't fit in RAM).
>> 
>> Tracepoints have the buffer and the daemon, but creating new messages is
>> very cumbersome, so converting every debug message into one does not look
>> very feasible.  Also it's convenient to have "event masks" one want
>> logged that I don't think you could do with tracepoints.
> 
> So creating trace points IMO isn't that cumbersome. I agree that converting
> hundreds or thousands debug printks into tracepoints is a pain in the
> ass but still it is doable. WRT filtering, you can enable each tracepoint
> individually. Granted that is not exactly the 'event mask' feature you ask
> about but that can be easily scripted in userspace if you give some
> structure to tracepoint names. Finally tracepoints provide a fine grained
> control you never get with printk - e.g. you can make a tracepoint trigger
> only if specific inode is involved with trace filters which greatly reduces
> the amount of output.

Oh, I am not dissing tracepoints, don't get me wrong, they add valuable things
at a fine-grained level when you have necessary details.
The problem is sometimes there are bugs where you don't have enough of knowledge
beforehand so you cannot do some fine-grained debug.
Think of a 10.000 nodes cluster (heck make it even 100 or probably even 10)
with a report of "when running a moderately sized job, there's a hang/something 
weird/
some unexpected data corruption" that does not occur when run on a single node,
so often what you resort to is the shotgun approach where you enable all debug 
(or
selective like "Everything in ldlm and everything rpc related) you
could everywhere, then run the job for however long it takes to reproduce and 
then
once reproduced, you sift through those locks reconstructing picture back 
together
only to discover there was this weird race on one of the clients only when
some lock was contended but then the grant RPC and some userspace action
coincided or some such.
the dev_dbg() and nfs's /proc/sys/sunrpc/*debug are somewhat similar, only they 
dump
to dmesg which is quite limited in buffer size, adds huge delays if it goes out 
to some
slow console, wipes other potentially useful messages from the buffer in process
and such.

I guess you could print script tracepoints with a pattern in their name too,
but then there's this pain in the ass of converting:
$ git grep CERROR  drivers/staging/lustre/ | wc -l
1069
$ git grep CDEBUG  drivers/staging/lustre/ | wc -l
1140

messages AND there's also this thing that I do want many of those output to 
console
(because they are important enough) and to the buffer (so I can see them 
relative to
other debug messages I do not want to go to the console).

if tracepoints could be extended to enable that much - I'd be a super happy 
camper,
of course.
Sure, you cannot just make a macro that wraps the whole print into a 
tracepoint, but
that would be a stupid tracepoint with no finegrained control whatsoever,
but perhaps we can do name arguments or some such so 

Re: [Nbd] [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Alex Bligh

> On 25 Jan 2017, at 16:48, Alex Gartrell  wrote:
> 
> 
> If nbd were *all* netlink I think that that'd be fine, but you'd have
> problems implementing the NBD_DOIT function in that fashion.  So I'd
> rather stick to the char device ioctl thing because it's more
> consistent with the old NBD stuff as well as the loop device stuff.

I spend most of my time looking at the userspace side of NBD so
apologies if this is off base.

Given (because of NBD_DO_IT) we need an ioctl anyway, and we have
an ioctl that isn't going to go away, it would seem better if possible
to stick with ioctls, and not introduce either a dependency
on netlink (which would presumably bloat static binaries that
are used early in the boot process). Personally I'd have thought
adding a new NBD ioctl (or extending an existing one) would be
less entropy than adding a new char device.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Nbd] [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Paul Clements
On Wed, Jan 25, 2017 at 9:23 AM, Arnd Bergmann  wrote:
>
> They all have their own set of problems, but the needs of nbd as a network
> storage interface seem most closely resemble what we have for other network
> related interfaces, where we typically use netlink to do the setup, see e.g.
> macvtap as an example for a network chardev interface.

But nbd is a virtual block device, first and foremost. That means it
is very similar to both loop (Pavel began nbd as a copy of loop, I
believe) and device mapper.

Also, it uses ioctls for all configuration and administration currently.

--
Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Nbd] [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Wouter Verhelst
On Wed, Jan 25, 2017 at 04:48:27PM +, Alex Gartrell wrote:
> On 1/25/17, 6:23 AM, "arndbergm...@gmail.com on behalf of Arnd Bergmann" 
>  wrote:
> > We have multiple established ways to deal with this kind of problem, the 
> > most
> > common ones being a separate chardev as you propose, a sysfs interface and
> > netlink.
> > 
> > They all have their own set of problems, but the needs of nbd as a network
> > storage interface seem most closely resemble what we have for other network
> > related interfaces, where we typically use netlink to do the setup, see e.g.
> > macvtap as an example for a network chardev interface.
> > 
> > Can you have a look if that would solve your problem more nicely?
> 
> FWIW, I'm the one consuming this nbd stuff at Facebook and bringing
> netlink into this would be a huge pain for me.  It's very weird to
> need to pull sockets in and then drop back to ioctls from a design
> perspective, and future consumers of this API would be sad as well.

As who's been maintaining the official userspace nbd client for 15 years
now, I have to concur. NBD has had an API that deals with /dev/nbdX
since forever, adding other APIs to that just feels wrong.

> This is compounded, I think, by the fact that the breadth of
> functionality here doesn't really merit a shared library for everyone
> to use -- could you imagine libnbdcontrol, which exposes a single
> "get_next_device" function?

Please no :)

> If nbd were *all* netlink I think that that'd be fine, but you'd have
> problems implementing the NBD_DOIT function in that fashion.  So I'd
> rather stick to the char device ioctl thing because it's more
> consistent with the old NBD stuff as well as the loop device stuff.

Indeed.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-25 Thread Jens Axboe
On 01/25/2017 10:03 AM, Jens Axboe wrote:
> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>> [ .. ]
 Bah.

 Not quite. I'm still seeing some queues with state 'restart'.

 I've found that I need another patch on top of that:

 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index e872555..edcbb44 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
 *work)

 queue_for_each_hw_ctx(q, hctx, i) {
 /* the hctx may be unmapped, so check it here */
 -   if (blk_mq_hw_queue_mapped(hctx))
 +   if (blk_mq_hw_queue_mapped(hctx)) {
 blk_mq_tag_idle(hctx);
 +   blk_mq_sched_restart(hctx);
 +   }
 }
 }
 blk_queue_exit(q);


 Reasoning is that in blk_mq_get_tag() we might end up scheduling the
 request on another hctx, but the original hctx might still have the
 SCHED_RESTART bit set.
 Which will never cleared as we complete the request on a different hctx,
 so anything we do on the end_request side won't do us any good.
>>>
>>> I think you are right, it'll potentially trigger with shared tags and
>>> multiple hardware queues. I'll debug this today and come up with a
>>> decent fix.
>>>
>>> I committed the previous patch, fwiw.
>>>
>> THX.
>>
>> The above patch _does_ help in the sense that my testcase now completes 
>> without stalls. And I even get a decent performance with the mq-sched 
>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>> when running without I/O scheduling.
>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>> getting there.
>>
>> However, I do get a noticeable stall during the stonewall sequence 
>> before the timeout handler kicks in, so the must be a better way for 
>> handling this.
>>
>> But nevertheless, thanks for all your work here.
>> Very much appreciated.
> 
> Yeah, the fix isn't really a fix, unless you are willing to tolerate
> potentially tens of seconds of extra latency until we idle it out :-)
> 
> So we can't use the un-idling for this, but we can track it on the
> shared state, which is the tags. The problem isn't that we are
> switching to a new hardware queue, it's if we mark the hardware queue
> as restart AND it has nothing pending. In that case, we'll never
> get it restarted, since IO completion is what restarts it.
> 
> I need to handle that case separately. Currently testing a patch, I
> should have something for you to test later today.

Can you try this one?


diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d05061f..6a1656d 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -300,6 +300,34 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_bypass_insert);
 
+static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+{
+   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
+   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
+   if (blk_mq_hctx_has_pending(hctx))
+   blk_mq_run_hw_queue(hctx, true);
+   }
+}
+
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
+{
+   unsigned int i;
+
+   if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+   blk_mq_sched_restart_hctx(hctx);
+   else {
+   struct request_queue *q = hctx->queue;
+
+   if (!test_bit(QUEUE_FLAG_RESTART, >queue_flags))
+   return;
+
+   clear_bit(QUEUE_FLAG_RESTART, >queue_flags);
+
+   queue_for_each_hw_ctx(q, hctx, i)
+   blk_mq_sched_restart_hctx(hctx);
+   }
+}
+
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
   struct blk_mq_hw_ctx *hctx,
   unsigned int hctx_idx)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 6b465bc..becbc78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,6 +19,7 @@ bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, 
struct request *rq);
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request 
*rq);
+void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
@@ -123,11 +124,6 @@ blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, 
struct request *rq)
BUG_ON(rq->internal_tag 

[PATCH 08/18] scsi_dh_rdac: switch to scsi_execute_req_flags()

2017-01-25 Thread Christoph Hellwig
From: Hannes Reinecke 

Switch to scsi_execute_req_flags() and scsi_get_vpd_page() instead of
open-coding it.  Using scsi_execute_req_flags() will set REQ_QUIET and
REQ_PREEMPT, but this is okay as we're evaluating the errors anyway and
should be able to send the command even if the device is quiesced.

Signed-off-by: Hannes Reinecke 
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/device_handler/scsi_dh_rdac.c | 174 +
 1 file changed, 51 insertions(+), 123 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c 
b/drivers/scsi/device_handler/scsi_dh_rdac.c
index 00d9c32..b64eaae 100644
--- a/drivers/scsi/device_handler/scsi_dh_rdac.c
+++ b/drivers/scsi/device_handler/scsi_dh_rdac.c
@@ -205,7 +205,6 @@ struct rdac_dh_data {
 #define RDAC_NON_PREFERRED 1
charpreferred;
 
-   unsigned char   sense[SCSI_SENSE_BUFFERSIZE];
union   {
struct c2_inquiry c2;
struct c4_inquiry c4;
@@ -262,40 +261,12 @@ do { \
sdev_printk(KERN_INFO, sdev, RDAC_NAME ": " f "\n", ## arg); \
 } while (0);
 
-static struct request *get_rdac_req(struct scsi_device *sdev,
-   void *buffer, unsigned buflen, int rw)
+static unsigned int rdac_failover_get(struct rdac_controller *ctlr,
+ struct list_head *list,
+ unsigned char *cdb)
 {
-   struct request *rq;
-   struct request_queue *q = sdev->request_queue;
-
-   rq = blk_get_request(q, rw, GFP_NOIO);
-
-   if (IS_ERR(rq)) {
-   sdev_printk(KERN_INFO, sdev,
-   "get_rdac_req: blk_get_request failed.\n");
-   return NULL;
-   }
-   blk_rq_set_block_pc(rq);
-
-   if (buflen && blk_rq_map_kern(q, rq, buffer, buflen, GFP_NOIO)) {
-   blk_put_request(rq);
-   sdev_printk(KERN_INFO, sdev,
-   "get_rdac_req: blk_rq_map_kern failed.\n");
-   return NULL;
-   }
-
-   rq->cmd_flags |= REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT |
-REQ_FAILFAST_DRIVER;
-   rq->retries = RDAC_RETRIES;
-   rq->timeout = RDAC_TIMEOUT;
-
-   return rq;
-}
-
-static struct request *rdac_failover_get(struct scsi_device *sdev,
-   struct rdac_dh_data *h, struct list_head *list)
-{
-   struct request *rq;
+   struct scsi_device *sdev = ctlr->ms_sdev;
+   struct rdac_dh_data *h = sdev->handler_data;
struct rdac_mode_common *common;
unsigned data_size;
struct rdac_queue_data *qdata;
@@ -332,27 +303,17 @@ static struct request *rdac_failover_get(struct 
scsi_device *sdev,
lun_table[qdata->h->lun] = 0x81;
}
 
-   /* get request for block layer packet command */
-   rq = get_rdac_req(sdev, >ctlr->mode_select, data_size, WRITE);
-   if (!rq)
-   return NULL;
-
/* Prepare the command. */
if (h->ctlr->use_ms10) {
-   rq->cmd[0] = MODE_SELECT_10;
-   rq->cmd[7] = data_size >> 8;
-   rq->cmd[8] = data_size & 0xff;
+   cdb[0] = MODE_SELECT_10;
+   cdb[7] = data_size >> 8;
+   cdb[8] = data_size & 0xff;
} else {
-   rq->cmd[0] = MODE_SELECT;
-   rq->cmd[4] = data_size;
+   cdb[0] = MODE_SELECT;
+   cdb[4] = data_size;
}
-   rq->cmd_len = COMMAND_SIZE(rq->cmd[0]);
-
-   rq->sense = h->sense;
-   memset(rq->sense, 0, SCSI_SENSE_BUFFERSIZE);
-   rq->sense_len = 0;
 
-   return rq;
+   return data_size;
 }
 
 static void release_controller(struct kref *kref)
@@ -400,46 +361,14 @@ static struct rdac_controller *get_controller(int index, 
char *array_name,
return ctlr;
 }
 
-static int submit_inquiry(struct scsi_device *sdev, int page_code,
- unsigned int len, struct rdac_dh_data *h)
-{
-   struct request *rq;
-   struct request_queue *q = sdev->request_queue;
-   int err = SCSI_DH_RES_TEMP_UNAVAIL;
-
-   rq = get_rdac_req(sdev, >inq, len, READ);
-   if (!rq)
-   goto done;
-
-   /* Prepare the command. */
-   rq->cmd[0] = INQUIRY;
-   rq->cmd[1] = 1;
-   rq->cmd[2] = page_code;
-   rq->cmd[4] = len;
-   rq->cmd_len = COMMAND_SIZE(INQUIRY);
-
-   rq->sense = h->sense;
-   memset(rq->sense, 0, SCSI_SENSE_BUFFERSIZE);
-   rq->sense_len = 0;
-
-   err = blk_execute_rq(q, NULL, rq, 1);
-   if (err == -EIO)
-   err = SCSI_DH_IO;
-
-   blk_put_request(rq);
-done:
-   return err;
-}
-
 static int get_lun_info(struct scsi_device *sdev, struct rdac_dh_data *h,
char *array_name, u8 *array_id)
 {
-   int err, i;
-   struct c8_inquiry *inqp;
+   

[PATCH 11/18] scsi: remove gfp_flags member in scsi_host_cmd_pool

2017-01-25 Thread Christoph Hellwig
When using the slab allocator we already decide at cache creation time if
an allocation comes from a GFP_DMA pool using the SLAB_CACHE_DMA flag,
and there is no point passing the kmalloc-family only GFP_DMA flag to
kmem_cache_alloc.  Drop all the infrastructure for doing so.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/scsi/scsi.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 75455d4..0f93892 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -105,7 +105,6 @@ struct scsi_host_cmd_pool {
char*cmd_name;
char*sense_name;
unsigned intslab_flags;
-   gfp_t   gfp_mask;
 };
 
 static struct scsi_host_cmd_pool scsi_cmd_pool = {
@@ -118,7 +117,6 @@ static struct scsi_host_cmd_pool scsi_cmd_dma_pool = {
.cmd_name   = "scsi_cmd_cache(DMA)",
.sense_name = "scsi_sense_cache(DMA)",
.slab_flags = SLAB_HWCACHE_ALIGN|SLAB_CACHE_DMA,
-   .gfp_mask   = __GFP_DMA,
 };
 
 static DEFINE_MUTEX(host_cmd_pool_mutex);
@@ -156,12 +154,11 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t 
gfp_mask)
struct scsi_host_cmd_pool *pool = shost->cmd_pool;
struct scsi_cmnd *cmd;
 
-   cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask);
+   cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask);
if (!cmd)
goto fail;
 
-   cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab,
-gfp_mask | pool->gfp_mask);
+   cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab, gfp_mask);
if (!cmd->sense_buffer)
goto fail_free_cmd;
 
@@ -327,10 +324,8 @@ scsi_alloc_host_cmd_pool(struct Scsi_Host *shost)
}
 
pool->slab_flags = SLAB_HWCACHE_ALIGN;
-   if (shost->unchecked_isa_dma) {
+   if (shost->unchecked_isa_dma)
pool->slab_flags |= SLAB_CACHE_DMA;
-   pool->gfp_mask = __GFP_DMA;
-   }
 
if (hostt->cmd_size)
hostt->cmd_pool = pool;
@@ -424,7 +419,6 @@ static void scsi_put_host_cmd_pool(struct Scsi_Host *shost)
  */
 int scsi_setup_command_freelist(struct Scsi_Host *shost)
 {
-   const gfp_t gfp_mask = shost->unchecked_isa_dma ? GFP_DMA : GFP_KERNEL;
struct scsi_cmnd *cmd;
 
spin_lock_init(>free_list_lock);
@@ -437,7 +431,7 @@ int scsi_setup_command_freelist(struct Scsi_Host *shost)
/*
 * Get one backup command for this host.
 */
-   cmd = scsi_host_alloc_command(shost, gfp_mask);
+   cmd = scsi_host_alloc_command(shost, GFP_KERNEL);
if (!cmd) {
scsi_put_host_cmd_pool(shost);
shost->cmd_pool = NULL;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/18] scsi: remove __scsi_alloc_queue

2017-01-25 Thread Christoph Hellwig
Instead do an internal export of __scsi_init_queue for the transport
classes that export BSG nodes.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Johannes Thumshirn 
---
 drivers/scsi/scsi_lib.c | 19 ---
 drivers/scsi/scsi_transport_fc.c|  6 --
 drivers/scsi/scsi_transport_iscsi.c |  3 ++-
 include/scsi/scsi_host.h|  2 --
 include/scsi/scsi_transport.h   |  2 ++
 5 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 3d6b364..7950516 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2082,7 +2082,7 @@ static u64 scsi_calculate_bounce_limit(struct Scsi_Host 
*shost)
return bounce_limit;
 }
 
-static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
+void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 {
struct device *dev = shost->dma_dev;
 
@@ -2117,28 +2117,17 @@ static void __scsi_init_queue(struct Scsi_Host *shost, 
struct request_queue *q)
 */
blk_queue_dma_alignment(q, 0x03);
 }
-
-struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
-request_fn_proc *request_fn)
-{
-   struct request_queue *q;
-
-   q = blk_init_queue(request_fn, NULL);
-   if (!q)
-   return NULL;
-   __scsi_init_queue(shost, q);
-   return q;
-}
-EXPORT_SYMBOL(__scsi_alloc_queue);
+EXPORT_SYMBOL_GPL(__scsi_init_queue);
 
 struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 {
struct request_queue *q;
 
-   q = __scsi_alloc_queue(sdev->host, scsi_request_fn);
+   q = blk_init_queue(scsi_request_fn, NULL);
if (!q)
return NULL;
 
+   __scsi_init_queue(sdev->host, q);
blk_queue_prep_rq(q, scsi_prep_fn);
blk_queue_unprep_rq(q, scsi_unprep_fn);
blk_queue_softirq_done(q, scsi_softirq_done);
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 03577bd..afcedec 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3776,7 +3776,7 @@ fc_bsg_hostadd(struct Scsi_Host *shost, struct 
fc_host_attrs *fc_host)
snprintf(bsg_name, sizeof(bsg_name),
 "fc_host%d", shost->host_no);
 
-   q = __scsi_alloc_queue(shost, bsg_request_fn);
+   q = blk_init_queue(bsg_request_fn, NULL);
if (!q) {
dev_err(dev,
"fc_host%d: bsg interface failed to initialize - no 
request queue\n",
@@ -3784,6 +3784,7 @@ fc_bsg_hostadd(struct Scsi_Host *shost, struct 
fc_host_attrs *fc_host)
return -ENOMEM;
}
 
+   __scsi_init_queue(shost, q);
err = bsg_setup_queue(dev, q, bsg_name, fc_bsg_dispatch,
 i->f->dd_bsg_size);
if (err) {
@@ -3831,12 +3832,13 @@ fc_bsg_rportadd(struct Scsi_Host *shost, struct 
fc_rport *rport)
if (!i->f->bsg_request)
return -ENOTSUPP;
 
-   q = __scsi_alloc_queue(shost, bsg_request_fn);
+   q = blk_init_queue(bsg_request_fn, NULL);
if (!q) {
dev_err(dev, "bsg interface failed to initialize - no request 
queue\n");
return -ENOMEM;
}
 
+   __scsi_init_queue(shost, q);
err = bsg_setup_queue(dev, q, NULL, fc_bsg_dispatch, i->f->dd_bsg_size);
if (err) {
dev_err(dev, "failed to setup bsg queue\n");
diff --git a/drivers/scsi/scsi_transport_iscsi.c 
b/drivers/scsi/scsi_transport_iscsi.c
index 42bca61..04ebe6e 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -1544,10 +1544,11 @@ iscsi_bsg_host_add(struct Scsi_Host *shost, struct 
iscsi_cls_host *ihost)
 
snprintf(bsg_name, sizeof(bsg_name), "iscsi_host%d", shost->host_no);
 
-   q = __scsi_alloc_queue(shost, bsg_request_fn);
+   q = blk_init_queue(bsg_request_fn, NULL);
if (!q)
return -ENOMEM;
 
+   __scsi_init_queue(shost, q);
ret = bsg_setup_queue(dev, q, bsg_name, iscsi_bsg_host_dispatch, 0);
if (ret) {
shost_printk(KERN_ERR, shost, "bsg interface failed to "
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 36680f1..f4964d7 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -826,8 +826,6 @@ extern void scsi_block_requests(struct Scsi_Host *);
 
 struct class_container;
 
-extern struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
-   void (*) (struct request_queue 
*));
 /*
  * These two functions are used to allocate and free a pseudo device
  * which will connect to the host adapter itself rather than any
diff --git a/include/scsi/scsi_transport.h b/include/scsi/scsi_transport.h
index 8129239..b6e07b5 

[PATCH 16/18] block/bsg: move queue creation into bsg_setup_queue

2017-01-25 Thread Christoph Hellwig
Simply the boilerplate code needed for bsg nodes a bit.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 block/bsg-lib.c | 21 +++--
 drivers/scsi/scsi_transport_fc.c| 36 
 drivers/scsi/scsi_transport_iscsi.c | 15 ---
 include/linux/bsg-lib.h |  5 ++---
 4 files changed, 25 insertions(+), 52 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index 9d652a9..c74acf4 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -177,7 +177,7 @@ static int bsg_create_job(struct device *dev, struct 
request *req)
  *
  * Drivers/subsys should pass this to the queue init function.
  */
-void bsg_request_fn(struct request_queue *q)
+static void bsg_request_fn(struct request_queue *q)
__releases(q->queue_lock)
__acquires(q->queue_lock)
 {
@@ -214,24 +214,24 @@ void bsg_request_fn(struct request_queue *q)
put_device(dev);
spin_lock_irq(q->queue_lock);
 }
-EXPORT_SYMBOL_GPL(bsg_request_fn);
 
 /**
  * bsg_setup_queue - Create and add the bsg hooks so we can receive requests
  * @dev: device to attach bsg device to
- * @q: request queue setup by caller
  * @name: device to give bsg device
  * @job_fn: bsg job handler
  * @dd_job_size: size of LLD data needed for each job
- *
- * The caller should have setup the reuqest queue with bsg_request_fn
- * as the request_fn.
  */
-int bsg_setup_queue(struct device *dev, struct request_queue *q,
-   char *name, bsg_job_fn *job_fn, int dd_job_size)
+struct request_queue *bsg_setup_queue(struct device *dev, char *name,
+   bsg_job_fn *job_fn, int dd_job_size)
 {
+   struct request_queue *q;
int ret;
 
+   q = blk_init_queue(bsg_request_fn, NULL);
+   if (!q)
+   return ERR_PTR(-ENOMEM);
+
q->queuedata = dev;
q->bsg_job_size = dd_job_size;
q->bsg_job_fn = job_fn;
@@ -243,9 +243,10 @@ int bsg_setup_queue(struct device *dev, struct 
request_queue *q,
if (ret) {
printk(KERN_ERR "%s: bsg interface failed to "
   "initialize - register queue\n", dev->kobj.name);
-   return ret;
+   blk_cleanup_queue(q);
+   return ERR_PTR(ret);
}
 
-   return 0;
+   return q;
 }
 EXPORT_SYMBOL_GPL(bsg_setup_queue);
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index afcedec..13dcb9b 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3765,7 +3765,6 @@ fc_bsg_hostadd(struct Scsi_Host *shost, struct 
fc_host_attrs *fc_host)
struct device *dev = >shost_gendev;
struct fc_internal *i = to_fc_internal(shost->transportt);
struct request_queue *q;
-   int err;
char bsg_name[20];
 
fc_host->rqst_q = NULL;
@@ -3776,24 +3775,14 @@ fc_bsg_hostadd(struct Scsi_Host *shost, struct 
fc_host_attrs *fc_host)
snprintf(bsg_name, sizeof(bsg_name),
 "fc_host%d", shost->host_no);
 
-   q = blk_init_queue(bsg_request_fn, NULL);
-   if (!q) {
-   dev_err(dev,
-   "fc_host%d: bsg interface failed to initialize - no 
request queue\n",
-   shost->host_no);
-   return -ENOMEM;
-   }
-
-   __scsi_init_queue(shost, q);
-   err = bsg_setup_queue(dev, q, bsg_name, fc_bsg_dispatch,
-i->f->dd_bsg_size);
-   if (err) {
+   q = bsg_setup_queue(dev, bsg_name, fc_bsg_dispatch, i->f->dd_bsg_size);
+   if (IS_ERR(q)) {
dev_err(dev,
"fc_host%d: bsg interface failed to initialize - setup 
queue\n",
shost->host_no);
-   blk_cleanup_queue(q);
-   return err;
+   return PTR_ERR(q);
}
+   __scsi_init_queue(shost, q);
blk_queue_rq_timed_out(q, fc_bsg_job_timeout);
blk_queue_rq_timeout(q, FC_DEFAULT_BSG_TIMEOUT);
fc_host->rqst_q = q;
@@ -3825,27 +3814,18 @@ fc_bsg_rportadd(struct Scsi_Host *shost, struct 
fc_rport *rport)
struct device *dev = >dev;
struct fc_internal *i = to_fc_internal(shost->transportt);
struct request_queue *q;
-   int err;
 
rport->rqst_q = NULL;
 
if (!i->f->bsg_request)
return -ENOTSUPP;
 
-   q = blk_init_queue(bsg_request_fn, NULL);
-   if (!q) {
-   dev_err(dev, "bsg interface failed to initialize - no request 
queue\n");
-   return -ENOMEM;
-   }
-
-   __scsi_init_queue(shost, q);
-   err = bsg_setup_queue(dev, q, NULL, fc_bsg_dispatch, i->f->dd_bsg_size);
-   if (err) {
+   q = bsg_setup_queue(dev, NULL, fc_bsg_dispatch, i->f->dd_bsg_size);
+   if (IS_ERR(q)) {
dev_err(dev, "failed to setup bsg queue\n");
-   

[PATCH 18/18] block: don't assign cmd_flags in __blk_rq_prep_clone

2017-01-25 Thread Christoph Hellwig
These days we have the proper flags set since request allocation time.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 block/blk-core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 33c5d05e..6bf5ba0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2983,7 +2983,6 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 {
dst->cpu = src->cpu;
-   dst->cmd_flags = src->cmd_flags | REQ_NOMERGE;
dst->cmd_type = src->cmd_type;
dst->__sector = blk_rq_pos(src);
dst->__data_len = blk_rq_bytes(src);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/18] scsi_dh_hp_sw: switch to scsi_execute_req_flags()

2017-01-25 Thread Christoph Hellwig
From: Hannes Reinecke 

Switch to scsi_execute_req_flags() instead of using the block interface
directly.  This will set REQ_QUIET and REQ_PREEMPT, but this is okay as
we're evaluating the errors anyway and should be able to send the command
even if the device is quiesced.

Signed-off-by: Hannes Reinecke 
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/device_handler/scsi_dh_hp_sw.c | 222 
 1 file changed, 65 insertions(+), 157 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_hp_sw.c 
b/drivers/scsi/device_handler/scsi_dh_hp_sw.c
index 308e871..be43c94 100644
--- a/drivers/scsi/device_handler/scsi_dh_hp_sw.c
+++ b/drivers/scsi/device_handler/scsi_dh_hp_sw.c
@@ -38,13 +38,10 @@
 #define HP_SW_PATH_PASSIVE 1
 
 struct hp_sw_dh_data {
-   unsigned char sense[SCSI_SENSE_BUFFERSIZE];
int path_state;
int retries;
int retry_cnt;
struct scsi_device *sdev;
-   activate_complete   callback_fn;
-   void*callback_data;
 };
 
 static int hp_sw_start_stop(struct hp_sw_dh_data *);
@@ -56,43 +53,34 @@ static int hp_sw_start_stop(struct hp_sw_dh_data *);
  *
  * Returns SCSI_DH_DEV_OFFLINED if the sdev is on the passive path
  */
-static int tur_done(struct scsi_device *sdev, unsigned char *sense)
+static int tur_done(struct scsi_device *sdev, struct hp_sw_dh_data *h,
+   struct scsi_sense_hdr *sshdr)
 {
-   struct scsi_sense_hdr sshdr;
-   int ret;
+   int ret = SCSI_DH_IO;
 
-   ret = scsi_normalize_sense(sense, SCSI_SENSE_BUFFERSIZE, );
-   if (!ret) {
-   sdev_printk(KERN_WARNING, sdev,
-   "%s: sending tur failed, no sense available\n",
-   HP_SW_NAME);
-   ret = SCSI_DH_IO;
-   goto done;
-   }
-   switch (sshdr.sense_key) {
+   switch (sshdr->sense_key) {
case UNIT_ATTENTION:
ret = SCSI_DH_IMM_RETRY;
break;
case NOT_READY:
-   if ((sshdr.asc == 0x04) && (sshdr.ascq == 2)) {
+   if (sshdr->asc == 0x04 && sshdr->ascq == 2) {
/*
 * LUN not ready - Initialization command required
 *
 * This is the passive path
 */
-   ret = SCSI_DH_DEV_OFFLINED;
+   h->path_state = HP_SW_PATH_PASSIVE;
+   ret = SCSI_DH_OK;
break;
}
/* Fallthrough */
default:
sdev_printk(KERN_WARNING, sdev,
   "%s: sending tur failed, sense %x/%x/%x\n",
-  HP_SW_NAME, sshdr.sense_key, sshdr.asc,
-  sshdr.ascq);
+  HP_SW_NAME, sshdr->sense_key, sshdr->asc,
+  sshdr->ascq);
break;
}
-
-done:
return ret;
 }
 
@@ -105,131 +93,36 @@ static int tur_done(struct scsi_device *sdev, unsigned 
char *sense)
  */
 static int hp_sw_tur(struct scsi_device *sdev, struct hp_sw_dh_data *h)
 {
-   struct request *req;
-   int ret;
+   unsigned char cmd[6] = { TEST_UNIT_READY };
+   struct scsi_sense_hdr sshdr;
+   int ret = SCSI_DH_OK, res;
+   u64 req_flags = REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT |
+   REQ_FAILFAST_DRIVER;
 
 retry:
-   req = blk_get_request(sdev->request_queue, WRITE, GFP_NOIO);
-   if (IS_ERR(req))
-   return SCSI_DH_RES_TEMP_UNAVAIL;
-
-   blk_rq_set_block_pc(req);
-   req->cmd_flags |= REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT |
- REQ_FAILFAST_DRIVER;
-   req->cmd_len = COMMAND_SIZE(TEST_UNIT_READY);
-   req->cmd[0] = TEST_UNIT_READY;
-   req->timeout = HP_SW_TIMEOUT;
-   req->sense = h->sense;
-   memset(req->sense, 0, SCSI_SENSE_BUFFERSIZE);
-   req->sense_len = 0;
-
-   ret = blk_execute_rq(req->q, NULL, req, 1);
-   if (ret == -EIO) {
-   if (req->sense_len > 0) {
-   ret = tur_done(sdev, h->sense);
-   } else {
+   res = scsi_execute_req_flags(sdev, cmd, DMA_NONE, NULL, 0, ,
+HP_SW_TIMEOUT, HP_SW_RETRIES,
+NULL, req_flags, 0);
+   if (res) {
+   if (scsi_sense_valid())
+   ret = tur_done(sdev, h, );
+   else {
sdev_printk(KERN_WARNING, sdev,
"%s: sending tur failed with %x\n",
-   HP_SW_NAME, req->errors);
+   HP_SW_NAME, res);
ret = SCSI_DH_IO;
}
} else {
h->path_state = HP_SW_PATH_ACTIVE;

[PATCH 12/18] scsi: respect unchecked_isa_dma for blk-mq

2017-01-25 Thread Christoph Hellwig
Currently blk-mq always allocates the sense buffer using normal GFP_KERNEL
allocation.  Refactor the cmd pool code to split the cmd and sense allocation
and share the code to allocate the sense buffers as well as the sense buffer
slab caches between the legacy and blk-mq path.

Note that this switches to lazy allocation of the sense slab caches - the
slab caches (not the actual allocations) won't be destroy until the scsi
module is unloaded instead of keeping track of hosts using them.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/hosts.c |  4 
 drivers/scsi/scsi.c  | 24 ---
 drivers/scsi/scsi_lib.c  | 62 +---
 drivers/scsi/scsi_priv.h |  5 
 4 files changed, 73 insertions(+), 22 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 258a3f9..6d29c4a 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -213,6 +213,10 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
goto fail;
}
 
+   error = scsi_init_sense_cache(shost);
+   if (error)
+   goto fail;
+
if (shost_use_blk_mq(shost)) {
error = scsi_mq_setup_tags(shost);
if (error)
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 0f93892..469aa0f 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -100,22 +100,18 @@ EXPORT_SYMBOL(scsi_sd_pm_domain);
 
 struct scsi_host_cmd_pool {
struct kmem_cache   *cmd_slab;
-   struct kmem_cache   *sense_slab;
unsigned intusers;
char*cmd_name;
-   char*sense_name;
unsigned intslab_flags;
 };
 
 static struct scsi_host_cmd_pool scsi_cmd_pool = {
.cmd_name   = "scsi_cmd_cache",
-   .sense_name = "scsi_sense_cache",
.slab_flags = SLAB_HWCACHE_ALIGN,
 };
 
 static struct scsi_host_cmd_pool scsi_cmd_dma_pool = {
.cmd_name   = "scsi_cmd_cache(DMA)",
-   .sense_name = "scsi_sense_cache(DMA)",
.slab_flags = SLAB_HWCACHE_ALIGN|SLAB_CACHE_DMA,
 };
 
@@ -136,7 +132,7 @@ scsi_host_free_command(struct Scsi_Host *shost, struct 
scsi_cmnd *cmd)
 
if (cmd->prot_sdb)
kmem_cache_free(scsi_sdb_cache, cmd->prot_sdb);
-   kmem_cache_free(pool->sense_slab, cmd->sense_buffer);
+   scsi_free_sense_buffer(shost, cmd->sense_buffer);
kmem_cache_free(pool->cmd_slab, cmd);
 }
 
@@ -158,7 +154,8 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t 
gfp_mask)
if (!cmd)
goto fail;
 
-   cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab, gfp_mask);
+   cmd->sense_buffer = scsi_alloc_sense_buffer(shost, gfp_mask,
+   NUMA_NO_NODE);
if (!cmd->sense_buffer)
goto fail_free_cmd;
 
@@ -171,7 +168,7 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t 
gfp_mask)
return cmd;
 
 fail_free_sense:
-   kmem_cache_free(pool->sense_slab, cmd->sense_buffer);
+   scsi_free_sense_buffer(shost, cmd->sense_buffer);
 fail_free_cmd:
kmem_cache_free(pool->cmd_slab, cmd);
 fail:
@@ -301,7 +298,6 @@ scsi_find_host_cmd_pool(struct Scsi_Host *shost)
 static void
 scsi_free_host_cmd_pool(struct scsi_host_cmd_pool *pool)
 {
-   kfree(pool->sense_name);
kfree(pool->cmd_name);
kfree(pool);
 }
@@ -317,8 +313,7 @@ scsi_alloc_host_cmd_pool(struct Scsi_Host *shost)
return NULL;
 
pool->cmd_name = kasprintf(GFP_KERNEL, "%s_cmd", hostt->proc_name);
-   pool->sense_name = kasprintf(GFP_KERNEL, "%s_sense", hostt->proc_name);
-   if (!pool->cmd_name || !pool->sense_name) {
+   if (!pool->cmd_name) {
scsi_free_host_cmd_pool(pool);
return NULL;
}
@@ -357,12 +352,6 @@ scsi_get_host_cmd_pool(struct Scsi_Host *shost)
   pool->slab_flags, NULL);
if (!pool->cmd_slab)
goto out_free_pool;
-
-   pool->sense_slab = kmem_cache_create(pool->sense_name,
-SCSI_SENSE_BUFFERSIZE, 0,
-pool->slab_flags, NULL);
-   if (!pool->sense_slab)
-   goto out_free_slab;
}
 
pool->users++;
@@ -371,8 +360,6 @@ scsi_get_host_cmd_pool(struct Scsi_Host *shost)
mutex_unlock(_cmd_pool_mutex);
return retval;
 
-out_free_slab:
-   kmem_cache_destroy(pool->cmd_slab);
 out_free_pool:
if (hostt->cmd_size) {
scsi_free_host_cmd_pool(pool);
@@ -398,7 +385,6 @@ static void scsi_put_host_cmd_pool(struct Scsi_Host *shost)
 
if (!--pool->users) {
kmem_cache_destroy(pool->cmd_slab);
-   

[PATCH 05/18] block: allow specifying size for extra command data

2017-01-25 Thread Christoph Hellwig
This mirrors the blk-mq capabilities to allocate extra drivers-specific
data behind struct request by setting a cmd_size field, as well as having
a constructor / destructor for it.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 block/blk-core.c   | 59 --
 block/blk-flush.c  |  5 ++---
 block/blk-sysfs.c  |  7 --
 include/linux/blkdev.h |  7 ++
 4 files changed, 61 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 54b5512..7de7164 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -606,17 +606,41 @@ void blk_cleanup_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_cleanup_queue);
 
 /* Allocate memory local to the request queue */
-static void *alloc_request_struct(gfp_t gfp_mask, void *data)
+static void *alloc_request_simple(gfp_t gfp_mask, void *data)
 {
-   int nid = (int)(long)data;
-   return kmem_cache_alloc_node(request_cachep, gfp_mask, nid);
+   struct request_queue *q = data;
+
+   return kmem_cache_alloc_node(request_cachep, gfp_mask, q->node);
 }
 
-static void free_request_struct(void *element, void *unused)
+static void free_request_simple(void *element, void *data)
 {
kmem_cache_free(request_cachep, element);
 }
 
+static void *alloc_request_size(gfp_t gfp_mask, void *data)
+{
+   struct request_queue *q = data;
+   struct request *rq;
+
+   rq = kmalloc_node(sizeof(struct request) + q->cmd_size, gfp_mask,
+   q->node);
+   if (rq && q->init_rq_fn && q->init_rq_fn(q, rq, gfp_mask) < 0) {
+   kfree(rq);
+   rq = NULL;
+   }
+   return rq;
+}
+
+static void free_request_size(void *element, void *data)
+{
+   struct request_queue *q = data;
+
+   if (q->exit_rq_fn)
+   q->exit_rq_fn(q, element);
+   kfree(element);
+}
+
 int blk_init_rl(struct request_list *rl, struct request_queue *q,
gfp_t gfp_mask)
 {
@@ -629,10 +653,15 @@ int blk_init_rl(struct request_list *rl, struct 
request_queue *q,
init_waitqueue_head(>wait[BLK_RW_SYNC]);
init_waitqueue_head(>wait[BLK_RW_ASYNC]);
 
-   rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, alloc_request_struct,
- free_request_struct,
- (void *)(long)q->node, gfp_mask,
- q->node);
+   if (q->cmd_size) {
+   rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+   alloc_request_size, free_request_size,
+   q, gfp_mask, q->node);
+   } else {
+   rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+   alloc_request_simple, free_request_simple,
+   q, gfp_mask, q->node);
+   }
if (!rl->rq_pool)
return -ENOMEM;
 
@@ -846,12 +875,15 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio);
 
 int blk_init_allocated_queue(struct request_queue *q)
 {
-   q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, 0);
+   q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, q->cmd_size);
if (!q->fq)
return -ENOMEM;
 
+   if (q->init_rq_fn && q->init_rq_fn(q, q->fq->flush_rq, GFP_KERNEL))
+   goto out_free_flush_queue;
+
if (blk_init_rl(>root_rl, q, GFP_KERNEL))
-   goto fail;
+   goto out_exit_flush_rq;
 
INIT_WORK(>timeout_work, blk_timeout_work);
q->queue_flags  |= QUEUE_FLAG_DEFAULT;
@@ -869,13 +901,16 @@ int blk_init_allocated_queue(struct request_queue *q)
/* init elevator */
if (elevator_init(q, NULL)) {
mutex_unlock(>sysfs_lock);
-   goto fail;
+   goto out_exit_flush_rq;
}
 
mutex_unlock(>sysfs_lock);
return 0;
 
-fail:
+out_exit_flush_rq:
+   if (q->exit_rq_fn)
+   q->exit_rq_fn(q, q->fq->flush_rq);
+out_free_flush_queue:
blk_free_flush_queue(q->fq);
wbt_exit(q);
return -ENOMEM;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index d7de34e..bf3ba3c 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -547,11 +547,10 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct 
request_queue *q,
if (!fq)
goto fail;
 
-   if (q->mq_ops) {
+   if (q->mq_ops)
spin_lock_init(>mq_flush_lock);
-   rq_sz = round_up(rq_sz + cmd_size, cache_line_size());
-   }
 
+   rq_sz = round_up(rq_sz + cmd_size, cache_line_size());
fq->flush_rq = kzalloc_node(rq_sz, GFP_KERNEL, node);
if (!fq->flush_rq)
goto fail_rq;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1dbce05..894f773 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -814,10 +814,13 @@ 

[PATCH 13/18] scsi: remove scsi_cmd_dma_pool

2017-01-25 Thread Christoph Hellwig
There is no need for GFP_DMA allocations of the scsi_cmnd structures
themselves, all that might be DMAed to or from is the actual payload,
or the sense buffers.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/scsi.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 469aa0f..2e24f31 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -102,17 +102,10 @@ struct scsi_host_cmd_pool {
struct kmem_cache   *cmd_slab;
unsigned intusers;
char*cmd_name;
-   unsigned intslab_flags;
 };
 
 static struct scsi_host_cmd_pool scsi_cmd_pool = {
.cmd_name   = "scsi_cmd_cache",
-   .slab_flags = SLAB_HWCACHE_ALIGN,
-};
-
-static struct scsi_host_cmd_pool scsi_cmd_dma_pool = {
-   .cmd_name   = "scsi_cmd_cache(DMA)",
-   .slab_flags = SLAB_HWCACHE_ALIGN|SLAB_CACHE_DMA,
 };
 
 static DEFINE_MUTEX(host_cmd_pool_mutex);
@@ -290,8 +283,6 @@ scsi_find_host_cmd_pool(struct Scsi_Host *shost)
 {
if (shost->hostt->cmd_size)
return shost->hostt->cmd_pool;
-   if (shost->unchecked_isa_dma)
-   return _cmd_dma_pool;
return _cmd_pool;
 }
 
@@ -318,10 +309,6 @@ scsi_alloc_host_cmd_pool(struct Scsi_Host *shost)
return NULL;
}
 
-   pool->slab_flags = SLAB_HWCACHE_ALIGN;
-   if (shost->unchecked_isa_dma)
-   pool->slab_flags |= SLAB_CACHE_DMA;
-
if (hostt->cmd_size)
hostt->cmd_pool = pool;
 
@@ -349,7 +336,7 @@ scsi_get_host_cmd_pool(struct Scsi_Host *shost)
 
if (!pool->users) {
pool->cmd_slab = kmem_cache_create(pool->cmd_name, cmd_size, 0,
-  pool->slab_flags, NULL);
+  SLAB_HWCACHE_ALIGN, NULL);
if (!pool->cmd_slab)
goto out_free_pool;
}
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/18] scsi: allocate scsi_cmnd structures as part of struct request

2017-01-25 Thread Christoph Hellwig
Rely on the new block layer functionality to allocate additional driver
specific data behind struct request instead of implementing it in SCSI
itѕelf.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/hosts.c  |  20 +--
 drivers/scsi/scsi.c   | 319 --
 drivers/scsi/scsi_error.c |  17 ++-
 drivers/scsi/scsi_lib.c   | 122 --
 drivers/scsi/scsi_priv.h  |   8 +-
 include/scsi/scsi_host.h  |   3 -
 6 files changed, 95 insertions(+), 394 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 6d29c4a..831a1c8 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -230,19 +230,6 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
}
}
 
-   /*
-* Note that we allocate the freelist even for the MQ case for now,
-* as we need a command set aside for scsi_reset_provider.  Having
-* the full host freelist and one command available for that is a
-* little heavy-handed, but avoids introducing a special allocator
-* just for this.  Eventually the structure of scsi_reset_provider
-* will need a major overhaul.
-*/
-   error = scsi_setup_command_freelist(shost);
-   if (error)
-   goto out_destroy_tags;
-
-
if (!shost->shost_gendev.parent)
shost->shost_gendev.parent = dev ? dev : _bus;
if (!dma_dev)
@@ -262,7 +249,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
 
error = device_add(>shost_gendev);
if (error)
-   goto out_destroy_freelist;
+   goto out_disable_runtime_pm;
 
scsi_host_set_state(shost, SHOST_RUNNING);
get_device(shost->shost_gendev.parent);
@@ -312,13 +299,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, 
struct device *dev,
device_del(>shost_dev);
  out_del_gendev:
device_del(>shost_gendev);
- out_destroy_freelist:
+ out_disable_runtime_pm:
device_disable_async_suspend(>shost_gendev);
pm_runtime_disable(>shost_gendev);
pm_runtime_set_suspended(>shost_gendev);
pm_runtime_put_noidle(>shost_gendev);
-   scsi_destroy_command_freelist(shost);
- out_destroy_tags:
if (shost_use_blk_mq(shost))
scsi_mq_destroy_tags(shost);
  fail:
@@ -359,7 +344,6 @@ static void scsi_host_dev_release(struct device *dev)
kfree(dev_name(>shost_dev));
}
 
-   scsi_destroy_command_freelist(shost);
if (shost_use_blk_mq(shost)) {
if (shost->tag_set.tags)
scsi_mq_destroy_tags(shost);
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2e24f31..3d8d215 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -98,163 +98,6 @@ EXPORT_SYMBOL(scsi_sd_probe_domain);
 ASYNC_DOMAIN_EXCLUSIVE(scsi_sd_pm_domain);
 EXPORT_SYMBOL(scsi_sd_pm_domain);
 
-struct scsi_host_cmd_pool {
-   struct kmem_cache   *cmd_slab;
-   unsigned intusers;
-   char*cmd_name;
-};
-
-static struct scsi_host_cmd_pool scsi_cmd_pool = {
-   .cmd_name   = "scsi_cmd_cache",
-};
-
-static DEFINE_MUTEX(host_cmd_pool_mutex);
-
-/**
- * scsi_host_free_command - internal function to release a command
- * @shost: host to free the command for
- * @cmd:   command to release
- *
- * the command must previously have been allocated by
- * scsi_host_alloc_command.
- */
-static void
-scsi_host_free_command(struct Scsi_Host *shost, struct scsi_cmnd *cmd)
-{
-   struct scsi_host_cmd_pool *pool = shost->cmd_pool;
-
-   if (cmd->prot_sdb)
-   kmem_cache_free(scsi_sdb_cache, cmd->prot_sdb);
-   scsi_free_sense_buffer(shost, cmd->sense_buffer);
-   kmem_cache_free(pool->cmd_slab, cmd);
-}
-
-/**
- * scsi_host_alloc_command - internal function to allocate command
- * @shost: SCSI host whose pool to allocate from
- * @gfp_mask:  mask for the allocation
- *
- * Returns a fully allocated command with sense buffer and protection
- * data buffer (where applicable) or NULL on failure
- */
-static struct scsi_cmnd *
-scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t gfp_mask)
-{
-   struct scsi_host_cmd_pool *pool = shost->cmd_pool;
-   struct scsi_cmnd *cmd;
-
-   cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask);
-   if (!cmd)
-   goto fail;
-
-   cmd->sense_buffer = scsi_alloc_sense_buffer(shost, gfp_mask,
-   NUMA_NO_NODE);
-   if (!cmd->sense_buffer)
-   goto fail_free_cmd;
-
-   if (scsi_host_get_prot(shost) >= SHOST_DIX_TYPE0_PROTECTION) {
-   cmd->prot_sdb = kmem_cache_zalloc(scsi_sdb_cache, gfp_mask);
-   if (!cmd->prot_sdb)
-   goto fail_free_sense;
-   }
-
-   return cmd;
-
-fail_free_sense:
-   

[PATCH 06/18] dm: remove incomple BLOCK_PC support

2017-01-25 Thread Christoph Hellwig
DM tries to copy a few fields around for BLOCK_PC requests, but given
that no dm-target ever wires up scsi_cmd_ioctl BLOCK_PC can't actually
be sent to dm.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-rq.c | 16 
 1 file changed, 16 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 93f6e9f..3f12916 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -270,19 +270,6 @@ static void dm_end_request(struct request *clone, int 
error)
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;
 
-   if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
-   rq->errors = clone->errors;
-   rq->resid_len = clone->resid_len;
-
-   if (rq->sense)
-   /*
-* We are using the sense buffer of the original
-* request.
-* So setting the length of the sense data is enough.
-*/
-   rq->sense_len = clone->sense_len;
-   }
-
free_rq_clone(clone);
rq_end_stats(md, rq);
if (!rq->q->mq_ops)
@@ -511,9 +498,6 @@ static int setup_clone(struct request *clone, struct 
request *rq,
if (r)
return r;
 
-   clone->cmd = rq->cmd;
-   clone->cmd_len = rq->cmd_len;
-   clone->sense = rq->sense;
clone->end_io = end_clone_request;
clone->end_io_data = tio;
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/18] dm: always defer request allocation to the owner of the request_queue

2017-01-25 Thread Christoph Hellwig
DM already calls blk_mq_alloc_request on the request_queue of the
underlying device if it is a blk-mq device.  But now that we allow drivers
to allocate additional data and initialize it ahead of time we need to do
the same for all drivers.   Doing so and using the new cmd_size
infrastructure in the block layer greatly simplifies the dm-rq and mpath
code, and should also make arbitrary combinations of SQ and MQ devices
with SQ or MQ device mapper tables easily possible as a further step.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Mike Snitzer 
---
 drivers/md/dm-core.h  |   1 -
 drivers/md/dm-mpath.c | 132 --
 drivers/md/dm-rq.c| 251 ++
 drivers/md/dm-rq.h|   2 +-
 drivers/md/dm-target.c|   7 --
 drivers/md/dm.c   |  30 ++---
 drivers/md/dm.h   |   3 +-
 include/linux/device-mapper.h |   3 -
 8 files changed, 85 insertions(+), 344 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 40ceba1..136fda3 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -92,7 +92,6 @@ struct mapped_device {
 * io objects are allocated from here.
 */
mempool_t *io_pool;
-   mempool_t *rq_pool;
 
struct bio_set *bs;
 
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 6400cff..784f237 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -92,12 +92,6 @@ struct multipath {
 
unsigned queue_mode;
 
-   /*
-* We must use a mempool of dm_mpath_io structs so that we
-* can resubmit bios on error.
-*/
-   mempool_t *mpio_pool;
-
struct mutex work_mutex;
struct work_struct trigger_event;
 
@@ -115,8 +109,6 @@ struct dm_mpath_io {
 
 typedef int (*action_fn) (struct pgpath *pgpath);
 
-static struct kmem_cache *_mpio_cache;
-
 static struct workqueue_struct *kmultipathd, *kmpath_handlerd;
 static void trigger_event(struct work_struct *work);
 static void activate_path(struct work_struct *work);
@@ -209,7 +201,6 @@ static struct multipath *alloc_multipath(struct dm_target 
*ti)
init_waitqueue_head(>pg_init_wait);
mutex_init(>work_mutex);
 
-   m->mpio_pool = NULL;
m->queue_mode = DM_TYPE_NONE;
 
m->ti = ti;
@@ -229,16 +220,7 @@ static int alloc_multipath_stage2(struct dm_target *ti, 
struct multipath *m)
m->queue_mode = DM_TYPE_MQ_REQUEST_BASED;
else
m->queue_mode = DM_TYPE_REQUEST_BASED;
-   }
-
-   if (m->queue_mode == DM_TYPE_REQUEST_BASED) {
-   unsigned min_ios = dm_get_reserved_rq_based_ios();
-
-   m->mpio_pool = mempool_create_slab_pool(min_ios, _mpio_cache);
-   if (!m->mpio_pool)
-   return -ENOMEM;
-   }
-   else if (m->queue_mode == DM_TYPE_BIO_BASED) {
+   } else if (m->queue_mode == DM_TYPE_BIO_BASED) {
INIT_WORK(>process_queued_bios, process_queued_bios);
/*
 * bio-based doesn't support any direct scsi_dh management;
@@ -263,7 +245,6 @@ static void free_multipath(struct multipath *m)
 
kfree(m->hw_handler_name);
kfree(m->hw_handler_params);
-   mempool_destroy(m->mpio_pool);
kfree(m);
 }
 
@@ -272,38 +253,6 @@ static struct dm_mpath_io *get_mpio(union map_info *info)
return info->ptr;
 }
 
-static struct dm_mpath_io *set_mpio(struct multipath *m, union map_info *info)
-{
-   struct dm_mpath_io *mpio;
-
-   if (!m->mpio_pool) {
-   /* Use blk-mq pdu memory requested via per_io_data_size */
-   mpio = get_mpio(info);
-   memset(mpio, 0, sizeof(*mpio));
-   return mpio;
-   }
-
-   mpio = mempool_alloc(m->mpio_pool, GFP_ATOMIC);
-   if (!mpio)
-   return NULL;
-
-   memset(mpio, 0, sizeof(*mpio));
-   info->ptr = mpio;
-
-   return mpio;
-}
-
-static void clear_request_fn_mpio(struct multipath *m, union map_info *info)
-{
-   /* Only needed for non blk-mq (.request_fn) multipath */
-   if (m->mpio_pool) {
-   struct dm_mpath_io *mpio = info->ptr;
-
-   info->ptr = NULL;
-   mempool_free(mpio, m->mpio_pool);
-   }
-}
-
 static size_t multipath_per_bio_data_size(void)
 {
return sizeof(struct dm_mpath_io) + sizeof(struct dm_bio_details);
@@ -530,16 +479,17 @@ static bool must_push_back_bio(struct multipath *m)
 /*
  * Map cloned requests (request-based multipath)
  */
-static int __multipath_map(struct dm_target *ti, struct request *clone,
-  union map_info *map_context,
-  struct request *rq, struct request **__clone)
+static int multipath_clone_and_map(struct dm_target *ti, 

[PATCH 02/18] md: cleanup bio op / flags handling in raid1_write_request

2017-01-25 Thread Christoph Hellwig
No need for the local variables, the bio is still live and we can just
assigned the bits we want directly.  Make me wonder why we can't assign
all the bio flags to start with.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/raid1.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647..67b0365 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1170,10 +1170,6 @@ static void raid1_write_request(struct mddev *mddev, 
struct bio *bio,
int i, disks;
struct bitmap *bitmap = mddev->bitmap;
unsigned long flags;
-   const int op = bio_op(bio);
-   const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
-   const unsigned long do_flush_fua = (bio->bi_opf &
-   (REQ_PREFLUSH | REQ_FUA));
struct md_rdev *blocked_rdev;
struct blk_plug_cb *cb;
struct raid1_plug_cb *plug = NULL;
@@ -1389,7 +1385,8 @@ static void raid1_write_request(struct mddev *mddev, 
struct bio *bio,
   conf->mirrors[i].rdev->data_offset);
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
-   bio_set_op_attrs(mbio, op, do_flush_fua | do_sync);
+   mbio->bi_opf = bio_op(bio) |
+   (bio->bi_opf & (REQ_SYNC | REQ_PREFLUSH | REQ_FUA));
if (test_bit(FailFast, >mirrors[i].rdev->flags) &&
!test_bit(WriteMostly, >mirrors[i].rdev->flags) &&
conf->raid_disks - mddev->degraded > 1)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/18] block: simplify blk_init_allocated_queue

2017-01-25 Thread Christoph Hellwig
Return an errno value instead of the passed in queue so that the callers
don't have to keep track of two queues, and move the assignment of the
request_fn and lock to the caller as passing them as argument doesn't
simplify anything.  While we're at it also remove two pointless NULL
assignments, given that the request structure is zeroed on allocation.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 block/blk-core.c   | 38 +++---
 drivers/md/dm-rq.c |  3 ++-
 include/linux/blkdev.h |  3 +--
 3 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a84c1b9..54b5512 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -823,15 +823,19 @@ EXPORT_SYMBOL(blk_init_queue);
 struct request_queue *
 blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 {
-   struct request_queue *uninit_q, *q;
+   struct request_queue *q;
 
-   uninit_q = blk_alloc_queue_node(GFP_KERNEL, node_id);
-   if (!uninit_q)
+   q = blk_alloc_queue_node(GFP_KERNEL, node_id);
+   if (!q)
return NULL;
 
-   q = blk_init_allocated_queue(uninit_q, rfn, lock);
-   if (!q)
-   blk_cleanup_queue(uninit_q);
+   q->request_fn = rfn;
+   if (lock)
+   q->queue_lock = lock;
+   if (blk_init_allocated_queue(q) < 0) {
+   blk_cleanup_queue(q);
+   return NULL;
+   }
 
return q;
 }
@@ -839,30 +843,19 @@ EXPORT_SYMBOL(blk_init_queue_node);
 
 static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio);
 
-struct request_queue *
-blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
-spinlock_t *lock)
-{
-   if (!q)
-   return NULL;
 
+int blk_init_allocated_queue(struct request_queue *q)
+{
q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, 0);
if (!q->fq)
-   return NULL;
+   return -ENOMEM;
 
if (blk_init_rl(>root_rl, q, GFP_KERNEL))
goto fail;
 
INIT_WORK(>timeout_work, blk_timeout_work);
-   q->request_fn   = rfn;
-   q->prep_rq_fn   = NULL;
-   q->unprep_rq_fn = NULL;
q->queue_flags  |= QUEUE_FLAG_DEFAULT;
 
-   /* Override internal queue lock with supplied lock pointer */
-   if (lock)
-   q->queue_lock   = lock;
-
/*
 * This also sets hw/phys segments, boundary and size
 */
@@ -880,13 +873,12 @@ blk_init_allocated_queue(struct request_queue *q, 
request_fn_proc *rfn,
}
 
mutex_unlock(>sysfs_lock);
-
-   return q;
+   return 0;
 
 fail:
blk_free_flush_queue(q->fq);
wbt_exit(q);
-   return NULL;
+   return -ENOMEM;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 9d7275f..93f6e9f 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -823,7 +823,8 @@ static void dm_old_request_fn(struct request_queue *q)
 int dm_old_init_request_queue(struct mapped_device *md)
 {
/* Fully initialize the queue */
-   if (!blk_init_allocated_queue(md->queue, dm_old_request_fn, NULL))
+   md->queue->request_fn = dm_old_request_fn;
+   if (blk_init_allocated_queue(md->queue) < 0)
return -EINVAL;
 
/* disable dm_old_request_fn's merge heuristic by default */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8e0b57e..a036c4a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1131,8 +1131,7 @@ extern void blk_unprep_request(struct request *);
 extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
spinlock_t *lock, int node_id);
 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
-extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
- request_fn_proc *, 
spinlock_t *);
+extern int blk_init_allocated_queue(struct request_queue *);
 extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/18] block: fix elevator init check

2017-01-25 Thread Christoph Hellwig
We can't initalize the elevator fields for flushes as flush share space
in struct request with the elevator data.  But currently we can't
commnicate that a request is a flush through blk_get_request as we
can only pass READ or WRITE, and the low-level code looks at the
possible NULL bio to check for a flush.

Fix this by allowing to pass any block op and flags, and by checking for
the flush flags in __get_request.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 block/blk-core.c | 26 --
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b830e14..a84c1b9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1022,25 +1022,6 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
return 0;
 }
 
-/*
- * Determine if elevator data should be initialized when allocating the
- * request associated with @bio.
- */
-static bool blk_rq_should_init_elevator(struct bio *bio)
-{
-   if (!bio)
-   return true;
-
-   /*
-* Flush requests do not use the elevator so skip initialization.
-* This allows a request to share the flush and elevator data.
-*/
-   if (op_is_flush(bio->bi_opf))
-   return false;
-
-   return true;
-}
-
 /**
  * __get_request - get a free request
  * @rl: request list to allocate from
@@ -1119,10 +1100,13 @@ static struct request *__get_request(struct 
request_list *rl, unsigned int op,
 * request is freed.  This guarantees icq's won't be destroyed and
 * makes creating new ones safe.
 *
+* Flush requests do not use the elevator so skip initialization.
+* This allows a request to share the flush and elevator data.
+*
 * Also, lookup icq while holding queue_lock.  If it doesn't exist,
 * it will be created after releasing queue_lock.
 */
-   if (blk_rq_should_init_elevator(bio) && !blk_queue_bypass(q)) {
+   if (!op_is_flush(op) && !blk_queue_bypass(q)) {
rq_flags |= RQF_ELVPRIV;
q->nr_rqs_elvpriv++;
if (et->icq_cache && ioc)
@@ -1276,8 +1260,6 @@ static struct request *blk_old_get_request(struct 
request_queue *q, int rw,
 {
struct request *rq;
 
-   BUG_ON(rw != READ && rw != WRITE);
-
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/18] block: add a op_is_flush helper

2017-01-25 Thread Christoph Hellwig
This centralizes the checks for bios that needs to be go into the flush
state machine.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c |  8 
 block/blk-mq-sched.c |  5 ++---
 block/blk-mq.c   |  4 ++--
 drivers/md/bcache/request.c  |  2 +-
 drivers/md/dm-cache-target.c | 13 +++--
 drivers/md/dm-thin.c | 13 +
 include/linux/blk_types.h|  9 +
 7 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a61f140..b830e14 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1035,7 +1035,7 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
 * Flush requests do not use the elevator so skip initialization.
 * This allows a request to share the flush and elevator data.
 */
-   if (bio->bi_opf & (REQ_PREFLUSH | REQ_FUA))
+   if (op_is_flush(bio->bi_opf))
return false;
 
return true;
@@ -1641,7 +1641,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
return BLK_QC_T_NONE;
}
 
-   if (bio->bi_opf & (REQ_PREFLUSH | REQ_FUA)) {
+   if (op_is_flush(bio->bi_opf)) {
spin_lock_irq(q->queue_lock);
where = ELEVATOR_INSERT_FLUSH;
goto get_rq;
@@ -2145,7 +2145,7 @@ int blk_insert_cloned_request(struct request_queue *q, 
struct request *rq)
 */
BUG_ON(blk_queued_rq(rq));
 
-   if (rq->cmd_flags & (REQ_PREFLUSH | REQ_FUA))
+   if (op_is_flush(rq->cmd_flags))
where = ELEVATOR_INSERT_FLUSH;
 
add_acct_request(q, rq, where);
@@ -3256,7 +3256,7 @@ void blk_flush_plug_list(struct blk_plug *plug, bool 
from_schedule)
/*
 * rq is already accounted, so use raw insert
 */
-   if (rq->cmd_flags & (REQ_PREFLUSH | REQ_FUA))
+   if (op_is_flush(rq->cmd_flags))
__elv_add_request(q, rq, ELEVATOR_INSERT_FLUSH);
else
__elv_add_request(q, rq, ELEVATOR_INSERT_SORT_MERGE);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d05061f..3bd66e5 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -111,7 +111,6 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct request *rq;
-   const bool is_flush = op & (REQ_PREFLUSH | REQ_FUA);
 
blk_queue_enter_live(q);
ctx = blk_mq_get_ctx(q);
@@ -126,7 +125,7 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
 * Flush requests are special and go directly to the
 * dispatch list.
 */
-   if (!is_flush && e->type->ops.mq.get_request) {
+   if (!op_is_flush(op) && e->type->ops.mq.get_request) {
rq = e->type->ops.mq.get_request(q, op, data);
if (rq)
rq->rq_flags |= RQF_QUEUED;
@@ -138,7 +137,7 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
}
 
if (rq) {
-   if (!is_flush) {
+   if (!op_is_flush(op)) {
rq->elv.icq = NULL;
if (e && e->type->icq_cache)
blk_mq_sched_assign_ioc(q, rq, bio);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee69e5e..e229f8a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1378,7 +1378,7 @@ static void blk_mq_try_issue_directly(struct request *rq, 
blk_qc_t *cookie)
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
const int is_sync = op_is_sync(bio->bi_opf);
-   const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
+   const int is_flush_fua = op_is_flush(bio->bi_opf);
struct blk_mq_alloc_data data;
struct request *rq;
unsigned int request_count = 0, srcu_idx;
@@ -1498,7 +1498,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
 static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 {
const int is_sync = op_is_sync(bio->bi_opf);
-   const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
+   const int is_flush_fua = op_is_flush(bio->bi_opf);
struct blk_plug *plug;
unsigned int request_count = 0;
struct blk_mq_alloc_data data;
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 76d2087..01035e7 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -666,7 +666,7 @@ static inline struct search *search_alloc(struct bio *bio,
s->iop.write_prio   = 0;
s->iop.error= 0;
s->iop.flags= 0;
-   s->iop.flush_journal= (bio->bi_opf & (REQ_PREFLUSH|REQ_FUA)) != 

Re: [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Alex Gartrell
On 1/25/17, 6:23 AM, "arndbergm...@gmail.com on behalf of Arnd Bergmann" 
 wrote:
> On Wed, Jan 25, 2017 at 2:47 PM, Josef Bacik  wrote:
> > On Tue, Jan 24, 2017 at 2:11 AM, Greg KH  wrote:
> >> On Mon, Jan 23, 2017 at 10:52:42AM -0500, Josef Bacik wrote:
> >>>  On Mon, Jan 23, 2017 at 10:03 AM, Greg KH 
> >>>  Loop accomplishes this with the /dev/loop-control
> >>>  and an ioctl.  Then we also need a way to figure out what is the first
> >>>  /dev/nbd# device that isn't currently in use in order to pick the next
> >>> one
> >>>  to configure.  Keep in mind that all of the configuration for /dev/nbd#
> >>>  devices is done through ioctls to those devices, so having a ioctl
> >>> interface
> >>>  for the control device is consistent with the rest of how NBD works.
> >>
> >>
> >> But adding a random char device node and using ioctls on it is different
> >> than using an ioctl on an existing block device node that is already
> >> there.
> >>
> >> So what _exactly_ do you need to do with this interface?  What data do
> >> you need sent/received to/from the kernel?  Specifics please.
> >
> >
> > I need to say "Hey kernel, can you setup some new nbd devices for me?" and
> > "Hey kernel, what is the first free nbd device I can use for this new
> > connection and (optionally) create a new device for me if one does not
> > exist?"  Loop accomplished this (in 2011 btw, not some far off date) with
> > /dev/loop-control.  Btrfs has it's scanning stuff controlled by
> > /dev/btrfs-control.  Device mapper has /dev/mapper/control.  This is a well
> > established and widely used way for control block based device drivers.
> 
> We have multiple established ways to deal with this kind of problem, the most
> common ones being a separate chardev as you propose, a sysfs interface and
> netlink.
> 
> They all have their own set of problems, but the needs of nbd as a network
> storage interface seem most closely resemble what we have for other network
> related interfaces, where we typically use netlink to do the setup, see e.g.
> macvtap as an example for a network chardev interface.
> 
> Can you have a look if that would solve your problem more nicely?

FWIW, I'm the one consuming this nbd stuff at Facebook and bringing
netlink into this would be a huge pain for me.  It's very weird to
need to pull sockets in and then drop back to ioctls from a design
perspective, and future consumers of this API would be sad as well.
This is compounded, I think, by the fact that the breadth of
functionality here doesn't really merit a shared library for everyone
to use -- could you imagine libnbdcontrol, which exposes a single
"get_next_device" function?

If nbd were *all* netlink I think that that'd be fine, but you'd have
problems implementing the NBD_DOIT function in that fashion.  So I'd
rather stick to the char device ioctl thing because it's more
consistent with the old NBD stuff as well as the loop device stuff.

-- 
Alex Gartrell 



[PATCH v2 04/10] blk-mq: add extra request information to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

The request pointers by themselves aren't super useful.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 1967c06d80e0..ee947f2f605b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -87,7 +87,9 @@ static int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
 {
struct request *rq = list_entry_rq(v);
 
-   seq_printf(m, "%p\n", rq);
+   seq_printf(m, "%p {.cmd_type=%u, .cmd_flags=0x%x, .rq_flags=0x%x, 
.tag=%d, .internal_tag=%d}\n",
+  rq, rq->cmd_type, rq->cmd_flags, (unsigned int)rq->rq_flags,
+  rq->tag, rq->internal_tag);
return 0;
 }
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/10] blk-mq: move tags and sched_tags info from sysfs to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

These are very tied to the blk-mq tag implementation, so exposing them
to sysfs isn't a great idea. Move the debugging information to debugfs
and add basic entries for the number of tags and the number of reserved
tags to sysfs.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 70 ++
 block/blk-mq-sysfs.c   | 33 
 block/blk-mq-tag.c | 27 ---
 block/blk-mq-tag.h |  1 -
 4 files changed, 86 insertions(+), 45 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 857bdadc6b1f..9f811007cf4f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -20,6 +20,7 @@
 
 #include 
 #include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 struct blk_mq_debugfs_attr {
const char *name;
@@ -154,6 +155,73 @@ static const struct file_operations hctx_ctx_map_fops = {
.release= single_release,
 };
 
+static void blk_mq_debugfs_tags_show(struct seq_file *m,
+struct blk_mq_tags *tags)
+{
+   seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
+   seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
+   seq_printf(m, "active_queues=%d\n",
+  atomic_read(>active_queues));
+
+   seq_puts(m, "\nbitmap_tags:\n");
+   sbitmap_queue_show(>bitmap_tags, m);
+
+   if (tags->nr_reserved_tags) {
+   seq_puts(m, "\nbreserved_tags:\n");
+   sbitmap_queue_show(>breserved_tags, m);
+   }
+}
+
+static int hctx_tags_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+   struct request_queue *q = hctx->queue;
+
+   mutex_lock(>sysfs_lock);
+   if (hctx->tags)
+   blk_mq_debugfs_tags_show(m, hctx->tags);
+   mutex_unlock(>sysfs_lock);
+
+   return 0;
+}
+
+static int hctx_tags_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_tags_show, inode->i_private);
+}
+
+static const struct file_operations hctx_tags_fops = {
+   .open   = hctx_tags_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int hctx_sched_tags_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+   struct request_queue *q = hctx->queue;
+
+   mutex_lock(>sysfs_lock);
+   if (hctx->sched_tags)
+   blk_mq_debugfs_tags_show(m, hctx->sched_tags);
+   mutex_unlock(>sysfs_lock);
+
+   return 0;
+}
+
+static int hctx_sched_tags_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_sched_tags_show, inode->i_private);
+}
+
+static const struct file_operations hctx_sched_tags_fops = {
+   .open   = hctx_sched_tags_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static void *ctx_rq_list_start(struct seq_file *m, loff_t *pos)
 {
struct blk_mq_ctx *ctx = m->private;
@@ -200,6 +268,8 @@ static const struct blk_mq_debugfs_attr 
blk_mq_debugfs_hctx_attrs[] = {
{"flags", 0400, _flags_fops},
{"dispatch", 0400, _dispatch_fops},
{"ctx_map", 0400, _ctx_map_fops},
+   {"tags", 0400, _tags_fops},
+   {"sched_tags", 0400, _sched_tags_fops},
 };
 
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index ee3694d8c4ee..a0ae1f536ed0 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -184,17 +184,16 @@ static ssize_t blk_mq_hw_sysfs_dispatched_show(struct 
blk_mq_hw_ctx *hctx,
return page - start_page;
 }
 
-static ssize_t blk_mq_hw_sysfs_sched_tags_show(struct blk_mq_hw_ctx *hctx, 
char *page)
+static ssize_t blk_mq_hw_sysfs_nr_tags_show(struct blk_mq_hw_ctx *hctx,
+   char *page)
 {
-   if (hctx->sched_tags)
-   return blk_mq_tag_sysfs_show(hctx->sched_tags, page);
-
-   return 0;
+   return sprintf(page, "%u\n", hctx->tags->nr_tags);
 }
 
-static ssize_t blk_mq_hw_sysfs_tags_show(struct blk_mq_hw_ctx *hctx, char 
*page)
+static ssize_t blk_mq_hw_sysfs_nr_reserved_tags_show(struct blk_mq_hw_ctx 
*hctx,
+char *page)
 {
-   return blk_mq_tag_sysfs_show(hctx->tags, page);
+   return sprintf(page, "%u\n", hctx->tags->nr_reserved_tags);
 }
 
 static ssize_t blk_mq_hw_sysfs_active_show(struct blk_mq_hw_ctx *hctx, char 
*page)
@@ -293,18 +292,18 @@ static struct blk_mq_hw_ctx_sysfs_entry 
blk_mq_hw_sysfs_dispatched = {
.attr = {.name = "dispatched", .mode = S_IRUGO },
.show = blk_mq_hw_sysfs_dispatched_show,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_nr_tags = {
+   .attr = 

[PATCH v2 06/10] blk-mq: export software queue pending map to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

This is useful for debugging problems where we've gotten stuck with
requests in the software queues.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ee947f2f605b..857bdadc6b1f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -134,6 +134,26 @@ static const struct file_operations hctx_dispatch_fops = {
.release= seq_release,
 };
 
+static int hctx_ctx_map_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   sbitmap_bitmap_show(>ctx_map, m);
+   return 0;
+}
+
+static int hctx_ctx_map_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_ctx_map_show, inode->i_private);
+}
+
+static const struct file_operations hctx_ctx_map_fops = {
+   .open   = hctx_ctx_map_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static void *ctx_rq_list_start(struct seq_file *m, loff_t *pos)
 {
struct blk_mq_ctx *ctx = m->private;
@@ -179,6 +199,7 @@ static const struct blk_mq_debugfs_attr 
blk_mq_debugfs_hctx_attrs[] = {
{"state", 0400, _state_fops},
{"flags", 0400, _flags_fops},
{"dispatch", 0400, _dispatch_fops},
+   {"ctx_map", 0400, _ctx_map_fops},
 };
 
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/10] blk-mq: move hctx and ctx counters from sysfs to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

These counters aren't as out-of-place in sysfs as the other stuff, but
debugfs is a slightly better home for them.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 181 +
 block/blk-mq-sysfs.c   |  64 -
 2 files changed, 181 insertions(+), 64 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 28ca56dd2c4e..5cd2b435a9f5 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -399,6 +399,88 @@ static const struct file_operations hctx_dispatched_fops = 
{
.release= single_release,
 };
 
+static int hctx_queued_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   seq_printf(m, "%lu\n", hctx->queued);
+   return 0;
+}
+
+static int hctx_queued_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_queued_show, inode->i_private);
+}
+
+static ssize_t hctx_queued_write(struct file *file, const char __user *buf,
+size_t count, loff_t *ppos)
+{
+   struct seq_file *m = file->private_data;
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   hctx->queued = 0;
+   return count;
+}
+
+static const struct file_operations hctx_queued_fops = {
+   .open   = hctx_queued_open,
+   .read   = seq_read,
+   .write  = hctx_queued_write,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int hctx_run_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   seq_printf(m, "%lu\n", hctx->run);
+   return 0;
+}
+
+static int hctx_run_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_run_show, inode->i_private);
+}
+
+static ssize_t hctx_run_write(struct file *file, const char __user *buf,
+size_t count, loff_t *ppos)
+{
+   struct seq_file *m = file->private_data;
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   hctx->run = 0;
+   return count;
+}
+
+static const struct file_operations hctx_run_fops = {
+   .open   = hctx_run_open,
+   .read   = seq_read,
+   .write  = hctx_run_write,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int hctx_active_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   seq_printf(m, "%d\n", atomic_read(>nr_active));
+   return 0;
+}
+
+static int hctx_active_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_active_show, inode->i_private);
+}
+
+static const struct file_operations hctx_active_fops = {
+   .open   = hctx_active_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static void *ctx_rq_list_start(struct seq_file *m, loff_t *pos)
 {
struct blk_mq_ctx *ctx = m->private;
@@ -440,6 +522,99 @@ static const struct file_operations ctx_rq_list_fops = {
.release= seq_release,
 };
 
+static int ctx_dispatched_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_ctx *ctx = m->private;
+
+   seq_printf(m, "%lu %lu\n", ctx->rq_dispatched[1], 
ctx->rq_dispatched[0]);
+   return 0;
+}
+
+static int ctx_dispatched_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, ctx_dispatched_show, inode->i_private);
+}
+
+static ssize_t ctx_dispatched_write(struct file *file, const char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct seq_file *m = file->private_data;
+   struct blk_mq_ctx *ctx = m->private;
+
+   ctx->rq_dispatched[0] = ctx->rq_dispatched[1] = 0;
+   return count;
+}
+
+static const struct file_operations ctx_dispatched_fops = {
+   .open   = ctx_dispatched_open,
+   .read   = seq_read,
+   .write  = ctx_dispatched_write,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int ctx_merged_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_ctx *ctx = m->private;
+
+   seq_printf(m, "%lu\n", ctx->rq_merged);
+   return 0;
+}
+
+static int ctx_merged_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, ctx_merged_show, inode->i_private);
+}
+
+static ssize_t ctx_merged_write(struct file *file, const char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct seq_file *m = file->private_data;
+   struct blk_mq_ctx *ctx = m->private;
+
+   ctx->rq_merged = 0;
+   return count;
+}
+
+static const struct file_operations ctx_merged_fops = {
+   .open   = ctx_merged_open,
+   .read   = 

[PATCH v2 09/10] blk-mq: move hctx io_poll, stats, and dispatched from sysfs to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

These statistics _might_ be useful to userspace, but it's better not to
commit to an ABI for these yet. Also, the dispatched file in sysfs
couldn't be cleared, so make it clearable like the others in debugfs.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 132 +
 block/blk-mq-sysfs.c   |  92 --
 2 files changed, 132 insertions(+), 92 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 9b87b15619f1..28ca56dd2c4e 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -270,6 +270,135 @@ static const struct file_operations 
hctx_sched_tags_bitmap_fops = {
.release= single_release,
 };
 
+static int hctx_io_poll_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   seq_printf(m, "considered=%lu\n", hctx->poll_considered);
+   seq_printf(m, "invoked=%lu\n", hctx->poll_invoked);
+   seq_printf(m, "success=%lu\n", hctx->poll_success);
+   return 0;
+}
+
+static int hctx_io_poll_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_io_poll_show, inode->i_private);
+}
+
+static ssize_t hctx_io_poll_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+   struct seq_file *m = file->private_data;
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   hctx->poll_considered = hctx->poll_invoked = hctx->poll_success = 0;
+   return count;
+}
+
+static const struct file_operations hctx_io_poll_fops = {
+   .open   = hctx_io_poll_open,
+   .read   = seq_read,
+   .write  = hctx_io_poll_write,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
+{
+   seq_printf(m, "samples=%d, mean=%lld, min=%llu, max=%llu",
+  stat->nr_samples, stat->mean, stat->min, stat->max);
+}
+
+static int hctx_stats_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+   struct blk_rq_stat stat[2];
+
+   blk_stat_init([BLK_STAT_READ]);
+   blk_stat_init([BLK_STAT_WRITE]);
+
+   blk_hctx_stat_get(hctx, stat);
+
+   seq_puts(m, "read: ");
+   print_stat(m, [BLK_STAT_READ]);
+   seq_puts(m, "\n");
+
+   seq_puts(m, "write: ");
+   print_stat(m, [BLK_STAT_WRITE]);
+   seq_puts(m, "\n");
+   return 0;
+}
+
+static int hctx_stats_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_stats_show, inode->i_private);
+}
+
+static ssize_t hctx_stats_write(struct file *file, const char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct seq_file *m = file->private_data;
+   struct blk_mq_hw_ctx *hctx = m->private;
+   struct blk_mq_ctx *ctx;
+   int i;
+
+   hctx_for_each_ctx(hctx, ctx, i) {
+   blk_stat_init(>stat[BLK_STAT_READ]);
+   blk_stat_init(>stat[BLK_STAT_WRITE]);
+   }
+   return count;
+}
+
+static const struct file_operations hctx_stats_fops = {
+   .open   = hctx_stats_open,
+   .read   = seq_read,
+   .write  = hctx_stats_write,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int hctx_dispatched_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+   int i;
+
+   seq_printf(m, "%8u\t%lu\n", 0U, hctx->dispatched[0]);
+
+   for (i = 1; i < BLK_MQ_MAX_DISPATCH_ORDER - 1; i++) {
+   unsigned int d = 1U << (i - 1);
+
+   seq_printf(m, "%8u\t%lu\n", d, hctx->dispatched[i]);
+   }
+
+   seq_printf(m, "%8u+\t%lu\n", 1U << (i - 1), hctx->dispatched[i]);
+   return 0;
+}
+
+static int hctx_dispatched_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_dispatched_show, inode->i_private);
+}
+
+static ssize_t hctx_dispatched_write(struct file *file, const char __user *buf,
+size_t count, loff_t *ppos)
+{
+   struct seq_file *m = file->private_data;
+   struct blk_mq_hw_ctx *hctx = m->private;
+   int i;
+
+   for (i = 0; i < BLK_MQ_MAX_DISPATCH_ORDER; i++)
+   hctx->dispatched[i] = 0;
+   return count;
+}
+
+static const struct file_operations hctx_dispatched_fops = {
+   .open   = hctx_dispatched_open,
+   .read   = seq_read,
+   .write  = hctx_dispatched_write,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static void *ctx_rq_list_start(struct seq_file *m, loff_t *pos)
 {
struct blk_mq_ctx *ctx = m->private;
@@ -320,6 +449,9 @@ static const struct 

[PATCH v2 02/10] blk-mq: add hctx->{state,flags} to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

hctx->state could come in handy for bugs where the hardware queue gets
stuck in the stopped state, and hctx->flags is just useful to know.

Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 01711bbf5ade..0c14511fa9e0 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -29,7 +29,49 @@ struct blk_mq_debugfs_attr {
 
 static struct dentry *block_debugfs_root;
 
+static int hctx_state_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   seq_printf(m, "0x%lx\n", hctx->state);
+   return 0;
+}
+
+static int hctx_state_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_state_show, inode->i_private);
+}
+
+static const struct file_operations hctx_state_fops = {
+   .open   = hctx_state_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int hctx_flags_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   seq_printf(m, "0x%lx\n", hctx->flags);
+   return 0;
+}
+
+static int hctx_flags_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_flags_show, inode->i_private);
+}
+
+static const struct file_operations hctx_flags_fops = {
+   .open   = hctx_flags_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
+   {"state", 0400, _state_fops},
+   {"flags", 0400, _flags_fops},
 };
 
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/10] blk-mq: create debugfs directory tree

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

In preparation for putting blk-mq debugging information in debugfs,
create a directory tree mirroring the one in sysfs:

# tree -d /sys/kernel/debug/block
/sys/kernel/debug/block
|-- nvme0n1
|   `-- mq
|   |-- 0
|   |   `-- cpu0
|   |-- 1
|   |   `-- cpu1
|   |-- 2
|   |   `-- cpu2
|   `-- 3
|   `-- cpu3
`-- vda
`-- mq
`-- 0
|-- cpu0
|-- cpu1
|-- cpu2
`-- cpu3

Also add the scaffolding for the actual files that will go in here,
either under the hardware queue or software queue directories.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 block/Makefile |   1 +
 block/blk-mq-debugfs.c | 152 +
 block/blk-mq-sysfs.c   |   8 +++
 block/blk-mq.c |   2 +
 block/blk-mq.h |  33 +++
 include/linux/blkdev.h |   5 ++
 6 files changed, 201 insertions(+)
 create mode 100644 block/blk-mq-debugfs.c

diff --git a/block/Makefile b/block/Makefile
index 3ee0abd7205a..6cabe6bd2882 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -26,3 +26,4 @@ obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o 
blk-integrity.o t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)   += blk-mq-pci.o
 obj-$(CONFIG_BLK_DEV_ZONED)+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)  += blk-wbt.o
+obj-$(CONFIG_DEBUG_FS) += blk-mq-debugfs.o
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
new file mode 100644
index ..01711bbf5ade
--- /dev/null
+++ b/block/blk-mq-debugfs.c
@@ -0,0 +1,152 @@
+/*
+ * Copyright (C) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+
+#include 
+#include "blk-mq.h"
+
+struct blk_mq_debugfs_attr {
+   const char *name;
+   umode_t mode;
+   const struct file_operations *fops;
+};
+
+static struct dentry *block_debugfs_root;
+
+static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
+};
+
+static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
+};
+
+int blk_mq_debugfs_register(struct request_queue *q, const char *name)
+{
+   if (!block_debugfs_root)
+   return -ENOENT;
+
+   q->debugfs_dir = debugfs_create_dir(name, block_debugfs_root);
+   if (!q->debugfs_dir)
+   goto err;
+
+   if (blk_mq_debugfs_register_hctxs(q))
+   goto err;
+
+   return 0;
+
+err:
+   blk_mq_debugfs_unregister(q);
+   return -ENOMEM;
+}
+
+void blk_mq_debugfs_unregister(struct request_queue *q)
+{
+   debugfs_remove_recursive(q->debugfs_dir);
+   q->mq_debugfs_dir = NULL;
+   q->debugfs_dir = NULL;
+}
+
+static int blk_mq_debugfs_register_ctx(struct request_queue *q,
+  struct blk_mq_ctx *ctx,
+  struct dentry *hctx_dir)
+{
+   struct dentry *ctx_dir;
+   char name[20];
+   int i;
+
+   snprintf(name, sizeof(name), "cpu%u", ctx->cpu);
+   ctx_dir = debugfs_create_dir(name, hctx_dir);
+   if (!ctx_dir)
+   return -ENOMEM;
+
+   for (i = 0; i < ARRAY_SIZE(blk_mq_debugfs_ctx_attrs); i++) {
+   const struct blk_mq_debugfs_attr *attr;
+
+   attr = _mq_debugfs_ctx_attrs[i];
+   if (!debugfs_create_file(attr->name, attr->mode, ctx_dir, ctx,
+attr->fops))
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
+static int blk_mq_debugfs_register_hctx(struct request_queue *q,
+   struct blk_mq_hw_ctx *hctx)
+{
+   struct blk_mq_ctx *ctx;
+   struct dentry *hctx_dir;
+   char name[20];
+   int i;
+
+   snprintf(name, sizeof(name), "%u", hctx->queue_num);
+   hctx_dir = debugfs_create_dir(name, q->mq_debugfs_dir);
+   if (!hctx_dir)
+   return -ENOMEM;
+
+   for (i = 0; i < ARRAY_SIZE(blk_mq_debugfs_hctx_attrs); i++) {
+   const struct blk_mq_debugfs_attr *attr;
+
+   attr = _mq_debugfs_hctx_attrs[i];
+   if (!debugfs_create_file(attr->name, attr->mode, hctx_dir, hctx,
+attr->fops))
+   return -ENOMEM;
+   }
+
+   

[PATCH v2 05/10] sbitmap: add helpers for dumping to a seq_file

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

This is useful debugging information that will be used in the blk-mq
debugfs directory.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 include/linux/sbitmap.h | 28 +++
 lib/sbitmap.c   | 74 +
 2 files changed, 102 insertions(+)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index f017fd6e69c4..4a7998355f2b 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -259,6 +259,25 @@ static inline int sbitmap_test_bit(struct sbitmap *sb, 
unsigned int bitnr)
 unsigned int sbitmap_weight(const struct sbitmap *sb);
 
 /**
+ * sbitmap_show() - Dump bitmap information to a struct seq_file.
+ * @sb: Bitmap to show.
+ * @m: struct seq_file to write to.
+ *
+ * This is intended for debugging. The format may change at any time.
+ */
+void sbitmap_show(struct sbitmap *sb, struct seq_file *m);
+
+/**
+ * sbitmap_bitmap_show() - Dump the raw bitmap to a struct seq_file.
+ * @sb: Bitmap to show.
+ * @m: struct seq_file to write to.
+ *
+ * This is intended for debugging. The output isn't guaranteed to be internally
+ * consistent.
+ */
+void sbitmap_bitmap_show(struct sbitmap *sb, struct seq_file *m);
+
+/**
  * sbitmap_queue_init_node() - Initialize a  sbitmap_queue on a specific
  * memory node.
  * @sbq: Bitmap queue to initialize.
@@ -370,4 +389,13 @@ static inline struct sbq_wait_state *sbq_wait_ptr(struct 
sbitmap_queue *sbq,
  */
 void sbitmap_queue_wake_all(struct sbitmap_queue *sbq);
 
+/**
+ * sbitmap_queue_show() - Dump bitmap queue information to a struct seq_file.
+ * @sbq: Bitmap queue to show.
+ * @m: struct seq_file to write to.
+ *
+ * This is intended for debugging. The format may change at any time.
+ */
+void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m);
+
 #endif /* __LINUX_SCALE_BITMAP_H */
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 8f5c3b268c77..48cb6b5a8e7e 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -17,6 +17,7 @@
 
 #include 
 #include 
+#include 
 
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
  gfp_t flags, int node)
@@ -180,6 +181,45 @@ unsigned int sbitmap_weight(const struct sbitmap *sb)
 }
 EXPORT_SYMBOL_GPL(sbitmap_weight);
 
+void sbitmap_show(struct sbitmap *sb, struct seq_file *m)
+{
+   seq_printf(m, "depth=%u\n", sb->depth);
+   seq_printf(m, "weight=%u\n", sbitmap_weight(sb));
+   seq_printf(m, "bits_per_word=%u\n", 1U << sb->shift);
+   seq_printf(m, "map_nr=%u\n", sb->map_nr);
+}
+EXPORT_SYMBOL_GPL(sbitmap_show);
+
+void sbitmap_bitmap_show(struct sbitmap *sb, struct seq_file *m)
+{
+   u8 byte = 0;
+   unsigned int byte_bits = 0;
+   int i;
+
+   for (i = 0; i < sb->map_nr; i++) {
+   unsigned long word = READ_ONCE(sb->map[i].word);
+   unsigned int word_bits = READ_ONCE(sb->map[i].depth);
+
+   while (word_bits > 0) {
+   unsigned int bits = min(8 - byte_bits, word_bits);
+
+   byte |= (word & (BIT(bits) - 1)) << byte_bits;
+   byte_bits += bits;
+   if (byte_bits == 8) {
+   seq_write(m, , sizeof(byte));
+   byte = 0;
+   byte_bits = 0;
+   }
+   word >>= bits;
+   word_bits -= bits;
+   }
+   }
+
+   if (byte_bits)
+   seq_write(m, , sizeof(byte));
+}
+EXPORT_SYMBOL_GPL(sbitmap_bitmap_show);
+
 static unsigned int sbq_calc_wake_batch(unsigned int depth)
 {
unsigned int wake_batch;
@@ -377,3 +417,37 @@ void sbitmap_queue_wake_all(struct sbitmap_queue *sbq)
}
 }
 EXPORT_SYMBOL_GPL(sbitmap_queue_wake_all);
+
+void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
+{
+   bool first;
+   int i;
+
+   sbitmap_show(>sb, m);
+
+   seq_puts(m, "alloc_hint={");
+   first = true;
+   for_each_possible_cpu(i) {
+   if (!first)
+   seq_puts(m, ", ");
+   first = false;
+   seq_printf(m, "%u", *per_cpu_ptr(sbq->alloc_hint, i));
+   }
+   seq_puts(m, "}\n");
+
+   seq_printf(m, "wake_batch=%u\n", sbq->wake_batch);
+   seq_printf(m, "wake_index=%d\n", atomic_read(>wake_index));
+
+   seq_puts(m, "ws={\n");
+   for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
+   struct sbq_wait_state *ws = >ws[i];
+
+   seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n",
+  atomic_read(>wait_cnt),
+  waitqueue_active(>wait) ? "active" : "inactive");
+   }
+   seq_puts(m, "}\n");
+
+   seq_printf(m, "round_robin=%d\n", sbq->round_robin);
+}
+EXPORT_SYMBOL_GPL(sbitmap_queue_show);
-- 
2.11.0

--
To unsubscribe from this list: 

[PATCH v2 08/10] blk-mq: add tags and sched_tags bitmaps to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

These can be used to debug issues like tag leaks and stuck requests.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 9f811007cf4f..9b87b15619f1 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -197,6 +197,30 @@ static const struct file_operations hctx_tags_fops = {
.release= single_release,
 };
 
+static int hctx_tags_bitmap_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+   struct request_queue *q = hctx->queue;
+
+   mutex_lock(>sysfs_lock);
+   if (hctx->tags)
+   sbitmap_bitmap_show(>tags->bitmap_tags.sb, m);
+   mutex_unlock(>sysfs_lock);
+   return 0;
+}
+
+static int hctx_tags_bitmap_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_tags_bitmap_show, inode->i_private);
+}
+
+static const struct file_operations hctx_tags_bitmap_fops = {
+   .open   = hctx_tags_bitmap_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static int hctx_sched_tags_show(struct seq_file *m, void *v)
 {
struct blk_mq_hw_ctx *hctx = m->private;
@@ -222,6 +246,30 @@ static const struct file_operations hctx_sched_tags_fops = 
{
.release= single_release,
 };
 
+static int hctx_sched_tags_bitmap_show(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+   struct request_queue *q = hctx->queue;
+
+   mutex_lock(>sysfs_lock);
+   if (hctx->sched_tags)
+   sbitmap_bitmap_show(>sched_tags->bitmap_tags.sb, m);
+   mutex_unlock(>sysfs_lock);
+   return 0;
+}
+
+static int hctx_sched_tags_bitmap_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, hctx_sched_tags_bitmap_show, inode->i_private);
+}
+
+static const struct file_operations hctx_sched_tags_bitmap_fops = {
+   .open   = hctx_sched_tags_bitmap_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static void *ctx_rq_list_start(struct seq_file *m, loff_t *pos)
 {
struct blk_mq_ctx *ctx = m->private;
@@ -269,7 +317,9 @@ static const struct blk_mq_debugfs_attr 
blk_mq_debugfs_hctx_attrs[] = {
{"dispatch", 0400, _dispatch_fops},
{"ctx_map", 0400, _ctx_map_fops},
{"tags", 0400, _tags_fops},
+   {"tags_bitmap", 0400, _tags_bitmap_fops},
{"sched_tags", 0400, _sched_tags_fops},
+   {"sched_tags_bitmap", 0400, _sched_tags_bitmap_fops},
 };
 
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/10] blk-mq: move debugging information from sysfs to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

Changes from v1:

- Make the sbitmap seq_file helpers take a (struct sbitmap *) instead of a
  (void *), since it's not possible to use them directly as the seq_file
  show helper, anyways
- Fix a crash when reading ctx_map because it was attempting to do
  exactly that ^

Cover letter from v1:

This series ends our abuse of sysfs and puts all of the debugging information
in debugfs instead. This has a few benefits:

1. Removes the possibility of userspace being stupid and relying on something
   in sysfs that we only exposed for debugging.
2. Lifts the limitations of sysfs, including the one-value-per-file convention
   and maximum of one page of output.
3. Allows us to add more debugging information that we often want but don't
   have when debugging a live system.

Thanks!
Omar

Omar Sandoval (10):
  blk-mq: create debugfs directory tree
  blk-mq: add hctx->{state,flags} to debugfs
  blk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfs
  blk-mq: add extra request information to debugfs
  sbitmap: add helpers for dumping to a seq_file
  blk-mq: export software queue pending map to debugfs
  blk-mq: move tags and sched_tags info from sysfs to debugfs
  blk-mq: add tags and sched_tags bitmaps to debugfs
  blk-mq: move hctx io_poll, stats, and dispatched from sysfs to debugfs
  blk-mq: move hctx and ctx counters from sysfs to debugfs

 block/Makefile  |   1 +
 block/blk-mq-debugfs.c  | 756 
 block/blk-mq-sysfs.c| 248 ++--
 block/blk-mq-tag.c  |  27 --
 block/blk-mq-tag.h  |   1 -
 block/blk-mq.c  |   2 +
 block/blk-mq.h  |  33 +++
 include/linux/blkdev.h  |   5 +
 include/linux/sbitmap.h |  28 ++
 lib/sbitmap.c   |  74 +
 10 files changed, 920 insertions(+), 255 deletions(-)
 create mode 100644 block/blk-mq-debugfs.c

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/10] blk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfs

2017-01-25 Thread Omar Sandoval
From: Omar Sandoval 

These lists are only useful for debugging; they definitely don't belong
in sysfs. Putting them in debugfs also removes the limitation of a
single page of output.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Omar Sandoval 
---
 block/blk-mq-debugfs.c | 106 +
 block/blk-mq-sysfs.c   |  57 --
 2 files changed, 106 insertions(+), 57 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 0c14511fa9e0..1967c06d80e0 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -29,6 +29,20 @@ struct blk_mq_debugfs_attr {
 
 static struct dentry *block_debugfs_root;
 
+static int blk_mq_debugfs_seq_open(struct inode *inode, struct file *file,
+  const struct seq_operations *ops)
+{
+   struct seq_file *m;
+   int ret;
+
+   ret = seq_open(file, ops);
+   if (!ret) {
+   m = file->private_data;
+   m->private = inode->i_private;
+   }
+   return ret;
+}
+
 static int hctx_state_show(struct seq_file *m, void *v)
 {
struct blk_mq_hw_ctx *hctx = m->private;
@@ -69,12 +83,104 @@ static const struct file_operations hctx_flags_fops = {
.release= single_release,
 };
 
+static int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
+{
+   struct request *rq = list_entry_rq(v);
+
+   seq_printf(m, "%p\n", rq);
+   return 0;
+}
+
+static void *hctx_dispatch_start(struct seq_file *m, loff_t *pos)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   spin_lock(>lock);
+   return seq_list_start(>dispatch, *pos);
+}
+
+static void *hctx_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   return seq_list_next(v, >dispatch, pos);
+}
+
+static void hctx_dispatch_stop(struct seq_file *m, void *v)
+{
+   struct blk_mq_hw_ctx *hctx = m->private;
+
+   spin_unlock(>lock);
+}
+
+static const struct seq_operations hctx_dispatch_seq_ops = {
+   .start  = hctx_dispatch_start,
+   .next   = hctx_dispatch_next,
+   .stop   = hctx_dispatch_stop,
+   .show   = blk_mq_debugfs_rq_show,
+};
+
+static int hctx_dispatch_open(struct inode *inode, struct file *file)
+{
+   return blk_mq_debugfs_seq_open(inode, file, _dispatch_seq_ops);
+}
+
+static const struct file_operations hctx_dispatch_fops = {
+   .open   = hctx_dispatch_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= seq_release,
+};
+
+static void *ctx_rq_list_start(struct seq_file *m, loff_t *pos)
+{
+   struct blk_mq_ctx *ctx = m->private;
+
+   spin_lock(>lock);
+   return seq_list_start(>rq_list, *pos);
+}
+
+static void *ctx_rq_list_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   struct blk_mq_ctx *ctx = m->private;
+
+   return seq_list_next(v, >rq_list, pos);
+}
+
+static void ctx_rq_list_stop(struct seq_file *m, void *v)
+{
+   struct blk_mq_ctx *ctx = m->private;
+
+   spin_unlock(>lock);
+}
+
+static const struct seq_operations ctx_rq_list_seq_ops = {
+   .start  = ctx_rq_list_start,
+   .next   = ctx_rq_list_next,
+   .stop   = ctx_rq_list_stop,
+   .show   = blk_mq_debugfs_rq_show,
+};
+
+static int ctx_rq_list_open(struct inode *inode, struct file *file)
+{
+   return blk_mq_debugfs_seq_open(inode, file, _rq_list_seq_ops);
+}
+
+static const struct file_operations ctx_rq_list_fops = {
+   .open   = ctx_rq_list_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= seq_release,
+};
+
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
{"state", 0400, _state_fops},
{"flags", 0400, _flags_fops},
+   {"dispatch", 0400, _dispatch_fops},
 };
 
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
+   {"rq_list", 0400, _rq_list_fops},
 };
 
 int blk_mq_debugfs_register(struct request_queue *q, const char *name)
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 91b93ca1c1e0..ee3694d8c4ee 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -139,41 +139,6 @@ static ssize_t blk_mq_sysfs_completed_show(struct 
blk_mq_ctx *ctx, char *page)
ctx->rq_completed[0]);
 }
 
-static ssize_t sysfs_list_show(char *page, struct list_head *list, char *msg)
-{
-   struct request *rq;
-   int len = snprintf(page, PAGE_SIZE - 1, "%s:\n", msg);
-
-   list_for_each_entry(rq, list, queuelist) {
-   const int rq_len = 2 * sizeof(rq) + 2;
-
-   /* if the output will be truncated */
-   if (PAGE_SIZE - 1 < len + rq_len) {
-   /* backspacing if it can't hold '\t...\n' */
-   if (PAGE_SIZE - 1 < len + 5)
-   len 

Re: [PATCH 02/10] blk-mq: add hctx->{state,flags} to debugfs

2017-01-25 Thread Omar Sandoval
On Tue, Jan 24, 2017 at 02:25:39PM +0100, Hannes Reinecke wrote:
> On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > hctx->state could come in handy for bugs where the hardware queue gets
> > stuck in the stopped state, and hctx->flags is just useful to know.
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  block/blk-mq-debugfs.c | 42 ++
> >  1 file changed, 42 insertions(+)
> > 
> Hehe. I've found myself adding exactly the same attributes when
> debugging mq-deadline :-)
> 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 01711bbf5ade..0c14511fa9e0 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -29,7 +29,49 @@ struct blk_mq_debugfs_attr {
> >  
> >  static struct dentry *block_debugfs_root;
> >  
> > +static int hctx_state_show(struct seq_file *m, void *v)
> > +{
> > +   struct blk_mq_hw_ctx *hctx = m->private;
> > +
> > +   seq_printf(m, "0x%lx\n", hctx->state);
> > +   return 0;
> > +}
> > +
> What about decoding it?
> Would make life so much easier, and one doesn't have to have a kernel
> source tree available when looking things up...

I thought of this, too, but doing that would require setting up a
mapping from flag values to strings and keeping those in sync with the
actual flags, which I don't think is worth the trouble, since I imagine
that most of the time you're going to be consulting the source to debug
this kind of issue, anyways.

Thanks for the review!
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-25 Thread Jens Axboe
On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
> On 01/25/2017 09:07 AM, Hannes Reinecke wrote:
>> On 01/25/2017 08:39 AM, Hannes Reinecke wrote:
>>> On 01/24/2017 11:06 PM, Jens Axboe wrote:
 On 01/24/2017 12:55 PM, Jens Axboe wrote:
> Try this patch. We only want to bump it for the driver tags, not the
> scheduler side.

 More complete version, this one actually tested. I think this should fix
 your issue, let me know.

>>> Nearly there.
>>> The initial stall is gone, but the test got hung at the 'stonewall'
>>> sequence again:
>>>
>>> [global]
>>> bs=4k
>>> ioengine=libaio
>>> iodepth=256
>>> size=4g
>>> direct=1
>>> runtime=60
>>> # directory=/mnt
>>> numjobs=32
>>> group_reporting
>>> cpus_allowed_policy=split
>>> filename=/dev/md127
>>>
>>> [seq-read]
>>> rw=read
>>> -> stonewall
>>>
>>> [rand-read]
>>> rw=randread
>>> stonewall
>>>
>>> Restarting all queues made the fio job continue.
>>> There were 4 queues with state 'restart', and one queue with state 'active'.
>>> So we're missing a queue run somewhere.
>>>
>> I've found the queue stalls are gone with this patch:
>>
>> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
>> index 6b465bc..de5db6c 100644
>> --- a/block/blk-mq-sched.h
>> +++ b/block/blk-mq-sched.h
>> @@ -113,6 +113,15 @@ static inline void blk_mq_sched_put_rq_priv(struct
>> request_queue *q,
>>  }
>>
>>  static inline void
>> +blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>> +{
>> +   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
>> +   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
>> +   blk_mq_run_hw_queue(hctx, true);
>> +   }
>> +}
>> +
>> +static inline void
>>  blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct
>> request *rq)
>>  {
>> struct elevator_queue *e = hctx->queue->elevator;
>> @@ -123,11 +132,6 @@ static inline void blk_mq_sched_put_rq_priv(struct
>> request_queue *q,
>> BUG_ON(rq->internal_tag == -1);
>>
>> blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx,
>> rq->internal_tag);
>> -
>> -   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
>> -   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
>> -   blk_mq_run_hw_queue(hctx, true);
>> -   }
>>  }
>>
>>  static inline void blk_mq_sched_started_request(struct request *rq)
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e872555..63799ad 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx
>> *hctx, struct blk_mq_ctx *ctx,
>> blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
>> if (sched_tag != -1)
>> blk_mq_sched_completed_request(hctx, rq);
>> +   blk_mq_sched_restart(hctx);
>> blk_queue_exit(q);
>>  }
>>
> Bah.
> 
> Not quite. I'm still seeing some queues with state 'restart'.
> 
> I've found that I need another patch on top of that:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e872555..edcbb44 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
> *work)
> 
> queue_for_each_hw_ctx(q, hctx, i) {
> /* the hctx may be unmapped, so check it here */
> -   if (blk_mq_hw_queue_mapped(hctx))
> +   if (blk_mq_hw_queue_mapped(hctx)) {
> blk_mq_tag_idle(hctx);
> +   blk_mq_sched_restart(hctx);
> +   }
> }
> }
> blk_queue_exit(q);
> 
> 
> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
> request on another hctx, but the original hctx might still have the
> SCHED_RESTART bit set.
> Which will never cleared as we complete the request on a different hctx,
> so anything we do on the end_request side won't do us any good.

I think you are right, it'll potentially trigger with shared tags and
multiple hardware queues. I'll debug this today and come up with a
decent fix.

I committed the previous patch, fwiw.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Arnd Bergmann
On Wed, Jan 25, 2017 at 2:47 PM, Josef Bacik  wrote:
> On Tue, Jan 24, 2017 at 2:11 AM, Greg KH  wrote:
>> On Mon, Jan 23, 2017 at 10:52:42AM -0500, Josef Bacik wrote:
>>>  On Mon, Jan 23, 2017 at 10:03 AM, Greg KH 
>>>  Loop accomplishes this with the /dev/loop-control
>>>  and an ioctl.  Then we also need a way to figure out what is the first
>>>  /dev/nbd# device that isn't currently in use in order to pick the next
>>> one
>>>  to configure.  Keep in mind that all of the configuration for /dev/nbd#
>>>  devices is done through ioctls to those devices, so having a ioctl
>>> interface
>>>  for the control device is consistent with the rest of how NBD works.
>>
>>
>> But adding a random char device node and using ioctls on it is different
>> than using an ioctl on an existing block device node that is already
>> there.
>>
>> So what _exactly_ do you need to do with this interface?  What data do
>> you need sent/received to/from the kernel?  Specifics please.
>
>
> I need to say "Hey kernel, can you setup some new nbd devices for me?" and
> "Hey kernel, what is the first free nbd device I can use for this new
> connection and (optionally) create a new device for me if one does not
> exist?"  Loop accomplished this (in 2011 btw, not some far off date) with
> /dev/loop-control.  Btrfs has it's scanning stuff controlled by
> /dev/btrfs-control.  Device mapper has /dev/mapper/control.  This is a well
> established and widely used way for control block based device drivers.

We have multiple established ways to deal with this kind of problem, the most
common ones being a separate chardev as you propose, a sysfs interface and
netlink.

They all have their own set of problems, but the needs of nbd as a network
storage interface seem most closely resemble what we have for other network
related interfaces, where we typically use netlink to do the setup, see e.g.
macvtap as an example for a network chardev interface.

Can you have a look if that would solve your problem more nicely?

 Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Josef Bacik
On Tue, Jan 24, 2017 at 2:11 AM, Greg KH  
wrote:

On Mon, Jan 23, 2017 at 10:52:42AM -0500, Josef Bacik wrote:
 On Mon, Jan 23, 2017 at 10:03 AM, Greg KH 


 wrote:
 > On Mon, Jan 23, 2017 at 09:57:52AM -0500, Josef Bacik wrote:
 > >  On Mon, Jan 23, 2017 at 9:52 AM, Greg KH
 > >  wrote:
 > >  > On Mon, Jan 23, 2017 at 09:42:08AM -0500, Josef Bacik wrote:
 > >  > >
 > >  > >
 > >  > >  On Sat, Jan 21, 2017 at 4:05 AM, Greg KH
 > >  > >  wrote:
 > >  > >  > On Fri, Jan 20, 2017 at 04:56:52PM -0500, Josef Bacik 
wrote:
 > >  > >  > >  This patch mirrors the loop back device behavior 
with a

 > > few
 > >  > >  > > changes.  First
 > >  > >  > >  there is no DEL operation as NBD doesn't get as much
 > > churn as
 > >  > > loop
 > >  > >  > > devices do.
 > >  > >  > >  Secondly the GET_NEXT operation can optionally 
create a

 > > new NBD
 > >  > >  > > device or not.
 > >  > >  > >  Our infrastructure people want to not allow NBD to 
create

 > > new
 > >  > >  > > devices as it
 > >  > >  > >  causes problems for them in containers.  However 
allow

 > > this to
 > >  > > be
 > >  > >  > > optional as
 > >  > >  > >  things like the OSS NBD client probably doesn't care 
and

 > > would
 > >  > > like
 > >  > >  > > to just be
 > >  > >  > >  given a device to use.
 > >  > >  > >
 > >  > >  > >  Signed-off-by: Josef Bacik 
 > >  > >  >
 > >  > >  > A random char device with odd ioctls?  Why?  There's no 
other
 > >  > >  > configuration choice you could possibly use?  Where is 
the

 > >  > > userspace
 > >  > >  > tool that uses this new kernel api?
 > >  > >  >
 > >  > >  > You aren't passing in structures to the ioctl, so why 
does

 > > this
 > >  > > HAVE to
 > >  > >  > be an ioctl?
 > >  > >
 > >  > >  Again, this is how loop does it so I assumed a known, 
regularly

 > >  > > used API was
 > >  > >  the best bet.  I can do literally anything, but these
 > > interfaces
 > >  > > have to be
 > >  > >  used by other people, including internal people.  The
 > >  > > /dev/whatever-control
 > >  > >  is a well established way for interacting with dynamic 
device

 > >  > > drivers (loop,
 > >  > >  DM, btrfs), so that's what I went with.  Thanks,
 > >  >
 > >  > Again, please don't duplicate what loop did, we must _learn_ 
from

 > >  > history, not repeat it :(
 > >
 > >  Sure but what am I supposed to do?  Have some random sysfs 
knobs?

 > > Thanks,
 >
 > It all depends on what you are trying to do.  I have yet to 
figure that

 > out at all here :(

 I explained it in the changelog and my response to Wouter.  NBD 
preallocates
 all of its /dev/nbd# devices at modprobe time, so there's no way to 
add new

 devices as we need them.


Why not fix that odd restriction?


That's the entire point of this patchset, adding the ability to create 
new devices on the fly as needed.






 Loop accomplishes this with the /dev/loop-control
 and an ioctl.  Then we also need a way to figure out what is the 
first
 /dev/nbd# device that isn't currently in use in order to pick the 
next one
 to configure.  Keep in mind that all of the configuration for 
/dev/nbd#
 devices is done through ioctls to those devices, so having a ioctl 
interface

 for the control device is consistent with the rest of how NBD works.


But adding a random char device node and using ioctls on it is 
different

than using an ioctl on an existing block device node that is already
there.

So what _exactly_ do you need to do with this interface?  What data do
you need sent/received to/from the kernel?  Specifics please.


I need to say "Hey kernel, can you setup some new nbd devices for me?" 
and "Hey kernel, what is the first free nbd device I can use for this 
new connection and (optionally) create a new device for me if one does 
not exist?"  Loop accomplished this (in 2011 btw, not some far off 
date) with /dev/loop-control.  Btrfs has it's scanning stuff controlled 
by /dev/btrfs-control.  Device mapper has /dev/mapper/control.  This is 
a well established and widely used way for control block based device 
drivers.  Thanks,


Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-25 Thread Hannes Reinecke
On 01/25/2017 09:07 AM, Hannes Reinecke wrote:
> On 01/25/2017 08:39 AM, Hannes Reinecke wrote:
>> On 01/24/2017 11:06 PM, Jens Axboe wrote:
>>> On 01/24/2017 12:55 PM, Jens Axboe wrote:
 Try this patch. We only want to bump it for the driver tags, not the
 scheduler side.
>>>
>>> More complete version, this one actually tested. I think this should fix
>>> your issue, let me know.
>>>
>> Nearly there.
>> The initial stall is gone, but the test got hung at the 'stonewall'
>> sequence again:
>>
>> [global]
>> bs=4k
>> ioengine=libaio
>> iodepth=256
>> size=4g
>> direct=1
>> runtime=60
>> # directory=/mnt
>> numjobs=32
>> group_reporting
>> cpus_allowed_policy=split
>> filename=/dev/md127
>>
>> [seq-read]
>> rw=read
>> -> stonewall
>>
>> [rand-read]
>> rw=randread
>> stonewall
>>
>> Restarting all queues made the fio job continue.
>> There were 4 queues with state 'restart', and one queue with state 'active'.
>> So we're missing a queue run somewhere.
>>
> I've found the queue stalls are gone with this patch:
> 
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 6b465bc..de5db6c 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -113,6 +113,15 @@ static inline void blk_mq_sched_put_rq_priv(struct
> request_queue *q,
>  }
> 
>  static inline void
> +blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> +{
> +   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
> +   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
> +   blk_mq_run_hw_queue(hctx, true);
> +   }
> +}
> +
> +static inline void
>  blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct
> request *rq)
>  {
> struct elevator_queue *e = hctx->queue->elevator;
> @@ -123,11 +132,6 @@ static inline void blk_mq_sched_put_rq_priv(struct
> request_queue *q,
> BUG_ON(rq->internal_tag == -1);
> 
> blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx,
> rq->internal_tag);
> -
> -   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
> -   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
> -   blk_mq_run_hw_queue(hctx, true);
> -   }
>  }
> 
>  static inline void blk_mq_sched_started_request(struct request *rq)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e872555..63799ad 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx
> *hctx, struct blk_mq_ctx *ctx,
> blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> if (sched_tag != -1)
> blk_mq_sched_completed_request(hctx, rq);
> +   blk_mq_sched_restart(hctx);
> blk_queue_exit(q);
>  }
> 
Bah.

Not quite. I'm still seeing some queues with state 'restart'.

I've found that I need another patch on top of that:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e872555..edcbb44 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
*work)

queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
-   if (blk_mq_hw_queue_mapped(hctx))
+   if (blk_mq_hw_queue_mapped(hctx)) {
blk_mq_tag_idle(hctx);
+   blk_mq_sched_restart(hctx);
+   }
}
}
blk_queue_exit(q);


Reasoning is that in blk_mq_get_tag() we might end up scheduling the
request on another hctx, but the original hctx might still have the
SCHED_RESTART bit set.
Which will never cleared as we complete the request on a different hctx,
so anything we do on the end_request side won't do us any good.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] block level event logging for storage media management

2017-01-25 Thread Jan Kara
On Tue 24-01-17 15:18:57, Oleg Drokin wrote:
> 
> On Jan 23, 2017, at 2:27 AM, Dan Williams wrote:
> 
> > [ adding Oleg ]
> > 
> > On Sun, Jan 22, 2017 at 10:00 PM, Song Liu  wrote:
> >> Hi Dan,
> >> 
> >> I think the the block level event log is more like log only system. When 
> >> en event
> >> happens,  it is not necessary to take immediate action. (I guess this is 
> >> different
> >> to bad block list?).
> >> 
> >> I would hope the event log to track more information. Some of these 
> >> individual
> >> event may not be very interesting, for example, soft error or latency 
> >> outliers.
> >> However, when we gather event log for a fleet of devices, these "soft 
> >> event"
> >> may become valuable for health monitoring.
> > 
> > I'd be interested in this. It sounds like you're trying to fill a gap
> > between tracing and console log messages which I believe others have
> > encountered as well.
> 
> We have a somewhat similar problem problem in Lustre and I guess it's not
> just Lustre.  Currently there are all sorts of conditional debug code all
> over the place that goes to the console and when you enable it for
> anything verbose, you quickly overflow your dmesg buffer no matter the
> size, that might be mostly ok for local "block level" stuff, but once you
> become distributed, it start to be a mess and once you get to be super
> large it worsens even more since you need to somehow coordinate data from
> multiple nodes, ensure all of it is not lost and still you don't end up
> using a lot of it since only a few nodes end up being useful.  (I don't
> know how NFS people manage to debug complicated issues using just this,
> could not be super easy).
> 
> Having some sort of a buffer of a (potentially very) large size that
> could be storing the data until it's needed, or eagerly polled by some
> daemon for storage (helpful when you expect a lot of data that definitely
> won't fit in RAM).
> 
> Tracepoints have the buffer and the daemon, but creating new messages is
> very cumbersome, so converting every debug message into one does not look
> very feasible.  Also it's convenient to have "event masks" one want
> logged that I don't think you could do with tracepoints.

So creating trace points IMO isn't that cumbersome. I agree that converting
hundreds or thousands debug printks into tracepoints is a pain in the
ass but still it is doable. WRT filtering, you can enable each tracepoint
individually. Granted that is not exactly the 'event mask' feature you ask
about but that can be easily scripted in userspace if you give some
structure to tracepoint names. Finally tracepoints provide a fine grained
control you never get with printk - e.g. you can make a tracepoint trigger
only if specific inode is involved with trace filters which greatly reduces
the amount of output.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] mmc: block: break out mmc_blk_rw_cmd_abort()

2017-01-25 Thread Mateusz Nowak

Hi Linus,

On 1/24/2017 11:17, Linus Walleij wrote:

As a first step toward breaking apart the very complex function
mmc_blk_issue_rw_rq() we break out the command abort code.
This code assumes "ret" is != 0 and then repeatedly hammers
blk_end_request() until the request to the block layer to end
the request succeeds.

Signed-off-by: Linus Walleij 
---
 drivers/mmc/core/block.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 7bd03381810d..14efe92a14ef 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1598,6 +1598,17 @@ static int mmc_blk_cmd_err(struct mmc_blk_data *md, 
struct mmc_card *card,
return ret;
 }

+static void mmc_blk_rw_cmd_abort(struct mmc_card *card, struct request *req)
+{
+   int ret = 1;
blk_end_request is returning bool, so maybe this variable should have 
matching type since it is only usage in this scope? And maybe it should 
have more meaningful name for this case?



+
+   if (mmc_card_removed(card))
+   req->rq_flags |= RQF_QUIET;
+   while (ret)
+   ret = blk_end_request(req, -EIO,
+ blk_rq_cur_bytes(req));
+}
+
 static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 {
struct mmc_blk_data *md = mq->blkdata;
@@ -1737,11 +1748,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
return 1;

  cmd_abort:
-   if (mmc_card_removed(card))
-   req->rq_flags |= RQF_QUIET;
-   while (ret)
-   ret = blk_end_request(req, -EIO,
-   blk_rq_cur_bytes(req));
+   mmc_blk_rw_cmd_abort(card, req);

  start_new_req:
if (rqc) {



Regards,
Mateusz.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Nbd] [PATCH 4/4] nbd: add a nbd-control interface

2017-01-25 Thread Wouter Verhelst
On Tue, Jan 24, 2017 at 08:11:52AM +0100, Greg KH wrote:
> On Mon, Jan 23, 2017 at 10:52:42AM -0500, Josef Bacik wrote:
> > I explained it in the changelog and my response to Wouter.  NBD preallocates
> > all of its /dev/nbd# devices at modprobe time, so there's no way to add new
> > devices as we need them.
> 
> Why not fix that odd restriction?

Isn't that what this patch is trying to do?

> > Loop accomplishes this with the /dev/loop-control
> > and an ioctl.  Then we also need a way to figure out what is the first
> > /dev/nbd# device that isn't currently in use in order to pick the next one
> > to configure.  Keep in mind that all of the configuration for /dev/nbd#
> > devices is done through ioctls to those devices, so having a ioctl interface
> > for the control device is consistent with the rest of how NBD works.
> 
> But adding a random char device node and using ioctls on it is different
> than using an ioctl on an existing block device node that is already
> there.
> 
> So what _exactly_ do you need to do with this interface?  What data do
> you need sent/received to/from the kernel?  Specifics please.

AIUI: Create new devices, and figure out which device is the next one
free so we can find one without having to check the pid attribute for
every device (as is needed today, and which is racy).

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET v4] blk-mq-scheduling framework

2017-01-25 Thread Paolo Valente

> Il giorno 23 gen 2017, alle ore 18:42, Jens Axboe  ha scritto:
> 
> On 01/23/2017 10:04 AM, Paolo Valente wrote:
>> 
>>> Il giorno 18 gen 2017, alle ore 17:21, Jens Axboe  ha scritto:
>>> 
>>> On 01/18/2017 08:14 AM, Paolo Valente wrote:
 according to the function blk_mq_sched_put_request, the
 mq.completed_request hook seems to always be invoked (if set) for a
 request for which the mq.put_rq_priv is invoked (if set).
>>> 
>>> Correct, any request that came out of blk_mq_sched_get_request()
>>> will always have completed called on it, regardless of whether it
>>> had IO started on it or not.
>>> 
>> 
>> It seems that some request, after being dispatched, happens to have no
>> mq.put_rq_priv invoked on it now or then.  Is it expected?  If it is,
>> could you point me to the path through which the end of the life of
>> such a request is handled?
> 
> I'm guessing that's a flush request. I added RQF_QUEUED to check for
> that, if RQF_QUEUED is set, you know it has come from your get_request
> handler.
> 

Exactly, the completion-without-put_rq_priv pattern seems to occur
only for requests coming from the flusher, precisely because they have
the flag RQF_ELVPRIV unset.  Just to understand: why is this flag
unset for these requests, if they do have private elevator (bfq)
data attached?  What am I misunderstanding?

Just to be certain: this should be the only case where the
completed_request hook is invoked while the put_rq_priv is not, right?

Thanks,
Paolo

> I'm assuming that is it, let me know.
> 
> -- 
> Jens Axboe
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-25 Thread Hannes Reinecke
On 01/25/2017 08:39 AM, Hannes Reinecke wrote:
> On 01/24/2017 11:06 PM, Jens Axboe wrote:
>> On 01/24/2017 12:55 PM, Jens Axboe wrote:
>>> Try this patch. We only want to bump it for the driver tags, not the
>>> scheduler side.
>>
>> More complete version, this one actually tested. I think this should fix
>> your issue, let me know.
>>
> Nearly there.
> The initial stall is gone, but the test got hung at the 'stonewall'
> sequence again:
> 
> [global]
> bs=4k
> ioengine=libaio
> iodepth=256
> size=4g
> direct=1
> runtime=60
> # directory=/mnt
> numjobs=32
> group_reporting
> cpus_allowed_policy=split
> filename=/dev/md127
> 
> [seq-read]
> rw=read
> -> stonewall
> 
> [rand-read]
> rw=randread
> stonewall
> 
> Restarting all queues made the fio job continue.
> There were 4 queues with state 'restart', and one queue with state 'active'.
> So we're missing a queue run somewhere.
> 
I've found the queue stalls are gone with this patch:

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 6b465bc..de5db6c 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -113,6 +113,15 @@ static inline void blk_mq_sched_put_rq_priv(struct
request_queue *q,
 }

 static inline void
+blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
+{
+   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
+   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
+   blk_mq_run_hw_queue(hctx, true);
+   }
+}
+
+static inline void
 blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct
request *rq)
 {
struct elevator_queue *e = hctx->queue->elevator;
@@ -123,11 +132,6 @@ static inline void blk_mq_sched_put_rq_priv(struct
request_queue *q,
BUG_ON(rq->internal_tag == -1);

blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx,
rq->internal_tag);
-
-   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
-   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
-   blk_mq_run_hw_queue(hctx, true);
-   }
 }

 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e872555..63799ad 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -345,6 +345,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx
*hctx, struct blk_mq_ctx *ctx,
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
blk_mq_sched_completed_request(hctx, rq);
+   blk_mq_sched_restart(hctx);
blk_queue_exit(q);
 }

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html