Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-01 Thread Hannes Reinecke

On 12/1/18 5:48 PM, Christoph Hellwig wrote:

On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:

On 11/30/18 1:26 PM, Keith Busch wrote:

A driver may wish to iterate every tagged request, not just ones that
satisfy blk_mq_request_started(). The intended use is so a driver may
terminate entered requests on quiesced queues.


How about we just move the started check into the handler passed in for
those that care about it? Much saner to make the interface iterate
everything, and leave whatever state check to the callback.


So we used to do that, and I changed it back in May to test for
MQ_RQ_IN_FLIGHT, and then Ming changed it to check
blk_mq_request_started.  So this is clearly a minefield of sorts..

Note that at least mtip32xx, nbd, skd and the various nvme transports
want to use the function to terminate all requests in the error
path, and it would be great to have one single understood, documented
and debugged helper for that in the core, so this is a vote for moving
more of the logic in your second helper into the core code.  skd
will need actually use ->complete to release resources for that, though
and mtip plays some odd abort bits.  If it weren't for the interesting
abort behavior in nvme-fc that means we could even unexport the
low-level interface.


Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set 
some error flags in the driver, then _restarting_ the queue, just so 
that the driver then sees the error flag and terminates the requests.

Which I always found quite counter-intuitive.
So having a common helper for terminating requests for queue errors 
would be very welcomed here.


But when we have that we really should audit all drivers to ensure they 
do the right thin (tm).


Cheers,

Hannes



Re: [PATCH 16/16] block: remove the queue_lock indirection

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

With the legacy request path gone there is no good reason to keep
queue_lock as a pointer, we can always use the embedded lock now.

Signed-off-by: Christoph Hellwig 
---
  block/bfq-cgroup.c  |  2 +-
  block/bfq-iosched.c | 16 +--
  block/blk-cgroup.c  | 60 -
  block/blk-core.c| 10 +--
  block/blk-ioc.c | 14 +-
  block/blk-iolatency.c   |  4 +--
  block/blk-mq-sched.c|  4 +--
  block/blk-pm.c  | 20 +++---
  block/blk-pm.h  |  6 ++---
  block/blk-sysfs.c   |  4 +--
  block/blk-throttle.c| 22 +++
  drivers/block/pktcdvd.c |  4 +--
  drivers/ide/ide-pm.c| 10 +++
  include/linux/blkdev.h  |  8 +-
  14 files changed, 85 insertions(+), 99 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 14/16] mmc: stop abusing the request queue_lock pointer

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

mmc uses the block layer struct request pointer to indirect their own
lock to the mmc_queue structure, given that the original lock isn't
reachable outside of block.c.  Add a lock pointer to struct mmc_queue
instead and stop overriding the block layer lock which protects fields
entirely separate from the mmc use.

Signed-off-by: Christoph Hellwig 
---
  drivers/mmc/core/block.c | 22 ++
  drivers/mmc/core/queue.c | 26 +-
  drivers/mmc/core/queue.h |  1 +
  3 files changed, 24 insertions(+), 25 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 15/16] block: remove the lock argument to blk_alloc_queue_node

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

With the legacy request path gone there is no real need to override the
queue_lock.

Signed-off-by: Christoph Hellwig 
---
  block/blk-core.c   | 16 +++-
  block/blk-mq.c |  2 +-
  drivers/block/drbd/drbd_main.c |  2 +-
  drivers/block/null_blk_main.c  |  3 +--
  drivers/block/umem.c   |  2 +-
  drivers/lightnvm/core.c|  2 +-
  drivers/md/dm.c|  2 +-
  drivers/nvdimm/pmem.c  |  2 +-
  drivers/nvme/host/multipath.c  |  2 +-
  include/linux/blkdev.h |  3 +--
  10 files changed, 12 insertions(+), 24 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 13/16] mmc: simplify queue initialization

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

Merge three functions initializing the queue into a single one, and drop
an unused argument for it.

Signed-off-by: Christoph Hellwig 
---
  drivers/mmc/core/block.c |  2 +-
  drivers/mmc/core/queue.c | 86 ++--
  drivers/mmc/core/queue.h |  3 +-
  3 files changed, 32 insertions(+), 59 deletions(-)
Reviewed-by: Hannes Reinecke 


Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 12/16] umem: don't override the queue_lock

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

The umem card->lock and the block layer queue_lock are used for entirely
different resources.  Stop using card->lock as the block layer
queue_lock.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/umem.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index be3e3ab79950..8a27b5adc2b3 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -888,8 +888,7 @@ static int mm_pci_probe(struct pci_dev *dev, const struct 
pci_device_id *id)
card->biotail = >bio;
spin_lock_init(>lock);
  
-	card->queue = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE,

-  >lock);
+   card->queue = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, NULL);
if (!card->queue)
goto failed_alloc;
  


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 10/16] blk-cgroup: move locking into blkg_destroy_all

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

Signed-off-by: Christoph Hellwig 
---
  block/blk-cgroup.c | 9 ++---
  1 file changed, 2 insertions(+), 7 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 11/16] drbd: don't override the queue_lock

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

The DRBD req_lock and block layer queue_lock are used for entirely
different resources.  Stop using the req_lock as the block layer
queue_lock.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/drbd/drbd_main.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index fa8204214ac0..b66c59ce6260 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2792,7 +2792,7 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
  
  	drbd_init_set_defaults(device);
  
-	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, >req_lock);

+   q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, NULL);
if (!q)
goto out_no_q;
device->rq_queue = q;


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 09/16] blk-cgroup: consolidate error handling in blkcg_init_queue

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

Use a goto label to merge two identical pieces of error handling code.

Signed-off-by: Christoph Hellwig 
---
  block/blk-cgroup.c | 22 ++
  1 file changed, 10 insertions(+), 12 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 08/16] block: remove a few unused exports

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

Signed-off-by: Christoph Hellwig 
---
  block/blk-cgroup.c   | 6 --
  block/blk-ioc.c  | 3 ---
  block/blk-mq-sysfs.c | 1 -
  block/blk-softirq.c  | 1 -
  block/blk-stat.c | 4 
  block/blk-wbt.c  | 2 --
  6 files changed, 17 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 06/16] block-iolatency: remove the unused lock argument to rq_qos_throttle

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

Unused now that the legacy request path is gone.

Signed-off-by: Christoph Hellwig 
---
  block/blk-iolatency.c | 24 ++--
  block/blk-mq.c|  2 +-
  block/blk-rq-qos.c|  5 ++---
  block/blk-rq-qos.h|  4 ++--
  block/blk-wbt.c   | 16 
  5 files changed, 15 insertions(+), 36 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 07/16] block: update a few comments for the legacy request removal

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

Only the mq locking is left in the flush state machine.

Signed-off-by: Christoph Hellwig 
---
  block/blk-flush.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index c53197dcdd70..fcd18b158fd6 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -148,7 +148,7 @@ static void blk_flush_queue_rq(struct request *rq, bool 
add_front)
   * completion and trigger the next step.
   *
   * CONTEXT:
- * spin_lock_irq(q->queue_lock or fq->mq_flush_lock)
+ * spin_lock_irq(fq->mq_flush_lock)
   *
   * RETURNS:
   * %true if requests were added to the dispatch queue, %false otherwise.
@@ -252,7 +252,7 @@ static void flush_end_io(struct request *flush_rq, 
blk_status_t error)
   * Please read the comment at the top of this file for more info.
   *
   * CONTEXT:
- * spin_lock_irq(q->queue_lock or fq->mq_flush_lock)
+ * spin_lock_irq(fq->mq_flush_lock)
   *
   */
  static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue 
*fq,


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 05/16] block: remove queue_lockdep_assert_held

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

The only remaining user unconditionally drops and reacquires the lock,
which means we really don't need any additional (conditional) annotation.

Signed-off-by: Christoph Hellwig 
---
  block/blk-throttle.c |  1 -
  block/blk.h  | 13 -
  2 files changed, 14 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 04/16] block: use atomic bitops for ->queue_flags

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

->queue_flags is generally not set or cleared in the fast path, and also
generally set or cleared one flag at a time.  Make use of the normal
atomic bitops for it so that we don't need to take the queue_lock,
which is otherwise mostly unused in the core block layer now.

Signed-off-by: Christoph Hellwig 
---
  block/blk-core.c   | 54 ++--
  block/blk-mq.c |  2 +-
  block/blk-settings.c   | 10 +++-
  block/blk-sysfs.c  | 28 +
  block/blk.h| 56 --
  include/linux/blkdev.h |  1 -
  6 files changed, 24 insertions(+), 127 deletions(-)

I wonder if we can't remove the 'blk_queue_flag_XXX' helpers and replace 
them with inlines ...


Otherwise:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 03/16] block: don't hold the queue_lock over blk_abort_request

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

There is nothing it could synchronize against, so don't go through
the pains of acquiring the lock.

Signed-off-by: Christoph Hellwig 
---
  block/blk-timeout.c |  2 +-
  drivers/ata/libata-eh.c |  4 
  drivers/block/mtip32xx/mtip32xx.c   |  5 +
  drivers/scsi/libsas/sas_ata.c   |  5 -
  drivers/scsi/libsas/sas_scsi_host.c | 10 ++
  5 files changed, 4 insertions(+), 22 deletions(-)


After all the pain we went through with aborts ...

Reviewed-by: Hannes Reinecke 

Cheers

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 02/16] block: remove deadline __deadline manipulation helpers

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

No users left since the removal of the legacy request interface, we can
remove all the magic bit stealing now and make it a normal field.

But use WRITE_ONCE/READ_ONCE on the new deadline field, given that we
don't seem to have any mechanism to guarantee a new value actually
gets seen by other threads.

Signed-off-by: Christoph Hellwig 
---
  block/blk-mq.c |  4 ++--
  block/blk-timeout.c|  8 +---
  block/blk.h| 35 ---
  include/linux/blkdev.h |  4 +---
  4 files changed, 8 insertions(+), 43 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 01/16] block: remove QUEUE_FLAG_BYPASS and ->bypass

2018-11-14 Thread Hannes Reinecke

On 11/14/18 5:02 PM, Christoph Hellwig wrote:

Unused since the removal of the legacy request code.

Signed-off-by: Christoph Hellwig 
---
  block/blk-cgroup.c | 15 ---
  block/blk-core.c   | 21 -
  block/blk-mq-debugfs.c |  1 -
  block/blk-throttle.c   |  3 ---
  include/linux/blk-cgroup.h |  6 +-
  include/linux/blkdev.h |  3 ---
  6 files changed, 1 insertion(+), 48 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 6/6] ide: don't use req->special

2018-11-10 Thread Hannes Reinecke
On 11/9/18 7:32 PM, Christoph Hellwig wrote:
> Just replace it with a field of the same name in struct ide_req.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/ide/ide-atapi.c|  4 ++--
>  drivers/ide/ide-cd.c   |  4 ++--
>  drivers/ide/ide-devsets.c  |  4 ++--
>  drivers/ide/ide-disk.c |  6 +++---
>  drivers/ide/ide-eh.c   |  2 +-
>  drivers/ide/ide-floppy.c   |  2 +-
>  drivers/ide/ide-io.c   | 14 +-
>  drivers/ide/ide-park.c |  4 ++--
>  drivers/ide/ide-pm.c   | 12 ++--
>  drivers/ide/ide-tape.c |  2 +-
>  drivers/ide/ide-taskfile.c |  2 +-
>  include/linux/ide.h|  1 +
>  12 files changed, 31 insertions(+), 26 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 5/6] pd: replace ->special use with private data in the request

2018-11-10 Thread Hannes Reinecke
On 11/9/18 7:32 PM, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/paride/pd.c | 30 +-
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 4/6] aoe: replace ->special use with private data in the request

2018-11-10 Thread Hannes Reinecke
On 11/9/18 7:32 PM, Christoph Hellwig wrote:
> Makes the code a whole lot better to read..
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/aoe/aoe.h|  4 
>  drivers/block/aoe/aoeblk.c |  1 +
>  drivers/block/aoe/aoecmd.c | 27 +--
>  drivers/block/aoe/aoedev.c | 11 ++-
>  4 files changed, 20 insertions(+), 23 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 3/6] skd_main: don't use req->special

2018-11-10 Thread Hannes Reinecke
On 11/9/18 7:32 PM, Christoph Hellwig wrote:
> Add a retries field to the internal request structure instead, which gets
> set to zero on the first submission.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/skd_main.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 2/6] nullb: remove leftover legacy request code

2018-11-10 Thread Hannes Reinecke
On 11/9/18 7:32 PM, Christoph Hellwig wrote:
> null_softirq_done_fn is only used for the blk-mq path, so remove the
> other branch.  Also rename the function to better match the method name.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/null_blk_main.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 1/6] fnic: fix fnic_scsi_host_{start,end}_tag

2018-11-10 Thread Hannes Reinecke
On 11/9/18 7:32 PM, Christoph Hellwig wrote:
> They way these functions abuse ->special to try to store the dummy
> request looks completely broken, given that it actually stores the
> original scsi command.
> 
> Instead switch to ->host_scribble and store the actual dummy command.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/fnic/fnic_scsi.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/fnic/fnic_scsi.c b/drivers/scsi/fnic/fnic_scsi.c
> index 96acfcecd540..cafbcfb85bfa 100644
> --- a/drivers/scsi/fnic/fnic_scsi.c
> +++ b/drivers/scsi/fnic/fnic_scsi.c
> @@ -2274,7 +2274,7 @@ fnic_scsi_host_start_tag(struct fnic *fnic, struct 
> scsi_cmnd *sc)
>   return SCSI_NO_TAG;
>  
>   sc->tag = sc->request->tag = dummy->tag;
> - sc->request->special = sc;
> + sc->host_scribble = (unsigned char *)dummy;
>  
>   return dummy->tag;
>  }
> @@ -2286,7 +2286,7 @@ fnic_scsi_host_start_tag(struct fnic *fnic, struct 
> scsi_cmnd *sc)
>  static inline void
>  fnic_scsi_host_end_tag(struct fnic *fnic, struct scsi_cmnd *sc)
>  {
> - struct request *dummy = sc->request->special;
> + struct request *dummy = (struct request *)sc->host_scribble;
>  
>   blk_mq_free_request(dummy);
>  }
> 
The entire device_reset for fnic should be redone; which reminds me to
redo my scsi-eh rework rebased.

But until then:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 14/14] nvme: add separate poll queue map

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

Adds support for defining a variable number of poll queues, currently
configurable with the 'poll_queues' module parameter. Defaults to
a single poll queue.

And now we finally have poll support without triggering interrupts!

Signed-off-by: Jens Axboe 
---
  drivers/nvme/host/pci.c | 103 +---
  include/linux/blk-mq.h  |   2 +-
  2 files changed, 88 insertions(+), 17 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes


Re: [PATCH 13/14] block: add REQ_HIPRI and inherit it from IOCB_HIPRI

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

We use IOCB_HIPRI to poll for IO in the caller instead of scheduling.
This information is not available for (or after) IO submission. The
driver may make different queue choices based on the type of IO, so
make the fact that we will poll for this IO known to the lower layers
as well.

Signed-off-by: Jens Axboe 
---
  fs/block_dev.c| 2 ++
  fs/direct-io.c| 2 ++
  fs/iomap.c| 9 -
  include/linux/blk_types.h | 4 +++-
  4 files changed, 15 insertions(+), 2 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes


Re: [PATCH 12/14] nvme: utilize two queue maps, one for reads and one for writes

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

NVMe does round-robin between queues by default, which means that
sharing a queue map for both reads and writes can be problematic
in terms of read servicing. It's much easier to flood the queue
with writes and reduce the read servicing.

Implement two queue maps, one for reads and one for writes. The
write queue count is configurable through the 'write_queues'
parameter.

By default, we retain the previous behavior of having a single
queue set, shared between reads and writes. Setting 'write_queues'
to a non-zero value will create two queue sets, one for reads and
one for writes, the latter using the configurable number of
queues (hardware queue counts permitting).

Signed-off-by: Jens Axboe 
---
  drivers/nvme/host/pci.c | 139 +---
  1 file changed, 131 insertions(+), 8 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes


Re: [PATCH 10/14] blk-mq: initial support for multiple queue maps

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

Add a queue offset to the tag map. This enables users to map
iteratively, for each queue map type they support.

Bump maximum number of supported maps to 2, we're now fully
able to support more than 1 map.

Signed-off-by: Jens Axboe 
---
  block/blk-mq-cpumap.c  | 9 +
  block/blk-mq-pci.c | 2 +-
  block/blk-mq-virtio.c  | 2 +-
  include/linux/blk-mq.h | 3 ++-
  4 files changed, 9 insertions(+), 7 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes


Re: [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

Since we insert per hardware queue, we have to ensure that every
request on the plug list being inserted belongs to the same
hardware queue.

Signed-off-by: Jens Axboe 
---
  block/blk-mq.c | 27 +--
  1 file changed, 25 insertions(+), 2 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes




Re: [PATCH 08/14] blk-mq: separate number of hardware queues from nr_cpu_ids

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

With multiple maps, nr_cpu_ids is no longer the maximum number of
hardware queues we support on a given devices. The initializer of
the tag_set can have set ->nr_hw_queues larger than the available
number of CPUs, since we can exceed that with multiple queue maps.

Signed-off-by: Jens Axboe 
---
  block/blk-mq.c | 28 +---
  1 file changed, 21 insertions(+), 7 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes



Re: [PATCH 07/14] blk-mq: support multiple hctx maps

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->flags_to_type(). The function takes request/bio flags
and CPU, and returns a queue map index for that. We then use the
type information in blk_mq_map_queue() to index the map set.

Signed-off-by: Jens Axboe 
---
  block/blk-mq.c | 85 --
  block/blk-mq.h | 19 ++
  include/linux/blk-mq.h |  7 
  3 files changed, 76 insertions(+), 35 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes




Re: [PATCH 06/14] blk-mq: add 'type' attribute to the sysfs hctx directory

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

It can be useful for a user to verify what type a given hardware
queue is, expose this information in sysfs.

Signed-off-by: Jens Axboe 
---
  block/blk-mq-sysfs.c | 10 ++
  1 file changed, 10 insertions(+)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes




Re: [PATCH 05/14] blk-mq: allow software queue to map to multiple hardware queues

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

The mapping used to be dependent on just the CPU location, but
now it's a tuple of { type, cpu} instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

Signed-off-by: Jens Axboe 
---
  block/blk-mq-sched.c   |  2 +-
  block/blk-mq.c | 18 --
  block/blk-mq.h |  2 +-
  block/kyber-iosched.c  |  6 +++---
  include/linux/blk-mq.h |  3 ++-
  5 files changed, 19 insertions(+), 12 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes


Re: [PATCH 04/14] blk-mq: pass in request/bio flags to queue mapping

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

Prep patch for being able to place request based not just on
CPU location, but also on the type of request.

Signed-off-by: Jens Axboe 
---
  block/blk-flush.c  |  7 +++---
  block/blk-mq-debugfs.c |  4 +++-
  block/blk-mq-sched.c   | 16 ++
  block/blk-mq-tag.c |  5 +++--
  block/blk-mq.c | 50 +++---
  block/blk-mq.h |  8 ---
  block/blk.h|  6 ++---
  7 files changed, 58 insertions(+), 38 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes



Re: [PATCH 03/14] blk-mq: provide dummy blk_mq_map_queue_type() helper

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

Doesn't do anything right now, but it's needed as a prep patch
to get the interfaces right.

Signed-off-by: Jens Axboe 
---
  block/blk-mq.h | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 889f0069dd80..79c300faa7ce 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -80,6 +80,12 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct 
request_queue *q,
return q->queue_hw_ctx[set->map[0].mq_map[cpu]];
  }
  
+static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,

+ int type, int cpu)
+{
+   return blk_mq_map_queue(q, cpu);
+}
+
  /*
   * sysfs helpers
   */


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes



Re: [PATCH 02/14] blk-mq: abstract out queue map

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

This is in preparation for allowing multiple sets of maps per
queue, if so desired.

Signed-off-by: Jens Axboe 
---
  block/blk-mq-cpumap.c | 10 
  block/blk-mq-pci.c| 10 
  block/blk-mq-rdma.c   |  4 ++--
  block/blk-mq-virtio.c |  8 +++
  block/blk-mq.c| 34 ++-
  block/blk-mq.h|  8 +++
  drivers/block/virtio_blk.c|  2 +-
  drivers/nvme/host/pci.c   |  2 +-
  drivers/scsi/qla2xxx/qla_os.c |  5 ++--
  drivers/scsi/scsi_lib.c   |  2 +-
  drivers/scsi/smartpqi/smartpqi_init.c |  3 ++-
  drivers/scsi/virtio_scsi.c|  3 ++-
  include/linux/blk-mq-pci.h|  4 ++--
  include/linux/blk-mq-virtio.h |  4 ++--
  include/linux/blk-mq.h| 13 --
  15 files changed, 63 insertions(+), 49 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes


Re: [PATCH 01/14] blk-mq: kill q->mq_map

2018-10-29 Thread Hannes Reinecke

On 10/25/18 11:16 PM, Jens Axboe wrote:

It's just a pointer to set->mq_map, use that instead.

Signed-off-by: Jens Axboe 
---
  block/blk-mq.c | 13 -
  block/blk-mq.h |  4 +++-
  include/linux/blkdev.h |  2 --
  3 files changed, 7 insertions(+), 12 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes




Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-10-04 Thread Hannes Reinecke

On 10/4/18 7:35 PM, Bart Van Assche wrote:

When debugging e.g. the SCSI timeout handler it is important that
requests that have not yet been started or that already have
completed are also reported through debugfs.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Martin K. Petersen 
---
  block/blk-mq-debugfs.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a5ea86835fcb..41b86f50d126 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -431,8 +431,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
  {
const struct show_busy_params *params = data;
  
-	if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&

-   blk_mq_rq_state(rq) != MQ_RQ_IDLE)
+   if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx)
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(>queuelist));
  }


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH 2/5] nvme: register ns_id attributes as default sysfs groups

2018-09-28 Thread Hannes Reinecke
We should be registering the ns_id attribute as default sysfs
attribute groups, otherwise we have a race condition between
the uevent and the attributes appearing in sysfs.

Suggested-by: Bart van Assche 
Signed-off-by: Hannes Reinecke 
---
 drivers/nvme/host/core.c  |  21 -
 drivers/nvme/host/lightnvm.c  | 105 ++
 drivers/nvme/host/multipath.c |  15 ++
 drivers/nvme/host/nvme.h  |  10 +---
 4 files changed, 59 insertions(+), 92 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0e824e8c8fd7..e0a9e1c5b30e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2734,6 +2734,14 @@ const struct attribute_group nvme_ns_id_attr_group = {
.is_visible = nvme_ns_id_attrs_are_visible,
 };
 
+const struct attribute_group *nvme_ns_id_attr_groups[] = {
+   _ns_id_attr_group,
+#ifdef CONFIG_NVM
+   _nvm_attr_group,
+#endif
+   NULL,
+};
+
 #define nvme_show_str_function(field)  
\
 static ssize_t  field##_show(struct device *dev,   
\
struct device_attribute *attr, char *buf)   
\
@@ -3099,14 +3107,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
 
nvme_get_ctrl(ctrl);
 
-   device_add_disk(ctrl->device, ns->disk, NULL);
-   if (sysfs_create_group(_to_dev(ns->disk)->kobj,
-   _ns_id_attr_group))
-   pr_warn("%s: failed to create sysfs group for identification\n",
-   ns->disk->disk_name);
-   if (ns->ndev && nvme_nvm_register_sysfs(ns))
-   pr_warn("%s: failed to register lightnvm sysfs group for 
identification\n",
-   ns->disk->disk_name);
+   device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups);
 
nvme_mpath_add_disk(ns, id);
nvme_fault_inject_init(ns);
@@ -3132,10 +3133,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
nvme_fault_inject_fini(ns);
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
-   sysfs_remove_group(_to_dev(ns->disk)->kobj,
-   _ns_id_attr_group);
-   if (ns->ndev)
-   nvme_nvm_unregister_sysfs(ns);
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
if (blk_get_integrity(ns->disk))
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 6fe5923c95d4..1e4f97538838 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -1190,10 +1190,29 @@ static NVM_DEV_ATTR_12_RO(multiplane_modes);
 static NVM_DEV_ATTR_12_RO(media_capabilities);
 static NVM_DEV_ATTR_12_RO(max_phys_secs);
 
-static struct attribute *nvm_dev_attrs_12[] = {
+/* 2.0 values */
+static NVM_DEV_ATTR_20_RO(groups);
+static NVM_DEV_ATTR_20_RO(punits);
+static NVM_DEV_ATTR_20_RO(chunks);
+static NVM_DEV_ATTR_20_RO(clba);
+static NVM_DEV_ATTR_20_RO(ws_min);
+static NVM_DEV_ATTR_20_RO(ws_opt);
+static NVM_DEV_ATTR_20_RO(maxoc);
+static NVM_DEV_ATTR_20_RO(maxocpu);
+static NVM_DEV_ATTR_20_RO(mw_cunits);
+static NVM_DEV_ATTR_20_RO(write_typ);
+static NVM_DEV_ATTR_20_RO(write_max);
+static NVM_DEV_ATTR_20_RO(reset_typ);
+static NVM_DEV_ATTR_20_RO(reset_max);
+
+static struct attribute *nvm_dev_attrs[] = {
+   /* version agnostic attrs */
_attr_version.attr,
_attr_capabilities.attr,
+   _attr_read_typ.attr,
+   _attr_read_max.attr,
 
+   /* 1.2 attrs */
_attr_vendor_opcode.attr,
_attr_device_mode.attr,
_attr_media_manager.attr,
@@ -1208,8 +1227,6 @@ static struct attribute *nvm_dev_attrs_12[] = {
_attr_page_size.attr,
_attr_hw_sector_size.attr,
_attr_oob_sector_size.attr,
-   _attr_read_typ.attr,
-   _attr_read_max.attr,
_attr_prog_typ.attr,
_attr_prog_max.attr,
_attr_erase_typ.attr,
@@ -1218,33 +1235,7 @@ static struct attribute *nvm_dev_attrs_12[] = {
_attr_media_capabilities.attr,
_attr_max_phys_secs.attr,
 
-   NULL,
-};
-
-static const struct attribute_group nvm_dev_attr_group_12 = {
-   .name   = "lightnvm",
-   .attrs  = nvm_dev_attrs_12,
-};
-
-/* 2.0 values */
-static NVM_DEV_ATTR_20_RO(groups);
-static NVM_DEV_ATTR_20_RO(punits);
-static NVM_DEV_ATTR_20_RO(chunks);
-static NVM_DEV_ATTR_20_RO(clba);
-static NVM_DEV_ATTR_20_RO(ws_min);
-static NVM_DEV_ATTR_20_RO(ws_opt);
-static NVM_DEV_ATTR_20_RO(maxoc);
-static NVM_DEV_ATTR_20_RO(maxocpu);
-static NVM_DEV_ATTR_20_RO(mw_cunits);
-static NVM_DEV_ATTR_20_RO(write_typ);
-static NVM_DEV_ATTR_20_RO(write_max);
-static NVM_DEV_ATTR_20_RO(reset_typ);
-static NVM_DEV_ATTR_20_RO(reset_max);
-
-static struct attribu

[PATCH 1/5] block: genhd: add 'groups' argument to device_add_disk

2018-09-28 Thread Hannes Reinecke
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck 
Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
---
 arch/um/drivers/ubd_kern.c  |  2 +-
 block/genhd.c   | 19 ++-
 drivers/block/floppy.c  |  2 +-
 drivers/block/mtip32xx/mtip32xx.c   |  2 +-
 drivers/block/ps3disk.c |  2 +-
 drivers/block/ps3vram.c |  2 +-
 drivers/block/rsxx/dev.c|  2 +-
 drivers/block/skd_main.c|  2 +-
 drivers/block/sunvdc.c  |  2 +-
 drivers/block/virtio_blk.c  |  2 +-
 drivers/block/xen-blkfront.c|  2 +-
 drivers/ide/ide-cd.c|  2 +-
 drivers/ide/ide-gd.c|  2 +-
 drivers/memstick/core/ms_block.c|  2 +-
 drivers/memstick/core/mspro_block.c |  2 +-
 drivers/mmc/core/block.c|  2 +-
 drivers/mtd/mtd_blkdevs.c   |  2 +-
 drivers/nvdimm/blk.c|  2 +-
 drivers/nvdimm/btt.c|  2 +-
 drivers/nvdimm/pmem.c   |  2 +-
 drivers/nvme/host/core.c|  2 +-
 drivers/nvme/host/multipath.c   |  2 +-
 drivers/s390/block/dasd_genhd.c |  2 +-
 drivers/s390/block/dcssblk.c|  2 +-
 drivers/s390/block/scm_blk.c|  2 +-
 drivers/scsi/sd.c   |  2 +-
 drivers/scsi/sr.c   |  2 +-
 include/linux/genhd.h   |  5 +++--
 28 files changed, 43 insertions(+), 33 deletions(-)

diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
index 83c470364dfb..6ee4c56032f7 100644
--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -891,7 +891,7 @@ static int ubd_disk_register(int major, u64 size, int unit,
 
disk->private_data = _devs[unit];
disk->queue = ubd_devs[unit].queue;
-   device_add_disk(parent, disk);
+   device_add_disk(parent, disk, NULL);
 
*disk_out = disk;
return 0;
diff --git a/block/genhd.c b/block/genhd.c
index 8cc719a37b32..ef0936184d69 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -567,7 +567,8 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
-static void register_disk(struct device *parent, struct gendisk *disk)
+static void register_disk(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups)
 {
struct device *ddev = disk_to_dev(disk);
struct block_device *bdev;
@@ -582,6 +583,10 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
/* delay uevents, until we scanned partition table */
dev_set_uevent_suppress(ddev, 1);
 
+   if (groups) {
+   WARN_ON(ddev->groups);
+   ddev->groups = groups;
+   }
if (device_add(ddev))
return;
if (!sysfs_deprecated) {
@@ -647,6 +652,7 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  * __device_add_disk - add disk information to kernel list
  * @parent: parent device for the disk
  * @disk: per-device partitioning information
+ * @groups: Additional per-device sysfs groups
  * @register_queue: register the queue if set to true
  *
  * This function registers the partitioning information in @disk
@@ -655,6 +661,7 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  * FIXME: error handling
  */
 static void __device_add_disk(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups,
  bool register_queue)
 {
dev_t devt;
@@ -698,7 +705,7 @@ static void __device_add_disk(struct device *parent, struct 
gendisk *disk,
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
}
-   register_disk(parent, disk);
+   register_disk(parent, disk, groups);
if (register_queue)
blk_register_queue(disk);
 
@@ -712,15 +719,17 @@ static void __device_add_disk(struct device *parent, 
struct gendisk *disk,
blk_integrity_add(disk);
 }
 
-void device_add_disk(struct device *parent, struct gendisk *disk)
+void device_add_disk(struct device *parent, struct gendisk *disk,
+const struct attribute_group **groups)
+
 {
-   __device_add_disk(parent, disk, true);
+   __device_add_disk(parent, disk, groups, true);
 }
 EXPORT_SYMBOL(device_add_disk);
 
 void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk)
 {
-   __device_add_disk(parent, disk, false);
+   __device_add_disk(parent, disk, NULL, false);
 }
 EXPORT_SYMBOL(device_add_disk_no_queue_reg);
 
diff --git a/drivers/b

[PATCH 4/5] zram: register default groups with device_add_disk()

2018-09-28 Thread Hannes Reinecke
Register default sysfs groups during device_add_disk() to avoid a
race condition with udev during startup.

Signed-off-by: Hannes Reinecke 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
---
 drivers/block/zram/zram_drv.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a1d6b5597c17..4879595200e1 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1636,6 +1636,11 @@ static const struct attribute_group zram_disk_attr_group 
= {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1716,24 +1721,14 @@ static int zram_add(void)
 
zram->disk->queue->backing_dev_info->capabilities |=
(BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);
-   add_disk(zram->disk);
-
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
+   device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
+
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
zram_debugfs_register(zram);
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1762,15 +1757,6 @@ static int zram_remove(struct zram *zram)
mutex_unlock(>bd_mutex);
 
zram_debugfs_unregister(zram);
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
 
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
-- 
2.16.4



[PATCH 5/5] virtio-blk: modernize sysfs attribute creation

2018-09-28 Thread Hannes Reinecke
Use new-style DEVICE_ATTR_RO/DEVICE_ATTR_RW to create the sysfs attributes
and register the disk with default sysfs attribute groups.

Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Acked-by: Michael S. Tsirkin 
Reviewed-by: Bart Van Assche 
---
 drivers/block/virtio_blk.c | 68 ++
 1 file changed, 39 insertions(+), 29 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index fe8056a1..086c6bb12baa 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -351,8 +351,8 @@ static int minor_to_index(int minor)
return minor >> PART_BITS;
 }
 
-static ssize_t virtblk_serial_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
+static ssize_t serial_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
 {
struct gendisk *disk = dev_to_disk(dev);
int err;
@@ -371,7 +371,7 @@ static ssize_t virtblk_serial_show(struct device *dev,
return err;
 }
 
-static DEVICE_ATTR(serial, 0444, virtblk_serial_show, NULL);
+static DEVICE_ATTR_RO(serial);
 
 /* The queue's logical block size must be set before calling this */
 static void virtblk_update_capacity(struct virtio_blk *vblk, bool resize)
@@ -545,8 +545,8 @@ static const char *const virtblk_cache_types[] = {
 };
 
 static ssize_t
-virtblk_cache_type_store(struct device *dev, struct device_attribute *attr,
-const char *buf, size_t count)
+cache_type_store(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
 {
struct gendisk *disk = dev_to_disk(dev);
struct virtio_blk *vblk = disk->private_data;
@@ -564,8 +564,7 @@ virtblk_cache_type_store(struct device *dev, struct 
device_attribute *attr,
 }
 
 static ssize_t
-virtblk_cache_type_show(struct device *dev, struct device_attribute *attr,
-char *buf)
+cache_type_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
struct gendisk *disk = dev_to_disk(dev);
struct virtio_blk *vblk = disk->private_data;
@@ -575,12 +574,38 @@ virtblk_cache_type_show(struct device *dev, struct 
device_attribute *attr,
return snprintf(buf, 40, "%s\n", virtblk_cache_types[writeback]);
 }
 
-static const struct device_attribute dev_attr_cache_type_ro =
-   __ATTR(cache_type, 0444,
-  virtblk_cache_type_show, NULL);
-static const struct device_attribute dev_attr_cache_type_rw =
-   __ATTR(cache_type, 0644,
-  virtblk_cache_type_show, virtblk_cache_type_store);
+static DEVICE_ATTR_RW(cache_type);
+
+static struct attribute *virtblk_attrs[] = {
+   _attr_serial.attr,
+   _attr_cache_type.attr,
+   NULL,
+};
+
+static umode_t virtblk_attrs_are_visible(struct kobject *kobj,
+   struct attribute *a, int n)
+{
+   struct device *dev = container_of(kobj, struct device, kobj);
+   struct gendisk *disk = dev_to_disk(dev);
+   struct virtio_blk *vblk = disk->private_data;
+   struct virtio_device *vdev = vblk->vdev;
+
+   if (a == _attr_cache_type.attr &&
+   !virtio_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE))
+   return S_IRUGO;
+
+   return a->mode;
+}
+
+static const struct attribute_group virtblk_attr_group = {
+   .attrs = virtblk_attrs,
+   .is_visible = virtblk_attrs_are_visible,
+};
+
+static const struct attribute_group *virtblk_attr_groups[] = {
+   _attr_group,
+   NULL,
+};
 
 static int virtblk_init_request(struct blk_mq_tag_set *set, struct request *rq,
unsigned int hctx_idx, unsigned int numa_node)
@@ -780,24 +805,9 @@ static int virtblk_probe(struct virtio_device *vdev)
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
 
-   device_add_disk(>dev, vblk->disk, NULL);
-   err = device_create_file(disk_to_dev(vblk->disk), _attr_serial);
-   if (err)
-   goto out_del_disk;
-
-   if (virtio_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE))
-   err = device_create_file(disk_to_dev(vblk->disk),
-_attr_cache_type_rw);
-   else
-   err = device_create_file(disk_to_dev(vblk->disk),
-_attr_cache_type_ro);
-   if (err)
-   goto out_del_disk;
+   device_add_disk(>dev, vblk->disk, virtblk_attr_groups);
return 0;
 
-out_del_disk:
-   del_gendisk(vblk->disk);
-   blk_cleanup_queue(vblk->disk->queue);
 out_free_tags:
blk_mq_free_tag_set(>tag_set);
 out_put_disk:
-- 
2.16.4



[PATCHv4 0/5] genhd: register default groups with device_add_disk()

2018-09-28 Thread Hannes Reinecke
device_add_disk() doesn't allow to register with default sysfs groups,
which introduces a race with udev as these groups have to be created after
the uevent has been sent.
This patchset updates device_add_disk() to accept a 'groups' argument to
avoid this race and updates the obvious drivers to use it.
There are some more drivers which might benefit from this (eg loop or md),
but the interface is not straightforward so I haven't attempted it.

As usual, comments and reviews are welcome.

Changes to v3:
- Better check in is_visible for lightnvm attributes as suggested
  by hch
  
Changes to v2:
- Fold lightnvm attributes into the generic attribute group as
  suggested by Bart

Changes to v1:
- Drop first patch
- Convert lightnvm sysfs attributes as suggested by Bart

Hannes Reinecke (5):
  block: genhd: add 'groups' argument to device_add_disk
  nvme: register ns_id attributes as default sysfs groups
  aoe: register default groups with device_add_disk()
  zram: register default groups with device_add_disk()
  virtio-blk: modernize sysfs attribute creation

 arch/um/drivers/ubd_kern.c  |   2 +-
 block/genhd.c   |  19 +--
 drivers/block/aoe/aoe.h |   1 -
 drivers/block/aoe/aoeblk.c  |  21 +++-
 drivers/block/aoe/aoedev.c  |   1 -
 drivers/block/floppy.c  |   2 +-
 drivers/block/mtip32xx/mtip32xx.c   |   2 +-
 drivers/block/ps3disk.c |   2 +-
 drivers/block/ps3vram.c |   2 +-
 drivers/block/rsxx/dev.c|   2 +-
 drivers/block/skd_main.c|   2 +-
 drivers/block/sunvdc.c  |   2 +-
 drivers/block/virtio_blk.c  |  68 +--
 drivers/block/xen-blkfront.c|   2 +-
 drivers/block/zram/zram_drv.c   |  28 +++---
 drivers/ide/ide-cd.c|   2 +-
 drivers/ide/ide-gd.c|   2 +-
 drivers/memstick/core/ms_block.c|   2 +-
 drivers/memstick/core/mspro_block.c |   2 +-
 drivers/mmc/core/block.c|   2 +-
 drivers/mtd/mtd_blkdevs.c   |   2 +-
 drivers/nvdimm/blk.c|   2 +-
 drivers/nvdimm/btt.c|   2 +-
 drivers/nvdimm/pmem.c   |   2 +-
 drivers/nvme/host/core.c|  21 
 drivers/nvme/host/lightnvm.c| 105 +++-
 drivers/nvme/host/multipath.c   |  15 ++
 drivers/nvme/host/nvme.h|  10 +---
 drivers/s390/block/dasd_genhd.c |   2 +-
 drivers/s390/block/dcssblk.c|   2 +-
 drivers/s390/block/scm_blk.c|   2 +-
 drivers/scsi/sd.c   |   2 +-
 drivers/scsi/sr.c   |   2 +-
 include/linux/genhd.h   |   5 +-
 34 files changed, 152 insertions(+), 188 deletions(-)

-- 
2.16.4



[PATCH 3/5] aoe: register default groups with device_add_disk()

2018-09-28 Thread Hannes Reinecke
Register default sysfs groups during device_add_disk() to avoid a
race condition with udev during startup.

Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Acked-by: Ed L. Cachin 
Reviewed-by: Bart Van Assche 
---
 drivers/block/aoe/aoe.h|  1 -
 drivers/block/aoe/aoeblk.c | 21 +++--
 drivers/block/aoe/aoedev.c |  1 -
 3 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index c0ebda1283cc..015c68017a1c 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -201,7 +201,6 @@ int aoeblk_init(void);
 void aoeblk_exit(void);
 void aoeblk_gdalloc(void *);
 void aoedisk_rm_debugfs(struct aoedev *d);
-void aoedisk_rm_sysfs(struct aoedev *d);
 
 int aoechr_init(void);
 void aoechr_exit(void);
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 429ebb84b592..ff770e7d9e52 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -177,10 +177,15 @@ static struct attribute *aoe_attrs[] = {
NULL,
 };
 
-static const struct attribute_group attr_group = {
+static const struct attribute_group aoe_attr_group = {
.attrs = aoe_attrs,
 };
 
+static const struct attribute_group *aoe_attr_groups[] = {
+   _attr_group,
+   NULL,
+};
+
 static const struct file_operations aoe_debugfs_fops = {
.open = aoe_debugfs_open,
.read = seq_read,
@@ -219,17 +224,6 @@ aoedisk_rm_debugfs(struct aoedev *d)
d->debugfs = NULL;
 }
 
-static int
-aoedisk_add_sysfs(struct aoedev *d)
-{
-   return sysfs_create_group(_to_dev(d->gd)->kobj, _group);
-}
-void
-aoedisk_rm_sysfs(struct aoedev *d)
-{
-   sysfs_remove_group(_to_dev(d->gd)->kobj, _group);
-}
-
 static int
 aoeblk_open(struct block_device *bdev, fmode_t mode)
 {
@@ -417,8 +411,7 @@ aoeblk_gdalloc(void *vp)
 
spin_unlock_irqrestore(>lock, flags);
 
-   add_disk(gd);
-   aoedisk_add_sysfs(d);
+   device_add_disk(NULL, gd, aoe_attr_groups);
aoedisk_add_debugfs(d);
 
spin_lock_irqsave(>lock, flags);
diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index 41060e9cedf2..f29a140cdbc1 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -275,7 +275,6 @@ freedev(struct aoedev *d)
del_timer_sync(>timer);
if (d->gd) {
aoedisk_rm_debugfs(d);
-   aoedisk_rm_sysfs(d);
del_gendisk(d->gd);
put_disk(d->gd);
blk_cleanup_queue(d->blkq);
-- 
2.16.4



Re: REQ_OP_WRITE_ZEROES clash with REQ_FAILFAST_TRANSPORT ?

2018-09-21 Thread Hannes Reinecke

On 9/21/18 11:15 AM, Hannes Reinecke wrote:

Hi all,

there's one thing which decidedly puzzles me:


[ .. ]

Ah. Screw that.
REQ_OP_XXX is a number, and the others are a bitmask.

Sorry for the noise.

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


REQ_OP_WRITE_ZEROES clash with REQ_FAILFAST_TRANSPORT ?

2018-09-21 Thread Hannes Reinecke

Hi all,

there's one thing which decidedly puzzles me:

We have:

#define REQ_OP_BITS 8
#define REQ_OP_MASK ((1 << REQ_OP_BITS) - 1)
#define REQ_FLAG_BITS   24

enum req_opf {
/* read sectors from the device */
REQ_OP_READ = 0,
/* write sectors to the device */
REQ_OP_WRITE= 1,
/* flush the volatile write cache */
REQ_OP_FLUSH= 2,
/* discard sectors */
REQ_OP_DISCARD  = 3,
/* get zone information */
REQ_OP_ZONE_REPORT  = 4,
/* securely erase sectors */
REQ_OP_SECURE_ERASE = 5,
/* seset a zone write pointer */
REQ_OP_ZONE_RESET   = 6,
/* write the same sector many times */
REQ_OP_WRITE_SAME   = 7,
/* write the zero filled sector many times */
>>REQ_OP_WRITE_ZEROES = 9,

and

enum req_flag_bits {
__REQ_FAILFAST_DEV =/* no driver retries of device errors */
REQ_OP_BITS,
>>__REQ_FAILFAST_TRANSPORT, /* no driver retries of transport errors */
__REQ_FAILFAST_DRIVER,  /* no driver retries of driver errors */

If my calculation skills are correct, both __REQ_FAILFAST_TRANSPORT and 
REQ_OP_WRITE_ZEROES end up on bit 9.


Which means that
a) any check for REQ_OP_WRITE_ZEROES will be true for a request with 
REQ_FAILFAST_TRANSPORT

and
b) this check:
#define req_op(req) \
((req)->cmd_flags & REQ_OP_MASK)

will actually _exclude_ REQ_WRITE_ZEROES.
Am I correct?
Is this intended?

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM

2018-09-17 Thread Hannes Reinecke
On 09/13/2018 02:15 PM, Ming Lei wrote:
> Hi,
> 
> This patchset introduces per-host admin request queue for submitting
> admin request only, and uses this approach to implement both SCSI
> quiesce and runtime PM in one very simply way. Also runtime PM deadlock
> can be avoided in case that request pool is used up, such as when too
> many IO requests are allocated before resuming device.
> 
> The idea is borrowed from NVMe.
> 
> In this patchset, admin request(all requests submitted via __scsi_execute) 
> will
> be submitted via one per-host admin queue, and the request is still
> associated with the same scsi_device as before, and respects this
> scsi_device's all kinds of limits too. Admin queue shares host tags with
> other IO queues.
> 
> One core idea is that for any admin request submitted from this admin queue,
> this request won't be called back to block layer via the associated IO
> queue(scsi_device). And this is done in the 3rd patch. So once IO queue
> is frozen, it can be observed as really frozen from block layer view.
> 
> SCSI quiesce is implemented by admin queue in very simple way, see patch
> 15.
> 
> Also runtime PM for legacy path is simplified too, see patch 16, and device
> resume is moved to blk_queue_enter().
> 
> blk-mq simply follows legacy's approach for supporting runtime PM.
> 
> Also the fast IO path is simplified much, see blk_queue_enter().
> 
[ .. ]
> 
Please don't do this.
Having an admin queue makes sense for NVMe (where it's even part of the
spec). But for SCSI it's just an additional logical construct with
doesn't correlate to anything we have in the lower layers.
And all of this just to handle PM requests properly.

At ALPSS we've discussed this issue and came up with a different
proposal: Allocate a PM request before _suspending_. Then we trivially
have that request available when resuming, and are sure that nothing can
block the request.
Far simpler, and doesn't require an entirely new infrastructure.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 2/5] nvme: register ns_id attributes as default sysfs groups

2018-09-06 Thread Hannes Reinecke
On 09/05/2018 03:45 PM, Christoph Hellwig wrote:
> On Wed, Sep 05, 2018 at 03:32:03PM +0200, Hannes Reinecke wrote:
>> On 09/05/2018 03:18 PM, Christoph Hellwig wrote:
>>> On Wed, Sep 05, 2018 at 09:00:50AM +0200, Hannes Reinecke wrote:
>>>> We should be registering the ns_id attribute as default sysfs
>>>> attribute groups, otherwise we have a race condition between
>>>> the uevent and the attributes appearing in sysfs.
>>>
>>> Please give Bart credit for his work, as the lightnvm bits are almost
>>> bigger than the rest.
>>>
>> Okay, will be doing so.
>>
>>>> +static umode_t nvm_dev_attrs_visible(struct kobject *kobj,
>>>> +   struct attribute *attr, int index)
>>>>  {
>>>> +  struct device *dev = container_of(kobj, struct device, kobj);
>>>> +  struct gendisk *disk = dev_to_disk(dev);
>>>> +  struct nvme_ns *ns = disk->private_data;
>>>>struct nvm_dev *ndev = ns->ndev;
>>>> +  struct device_attribute *dev_attr =
>>>> +  container_of(attr, typeof(*dev_attr), attr);
>>>>  
>>>> +  if (dev_attr->show == nvm_dev_attr_show)
>>>> +  return attr->mode;
>>>>  
>>>> +  switch (ndev ? ndev->geo.major_ver_id : 0) {
>>>
>>> How could ndev be zero here?
>>>
>> For 'normal' NVMe devices (ie non-lightnvm). As we now register all
>> sysfs attributes (including the lightnvm ones) per default we'll need to
>> blank them out for non-lightnvm devices.
> 
> But then we need to exit early at the beginning of the function,
> as we should not register attributes using nvm_dev_attr_show (or
> anything else for that matter) either.
> 
Okay, will be fixing it up.

Cheers,

Hannes
-- 
Dr. Hannes ReineckezSeries & Storage
h...@suse.com  +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 2/5] nvme: register ns_id attributes as default sysfs groups

2018-09-05 Thread Hannes Reinecke
On 09/05/2018 03:18 PM, Christoph Hellwig wrote:
> On Wed, Sep 05, 2018 at 09:00:50AM +0200, Hannes Reinecke wrote:
>> We should be registering the ns_id attribute as default sysfs
>> attribute groups, otherwise we have a race condition between
>> the uevent and the attributes appearing in sysfs.
> 
> Please give Bart credit for his work, as the lightnvm bits are almost
> bigger than the rest.
> 
Okay, will be doing so.

>> +static umode_t nvm_dev_attrs_visible(struct kobject *kobj,
>> + struct attribute *attr, int index)
>>  {
>> +struct device *dev = container_of(kobj, struct device, kobj);
>> +struct gendisk *disk = dev_to_disk(dev);
>> +struct nvme_ns *ns = disk->private_data;
>>  struct nvm_dev *ndev = ns->ndev;
>> +struct device_attribute *dev_attr =
>> +container_of(attr, typeof(*dev_attr), attr);
>>  
>> +if (dev_attr->show == nvm_dev_attr_show)
>> +return attr->mode;
>>  
>> +switch (ndev ? ndev->geo.major_ver_id : 0) {
> 
> How could ndev be zero here?
> 
For 'normal' NVMe devices (ie non-lightnvm). As we now register all
sysfs attributes (including the lightnvm ones) per default we'll need to
blank them out for non-lightnvm devices.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH 2/5] nvme: register ns_id attributes as default sysfs groups

2018-09-05 Thread Hannes Reinecke
We should be registering the ns_id attribute as default sysfs
attribute groups, otherwise we have a race condition between
the uevent and the attributes appearing in sysfs.

Signed-off-by: Hannes Reinecke 
---
 drivers/nvme/host/core.c  |  21 -
 drivers/nvme/host/lightnvm.c  | 106 +-
 drivers/nvme/host/multipath.c |  15 ++
 drivers/nvme/host/nvme.h  |  10 +---
 4 files changed, 58 insertions(+), 94 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0e824e8c8fd7..e0a9e1c5b30e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2734,6 +2734,14 @@ const struct attribute_group nvme_ns_id_attr_group = {
.is_visible = nvme_ns_id_attrs_are_visible,
 };
 
+const struct attribute_group *nvme_ns_id_attr_groups[] = {
+   _ns_id_attr_group,
+#ifdef CONFIG_NVM
+   _nvm_attr_group,
+#endif
+   NULL,
+};
+
 #define nvme_show_str_function(field)  
\
 static ssize_t  field##_show(struct device *dev,   
\
struct device_attribute *attr, char *buf)   
\
@@ -3099,14 +3107,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
 
nvme_get_ctrl(ctrl);
 
-   device_add_disk(ctrl->device, ns->disk, NULL);
-   if (sysfs_create_group(_to_dev(ns->disk)->kobj,
-   _ns_id_attr_group))
-   pr_warn("%s: failed to create sysfs group for identification\n",
-   ns->disk->disk_name);
-   if (ns->ndev && nvme_nvm_register_sysfs(ns))
-   pr_warn("%s: failed to register lightnvm sysfs group for 
identification\n",
-   ns->disk->disk_name);
+   device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups);
 
nvme_mpath_add_disk(ns, id);
nvme_fault_inject_init(ns);
@@ -3132,10 +3133,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
nvme_fault_inject_fini(ns);
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
-   sysfs_remove_group(_to_dev(ns->disk)->kobj,
-   _ns_id_attr_group);
-   if (ns->ndev)
-   nvme_nvm_unregister_sysfs(ns);
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
if (blk_get_integrity(ns->disk))
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 6fe5923c95d4..941ce5fc48f5 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -1190,10 +1190,29 @@ static NVM_DEV_ATTR_12_RO(multiplane_modes);
 static NVM_DEV_ATTR_12_RO(media_capabilities);
 static NVM_DEV_ATTR_12_RO(max_phys_secs);
 
-static struct attribute *nvm_dev_attrs_12[] = {
+/* 2.0 values */
+static NVM_DEV_ATTR_20_RO(groups);
+static NVM_DEV_ATTR_20_RO(punits);
+static NVM_DEV_ATTR_20_RO(chunks);
+static NVM_DEV_ATTR_20_RO(clba);
+static NVM_DEV_ATTR_20_RO(ws_min);
+static NVM_DEV_ATTR_20_RO(ws_opt);
+static NVM_DEV_ATTR_20_RO(maxoc);
+static NVM_DEV_ATTR_20_RO(maxocpu);
+static NVM_DEV_ATTR_20_RO(mw_cunits);
+static NVM_DEV_ATTR_20_RO(write_typ);
+static NVM_DEV_ATTR_20_RO(write_max);
+static NVM_DEV_ATTR_20_RO(reset_typ);
+static NVM_DEV_ATTR_20_RO(reset_max);
+
+static struct attribute *nvm_dev_attrs[] = {
+   /* version agnostic attrs */
_attr_version.attr,
_attr_capabilities.attr,
+   _attr_read_typ.attr,
+   _attr_read_max.attr,
 
+   /* 1.2 attrs */
_attr_vendor_opcode.attr,
_attr_device_mode.attr,
_attr_media_manager.attr,
@@ -1208,8 +1227,6 @@ static struct attribute *nvm_dev_attrs_12[] = {
_attr_page_size.attr,
_attr_hw_sector_size.attr,
_attr_oob_sector_size.attr,
-   _attr_read_typ.attr,
-   _attr_read_max.attr,
_attr_prog_typ.attr,
_attr_prog_max.attr,
_attr_erase_typ.attr,
@@ -1218,33 +1235,7 @@ static struct attribute *nvm_dev_attrs_12[] = {
_attr_media_capabilities.attr,
_attr_max_phys_secs.attr,
 
-   NULL,
-};
-
-static const struct attribute_group nvm_dev_attr_group_12 = {
-   .name   = "lightnvm",
-   .attrs  = nvm_dev_attrs_12,
-};
-
-/* 2.0 values */
-static NVM_DEV_ATTR_20_RO(groups);
-static NVM_DEV_ATTR_20_RO(punits);
-static NVM_DEV_ATTR_20_RO(chunks);
-static NVM_DEV_ATTR_20_RO(clba);
-static NVM_DEV_ATTR_20_RO(ws_min);
-static NVM_DEV_ATTR_20_RO(ws_opt);
-static NVM_DEV_ATTR_20_RO(maxoc);
-static NVM_DEV_ATTR_20_RO(maxocpu);
-static NVM_DEV_ATTR_20_RO(mw_cunits);
-static NVM_DEV_ATTR_20_RO(write_typ);
-static NVM_DEV_ATTR_20_RO(write_max);
-static NVM_DEV_ATTR_20_RO(reset_typ);
-static NVM_DEV_ATTR_20_RO(reset_max);
-
-static struct attribute *nvm_dev_attrs_20[] = {
- 

[PATCH 4/5] zram: register default groups with device_add_disk()

2018-09-05 Thread Hannes Reinecke
Register default sysfs groups during device_add_disk() to avoid a
race condition with udev during startup.

Signed-off-by: Hannes Reinecke 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
---
 drivers/block/zram/zram_drv.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a1d6b5597c17..4879595200e1 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1636,6 +1636,11 @@ static const struct attribute_group zram_disk_attr_group 
= {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1716,24 +1721,14 @@ static int zram_add(void)
 
zram->disk->queue->backing_dev_info->capabilities |=
(BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);
-   add_disk(zram->disk);
-
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
+   device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
+
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
zram_debugfs_register(zram);
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1762,15 +1757,6 @@ static int zram_remove(struct zram *zram)
mutex_unlock(>bd_mutex);
 
zram_debugfs_unregister(zram);
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
 
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
-- 
2.16.4



[PATCH 1/5] block: genhd: add 'groups' argument to device_add_disk

2018-09-05 Thread Hannes Reinecke
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck 
Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
---
 arch/um/drivers/ubd_kern.c  |  2 +-
 block/genhd.c   | 19 ++-
 drivers/block/floppy.c  |  2 +-
 drivers/block/mtip32xx/mtip32xx.c   |  2 +-
 drivers/block/ps3disk.c |  2 +-
 drivers/block/ps3vram.c |  2 +-
 drivers/block/rsxx/dev.c|  2 +-
 drivers/block/skd_main.c|  2 +-
 drivers/block/sunvdc.c  |  2 +-
 drivers/block/virtio_blk.c  |  2 +-
 drivers/block/xen-blkfront.c|  2 +-
 drivers/ide/ide-cd.c|  2 +-
 drivers/ide/ide-gd.c|  2 +-
 drivers/memstick/core/ms_block.c|  2 +-
 drivers/memstick/core/mspro_block.c |  2 +-
 drivers/mmc/core/block.c|  2 +-
 drivers/mtd/mtd_blkdevs.c   |  2 +-
 drivers/nvdimm/blk.c|  2 +-
 drivers/nvdimm/btt.c|  2 +-
 drivers/nvdimm/pmem.c   |  2 +-
 drivers/nvme/host/core.c|  2 +-
 drivers/nvme/host/multipath.c   |  2 +-
 drivers/s390/block/dasd_genhd.c |  2 +-
 drivers/s390/block/dcssblk.c|  2 +-
 drivers/s390/block/scm_blk.c|  2 +-
 drivers/scsi/sd.c   |  2 +-
 drivers/scsi/sr.c   |  2 +-
 include/linux/genhd.h   |  5 +++--
 28 files changed, 43 insertions(+), 33 deletions(-)

diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
index 83c470364dfb..6ee4c56032f7 100644
--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -891,7 +891,7 @@ static int ubd_disk_register(int major, u64 size, int unit,
 
disk->private_data = _devs[unit];
disk->queue = ubd_devs[unit].queue;
-   device_add_disk(parent, disk);
+   device_add_disk(parent, disk, NULL);
 
*disk_out = disk;
return 0;
diff --git a/block/genhd.c b/block/genhd.c
index 8cc719a37b32..ef0936184d69 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -567,7 +567,8 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
-static void register_disk(struct device *parent, struct gendisk *disk)
+static void register_disk(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups)
 {
struct device *ddev = disk_to_dev(disk);
struct block_device *bdev;
@@ -582,6 +583,10 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
/* delay uevents, until we scanned partition table */
dev_set_uevent_suppress(ddev, 1);
 
+   if (groups) {
+   WARN_ON(ddev->groups);
+   ddev->groups = groups;
+   }
if (device_add(ddev))
return;
if (!sysfs_deprecated) {
@@ -647,6 +652,7 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  * __device_add_disk - add disk information to kernel list
  * @parent: parent device for the disk
  * @disk: per-device partitioning information
+ * @groups: Additional per-device sysfs groups
  * @register_queue: register the queue if set to true
  *
  * This function registers the partitioning information in @disk
@@ -655,6 +661,7 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  * FIXME: error handling
  */
 static void __device_add_disk(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups,
  bool register_queue)
 {
dev_t devt;
@@ -698,7 +705,7 @@ static void __device_add_disk(struct device *parent, struct 
gendisk *disk,
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
}
-   register_disk(parent, disk);
+   register_disk(parent, disk, groups);
if (register_queue)
blk_register_queue(disk);
 
@@ -712,15 +719,17 @@ static void __device_add_disk(struct device *parent, 
struct gendisk *disk,
blk_integrity_add(disk);
 }
 
-void device_add_disk(struct device *parent, struct gendisk *disk)
+void device_add_disk(struct device *parent, struct gendisk *disk,
+const struct attribute_group **groups)
+
 {
-   __device_add_disk(parent, disk, true);
+   __device_add_disk(parent, disk, groups, true);
 }
 EXPORT_SYMBOL(device_add_disk);
 
 void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk)
 {
-   __device_add_disk(parent, disk, false);
+   __device_add_disk(parent, disk, NULL, false);
 }
 EXPORT_SYMBOL(device_add_disk_no_queue_reg);
 
diff --git a/drivers/block/floppy.c b/drivers/block

[PATCH 3/5] aoe: use device_add_disk_with_groups()

2018-09-05 Thread Hannes Reinecke
Use device_add_disk_with_groups() to avoid a race condition with
udev during startup.

Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Acked-by: Ed L. Cachin 
Reviewed-by: Bart Van Assche 
---
 drivers/block/aoe/aoe.h|  1 -
 drivers/block/aoe/aoeblk.c | 21 +++--
 drivers/block/aoe/aoedev.c |  1 -
 3 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index c0ebda1283cc..015c68017a1c 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -201,7 +201,6 @@ int aoeblk_init(void);
 void aoeblk_exit(void);
 void aoeblk_gdalloc(void *);
 void aoedisk_rm_debugfs(struct aoedev *d);
-void aoedisk_rm_sysfs(struct aoedev *d);
 
 int aoechr_init(void);
 void aoechr_exit(void);
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 429ebb84b592..ff770e7d9e52 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -177,10 +177,15 @@ static struct attribute *aoe_attrs[] = {
NULL,
 };
 
-static const struct attribute_group attr_group = {
+static const struct attribute_group aoe_attr_group = {
.attrs = aoe_attrs,
 };
 
+static const struct attribute_group *aoe_attr_groups[] = {
+   _attr_group,
+   NULL,
+};
+
 static const struct file_operations aoe_debugfs_fops = {
.open = aoe_debugfs_open,
.read = seq_read,
@@ -219,17 +224,6 @@ aoedisk_rm_debugfs(struct aoedev *d)
d->debugfs = NULL;
 }
 
-static int
-aoedisk_add_sysfs(struct aoedev *d)
-{
-   return sysfs_create_group(_to_dev(d->gd)->kobj, _group);
-}
-void
-aoedisk_rm_sysfs(struct aoedev *d)
-{
-   sysfs_remove_group(_to_dev(d->gd)->kobj, _group);
-}
-
 static int
 aoeblk_open(struct block_device *bdev, fmode_t mode)
 {
@@ -417,8 +411,7 @@ aoeblk_gdalloc(void *vp)
 
spin_unlock_irqrestore(>lock, flags);
 
-   add_disk(gd);
-   aoedisk_add_sysfs(d);
+   device_add_disk(NULL, gd, aoe_attr_groups);
aoedisk_add_debugfs(d);
 
spin_lock_irqsave(>lock, flags);
diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index 41060e9cedf2..f29a140cdbc1 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -275,7 +275,6 @@ freedev(struct aoedev *d)
del_timer_sync(>timer);
if (d->gd) {
aoedisk_rm_debugfs(d);
-   aoedisk_rm_sysfs(d);
del_gendisk(d->gd);
put_disk(d->gd);
blk_cleanup_queue(d->blkq);
-- 
2.16.4



[PATCH 5/5] virtio-blk: modernize sysfs attribute creation

2018-09-05 Thread Hannes Reinecke
Use new-style DEVICE_ATTR_RO/DEVICE_ATTR_RW to create the sysfs attributes
and register the disk with default sysfs attribute groups.

Signed-off-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Acked-by: Michael S. Tsirkin 
Reviewed-by: Bart Van Assche 
---
 drivers/block/virtio_blk.c | 68 ++
 1 file changed, 39 insertions(+), 29 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index fe8056a1..086c6bb12baa 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -351,8 +351,8 @@ static int minor_to_index(int minor)
return minor >> PART_BITS;
 }
 
-static ssize_t virtblk_serial_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
+static ssize_t serial_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
 {
struct gendisk *disk = dev_to_disk(dev);
int err;
@@ -371,7 +371,7 @@ static ssize_t virtblk_serial_show(struct device *dev,
return err;
 }
 
-static DEVICE_ATTR(serial, 0444, virtblk_serial_show, NULL);
+static DEVICE_ATTR_RO(serial);
 
 /* The queue's logical block size must be set before calling this */
 static void virtblk_update_capacity(struct virtio_blk *vblk, bool resize)
@@ -545,8 +545,8 @@ static const char *const virtblk_cache_types[] = {
 };
 
 static ssize_t
-virtblk_cache_type_store(struct device *dev, struct device_attribute *attr,
-const char *buf, size_t count)
+cache_type_store(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
 {
struct gendisk *disk = dev_to_disk(dev);
struct virtio_blk *vblk = disk->private_data;
@@ -564,8 +564,7 @@ virtblk_cache_type_store(struct device *dev, struct 
device_attribute *attr,
 }
 
 static ssize_t
-virtblk_cache_type_show(struct device *dev, struct device_attribute *attr,
-char *buf)
+cache_type_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
struct gendisk *disk = dev_to_disk(dev);
struct virtio_blk *vblk = disk->private_data;
@@ -575,12 +574,38 @@ virtblk_cache_type_show(struct device *dev, struct 
device_attribute *attr,
return snprintf(buf, 40, "%s\n", virtblk_cache_types[writeback]);
 }
 
-static const struct device_attribute dev_attr_cache_type_ro =
-   __ATTR(cache_type, 0444,
-  virtblk_cache_type_show, NULL);
-static const struct device_attribute dev_attr_cache_type_rw =
-   __ATTR(cache_type, 0644,
-  virtblk_cache_type_show, virtblk_cache_type_store);
+static DEVICE_ATTR_RW(cache_type);
+
+static struct attribute *virtblk_attrs[] = {
+   _attr_serial.attr,
+   _attr_cache_type.attr,
+   NULL,
+};
+
+static umode_t virtblk_attrs_are_visible(struct kobject *kobj,
+   struct attribute *a, int n)
+{
+   struct device *dev = container_of(kobj, struct device, kobj);
+   struct gendisk *disk = dev_to_disk(dev);
+   struct virtio_blk *vblk = disk->private_data;
+   struct virtio_device *vdev = vblk->vdev;
+
+   if (a == _attr_cache_type.attr &&
+   !virtio_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE))
+   return S_IRUGO;
+
+   return a->mode;
+}
+
+static const struct attribute_group virtblk_attr_group = {
+   .attrs = virtblk_attrs,
+   .is_visible = virtblk_attrs_are_visible,
+};
+
+static const struct attribute_group *virtblk_attr_groups[] = {
+   _attr_group,
+   NULL,
+};
 
 static int virtblk_init_request(struct blk_mq_tag_set *set, struct request *rq,
unsigned int hctx_idx, unsigned int numa_node)
@@ -780,24 +805,9 @@ static int virtblk_probe(struct virtio_device *vdev)
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
 
-   device_add_disk(>dev, vblk->disk, NULL);
-   err = device_create_file(disk_to_dev(vblk->disk), _attr_serial);
-   if (err)
-   goto out_del_disk;
-
-   if (virtio_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE))
-   err = device_create_file(disk_to_dev(vblk->disk),
-_attr_cache_type_rw);
-   else
-   err = device_create_file(disk_to_dev(vblk->disk),
-_attr_cache_type_ro);
-   if (err)
-   goto out_del_disk;
+   device_add_disk(>dev, vblk->disk, virtblk_attr_groups);
return 0;
 
-out_del_disk:
-   del_gendisk(vblk->disk);
-   blk_cleanup_queue(vblk->disk->queue);
 out_free_tags:
blk_mq_free_tag_set(>tag_set);
 out_put_disk:
-- 
2.16.4



[PATCHv3 0/5] genhd: register default groups with device_add_disk()

2018-09-05 Thread Hannes Reinecke
device_add_disk() doesn't allow to register with default sysfs groups,
which introduces a race with udev as these groups have to be created after
the uevent has been sent.
This patchset updates device_add_disk() to accept a 'groups' argument to
avoid this race and updates the obvious drivers to use it.
There are some more drivers which might benefit from this (eg loop or md),
but the interface is not straightforward so I haven't attempted it.

As usual, comments and reviews are welcome.

Changes to v2:
- Fold lightnvm attributes into the generic attribute group as
  suggested by Bart

Changes to v1:
- Drop first patch
- Convert lightnvm sysfs attributes as suggested by Bart

Hannes Reinecke (5):
  block: genhd: add 'groups' argument to device_add_disk
  nvme: register ns_id attributes as default sysfs groups
  aoe: use device_add_disk_with_groups()
  zram: register default groups with device_add_disk()
  virtio-blk: modernize sysfs attribute creation

 arch/um/drivers/ubd_kern.c  |   2 +-
 block/genhd.c   |  19 +--
 drivers/block/aoe/aoe.h |   1 -
 drivers/block/aoe/aoeblk.c  |  21 +++
 drivers/block/aoe/aoedev.c  |   1 -
 drivers/block/floppy.c  |   2 +-
 drivers/block/mtip32xx/mtip32xx.c   |   2 +-
 drivers/block/ps3disk.c |   2 +-
 drivers/block/ps3vram.c |   2 +-
 drivers/block/rsxx/dev.c|   2 +-
 drivers/block/skd_main.c|   2 +-
 drivers/block/sunvdc.c  |   2 +-
 drivers/block/virtio_blk.c  |  68 +--
 drivers/block/xen-blkfront.c|   2 +-
 drivers/block/zram/zram_drv.c   |  28 +++---
 drivers/ide/ide-cd.c|   2 +-
 drivers/ide/ide-gd.c|   2 +-
 drivers/memstick/core/ms_block.c|   2 +-
 drivers/memstick/core/mspro_block.c |   2 +-
 drivers/mmc/core/block.c|   2 +-
 drivers/mtd/mtd_blkdevs.c   |   2 +-
 drivers/nvdimm/blk.c|   2 +-
 drivers/nvdimm/btt.c|   2 +-
 drivers/nvdimm/pmem.c   |   2 +-
 drivers/nvme/host/core.c|  21 +++
 drivers/nvme/host/lightnvm.c| 106 +++-
 drivers/nvme/host/multipath.c   |  15 ++---
 drivers/nvme/host/nvme.h|  10 +---
 drivers/s390/block/dasd_genhd.c |   2 +-
 drivers/s390/block/dcssblk.c|   2 +-
 drivers/s390/block/scm_blk.c|   2 +-
 drivers/scsi/sd.c   |   2 +-
 drivers/scsi/sr.c   |   2 +-
 include/linux/genhd.h   |   5 +-
 34 files changed, 151 insertions(+), 190 deletions(-)

-- 
2.16.4



Re: [PATCH 3/6] nvme: register ns_id attributes as default sysfs groups

2018-08-14 Thread Hannes Reinecke
On 08/13/2018 09:51 PM, Bart Van Assche wrote:
> On Mon, 2018-07-30 at 09:12 +0200, Hannes Reinecke wrote:
>> @@ -3061,11 +3066,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
>> unsigned nsid)
>>  
>>  nvme_get_ctrl(ctrl);
>>  
>> -device_add_disk(ctrl->device, ns->disk, NULL);
>> -if (sysfs_create_group(_to_dev(ns->disk)->kobj,
>> -_ns_id_attr_group))
>> -pr_warn("%s: failed to create sysfs group for identification\n",
>> -ns->disk->disk_name);
>> +device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups);
>>  if (ns->ndev && nvme_nvm_register_sysfs(ns))
>>  pr_warn("%s: failed to register lightnvm sysfs group for 
>> identification\n",
>>  ns->disk->disk_name);
> 
> Hello Hannes,
> 
> Have you noticed that nvme_nvm_register_sysfs() also registers sysfs 
> attributes
> and hence that the attributes registered by that function should be merged 
> into
> the nvme_ns_id_attr_groups array?
> 
Actually, no :-)
I'll be looking into it and reposting the series.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v2 0/2] Fix a race condition related to creation of /dev/nvme0n

2018-08-13 Thread Hannes Reinecke

On 08/13/2018 07:25 PM, Bart Van Assche wrote:

Hello Jens,

The two patches in this series fix a race condition related to the creation of
the sysfs attributes for the /dev/nvme0n devices nodes. I encountered this
race while adding tests to the blktests project for the NVMeOF kernel
drivers. Please consider these patches for the upstream kernel.

Thanks,

Bart.

Bart Van Assche (2):
   block: Introduce alloc_disk_node_attr()
   nvme: Fix race conditions related to creation of /dev/nvme0n

  block/genhd.c| 26 --
  drivers/nvme/host/core.c | 17 +
  drivers/nvme/host/lightnvm.c | 34 +++---
  drivers/nvme/host/nvme.h |  9 -
  include/linux/genhd.h| 13 +++--
  5 files changed, 51 insertions(+), 48 deletions(-)

I have fixed the very same thing with my patchset titled "genhd: 
register default groups with device_add_disk" which is waiting for 
inclusion.


Cheers,

Hannes


Re: [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning

2018-08-06 Thread Hannes Reinecke
On 08/04/2018 02:03 AM, Bart Van Assche wrote:
> In kernel v2.6.18 RQF_PREEMPT was only set for IDE preempt requests.
> Later on the SCSI core was modified such that RQF_PREEMPT requests
> was set for all requests submitted by __scsi_execute(), including
> power management requests. RQF_PREEMPT requests are the only requests
> processed in the SDEV_QUIESCE state. Instead of setting RQF_PREEMPT
> for all requests submitted by __scsi_execute(), only set RQF_PM for
> power management requests. Modify blk_get_request() such that it
> blocks in pm-only mode on non-RQF_PM requests. Leave the code out
> from scsi_prep_state_check() that is no longer reachable.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Martin K. Petersen 
> Cc: Christoph Hellwig 
> Cc: Jianchao Wang 
> Cc: Ming Lei 
> Cc: Alan Stern 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-core.c|  9 +
>  drivers/scsi/scsi_lib.c | 28 
>  include/linux/blk-mq.h  |  2 ++
>  include/linux/blkdev.h  |  6 ++
>  4 files changed, 21 insertions(+), 24 deletions(-)
> 
I am not sure this is going to work for SCSI parallel; we're using the
QUIESCE state there to do domain validation, and all commands there are
most definitely not PM requests.
Can you please validate your patches with eg aic7xxx and SCSI parallel
disks?

Thanks.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH v4 01/10] block: Change the preempt-only flag into a counter

2018-08-06 Thread Hannes Reinecke
On 08/04/2018 02:03 AM, Bart Van Assche wrote:
> Rename "preempt-only" into "pm-only" because the primary purpose of
> this mode is power management. Since the power management core may
> but does not have to resume a runtime suspended device before
> performing system-wide suspend and since a later patch will set
> "pm-only" mode as long as a block device is runtime suspended, make
> it possible to set "pm-only" mode from more than one context.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Jianchao Wang 
> Cc: Ming Lei 
> Cc: Johannes Thumshirn 
> Cc: Alan Stern 
> ---
>  block/blk-core.c| 35 ++-
>  block/blk-mq-debugfs.c  | 10 +-
>  drivers/scsi/scsi_lib.c |  8 
>  include/linux/blkdev.h  | 14 +++++-
>  4 files changed, 40 insertions(+), 27 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH] block: copy ioprio in __bio_clone_fast()

2018-08-01 Thread Hannes Reinecke
We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.

Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field")
Signed-off-by: Hannes Reinecke 
---
 block/bio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 8ecc95615941..88129f20623d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -608,6 +608,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_write_hint = bio_src->bi_write_hint;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
+   bio->bi_ioprio = bio_src->bi_ioprio;
 
bio_clone_blkcg_association(bio, bio_src);
 }
@@ -692,6 +693,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
bio->bi_write_hint  = bio_src->bi_write_hint;
bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size= bio_src->bi_iter.bi_size;
+   bio->bi_ioprio  = bio_src->bi_ioprio;
 
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
-- 
2.12.3



Re: [PATCH 4/6] aoe: use device_add_disk_with_groups()

2018-08-01 Thread Hannes Reinecke
On 08/01/2018 03:00 AM, Ed Cashin wrote:
> On Mon, Jul 30, 2018 at 3:12 AM Hannes Reinecke  <mailto:h...@suse.de>> wrote:
> 
> Use device_add_disk_with_groups() to avoid a race condition with
> udev during startup.
> 
> 
> I love the idea of getting rid of the race, but I am having trouble
> seeing what happened to the cleanup we had via sysfs_remove_group.
> You're storing a pointer to groups off the device, but I don't see it
> getting
> used for cleanup later in this patch set.  Are you patching linux-next?
> 
And that's the beauty of this patch: you don't need to free/unlink the
groups yourself.
Unlinking is done in the driver core via
device_del()->device_remove_attrs()->device_remove_groups().

So no separate patch needed.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 1/6] genhd: drop 'bool' argument from __device_add_disk()

2018-07-30 Thread Hannes Reinecke
On 07/30/2018 10:56 AM, Christoph Hellwig wrote:
> I really don't see the point for this change.
> 
Okay, I'll be dropping it on the next iteration.

Cheers,

Hannes
-- 
Dr. Hannes ReineckezSeries & Storage
h...@suse.com  +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH 3/6] nvme: register ns_id attributes as default sysfs groups

2018-07-30 Thread Hannes Reinecke
We should be registering the ns_id attribute as default sysfs
attribute groups, otherwise we have a race condition between
the uevent and the attributes appearing in sysfs.

Signed-off-by: Hannes Reinecke 
Cc: Sagi Grimberg 
Cc: Keith Busch 
---
 drivers/nvme/host/core.c  | 13 ++---
 drivers/nvme/host/multipath.c | 15 ---
 drivers/nvme/host/nvme.h  |  2 +-
 3 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f1b61ebceaf1..39bdfe806d1b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2696,6 +2696,11 @@ const struct attribute_group nvme_ns_id_attr_group = {
.is_visible = nvme_ns_id_attrs_are_visible,
 };
 
+const struct attribute_group *nvme_ns_id_attr_groups[] = {
+   _ns_id_attr_group,
+   NULL,
+};
+
 #define nvme_show_str_function(field)  
\
 static ssize_t  field##_show(struct device *dev,   
\
struct device_attribute *attr, char *buf)   
\
@@ -3061,11 +3066,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
 
nvme_get_ctrl(ctrl);
 
-   device_add_disk(ctrl->device, ns->disk, NULL);
-   if (sysfs_create_group(_to_dev(ns->disk)->kobj,
-   _ns_id_attr_group))
-   pr_warn("%s: failed to create sysfs group for identification\n",
-   ns->disk->disk_name);
+   device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups);
if (ns->ndev && nvme_nvm_register_sysfs(ns))
pr_warn("%s: failed to register lightnvm sysfs group for 
identification\n",
ns->disk->disk_name);
@@ -3094,8 +3095,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
nvme_fault_inject_fini(ns);
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
-   sysfs_remove_group(_to_dev(ns->disk)->kobj,
-   _ns_id_attr_group);
if (ns->ndev)
nvme_nvm_unregister_sysfs(ns);
del_gendisk(ns->disk);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index e4d786467129..7d3c30324da9 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -281,13 +281,9 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
if (!head->disk)
return;
 
-   if (!(head->disk->flags & GENHD_FL_UP)) {
-   device_add_disk(>subsys->dev, head->disk, NULL);
-   if (sysfs_create_group(_to_dev(head->disk)->kobj,
-   _ns_id_attr_group))
-   dev_warn(>subsys->dev,
-"failed to create id group.\n");
-   }
+   if (!(head->disk->flags & GENHD_FL_UP))
+   device_add_disk(>subsys->dev, head->disk,
+   nvme_ns_id_attr_groups);
 
kblockd_schedule_work(>head->requeue_work);
 }
@@ -493,11 +489,8 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 {
if (!head->disk)
return;
-   if (head->disk->flags & GENHD_FL_UP) {
-   sysfs_remove_group(_to_dev(head->disk)->kobj,
-  _ns_id_attr_group);
+   if (head->disk->flags & GENHD_FL_UP)
del_gendisk(head->disk);
-   }
blk_set_queue_dying(head->disk->queue);
/* make sure all pending bios are cleaned up */
kblockd_schedule_work(>requeue_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 8b356f1d941c..7f156367bd18 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -466,7 +466,7 @@ int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
 int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
void *log, size_t size, u64 offset);
 
-extern const struct attribute_group nvme_ns_id_attr_group;
+extern const struct attribute_group *nvme_ns_id_attr_groups[];
 extern const struct block_device_operations nvme_ns_head_ops;
 
 #ifdef CONFIG_NVME_MULTIPATH
-- 
2.12.3



[PATCH 2/6] block: genhd: add 'groups' argument to device_add_disk

2018-07-30 Thread Hannes Reinecke
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck 
Signed-off-by: Hannes Reinecke 
---
 arch/um/drivers/ubd_kern.c  |  2 +-
 block/genhd.c   | 21 +++--
 drivers/block/floppy.c  |  2 +-
 drivers/block/mtip32xx/mtip32xx.c   |  2 +-
 drivers/block/ps3disk.c |  2 +-
 drivers/block/ps3vram.c |  2 +-
 drivers/block/rsxx/dev.c|  2 +-
 drivers/block/skd_main.c|  2 +-
 drivers/block/sunvdc.c  |  2 +-
 drivers/block/virtio_blk.c  |  2 +-
 drivers/block/xen-blkfront.c|  2 +-
 drivers/ide/ide-cd.c|  2 +-
 drivers/ide/ide-gd.c|  2 +-
 drivers/memstick/core/ms_block.c|  2 +-
 drivers/memstick/core/mspro_block.c |  2 +-
 drivers/mmc/core/block.c|  2 +-
 drivers/mtd/mtd_blkdevs.c   |  2 +-
 drivers/nvdimm/blk.c|  2 +-
 drivers/nvdimm/btt.c|  2 +-
 drivers/nvdimm/pmem.c   |  2 +-
 drivers/nvme/host/core.c|  2 +-
 drivers/nvme/host/multipath.c   |  2 +-
 drivers/s390/block/dasd_genhd.c |  2 +-
 drivers/s390/block/dcssblk.c|  2 +-
 drivers/s390/block/scm_blk.c|  2 +-
 drivers/scsi/sd.c   |  2 +-
 drivers/scsi/sr.c   |  2 +-
 include/linux/genhd.h   |  5 +++--
 28 files changed, 44 insertions(+), 34 deletions(-)

diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
index 83c470364dfb..6ee4c56032f7 100644
--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -891,7 +891,7 @@ static int ubd_disk_register(int major, u64 size, int unit,
 
disk->private_data = _devs[unit];
disk->queue = ubd_devs[unit].queue;
-   device_add_disk(parent, disk);
+   device_add_disk(parent, disk, NULL);
 
*disk_out = disk;
return 0;
diff --git a/block/genhd.c b/block/genhd.c
index 62aba3d8ae19..500c2e8d6345 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -567,7 +567,8 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
-static void register_disk(struct device *parent, struct gendisk *disk)
+static void register_disk(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups)
 {
struct device *ddev = disk_to_dev(disk);
struct block_device *bdev;
@@ -582,6 +583,10 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
/* delay uevents, until we scanned partition table */
dev_set_uevent_suppress(ddev, 1);
 
+   if (groups) {
+   WARN_ON(ddev->groups);
+   ddev->groups = groups;
+   }
if (device_add(ddev))
return;
if (!sysfs_deprecated) {
@@ -647,13 +652,15 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  * __device_add_disk - add disk information to kernel list
  * @parent: parent device for the disk
  * @disk: per-device partitioning information
+ * @groups: Additional per-device sysfs groups
  *
  * This function registers the partitioning information in @disk
  * with the kernel.
  *
  * FIXME: error handling
  */
-static void __device_add_disk(struct device *parent, struct gendisk *disk)
+static void __device_add_disk(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups)
 {
dev_t devt;
int retval;
@@ -696,12 +703,13 @@ static void __device_add_disk(struct device *parent, 
struct gendisk *disk)
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
}
-   register_disk(parent, disk);
+   register_disk(parent, disk, groups);
 }
 
-void device_add_disk(struct device *parent, struct gendisk *disk)
+void device_add_disk(struct device *parent, struct gendisk *disk,
+const struct attribute_group **groups)
 {
-   __device_add_disk(parent, disk);
+   __device_add_disk(parent, disk, groups);
 
blk_register_queue(disk);
 
@@ -718,7 +726,8 @@ EXPORT_SYMBOL(device_add_disk);
 
 void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk)
 {
-   __device_add_disk(parent, disk);
+   __device_add_disk(parent, disk, NULL);
+
/*
 * Take an extra ref on queue which will be put on disk_release()
 * so that it sticks around as long as @disk is there.
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 48f622728ce6..1bc99e9dfaee 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4676,7 +4676,7 @@ static int __init do_f

[PATCH 1/6] genhd: drop 'bool' argument from __device_add_disk()

2018-07-30 Thread Hannes Reinecke
Split off the last part of __device_add_disk() and move it into the
caller.

Signed-off-by: Hannes Reinecke 
---
 block/genhd.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 8cc719a37b32..62aba3d8ae19 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -647,15 +647,13 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  * __device_add_disk - add disk information to kernel list
  * @parent: parent device for the disk
  * @disk: per-device partitioning information
- * @register_queue: register the queue if set to true
  *
  * This function registers the partitioning information in @disk
  * with the kernel.
  *
  * FIXME: error handling
  */
-static void __device_add_disk(struct device *parent, struct gendisk *disk,
- bool register_queue)
+static void __device_add_disk(struct device *parent, struct gendisk *disk)
 {
dev_t devt;
int retval;
@@ -699,8 +697,13 @@ static void __device_add_disk(struct device *parent, 
struct gendisk *disk,
exact_match, exact_lock, disk);
}
register_disk(parent, disk);
-   if (register_queue)
-   blk_register_queue(disk);
+}
+
+void device_add_disk(struct device *parent, struct gendisk *disk)
+{
+   __device_add_disk(parent, disk);
+
+   blk_register_queue(disk);
 
/*
 * Take an extra ref on queue which will be put on disk_release()
@@ -711,16 +714,19 @@ static void __device_add_disk(struct device *parent, 
struct gendisk *disk,
disk_add_events(disk);
blk_integrity_add(disk);
 }
-
-void device_add_disk(struct device *parent, struct gendisk *disk)
-{
-   __device_add_disk(parent, disk, true);
-}
 EXPORT_SYMBOL(device_add_disk);
 
 void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk)
 {
-   __device_add_disk(parent, disk, false);
+   __device_add_disk(parent, disk);
+   /*
+* Take an extra ref on queue which will be put on disk_release()
+* so that it sticks around as long as @disk is there.
+*/
+   WARN_ON_ONCE(!blk_get_queue(disk->queue));
+
+   disk_add_events(disk);
+   blk_integrity_add(disk);
 }
 EXPORT_SYMBOL(device_add_disk_no_queue_reg);
 
-- 
2.12.3



[PATCH 6/6] virtio-blk: modernize sysfs attribute creation

2018-07-30 Thread Hannes Reinecke
Use new-style DEVICE_ATTR_RO/DEVICE_ATTR_RW to create the sysfs attributes
and register the disk with default sysfs attribute groups.

Signed-off-by: Hannes Reinecke 
Cc: Michael Tsirkin 
Cc: Jason Wang 
---
 drivers/block/virtio_blk.c | 68 ++
 1 file changed, 39 insertions(+), 29 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index fe8056a1..086c6bb12baa 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -351,8 +351,8 @@ static int minor_to_index(int minor)
return minor >> PART_BITS;
 }
 
-static ssize_t virtblk_serial_show(struct device *dev,
-   struct device_attribute *attr, char *buf)
+static ssize_t serial_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
 {
struct gendisk *disk = dev_to_disk(dev);
int err;
@@ -371,7 +371,7 @@ static ssize_t virtblk_serial_show(struct device *dev,
return err;
 }
 
-static DEVICE_ATTR(serial, 0444, virtblk_serial_show, NULL);
+static DEVICE_ATTR_RO(serial);
 
 /* The queue's logical block size must be set before calling this */
 static void virtblk_update_capacity(struct virtio_blk *vblk, bool resize)
@@ -545,8 +545,8 @@ static const char *const virtblk_cache_types[] = {
 };
 
 static ssize_t
-virtblk_cache_type_store(struct device *dev, struct device_attribute *attr,
-const char *buf, size_t count)
+cache_type_store(struct device *dev, struct device_attribute *attr,
+const char *buf, size_t count)
 {
struct gendisk *disk = dev_to_disk(dev);
struct virtio_blk *vblk = disk->private_data;
@@ -564,8 +564,7 @@ virtblk_cache_type_store(struct device *dev, struct 
device_attribute *attr,
 }
 
 static ssize_t
-virtblk_cache_type_show(struct device *dev, struct device_attribute *attr,
-char *buf)
+cache_type_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
struct gendisk *disk = dev_to_disk(dev);
struct virtio_blk *vblk = disk->private_data;
@@ -575,12 +574,38 @@ virtblk_cache_type_show(struct device *dev, struct 
device_attribute *attr,
return snprintf(buf, 40, "%s\n", virtblk_cache_types[writeback]);
 }
 
-static const struct device_attribute dev_attr_cache_type_ro =
-   __ATTR(cache_type, 0444,
-  virtblk_cache_type_show, NULL);
-static const struct device_attribute dev_attr_cache_type_rw =
-   __ATTR(cache_type, 0644,
-  virtblk_cache_type_show, virtblk_cache_type_store);
+static DEVICE_ATTR_RW(cache_type);
+
+static struct attribute *virtblk_attrs[] = {
+   _attr_serial.attr,
+   _attr_cache_type.attr,
+   NULL,
+};
+
+static umode_t virtblk_attrs_are_visible(struct kobject *kobj,
+   struct attribute *a, int n)
+{
+   struct device *dev = container_of(kobj, struct device, kobj);
+   struct gendisk *disk = dev_to_disk(dev);
+   struct virtio_blk *vblk = disk->private_data;
+   struct virtio_device *vdev = vblk->vdev;
+
+   if (a == _attr_cache_type.attr &&
+   !virtio_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE))
+   return S_IRUGO;
+
+   return a->mode;
+}
+
+static const struct attribute_group virtblk_attr_group = {
+   .attrs = virtblk_attrs,
+   .is_visible = virtblk_attrs_are_visible,
+};
+
+static const struct attribute_group *virtblk_attr_groups[] = {
+   _attr_group,
+   NULL,
+};
 
 static int virtblk_init_request(struct blk_mq_tag_set *set, struct request *rq,
unsigned int hctx_idx, unsigned int numa_node)
@@ -780,24 +805,9 @@ static int virtblk_probe(struct virtio_device *vdev)
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
 
-   device_add_disk(>dev, vblk->disk, NULL);
-   err = device_create_file(disk_to_dev(vblk->disk), _attr_serial);
-   if (err)
-   goto out_del_disk;
-
-   if (virtio_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE))
-   err = device_create_file(disk_to_dev(vblk->disk),
-_attr_cache_type_rw);
-   else
-   err = device_create_file(disk_to_dev(vblk->disk),
-_attr_cache_type_ro);
-   if (err)
-   goto out_del_disk;
+   device_add_disk(>dev, vblk->disk, virtblk_attr_groups);
return 0;
 
-out_del_disk:
-   del_gendisk(vblk->disk);
-   blk_cleanup_queue(vblk->disk->queue);
 out_free_tags:
blk_mq_free_tag_set(>tag_set);
 out_put_disk:
-- 
2.12.3



[PATCH 4/6] aoe: use device_add_disk_with_groups()

2018-07-30 Thread Hannes Reinecke
Use device_add_disk_with_groups() to avoid a race condition with
udev during startup.

Signed-off-by: Hannes Reinecke 
Cc: Ed L. Cachin 
---
 drivers/block/aoe/aoe.h|  1 -
 drivers/block/aoe/aoeblk.c | 21 +++--
 drivers/block/aoe/aoedev.c |  1 -
 3 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index c0ebda1283cc..015c68017a1c 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -201,7 +201,6 @@ int aoeblk_init(void);
 void aoeblk_exit(void);
 void aoeblk_gdalloc(void *);
 void aoedisk_rm_debugfs(struct aoedev *d);
-void aoedisk_rm_sysfs(struct aoedev *d);
 
 int aoechr_init(void);
 void aoechr_exit(void);
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 429ebb84b592..ff770e7d9e52 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -177,10 +177,15 @@ static struct attribute *aoe_attrs[] = {
NULL,
 };
 
-static const struct attribute_group attr_group = {
+static const struct attribute_group aoe_attr_group = {
.attrs = aoe_attrs,
 };
 
+static const struct attribute_group *aoe_attr_groups[] = {
+   _attr_group,
+   NULL,
+};
+
 static const struct file_operations aoe_debugfs_fops = {
.open = aoe_debugfs_open,
.read = seq_read,
@@ -220,17 +225,6 @@ aoedisk_rm_debugfs(struct aoedev *d)
 }
 
 static int
-aoedisk_add_sysfs(struct aoedev *d)
-{
-   return sysfs_create_group(_to_dev(d->gd)->kobj, _group);
-}
-void
-aoedisk_rm_sysfs(struct aoedev *d)
-{
-   sysfs_remove_group(_to_dev(d->gd)->kobj, _group);
-}
-
-static int
 aoeblk_open(struct block_device *bdev, fmode_t mode)
 {
struct aoedev *d = bdev->bd_disk->private_data;
@@ -417,8 +411,7 @@ aoeblk_gdalloc(void *vp)
 
spin_unlock_irqrestore(>lock, flags);
 
-   add_disk(gd);
-   aoedisk_add_sysfs(d);
+   device_add_disk(NULL, gd, aoe_attr_groups);
aoedisk_add_debugfs(d);
 
spin_lock_irqsave(>lock, flags);
diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index 697f735b07a4..d92fa1fe3580 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -275,7 +275,6 @@ freedev(struct aoedev *d)
del_timer_sync(>timer);
if (d->gd) {
aoedisk_rm_debugfs(d);
-   aoedisk_rm_sysfs(d);
del_gendisk(d->gd);
put_disk(d->gd);
blk_cleanup_queue(d->blkq);
-- 
2.12.3



[PATCH 5/6] zram: register default groups with device_add_disk()

2018-07-30 Thread Hannes Reinecke
Register default sysfs groups during device_add-disk() to avoid a
race condition with udev during startup.

Signed-off-by: Hannes Reinecke 
Cc: Minchan Kim 
Cc: Nitin Gupta 
---
 drivers/block/zram/zram_drv.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 2907a8156aaf..fc3626cb7fe5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1618,6 +1618,11 @@ static const struct attribute_group zram_disk_attr_group 
= {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1698,24 +1703,14 @@ static int zram_add(void)
 
zram->disk->queue->backing_dev_info->capabilities |=
(BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);
-   add_disk(zram->disk);
-
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
+   device_add_disk(NULL, zram->disk, zram_disk_attr_groups);
+
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
zram_debugfs_register(zram);
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1744,15 +1739,6 @@ static int zram_remove(struct zram *zram)
mutex_unlock(>bd_mutex);
 
zram_debugfs_unregister(zram);
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
 
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
-- 
2.12.3



[PATCH 0/6] genhd: register default groups with device_add_disk()

2018-07-30 Thread Hannes Reinecke
device_add_disk() doesn't allow to register with default sysfs groups,
which introduces a race with udev as these groups have to be created after
the uevent has been sent.
This patchset updates device_add_disk() to accept a 'groups' argument to
avoid this race and updates the obvious drivers to use it.
There are some more drivers which might benefit from this (eg loop or md),
but the interface is not straightforward so I haven't attempted it.

As usual, comments and reviews are welcome.

Hannes Reinecke (6):
  genhd: drop 'bool' argument from __device_add_disk()
  block: genhd: add 'groups' argument to device_add_disk
  nvme: register ns_id attributes as default sysfs groups
  aoe: use device_add_disk_with_groups()
  zram: register default groups with device_add_disk()
  virtio-blk: modernize sysfs attribute creation

 arch/um/drivers/ubd_kern.c  |  2 +-
 block/genhd.c   | 39 ++---
 drivers/block/aoe/aoe.h |  1 -
 drivers/block/aoe/aoeblk.c  | 21 
 drivers/block/aoe/aoedev.c  |  1 -
 drivers/block/floppy.c  |  2 +-
 drivers/block/mtip32xx/mtip32xx.c   |  2 +-
 drivers/block/ps3disk.c |  2 +-
 drivers/block/ps3vram.c |  2 +-
 drivers/block/rsxx/dev.c|  2 +-
 drivers/block/skd_main.c|  2 +-
 drivers/block/sunvdc.c  |  2 +-
 drivers/block/virtio_blk.c  | 68 +
 drivers/block/xen-blkfront.c|  2 +-
 drivers/block/zram/zram_drv.c   | 28 ---
 drivers/ide/ide-cd.c|  2 +-
 drivers/ide/ide-gd.c|  2 +-
 drivers/memstick/core/ms_block.c|  2 +-
 drivers/memstick/core/mspro_block.c |  2 +-
 drivers/mmc/core/block.c|  2 +-
 drivers/mtd/mtd_blkdevs.c   |  2 +-
 drivers/nvdimm/blk.c|  2 +-
 drivers/nvdimm/btt.c|  2 +-
 drivers/nvdimm/pmem.c   |  2 +-
 drivers/nvme/host/core.c| 13 ---
 drivers/nvme/host/multipath.c   | 15 +++-
 drivers/nvme/host/nvme.h|  2 +-
 drivers/s390/block/dasd_genhd.c |  2 +-
 drivers/s390/block/dcssblk.c|  2 +-
 drivers/s390/block/scm_blk.c|  2 +-
 drivers/scsi/sd.c   |  2 +-
 drivers/scsi/sr.c   |  2 +-
 include/linux/genhd.h   |  5 +--
 33 files changed, 117 insertions(+), 122 deletions(-)

-- 
2.12.3



Re: [PATCH] blktests: Add '--outdir' to store results in a different directory

2018-07-26 Thread Hannes Reinecke
On 07/25/2018 11:23 PM, Omar Sandoval wrote:
> On Tue, Jul 17, 2018 at 03:27:50PM +0200, Hannes Reinecke wrote:
>> Adding an option '--outdir' to store results in a different
>> director so as not to clutter the git repository itself.
>>
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  check | 14 ++
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/check b/check
>> index a635531..42d07f8 100755
>> --- a/check
>> +++ b/check
>> @@ -334,7 +334,7 @@ _call_test() {
>>  fi
>>  
>>  trap _cleanup EXIT
>> -if ! TMPDIR="$(mktemp --tmpdir -p "$PWD/results" -d 
>> "tmpdir.${TEST_NAME//\//.}.XXX")"; then
>> +if ! TMPDIR="$(mktemp --tmpdir -p "$RESULTS_DIR" -d 
>> "tmpdir.${TEST_NAME//\//.}.XXX")"; then
>>  return
>>  fi
>>  
>> @@ -415,7 +415,7 @@ _run_test() {
>>  return 0
>>  fi
>>  
>> -RESULTS_DIR="results/nodev"
>> +RESULTS_DIR="${OUT_DIR}/results/nodev"
>>  _call_test test
>>  else
>>  if [[ ${#TEST_DEVS[@]} -eq 0 ]]; then
>> @@ -434,7 +434,7 @@ _run_test() {
>>  _output_notrun "$TEST_NAME => $(basename 
>> "$TEST_DEV")"
>>  continue
>>  fi
>> -RESULTS_DIR="results/$(basename "$TEST_DEV")"
>> +RESULTS_DIR="${OUT_DIR}/results/$(basename "$TEST_DEV")"
>>  if ! _call_test test_device; then
>>  ret=1
>>  fi
>> @@ -567,6 +567,7 @@ Test runs:
>>   tests to run
>>  
>>  Miscellaneous:
>> +  -o, --outdir=OUTDIRwrite results into the specified directory
>>-h, --help display this help message and exit"
>>  
>>  case "$1" in
>> @@ -581,12 +582,13 @@ Miscellaneous:
>>  esac
>>  }
>>  
>> -if ! TEMP=$(getopt -o 'dq::x:h' --long 'quick::,exclude:,help' -n "$0" -- 
>> "$@"); then
>> +if ! TEMP=$(getopt -o 'dq::o:x:h' --long 'quick::,exclude:,outdir:,help' -n 
>> "$0" -- "$@"); then
>>  exit 1
>>  fi
>>  
>>  eval set -- "$TEMP"
>>  unset TEMP
>> +OUT_DIR="."
> 
> This doesn't allow setting it from the config file. How about this?
> 
> diff --git a/check b/check
> index 5f4461f..5e99415 100755
> --- a/check
> +++ b/check
> @@ -313,7 +313,7 @@ _call_test() {
>   local test_func="$1"
>   local seqres="${RESULTS_DIR}/${TEST_NAME}"
>   # shellcheck disable=SC2034
> - FULL="$PWD/${seqres}.full"
> + FULL="${seqres}.full"
>   declare -A TEST_DEV_QUEUE_SAVED
>  
>   _read_last_test_run
> @@ -334,7 +334,7 @@ _call_test() {
>   fi
>  
>   trap _cleanup EXIT
> - if ! TMPDIR="$(mktemp --tmpdir -p "$PWD/results" -d 
> "tmpdir.${TEST_NAME//\//.}.XXX")"; then
> + if ! TMPDIR="$(mktemp --tmpdir -p "$OUTPUT" -d 
> "tmpdir.${TEST_NAME//\//.}.XXX")"; then
>   return
>   fi
>  
> @@ -415,7 +415,7 @@ _run_test() {
>   return 0
>   fi
>  
> - RESULTS_DIR="results/nodev"
> + RESULTS_DIR="$OUTPUT/nodev"
>   _call_test test
>   else
>   if [[ ${#TEST_DEVS[@]} -eq 0 ]]; then
> @@ -434,7 +434,7 @@ _run_test() {
>   _output_notrun "$TEST_NAME => $(basename 
> "$TEST_DEV")"
>   continue
>   fi
> - RESULTS_DIR="results/$(basename "$TEST_DEV")"
> + RESULTS_DIR="$OUTPUT/$(basename "$TEST_DEV")"
>   if ! _call_test test_device; then
>   ret=1
>   fi
> @@ -559,6 +559,9 @@ Test runs:
>-d, --device-only   only run tests which use a test device from the
>TEST_DEVS config setting
>  
> +  -o, --output=DIRoutput results to the given directory (the default is
> +  ./results)
> +
>-q, --quick=SECONDS do a quick run (only run quick tests and limit 
> 

Re: [PATCH v5 2/3] blkdev: __blkdev_direct_IO_simple: fix leak in error case

2018-07-26 Thread Hannes Reinecke
On 07/25/2018 11:15 PM, Martin Wilck wrote:
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for
>  simplified bdev direct-io")
> Reviewed-by: Ming Lei 
> Signed-off-by: Martin Wilck 
> ---
>  fs/block_dev.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH 1/5] genhd: drop 'bool' argument from __device_add_disk()

2018-07-25 Thread Hannes Reinecke
Split off the last part of __device_add_disk() into __device_get_disk().
With that we can drop the 'bool' argument and streamline the function.

Signed-off-by: Hannes Reinecke 
---
 block/genhd.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index f1543a45e73b..cfa7f4f78435 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -647,15 +647,13 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  * __device_add_disk - add disk information to kernel list
  * @parent: parent device for the disk
  * @disk: per-device partitioning information
- * @register_queue: register the queue if set to true
  *
  * This function registers the partitioning information in @disk
  * with the kernel.
  *
  * FIXME: error handling
  */
-static void __device_add_disk(struct device *parent, struct gendisk *disk,
- bool register_queue)
+static void __device_add_disk(struct device *parent, struct gendisk *disk)
 {
dev_t devt;
int retval;
@@ -699,9 +697,10 @@ static void __device_add_disk(struct device *parent, 
struct gendisk *disk,
exact_match, exact_lock, disk);
}
register_disk(parent, disk);
-   if (register_queue)
-   blk_register_queue(disk);
+}
 
+void __device_get_disk(struct gendisk *disk)
+{
/*
 * Take an extra ref on queue which will be put on disk_release()
 * so that it sticks around as long as @disk is there.
@@ -714,13 +713,18 @@ static void __device_add_disk(struct device *parent, 
struct gendisk *disk,
 
 void device_add_disk(struct device *parent, struct gendisk *disk)
 {
-   __device_add_disk(parent, disk, true);
+   __device_add_disk(parent, disk);
+
+   blk_register_queue(disk);
+
+   __device_get_disk(disk);
 }
 EXPORT_SYMBOL(device_add_disk);
 
 void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk)
 {
-   __device_add_disk(parent, disk, false);
+   __device_add_disk(parent, disk);
+   __device_get_disk(disk);
 }
 EXPORT_SYMBOL(device_add_disk_no_queue_reg);
 
-- 
2.12.3



[PATCH 0/5] genhd: implement device_add_disk_with_groups()

2018-07-25 Thread Hannes Reinecke
When creating a block device some drivers need to create additional sysfs
groups to store driver-specific informations.
With the current workflow of adding these groups with a separate call to
sysfs after the device has been created we are introducing a race with udev,
as the uevent is generated before the sysfs attributes are created, and udev
fails to read the required information.

This patchset adds a new function 'device_add_disk_with_groups()' and converts
the obvious candidates to use this new function.

As usual, comments and reviews are welcome.

Hannes Reinecke (5):
  genhd: drop 'bool' argument from __device_add_disk()
  block: genhd: add device_add_disk_with_groups
  nvme: register ns_id attributes as default sysfs groups
  aoe: use device_add_disk_with_groups()
  zram: use device_add_disk_with_groups()

 block/genhd.c | 38 ++
 drivers/block/aoe/aoe.h   |  1 -
 drivers/block/aoe/aoeblk.c| 21 +++--
 drivers/block/aoe/aoedev.c|  1 -
 drivers/block/zram/zram_drv.c | 28 +++-
 drivers/nvme/host/core.c  | 14 +++---
 drivers/nvme/host/multipath.c | 12 +++-
 drivers/nvme/host/nvme.h  |  2 +-
 include/linux/genhd.h |  3 +++
 9 files changed, 58 insertions(+), 62 deletions(-)

-- 
2.12.3



[PATCH 2/5] block: genhd: add device_add_disk_with_groups

2018-07-25 Thread Hannes Reinecke
Update __device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups need to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck 
Signed-off-by: Hannes Reinecke 
---
 block/genhd.c | 28 +++-
 include/linux/genhd.h |  3 +++
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index cfa7f4f78435..fbe27cb2c9d7 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -567,7 +567,8 @@ static int exact_lock(dev_t devt, void *data)
return 0;
 }
 
-static void register_disk(struct device *parent, struct gendisk *disk)
+static void register_disk(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups)
 {
struct device *ddev = disk_to_dev(disk);
struct block_device *bdev;
@@ -582,6 +583,10 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
/* delay uevents, until we scanned partition table */
dev_set_uevent_suppress(ddev, 1);
 
+   if (groups) {
+   WARN_ON(ddev->groups);
+   ddev->groups = groups;
+   }
if (device_add(ddev))
return;
if (!sysfs_deprecated) {
@@ -647,13 +652,15 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
  * __device_add_disk - add disk information to kernel list
  * @parent: parent device for the disk
  * @disk: per-device partitioning information
+ * @groups: Additional per-device sysfs groups
  *
  * This function registers the partitioning information in @disk
  * with the kernel.
  *
  * FIXME: error handling
  */
-static void __device_add_disk(struct device *parent, struct gendisk *disk)
+static void __device_add_disk(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups)
 {
dev_t devt;
int retval;
@@ -696,7 +703,7 @@ static void __device_add_disk(struct device *parent, struct 
gendisk *disk)
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
}
-   register_disk(parent, disk);
+   register_disk(parent, disk, groups);
 }
 
 void __device_get_disk(struct gendisk *disk)
@@ -711,9 +718,20 @@ void __device_get_disk(struct gendisk *disk)
blk_integrity_add(disk);
 }
 
+void device_add_disk_with_groups(struct device *parent, struct gendisk *disk,
+const struct attribute_group **groups)
+{
+   __device_add_disk(parent, disk, groups);
+
+   blk_register_queue(disk);
+
+   __device_get_disk(disk);
+}
+EXPORT_SYMBOL(device_add_disk_with_groups);
+
 void device_add_disk(struct device *parent, struct gendisk *disk)
 {
-   __device_add_disk(parent, disk);
+   __device_add_disk(parent, disk, NULL);
 
blk_register_queue(disk);
 
@@ -723,7 +741,7 @@ EXPORT_SYMBOL(device_add_disk);
 
 void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk)
 {
-   __device_add_disk(parent, disk);
+   __device_add_disk(parent, disk, NULL);
__device_get_disk(disk);
 }
 EXPORT_SYMBOL(device_add_disk_no_queue_reg);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 6cb8a5789668..f41152979296 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -394,6 +394,9 @@ extern void part_round_stats(struct request_queue *q, int 
cpu, struct hd_struct
 
 /* block/genhd.c */
 extern void device_add_disk(struct device *parent, struct gendisk *disk);
+extern void device_add_disk_with_groups(struct device *parent,
+   struct gendisk *disk,
+   const struct attribute_group **groups);
 static inline void add_disk(struct gendisk *disk)
 {
device_add_disk(NULL, disk);
-- 
2.12.3



[PATCH 4/5] aoe: use device_add_disk_with_groups()

2018-07-25 Thread Hannes Reinecke
Use device_add_disk_with_groups() to avoid a race condition with
udev during startup.

Signed-off-by: Hannes Reinecke 
---
 drivers/block/aoe/aoe.h|  1 -
 drivers/block/aoe/aoeblk.c | 21 +++--
 drivers/block/aoe/aoedev.c |  1 -
 3 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/block/aoe/aoe.h b/drivers/block/aoe/aoe.h
index c0ebda1283cc..015c68017a1c 100644
--- a/drivers/block/aoe/aoe.h
+++ b/drivers/block/aoe/aoe.h
@@ -201,7 +201,6 @@ int aoeblk_init(void);
 void aoeblk_exit(void);
 void aoeblk_gdalloc(void *);
 void aoedisk_rm_debugfs(struct aoedev *d);
-void aoedisk_rm_sysfs(struct aoedev *d);
 
 int aoechr_init(void);
 void aoechr_exit(void);
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 429ebb84b592..b9d30150ce20 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -177,10 +177,15 @@ static struct attribute *aoe_attrs[] = {
NULL,
 };
 
-static const struct attribute_group attr_group = {
+static const struct attribute_group aoe_attr_group = {
.attrs = aoe_attrs,
 };
 
+static const struct attribute_group *aoe_attr_groups[] = {
+   _attr_group,
+   NULL,
+};
+
 static const struct file_operations aoe_debugfs_fops = {
.open = aoe_debugfs_open,
.read = seq_read,
@@ -220,17 +225,6 @@ aoedisk_rm_debugfs(struct aoedev *d)
 }
 
 static int
-aoedisk_add_sysfs(struct aoedev *d)
-{
-   return sysfs_create_group(_to_dev(d->gd)->kobj, _group);
-}
-void
-aoedisk_rm_sysfs(struct aoedev *d)
-{
-   sysfs_remove_group(_to_dev(d->gd)->kobj, _group);
-}
-
-static int
 aoeblk_open(struct block_device *bdev, fmode_t mode)
 {
struct aoedev *d = bdev->bd_disk->private_data;
@@ -417,8 +411,7 @@ aoeblk_gdalloc(void *vp)
 
spin_unlock_irqrestore(>lock, flags);
 
-   add_disk(gd);
-   aoedisk_add_sysfs(d);
+   device_add_disk_with_groups(NULL, gd, aoe_attr_groups);
aoedisk_add_debugfs(d);
 
spin_lock_irqsave(>lock, flags);
diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c
index 697f735b07a4..d92fa1fe3580 100644
--- a/drivers/block/aoe/aoedev.c
+++ b/drivers/block/aoe/aoedev.c
@@ -275,7 +275,6 @@ freedev(struct aoedev *d)
del_timer_sync(>timer);
if (d->gd) {
aoedisk_rm_debugfs(d);
-   aoedisk_rm_sysfs(d);
del_gendisk(d->gd);
put_disk(d->gd);
blk_cleanup_queue(d->blkq);
-- 
2.12.3



[PATCH 3/5] nvme: register ns_id attributes as default sysfs groups

2018-07-25 Thread Hannes Reinecke
We should be registering the ns_id attribute as default sysfs
attribute groups, otherwise we have a race condition between
the uevent and the attributes appearing in sysfs.

Signed-off-by: Hannes Reinecke 
---
 drivers/nvme/host/core.c  | 14 +++---
 drivers/nvme/host/multipath.c | 12 +++-
 drivers/nvme/host/nvme.h  |  2 +-
 3 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e77e6418a21c..45dab8d4aff2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2689,6 +2689,11 @@ const struct attribute_group nvme_ns_id_attr_group = {
.is_visible = nvme_ns_id_attrs_are_visible,
 };
 
+const struct attribute_group *nvme_ns_id_attr_groups[] = {
+   _ns_id_attr_group,
+   NULL,
+};
+
 #define nvme_show_str_function(field)  
\
 static ssize_t  field##_show(struct device *dev,   
\
struct device_attribute *attr, char *buf)   
\
@@ -3056,11 +3061,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
 
kfree(id);
 
-   device_add_disk(ctrl->device, ns->disk);
-   if (sysfs_create_group(_to_dev(ns->disk)->kobj,
-   _ns_id_attr_group))
-   pr_warn("%s: failed to create sysfs group for identification\n",
-   ns->disk->disk_name);
+   device_add_disk_with_groups(ctrl->device, ns->disk,
+   nvme_ns_id_attr_groups);
if (ns->ndev && nvme_nvm_register_sysfs(ns))
pr_warn("%s: failed to register lightnvm sysfs group for 
identification\n",
ns->disk->disk_name);
@@ -3087,8 +3089,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
nvme_fault_inject_fini(ns);
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
-   sysfs_remove_group(_to_dev(ns->disk)->kobj,
-   _ns_id_attr_group);
if (ns->ndev)
nvme_nvm_unregister_sysfs(ns);
del_gendisk(ns->disk);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1ffd3e8b13a1..2f57ff69e26d 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -226,13 +226,9 @@ void nvme_mpath_add_disk(struct nvme_ns_head *head)
return;
 
mutex_lock(>subsys->lock);
-   if (!(head->disk->flags & GENHD_FL_UP)) {
-   device_add_disk(>subsys->dev, head->disk);
-   if (sysfs_create_group(_to_dev(head->disk)->kobj,
-   _ns_id_attr_group))
-   pr_warn("%s: failed to create sysfs group for 
identification\n",
-   head->disk->disk_name);
-   }
+   if (!(head->disk->flags & GENHD_FL_UP))
+   device_add_disk_with_groups(>subsys->dev, head->disk,
+   nvme_ns_id_attr_groups);
mutex_unlock(>subsys->lock);
 }
 
@@ -240,8 +236,6 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 {
if (!head->disk)
return;
-   sysfs_remove_group(_to_dev(head->disk)->kobj,
-  _ns_id_attr_group);
del_gendisk(head->disk);
blk_set_queue_dying(head->disk->queue);
/* make sure all pending bios are cleaned up */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 4ad0c8ad2a27..4b911274150d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -446,7 +446,7 @@ int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
 int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
u8 log_page, void *log, size_t size, u64 offset);
 
-extern const struct attribute_group nvme_ns_id_attr_group;
+extern const struct attribute_group *nvme_ns_id_attr_groups[];
 extern const struct block_device_operations nvme_ns_head_ops;
 
 #ifdef CONFIG_NVME_MULTIPATH
-- 
2.12.3



[PATCH 5/5] zram: use device_add_disk_with_groups()

2018-07-25 Thread Hannes Reinecke
Use device_add_disk_with_groups() to avoid a race condition with
udev during startup.

Signed-off-by: Hannes Reinecke 
---
 drivers/block/zram/zram_drv.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index da51293e7c03..6f55676e2164 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1619,6 +1619,11 @@ static const struct attribute_group zram_disk_attr_group 
= {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1699,24 +1704,14 @@ static int zram_add(void)
 
zram->disk->queue->backing_dev_info->capabilities |=
(BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);
-   add_disk(zram->disk);
-
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
+   device_add_disk_with_groups(NULL, zram->disk, zram_disk_attr_groups);
+
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
zram_debugfs_register(zram);
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1745,15 +1740,6 @@ static int zram_remove(struct zram *zram)
mutex_unlock(>bd_mutex);
 
zram_debugfs_unregister(zram);
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
 
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
-- 
2.12.3



Re: [PATCH 2/2] blktests: add test for ANA state transition

2018-07-23 Thread Hannes Reinecke

On 07/21/2018 11:29 PM, Chaitanya Kulkarni wrote:

From: linux-block-ow...@vger.kernel.org  on behalf 
of Hannes Reinecke 
Sent: Tuesday, July 17, 2018 6:31 AM
To: Omar Sandoval
Cc: Christoph Hellwig; Sagi Grimberg; Keith Busch; 
linux-n...@lists.infradead.org; linux-block@vger.kernel.org; Hannes Reinecke; 
Hannes Reinecke
Subject: [PATCH 2/2] blktests: add test for ANA state transition
   
  
Signed-off-by: Hannes Reinecke 

---
  tests/nvme/014 | 158 +
  tests/nvme/014.out |  17 ++
  2 files changed, 175 insertions(+)
  create mode 100755 tests/nvme/014
  create mode 100644 tests/nvme/014.out

diff --git a/tests/nvme/014 b/tests/nvme/014
new file mode 100755
index 000..4b57229
--- /dev/null
+++ b/tests/nvme/014
@@ -0,0 +1,158 @@
+#!/bin/bash
+#
+# Regression test for ANA base support
+#
+# Copyright (C) 2018 Hannes Reinecke
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.


Licenses - GNU Project - Free Software Foundation
www.gnu.org
Published software should be free software.To make it free software, you need 
to release it under
a free software license. We normally use the GNU General Public License (GNU 
GPL), specifying version 3
or any later version, but occasionally we use other free software licenses.


I don't mind, I just copied it over from testcase 10...


+
+. tests/nvme/rc
+
+DESCRIPTION="test ANA optimized/transitioning/inaccessible support"
+QUICK=1
+
+switch_nvmet_anagroup() {
+   local port1="$1"
+   local port2="$2"
+   local mode="$3"
+
+   echo "ANA state ${mode}"
+
+   if [ "${mode}" = "change" ] ; then
+   _set_nvmet_anagroup_state "${port1}" "1" "change"
+   _set_nvmet_anagroup_state "${port1}" "2" "change"
+   _set_nvmet_anagroup_state "${port2}" "1" "change"
+   _set_nvmet_anagroup_state "${port2}" "2" "change"
+   elif [ "${mode}" = "failover" ] ; then
+   _set_nvmet_anagroup_state "${port1}" "1" "inaccessible"
+   _set_nvmet_anagroup_state "${port1}" "2" "optimized"
+   _set_nvmet_anagroup_state "${port2}" "1" "optimized"
+   _set_nvmet_anagroup_state "${port2}" "2" "inaccessible"
+   else
+   _set_nvmet_anagroup_state "${port1}" "1" "optimized"
+   _set_nvmet_anagroup_state "${port1}" "2" "inaccessible"
+   _set_nvmet_anagroup_state "${port2}" "1" "inaccessible"
+   _set_nvmet_anagroup_state "${port2}" "2" "optimized"
+   fi
+}
+
+_display_ana_state() {
+   local grpid state
[CK] Newliine here ?


Okay


+   for nvme in /sys/class/nvme/* ; do
+   for c in ${nvme}/nvme* ; do
+   if [ ! -d ${c} ] ; then
+   echo "${nvme##*/}: ANA disabled"
+   continue
+   fi
+   grpid="$(cat "${c}/ana_grpid")"
+   state="$(cat "${c}/ana_state")"
+   echo "${c##*/}: grpid ${grpid} state ${state}"
+   done
+   done
+}

I think we need to move above functions to the  ${BLKTESTS_HOME}/tests/nvme/rc.


Okay, can do.


+
+_switch_ana_states() {
+   local port1=$1
+   local port2=$2
+
[CK] Please remove the extra line.
+}
[CK] I was not able to find a caller for above function, I'm I missing 
something ?
+

Yeah, that's an older function which got left over during refactoring.


+requires() {
+   _have_program nvme && _have_module nvme-loop && _have_module loop && \
+   _have_configfs && _have_fio
[CK] Missing nvmet module from the above list.
+}

Can we split following test function into small routines, it will be easier to 
review and
maintain?

+
+test() {
+   echo "Running ${TEST_NAME}"
+
+   m

Re: [PATCH 0/2] Fix silent data corruption in blkdev_direct_IO()

2018-07-19 Thread Hannes Reinecke
On 07/19/2018 11:39 AM, Martin Wilck wrote:
> Hello Jens, Ming, Jan, and all others,
> 
> the following patches have been verified by a customer to fix a silent data
> corruption which he has been seeing since "72ecad2 block: support a full bio
> worth of IO for simplified bdev direct-io".
> 
> The patches are based on our observation that the corruption is only
> observed if the __blkdev_direct_IO_simple() code path is executed,
> and if that happens, "short writes" are observed in this code path,
> which causes a fallback to buffered IO, while the application continues
> submitting direct IO requests.
> 
> For the first patch, an alternative solution by Christoph Hellwig
> exists:
> 
>https://marc.info/?l=linux-kernel=153013977816825=2
> 
> While I believe that Christoph's patch is correct, the one presented
> here is smaller. Ming has suggested to use Christoph's for mainline and
> mine for -stable.
> 
> Wrt the second patch, we've had an internal discussion at SUSE how to handle
> (unlikely) error conditions from bio_iov_iter_get_pages(). The patch presented
> here tries to submit as much IO as possible via the direct path even in the
> error case, while Jan Kara suggested to abort, not submit any IO, and fall 
> back to buffered IO in that case.
> 
> Looking forward to your opinions and suggestions.
> 
Can you please add the test program from Jan Kara to the blktest suite?
This issue is something we really should cover there too, seeing the
amount of time we've spend debugging it...

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 2/2] blkdev: __blkdev_direct_IO_simple: make sure to fill up the bio

2018-07-19 Thread Hannes Reinecke
On 07/19/2018 11:39 AM, Martin Wilck wrote:
> bio_iov_iter_get_pages() returns only pages for a single non-empty
> segment of the input iov_iter's iovec. This may be much less than the number
> of pages __blkdev_direct_IO_simple() is supposed to process. Call
> bio_iov_iter_get_pages() repeatedly until either the requested number
> of bytes is reached, or bio.bi_io_vec is exhausted. If this is not done,
> short writes or reads may occur for direct synchronous IOs with multiple
> iovec slots (such as generated by writev()). In that case,
> __generic_file_write_iter() falls back to buffered writes, which
> has been observed to cause data corruption in certain workloads.
> 
> Note: if segments aren't page-aligned in the input iovec, this patch may
> result in multiple adjacent slots of the bi_io_vec array to reference the same
> page (the byte ranges are guaranteed to be disjunct if the preceding patch is
> applied). We haven't seen problems with that in our and the customer's
> tests. It'd be possible to detect this situation and merge bi_io_vec slots
> that refer to the same page, but I prefer to keep it simple for now.
> 
> Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified 
> bdev direct-io")
> Signed-off-by: Martin Wilck 
> ---
>  fs/block_dev.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 1/2] block: bio_iov_iter_get_pages: fix size of last iovec

2018-07-19 Thread Hannes Reinecke
On 07/19/2018 11:39 AM, Martin Wilck wrote:
> If the last page of the bio is not "full", the length of the last
> vector slot needs to be corrected. This slot has the index
> (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
> array, which is shifted by the value of bio->bi_vcnt at function
> invocation, the correct index is (nr_pages - 1).
> 
> V2: improved readability following suggestions from Ming Lei.
> 
> Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
> Signed-off-by: Martin Wilck 
> ---
>  block/bio.c | 18 --
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH 2/2] blktests: add test for ANA state transition

2018-07-17 Thread Hannes Reinecke
Signed-off-by: Hannes Reinecke 
---
 tests/nvme/014 | 158 +
 tests/nvme/014.out |  17 ++
 2 files changed, 175 insertions(+)
 create mode 100755 tests/nvme/014
 create mode 100644 tests/nvme/014.out

diff --git a/tests/nvme/014 b/tests/nvme/014
new file mode 100755
index 000..4b57229
--- /dev/null
+++ b/tests/nvme/014
@@ -0,0 +1,158 @@
+#!/bin/bash
+#
+# Regression test for ANA base support
+#
+# Copyright (C) 2018 Hannes Reinecke
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. tests/nvme/rc
+
+DESCRIPTION="test ANA optimized/transitioning/inaccessible support"
+QUICK=1
+
+switch_nvmet_anagroup() {
+   local port1="$1"
+   local port2="$2"
+   local mode="$3"
+
+   echo "ANA state ${mode}"
+
+   if [ "${mode}" = "change" ] ; then
+   _set_nvmet_anagroup_state "${port1}" "1" "change"
+   _set_nvmet_anagroup_state "${port1}" "2" "change"
+   _set_nvmet_anagroup_state "${port2}" "1" "change"
+   _set_nvmet_anagroup_state "${port2}" "2" "change"
+   elif [ "${mode}" = "failover" ] ; then
+   _set_nvmet_anagroup_state "${port1}" "1" "inaccessible"
+   _set_nvmet_anagroup_state "${port1}" "2" "optimized"
+   _set_nvmet_anagroup_state "${port2}" "1" "optimized"
+   _set_nvmet_anagroup_state "${port2}" "2" "inaccessible"
+   else
+   _set_nvmet_anagroup_state "${port1}" "1" "optimized"
+   _set_nvmet_anagroup_state "${port1}" "2" "inaccessible"
+   _set_nvmet_anagroup_state "${port2}" "1" "inaccessible"
+   _set_nvmet_anagroup_state "${port2}" "2" "optimized"
+   fi
+}
+
+_display_ana_state() {
+   local grpid state
+   for nvme in /sys/class/nvme/* ; do
+   for c in ${nvme}/nvme* ; do
+   if [ ! -d ${c} ] ; then
+   echo "${nvme##*/}: ANA disabled"
+   continue
+   fi
+   grpid="$(cat "${c}/ana_grpid")"
+   state="$(cat "${c}/ana_state")"
+   echo "${c##*/}: grpid ${grpid} state ${state}"
+   done
+   done
+}
+
+_switch_ana_states() {
+   local port1=$1
+   local port2=$2
+
+}
+
+requires() {
+   _have_program nvme && _have_module nvme-loop && _have_module loop && \
+   _have_configfs && _have_fio
+}
+
+test() {
+   echo "Running ${TEST_NAME}"
+
+   modprobe nvmet
+   modprobe nvme-loop
+
+   local port1
+   port1="$(_create_nvmet_port "loop")"
+   ag1="$(_create_nvmet_anagroup "${port1}")"
+
+   local port2
+   port2="$(_create_nvmet_port "loop")"
+   ag2="$(_create_nvmet_anagroup "${port2}")"
+
+   truncate -s 1G "$TMPDIR/img1"
+
+   local loop_dev1
+   loop_dev1="$(losetup -f --show "$TMPDIR/img1")"
+
+   _create_nvmet_subsystem "blktests-subsystem-1" "${loop_dev1}" \
+   "91fdba0d-f87b-4c25-b80f-db7be1418b9e" "1"
+
+   truncate -s 1G "$TMPDIR/img2"
+
+   local loop_dev2
+   loop_dev2="$(losetup -f --show "$TMPDIR/img2")"
+
+   _create_nvmet_ns "blktests-subsystem-1" "2" "${loop_dev2}" \
+   "9aed0138-bfd9-46f5-92ac-24c70377fd49" "2"
+
+   _add_nvmet_subsys_to_port "${port1}" "blktests-subsystem-1"
+   _add_nvmet_subsys_to_port "${port2}" "blktests-subsystem-1"
+
+   switch_nvmet_anagroup "${port1}" "${port2}" failback
+
+   nv

[PATCH 1/2] blktests: enable ANA support

2018-07-17 Thread Hannes Reinecke
Update nvme functions to support ANA.

Signed-off-by: Hannes Reinecke 
---
 tests/nvme/rc | 49 ++---
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/tests/nvme/rc b/tests/nvme/rc
index fb5dbdf..116661d 100644
--- a/tests/nvme/rc
+++ b/tests/nvme/rc
@@ -27,7 +27,7 @@ group_device_requires() {
_test_dev_is_nvme
 }
 
-NVMET_CFS="/sys/kernel/config/nvmet/"
+NVMET_CFS="/sys/kernel/config/nvmet"
 
 _test_dev_is_nvme() {
if ! readlink -f "$TEST_DEV_SYSFS/device" | grep -q nvme; then
@@ -49,6 +49,7 @@ _create_nvmet_port() {
 
mkdir "${NVMET_CFS}/ports/${port}"
echo "${trtype}" > "${NVMET_CFS}/ports/${port}/addr_trtype"
+   echo "${port}" > "${NVMET_CFS}/ports/${port}/addr_traddr"
 
echo "${port}"
 }
@@ -58,6 +59,39 @@ _remove_nvmet_port() {
rmdir "${NVMET_CFS}/ports/${port}"
 }
 
+_create_nvmet_anagroup() {
+   local port="$1"
+   local port_cfs="${NVMET_CFS}/ports/${port}"
+   local grpid
+
+   for ((grpid = 1; ; grpid++)); do
+   if [[ ! -e "${port_cfs}/ana_groups/${grpid}" ]]; then
+   break
+   fi
+   done
+
+   mkdir "${port_cfs}/ana_groups/${grpid}"
+
+   echo "${grpid}"
+}
+
+_remove_nvmet_anagroup() {
+   local port="$1"
+   local grpid="$2"
+   local ana_cfs="${NVMET_CFS}/ports/${port}/ana_groups/${grpid}"
+
+   rmdir "${ana_cfs}"
+}
+
+_set_nvmet_anagroup_state() {
+   local port="$1"
+   local grpid="$2"
+   local state="$3"
+   local ana_cfs="${NVMET_CFS}/ports/${port}/ana_groups/${grpid}"
+
+   echo "${state}" > "${ana_cfs}/ana_state"
+}
+
 _create_nvmet_ns() {
local nvmet_subsystem="$1"
local nsid="$2"
@@ -65,14 +99,22 @@ _create_nvmet_ns() {
local uuid="----"
local subsys_path="${NVMET_CFS}/subsystems/${nvmet_subsystem}"
local ns_path="${subsys_path}/namespaces/${nsid}"
+   local ana_grpid
 
-   if [[ $# -eq 4 ]]; then
+   if [[ $# -ge 4 ]]; then
uuid="$4"
fi
 
+   if [[ $# -eq 5 ]]; then
+   ana_grpid="$5"
+   fi
+
mkdir "${ns_path}"
printf "%s" "${blkdev}" > "${ns_path}/device_path"
printf "%s" "${uuid}" > "${ns_path}/device_uuid"
+   if [ -n "${ana_grpid}" ] ; then
+   printf "%s" "${ana_grpid}" > "${ns_path}/ana_grpid"
+   fi
printf 1 > "${ns_path}/enable"
 }
 
@@ -80,11 +122,12 @@ _create_nvmet_subsystem() {
local nvmet_subsystem="$1"
local blkdev="$2"
local uuid=$3
+   local ana_grpid=$4
local cfs_path="${NVMET_CFS}/subsystems/${nvmet_subsystem}"
 
mkdir -p "${cfs_path}"
echo 1 > "${cfs_path}/attr_allow_any_host"
-   _create_nvmet_ns "${nvmet_subsystem}" "1" "${blkdev}" "${uuid}"
+   _create_nvmet_ns "${nvmet_subsystem}" "1" "${blkdev}" "${uuid}" 
"${ana_grpid}"
 }
 
 _remove_nvmet_ns() {
-- 
2.12.3



[PATCH 0/2] blktests: test ANA base support

2018-07-17 Thread Hannes Reinecke
Hi all,

here's a small patchset for testing ANA base support.
The test itself requires the ANA patchset from Christoph, plus
the fixes I've sent to the mailing list earlier.

Hannes Reinecke (2):
  blktests: enable ANA support
  blktests: add test for ANA state transition

 tests/nvme/014 | 158 +
 tests/nvme/014.out |  17 ++
 tests/nvme/rc  |  49 -
 3 files changed, 221 insertions(+), 3 deletions(-)
 create mode 100755 tests/nvme/014
 create mode 100644 tests/nvme/014.out

-- 
2.12.3



[PATCH] blktests: Add '--outdir' to store results in a different directory

2018-07-17 Thread Hannes Reinecke
Adding an option '--outdir' to store results in a different
director so as not to clutter the git repository itself.

Signed-off-by: Hannes Reinecke 
---
 check | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/check b/check
index a635531..42d07f8 100755
--- a/check
+++ b/check
@@ -334,7 +334,7 @@ _call_test() {
fi
 
trap _cleanup EXIT
-   if ! TMPDIR="$(mktemp --tmpdir -p "$PWD/results" -d 
"tmpdir.${TEST_NAME//\//.}.XXX")"; then
+   if ! TMPDIR="$(mktemp --tmpdir -p "$RESULTS_DIR" -d 
"tmpdir.${TEST_NAME//\//.}.XXX")"; then
return
fi
 
@@ -415,7 +415,7 @@ _run_test() {
return 0
fi
 
-   RESULTS_DIR="results/nodev"
+   RESULTS_DIR="${OUT_DIR}/results/nodev"
_call_test test
else
if [[ ${#TEST_DEVS[@]} -eq 0 ]]; then
@@ -434,7 +434,7 @@ _run_test() {
_output_notrun "$TEST_NAME => $(basename 
"$TEST_DEV")"
continue
fi
-   RESULTS_DIR="results/$(basename "$TEST_DEV")"
+   RESULTS_DIR="${OUT_DIR}/results/$(basename "$TEST_DEV")"
if ! _call_test test_device; then
ret=1
fi
@@ -567,6 +567,7 @@ Test runs:
 tests to run
 
 Miscellaneous:
+  -o, --outdir=OUTDIRwrite results into the specified directory
   -h, --help display this help message and exit"
 
case "$1" in
@@ -581,12 +582,13 @@ Miscellaneous:
esac
 }
 
-if ! TEMP=$(getopt -o 'dq::x:h' --long 'quick::,exclude:,help' -n "$0" -- 
"$@"); then
+if ! TEMP=$(getopt -o 'dq::o:x:h' --long 'quick::,exclude:,outdir:,help' -n 
"$0" -- "$@"); then
exit 1
 fi
 
 eval set -- "$TEMP"
 unset TEMP
+OUT_DIR="."
 
 if [[ -r config ]]; then
# shellcheck disable=SC1091
@@ -629,6 +631,10 @@ while true; do
EXCLUDE+=("$2")
shift 2
;;
+   '-o'|'--outdir')
+   OUT_DIR="${2:-${OUT_DIR:-.}}"
+   shift 2
+   ;;
'-h'|'--help')
usage out
;;
-- 
2.12.3



Re: Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Hannes Reinecke

On 07/12/2018 05:08 PM, Jens Axboe wrote:

On 7/12/18 8:36 AM, Hannes Reinecke wrote:

Hi Jens, Christoph,

we're currently hunting down a silent data corruption occurring due to
commit 72ecad22d9f1 ("block: support a full bio worth of IO for
simplified bdev direct-io").

While the whole thing is still hazy on the details, the one thing we've
found is that reverting that patch fixes the data corruption.

And looking closer, I've found this:

static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
int nr_pages;

nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
if (!nr_pages)
return 0;
if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
}

When checking the call path
__blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.

So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
It's not that we can handle it in __blkdev_direct_IO() ...


The logic could be cleaned up like below, the sync part is really all
we care about. What is the test case for this? async or sync?

I also don't remember why it's BIO_MAX_PAGES + 1...

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0dd87aaeb39a..14ef3d71b55f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -424,13 +424,13 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter)
  {
int nr_pages;
  
-	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);

+   nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
if (!nr_pages)
return 0;
-   if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
+   if (is_sync_kiocb(iocb))
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
  
-	return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));

+   return __blkdev_direct_IO(iocb, iter, nr_pages);
  }
  
  static __init int blkdev_init(void)



Hmm. We'll give it a go, but somehow I feel this won't solve our problem.
Another question (which probably shows my complete ignorance):
What happens if the iter holds >= BIO_MAX_PAGES?
'iov_iter_npages' will only ever return BIO_MAX_PAGES, independent on 
whether the iter contains more pages.

What happens to those left-over pages?
Will they ever be processed?

Cheers,

Hannes


Silent data corruption in blkdev_direct_IO()

2018-07-12 Thread Hannes Reinecke
Hi Jens, Christoph,

we're currently hunting down a silent data corruption occurring due to
commit 72ecad22d9f1 ("block: support a full bio worth of IO for
simplified bdev direct-io").

While the whole thing is still hazy on the details, the one thing we've
found is that reverting that patch fixes the data corruption.

And looking closer, I've found this:

static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
int nr_pages;

nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
if (!nr_pages)
return 0;
if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES));
}

When checking the call path
__blkdev_direct_IO()->bio_alloc_bioset()->bvec_alloc()
I found that bvec_alloc() will fail if nr_pages > BIO_MAX_PAGES.

So why is there the check for 'nr_pages <= BIO_MAX_PAGES' ?
It's not that we can handle it in __blkdev_direct_IO() ...

Thanks for any clarification.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH blktests v2 3/3] Add tests for the SRP initiator and target drivers

2018-07-09 Thread Hannes Reinecke
vdd: device node name blacklisted
> sda: udev property DEVLINKS whitelisted
> sda: mask = 0x1f
> sda: dev_t = 8:0
> sda: size = 65536
> sda: vendor = Linux
> sda: product = scsi_debug
> sda: rev = 0188
> sda: h:b:t:l = 0:0:0:0
> sda: tgt_node_name =
> sda: path state = running
> sda: 1024 cyl, 2 heads, 32 sectors/track, start at 0
> sda: serial = 2000
> sda: get_state
> sda: detect_checker = yes (setting: multipath internal)
> failed to issue vpd inquiry for pgc9
> sda: path_checker = tur (setting: multipath internal)
> sda: checker timeout = 30 s (setting: kernel sysfs)
> sda: tur state = up
> sda: uid_attribute = ID_SERIAL (setting: multipath internal)
> sda: uid = 007d0 (udev)
> sda: detect_prio = yes (setting: multipath internal)
> sda: prio = const (setting: multipath internal)
> sda: prio args = "" (setting: multipath internal)
> sda: const prio = 1
> ram0: udev property DEVNAME whitelisted
> ram0: device node name blacklisted
> ram1: udev property DEVNAME whitelisted
> ram1: device node name blacklisted
> ram2: udev property DEVNAME whitelisted
> ram2: device node name blacklisted
> sdb: udev property DEVLINKS whitelisted
> sdd: udev property DEVLINKS whitelisted
> sdc: udev property DEVLINKS whitelisted
> nvme0n1: udev property DEVLINKS whitelisted
> sda: udev property DEVLINKS whitelisted
> sda: (Linux:scsi_debug) vendor/product blacklisted
> tur checker refcount 4
> const prioritizer refcount 2
> libdevmapper version 1.02.146 (2017-12-18)
> DM multipath kernel driver v1.13.0
> sdb: udev property DEVLINKS whitelisted
> wwid 3600140572616d6469736b310 not in wwids file, skipping sdb
> sdb: orphan path, only one path
> sysfs prioritizer refcount 3
> tur checker refcount 3
> sdd: udev property DEVLINKS whitelisted
> wwid 3600140572616d6469736b320 not in wwids file, skipping sdd
> sdd: orphan path, only one path
> sysfs prioritizer refcount 2
> tur checker refcount 2
> sdc: udev property DEVLINKS whitelisted
> wwid 3600140573637369646267000 not in wwids file, skipping sdc
> sdc: orphan path, only one path
> sysfs prioritizer refcount 1
> tur checker refcount 1
> nvme0n1: udev property DEVLINKS whitelisted
> wwid nvme.8086-3137-51454d55204e564d65204374726c-0001 not in
> wwids file, skipping nvme0n1
> nvme0n1: orphan path, only one path
> const prioritizer refcount 1
> none checker refcount 1
> exit (signal)
> ^CReleasing uevent_monitor() resources
> Releasing uevent_listen() resources
> unloading none checker
> unloading tur checker
> unloading sysfs prioritizer
> unloading const prioritizer
> unlink pidfile
> shut down---
> 
Weelll ... that's to be expected.
You are running with RH-style defaults, which won't be starting multipath if
a) only one path is set
and
b) if it's not registered in the wwids file

You can switch off this behaviour with the option

find_multipaths no

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH blktests v2 3/3] Add tests for the SRP initiator and target drivers

2018-07-03 Thread Hannes Reinecke
On 07/03/2018 11:39 PM, Omar Sandoval wrote:
> On Tue, Jul 03, 2018 at 12:50:10PM -0700, Omar Sandoval wrote:
>> On Tue, Jul 03, 2018 at 12:49:13PM -0700, Omar Sandoval wrote:
>>> On Fri, Jun 29, 2018 at 09:13:53AM -0700, Bart Van Assche wrote:
>>>> On 06/28/18 16:43, Omar Sandoval wrote:
>>>>> On Wed, Jun 27, 2018 at 02:49:08PM -0700, Bart Van Assche wrote:
>>>>> [ ... ]
>>>>> srp/002 (File I/O on top of multipath concurrently with logout and login 
>>>>> (mq)) [failed]
>>>>> runtime  6.680s  ...  6.640s
>>>>>  --- tests/srp/002.out   2018-06-28 15:18:36.537169282 -0700
>>>>>  +++ results/nodev/srp/002.out.bad   2018-06-28 16:21:59.930603931 
>>>>> -0700
>>>>>  @@ -3,5 +3,4 @@
>>>>>   Unloaded the rdma_rxe kernel module
>>>>>   Configured SRP target driver
>>>>>   Unloaded the ib_srp kernel module
>>>>>  -Unloaded the ib_srpt kernel module
>>>>>  -Unloaded the rdma_rxe kernel module
>>>>>  +/dev/disk/by-id/dm-uuid-mpath-3600140572616d6469736b310: 
>>>>> not found
>>>>>
>>>>> And everything else fails like srp/002.
>>>>
>>>> I think that indicates a problem with the multipathing software on your
>>>> setup. Was multipathd running? Had the proper multipath configuration
>>>> (/etc/multipath.conf) been provided? Was multipathing enabled in the kernel
>>>> config?
>>>
>>> Ah, the README says /etc/multipathd.conf. Fixing it to multipath.conf
>>> didn't work, either, though.
>>
>> Hit send too early. multipathd is running, but I did find this in my
>> dmesg:
>>
>> [ 1844.347561] multipath[6558]: segfault at 100 ip 7fbb7cda51e6 sp 
>> 7fffe9241798 error 4 in libc-2.27.so[7fbb7cd0e000+1b3000]
>>
>> I'll look into that.
> 
> Alright, I installed multipath-tools from source and the segfaults are
> gone, but I still don't get these symlinks. Instead, they show up as
> 
> /dev/disk/by-id/scsi-3600140572616d6469736b320
> 
> Any ideas?
> 
Yes.

Symlink generation strongly depends on the OS you're running. With a
recent distro you should be getting a lot of symlinks with the WWN of
the disk in it, like

/dev/disk/by-id/dm-uuid-mpath-
/dev/disk/by-id/dm-name-
/dev/disk/by-id/scsi-
/dev/disk/by-id/wwn-

originally we just specified /dev/disk/by-id/scsi- to be stable,
and that's also the one you're most likely to find.

However, you might get more than _one_ /dev/disk/by-id/scsi-, as
these merely reflect the contents of the VPD page 83. So if that
provides more than one identifier per disk you'll get more than one link.

Rule of thumb: Use the WWN starting with the highest number; that is
having the best chance of being actually persistent.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-15 Thread Hannes Reinecke

On 06/15/2018 04:07 PM, Bart Van Assche wrote:

On Thu, 2018-06-14 at 15:38 +0200, Hannes Reinecke wrote:

For performance reasons we should be able to allocate all memory
from a given NUMA node, so this patch adds a new parameter
'rd_numa_node' to allow the user to specify the NUMA node id.
When restricing fio to use the same NUMA node I'm seeing a performance
boost of more than 200%.


Passing this information through a kernel module parameter to the brd kernel
module seems wrong to me. There can be multiple brd instances. Using a kernel
module parameter makes it impossible to specify a different NUMA node for
different brd instances.

This patch has primarily done for simplicity; all the existing brd 
parameters affect _all_ ramdisks, so this patch keeps that style.


If you want soemthing more fine-grained you could use the approach 
suggested by Mel Gorman and use 'numactl' to pre-fill the ramdisk via 'dd'.


Cheers,

Hannes




Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-15 Thread Hannes Reinecke

On 06/14/2018 10:53 PM, Adam Manzanares wrote:
[ .. ]


Then how about a numactl ... dd /dev/ram ... after the modprobe.


Yes of course, or you could do that for every application that ends
up in the path of the doing IO to it. The point of the option is to
just make it explicit, and not have to either NUMA pin each task,
or prefill all possible pages.



Makes sense, I have done some similar benchmarking and had to worry
about NUMA awareness and the numactl + dd approach seemed to work
because I wanted to not take a performance hit for page allocation
during the benchmarking.

Would anyone be interested in forcing the allocations to occur during
module initialization?


YES.

Cheers,

Hannes



Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Hannes Reinecke
On Thu, 14 Jun 2018 09:33:35 -0600
Jens Axboe  wrote:

> On 6/14/18 9:29 AM, Hannes Reinecke wrote:
> > On Thu, 14 Jun 2018 08:47:33 -0600
> > Jens Axboe  wrote:
> >   
> >> On 6/14/18 7:38 AM, Hannes Reinecke wrote:  
> >>> For performance reasons we should be able to allocate all memory
> >>> from a given NUMA node, so this patch adds a new parameter
> >>> 'rd_numa_node' to allow the user to specify the NUMA node id.
> >>> When restricing fio to use the same NUMA node I'm seeing a
> >>> performance boost of more than 200%.
> >>
> >> Looks fine to me. One comment.
> >>  
> >>> @@ -342,6 +343,10 @@ static int max_part = 1;
> >>>  module_param(max_part, int, 0444);
> >>>  MODULE_PARM_DESC(max_part, "Num Minors to reserve between
> >>> devices"); 
> >>> +static int rd_numa_node = NUMA_NO_NODE;
> >>> +module_param(rd_numa_node, int, 0444);
> >>> +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
> >>> disk on.");
> >>
> >> This could feasibly be 0644, as there would be nothing wrong with
> >> altering this at runtime.
> >>  
> > 
> > While we could it would not change the allocation of _existing_ ram
> > devices, making behaviour rather unpredictable.
> > Hence I did decide against it (and yes, I actually thought about
> > it).
> > 
> > But if you insist ...  
> 
> Right, it would just change new allocations. Probably not a common use
> case, but there's really nothing that prevents it from being feasible.
> 
> Next question - what does the memory allocator do if we run out of
> memory on the given node? Should we punt to a different node if that
> happens? Slower, but functional, seems preferable to not being able
> to get memory.
> 

Hmm. That I haven't considered; yes, that really sounds like an idea.
Will be sending an updated patch.

Cheers,

Hannes


Re: [PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Hannes Reinecke
On Thu, 14 Jun 2018 08:47:33 -0600
Jens Axboe  wrote:

> On 6/14/18 7:38 AM, Hannes Reinecke wrote:
> > For performance reasons we should be able to allocate all memory
> > from a given NUMA node, so this patch adds a new parameter
> > 'rd_numa_node' to allow the user to specify the NUMA node id.
> > When restricing fio to use the same NUMA node I'm seeing a
> > performance boost of more than 200%.  
> 
> Looks fine to me. One comment.
> 
> > @@ -342,6 +343,10 @@ static int max_part = 1;
> >  module_param(max_part, int, 0444);
> >  MODULE_PARM_DESC(max_part, "Num Minors to reserve between
> > devices"); 
> > +static int rd_numa_node = NUMA_NO_NODE;
> > +module_param(rd_numa_node, int, 0444);
> > +MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM
> > disk on.");  
> 
> This could feasibly be 0644, as there would be nothing wrong with
> altering this at runtime.
> 

While we could it would not change the allocation of _existing_ ram
devices, making behaviour rather unpredictable.
Hence I did decide against it (and yes, I actually thought about it).

But if you insist ...

Cheers,

Hannes


[PATCH] brd: Allow ramdisk to be allocated on selected NUMA node

2018-06-14 Thread Hannes Reinecke
For performance reasons we should be able to allocate all memory
from a given NUMA node, so this patch adds a new parameter
'rd_numa_node' to allow the user to specify the NUMA node id.
When restricing fio to use the same NUMA node I'm seeing a performance
boost of more than 200%.

Signed-off-by: Hannes Reinecke 
---
 drivers/block/brd.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index bb976598ee43..7142d836539e 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -36,6 +36,7 @@
  */
 struct brd_device {
int brd_number;
+   int brd_numa_node;
 
struct request_queue*brd_queue;
struct gendisk  *brd_disk;
@@ -103,7 +104,7 @@ static struct page *brd_insert_page(struct brd_device *brd, 
sector_t sector)
 * restriction might be able to be lifted.
 */
gfp_flags = GFP_NOIO | __GFP_ZERO;
-   page = alloc_page(gfp_flags);
+   page = alloc_pages_node(brd->brd_numa_node, gfp_flags, 0);
if (!page)
return NULL;
 
@@ -342,6 +343,10 @@ static int max_part = 1;
 module_param(max_part, int, 0444);
 MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
 
+static int rd_numa_node = NUMA_NO_NODE;
+module_param(rd_numa_node, int, 0444);
+MODULE_PARM_DESC(rd_numa_node, "NUMA node number to allocate RAM disk on.");
+
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
 MODULE_ALIAS("rd");
@@ -363,7 +368,7 @@ __setup("ramdisk_size=", ramdisk_size);
 static LIST_HEAD(brd_devices);
 static DEFINE_MUTEX(brd_devices_mutex);
 
-static struct brd_device *brd_alloc(int i)
+static struct brd_device *brd_alloc(int i, int node)
 {
struct brd_device *brd;
struct gendisk *disk;
@@ -372,10 +377,11 @@ static struct brd_device *brd_alloc(int i)
if (!brd)
goto out;
brd->brd_number = i;
+   brd->brd_numa_node = node;
spin_lock_init(>brd_lock);
INIT_RADIX_TREE(>brd_pages, GFP_ATOMIC);
 
-   brd->brd_queue = blk_alloc_queue(GFP_KERNEL);
+   brd->brd_queue = blk_alloc_queue_node(GFP_KERNEL, node, NULL);
if (!brd->brd_queue)
goto out_free_dev;
 
@@ -434,7 +440,7 @@ static struct brd_device *brd_init_one(int i, bool *new)
goto out;
}
 
-   brd = brd_alloc(i);
+   brd = brd_alloc(i, rd_numa_node);
if (brd) {
add_disk(brd->brd_disk);
list_add_tail(>brd_list, _devices);
@@ -495,7 +501,7 @@ static int __init brd_init(void)
max_part = 1;
 
for (i = 0; i < rd_nr; i++) {
-   brd = brd_alloc(i);
+   brd = brd_alloc(i, rd_numa_node);
if (!brd)
goto out_free;
list_add_tail(>brd_list, _devices);
-- 
2.12.3



Re: [PATCH] blktrace: display failfast and driver-specific flags

2018-06-07 Thread Hannes Reinecke
On Thu, 7 Jun 2018 13:24:23 +0200
Christoph Hellwig  wrote:

> Good idea to display the flags, but the characters aren't really
> very self-describing.  Should we prefix each with an f maybe?

Honestly, I don't care.
Just wanted to stick with the one-letter acronym like all the others
did.
But in the end having the failfast ones prefixed with an 'f' sounds
reasonable.

Cheers,

Hannes



[PATCH] blktrace: display failfast and driver-specific flags

2018-06-07 Thread Hannes Reinecke
Signed-off-by: Hannes Reinecke 
---
 kernel/trace/blktrace.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 987d9a9ae283..238e16211a5c 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -1922,6 +1922,14 @@ void blk_fill_rwbs(char *rwbs, unsigned int op, int 
bytes)
rwbs[i++] = 'S';
if (op & REQ_META)
rwbs[i++] = 'M';
+   if (op & REQ_FAILFAST_DEV)
+   rwbs[i++] = 'd';
+   if (op & REQ_FAILFAST_TRANSPORT)
+   rwbs[i++] = 't';
+   if (op & REQ_FAILFAST_DRIVER)
+   rwbs[i++] = 'v';
+   if (op & REQ_DRV)
+   rwbs[i++] = 'V';
 
rwbs[i] = '\0';
 }
-- 
2.12.3



[PATCHv2] block: always set partition number to '0' in blk_partition_remap()

2018-06-07 Thread Hannes Reinecke
blk_partition_remap() will only clear bi_partno if an actual remapping
has happened. But flush request et al don't have an actual size, so
the remapping doesn't happen and bi_partno is never cleared.
So for stacked devices blk_partition_remap() will be called on each level.
If (as is the case for native nvme multipathing) one of the lower-level
devices do _not_support partitioning a spurious I/O error is generated.

Signed-off-by: Hannes Reinecke 
---
 block/blk-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3f56be15f17e..cf0ee764b908 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2220,10 +2220,10 @@ static inline int blk_partition_remap(struct bio *bio)
if (bio_check_eod(bio, part_nr_sects_read(p)))
goto out;
bio->bi_iter.bi_sector += p->start_sect;
-   bio->bi_partno = 0;
trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p),
  bio->bi_iter.bi_sector - p->start_sect);
}
+   bio->bi_partno = 0;
ret = 0;
 out:
rcu_read_unlock();
-- 
2.12.3



Re: [PATCH] block: always set partition number to '0' in blk_partition_remap()

2018-06-06 Thread Hannes Reinecke
On Wed, 6 Jun 2018 08:26:56 -0600
Jens Axboe  wrote:

> On 6/6/18 8:22 AM, Hannes Reinecke wrote:
> > blk_partition_remap() will only clear bi_partno if an actual
> > remapping has happened. But flush request et al don't have an
> > actual size, so the remapping doesn't happen and bi_partno is never
> > cleared. So for stacked devices blk_partition_remap() will be
> > called on each level. If (as is the case for native nvme
> > multipathing) one of the lower-level devices do _not_support
> > partitioning a spurious I/O error is generated.  
> 
> Just move it down, we're now clearing it for both cases.
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3f56be15f17e..cf0ee764b908 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2220,10 +2220,10 @@ static inline int blk_partition_remap(struct
> bio *bio) if (bio_check_eod(bio, part_nr_sects_read(p)))
>   goto out;
>   bio->bi_iter.bi_sector += p->start_sect;
> - bio->bi_partno = 0;
>   trace_block_bio_remap(bio->bi_disk->queue, bio,
> part_devt(p), bio->bi_iter.bi_sector - p->start_sect);
>   }
> + bio->bi_partno = 0;
>   ret = 0;
>  out:
>   rcu_read_unlock();
> 

Okay, will be resending.

Cheers,

Hannes



[PATCH] block: always set partition number to '0' in blk_partition_remap()

2018-06-06 Thread Hannes Reinecke
blk_partition_remap() will only clear bi_partno if an actual remapping
has happened. But flush request et al don't have an actual size, so
the remapping doesn't happen and bi_partno is never cleared.
So for stacked devices blk_partition_remap() will be called on each level.
If (as is the case for native nvme multipathing) one of the lower-level
devices do _not_support partitioning a spurious I/O error is generated.

Signed-off-by: Hannes Reinecke 
---
 block/blk-core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cee03cad99f2..8a2c3a474234 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2225,7 +2225,10 @@ static inline int blk_partition_remap(struct bio *bio)
bio->bi_partno = 0;
trace_block_bio_remap(bio->bi_disk->queue, bio, part_devt(p),
  bio->bi_iter.bi_sector - p->start_sect);
-   }
+   } else
+   /* Set partition number to '0' to avoid repetitive calls */
+   bio->bi_partno = 0;
+
ret = 0;
 out:
rcu_read_unlock();
-- 
2.12.3



[PATCH] block: pass failfast and driver-specific flags to flush requests

2018-06-06 Thread Hannes Reinecke
If flush requests are being sent to the device we need to inherit the
failfast and driver-specific flags, too, otherwise I/O will fail.

Signed-off-by: Hannes Reinecke 
---
 block/blk-flush.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index f17170675917..058abdb50f31 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -94,7 +94,7 @@ enum {
 };
 
 static bool blk_kick_flush(struct request_queue *q,
-  struct blk_flush_queue *fq);
+  struct blk_flush_queue *fq, unsigned int flags);
 
 static unsigned int blk_flush_policy(unsigned long fflags, struct request *rq)
 {
@@ -212,7 +212,7 @@ static bool blk_flush_complete_seq(struct request *rq,
BUG();
}
 
-   kicked = blk_kick_flush(q, fq);
+   kicked = blk_kick_flush(q, fq, rq->cmd_flags);
return kicked | queued;
 }
 
@@ -281,6 +281,7 @@ static void flush_end_io(struct request *flush_rq, 
blk_status_t error)
  * blk_kick_flush - consider issuing flush request
  * @q: request_queue being kicked
  * @fq: flush queue
+ * @flags: cmd_flags of the original request
  *
  * Flush related states of @q have changed, consider issuing flush request.
  * Please read the comment at the top of this file for more info.
@@ -291,7 +292,8 @@ static void flush_end_io(struct request *flush_rq, 
blk_status_t error)
  * RETURNS:
  * %true if flush was issued, %false otherwise.
  */
-static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
+static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
+  unsigned int flags)
 {
struct list_head *pending = >flush_queue[fq->flush_pending_idx];
struct request *first_rq =
@@ -346,6 +348,7 @@ static bool blk_kick_flush(struct request_queue *q, struct 
blk_flush_queue *fq)
}
 
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
+   flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
flush_rq->rq_flags |= RQF_FLUSH_SEQ;
flush_rq->rq_disk = first_rq->rq_disk;
flush_rq->end_io = flush_end_io;
-- 
2.12.3



Re: [PATCH 12/12] nvme: limit warnings from nvme_identify_ns

2018-06-04 Thread Hannes Reinecke
On Wed, 30 May 2018 18:46:00 +0200
Christoph Hellwig  wrote:

> When rescanning namespaces after an AEN we will issue Identify
> Namespace comands to namespaces that have gone away, so don't warn
> for this specific case.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/nvme/host/core.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes



  1   2   3   4   5   6   >