date:20170824

Re: [PATCH v2] blk-mq-debugfs: Add names for recently added flags

2017-08-24 Thread Omar Sandoval

On Thu, Aug 24, 2017 at 11:30:46PM +, Bart Van Assche wrote:
> On Fri, 2017-08-18 at 15:52 -0700, Bart Van Assche wrote:
> > The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED
> > and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to
> > blk-mq-debugfs.c such that these appear as names in debugfs instead of
> > as numbers.
> 
> Hi Jens,
> 
> Have you already had the time to have a look at this patch?
> 
> Thanks,
> 
> Bart.

Reviewed-by: Omar Sandoval 

FWIW

Re: [PATCH v2] blk-mq-debugfs: Add names for recently added flags

2017-08-24 Thread Bart Van Assche

On Fri, 2017-08-18 at 15:52 -0700, Bart Van Assche wrote:
> The symbolic constants QUEUE_FLAG_SCSI_PASSTHROUGH, QUEUE_FLAG_QUIESCED
> and REQ_NOWAIT are missing from blk-mq-debugfs.c. Add these to
> blk-mq-debugfs.c such that these appear as names in debugfs instead of
> as numbers.

Hi Jens,

Have you already had the time to have a look at this patch?

Thanks,

Bart.

BFQ + dm-mpath

2017-08-24 Thread Bart Van Assche

Hello Paolo,

Has BFQ ever been tested in combination with dm-mpath? A few seconds
after I had started a run of the srp-tests software with BFQ a kernel
oops appeared on the console. The command I ran is:

$ while true; do ~bart/software/infiniband/srp-test/run_tests -d -r 30 -e bfq; 
done

And this is what appeared on the console:

[89251.977201] BUG: unable to handle kernel NULL pointer dereference at 
0018 
[89251.977201] BUG: unable to handle kernel NULL pointer dereference at 
0018 
[89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] 
[89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] 
[89251.995241] PGD 0  
[89251.995241] PGD 0  
[89251.995243] P4D 0  
[89251.995243] P4D 0  
[89251.999194]  
[89251.999194]  
[89252.006423] Oops:  [#1] PREEMPT SMP 
[89252.006423] Oops:  [#1] PREEMPT SMP 
[89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp 
target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd bfq 
dm_service_time rdma_cm
iw_cm bridge stp llc ip6table_filter ip 
6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas 
crct10dif_pclmul ioatdma
crc32_pclmul mei_me joydev ghash_clmulni_intel 
mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca 
cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi 
mfd_core ib_ipoib ib_cm
ib_uverbs ib_umad mlx4_ib ib_core dm_mul 
tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs 
ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid mgag200 
i2c_algo_bit
drm_kms_helper 
[89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp 
target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd bfq 
dm_service_time rdma_cm
iw_cm bridge stp llc ip6table_filter ip 
6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas 
crct10dif_pclmul ioatdma
crc32_pclmul mei_me joydev ghash_clmulni_intel 
mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca 
cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi 
mfd_core ib_ipoib ib_cm
ib_uverbs ib_umad mlx4_ib ib_core dm_mul 
tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs 
ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid mgag200 
i2c_algo_bit
drm_kms_helper 
[89252.103941]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci 
mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common pps_core 
megaraid_sas libphy
[last unloaded: scsi_transport_srp] 
[89252.103941]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci 
mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common pps_core 
megaraid_sas libphy
[last unloaded: scsi_transport_srp] 
[89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW   
4.13.0-rc6-dbg+ #2 
[89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW   
4.13.0-rc6-dbg+ #2 
[89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 
09/11/2012 
[89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 
09/11/2012 
[89252.150243] Workqueue: kblockd blk_mq_run_work_fn 
[89252.150243] Workqueue: kblockd blk_mq_run_work_fn 
[89252.157449] task: 880911d80040 task.stack: c90007c64000 
[89252.157449] task: 880911d80040 task.stack: c90007c64000 
[89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] 
[89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] 
[89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 
[89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 
[89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: 
0001 
[89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: 
0001 
[89252.191990] RDX: 8807b9988700 RSI:  RDI: 
880288554a68 
[89252.191990] RDX: 8807b9988700 RSI:  RDI: 
880288554a68 
[89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: 
 
[89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: 
 
[89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: 
880288554dc8 
[89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: 
880288554dc8 
[89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: 
8807bbe03528 
[89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: 
8807bbe03528 
[89252.231715] FS:  () GS:88093f78() 
knlGS: 
[89252.231715] FS:  () GS:88093f78() 
knlGS: 
[89252.242667] CS:  0010 DS:  ES:  CR0: 80050033 
[89252.242667] CS:  0010 DS:  ES:  CR0: 80050033 
[89252.250958] CR2: 0018 CR3:

Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-24 Thread Bart Van Assche

On Thu, 2017-08-24 at 10:59 +0200, h...@lst.de wrote:
> On Wed, Aug 23, 2017 at 06:21:55PM +, Bart Van Assche wrote:
> > Since generic_make_request_fast() returns BLK_STS_AGAIN for a dying path:
> > can the same kind of soft lockups occur with the NVMe multipathing code as
> > with the current upstream device mapper multipathing code? See e.g.
> > "[PATCH 3/7] dm-mpath: Do not lock up a CPU with requeuing activity"
> > (https://www.redhat.com/archives/dm-devel/2017-August/msg00124.html).
> 
> I suspect the code is not going to hit it because we check the controller
> state before trying to queue I/O on the lower queue.  But if you point
> me to a good reproducer test case I'd like to check.

For NVMe over RDMA, how about the simulate_network_failure_loop() function in
https://github.com/bvanassche/srp-test/blob/master/lib/functions? It simulates
a network failure by writing into the reset_controller sysfs attribute.

> Also does the "single queue" case in your mail refer to the old
> request code?  nvme only uses blk-mq so it would not hit that.
> But either way I think get_request should be fixed to return
> BLK_STS_IOERR if the queue is dying instead of BLK_STS_AGAIN.

The description in the patch I referred to indeed refers to the old request
code in the block layer. When I prepared that patch I had analyzed the
behavior of the old request code only.

Bart.

[PATCH V2 2/2] block/loop: allow request merge for directio mode

2017-08-24 Thread Shaohua Li

From: Shaohua Li 

Currently loop disables merge. While it makes sense for buffer IO mode,
directio mode can benefit from request merge. Without merge, loop could
send small size IO to underlayer disk and harm performance.

Reviewed-by: Omar Sandoval 
Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 66 
 drivers/block/loop.h |  1 +
 2 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 428da07..75a8f6e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, 
bool dio)
 */
blk_mq_freeze_queue(lo->lo_queue);
lo->use_dio = use_dio;
-   if (use_dio)
+   if (use_dio) {
+   queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
lo->lo_flags |= LO_FLAGS_DIRECT_IO;
-   else
+   } else {
+   queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
+   }
blk_mq_unfreeze_queue(lo->lo_queue);
 }
 
@@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
+   kfree(cmd->bvec);
+   cmd->bvec = NULL;
cmd->ret = ret;
blk_mq_complete_request(cmd->rq);
 }
@@ -473,22 +478,50 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 {
struct iov_iter iter;
struct bio_vec *bvec;
-   struct bio *bio = cmd->rq->bio;
+   struct request *rq = cmd->rq;
+   struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
+   unsigned int offset;
+   int segments = 0;
int ret;
 
-   /* nomerge for loop request queue */
-   WARN_ON(cmd->rq->bio != cmd->rq->biotail);
+   if (rq->bio != rq->biotail) {
+   struct req_iterator iter;
+   struct bio_vec tmp;
+
+   __rq_for_each_bio(bio, rq)
+   segments += bio_segments(bio);
+   bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL);
+   if (!bvec)
+   return -EIO;
+   cmd->bvec = bvec;
+
+   /*
+* The bios of the request may be started from the middle of
+* the 'bvec' because of bio splitting, so we can't directly
+* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+* API will take care of all details for us.
+*/
+   rq_for_each_segment(tmp, rq, iter) {
+   *bvec = tmp;
+   bvec++;
+   }
+   bvec = cmd->bvec;
+   offset = 0;
+   } else {
+   /*
+* Same here, this bio may be started from the middle of the
+* 'bvec' because of bio splitting, so offset from the bvec
+* must be passed to iov iterator
+*/
+   offset = bio->bi_iter.bi_bvec_done;
+   bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+   segments = bio_segments(bio);
+   }
 
-   bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
iov_iter_bvec(, ITER_BVEC | rw, bvec,
- bio_segments(bio), blk_rq_bytes(cmd->rq));
-   /*
-* This bio may be started from the middle of the 'bvec'
-* because of bio splitting, so offset from the bvec must
-* be passed to iov iterator
-*/
-   iter.iov_offset = bio->bi_iter.bi_bvec_done;
+ segments, blk_rq_bytes(rq));
+   iter.iov_offset = offset;
 
cmd->iocb.ki_pos = pos;
cmd->iocb.ki_filp = file;
@@ -1800,9 +1833,12 @@ static int loop_add(struct loop_device **l, int i)
lo->lo_queue->queuedata = lo;
 
blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
+
/*
-* It doesn't make sense to enable merge because the I/O
-* submitted to backing file is handled page by page.
+* By default, we do buffer IO, so it doesn't make sense to enable
+* merge because the I/O submitted to backing file is handled page by
+* page. For directio mode, merge does help to dispatch bigger request
+* to underlayer disk. We will enable merge once directio is enabled.
 */
queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fecd3f9..bc9cf91 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -72,6 +72,7 @@ struct loop_cmd {
bool use_aio;   /* use AIO interface to handle I/O */
long ret;
struct kiocb iocb;
+   struct bio_vec *bvec;
 };
 
 /* Support for loadable transfer modules */

[PATCH V2 0/2] block/loop: improve performance

2017-08-24 Thread Shaohua Li

From: Shaohua Li 

two small patches to improve performance for loop in directio mode. The goal is
to increase IO size sending to underlayer disks. As Omar pointed out, the
patches have slight conflict with his, but should be easy to fix.

Thanks,
Shaohua

Shaohua Li (2):
  block/loop: set hw_sectors
  block/loop: allow request merge for directio mode

 drivers/block/loop.c | 67 
 drivers/block/loop.h |  1 +
 2 files changed, 53 insertions(+), 15 deletions(-)

-- 
2.9.5

[PATCH V2 1/2] block/loop: set hw_sectors

2017-08-24 Thread Shaohua Li

From: Shaohua Li 

Loop can handle any size of request. Limiting it to 255 sectors just
burns the CPU for bio split and request merge for underlayer disk and
also cause bad fs block allocation in directio mode.

Reviewed-by: Omar Sandoval 
Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b55a1f8..428da07 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1799,6 +1799,7 @@ static int loop_add(struct loop_device **l, int i)
}
lo->lo_queue->queuedata = lo;
 
+   blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
/*
 * It doesn't make sense to enable merge because the I/O
 * submitted to backing file is handled page by page.
-- 
2.9.5

Re: [bug report] skd: Avoid that module unloading triggers a use-after-free

2017-08-24 Thread Dan Carpenter

On Thu, Aug 24, 2017 at 03:04:12PM +, Bart Van Assche wrote:
> On Thu, 2017-08-24 at 14:04 +0300, Dan Carpenter wrote:
> > Hello Bart Van Assche,
> > 
> > This is a semi-automatic email about new static checker warnings.
> > 
> > The patch 7277cc67b391: "skd: Avoid that module unloading triggers a 
> > use-after-free" from Aug 17, 2017, leads to the following Smatch 
> > complaint:
> > 
> > drivers/block/skd_main.c:3080 skd_free_disk()
> >  error: we previously assumed 'disk' could be null (see line 3074)
> > 
> > drivers/block/skd_main.c
> >   3073  
> >   3074  if (disk && (disk->flags & GENHD_FL_UP))
> > 
> > Existing code checked for NULL.  The new code shuffles things around.
> > 
> >   3075  del_gendisk(disk);
> >   3076  
> >   3077  if (skdev->queue) {
> >   3078  blk_cleanup_queue(skdev->queue);
> >   3079  skdev->queue = NULL;
> >   3080  disk->queue = NULL;
> > ^^^
> > Now we don't check here.
> > 
> >   3081  }
> >   3082  
> > 
> > regards,
> > dan carpenter
> 
> Hello Dan,
> 
> If you have a look at skd_cons_disk() you will see that skdev->queue != NULL
> implies that skdev->disk != NULL. So I think the above report is a false
> positive.
> 

Oh, yeah.  You're right.  Thanks for taking a look at this.

regards,
dan carpenter

Re: [PATCH 2/2] block/loop: allow request merge for directio mode

2017-08-24 Thread Shaohua Li

On Thu, Aug 24, 2017 at 10:57:39AM -0700, Omar Sandoval wrote:
> On Wed, Aug 23, 2017 at 04:49:24PM -0700, Shaohua Li wrote:
> > From: Shaohua Li 
> > 
> > Currently loop disables merge. While it makes sense for buffer IO mode,
> > directio mode can benefit from request merge. Without merge, loop could
> > send small size IO to underlayer disk and harm performance.
> 
> A couple of nits on comments below, besides that
> 
> Reviewed-by: Omar Sandoval 
> 
> > Signed-off-by: Shaohua Li 
> > ---
> >  drivers/block/loop.c | 43 +++
> >  drivers/block/loop.h |  1 +
> >  2 files changed, 36 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 428da07..1bfa3ff 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, 
> > bool dio)
> >  */
> > blk_mq_freeze_queue(lo->lo_queue);
> > lo->use_dio = use_dio;
> > -   if (use_dio)
> > +   if (use_dio) {
> > +   queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
> > lo->lo_flags |= LO_FLAGS_DIRECT_IO;
> > -   else
> > +   } else {
> > +   queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
> > lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
> > +   }
> 
> Could you please also update the comment in loop_add() where we set
> QUEUE_FLAG_NOMERGES to say something like
> 
>   /*
>* By default we do buffered I/O, so it doesn't make sense to
>* enable merging because the I/O submitted to backing file is
>* handled page by page. We enable merging when switching to
>* direct I/O mode.
>*/

sure, will add
 
> > blk_mq_unfreeze_queue(lo->lo_queue);
> >  }
> >  
> > @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
> > ret, long ret2)
> >  {
> > struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
> >  
> > +   kfree(cmd->bvec);
> > +   cmd->bvec = NULL;
> > cmd->ret = ret;
> > blk_mq_complete_request(cmd->rq);
> >  }
> > @@ -473,22 +478,44 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> >  {
> > struct iov_iter iter;
> > struct bio_vec *bvec;
> > -   struct bio *bio = cmd->rq->bio;
> > +   struct request *rq = cmd->rq;
> > +   struct bio *bio = rq->bio;
> > struct file *file = lo->lo_backing_file;
> > +   unsigned int offset;
> > +   int segments = 0;
> > int ret;
> >  
> > -   /* nomerge for loop request queue */
> > -   WARN_ON(cmd->rq->bio != cmd->rq->biotail);
> > +   if (rq->bio != rq->biotail) {
> > +   struct req_iterator iter;
> > +   struct bio_vec tmp;
> > +
> > +   __rq_for_each_bio(bio, rq)
> > +   segments += bio_segments(bio);
> > +   bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL);
> > +   if (!bvec)
> > +   return -EIO;
> > +   cmd->bvec = bvec;
> > +
> > +   rq_for_each_segment(tmp, rq, iter) {
> > +   *bvec = tmp;
> > +   bvec++;
> > +   }
> > +   bvec = cmd->bvec;
> > +   offset = 0;
> > +   } else {
> > +   offset = bio->bi_iter.bi_bvec_done;
> > +   bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> > +   segments = bio_segments(bio);
> > +   }
> >  
> > -   bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> > iov_iter_bvec(, ITER_BVEC | rw, bvec,
> > - bio_segments(bio), blk_rq_bytes(cmd->rq));
> > + segments, blk_rq_bytes(rq));
> > /*
> >  * This bio may be started from the middle of the 'bvec'
> >  * because of bio splitting, so offset from the bvec must
> >  * be passed to iov iterator
> >  */
> 
> This comment should be moved above to where you do
> 
>   offset = bio->bi_iter.bi_bvec_done;

This comment actually applies to 'rq->bio != rq->biotail' case too for
each bio of the request. That's why I don't just copy bio->bi_io_vec.
I'll explain this in different way.

Thanks,
Shaohua

[PATCH] block: update comments to reflect REQ_FLUSH -> REQ_PREFLUSH rename

2017-08-24 Thread Omar Sandoval

From: Omar Sandoval 

Normally I wouldn't bother with this, but in my opinion the comments are
the most important part of this whole file since without them no one
would have any clue how this insanity works.

Signed-off-by: Omar Sandoval 
---
Based on Jens' for-next branch, for 4.14.

 block/blk-flush.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index ed5fe322abba..3e77676244d9 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,12 +1,12 @@
 /*
- * Functions to sequence FLUSH and FUA writes.
+ * Functions to sequence PREFLUSH and FUA writes.
  *
  * Copyright (C) 2011  Max Planck Institute for Gravitational Physics
  * Copyright (C) 2011  Tejun Heo 
  *
  * This file is released under the GPLv2.
  *
- * REQ_{FLUSH|FUA} requests are decomposed to sequences consisted of three
+ * REQ_{PREFLUSH|FUA} requests are decomposed to sequences consisted of three
  * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request
  * properties and hardware capability.
  *
@@ -16,9 +16,9 @@
  * REQ_FUA means that the data must be on non-volatile media on request
  * completion.
  *
- * If the device doesn't have writeback cache, FLUSH and FUA don't make any
- * difference.  The requests are either completed immediately if there's no
- * data or executed as normal requests otherwise.
+ * If the device doesn't have writeback cache, PREFLUSH and FUA don't make any
+ * difference.  The requests are either completed immediately if there's no 
data
+ * or executed as normal requests otherwise.
  *
  * If the device has writeback cache and supports FUA, REQ_PREFLUSH is
  * translated to PREFLUSH but REQ_FUA is passed down directly with DATA.
@@ -31,7 +31,7 @@
  * fq->flush_queue[fq->flush_pending_idx].  Once certain criteria are met, a
  * REQ_OP_FLUSH is issued and the pending_idx is toggled.  When the flush
  * completes, all the requests which were pending are proceeded to the next
- * step.  This allows arbitrary merging of different types of FLUSH/FUA
+ * step.  This allows arbitrary merging of different types of PREFLUSH/FUA
  * requests.
  *
  * Currently, the following conditions are used to determine when to issue
@@ -47,19 +47,19 @@
  * C3. The second condition is ignored if there is a request which has
  * waited longer than FLUSH_PENDING_TIMEOUT.  This is to avoid
  * starvation in the unlikely case where there are continuous stream of
- * FUA (without FLUSH) requests.
+ * FUA (without PREFLUSH) requests.
  *
  * For devices which support FUA, it isn't clear whether C2 (and thus C3)
  * is beneficial.
  *
- * Note that a sequenced FLUSH/FUA request with DATA is completed twice.
+ * Note that a sequenced PREFLUSH/FUA request with DATA is completed twice.
  * Once while executing DATA and again after the whole sequence is
  * complete.  The first completion updates the contained bio but doesn't
  * finish it so that the bio submitter is notified only after the whole
  * sequence is complete.  This is implemented by testing RQF_FLUSH_SEQ in
  * req_bio_endio().
  *
- * The above peculiarity requires that each FLUSH/FUA request has only one
+ * The above peculiarity requires that each PREFLUSH/FUA request has only one
  * bio attached to it, which is guaranteed as they aren't allowed to be
  * merged in the usual way.
  */
@@ -76,7 +76,7 @@
 #include "blk-mq-tag.h"
 #include "blk-mq-sched.h"
 
-/* FLUSH/FUA sequences */
+/* PREFLUSH/FUA sequences */
 enum {
REQ_FSEQ_PREFLUSH   = (1 << 0), /* pre-flushing in progress */
REQ_FSEQ_DATA   = (1 << 1), /* data write in progress */
@@ -148,7 +148,7 @@ static bool blk_flush_queue_rq(struct request *rq, bool 
add_front)
 
 /**
  * blk_flush_complete_seq - complete flush sequence
- * @rq: FLUSH/FUA request being sequenced
+ * @rq: PREFLUSH/FUA request being sequenced
  * @fq: flush queue
  * @seq: sequences to complete (mask of %REQ_FSEQ_*, can be zero)
  * @error: whether an error occurred
@@ -406,7 +406,7 @@ static void mq_flush_data_end_io(struct request *rq, 
blk_status_t error)
 }
 
 /**
- * blk_insert_flush - insert a new FLUSH/FUA request
+ * blk_insert_flush - insert a new PREFLUSH/FUA request
  * @rq: request to insert
  *
  * To be called from __elv_add_request() for %ELEVATOR_INSERT_FLUSH insertions.
-- 
2.14.1

Re: [PATCH 2/2] block/loop: allow request merge for directio mode

2017-08-24 Thread Omar Sandoval

On Wed, Aug 23, 2017 at 04:49:24PM -0700, Shaohua Li wrote:
> From: Shaohua Li 
> 
> Currently loop disables merge. While it makes sense for buffer IO mode,
> directio mode can benefit from request merge. Without merge, loop could
> send small size IO to underlayer disk and harm performance.

A couple of nits on comments below, besides that

Reviewed-by: Omar Sandoval 

> Signed-off-by: Shaohua Li 
> ---
>  drivers/block/loop.c | 43 +++
>  drivers/block/loop.h |  1 +
>  2 files changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 428da07..1bfa3ff 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, 
> bool dio)
>*/
>   blk_mq_freeze_queue(lo->lo_queue);
>   lo->use_dio = use_dio;
> - if (use_dio)
> + if (use_dio) {
> + queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
>   lo->lo_flags |= LO_FLAGS_DIRECT_IO;
> - else
> + } else {
> + queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
>   lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
> + }

Could you please also update the comment in loop_add() where we set
QUEUE_FLAG_NOMERGES to say something like

/*
 * By default we do buffered I/O, so it doesn't make sense to
 * enable merging because the I/O submitted to backing file is
 * handled page by page. We enable merging when switching to
 * direct I/O mode.
 */

>   blk_mq_unfreeze_queue(lo->lo_queue);
>  }
>  
> @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
> ret, long ret2)
>  {
>   struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
>  
> + kfree(cmd->bvec);
> + cmd->bvec = NULL;
>   cmd->ret = ret;
>   blk_mq_complete_request(cmd->rq);
>  }
> @@ -473,22 +478,44 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>  {
>   struct iov_iter iter;
>   struct bio_vec *bvec;
> - struct bio *bio = cmd->rq->bio;
> + struct request *rq = cmd->rq;
> + struct bio *bio = rq->bio;
>   struct file *file = lo->lo_backing_file;
> + unsigned int offset;
> + int segments = 0;
>   int ret;
>  
> - /* nomerge for loop request queue */
> - WARN_ON(cmd->rq->bio != cmd->rq->biotail);
> + if (rq->bio != rq->biotail) {
> + struct req_iterator iter;
> + struct bio_vec tmp;
> +
> + __rq_for_each_bio(bio, rq)
> + segments += bio_segments(bio);
> + bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL);
> + if (!bvec)
> + return -EIO;
> + cmd->bvec = bvec;
> +
> + rq_for_each_segment(tmp, rq, iter) {
> + *bvec = tmp;
> + bvec++;
> + }
> + bvec = cmd->bvec;
> + offset = 0;
> + } else {
> + offset = bio->bi_iter.bi_bvec_done;
> + bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> + segments = bio_segments(bio);
> + }
>  
> - bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
>   iov_iter_bvec(, ITER_BVEC | rw, bvec,
> -   bio_segments(bio), blk_rq_bytes(cmd->rq));
> +   segments, blk_rq_bytes(rq));
>   /*
>* This bio may be started from the middle of the 'bvec'
>* because of bio splitting, so offset from the bvec must
>* be passed to iov iterator
>*/

This comment should be moved above to where you do

offset = bio->bi_iter.bi_bvec_done;

> - iter.iov_offset = bio->bi_iter.bi_bvec_done;
> + iter.iov_offset = offset;
>  
>   cmd->iocb.ki_pos = pos;
>   cmd->iocb.ki_filp = file;
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index fecd3f9..bc9cf91 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -72,6 +72,7 @@ struct loop_cmd {
>   bool use_aio;   /* use AIO interface to handle I/O */
>   long ret;
>   struct kiocb iocb;
> + struct bio_vec *bvec;
>  };
>  
>  /* Support for loadable transfer modules */
> -- 
> 2.9.5
>

Re: [PATCH 1/2] block/loop: set hw_sectors

2017-08-24 Thread Omar Sandoval

On Wed, Aug 23, 2017 at 04:49:23PM -0700, Shaohua Li wrote:
> From: Shaohua Li 
> 
> Loop can handle any size of request. Limiting it to 255 sectors just
> burns the CPU for bio split and request merge for underlayer disk and
> also cause bad fs block allocation in directio mode.

Reviewed-by: Omar Sandoval 

Note that this will conflict with my loop blocksize series, we can fix
up whichever series goes in second.

> Signed-off-by: Shaohua Li 
> ---
>  drivers/block/loop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index b55a1f8..428da07 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1799,6 +1799,7 @@ static int loop_add(struct loop_device **l, int i)
>   }
>   lo->lo_queue->queuedata = lo;
>  
> + blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
>   /*
>* It doesn't make sense to enable merge because the I/O
>* submitted to backing file is handled page by page.
> -- 
> 2.9.5
>

[PATCH] block, scheduler: convert xxx_var_store to void

2017-08-24 Thread weiping zhang

The last parameter "count" never be used in xxx_var_store,
convert these functions to void.

Signed-off-by: weiping zhang 
---
 block/bfq-iosched.c  | 33 +
 block/cfq-iosched.c  | 13 ++---
 block/deadline-iosched.c |  9 -
 block/mq-deadline.c  |  9 -
 4 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 436b6ca..7a4085d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4787,16 +4787,13 @@ static ssize_t bfq_var_show(unsigned int var, char 
*page)
return sprintf(page, "%u\n", var);
 }
 
-static ssize_t bfq_var_store(unsigned long *var, const char *page,
-size_t count)
+static void bfq_var_store(unsigned long *var, const char *page)
 {
unsigned long new_val;
int ret = kstrtoul(page, 10, _val);
 
if (ret == 0)
*var = new_val;
-
-   return count;
 }
 
 #define SHOW_FUNCTION(__FUNC, __VAR, __CONV)   \
@@ -4838,7 +4835,7 @@ __FUNC(struct elevator_queue *e, const char *page, size_t 
count)  \
 {  \
struct bfq_data *bfqd = e->elevator_data;   \
unsigned long uninitialized_var(__data);\
-   int ret = bfq_var_store(&__data, (page), count);\
+   bfq_var_store(&__data, (page)); \
if (__data < (MIN)) \
__data = (MIN); \
else if (__data > (MAX))\
@@ -4849,7 +4846,7 @@ __FUNC(struct elevator_queue *e, const char *page, size_t 
count)  \
*(__PTR) = (u64)__data * NSEC_PER_MSEC; \
else\
*(__PTR) = __data;  \
-   return ret; \
+   return count;   \
 }
 STORE_FUNCTION(bfq_fifo_expire_sync_store, >bfq_fifo_expire[1], 1,
INT_MAX, 2);
@@ -4866,13 +4863,13 @@ static ssize_t __FUNC(struct elevator_queue *e, const 
char *page, size_t count)\
 {  \
struct bfq_data *bfqd = e->elevator_data;   \
unsigned long uninitialized_var(__data);\
-   int ret = bfq_var_store(&__data, (page), count);\
+   bfq_var_store(&__data, (page)); \
if (__data < (MIN)) \
__data = (MIN); \
else if (__data > (MAX))\
__data = (MAX); \
*(__PTR) = (u64)__data * NSEC_PER_USEC; \
-   return ret; \
+   return count;   \
 }
 USEC_STORE_FUNCTION(bfq_slice_idle_us_store, >bfq_slice_idle, 0,
UINT_MAX);
@@ -4883,7 +4880,8 @@ static ssize_t bfq_max_budget_store(struct elevator_queue 
*e,
 {
struct bfq_data *bfqd = e->elevator_data;
unsigned long uninitialized_var(__data);
-   int ret = bfq_var_store(&__data, (page), count);
+
+   bfq_var_store(&__data, (page));
 
if (__data == 0)
bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
@@ -4895,7 +4893,7 @@ static ssize_t bfq_max_budget_store(struct elevator_queue 
*e,
 
bfqd->bfq_user_max_budget = __data;
 
-   return ret;
+   return count;
 }
 
 /*
@@ -4907,7 +4905,8 @@ static ssize_t bfq_timeout_sync_store(struct 
elevator_queue *e,
 {
struct bfq_data *bfqd = e->elevator_data;
unsigned long uninitialized_var(__data);
-   int ret = bfq_var_store(&__data, (page), count);
+
+   bfq_var_store(&__data, (page));
 
if (__data < 1)
__data = 1;
@@ -4918,7 +4917,7 @@ static ssize_t bfq_timeout_sync_store(struct 
elevator_queue *e,
if (bfqd->bfq_user_max_budget == 0)
bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
 
-   return ret;
+   return count;
 }
 
 static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,
@@ -4926,7 +4925,8 @@ static ssize_t bfq_strict_guarantees_store(struct 
elevator_queue *e,
 {
struct bfq_data *bfqd = e->elevator_data;
unsigned long uninitialized_var(__data);
-   int ret = bfq_var_store(&__data, (page), count);
+
+   bfq_var_store(&__data, (page));
 
if (__data > 1)
__data = 1;
@@

Re: [PATCH 3/4] loop: add ioctl for changing logical block size

2017-08-24 Thread Omar Sandoval

On Thu, Aug 24, 2017 at 10:06:57AM +0200, Karel Zak wrote:
> On Thu, Aug 24, 2017 at 12:03:43AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > This is a different approach from the first attempt in f2c6df7dbf9a
> > ("loop: support 4k physical blocksize"). Rather than extending
> > LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block
> > size.
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  drivers/block/loop.c  | 24 
> >  include/uapi/linux/loop.h |  1 +
> >  2 files changed, 25 insertions(+)
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 1a5b4ecf54ec..6b1d932028e3 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -1047,6 +1047,7 @@ static int loop_clr_fd(struct loop_device *lo)
> > memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE);
> > memset(lo->lo_crypt_name, 0, LO_NAME_SIZE);
> > memset(lo->lo_file_name, 0, LO_NAME_SIZE);
> > +   blk_queue_logical_block_size(lo->lo_queue, 512);
> > if (bdev) {
> > bdput(bdev);
> > invalidate_bdev(bdev);
> > @@ -1330,6 +1331,24 @@ static int loop_set_dio(struct loop_device *lo, 
> > unsigned long arg)
> > return error;
> >  }
> >  
> > +static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
> > +{
> > +   if (lo->lo_state != Lo_bound)
> > +   return -ENXIO;
> > +
> > +   if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg))
> > +   return -EINVAL;
> 
> I'm not sure if the PAGE_SIZE is the right limit for the use-case
> where backing file is a block device ...but it's nothing important.

A bigger logical block size just means we'll send bigger I/Os down to
the backing file, so I think it should be okay.

> > +   blk_mq_freeze_queue(lo->lo_queue);
> > +
> > +   blk_queue_logical_block_size(lo->lo_queue, arg);
> 
> Just note that this function also updates physical_block_size if the
> 'arg' is greater than the current setting. That's good thing :-)

Patch 2 sets the physical block size to PAGE_SIZE, and we cap the
logical block size to PAGE_SIZE, so we won't actually hit that case
anyways :)

> > +   loop_update_dio(lo);
> > +
> > +   blk_mq_unfreeze_queue(lo->lo_queue);
> > +
> > +   return 0;
> > +}
> > +
> >  static int lo_ioctl(struct block_device *bdev, fmode_t mode,
> > unsigned int cmd, unsigned long arg)
> >  {
> > @@ -1378,6 +1397,11 @@ static int lo_ioctl(struct block_device *bdev, 
> > fmode_t mode,
> > if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
> > err = loop_set_dio(lo, arg);
> > break;
> > +   case LOOP_SET_BLOCK_SIZE:
> > +   err = -EPERM;
> > +   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
> > +   err = loop_set_block_size(lo, arg);
> > +   break;
> 
> I like it this ioctl idea. It makes loop based tests more comfortable
> as we can change sector size on the fly without loop device reset.
> 
> I did not test it yet, but this implementation seems the best. Thanks.

Thanks! Let me know if you spot any issues when you test it.

Re: [PATCH 4/5] skd: Avoid double completions in case of a timeout

2017-08-24 Thread Bart Van Assche

On Wed, 2017-08-23 at 20:08 +0200, Christoph Hellwig wrote:
> Although as a follow on I would also move the debug printks
> from skd_end_request to skd_softirq_done, then remove skd_end_request
> and rename skd_softirq_done to skd_complete_rq to fit what other
> drivers are doing a little more closely.

Hello Christoph,

Thanks for the review of this patch series. I will include the above changes
when I post my next patch series for the skd driver.

Bart.

Re: [bug report] skd: Avoid that module unloading triggers a use-after-free

2017-08-24 Thread Bart Van Assche

On Thu, 2017-08-24 at 14:04 +0300, Dan Carpenter wrote:
> Hello Bart Van Assche,
> 
> This is a semi-automatic email about new static checker warnings.
> 
> The patch 7277cc67b391: "skd: Avoid that module unloading triggers a 
> use-after-free" from Aug 17, 2017, leads to the following Smatch 
> complaint:
> 
> drivers/block/skd_main.c:3080 skd_free_disk()
>error: we previously assumed 'disk' could be null (see line 3074)
> 
> drivers/block/skd_main.c
>   3073
>   3074if (disk && (disk->flags & GENHD_FL_UP))
> 
> Existing code checked for NULL.  The new code shuffles things around.
> 
>   3075del_gendisk(disk);
>   3076
>   3077if (skdev->queue) {
>   3078blk_cleanup_queue(skdev->queue);
>   3079skdev->queue = NULL;
>   3080disk->queue = NULL;
> ^^^
> Now we don't check here.
> 
>   3081}
>   3082
> 
> regards,
> dan carpenter

Hello Dan,

If you have a look at skd_cons_disk() you will see that skdev->queue != NULL
implies that skdev->disk != NULL. So I think the above report is a false
positive.

Bart.

Re: [PATCH 2/2] compat_hdio_ioctl: Fix a declaration

2017-08-24 Thread Jens Axboe

On 08/23/2017 04:29 PM, Bart Van Assche wrote:
> This patch avoids that sparse reports the following warning messages:
> 
> block/compat_ioctl.c:85:11: warning: incorrect type in assignment (different 
> address spaces)
> block/compat_ioctl.c:85:11:expected unsigned long *[noderef] p
> block/compat_ioctl.c:85:11:got void [noderef] *
> block/compat_ioctl.c:91:21: warning: incorrect type in argument 1 (different 
> address spaces)
> block/compat_ioctl.c:91:21:expected void const volatile [noderef] 
> *
> block/compat_ioctl.c:91:21:got unsigned long *[noderef] p
> block/compat_ioctl.c:87:53: warning: dereference of noderef expression
> block/compat_ioctl.c:91:21: warning: dereference of noderef expression

Applied for 4.14, thanks Bart.

-- 
Jens Axboe

Re: [PATCH 4/4] block_dev: support RFW_NOWAIT on block device nodes

2017-08-24 Thread Jan Kara

On Tue 22-08-17 18:17:12, Christoph Hellwig wrote:
> All support is already there in the generic code, we just need to wire
> it up.
> 
> Signed-off-by: Christoph Hellwig 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/block_dev.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 9941dc8342df..ea21d18d8e79 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1739,6 +1739,8 @@ static int blkdev_open(struct inode * inode, struct 
> file * filp)
>*/
>   filp->f_flags |= O_LARGEFILE;
>  
> + filp->f_mode |= FMODE_NOWAIT;
> +
>   if (filp->f_flags & O_NDELAY)
>   filp->f_mode |= FMODE_NDELAY;
>   if (filp->f_flags & O_EXCL)
> @@ -1891,6 +1893,9 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct 
> iov_iter *from)
>   if (iocb->ki_pos >= size)
>   return -ENOSPC;
>  
> + if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>   iov_iter_truncate(from, size - iocb->ki_pos);
>  
>   blk_start_plug();
> -- 
> 2.11.0
> 
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH 3/4] fs: support RWF_NOWAIT for buffered reads

2017-08-24 Thread Jan Kara

On Tue 22-08-17 18:17:11, Christoph Hellwig wrote:
> This is based on the old idea and code from Milosz Tanski.  With the aio
> nowait code it becomes mostly trivial now.  Buffered writes continue to
> return -EOPNOTSUPP if RWF_NOWAIT is passed.
> 
> Signed-off-by: Christoph Hellwig 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/aio.c   |  6 --
>  fs/btrfs/file.c|  6 +-
>  fs/ext4/file.c |  6 +++---
>  fs/xfs/xfs_file.c  | 11 +--
>  include/linux/fs.h |  6 +++---
>  5 files changed, 20 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index dcad3a66748c..d93daa076726 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1593,12 +1593,6 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   goto out_put_req;
>   }
>  
> - if ((req->common.ki_flags & IOCB_NOWAIT) &&
> - !(req->common.ki_flags & IOCB_DIRECT)) {
> - ret = -EOPNOTSUPP;
> - goto out_put_req;
> - }
> -
>   ret = put_user(KIOCB_KEY, _iocb->aio_key);
>   if (unlikely(ret)) {
>   pr_debug("EFAULT: aio_key\n");
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 9e75d8a39aac..e62dd55b4079 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1886,6 +1886,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> *iocb,
>   loff_t oldsize;
>   int clean_page = 0;
>  
> + if (!(iocb->ki_flags & IOCB_DIRECT) &&
> + (iocb->ki_flags & IOCB_NOWAIT))
> + return -EOPNOTSUPP;
> +
>   if (!inode_trylock(inode)) {
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   return -EAGAIN;
> @@ -3105,7 +3109,7 @@ static loff_t btrfs_file_llseek(struct file *file, 
> loff_t offset, int whence)
>  
>  static int btrfs_file_open(struct inode *inode, struct file *filp)
>  {
> - filp->f_mode |= FMODE_AIO_NOWAIT;
> + filp->f_mode |= FMODE_NOWAIT;
>   return generic_file_open(inode, filp);
>  }
>  
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0d7cf0cc9b87..f83521337b8f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -223,6 +223,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>   if (IS_DAX(inode))
>   return ext4_dax_write_iter(iocb, from);
>  #endif
> + if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
> + return -EOPNOTSUPP;
>  
>   if (!inode_trylock(inode)) {
>   if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -448,9 +450,7 @@ static int ext4_file_open(struct inode * inode, struct 
> file * filp)
>   return ret;
>   }
>  
> - /* Set the flags to support nowait AIO */
> - filp->f_mode |= FMODE_AIO_NOWAIT;
> -
> + filp->f_mode |= FMODE_NOWAIT;
>   return dquot_file_open(inode, filp);
>  }
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c4893e226fd8..1a09104b3eb0 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -259,7 +259,11 @@ xfs_file_buffered_aio_read(
>  
>   trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
>  
> - xfs_ilock(ip, XFS_IOLOCK_SHARED);
> + if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + xfs_ilock(ip, XFS_IOLOCK_SHARED);
> + }
>   ret = generic_file_read_iter(iocb, to);
>   xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
> @@ -636,6 +640,9 @@ xfs_file_buffered_aio_write(
>   int enospc = 0;
>   int iolock;
>  
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
>  write_retry:
>   iolock = XFS_IOLOCK_EXCL;
>   xfs_ilock(ip, iolock);
> @@ -912,7 +919,7 @@ xfs_file_open(
>   return -EFBIG;
>   if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
>   return -EIO;
> - file->f_mode |= FMODE_AIO_NOWAIT;
> + file->f_mode |= FMODE_NOWAIT;
>   return 0;
>  }
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d21248..ded258edc425 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -146,8 +146,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> offset,
>  /* File was opened by fanotify and shouldn't generate fanotify events */
>  #define FMODE_NONOTIFY   ((__force fmode_t)0x400)
>  
> -/* File is capable of returning -EAGAIN if AIO will block */
> -#define FMODE_AIO_NOWAIT ((__force fmode_t)0x800)
> +/* File is capable of returning -EAGAIN if I/O will block */
> +#define FMODE_NOWAIT ((__force fmode_t)0x800)
>  
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> @@ -3149,7 +3149,7 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, 
> int flags)
>   return -EOPNOTSUPP;
>  
>   if (flags &

Re: [PATCH 2/4] fs: support IOCB_NOWAIT in generic_file_buffered_read

2017-08-24 Thread Jan Kara

On Tue 22-08-17 18:17:10, Christoph Hellwig wrote:
> From: Milosz Tanski 
> 
> Allow generic_file_buffered_read to bail out early instead of waiting for
> the page lock or reading a page if IOCB_NOWAIT is specified.
> 
> Signed-off-by: Milosz Tanski 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Jeff Moyer 
> Acked-by: Sage Weil 
> ---
>  mm/filemap.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4bcfa74ad802..9f60255fb7bb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1937,6 +1937,8 @@ static ssize_t generic_file_buffered_read(struct kiocb 
> *iocb,
>  
>   page = find_get_page(mapping, index);
>   if (!page) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + goto would_block;
>   page_cache_sync_readahead(mapping,
>   ra, filp,
>   index, last_index - index);

Hum, we have:

if (!PageUptodate(page)) {
/*
 * See comment in do_read_cache_page on why
 * wait_on_page_locked is used to avoid
 * unnecessarily
 * serialisations and why it's safe.
 */
error = wait_on_page_locked_killable(page);
if (unlikely(error))
goto readpage_error;
if (PageUptodate(page))
goto page_ok;

And wait_on_page_locked_killable() above does not seem to be handled in
your patch. I would just check IOCB_NOWAIT in !PageUptodate(page) branch
above and bail - which also removes the need for the two checks below...

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH] block: remove blk_free_devt in add_partition

2017-08-24 Thread Jens Axboe

On 08/18/2017 09:54 AM, weiping zhang wrote:
> put_device(pdev) will call pdev->type->release finally, and blk_free_devt
> has been called in part_release(), so remove it.

Your analysis looks correct to me, applied for 4.14.

-- 
Jens Axboe

Re: [PATCH 1/4] fs: pass iocb to do_generic_file_read

2017-08-24 Thread Jan Kara

On Tue 22-08-17 18:17:09, Christoph Hellwig wrote:
> And rename it to the more descriptive generic_file_buffered_read while
> at it.
> 
> Signed-off-by: Christoph Hellwig 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/filemap.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..4bcfa74ad802 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1886,9 +1886,8 @@ static void shrink_readahead_size_eio(struct file *filp,
>  }
>  
>  /**
> - * do_generic_file_read - generic file read routine
> - * @filp:the file to read
> - * @ppos:current file position
> + * generic_file_buffered_read - generic file read routine
> + * @iocb:the iocb to read
>   * @iter:data destination
>   * @written: already copied
>   *
> @@ -1898,12 +1897,14 @@ static void shrink_readahead_size_eio(struct file 
> *filp,
>   * This is really ugly. But the goto's actually try to clarify some
>   * of the logic when it comes to error handling etc.
>   */
> -static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
> +static ssize_t generic_file_buffered_read(struct kiocb *iocb,
>   struct iov_iter *iter, ssize_t written)
>  {
> + struct file *filp = iocb->ki_filp;
>   struct address_space *mapping = filp->f_mapping;
>   struct inode *inode = mapping->host;
>   struct file_ra_state *ra = >f_ra;
> + loff_t *ppos = >ki_pos;
>   pgoff_t index;
>   pgoff_t last_index;
>   pgoff_t prev_index;
> @@ -2151,14 +2152,14 @@ static ssize_t do_generic_file_read(struct file 
> *filp, loff_t *ppos,
>  ssize_t
>  generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  {
> - struct file *file = iocb->ki_filp;
> - ssize_t retval = 0;
>   size_t count = iov_iter_count(iter);
> + ssize_t retval = 0;
>  
>   if (!count)
>   goto out; /* skip atime */
>  
>   if (iocb->ki_flags & IOCB_DIRECT) {
> + struct file *file = iocb->ki_filp;
>   struct address_space *mapping = file->f_mapping;
>   struct inode *inode = mapping->host;
>   loff_t size;
> @@ -2199,7 +2200,7 @@ generic_file_read_iter(struct kiocb *iocb, struct 
> iov_iter *iter)
>   goto out;
>   }
>  
> - retval = do_generic_file_read(file, >ki_pos, iter, retval);
> + retval = generic_file_buffered_read(iocb, iter, retval);
>  out:
>   return retval;
>  }
> -- 
> 2.11.0
> 
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH v2 0/1] bsg: fix regression resulting in panics when sending commands via BSG

2017-08-24 Thread Jens Axboe

On 08/23/2017 05:57 PM, Benjamin Block wrote:
> Hello all,
> 
> This is the second try for fixing the regression in the BSG-interface that
> exists since v4.11 (for more infos see the first series).
> 
> I separated my other changes from the bug-fix so that it is easier to apply
> if judged good. I will rebase my cleanups I sent in v1 and send them when I
> get a bit more time. But the regression-fix is more important, so here's
> that.
> 
> I did some more tests on it than on v1, including some heavy parallel I/O
> on the same blk-queue using both BSG and the normal SCSI-stack at the same
> time (throwing some intentional bad commands in it too). That seemed to
> work all well enough - i.e. it didn't crash and got the expected results. I
> haven't done any external error-inject, but IMO that would be beyond the
> scope right now.
> 
> The fix is based on Christoph's idea, I discussed this with him off-list
> already.
> 
> I rebased the series on Jens' for-next.

Added for 4.13, thanks.

-- 
Jens Axboe

Re: [PATCH for-next] block: Fix left-over bi_bdev usage

2017-08-24 Thread Jens Axboe

On 08/24/2017 02:41 AM, Christoph Hellwig wrote:
> On Wed, Aug 23, 2017 at 07:45:55PM -0400, Keith Busch wrote:
>> The bi_bdev field was replaced with the gendisk. This patch just fixes
>> an omission.
> 
> Thanks Keith.  I'm pretty sure I fixed exactly this one when
> rebasing from my nvme tree to the block tree.  But I guess I didn't
> finish the rebase fully, as usual - git rebase considered harmful :)

Problem is, the issue doesn't exist in my for-4.14/block branch, it
only exists when that is merged with master. I'll apply it to my
for-next branch, even though that one is not persistent, so that
it builds.

-- 
Jens Axboe

Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

2017-08-24 Thread Tejun Heo

Hello, Laurent.

On Thu, Aug 24, 2017 at 02:10:31PM +0200, Laurent Vivier wrote:
> > Yeah, it just needs to match up new cpus to the cpu ids assigned to
> > the right node.
> 
> We are not able to assign the cpu ids to the right node before the CPU
> is present, because firmware doesn't provide CPU mapping <-> node id
> before that.

What I meant was to assign the matching CPU ID when the CPU becomes
present - ie. have CPU IDs available for different nodes and allocate
them to the new CPU according to its node mapping when it actually
comes up.  Please note that I'm not saying this is the way to go, just
that it is a solvable problem from the arch code.

> > The node mapping for that cpu id changes *dynamically* while the
> > system is running and that can race with node-affinity sensitive
> > operations such as memory allocations.
> 
> Memory is mapped to the node through its own firmware entry, so I don't
> think cpu id change can affect memory affinity, and before we know the
> node id of the CPU, the CPU is not present and thus it can't use memory.

The latter part isn't true.  For example, percpu memory gets alloacted
for all possible CPUs according to their node affinity, so the memory
node association change which happens when the CPU comes up for the
first time can race against such allocations.  I don't know whether
that's actually problematic but we don't have *any* synchronization
around it.  If you think it's safe to have such races, please explain
why that is.

> > Please take a step back and think through the problem again.  You
> > can't bandaid it this way.
> 
> Could you give some ideas, proposals?
> As the firmware doesn't provide the information before the CPU is really
> plugged, I really don't know how to manage this problem.

There are two possible approaches, I think.

1. Make physical cpu -> logical cpu mapping indirect so that the
   kernel's cpu ID assignment is always on the right numa node.  This
   may mean that the kernel might have to keep more possible CPUs
   around than necessary but it does have the benefit that all memory
   allocations are affine to the right node.

2. Make cpu <-> node mapping properly dynamic.  Identify what sort of
   synchronization we'd need around the mapping changing dynamically.
   Note that we might not need much but it'll most likely need some.
   Build synchronization and notification infrastructure around it.

Thanks.

-- 
tejun

Re: [PATCH v2 1/1] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer

2017-08-24 Thread Benjamin Block

On Thu, Aug 24, 2017 at 10:45:56AM +0200, Christoph Hellwig wrote:
> >  /**
> > - * bsg_destroy_job - routine to teardown/delete a bsg job
> > + * bsg_teardown_job - routine to teardown a bsg job
> >   * @job: bsg_job that is to be torn down
> >   */
> > -static void bsg_destroy_job(struct kref *kref)
> > +static void bsg_teardown_job(struct kref *kref)
> 
> Why this rename?  The destroy name seems to be one of the most
> common patterns for the kref_put callbacks.
>

Hmm, I did it mostly so it is symmetric with bsg_prepare_job() and it
doesn't really itself destroy the job-struct anymore. If there are other
thing amiss I can change that along with them, if it bothers poeple.


Beste Grüße / Best regards,
  - Benjamin Block

> 
> Otherwise this looks fine:
> 
> Reviewed-by: Christoph Hellwig 
> 

-- 
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294

Re: [PATCH 4/4] loop: fold loop_switch() into callers

2017-08-24 Thread Hannes Reinecke

On 08/24/2017 09:03 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The comments here are really outdated, and blk-mq made flushing much
> simpler, so just fold the two cases into the callers.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  drivers/block/loop.c | 76 
> 
>  1 file changed, 11 insertions(+), 65 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 3/4] loop: add ioctl for changing logical block size

2017-08-24 Thread Hannes Reinecke

On 08/24/2017 09:03 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> This is a different approach from the first attempt in f2c6df7dbf9a
> ("loop: support 4k physical blocksize"). Rather than extending
> LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block
> size.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  drivers/block/loop.c  | 24 
>  include/uapi/linux/loop.h |  1 +
>  2 files changed, 25 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 1a5b4ecf54ec..6b1d932028e3 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1047,6 +1047,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE);
>   memset(lo->lo_crypt_name, 0, LO_NAME_SIZE);
>   memset(lo->lo_file_name, 0, LO_NAME_SIZE);
> + blk_queue_logical_block_size(lo->lo_queue, 512);
>   if (bdev) {
>   bdput(bdev);
>   invalidate_bdev(bdev);
> @@ -1330,6 +1331,24 @@ static int loop_set_dio(struct loop_device *lo, 
> unsigned long arg)
>   return error;
>  }
>  
> +static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
> +{
> + if (lo->lo_state != Lo_bound)
> + return -ENXIO;
> +
> + if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg))
> + return -EINVAL;
> +
> + blk_mq_freeze_queue(lo->lo_queue);
> +
> + blk_queue_logical_block_size(lo->lo_queue, arg);
> + loop_update_dio(lo);
> +
> + blk_mq_unfreeze_queue(lo->lo_queue);
> +
> + return 0;
> +}
> +
>  static int lo_ioctl(struct block_device *bdev, fmode_t mode,
>   unsigned int cmd, unsigned long arg)
>  {
> @@ -1378,6 +1397,11 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
> mode,
>   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
>   err = loop_set_dio(lo, arg);
>   break;
> + case LOOP_SET_BLOCK_SIZE:
> + err = -EPERM;
> + if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
> + err = loop_set_block_size(lo, arg);
> + break;
>   default:
>   err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL;
>   }
> diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
> index c8125ec1f4f2..23158dbe2424 100644
> --- a/include/uapi/linux/loop.h
> +++ b/include/uapi/linux/loop.h
> @@ -88,6 +88,7 @@ struct loop_info64 {
>  #define LOOP_CHANGE_FD   0x4C06
>  #define LOOP_SET_CAPACITY0x4C07
>  #define LOOP_SET_DIRECT_IO   0x4C08
> +#define LOOP_SET_BLOCK_SIZE  0x4C09
>  
>  /* /dev/loop-control interface */
>  #define LOOP_CTL_ADD 0x4C80
> 
Hmm.

If the losetup / util-linux maintainers are on board with this, okay.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 2/4] loop: set physical block size to PAGE_SIZE

2017-08-24 Thread Hannes Reinecke

On 08/24/2017 09:03 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The physical block size is "the lowest possible sector size that the
> hardware can operate on without reverting to read-modify-write
> operations" (from the comment on blk_queue_physical_block_size()). Since
> loop does buffered I/O on the backing file by default, the RMW unit is a
> page. This isn't the case for direct I/O mode, but let's keep it simple.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  drivers/block/loop.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 54e091887199..1a5b4ecf54ec 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1764,6 +1764,8 @@ static int loop_add(struct loop_device **l, int i)
>   }
>   lo->lo_queue->queuedata = lo;
>  
> + blk_queue_physical_block_size(lo->lo_queue, PAGE_SIZE);
> +
>   /*
>* It doesn't make sense to enable merge because the I/O
>* submitted to backing file is handled page by page.
> 
Let's see it this one goes through ...

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 1/4] loop: get rid of lo_blocksize

2017-08-24 Thread Hannes Reinecke

On 08/24/2017 09:03 AM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> This is only used for setting the soft block size on the struct
> block_device once and then never used again.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  drivers/block/loop.c | 10 ++
>  drivers/block/loop.h |  1 -
>  2 files changed, 2 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

2017-08-24 Thread Laurent Vivier

On 23/08/2017 15:26, Tejun Heo wrote:
> Hello, Michael.
> 
> On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote:
>>> I don't think that's true.  The CPU id used in kernel doesn't have to
>>> match the physical one and arch code should be able to pre-map CPU IDs
>>> to nodes and use the matching one when hotplugging CPUs.  I'm not
>>> saying that's the best way to solve the problem tho.
>>
>> We already virtualise the CPU numbers, but not the node IDs. And it's
>> the node IDs that are really the problem.
> 
> Yeah, it just needs to match up new cpus to the cpu ids assigned to
> the right node.

We are not able to assign the cpu ids to the right node before the CPU
is present, because firmware doesn't provide CPU mapping <-> node id
before that.

>>> It could be that the best way forward is making cpu <-> node mapping
>>> dynamic and properly synchronized.
>>
>> We don't need it to be dynamic (at least for this bug).
> 
> The node mapping for that cpu id changes *dynamically* while the
> system is running and that can race with node-affinity sensitive
> operations such as memory allocations.

Memory is mapped to the node through its own firmware entry, so I don't
think cpu id change can affect memory affinity, and before we know the
node id of the CPU, the CPU is not present and thus it can't use memory.

>> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
>> that because some CPUs aren't present at boot we don't know what the
>> node mapping is. (Correct me if I'm wrong Laurent).
>>
>> So all we need is:
>>  - the workqueue code to cope with CPUs that are possible but not online
>>having NUMA_NO_NODE to begin with.
>>  - a way to update the workqueue cpumask when the CPU comes online.
>>
>> Which seems reasonable to me?
> 
> Please take a step back and think through the problem again.  You
> can't bandaid it this way.

Could you give some ideas, proposals?
As the firmware doesn't provide the information before the CPU is really
plugged, I really don't know how to manage this problem.

Thanks,
Laurent

Re: [PATCH] block: remove blk_free_devt in add_partition

2017-08-24 Thread weiping zhang

On Fri, Aug 18, 2017 at 11:54:45PM +0800, weiping zhang wrote:
> put_device(pdev) will call pdev->type->release finally, and blk_free_devt
> has been called in part_release(), so remove it.
> 
> Signed-off-by: weiping zhang 
> ---
>  block/partition-generic.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index c5ec824..04d82dd 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -391,7 +391,6 @@ struct hd_struct *add_partition(struct gendisk *disk, int 
> partno,
>   device_del(pdev);
>  out_put:
>   put_device(pdev);
> - blk_free_devt(devt);
>   return ERR_PTR(err);
>  }
>  
> -- 
> 2.9.4
> 

Hi Jens,

would you please look this trivial patch at your convenience?

Thanks

[bug report] skd: Avoid that module unloading triggers a use-after-free

2017-08-24 Thread Dan Carpenter

Hello Bart Van Assche,

This is a semi-automatic email about new static checker warnings.

The patch 7277cc67b391: "skd: Avoid that module unloading triggers a 
use-after-free" from Aug 17, 2017, leads to the following Smatch 
complaint:

drivers/block/skd_main.c:3080 skd_free_disk()
 error: we previously assumed 'disk' could be null (see line 3074)

drivers/block/skd_main.c
  3073  
  3074  if (disk && (disk->flags & GENHD_FL_UP))

Existing code checked for NULL.  The new code shuffles things around.

  3075  del_gendisk(disk);
  3076  
  3077  if (skdev->queue) {
  3078  blk_cleanup_queue(skdev->queue);
  3079  skdev->queue = NULL;
  3080  disk->queue = NULL;
^^^
Now we don't check here.

  3081  }
  3082  

regards,
dan carpenter

Re: [PATCH 2/6] raid5: remove a call to get_start_sect

2017-08-24 Thread Christoph Hellwig

On Wed, Aug 23, 2017 at 11:23:38AM -0700, Shaohua Li wrote:
> On Wed, Aug 23, 2017 at 07:10:28PM +0200, Christoph Hellwig wrote:
> > The block layer always remaps partitions before calling into the
> > ->make_request methods of drivers.  Thus the call to get_start_sect in
> > in_chunk_boundary will always return 0 and can be removed.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  drivers/md/raid5.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 0fc2748aaf95..d687aeb1b538 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -5092,10 +5092,12 @@ static int raid5_congested(struct mddev *mddev, int 
> > bits)
> >  static int in_chunk_boundary(struct mddev *mddev, struct bio *bio)
> >  {
> > struct r5conf *conf = mddev->private;
> > -   sector_t sector = bio->bi_iter.bi_sector + get_start_sect(bio->bi_bdev);
> > +   sector_t sector = bio->bi_iter.bi_sector;
> > unsigned int chunk_sectors;
> > unsigned int bio_sectors = bio_sectors(bio);
> >  
> > +   WARN_ON_ONCE(bio->bi_partno);
> > +

Meh, of course bi_partno is only added a few patches later, so this
breaks bisectability.  But given that the patch is already in there's
probably nothing I can do here.

Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-24 Thread h...@lst.de

On Wed, Aug 23, 2017 at 06:21:55PM +, Bart Van Assche wrote:
> On Wed, 2017-08-23 at 19:58 +0200, Christoph Hellwig wrote:
> > +static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio)
> > +{
> > +   struct nvme_ns_head *head = q->queuedata;
> > +   struct nvme_ns *ns;
> > +   blk_qc_t ret = BLK_QC_T_NONE;
> > +   int srcu_idx;
> > +
> > +   srcu_idx = srcu_read_lock(>srcu);
> > +   ns = srcu_dereference(head->current_path, >srcu);
> > +   if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
> > +   ns = nvme_find_path(head);
> > +   if (likely(ns)) {
> > +   bio->bi_disk = ns->disk;
> > +   bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
> > +   ret = generic_make_request_fast(bio);
> > +   } else if (!list_empty_careful(>list)) {
> > +   printk_ratelimited("no path available - requeing I/O\n");
> > +
> > +   spin_lock_irq(>requeue_lock);
> > +   bio_list_add(>requeue_list, bio);
> > +   spin_unlock_irq(>requeue_lock);
> > +   } else {
> > +   printk_ratelimited("no path - failing I/O\n");
> > +
> > +   bio->bi_status = BLK_STS_IOERR;
> > +   bio_endio(bio);
> > +   }
> > +
> > +   srcu_read_unlock(>srcu, srcu_idx);
> > +   return ret;
> > +}
> 
> Hello Christoph,
> 
> Since generic_make_request_fast() returns BLK_STS_AGAIN for a dying path:
> can the same kind of soft lockups occur with the NVMe multipathing code as
> with the current upstream device mapper multipathing code? See e.g.
> "[PATCH 3/7] dm-mpath: Do not lock up a CPU with requeuing activity"
> (https://www.redhat.com/archives/dm-devel/2017-August/msg00124.html).

I suspect the code is not going to hit it because we check the controller
state before trying to queue I/O on the lower queue.  But if you point
me to a good reproducer test case I'd like to check.

Also does the "single queue" case in your mail refer to the old
request code?  nvme only uses blk-mq so it would not hit that.
But either way I think get_request should be fixed to return
BLK_STS_IOERR if the queue is dying instead of BLK_STS_AGAIN.

> Another question about this code is what will happen if
> generic_make_request_fast() returns BLK_STS_AGAIN and the submit_bio() or
> generic_make_request() caller ignores the return value of the called
> function? A quick grep revealed that there is plenty of code that ignores
> the return value of these last two functions.

generic_make_request and generic_make_request_fast only return
the polling cookie (blk_qc_t), not a block status.  Note that we do
not use blk_get_request / blk_mq_alloc_request for the request allocation
of the request on the lower device, so unless the caller passed REQ_NOWAIT
and is able to handle BLK_STS_AGAIN we won't ever return it.

Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-24 Thread Christoph Hellwig

On Wed, Aug 23, 2017 at 06:53:00PM -0400, Keith Busch wrote:
> On Wed, Aug 23, 2017 at 07:58:15PM +0200, Christoph Hellwig wrote:
> > 
> > TODO: implement sysfs interfaces for the new subsystem and
> > subsystem-namespace object.  Unless we can come up with something
> > better than sysfs here..
> 
> Can we get symlinks from the multipath'ed nvme block device to the
> individual paths? I think it should be something like the following:
>  
> > @@ -2616,6 +2821,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
> > unsigned nsid)
> > if (ns->ndev && nvme_nvm_register_sysfs(ns))
> > pr_warn("%s: failed to register lightnvm sysfs group for 
> > identification\n",
> > ns->disk->disk_name);
> > +
> > +   if (new)
> > +   add_disk(ns->head->disk);
> +
> + sysfs_create_link(_to_dev(ns->head->disk)->kobj,
> +   _to_dev(ns->disk)->kobj,
> +   ns->disk->name);

Yeah, probably..

Re: [PATCH 06/10] nvme: track subsystems

2017-08-24 Thread Christoph Hellwig

On Wed, Aug 23, 2017 at 06:04:34PM -0400, Keith Busch wrote:
> > +   struct nvme_subsystem *subsys;
> > +
> > +   lockdep_assert_held(_subsystems_lock);
> > +
> > +   list_for_each_entry(subsys, _subsystems, entry) {
> > +   if (strcmp(subsys->subnqn, subsysnqn))
> > +   continue;
> > +   if (!kref_get_unless_zero(>ref))
> > +   continue;
> 
> You should be able to just return immediately here since there can't be
> a duplicated subsysnqn in the list.

We could have a race where we tear one subsystem instance down and
are setting up another one.

> > +   INIT_LIST_HEAD(>ctrls);
> > +   kref_init(>ref);
> > +   nvme_init_subnqn(subsys, ctrl, id);
> > +   mutex_init(>lock);
> > +
> > +   mutex_lock(_subsystems_lock);
> 
> This could be a spinlock instead of a mutex.

We could.  But given that the lock is not in the hot path there seems
to be no point in making it a spinlock.

> > +   found = __nvme_find_get_subsystem(subsys->subnqn);
> > +   if (found) {
> > +   /*
> > +* Verify that the subsystem actually supports multiple
> > +* controllers, else bail out.
> > +*/
> > +   kfree(subsys);
> > +   if (!(id->cmic & (1 << 1))) {
> > +   dev_err(ctrl->device,
> > +   "ignoring ctrl due to duplicate subnqn (%s).\n",
> > +   found->subnqn);
> > +   mutex_unlock(_subsystems_lock);
> > +   return -EINVAL;
> 
> Returning -EINVAL here will cause nvme_init_identify to fail. Do we want
> that to happen here? I think we want to be able to manage controllers
> in such a state, but just checking if there's a good reason to not allow
> them.

Without this we will get duplicate nvme_subsystem structures, messing
up the whole lookup.  We could mark them as buggy with a flag and
make sure controllers without CMIC bit 1 set will never be linked.

Re: [PATCH v2 1/1] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer

2017-08-24 Thread Christoph Hellwig

>  /**
> - * bsg_destroy_job - routine to teardown/delete a bsg job
> + * bsg_teardown_job - routine to teardown a bsg job
>   * @job: bsg_job that is to be torn down
>   */
> -static void bsg_destroy_job(struct kref *kref)
> +static void bsg_teardown_job(struct kref *kref)

Why this rename?  The destroy name seems to be one of the most
common patterns for the kref_put callbacks.

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig

Re: [PATCH for-next] block: Fix left-over bi_bdev usage

2017-08-24 Thread Christoph Hellwig

On Wed, Aug 23, 2017 at 07:45:55PM -0400, Keith Busch wrote:
> The bi_bdev field was replaced with the gendisk. This patch just fixes
> an omission.

Thanks Keith.  I'm pretty sure I fixed exactly this one when
rebasing from my nvme tree to the block tree.  But I guess I didn't
finish the rebase fully, as usual - git rebase considered harmful :)

Acked-by: Christoph Hellwig

Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list

2017-08-24 Thread Damien Le Moal

On 8/24/17 16:10, Ming Lei wrote:
> On Thu, Aug 24, 2017 at 09:59:13AM +0900, Damien Le Moal wrote:
>>
>> On 8/23/17 05:43, Bart Van Assche wrote:
>>> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
 +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
 +{
 +  return !list_empty_careful(>dispatch);
 +}
 +
 +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
 +  struct request *rq)
 +{
 +  spin_lock(>lock);
 +  list_add(>queuelist, >dispatch);
 +  blk_mq_hctx_set_dispatch_busy(hctx);
 +  spin_unlock(>lock);
 +}
 +
 +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
 +  struct list_head *list)
 +{
 +  spin_lock(>lock);
 +  list_splice_init(list, >dispatch);
 +  blk_mq_hctx_set_dispatch_busy(hctx);
 +  spin_unlock(>lock);
 +}
 +
 +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx 
 *hctx,
 +  struct list_head *list)
 +{
 +  spin_lock(>lock);
 +  list_splice_tail_init(list, >dispatch);
 +  blk_mq_hctx_set_dispatch_busy(hctx);
 +  spin_unlock(>lock);
 +}
 +
 +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx 
 *hctx,
 +  struct list_head *list)
 +{
 +  spin_lock(>lock);
 +  list_splice_init(>dispatch, list);
 +  spin_unlock(>lock);
 +}
>>>
>>> Same comment for this patch: these helper functions are so short that I'm 
>>> not
>>> sure it is useful to introduce these helper functions.
>>>
>>> Bart.
>>
>> Personally, I like those very much as they give a place to hook up
>> different dispatch_list handling without having to change blk-mq.c and
>> blk-mq-sched.c all over the place.
>>
>> I am thinking of SMR (zoned block device) support here since we need to
>> to sort insert write requests in blk_mq_add_rq_to_dispatch() and
>> blk_mq_add_list_to_dispatch_tail(). For this last one, the name would
> 
> Could you explain a bit why you sort insert write rq in these two
> helpers? which are only triggered in case of dispatch busy.

For host-managed zoned block devices (SMR disks), writes have to be
sequential per zone of the device, otherwise an "unaligned write error"
is returned by the device. This constraint is exposed to the user (FS,
device mapper or applications doing raw I/Os to the block device). So a
well behaved user (f2fs and dm-zoned in the kernel) will issue
sequential writes to zones (this includes write same and write zeroes).
However, the current current blk-mq code does not guarantee that this
issuing order will be respected at dispatch time. There are multiple
reasons:

1) On submit, requests first go into per CPU context list. So if the
issuer thread is moved from one CPU to another in the middle of a
sequential write request sequence, the sequence would be broken in two
parts into different lists. When a scheduler is used, this can get fixed
by a nice LBA ordering (mq-deadline would do that), but in the "nonw"
scheduler case, the requests in the CPU lists are merged into the hctx
dispatch queue in CPU number order, so potentially reordering write
sequence. Hence the sort to fix that.

2) Dispatch may happen in parallel with multiple contexts getting
requests from the scheduler (or the CPU context lists) and again break
write sequences. I think your patch partly addresses this with the BUSY
flag, but when requests are in the dispatch queue already. However, at
dispatch time, requests go directly from the scheduler (or CPU context
lists) into the dispatch caller local list, bypassing the hctx dispatch
queue. Write ordering gets broken again here.

3) SCSI layer locks a device zone when it sees a write to prevent
subsequent write to the same zone. This is to avoid reordering at the
HBA level (AHCI has a 100% chance of doing it). In effect, this is a
write QD=1 per zone implementation, and that causes requeue of write
requests. In that case, this is a requeue happening after ->queue_rq()
call, so using the dispatch context local list. Combine that with (2)
and reordering can happen here too.

I have a series implementing a "dispatcher", which is basically a set of
operations that a device driver can specify to handle the dispatch
queue. The operations are basically "insert" and "get", so what you have
as helpers. The default dispatcher uses operations very similar to your
helpers minus the busy bit. That is, the default dispatcher does not
change the current behavior. But a ZBC dispatcher can be installed by
the scsi layer for a ZBC drive at device scan time and do the sort
insert of write requests to address all the potential reordering.

That series goes beyond what you did, and I was waiting for your series
to stabilize to rebase (or rework) what I did on top of your changes
which look like they would simplify handling of ZBC drives.

Note: ZBC drives are

Re: [PATCH V2 14/20] blk-mq-sched: improve IO scheduling on SCSI devcie

2017-08-24 Thread Ming Lei

On Tue, Aug 22, 2017 at 08:51:32PM +, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > This patch introduces per-request_queue dispatch
> > list for this purpose, and only when all requests
> > in this list are dispatched out successfully, we
> > can restart to dequeue request from sw/scheduler
> > queue and dispath it to lld.
> 
> Wasn't one of the design goals of blk-mq to avoid a per-request_queue list?

Yes, but why does scsi device have per-request_queue depth?

Anyway this single dispatch list won't hurt devices which doesn't have
per-request_queue depth.

For scsi device which has q->queue_depth, if one hctx is stuck, all
hctxs are stuck actually, the single dispatch list just fits this
kind of use case, doesn't it?

-- 
Ming

Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list

2017-08-24 Thread Ming Lei

On Thu, Aug 24, 2017 at 09:59:13AM +0900, Damien Le Moal wrote:
> 
> On 8/23/17 05:43, Bart Van Assche wrote:
> > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> >> +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
> >> +{
> >> +  return !list_empty_careful(>dispatch);
> >> +}
> >> +
> >> +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
> >> +  struct request *rq)
> >> +{
> >> +  spin_lock(>lock);
> >> +  list_add(>queuelist, >dispatch);
> >> +  blk_mq_hctx_set_dispatch_busy(hctx);
> >> +  spin_unlock(>lock);
> >> +}
> >> +
> >> +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
> >> +  struct list_head *list)
> >> +{
> >> +  spin_lock(>lock);
> >> +  list_splice_init(list, >dispatch);
> >> +  blk_mq_hctx_set_dispatch_busy(hctx);
> >> +  spin_unlock(>lock);
> >> +}
> >> +
> >> +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx 
> >> *hctx,
> >> +  struct list_head *list)
> >> +{
> >> +  spin_lock(>lock);
> >> +  list_splice_tail_init(list, >dispatch);
> >> +  blk_mq_hctx_set_dispatch_busy(hctx);
> >> +  spin_unlock(>lock);
> >> +}
> >> +
> >> +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx 
> >> *hctx,
> >> +  struct list_head *list)
> >> +{
> >> +  spin_lock(>lock);
> >> +  list_splice_init(>dispatch, list);
> >> +  spin_unlock(>lock);
> >> +}
> > 
> > Same comment for this patch: these helper functions are so short that I'm 
> > not
> > sure it is useful to introduce these helper functions.
> > 
> > Bart.
> 
> Personally, I like those very much as they give a place to hook up
> different dispatch_list handling without having to change blk-mq.c and
> blk-mq-sched.c all over the place.
> 
> I am thinking of SMR (zoned block device) support here since we need to
> to sort insert write requests in blk_mq_add_rq_to_dispatch() and
> blk_mq_add_list_to_dispatch_tail(). For this last one, the name would

Could you explain a bit why you sort insert write rq in these two
helpers? which are only triggered in case of dispatch busy.

> become a little awkward though. This sort insert would be to avoid
> breaking a sequential write request sequence sent by the disk user. This
> is needed since this reordering breakage cannot be solved only from the
> SCSI layer.

Basically this patchset tries to flush hctx->dispatch first, then
fetch requests from scheduler queue after htct->dispatch is flushed.

But it isn't strictly in this way because the implementation is lockless.

So looks the request from hctx->dispatch won't break one coming
requests from scheduler.


-- 
Ming

[PATCH 0/4] loop: support different logical block sizes

2017-08-24 Thread Omar Sandoval

From: Omar Sandoval 

Hi, everyone,

Here's another attempt at supporting different logical block sizes for
loop devices. First, some terminology:

 * Logical block size: the lowest possible block size that the storage
   device can address.
 * Physical block size: the lowest possible sector size that the
   hardware can operate on without reverting to read-modify-write
   operations. Always greater than or equal to the logical block size.

These are characteristics of the device itself tracked in the
request_queue. For loop devices, both are currently always 512 bytes.

struct block_device also has another block size, basically a "soft"
block size which is the preferred I/O size and can be changed from
userspace. For loop devices, this is PAGE_SIZE if the backing file is a
regular file or otherwise the soft block size of the backing block
device.

Patch 1 simplifies how the soft block size is set. I'm tempted to not
set it at all and use the default of PAGE_SIZE, but maybe someone is
depending on inheriting the soft block size from the backing block
device...

Patch 2 sets the physical block size of loop devices to a more
meaningful value for loop, PAGE_SIZE.

Patch 3 allows us to change the logical block size. It adds a new ioctl
instead of the previous approach (which extends LOOP_{GET,SET}_STATUS).
Hannes, I assume you had a specific reason for doing it the way you did,
care to explain?

Patch 4 is a cleanup.

This is based on my patch reverting the original block size support,
targeting 4.14. Thanks!

Omar Sandoval (4):
  loop: get rid of lo_blocksize
  loop: set physical block size to PAGE_SIZE
  loop: add ioctl for changing logical block size
  loop: fold loop_switch() into callers

 drivers/block/loop.c  | 112 --
 drivers/block/loop.h  |   1 -
 include/uapi/linux/loop.h |   1 +
 3 files changed, 40 insertions(+), 74 deletions(-)

-- 
2.14.1

[PATCH 1/4] loop: get rid of lo_blocksize

2017-08-24 Thread Omar Sandoval

From: Omar Sandoval 

This is only used for setting the soft block size on the struct
block_device once and then never used again.

Signed-off-by: Omar Sandoval 
---
 drivers/block/loop.c | 10 ++
 drivers/block/loop.h |  1 -
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f321b96405f5..54e091887199 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -573,8 +573,6 @@ static void do_loop_switch(struct loop_device *lo, struct 
switch_request *p)
mapping = file->f_mapping;
mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
lo->lo_backing_file = file;
-   lo->lo_blocksize = S_ISBLK(mapping->host->i_mode) ?
-   mapping->host->i_bdev->bd_block_size : PAGE_SIZE;
lo->old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
loop_update_dio(lo);
@@ -867,7 +865,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
struct file *file, *f;
struct inode*inode;
struct address_space *mapping;
-   unsigned lo_blocksize;
int lo_flags = 0;
int error;
loff_t  size;
@@ -911,9 +908,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
!file->f_op->write_iter)
lo_flags |= LO_FLAGS_READ_ONLY;
 
-   lo_blocksize = S_ISBLK(inode->i_mode) ?
-   inode->i_bdev->bd_block_size : PAGE_SIZE;
-
error = -EFBIG;
size = get_loop_size(lo, file);
if ((loff_t)(sector_t)size != size)
@@ -927,7 +921,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);
 
lo->use_dio = false;
-   lo->lo_blocksize = lo_blocksize;
lo->lo_device = bdev;
lo->lo_flags = lo_flags;
lo->lo_backing_file = file;
@@ -947,7 +940,8 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
/* let user-space know about the new size */
kobject_uevent(_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
 
-   set_blocksize(bdev, lo_blocksize);
+   set_blocksize(bdev, S_ISBLK(inode->i_mode) ?
+ block_size(inode->i_bdev) : PAGE_SIZE);
 
lo->lo_state = Lo_bound;
if (part_shift)
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index fecd3f97ef8c..efe57189d01e 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -48,7 +48,6 @@ struct loop_device {
 
struct file *   lo_backing_file;
struct block_device *lo_device;
-   unsignedlo_blocksize;
void*key_data; 
 
gfp_t   old_gfp_mask;
-- 
2.14.1

[PATCH 2/4] loop: set physical block size to PAGE_SIZE

2017-08-24 Thread Omar Sandoval

From: Omar Sandoval 

The physical block size is "the lowest possible sector size that the
hardware can operate on without reverting to read-modify-write
operations" (from the comment on blk_queue_physical_block_size()). Since
loop does buffered I/O on the backing file by default, the RMW unit is a
page. This isn't the case for direct I/O mode, but let's keep it simple.

Signed-off-by: Omar Sandoval 
---
 drivers/block/loop.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 54e091887199..1a5b4ecf54ec 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1764,6 +1764,8 @@ static int loop_add(struct loop_device **l, int i)
}
lo->lo_queue->queuedata = lo;
 
+   blk_queue_physical_block_size(lo->lo_queue, PAGE_SIZE);
+
/*
 * It doesn't make sense to enable merge because the I/O
 * submitted to backing file is handled page by page.
-- 
2.14.1

[PATCH 4/4] loop: fold loop_switch() into callers

2017-08-24 Thread Omar Sandoval

From: Omar Sandoval 

The comments here are really outdated, and blk-mq made flushing much
simpler, so just fold the two cases into the callers.

Signed-off-by: Omar Sandoval 
---
 drivers/block/loop.c | 76 
 1 file changed, 11 insertions(+), 65 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 6b1d932028e3..e60b4ac77577 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -546,72 +546,12 @@ static int do_req_filebacked(struct loop_device *lo, 
struct request *rq)
}
 }
 
-struct switch_request {
-   struct file *file;
-   struct completion wait;
-};
-
 static inline void loop_update_dio(struct loop_device *lo)
 {
__loop_update_dio(lo, io_is_direct(lo->lo_backing_file) |
lo->use_dio);
 }
 
-/*
- * Do the actual switch; called from the BIO completion routine
- */
-static void do_loop_switch(struct loop_device *lo, struct switch_request *p)
-{
-   struct file *file = p->file;
-   struct file *old_file = lo->lo_backing_file;
-   struct address_space *mapping;
-
-   /* if no new file, only flush of queued bios requested */
-   if (!file)
-   return;
-
-   mapping = file->f_mapping;
-   mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
-   lo->lo_backing_file = file;
-   lo->old_gfp_mask = mapping_gfp_mask(mapping);
-   mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
-   loop_update_dio(lo);
-}
-
-/*
- * loop_switch performs the hard work of switching a backing store.
- * First it needs to flush existing IO, it does this by sending a magic
- * BIO down the pipe. The completion of this BIO does the actual switch.
- */
-static int loop_switch(struct loop_device *lo, struct file *file)
-{
-   struct switch_request w;
-
-   w.file = file;
-
-   /* freeze queue and wait for completion of scheduled requests */
-   blk_mq_freeze_queue(lo->lo_queue);
-
-   /* do the switch action */
-   do_loop_switch(lo, );
-
-   /* unfreeze */
-   blk_mq_unfreeze_queue(lo->lo_queue);
-
-   return 0;
-}
-
-/*
- * Helper to flush the IOs in loop, but keeping loop thread running
- */
-static int loop_flush(struct loop_device *lo)
-{
-   /* loop not yet configured, no running thread, nothing to flush */
-   if (lo->lo_state != Lo_bound)
-   return 0;
-   return loop_switch(lo, NULL);
-}
-
 static void loop_reread_partitions(struct loop_device *lo,
   struct block_device *bdev)
 {
@@ -676,9 +616,14 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
goto out_putf;
 
/* and ... switch */
-   error = loop_switch(lo, file);
-   if (error)
-   goto out_putf;
+   blk_mq_freeze_queue(lo->lo_queue);
+   mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
+   lo->lo_backing_file = file;
+   lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping);
+   mapping_set_gfp_mask(file->f_mapping,
+lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
+   loop_update_dio(lo);
+   blk_mq_unfreeze_queue(lo->lo_queue);
 
fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
@@ -1601,12 +1546,13 @@ static void lo_release(struct gendisk *disk, fmode_t 
mode)
err = loop_clr_fd(lo);
if (!err)
return;
-   } else {
+   } else if (lo->lo_state == Lo_bound) {
/*
 * Otherwise keep thread (if running) and config,
 * but flush possible ongoing bios in thread.
 */
-   loop_flush(lo);
+   blk_mq_freeze_queue(lo->lo_queue);
+   blk_mq_unfreeze_queue(lo->lo_queue);
}
 
mutex_unlock(>lo_ctl_mutex);
-- 
2.14.1

[PATCH 3/4] loop: add ioctl for changing logical block size

2017-08-24 Thread Omar Sandoval

From: Omar Sandoval 

This is a different approach from the first attempt in f2c6df7dbf9a
("loop: support 4k physical blocksize"). Rather than extending
LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block
size.

Signed-off-by: Omar Sandoval 
---
 drivers/block/loop.c  | 24 
 include/uapi/linux/loop.h |  1 +
 2 files changed, 25 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1a5b4ecf54ec..6b1d932028e3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1047,6 +1047,7 @@ static int loop_clr_fd(struct loop_device *lo)
memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE);
memset(lo->lo_crypt_name, 0, LO_NAME_SIZE);
memset(lo->lo_file_name, 0, LO_NAME_SIZE);
+   blk_queue_logical_block_size(lo->lo_queue, 512);
if (bdev) {
bdput(bdev);
invalidate_bdev(bdev);
@@ -1330,6 +1331,24 @@ static int loop_set_dio(struct loop_device *lo, unsigned 
long arg)
return error;
 }
 
+static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
+{
+   if (lo->lo_state != Lo_bound)
+   return -ENXIO;
+
+   if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg))
+   return -EINVAL;
+
+   blk_mq_freeze_queue(lo->lo_queue);
+
+   blk_queue_logical_block_size(lo->lo_queue, arg);
+   loop_update_dio(lo);
+
+   blk_mq_unfreeze_queue(lo->lo_queue);
+
+   return 0;
+}
+
 static int lo_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
 {
@@ -1378,6 +1397,11 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
mode,
if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
err = loop_set_dio(lo, arg);
break;
+   case LOOP_SET_BLOCK_SIZE:
+   err = -EPERM;
+   if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+   err = loop_set_block_size(lo, arg);
+   break;
default:
err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL;
}
diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
index c8125ec1f4f2..23158dbe2424 100644
--- a/include/uapi/linux/loop.h
+++ b/include/uapi/linux/loop.h
@@ -88,6 +88,7 @@ struct loop_info64 {
 #define LOOP_CHANGE_FD 0x4C06
 #define LOOP_SET_CAPACITY  0x4C07
 #define LOOP_SET_DIRECT_IO 0x4C08
+#define LOOP_SET_BLOCK_SIZE0x4C09
 
 /* /dev/loop-control interface */
 #define LOOP_CTL_ADD   0x4C80
-- 
2.14.1

Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list

2017-08-24 Thread Ming Lei

On Tue, Aug 22, 2017 at 08:43:12PM +, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
> > +{
> > +   return !list_empty_careful(>dispatch);
> > +}
> > +
> > +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
> > +   struct request *rq)
> > +{
> > +   spin_lock(>lock);
> > +   list_add(>queuelist, >dispatch);
> > +   blk_mq_hctx_set_dispatch_busy(hctx);
> > +   spin_unlock(>lock);
> > +}
> > +
> > +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
> > +   struct list_head *list)
> > +{
> > +   spin_lock(>lock);
> > +   list_splice_init(list, >dispatch);
> > +   blk_mq_hctx_set_dispatch_busy(hctx);
> > +   spin_unlock(>lock);
> > +}
> > +
> > +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx 
> > *hctx,
> > +   struct list_head *list)
> > +{
> > +   spin_lock(>lock);
> > +   list_splice_tail_init(list, >dispatch);
> > +   blk_mq_hctx_set_dispatch_busy(hctx);
> > +   spin_unlock(>lock);
> > +}
> > +
> > +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx 
> > *hctx,
> > +   struct list_head *list)
> > +{
> > +   spin_lock(>lock);
> > +   list_splice_init(>dispatch, list);
> > +   spin_unlock(>lock);
> > +}
> 
> Same comment for this patch: these helper functions are so short that I'm not
> sure it is useful to introduce these helper functions.

It will become long in the following patches.

-- 
Ming

Re: [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state

2017-08-24 Thread Ming Lei

On Wed, Aug 23, 2017 at 02:02:20PM -0600, Jens Axboe wrote:
> On Tue, Aug 22 2017, Bart Van Assche wrote:
> > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > > +static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx 
> > > *hctx)
> > > +{
> > > + return test_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> > > +}
> > > +
> > > +static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx 
> > > *hctx)
> > > +{
> > > + set_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> > > +}
> > > +
> > > +static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx 
> > > *hctx)
> > > +{
> > > + clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> > > +}
> > 
> > Hello Ming,
> > 
> > Are these helper functions modified in a later patch? If not, please drop
> > this patch. One line helper functions are not useful and do not improve
> > readability of source code.
> 
> Agree, they just obfuscate the code. Only reason to do this is to do
> things like:

If you look at the following patch, these introduced functions are
modified a lot.

> 
> static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> {
> if (test_bit(BLK_MQ_S_DISPATCH_BUSY, >state))
>   clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> }
> 
> to avoid unecessary RMW (and locked instruction). At least then you can
> add a single comment as to why it's done that way. Apart from that, I
> prefer to open-code it so people don't have to grep to figure out wtf
> blk_mq_hctx_clear_dispatch_busy() does.

Ok, will do that in this way.



-- 
Ming

Re: [PATCH V2 10/20] blk-mq-sched: introduce helpers for query, change busy state

2017-08-24 Thread Ming Lei

On Tue, Aug 22, 2017 at 08:41:14PM +, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +static inline bool blk_mq_hctx_is_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +   return test_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> > +}
> > +
> > +static inline void blk_mq_hctx_set_dispatch_busy(struct blk_mq_hw_ctx 
> > *hctx)
> > +{
> > +   set_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> > +}
> > +
> > +static inline void blk_mq_hctx_clear_dispatch_busy(struct blk_mq_hw_ctx 
> > *hctx)
> > +{
> > +   clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> > +}
> 
> Hello Ming,
> 
> Are these helper functions modified in a later patch? If not, please drop
> this patch. One line helper functions are not useful and do not improve
> readability of source code.

It is definitely modified in later patch, :-)

-- 
Ming

Re: [PATCH V2 09/20] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH

2017-08-24 Thread Ming Lei

On Tue, Aug 22, 2017 at 09:55:57PM +, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +   /*
> > +* if there is q->queue_depth, all hw queues share
> > +* this queue depth limit
> > +*/
> > +   if (q->queue_depth) {
> > +   queue_for_each_hw_ctx(q, hctx, i)
> > +   hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
> > +   }
> > +
> > +   if (!q->elevator)
> > +   goto exit;
> 
> Hello Ming,
> 
> It seems very fragile to me to set BLK_MQ_F_SHARED_DEPTH if and only if
> q->queue_depth != 0. Wouldn't it be better to let the block driver tell the
> block layer whether or not there is a queue depth limit across hardware
> queues, e.g. through a tag set flag?

One reason for not doing in that way is because q->queue_depth can be
changed via sysfs interface.

Another reason is that better to not exposing this flag to drivers since
it isn't necessary, that said it is an internal flag actually.

-- 
Ming

Re: [PATCH V2 08/20] blk-mq-sched: use q->queue_depth as hint for q->nr_requests

2017-08-24 Thread Ming Lei

On Tue, Aug 22, 2017 at 08:20:20PM +, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > SCSI sets q->queue_depth from shost->cmd_per_lun, and
> > q->queue_depth is per request_queue and more related to
> > scheduler queue compared with hw queue depth, which can be
> > shared by queues, such as TAG_SHARED.
> > 
> > This patch trys to use q->queue_depth as hint for computing
>  
>  tries?

Will fix it in V3.

> > q->nr_requests, which should be more effective than
> > current way.
> 
> Reviewed-by: Bart Van Assche 

Thanks!

-- 
Ming

Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-08-24 Thread Ming Lei

On Wed, Aug 23, 2017 at 01:56:50PM -0600, Jens Axboe wrote:
> On Sat, Aug 05 2017, Ming Lei wrote:
> > During dispatching, we moved all requests from hctx->dispatch to
> > one temporary list, then dispatch them one by one from this list.
> > Unfortunately duirng this period, run queue from other contexts
> > may think the queue is idle, then start to dequeue from sw/scheduler
> > queue and still try to dispatch because ->dispatch is empty. This way
> > hurts sequential I/O performance because requests are dequeued when
> > lld queue is busy.
> > 
> > This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> > make sure that request isn't dequeued until ->dispatch is
> > flushed.
> 
> I don't like how this patch introduces a bunch of locked setting of a
> flag under the hctx lock. Especially since I think we can easily avoid
> it.

Actually the lock isn't needed for setting the flag, will move it out
in V3.

> 
> > -   } else if (!has_sched_dispatch & !q->queue_depth) {
> > +   blk_mq_dispatch_rq_list(q, _list);
> > +
> > +   /*
> > +* We may clear DISPATCH_BUSY just after it
> > +* is set from another context, the only cost
> > +* is that one request is dequeued a bit early,
> > +* we can survive that. Given the window is
> > +* too small, no need to worry about performance
> > +* effect.
> > +*/
> > +   if (list_empty_careful(>dispatch))
> > +   clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
> 
> This is basically the only place where we modify it without holding the
> hctx lock. Can we move it into blk_mq_dispatch_rq_list()? The list is

The problem is that blk_mq_dispatch_rq_list() don't know if it is
handling requests from hctx->dispatch or sw/scheduler queue. We only
need to clear the bit after hctx->dispatch is flushed. So the clearing
can't be moved into blk_mq_dispatch_rq_list().

> generally empty, unless for the case where we splice residuals back. If
> we splice them back, we grab the lock anyway.
> 
> The other places it's set under the hctx lock, yet we end up using an
> atomic operation to do it.

As the comment explained, it needn't to be atomic.

-- 
Ming

Re: [PATCH V2 06/20] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-08-24 Thread Ming Lei

On Tue, Aug 22, 2017 at 08:09:32PM +, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +   /*
> > +* Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
> > +* will be run to try to make progress, so it is always
> > +* safe to check the state here.
> > +*/
> > +   if (test_bit(BLK_MQ_S_DISPATCH_BUSY, >state))
> > +   return;
> 
> The comment above test_bit() is useful but does not explain the purpose of
> the early return. Is the purpose of the early return perhaps to serialize
> blk_mq_sched_dispatch_requests() calls? If so, please mention this.

If the bit is set, that means there are requests in hctx->dispatch, so
return early for avoiding to dequeue requests from sw/scheduler queue
unnecessarily.

I thought the code is self-comment, so not explain it, will add the
comment.

-- 
Ming

Re: [PATCH] Revert "loop: support 4k physical blocksize"

2017-08-24 Thread Hannes Reinecke

On 08/23/2017 11:59 PM, Jens Axboe wrote:
> On 08/23/2017 03:54 PM, Omar Sandoval wrote:
>> From: Omar Sandoval 
>>
>> There's some stuff still up in the air, let's not get stuck with a
>> subpar ABI. I'll follow up with something better for 4.14.
> 
> I think that's a good idea, there was already too much late churn
> fixing up the 4k support. Dropped the fixups, added this.
> 


I'll never get it upstream ...

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH V2 05/20] blk-mq-sched: improve dispatching from sw queue

2017-08-24 Thread Ming Lei

On Tue, Aug 22, 2017 at 08:57:03PM +, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> > +static inline void blk_mq_do_dispatch_ctx(struct request_queue *q,
> > + struct blk_mq_hw_ctx *hctx)
> > +{
> > +   LIST_HEAD(rq_list);
> > +   struct blk_mq_ctx *ctx = NULL;
> > +
> > +   do {
> > +   struct request *rq;
> > +
> > +   rq = blk_mq_dispatch_rq_from_ctx(hctx, ctx);
> > +   if (!rq)
> > +   break;
> > +   list_add(>queuelist, _list);
> > +
> > +   /* round robin for fair dispatch */
> > +   ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> > +   } while (blk_mq_dispatch_rq_list(q, _list));
> > +}
> 
> An additional question about this patch: shouldn't request dequeuing start
> at the software queue next to the last one from which a request got dequeued
> instead of always starting at the first software queue (struct blk_mq_ctx
> *ctx = NULL) to be truly round robin?

Looks a good idea, will introduce a ctx hint in hctx for starting when
blk_mq_dispatch_rq_list() returns zero. No lock is needed since
it is just a hint.

-- 
Ming

56 matches

Mail list logo