from:"Ming Lei"

Re: [dm-devel] [PATCH v13 3/9] block: add emulation for copy

2023-06-29 Thread Ming Lei

Hi Nitesh,

On Wed, Jun 28, 2023 at 12:06:17AM +0530, Nitesh Shetty wrote:
> For the devices which does not support copy, copy emulation is added.
> It is required for in-kernel users like fabrics, where file descriptor is

I can understand copy command does help for FS GC and fabrics storages,
but still not very clear why copy emulation is needed for kernel users,
is it just for covering both copy command and emulation in single
interface? Or other purposes?

I'd suggest to add more words about in-kernel users of copy emulation.

> not available and hence they can't use copy_file_range.
> Copy-emulation is implemented by reading from source into memory and
> writing to the corresponding destination asynchronously.
> Also emulation is used, if copy offload fails or partially completes.

Per my understanding, this kind of emulation may not be as efficient
as doing it in userspace(two linked io_uring SQEs, read & write with
shared buffer). But it is fine if there are real in-kernel such users.

Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v3] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request

2023-03-01 Thread Ming Lei

On Tue, Feb 28, 2023 at 05:06:55PM -0700, Uday Shankar wrote:
> The block layer might merge together discard requests up until the
> max_discard_segments limit is hit, but blk_insert_cloned_request checks
> the segment count against max_segments regardless of the req op. This
> can result in errors like the following when discards are issued through
> a DM device and max_discard_segments exceeds max_segments for the queue
> of the chosen underlying device.
> 
> blk_insert_cloned_request: over max segments limit. (256 > 129)
> 
> Fix this by looking at the req_op and enforcing the appropriate segment
> limit - max_discard_segments for REQ_OP_DISCARDs and max_segments for
> everything else.
> 
> Signed-off-by: Uday Shankar 

Reviewed-by: Ming Lei 

Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] A hang bug of dm on s390x

2023-02-15 Thread Ming Lei

On Wed, Feb 15, 2023 at 07:23:40PM +0800, Pingfan Liu wrote:
> Hi guys,
> 
> I encountered  a hang issue on a s390x system.  The tested kernel is
> not preemptible and booting with "nr_cpus=1"
> 
> The test steps:
>   umount /home
>   lvremove /dev/rhel_s390x-kvm-011/home
>   ## uncomment "snapshot_autoextend_threshold = 70" and
>   "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf
> 
>   systemctl enable lvm2-monitor.service
>   systemctl start lvm2-monitor.service
> 
>   lvremove -y rhel_s390x-kvm-011/thinp
>   lvcreate -L 10M -T rhel_s390x-kvm-011/thinp
>   lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src
>   mkfs.ext4 /dev/rhel_s390x-kvm-011/src
>   mount /dev/rhel_s390x-kvm-011/src /mnt
>   for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img
> bs=100M count=1; done
> 
> And the system hangs with the console log [1]
> 
> The related kernel config
> 
> CONFIG_PREEMPT_NONE_BUILD=y
> CONFIG_PREEMPT_NONE=y
> CONFIG_PREEMPT_COUNT=y
> CONFIG_SCHED_CORE=y
> 
> It turns out that when hanging, the kernel is stuck in the dead-loop
> in the function dm_wq_work()
> while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> spin_lock_irq(>deferred_lock);
> bio = bio_list_pop(>deferred);
> spin_unlock_irq(>deferred_lock);
> 
> if (!bio)
> break;
> thread_cpu = smp_processor_id();
> submit_bio_noacct(bio);
> }
> where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue()
> keeps generating new bio, and the condition "if (!bio)" can not be
> meet.
> 
> 
> After applying the following patch, the issue is gone.
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index e1ea3a7bd9d9..95c9cb07a42f 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work)
> break;
> 
> submit_bio_noacct(bio);
> +   cond_resched();
> }
>  }
> 
> But I think it is not a proper solution. And without this patch, if
> removing nr_cpus=1 (the system has two cpus), the issue can not be
> triggered. That says when more than one cpu, the above loop can exit
> by the condition "if (!bio)"
> 
> Any ideas?

I think the patch is correct.

For kernel built without CONFIG_PREEMPT, in case of single cpu core,
if the dm target(such as dm-thin) needs another wq or kthread for
handling IO, then dm target side is blocked because dm_wq_work()
holds the single cpu, sooner or later, dm target may have not
resource to handle new io from dm core and returns REQUEUE.

Then dm_wq_work becomes one dead loop.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC for-6.2/block V2] block: Change the granularity of io ticks from ms to ns

2022-12-07 Thread Ming Lei

On Wed, Dec 07, 2022 at 10:32:04PM +, Gulam Mohamed wrote:
> As per the review comment from Jens Axboe, I am re-sending this patch
> against "for-6.2/block".
> 
> 
> Use ktime to change the granularity of IO accounting in block layer from
> milli-seconds to nano-seconds to get the proper latency values for the
> devices whose latency is in micro-seconds. After changing the granularity
> to nano-seconds the iostat command, which was showing incorrect values for
> %util, is now showing correct values.

Please add the theory behind why using nano-seconds can get correct accounting.

> 
> We did not work on the patch to drop the logic for
> STAT_PRECISE_TIMESTAMPS yet. Will do it if this patch is ok.
> 
> The iostat command was run after starting the fio with following command
> on an NVME disk. For the same fio command, the iostat %util was showing
> ~100% for the disks whose latencies are in the range of microseconds.
> With the kernel changes (granularity to nano-seconds), the %util was
> showing correct values. Following are the details of the test and their
> output:
> 
> fio command
> ---
> [global]
> bs=128K
> iodepth=1
> direct=1
> ioengine=libaio
> group_reporting
> time_based
> runtime=90
> thinktime=1ms
> numjobs=1
> name=raw-write
> rw=randrw
> ignore_error=EIO:EIO
> [job1]
> filename=/dev/nvme0n1
> 
> Correct values after kernel changes:
> 
> iostat output
> -
> iostat -d /dev/nvme0n1 -x 1
> 
> Devicer_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme0n1  0.080.05   0.06   128.00   128.00   0.07   6.50
> 
> Devicer_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme0n1  0.080.06   0.06   128.00   128.00   0.07   6.30
> 
> Devicer_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme0n1  0.060.05   0.06   128.00   128.00   0.06   5.70
> 
> From fio
> 
> Read Latency: clat (usec): min=32, max=2335, avg=79.54, stdev=29.95
> Write Latency: clat (usec): min=38, max=130, avg=57.76, stdev= 3.25

Can you explain a bit why the above %util is correct?

BTW, %util is usually not important for SSDs, please see 'man iostat':

 %util
Percentage of elapsed time during which I/O requests were issued to 
the device (bandwidth  uti‐
lization for the device). Device saturation occurs when this value 
is close to 100% for devices
serving requests serially.  But for devices serving requests in 
parallel, such as  RAID  arrays
and modern SSDs, this number does not reflect their performance 
limits.


Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 02/10] block: Add copy offload support infrastructure

2022-12-07 Thread Ming Lei

On Wed, Dec 07, 2022 at 11:24:00AM +0530, Nitesh Shetty wrote:
> On Tue, Nov 29, 2022 at 05:14:28PM +0530, Nitesh Shetty wrote:
> > On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> > > On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > > > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > > > Introduce blkdev_issue_copy which supports source and destination 
> > > > > > bdevs,
> > > > > > and an array of (source, destination and copy length) tuples.
> > > > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > > > bio pair with a token as payload and submitted to the device in 
> > > > > > order.
> > > > > > Read request populates token with source specific information which
> > > > > > is then passed with write request.
> > > > > > This design is courtesy Mikulas Patocka's token based copy
> > > > > 
> > > > > I thought this patchset is just for enabling copy command which is
> > > > > supported by hardware. But turns out it isn't, because 
> > > > > blk_copy_offload()
> > > > > still submits read/write bios for doing the copy.
> > > > > 
> > > > > I am just wondering why not let copy_file_range() cover this kind of 
> > > > > copy,
> > > > > and the framework has been there.
> > > > > 
> > > > 
> > > > Main goal was to enable copy command, but community suggested to add
> > > > copy emulation as well.
> > > > 
> > > > blk_copy_offload - actually issues copy command in driver layer.
> > > > The way read/write BIOs are percieved is different for copy offload.
> > > > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > > > copy command. But we did missed it to add in other driver's, where they
> > > > might be treated as normal READ/WRITE.
> > > > 
> > > > blk_copy_emulate - is used if we fail or if device doesn't support 
> > > > native
> > > > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > > > emulation might be possible, but we see 2 issues here.
> > > > 1. We explored possibility of pulling dm-kcopyd to block layer so that 
> > > > we 
> > > > can readily use it. But we found it had many dependecies from dm-layer.
> > > > So later dropped that idea.
> > > 
> > > Is it just because dm-kcopyd supports async copy? If yes, I believe we
> > > can reply on io_uring for implementing async copy_file_range, which will
> > > be generic interface for async copy, and could get better perf.
> > >
> > 
> > It supports both sync and async. But used only inside dm-layer.
> > Async version of copy_file_range can help, using io-uring can be helpful
> > for user , but in-kernel users can't use uring.
> > 
> > > > 2. copy_file_range, for block device atleast we saw few check's which 
> > > > fail
> > > > it for raw block device. At this point I dont know much about the 
> > > > history of
> > > > why such check is present.
> > > 
> > > Got it, but IMO the check in generic_copy_file_checks() can be
> > > relaxed to cover blkdev cause splice does support blkdev.
> > > 
> > > Then your bdev offload copy work can be simplified into:
> > > 
> > > 1) implement .copy_file_range for def_blk_fops, suppose it is
> > > blkdev_copy_file_range()
> > > 
> > > 2) inside blkdev_copy_file_range()
> > > 
> > > - if the bdev supports offload copy, just submit one bio to the device,
> > > and this will be converted to one pt req to device
> > > 
> > > - otherwise, fallback to generic_copy_file_range()
> > >
> > 
> 
> Actually we sent initial version with single bio, but later community
> suggested two bio's is must for offload, main reasoning being

Is there any link which holds the discussion?

> dm-layer,Xcopy,copy across namespace compatibilty.

But dm kcopy has supported bdev copy already, so once your patch is
ready, dm kcopy can just sends one bio with REQ_COPY if the device
supports offload command, otherwise the current dm kcopy code can work
as before.

> 
> > We will check the feasibilty and try to implement the scheme in next 
> > versions.
> > It would be helpful, i

Re: [dm-devel] [RFC] block: Change the granularity of io ticks from ms to ns

2022-12-06 Thread Ming Lei

On Wed, Dec 07, 2022 at 10:19:08AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2022/12/07 2:15, Gulam Mohamed 写道:
> > Use ktime to change the granularity of IO accounting in block layer from
> > milli-seconds to nano-seconds to get the proper latency values for the
> > devices whose latency is in micro-seconds. After changing the granularity
> > to nano-seconds the iostat command, which was showing incorrect values for
> > %util, is now showing correct values.
> 
> This patch didn't correct the counting of io_ticks, just make the
> error accounting from jiffies(ms) to ns. The problem that util can be
> smaller or larger still exist.

Agree.

> 
> However, I think this change make sense consider that error margin is
> much smaller, and performance overhead should be minimum.
> 
> Hi, Ming, how do you think?

I remembered that ktime_get() has non-negligible overhead, is there any
test data(iops/cpu utilization) when running fio or t/io_uring on
null_blk with this patch?


Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 02/10] block: Add copy offload support infrastructure

2022-11-23 Thread Ming Lei

On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > and an array of (source, destination and copy length) tuples.
> > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > bio pair with a token as payload and submitted to the device in order.
> > > Read request populates token with source specific information which
> > > is then passed with write request.
> > > This design is courtesy Mikulas Patocka's token based copy
> > 
> > I thought this patchset is just for enabling copy command which is
> > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > still submits read/write bios for doing the copy.
> > 
> > I am just wondering why not let copy_file_range() cover this kind of copy,
> > and the framework has been there.
> > 
> 
> Main goal was to enable copy command, but community suggested to add
> copy emulation as well.
> 
> blk_copy_offload - actually issues copy command in driver layer.
> The way read/write BIOs are percieved is different for copy offload.
> In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> copy command. But we did missed it to add in other driver's, where they
> might be treated as normal READ/WRITE.
> 
> blk_copy_emulate - is used if we fail or if device doesn't support native
> copy offload command. Here we do READ/WRITE. Using copy_file_range for
> emulation might be possible, but we see 2 issues here.
> 1. We explored possibility of pulling dm-kcopyd to block layer so that we 
> can readily use it. But we found it had many dependecies from dm-layer.
> So later dropped that idea.

Is it just because dm-kcopyd supports async copy? If yes, I believe we
can reply on io_uring for implementing async copy_file_range, which will
be generic interface for async copy, and could get better perf.

> 2. copy_file_range, for block device atleast we saw few check's which fail
> it for raw block device. At this point I dont know much about the history of
> why such check is present.

Got it, but IMO the check in generic_copy_file_checks() can be
relaxed to cover blkdev cause splice does support blkdev.

Then your bdev offload copy work can be simplified into:

1) implement .copy_file_range for def_blk_fops, suppose it is
blkdev_copy_file_range()

2) inside blkdev_copy_file_range()

- if the bdev supports offload copy, just submit one bio to the device,
and this will be converted to one pt req to device

- otherwise, fallback to generic_copy_file_range()

> 
> > When I was researching pipe/splice code for supporting ublk zero copy[1], I
> > have got idea for async copy_file_range(), such as: io uring based
> > direct splice, user backed intermediate buffer, still zero copy, if these
> > ideas are finally implemented, we could get super-fast generic offload copy,
> > and bdev copy is really covered too.
> > 
> > [1] 
> > https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming@redhat.com/
> > 
> 
> Seems interesting, We will take a look into this.

BTW, that is probably one direction of ublk's async zero copy IO too.


Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 02/10] block: Add copy offload support infrastructure

2022-11-23 Thread Ming Lei

On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> Introduce blkdev_issue_copy which supports source and destination bdevs,
> and an array of (source, destination and copy length) tuples.
> Introduce REQ_COPY copy offload operation flag. Create a read-write
> bio pair with a token as payload and submitted to the device in order.
> Read request populates token with source specific information which
> is then passed with write request.
> This design is courtesy Mikulas Patocka's token based copy

I thought this patchset is just for enabling copy command which is
supported by hardware. But turns out it isn't, because blk_copy_offload()
still submits read/write bios for doing the copy.

I am just wondering why not let copy_file_range() cover this kind of copy,
and the framework has been there.

When I was researching pipe/splice code for supporting ublk zero copy[1], I
have got idea for async copy_file_range(), such as: io uring based
direct splice, user backed intermediate buffer, still zero copy, if these
ideas are finally implemented, we could get super-fast generic offload copy,
and bdev copy is really covered too.

[1] 
https://lore.kernel.org/linux-block/20221103085004.1029763-1-ming@redhat.com/

thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH] blk-mq: don't add non-pt request with ->end_io to batch

2022-10-27 Thread Ming Lei

dm-rq implements ->end_io callback for request issued to underlying queue,
and it isn't passthrough request.

Commit ab3e1d3bbab9 ("block: allow end_io based requests in the completion
batch handling") doesn't clear rq->bio and rq->__data_len for request
with ->end_io in blk_mq_end_request_batch(), and this way is actually
dangerous, but so far it is only for nvme passthrough request.

dm-rq needs to clean up remained bios in case of partial completion,
and req->bio is required, then use-after-free is triggered, so the
underlying clone request can't be completed in blk_mq_end_request_batch.

Fix panic by not adding such request into batch list, and the issue
can be triggered simply by exposing nvme pci to dm-mpath simply.

Fixes: ab3e1d3bbab9 ("block: allow end_io based requests in the completion 
batch handling")
Cc: dm-devel@redhat.com
Cc: Mike Snitzer 
Reported-by: Changhui Zhong 
Signed-off-by: Ming Lei 
---
 include/linux/blk-mq.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ba18e9bdb799..d6119c5d1069 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -853,7 +853,8 @@ static inline bool blk_mq_add_to_batch(struct request *req,
   struct io_comp_batch *iob, int ioerror,
   void (*complete)(struct io_comp_batch *))
 {
-   if (!iob || (req->rq_flags & RQF_ELV) || ioerror)
+   if (!iob || (req->rq_flags & RQF_ELV) || ioerror ||
+   (req->end_io && !blk_rq_is_passthrough(req)))
return false;
 
if (!iob->complete)
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-30 Thread Ming Lei

On Wed, Jun 29, 2022 at 09:14:54PM -0400, Kent Overstreet wrote:
> On Thu, Jun 30, 2022 at 08:47:13AM +0800, Ming Lei wrote:
> > Or if I misunderstood your point, please cook a patch and I am happy to
> > take a close look, and posting one very raw idea with random data
> > structure looks not helpful much for this discussion technically.
> 
> Based it on your bio_rewind() patch - what do you think of this?
> 
> -- >8 --
> From: Kent Overstreet 
> Subject: [PATCH] block: add bio_(get|set)_pos()
> 
> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> the similar API because the following reasons:
> 
> ```
> It is pointed that bio_rewind_iter() is one very bad API[1]:
> 
> 1) bio size may not be restored after rewinding
> 
> 2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> bi_iter.bi_done after splitting bio)
> 
> 3) rewinding really makes things complicated wrt. bio splitting
> 
> 4) unnecessary updating of .bi_done in fast path
> 
> [1] https://marc.info/?t=15354992425=1=2
> 
> So this patch takes Kent's suggestion to restore one bio into its original
> state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> given now bio_rewind_iter() is only used by bio integrity code.
> ```
> 
> However, saving and restoring bi_iter isn't sufficient anymore, because
> of integrity and now per-bio crypt context.
> 
> This patch implements the same functionality as bio_rewind(), based on a
> patch by Ming, but with a different (safer!) interface.
> 
>  - bio_get_pos() gets the current state of a a bio, i.e. how far it has
>been advanced and its current (remaining) size
>  - bio_set_pos() restores a bio to a previous state, advancing or
>rewinding it as needed
> 
> Co-authored-by: Ming Lei 
> Signed-off-by: Kent Overstreet 
> ---
>  block/bio-integrity.c   | 19 +++
>  block/bio.c | 26 ++
>  block/blk-crypto-internal.h |  7 +++
>  block/blk-crypto.c  | 25 +
>  include/linux/bio.h | 22 ++
>  include/linux/blk_types.h   | 19 +++
>  include/linux/bvec.h| 36 +++-
>  7 files changed, 153 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 32929c89ba..06c2fe81fd 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -378,6 +378,25 @@ void bio_integrity_advance(struct bio *bio, unsigned int 
> bytes_done)
>   bvec_iter_advance(bip->bip_vec, >bip_iter, bytes);
>  }
>  
> +/**
> + * bio_integrity_rewind - Rewind integrity vector
> + * @bio: bio whose integrity vector to update
> + * @bytes_done:  number of data bytes to rewind
> + *
> + * Description: This function calculates how many integrity bytes the
> + * number of completed data bytes correspond to and rewind the
> + * integrity vector accordingly.
> + */
> +void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done)
> +{
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
> + unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
> +
> + bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9);
> + bvec_iter_rewind(bip->bip_vec, >bip_iter, bytes);
> +}
> +
>  /**
>   * bio_integrity_trim - Trim integrity vector
>   * @bio: bio whose integrity vector to update
> diff --git a/block/bio.c b/block/bio.c
> index b2425b8d88..bbf8aa4e62 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1329,6 +1329,32 @@ void __bio_advance(struct bio *bio, unsigned bytes)
>  }
>  EXPORT_SYMBOL(__bio_advance);
>  
> +/**
> + * bio_set_pos - restore a bio to a previous state, after having been 
> iterated
> + * or trimmed
> + * @bio: bio to reset
> + * @pos: pos to reset it to, from bio_get_pos()
> + */
> +void bio_set_pos(struct bio *bio, struct bio_pos pos)
> +{
> + int delta = bio->bi_iter.bi_done - pos.bi_done;
> +
> + if (delta > 0) {
> + if (bio_integrity(bio))
> + bio_integrity_rewind(bio, delta);
> + bio_crypt_rewind(bio, delta);
> + bio_rewind_iter(bio, >bi_iter, delta);
> + } else {
> + bio_advance(bio, -delta);
> + }
> +
> + bio->bi_iter.bi_size = pos.bi_size;
> +
> + if (bio_integrity(bio))
> + bio_integrity_trim(bio);
> +}
> +EXPORT_SYMBOL(bio_set_pos);
&

Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-29 Thread Ming Lei

On Wed, Jun 29, 2022 at 02:11:54PM -0400, Kent Overstreet wrote:
> On Wed, Jun 29, 2022 at 02:07:08AM -0400, Mike Snitzer wrote:
> > Please try to dial down the hyperbole and judgment. Ming wrote this
> > code. And you haven't been able to point out anything _actually_ wrong
> > with it (yet).
> > 
> > This patch's header does need editing for clarity, but we can help
> > improve it and the documentation above bio_rewind() in the code.
> > 
> > > So, and I'm sorry I have to be the killjoy here, but hard NACK on this 
> > > patchset.
> > > Hard, hard NACK.
> > 
> > 
> > 
> > You see this bio_rewind() as history repeating itself, but it isn't
> > like what you ranted about in the past:
> > https://marc.info/?l=linux-block=153549921116441=2
> > 
> > I can certainly see why you think it similar at first glance. But this
> > patchset shows how bio_rewind() must be used, and how DM benefits from
> > using it safely (with no impact to struct bio or DM's per-bio-data).
> > 
> > bio_rewind() usage will be as niche as DM's use-case for it. If other
> > code respects the documented constraint, that the original bio's end
> > sector be preserved, then they can use it too.
> > 
> > The key is for a driver to maintain enough state to allow this fixed
> > end be effectively immutable. (DM happens to get this state "for free"
> > simply because it was already established for its IO accounting of
> > split bios).
> > 
> > The Linux codebase requires precision. This isn't new.
> 
> Mike, that's not justification for making things _more_ dangerous.
> 
> > 
> > > I'll be happy to assist in coming up with alternate, less dangerous 
> > > solutions
> > > though (and I think introducing a real bio_iter is overdue, so that's 
> > > probably
> > > the first thing we should look at).
> > 
> > It isn't dangerous. It is an interface whose constraint needs to be
> > respected. Just like is documented for a myriad other kernel
> > interfaces.
> > 
> > Factoring out a bio_iter will bloat struct bio for functionality most
> > consumers don't need. And gating DM's ability to achieve this
> > patchset's functionality with some overdue refactoring is really _not_
> > acceptable.
> 
> Mike, you're the one who's getting seriously hyperbolic here. You're getting
> frustrated because you've got this one thing you really want to get done, and
> you feel like you're running into a brick wall when I tell you "no".
> 
> And yes, coding in the kernel is a complicated, dangerous environment with 
> many
> rules that need to be respected.
> 
> That does not mean it's ok to be adding to that complexity, and making it even
> more dangerous, without a _really fucking good reason_. This doesn't fly. 
> Maybe
> it would if it was some device mapper private thing, but you're acting like 
> it's
> only going to be used by device mapper when you're trying to add it to the
> public interface for core block layer bio code. _That_ needs real 
> justification.
> 
> Also, bio_iter is something we should definitely be considering because of the
> way integrity and now crypt has been tacked on to struct bio.
> 
> When I originally wrote the modern bvec_iter code, the ability to use an
> iterator besides the one in struct bio was an important piece of 
> functionality,
> one that's still in use (including in device mapper; see
> __bio_for_each_segment()). The fact that we're growing additional data
> structures that in theory want to be iterated in lockstep with the main bio
> payload but _aren't_ iterated over with bi_iter is, at best, a code smell and 
> a
> lurking footgun.
> 
> However, I can see that the two of you are not likely take on figuring out how
> to clean that up, and truthfully I don't have the time right now either, much 
> as
> it pains me.
> 
> Here's an alternative approach:
> 
> The fundamental problem with bio_rewind() (and I know that you two are super
> serious that this is completely safe for your use case and no one else is 
> going
> to use it for anything else) is that we're using it to get back to some 
> initial
> state, but it's not invariant w.r.t. what's been done to the bio since then, 
> and
> the nature of the block layer is that that's a problem.
> 
> So here's what you do:
> 
> You bring back bi_done: bi_done counts bytes advanced, total, since the start
> of the bio. Then we introduce a type:
> 
> struct bio_pos {
>   unsignedbi_done;
>   unsignedbi_size;
> };
> 
> And two new functions:
> 
> struct bio_pos bio_get_pos(struct bio *)
> {
>   ...
> }
> 
> void bio_set_pos(struct bio *, struct bio_pos)
> {
>   ...
> }
> 
> That gets you the same functionality as bio_rewind(), but it'll be much more
> broadly useful.

What is the difference between bio_set_pos and bio_rewind()? Both have
to restore bio->bi_iter(the sector part and the bvec part).

Also how to update ->bi_done which 'counts bytes advanced'? You meant doing it 
in
very bio_advance()? then no, why do we have to pay the cost for very

Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-28 Thread Ming Lei

On Tue, Jun 28, 2022 at 12:36:17PM -0400, Kent Overstreet wrote:
> On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > to rewind one bio if the end sector is fixed.
> > > 
> > > And with rewind, you're making an assumption about the state the iterator 
> > > is
> > > going to be in when the IO has completed.
> > > 
> > > What if the iterator was never advanced?
> > 
> > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > between the recorded position and the end position isn't changed, same
> > with the end position.
> > 
> > > 
> > > So say you check for that by saving some other part of the iterator - but 
> > > that
> > > may have legitimately changed too, if the bio was redirected (bi_sector 
> > > changes)
> > > or trimmed (bi_size changes)
> > > 
> > > I still think this is an inherently buggy interface, the way it's being 
> > > proposed
> > > to be used.
> > 
> > The patch did mention that the interface should be for situation in which 
> > end
> > sector of bio won't change.
> 
> But that's an assumption that you simply can't make!

Of course, we can, at least for this DM's use case, the bio is issued
from FS or split from DM, and it won't be issued to underlying queue any
more, and simply owned by DM core code.

> 
> We allow block device drivers to be stacked in _any_ combination. After a bio 
> is
> completed it may have been partially advanced, fully advanced, trimmed, not
> trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> changed, and will have if there's partition tables involved.

How can bio (partial)advance change bio's end sector?

bio end sector can be changed only when bio->bi_iter.bi_size is changed
manually(include bio_trim), or ->bi_bdev is changed. But inside one
driver, if the bio is owned by this driver(such as the driver is the
finally layer request based driver), the assumption often can make.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-28 Thread Ming Lei

On Tue, Jun 28, 2022 at 12:13:06PM -0600, Jens Axboe wrote:
> On 6/27/22 10:20 PM, Kent Overstreet wrote:
> > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> >> On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> >>> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> >>>> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> >>>> the similar API because the following reasons:
> >>>>
> >>>> ```
> >>>> It is pointed that bio_rewind_iter() is one very bad API[1]:
> >>>>
> >>>> 1) bio size may not be restored after rewinding
> >>>>
> >>>> 2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> >>>> bi_iter.bi_done after splitting bio)
> >>>>
> >>>> 3) rewinding really makes things complicated wrt. bio splitting
> >>>>
> >>>> 4) unnecessary updating of .bi_done in fast path
> >>>>
> >>>> [1] https://marc.info/?t=15354992425=1=2
> >>>>
> >>>> So this patch takes Kent's suggestion to restore one bio into its 
> >>>> original
> >>>> state via saving bio iterator(struct bvec_iter) in 
> >>>> bio_integrity_prep(),
> >>>> given now bio_rewind_iter() is only used by bio integrity code.
> >>>> ```
> >>>>
> >>>> However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, 
> >>>> and saving
> >>>> it only can't restore crypto and integrity info.
> >>>>
> >>>> Add bio_rewind() back for some use cases which may not be same with
> >>>> previous generic case:
> >>>>
> >>>> 1) most of bio has fixed end sector since bio split is done from front 
> >>>> of the bio,
> >>>> if driver just records how many sectors between current bio's start 
> >>>> sector and
> >>>> the bio's end sector, the original position can be restored
> >>>>
> >>>> 2) if one bio's end sector won't change, usually bio_trim() isn't 
> >>>> called, user can
> >>>> restore original position by storing sectors from current 
> >>>> ->bi_iter.bi_sector to
> >>>> bio's end sector; together by saving bio size, 8 bytes can restore to
> >>>> original bio.
> >>>>
> >>>> 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs 
> >>>> to
> >>>> restore to the original bio which represents current dm io to be 
> >>>> requeued.
> >>>> By storing sectors to the bio's end sector and dm io's size,
> >>>> bio_rewind() can restore such original bio, then dm core code needn't to
> >>>> allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> >>>> is actually one unusual event.
> >>>>
> >>>> 4) Not like original rewind API, this one needn't to add .bi_done, and 
> >>>> no any
> >>>> effect on fast path
> >>>
> >>> It seems like perhaps the real issue here is that we need a real bio_iter,
> >>> separate from bvec_iter, that also encapsulates iterating over integrity &
> >>> fscrypt. 
> >>
> >> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> >> hold in per-io data structure. With this patch, 8bytes is enough
> >> to rewind one bio if the end sector is fixed.
> > 
> > Hold on though, does that check out? Why is that too big for per IO data
> > structures?
> > 
> > By definition these structures are only for IOs in flight, and we don't 
> > _want_
> > there to ever be very many of these or we're going to run into latency 
> > issues
> > due to queue depth.
> 
> It's much less about using whatever amount of memory for inflight IO,
> and much more about not bloating fast path structures (of which the bio
> is certainly one). All of this gunk has to be initialized for each IO,
> and that's the real issue.

Can't agree more, especially the initialization is just for the unusual
DM_REQUEUE event(bio rewind is needed).


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-28 Thread Ming Lei

On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > hold in per-io data structure. With this patch, 8bytes is enough
> > to rewind one bio if the end sector is fixed.
> 
> And with rewind, you're making an assumption about the state the iterator is
> going to be in when the IO has completed.
> 
> What if the iterator was never advanced?

bio_rewind() works as expected if the iterator doesn't advance, since bytes
between the recorded position and the end position isn't changed, same
with the end position.

> 
> So say you check for that by saving some other part of the iterator - but that
> may have legitimately changed too, if the bio was redirected (bi_sector 
> changes)
> or trimmed (bi_size changes)
> 
> I still think this is an inherently buggy interface, the way it's being 
> proposed
> to be used.

The patch did mention that the interface should be for situation in which end
sector of bio won't change.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-28 Thread Ming Lei

On Tue, Jun 28, 2022 at 12:20:16AM -0400, Kent Overstreet wrote:
> On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> > > On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > > > Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> > > > the similar API because the following reasons:
> > > > 
> > > > ```
> > > > It is pointed that bio_rewind_iter() is one very bad API[1]:
> > > > 
> > > > 1) bio size may not be restored after rewinding
> > > > 
> > > > 2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> > > > bi_iter.bi_done after splitting bio)
> > > > 
> > > > 3) rewinding really makes things complicated wrt. bio splitting
> > > > 
> > > > 4) unnecessary updating of .bi_done in fast path
> > > > 
> > > > [1] https://marc.info/?t=15354992425=1=2
> > > > 
> > > > So this patch takes Kent's suggestion to restore one bio into its 
> > > > original
> > > > state via saving bio iterator(struct bvec_iter) in 
> > > > bio_integrity_prep(),
> > > > given now bio_rewind_iter() is only used by bio integrity code.
> > > > ```
> > > > 
> > > > However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, 
> > > > and saving
> > > > it only can't restore crypto and integrity info.
> > > > 
> > > > Add bio_rewind() back for some use cases which may not be same with
> > > > previous generic case:
> > > > 
> > > > 1) most of bio has fixed end sector since bio split is done from front 
> > > > of the bio,
> > > > if driver just records how many sectors between current bio's start 
> > > > sector and
> > > > the bio's end sector, the original position can be restored
> > > > 
> > > > 2) if one bio's end sector won't change, usually bio_trim() isn't 
> > > > called, user can
> > > > restore original position by storing sectors from current 
> > > > ->bi_iter.bi_sector to
> > > > bio's end sector; together by saving bio size, 8 bytes can restore to
> > > > original bio.
> > > > 
> > > > 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core 
> > > > needs to
> > > > restore to the original bio which represents current dm io to be 
> > > > requeued.
> > > > By storing sectors to the bio's end sector and dm io's size,
> > > > bio_rewind() can restore such original bio, then dm core code needn't to
> > > > allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> > > > is actually one unusual event.
> > > > 
> > > > 4) Not like original rewind API, this one needn't to add .bi_done, and 
> > > > no any
> > > > effect on fast path
> > > 
> > > It seems like perhaps the real issue here is that we need a real bio_iter,
> > > separate from bvec_iter, that also encapsulates iterating over integrity &
> > > fscrypt. 
> > 
> > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > hold in per-io data structure. With this patch, 8bytes is enough
> > to rewind one bio if the end sector is fixed.
> 
> Hold on though, does that check out? Why is that too big for per IO data
> structures?
> 
> By definition these structures are only for IOs in flight, and we don't _want_
> there to ever be very many of these or we're going to run into latency issues
> due to queue depth.

I don't see there is 'queue depth' for bio or bio driver.

32 bytes have been big, and memory footprint is increased too since the data 
has been
prepared for the future possible rewind. If crypt or integrity is considered, it
can be bigger.



Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-27 Thread Ming Lei

On Sun, Jun 26, 2022 at 02:37:22PM -0700, Eric Biggers wrote:
> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> > index a496aaef85ba..caae2f429fc7 100644
> > --- a/block/blk-crypto.c
> > +++ b/block/blk-crypto.c
> > @@ -134,6 +134,21 @@ void bio_crypt_dun_increment(u64 
> > dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> > }
> >  }
> >  
> > +/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
> > +void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> > +unsigned int dec)
> > +{
> > +   int i;
> > +
> > +   for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
> > +   dun[i] -= dec;
> > +   if (dun[i] > inc)
> > +   dec = 1;
> > +   else
> > +   dec = 0;
> > +   }
> > +}
> 
> This doesn't compile.  Also this doesn't handle underflow into the next limb
> correctly.  A correct version would be:
> 
>   u64 prev = dun[i];
> 
>   dun[i] -= dec;
>   if (dun[i] > prev)
>   dec = 1;
>   else
>   dec = 0;

You are right, thanks for the review!

Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-27 Thread Ming Lei

On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> > the similar API because the following reasons:
> > 
> > ```
> > It is pointed that bio_rewind_iter() is one very bad API[1]:
> > 
> > 1) bio size may not be restored after rewinding
> > 
> > 2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> > bi_iter.bi_done after splitting bio)
> > 
> > 3) rewinding really makes things complicated wrt. bio splitting
> > 
> > 4) unnecessary updating of .bi_done in fast path
> > 
> > [1] https://marc.info/?t=15354992425=1=2
> > 
> > So this patch takes Kent's suggestion to restore one bio into its 
> > original
> > state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> > given now bio_rewind_iter() is only used by bio integrity code.
> > ```
> > 
> > However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and 
> > saving
> > it only can't restore crypto and integrity info.
> > 
> > Add bio_rewind() back for some use cases which may not be same with
> > previous generic case:
> > 
> > 1) most of bio has fixed end sector since bio split is done from front of 
> > the bio,
> > if driver just records how many sectors between current bio's start sector 
> > and
> > the bio's end sector, the original position can be restored
> > 
> > 2) if one bio's end sector won't change, usually bio_trim() isn't called, 
> > user can
> > restore original position by storing sectors from current 
> > ->bi_iter.bi_sector to
> > bio's end sector; together by saving bio size, 8 bytes can restore to
> > original bio.
> > 
> > 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> > restore to the original bio which represents current dm io to be requeued.
> > By storing sectors to the bio's end sector and dm io's size,
> > bio_rewind() can restore such original bio, then dm core code needn't to
> > allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> > is actually one unusual event.
> > 
> > 4) Not like original rewind API, this one needn't to add .bi_done, and no 
> > any
> > effect on fast path
> 
> It seems like perhaps the real issue here is that we need a real bio_iter,
> separate from bvec_iter, that also encapsulates iterating over integrity &
> fscrypt. 

Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
hold in per-io data structure. With this patch, 8bytes is enough
to rewind one bio if the end sector is fixed.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 5.20 4/4] dm: add two stage requeue

2022-06-24 Thread Ming Lei

Commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting")
makes dm io's original bio points to same original bio from upper layer,
so more than one dm io can share one same original bio in case of
splitting. This way is fine if all dm io is completed successfully. But
if BLK_STS_DM_REQUEUE is returned from clone bio, the current code will
requeue the shared original bio, and cause the following issue:

1) the shared original bio has been trimmed and mapped to the last dm
   io, so it becomes not matched if requeuing this original bio

2) more than one dm io completion may touch the single shared original
   bio, such as the bio may have been submitted in one code path, but
   another code path may be ending it, so this way is very fragile.

Patch 'dm: fix BLK_STS_DM_REQUEUE handling when dm_io' can fix the
issue, but still need to clone one backing bio in case of split. We can
solve the issue by two stages requeue with help of new added bio_rewind,
then the bio clone can only be needed after BLK_STS_DM_REQUEUE happens:

1) requeue the dm io into the added requeue list, and schedule it via
new added requeue work, it is just for clone/allocate a mapped original
bio for requeue, and we recover the original bio by bio_rewind().

2) the 2nd stage requeue is same with original requeue, but io->orig_bio
points to new cloned bio which matches with the requeued dm io.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-core.h |  11 +++-
 drivers/md/dm.c  | 131 ++-
 2 files changed, 116 insertions(+), 26 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index c954ff91870e..0545ce441427 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -22,6 +22,8 @@
 
 #define DM_RESERVED_MAX_IOS1024
 
+struct dm_io;
+
 struct dm_kobject_holder {
struct kobject kobj;
struct completion completion;
@@ -91,6 +93,14 @@ struct mapped_device {
spinlock_t deferred_lock;
struct bio_list deferred;
 
+   /*
+* requeue work context is nedded for cloning one new bio
+* for representing the dm_io to be requeued since each
+* dm_io may point to the original bio from FS.
+*/
+   struct work_struct requeue_work;
+   struct dm_io *requeue_list;
+
void *interface_ptr;
 
/*
@@ -272,7 +282,6 @@ struct dm_io {
atomic_t io_count;
struct mapped_device *md;
 
-   struct bio *split_bio;
/* The three fields represent mapped part of original bio */
struct bio *orig_bio;
unsigned int sector_offset; /* offset to end of orig_bio */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ee22c763873f..c2b95b931c31 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -594,7 +594,6 @@ static struct dm_io *alloc_io(struct mapped_device *md, 
struct bio *bio)
atomic_set(>io_count, 2);
this_cpu_inc(*md->pending_io);
io->orig_bio = bio;
-   io->split_bio = NULL;
io->md = md;
spin_lock_init(>lock);
io->start_time = jiffies;
@@ -884,10 +883,32 @@ static int __noflush_suspending(struct mapped_device *md)
return test_bit(DMF_NOFLUSH_SUSPENDING, >flags);
 }
 
+static void dm_requeue_add_io(struct dm_io *io, bool first_stage)
+{
+   struct mapped_device *md = io->md;
+
+   if (first_stage) {
+   struct dm_io *next = md->requeue_list;
+
+   md->requeue_list = io;
+   io->next = next;
+   } else {
+   bio_list_add_head(>deferred, io->orig_bio);
+   }
+}
+
+static void dm_requeue_schedule(struct mapped_device *md, bool first_stage)
+{
+   if (first_stage)
+   queue_work(md->wq, >requeue_work);
+   else
+   queue_work(md->wq, >work);
+}
+
 /* Return true if the original bio is requeued */
-static bool dm_handle_requeue(struct dm_io *io)
+static bool dm_handle_requeue(struct dm_io *io, bool first_stage)
 {
-   struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+   struct bio *bio = io->orig_bio;
bool need_requeue = (io->status == BLK_STS_DM_REQUEUE);
bool handle_eagain = (io->status == BLK_STS_AGAIN) &&
(bio->bi_opf & REQ_POLLED);
@@ -913,9 +934,9 @@ static bool dm_handle_requeue(struct dm_io *io)
spin_lock_irqsave(>deferred_lock, flags);
if ((__noflush_suspending(md) &&
!WARN_ON_ONCE(dm_is_zone_write(md, bio))) ||
-   handle_eagain) {
+   handle_eagain || first_stage) {
/* NOTE early return due to BLK_STS_DM_REQUEUE below */
-   bio_list_add_head(>deferred, bio);
+   dm_requeue_add_io(io, first_stage);
requeued = true;

[dm-devel] [PATCH 5.20 3/4] dm: improve handling for DM_REQUEUE and AGAIN

2022-06-24 Thread Ming Lei

In case that BLK_STS_DM_REQUEUE is returned or BLK_STS_AGAIN is returned
for POLLED io, we requeue the original bio into deferred list and
request md->wq to re-submit it to block layer.

Improve the handling in the following way:

1) unify handling for BLK_STS_DM_REQUEUE and BLK_STS_AGAIN, and clear
REQ_POLLED for BLK_STS_DM_REQUEUE too, for the sake of simplicity,
given BLK_STS_DM_REQUEUE is very unusual

2) queue md->wq explicitly in __dm_io_complete(), so requeue handling
becomes more robust

Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 58 +
 1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a9e5e429c150..ee22c763873f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -884,20 +884,39 @@ static int __noflush_suspending(struct mapped_device *md)
return test_bit(DMF_NOFLUSH_SUSPENDING, >flags);
 }
 
-static void dm_handle_requeue(struct dm_io *io)
+/* Return true if the original bio is requeued */
+static bool dm_handle_requeue(struct dm_io *io)
 {
-   if (io->status == BLK_STS_DM_REQUEUE) {
-   struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
-   struct mapped_device *md = io->md;
+   struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+   bool need_requeue = (io->status == BLK_STS_DM_REQUEUE);
+   bool handle_eagain = (io->status == BLK_STS_AGAIN) &&
+   (bio->bi_opf & REQ_POLLED);
+   struct mapped_device *md = io->md;
+   bool requeued = false;
+
+   if (need_requeue || handle_eagain) {
unsigned long flags;
+
+   if (bio->bi_opf & REQ_POLLED) {
+   /*
+* Upper layer won't help us poll split bio
+* (io->orig_bio may only reflect a subset of the
+* pre-split original) so clear REQ_POLLED in case
+* of requeue.
+*/
+   bio_clear_polled(bio);
+   }
+
/*
 * Target requested pushing back the I/O.
 */
spin_lock_irqsave(>deferred_lock, flags);
-   if (__noflush_suspending(md) &&
-   !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
+   if ((__noflush_suspending(md) &&
+   !WARN_ON_ONCE(dm_is_zone_write(md, bio))) ||
+   handle_eagain) {
/* NOTE early return due to BLK_STS_DM_REQUEUE below */
bio_list_add_head(>deferred, bio);
+   requeued = true;
} else {
/*
 * noflush suspend was interrupted or this is
@@ -907,6 +926,10 @@ static void dm_handle_requeue(struct dm_io *io)
}
spin_unlock_irqrestore(>deferred_lock, flags);
}
+
+   if (requeued)
+   queue_work(md->wq, >work);
+   return requeued;
 }
 
 static void dm_io_complete(struct dm_io *io)
@@ -914,8 +937,9 @@ static void dm_io_complete(struct dm_io *io)
struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
struct mapped_device *md = io->md;
blk_status_t io_error;
+   bool requeued;
 
-   dm_handle_requeue(io);
+   requeued = dm_handle_requeue(io);
 
io_error = io->status;
if (dm_io_flagged(io, DM_IO_ACCOUNTED))
@@ -936,23 +960,9 @@ static void dm_io_complete(struct dm_io *io)
if (unlikely(wq_has_sleeper(>wait)))
wake_up(>wait);
 
-   if (io_error == BLK_STS_DM_REQUEUE || io_error == BLK_STS_AGAIN) {
-   if (bio->bi_opf & REQ_POLLED) {
-   /*
-* Upper layer won't help us poll split bio 
(io->orig_bio
-* may only reflect a subset of the pre-split original)
-* so clear REQ_POLLED in case of requeue.
-*/
-   bio_clear_polled(bio);
-   if (io_error == BLK_STS_AGAIN) {
-   /* io_uring doesn't handle BLK_STS_AGAIN (yet) 
*/
-   queue_io(md, bio);
-   return;
-   }
-   }
-   if (io_error == BLK_STS_DM_REQUEUE)
-   return;
-   }
+   /* We have requeued, so return now */
+   if (requeued)
+   return;
 
if (bio_is_flush_with_data(bio)) {
/*
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 5.20 2/4] dm: add new helper for handling dm_io requeue

2022-06-24 Thread Ming Lei

Add helper of dm_handle_requeue() for handling dm_io requeue.

Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2b75f1ef7386..a9e5e429c150 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -884,13 +884,11 @@ static int __noflush_suspending(struct mapped_device *md)
return test_bit(DMF_NOFLUSH_SUSPENDING, >flags);
 }
 
-static void dm_io_complete(struct dm_io *io)
+static void dm_handle_requeue(struct dm_io *io)
 {
-   blk_status_t io_error;
-   struct mapped_device *md = io->md;
-   struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
-
if (io->status == BLK_STS_DM_REQUEUE) {
+   struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+   struct mapped_device *md = io->md;
unsigned long flags;
/*
 * Target requested pushing back the I/O.
@@ -909,6 +907,15 @@ static void dm_io_complete(struct dm_io *io)
}
spin_unlock_irqrestore(>deferred_lock, flags);
}
+}
+
+static void dm_io_complete(struct dm_io *io)
+{
+   struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+   struct mapped_device *md = io->md;
+   blk_status_t io_error;
+
+   dm_handle_requeue(io);
 
io_error = io->status;
if (dm_io_flagged(io, DM_IO_ACCOUNTED))
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 5.20 0/4] block/dm: add bio_rewind for improving dm requeue

2022-06-24 Thread Ming Lei

Hello Guys,

The 1st patch adds bio_rewind which can restore bio to original position
by recording sectors between the original position to bio's end sector
if the bio's end sector won't change, which should be very common to
see.

The 2nd and 3rd patch cleans up dm code for handling requeue &
completion.

The last patch implements 2 stage dm io requeue for avoiding to allocate
one bio beforehand for just handling requeue which is one unusual event.
The 1st stage requeue is added for cloning & restoring original bio in wq
context, then 2nd stage requeue will use that as original bio for
handling requeue.


Ming Lei (4):
  block: add bio_rewind() API
  dm: add new helper for handling dm_io requeue
  dm: improve handling for DM_REQUEUE and AGAIN
  dm: add two stage requeue

 block/bio-integrity.c   |  19 
 block/bio.c |  19 
 block/blk-crypto-internal.h |   7 ++
 block/blk-crypto.c  |  23 +
 drivers/md/dm-core.h|  11 ++-
 drivers/md/dm.c | 180 
 include/linux/bio.h |  21 +
 include/linux/bvec.h|  33 +++
 8 files changed, 271 insertions(+), 42 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-24 Thread Ming Lei

Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
the similar API because the following reasons:

```
It is pointed that bio_rewind_iter() is one very bad API[1]:

1) bio size may not be restored after rewinding

2) it causes some bogus change, such as 5151842b9d8732 (block: reset
bi_iter.bi_done after splitting bio)

3) rewinding really makes things complicated wrt. bio splitting

4) unnecessary updating of .bi_done in fast path

[1] https://marc.info/?t=15354992425=1=2

So this patch takes Kent's suggestion to restore one bio into its original
state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
given now bio_rewind_iter() is only used by bio integrity code.
```

However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and 
saving
it only can't restore crypto and integrity info.

Add bio_rewind() back for some use cases which may not be same with
previous generic case:

1) most of bio has fixed end sector since bio split is done from front of the 
bio,
if driver just records how many sectors between current bio's start sector and
the bio's end sector, the original position can be restored

2) if one bio's end sector won't change, usually bio_trim() isn't called, user 
can
restore original position by storing sectors from current ->bi_iter.bi_sector to
bio's end sector; together by saving bio size, 8 bytes can restore to
original bio.

3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
restore to the original bio which represents current dm io to be requeued.
By storing sectors to the bio's end sector and dm io's size,
bio_rewind() can restore such original bio, then dm core code needn't to
allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
is actually one unusual event.

4) Not like original rewind API, this one needn't to add .bi_done, and no any
effect on fast path

Cc: Eric Biggers 
Cc: Kent Overstreet 
Cc: Dmitry Monakhov 
Cc: Martin K. Petersen 
Signed-off-by: Ming Lei 
---
 block/bio-integrity.c   | 19 +++
 block/bio.c | 19 +++
 block/blk-crypto-internal.h |  7 +++
 block/blk-crypto.c  | 23 +++
 include/linux/bio.h | 21 +
 include/linux/bvec.h| 33 +
 6 files changed, 122 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 32929c89ba8a..06c2fe81fdf2 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -378,6 +378,25 @@ void bio_integrity_advance(struct bio *bio, unsigned int 
bytes_done)
bvec_iter_advance(bip->bip_vec, >bip_iter, bytes);
 }
 
+/**
+ * bio_integrity_rewind - Rewind integrity vector
+ * @bio:   bio whose integrity vector to update
+ * @bytes_done:number of data bytes to rewind
+ *
+ * Description: This function calculates how many integrity bytes the
+ * number of completed data bytes correspond to and rewind the
+ * integrity vector accordingly.
+ */
+void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done)
+{
+   struct bio_integrity_payload *bip = bio_integrity(bio);
+   struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+   unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
+
+   bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9);
+   bvec_iter_rewind(bip->bip_vec, >bip_iter, bytes);
+}
+
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:   bio whose integrity vector to update
diff --git a/block/bio.c b/block/bio.c
index 51c99f2c5c90..5318944b7b18 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1360,6 +1360,25 @@ void __bio_advance(struct bio *bio, unsigned bytes)
 }
 EXPORT_SYMBOL(__bio_advance);
 
+/**
+ * bio_rewind - rewind @bio by @bytes
+ * @bio: bio to rewind
+ * @bytes: how many bytes to rewind
+ *
+ * Update ->bi_iter of @bio by rewinding @bytes. Most of bio has fixed end
+ * sector, so it is easy to rewind from end of the bio and restore its
+ * original position. And it is caller's responsibility to restore bio size.
+ */
+void bio_rewind(struct bio *bio, unsigned bytes)
+{
+   if (bio_integrity(bio))
+   bio_integrity_rewind(bio, bytes);
+
+   bio_crypt_rewind(bio, bytes);
+   bio_rewind_iter(bio, >bi_iter, bytes);
+}
+EXPORT_SYMBOL(bio_rewind);
+
 void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
struct bio *src, struct bvec_iter *src_iter)
 {
diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h
index e6818ffaddbf..b723599bbf99 100644
--- a/block/blk-crypto-internal.h
+++ b/block/blk-crypto-internal.h
@@ -114,6 +114,13 @@ static inline void bio_crypt_advance(struct bio *bio, 
unsigned int bytes)
__bio_crypt_advance(bio, bytes);
 }
 
+void __bio_crypt_rewind(st

[dm-devel] [PATCH] dm: fix dm io BLK_STS_DM_REQUEUE

2022-06-23 Thread Ming Lei

Commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting")
removes cloned bio when dm io splitting is needed. This way will make
multiple dm io instance sharing same original bio, and it works fine if
IOs are completed successfully. But regression may be caused if
BLK_STS_DM_REQUEUE is returned from either one of cloned io.

If case of BLK_STS_DM_REQUEUE from one cloned io, only the mapped part
of original bio for the current exact dm io needs to be re-submitted.
However, since the original bio is shared among all dm io instances,
actually the original bio only represents the last dm io instance, so
requeue can't work as expected. Also when more than one dm io is
requeued, the same original bio is requeued from all dm io's completion
handler, then race is caused.

Fix the issue by still allocating one bio for completing io only, then
io accounting can reply on ->orig_bio.

Based on one earlier version from Mike.

In theory, we can delay the bio clone when BLK_STS_DM_REQUEUE happens,
but that approach is a bit complicated: 1) bio clone needs to be done in
task context; 2) one block interface for unwinding bio is required.

Cc: Benjamin Marzinski 
Fixes: 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting")
Signed-off-by: Ming Lei 
---
 drivers/md/dm-core.h |  2 +-
 drivers/md/dm.c  | 10 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 54c0473a51dd..32f461c624c6 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -273,7 +273,7 @@ struct dm_io {
struct mapped_device *md;
 
/* The three fields represent mapped part of original bio */
-   struct bio *orig_bio;
+   struct bio *orig_bio, *bak_bio;
unsigned int sector_offset; /* offset to end of orig_bio */
unsigned int sectors;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9ede55278eec..85d8f2f1c9c8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -594,6 +594,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, 
struct bio *bio)
atomic_set(>io_count, 2);
this_cpu_inc(*md->pending_io);
io->orig_bio = bio;
+   io->bak_bio = NULL;
io->md = md;
spin_lock_init(>lock);
io->start_time = jiffies;
@@ -887,7 +888,7 @@ static void dm_io_complete(struct dm_io *io)
 {
blk_status_t io_error;
struct mapped_device *md = io->md;
-   struct bio *bio = io->orig_bio;
+   struct bio *bio = io->bak_bio ? io->bak_bio : io->orig_bio;
 
if (io->status == BLK_STS_DM_REQUEUE) {
unsigned long flags;
@@ -1693,9 +1694,10 @@ static void dm_split_and_process_bio(struct 
mapped_device *md,
 * Remainder must be passed to submit_bio_noacct() so it gets handled
 * *after* bios already submitted have been completely processed.
 */
-   bio_trim(bio, io->sectors, ci.sector_count);
-   trace_block_split(bio, bio->bi_iter.bi_sector);
-   bio_inc_remaining(bio);
+   io->bak_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
+   GFP_NOIO, >queue->bio_split);
+   bio_chain(io->bak_bio, bio);
+   trace_block_split(io->bak_bio, bio->bi_iter.bi_sector);
submit_bio_noacct(bio);
 out:
/*
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io

2022-04-16 Thread Ming Lei

On Fri, Apr 15, 2022 at 05:06:55PM -0400, Mike Snitzer wrote:
> On Thu, Apr 14 2022 at  8:14P -0400,
> Ming Lei  wrote:
> 
> > On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> > > On Wed, Apr 13 2022 at 11:57P -0400,
> > > Ming Lei  wrote:
> > > 
> > > > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > > > Ming Lei  wrote:
> > > > > 
> > > > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > > > 
> > > > > > > The bigger issue with this patch is that you've caused
> > > > > > > dm_submit_bio_remap() to go back to accounting the entire 
> > > > > > > original bio
> > > > > > > before any split occurs.  That is a problem because you'll end up
> > > > > > > accounting that bio for every split, so in split heavy workloads 
> > > > > > > the
> > > > > > > IO accounting won't reflect when the IO is actually issued and 
> > > > > > > we'll
> > > > > > > regress back to having very inaccurate and incorrect IO 
> > > > > > > accounting for
> > > > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > > > 
> > > > > > Good catch, but we know the length of mapped part in original bio 
> > > > > > before
> > > > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup 
> > > > > > here,
> > > > > > something like the following delta change should address it:
> > > > > > 
> > > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > > index db23efd6bbf6..06b554f3104b 100644
> > > > > > --- a/drivers/md/dm.c
> > > > > > +++ b/drivers/md/dm.c
> > > > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct 
> > > > > > clone_info *ci)
> > > > > >  
> > > > > > len = min_t(sector_t, max_io_len(ti, ci->sector), 
> > > > > > ci->sector_count);
> > > > > > clone = alloc_tio(ci, ti, 0, , GFP_NOIO);
> > > > > > +
> > > > > > +   if (ci->sector_count > len) {
> > > > > > +   /* setup the mapped part for accounting */
> > > > > > +   dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > > > +   ci->io->sectors = len;
> > > > > > +   ci->io->sector_offset = bio_end_sector(ci->bio) - 
> > > > > > ci->sector;
> > > > > > +   }
> > > > > > __map_bio(clone);
> > > > > >  
> > > > > > ci->sector += len;
> > > > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct 
> > > > > > mapped_device *md,
> > > > > > if (error || !ci.sector_count)
> > > > > > goto out;
> > > > > >  
> > > > > > -   /* setup the mapped part for accounting */
> > > > > > -   dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > > > -   ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > > > -   ci.io->sector_offset = bio_end_sector(bio) - 
> > > > > > bio->bi_iter.bi_sector;
> > > > > > -
> > > > > > bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > > > > trace_block_split(bio, bio->bi_iter.bi_sector);
> > > > > > bio_inc_remaining(bio);
> > > > > > 
> > > > > > -- 
> > > > > > Ming
> > > > > > 
> > > > > 
> > > > > Unfortunately we do need splitting after __map_bio() because a dm
> > > > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > > > bio's mapped part.
> > > > > 
> > > > > But I think dm_accept_partial_bio() could be trained to update
> > > > > tio->io->sectors?
> > > > 
> > > > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > > > passed to dm_accept_partial_bio(), and not gets updated after
> > > > dm_accept_partial_bio() is called.
> > > > 
> > >

Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io

2022-04-14 Thread Ming Lei

On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at 11:57P -0400,
> Ming Lei  wrote:
> 
> > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > Ming Lei  wrote:
> > > 
> > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > 
> > > > > The bigger issue with this patch is that you've caused
> > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > before any split occurs.  That is a problem because you'll end up
> > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > 
> > > > Good catch, but we know the length of mapped part in original bio before
> > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > something like the following delta change should address it:
> > > > 
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index db23efd6bbf6..06b554f3104b 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct 
> > > > clone_info *ci)
> > > >  
> > > > len = min_t(sector_t, max_io_len(ti, ci->sector), 
> > > > ci->sector_count);
> > > > clone = alloc_tio(ci, ti, 0, , GFP_NOIO);
> > > > +
> > > > +   if (ci->sector_count > len) {
> > > > +   /* setup the mapped part for accounting */
> > > > +   dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > +   ci->io->sectors = len;
> > > > +   ci->io->sector_offset = bio_end_sector(ci->bio) - 
> > > > ci->sector;
> > > > +   }
> > > > __map_bio(clone);
> > > >  
> > > > ci->sector += len;
> > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct 
> > > > mapped_device *md,
> > > > if (error || !ci.sector_count)
> > > > goto out;
> > > >  
> > > > -   /* setup the mapped part for accounting */
> > > > -   dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > -   ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > -   ci.io->sector_offset = bio_end_sector(bio) - 
> > > > bio->bi_iter.bi_sector;
> > > > -
> > > > bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > > trace_block_split(bio, bio->bi_iter.bi_sector);
> > > > bio_inc_remaining(bio);
> > > > 
> > > > -- 
> > > > Ming
> > > > 
> > > 
> > > Unfortunately we do need splitting after __map_bio() because a dm
> > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > bio's mapped part.
> > > 
> > > But I think dm_accept_partial_bio() could be trained to update
> > > tio->io->sectors?
> > 
> > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > passed to dm_accept_partial_bio(), and not gets updated after
> > dm_accept_partial_bio() is called.
> > 
> > If that is one issue, it must be one existed issue in dm io accounting
> > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> 
> Recall that ->orig_bio is updated after the bio_split() at the bottom of
> dm_split_and_process_bio().
> 
> That bio_split() is based on ci->sector_count, which is reduced as a
> side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> pretty circuitous so I can absolutely understand why you didn't
> immediately appreciate the interface.  The block comment above
> dm_accept_partial_bio() does a pretty comprehensive job of explaining.

Go it now, thanks for the explanation.

As you mentioned, it can be addressed in dm_accept_partial_bio()
by updating ti->io->sectors.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io

2022-04-13 Thread Ming Lei

On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at  8:36P -0400,
> Ming Lei  wrote:
> 
> > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > 
> > > The bigger issue with this patch is that you've caused
> > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > before any split occurs.  That is a problem because you'll end up
> > > accounting that bio for every split, so in split heavy workloads the
> > > IO accounting won't reflect when the IO is actually issued and we'll
> > > regress back to having very inaccurate and incorrect IO accounting for
> > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > 
> > Good catch, but we know the length of mapped part in original bio before
> > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > something like the following delta change should address it:
> > 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index db23efd6bbf6..06b554f3104b 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info 
> > *ci)
> >  
> > len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > clone = alloc_tio(ci, ti, 0, , GFP_NOIO);
> > +
> > +   if (ci->sector_count > len) {
> > +   /* setup the mapped part for accounting */
> > +   dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > +   ci->io->sectors = len;
> > +   ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > +   }
> > __map_bio(clone);
> >  
> > ci->sector += len;
> > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct 
> > mapped_device *md,
> > if (error || !ci.sector_count)
> > goto out;
> >  
> > -   /* setup the mapped part for accounting */
> > -   dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > -   ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > -   ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > -
> > bio_trim(bio, ci.io->sectors, ci.sector_count);
> > trace_block_split(bio, bio->bi_iter.bi_sector);
> > bio_inc_remaining(bio);
> > 
> > -- 
> > Ming
> > 
> 
> Unfortunately we do need splitting after __map_bio() because a dm
> target's ->map can use dm_accept_partial_bio() to further reduce a
> bio's mapped part.
> 
> But I think dm_accept_partial_bio() could be trained to update
> tio->io->sectors?

->orig_bio is just for serving io accounting, but ->orig_bio isn't
passed to dm_accept_partial_bio(), and not gets updated after
dm_accept_partial_bio() is called.

If that is one issue, it must be one existed issue in dm io accounting
since ->orig_bio isn't updated when dm_accept_partial_bio() is called.

So do we have to update it?

> 
> dm_accept_partial_bio() has been around for a long time, it keeps
> growing BUG_ONs that are actually helpful to narrow its use to "normal
> IO", so it should be OK.
> 
> Running 'make check' in a built cryptsetup source tree should be a
> good test for DM target interface functionality.

Care to share the test tree?

> 
> But there aren't automated tests for IO accounting correctness yet.

I did verify io accounting by running dm-thin with blk-throttle, and the
observed throughput is same with expected setting. Running both small bs
and large bs, so non-split and split code path are covered.

Maybe you can add this kind of test into dm io accounting automated test.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io

2022-04-13 Thread Ming Lei

On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at  8:26P -0400,
> Ming Lei  wrote:
> 
> > On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2022 at  9:56P -0400,
> > > Ming Lei  wrote:
> > > 
> > > > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > > > Ming Lei  wrote:
> > > > > 
> > > > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > > > dm_submit_bio_remap().
> > > > > > 
> > > > > > The reason is that one new bio is cloned from original FS bio to
> > > > > > represent the mapped part, which just serves io accounting.
> > > > > > 
> > > > > > Now we have switched to bdev based io accounting interface, and we
> > > > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > > > bio isn't necessary any more.
> > > > > > 
> > > > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > > > accounting & split a bit.
> > > > > 
> > > > > You're conflating quite a few things here.  DM zone really has no
> > > > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > > > ASAP.
> > > > 
> > > > You can just consider the cleanup part of this patches, :-)
> > > 
> > > I will.  But your following list doesn't reflect any "cleanup" that I
> > > saw in your patchset.  Pretty fundamental changes that are similar,
> > > but different, to the dm-5.19 changes I've staged.
> > > 
> > > > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> > > >
> > > > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > > > fast path of dm_submit_bio_remap().
> > > 
> > > For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> > > split and can be accounted.
> > > 
> > > For 5.19 I also plan on using late io->orig_bio assignment as an
> > > alternative to the full-blown refcounting currently done with
> > > io->io_count.  I've yet to quantify the gains with focused testing but
> > > in theory this approach should scale better on large systems with many
> > > concurrent IO threads to the same device (RCU is primary constraint
> > > now).
> > > 
> > > I'll try to write a bpfrace script to measure how frequently "waiting on
> > > io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> > > dm-crypt). But I think we'll find it is very rarely, if ever, waited
> > > on in the fast path.
> > 
> > The waiting depends on CPU and device's speed, if device is quicker than
> > CPU, the wait should be longer. Testing in one environment is usually
> > not enough.
> > 
> > > 
> > > > 3) no split for io accounting
> > > 
> > > DM's more recent approach to splitting has never been done for benefit
> > > or use of IO accounting, see this commit for its origin:
> > > 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree 
> > > walk")
> > > 
> > > Not sure why you keep poking fun at DM only doing a single split when:
> > > that is the actual design.  DM splits off orig_bio then recurses to
> > > handle the remainder of the bio that wasn't issued.  Storing it in
> > > io->orig_bio (previously io->bio) was always a means of reflecting
> > > things properly. And yes IO accounting is one use, the other is IO
> > > completion. But unfortunately DM's IO accounting has always been a
> > > mess ever since the above commit. Changes in 5.18 fixed that.
> > > 
> > > But again, DM's splitting has _nothing_ to do with IO accounti

Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io

2022-04-13 Thread Ming Lei

On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  9:56P -0400,
> Ming Lei  wrote:
> 
> > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > Ming Lei  wrote:
> > > 
> > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > dm_submit_bio_remap().
> > > > 
> > > > The reason is that one new bio is cloned from original FS bio to
> > > > represent the mapped part, which just serves io accounting.
> > > > 
> > > > Now we have switched to bdev based io accounting interface, and we
> > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > bio isn't necessary any more.
> > > > 
> > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > accounting & split a bit.
> > > 
> > > You're conflating quite a few things here.  DM zone really has no
> > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > ASAP.
> > 
> > You can just consider the cleanup part of this patches, :-)
> 
> I will.  But your following list doesn't reflect any "cleanup" that I
> saw in your patchset.  Pretty fundamental changes that are similar,
> but different, to the dm-5.19 changes I've staged.
> 
> > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> >
> > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > fast path of dm_submit_bio_remap().
> 
> For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> split and can be accounted.
> 
> For 5.19 I also plan on using late io->orig_bio assignment as an
> alternative to the full-blown refcounting currently done with
> io->io_count.  I've yet to quantify the gains with focused testing but
> in theory this approach should scale better on large systems with many
> concurrent IO threads to the same device (RCU is primary constraint
> now).
> 
> I'll try to write a bpfrace script to measure how frequently "waiting on
> io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> dm-crypt). But I think we'll find it is very rarely, if ever, waited
> on in the fast path.

The waiting depends on CPU and device's speed, if device is quicker than
CPU, the wait should be longer. Testing in one environment is usually
not enough.

> 
> > 3) no split for io accounting
> 
> DM's more recent approach to splitting has never been done for benefit
> or use of IO accounting, see this commit for its origin:
> 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> 
> Not sure why you keep poking fun at DM only doing a single split when:
> that is the actual design.  DM splits off orig_bio then recurses to
> handle the remainder of the bio that wasn't issued.  Storing it in
> io->orig_bio (previously io->bio) was always a means of reflecting
> things properly. And yes IO accounting is one use, the other is IO
> completion. But unfortunately DM's IO accounting has always been a
> mess ever since the above commit. Changes in 5.18 fixed that.
> 
> But again, DM's splitting has _nothing_ to do with IO accounting.
> Splitting only happens when needed for IO submission given constraints
> of DM target(s) or underlying layers.

What I meant is that the bio returned from bio_split() is only for
io accounting. Yeah, the comment said it can be for io completion too,
but that is easily done without the splitted bio.

> 
> All said, I will look closer at your entire set and see if it better
> to go with your approach.  This patch in particular is interesting
> (avoids cloning and other complexity of bio_split + bio_chain):
> https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming@redhat.com/

That patch shows we can avoid the extra split, also shows that the
splitted bio from bio_split() is for io accounting only.


thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io

2022-04-12 Thread Ming Lei

On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei  wrote:
> 
> > The current DM codes setup ->orig_bio after __map_bio() returns,
> > and not only cause kernel panic for dm zone, but also a bit ugly
> > and tricky, especially the waiting until ->orig_bio is set in
> > dm_submit_bio_remap().
> > 
> > The reason is that one new bio is cloned from original FS bio to
> > represent the mapped part, which just serves io accounting.
> > 
> > Now we have switched to bdev based io accounting interface, and we
> > can retrieve sectors/bio_op from both the real original bio and the
> > added fields of .sector_offset & .sectors easily, so the new cloned
> > bio isn't necessary any more.
> > 
> > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > accounting & split a bit.
> 
> You're conflating quite a few things here.  DM zone really has no
> business accessing io->orig_bio (dm-zone.c can just as easily inspect
> the tio->clone, because it hasn't been remapped yet it reflects the
> io->origin_bio, so there is no need to look at io->orig_bio) -- but
> yes I clearly broke things during the 5.18 merge and it needs fixing
> ASAP.

You can just consider the cleanup part of this patches, :-)

1) no late assignment of ->orig_bio, and always set it in alloc_io()

2) no waiting on on ->origi_bio, especially the waiting is done in
fast path of dm_submit_bio_remap().

3) no split for io accounting


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly

2022-04-12 Thread Ming Lei

On Tue, Apr 12, 2022 at 04:28:59PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei  wrote:
> 
> > All the other 4 parameters are retrieved from the 'dm_io' instance, so
> > not necessary to pass all four to dm_io_acct().
> > 
> > Signed-off-by: Ming Lei 
> 
> Yeah, commit 0ab30b4079e103 ("dm: eliminate copying of dm_io fields in
> dm_io_dec_pending") could've gone further to do what you've done here
> in this patch.
> 
> But it stopped short because of the additional "games" associated with
> the late assignment of io->orig_bio that is in the dm-5.19 branch.

OK, I will rebase on dm-5.19, but IMO the idea of late assignment of
io->orig_bio isn't good, same with splitting one bio just for
accounting, things shouldn't be so tricky.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 8/8] dm: put all polled io into one single list

2022-04-12 Thread Ming Lei

If bio_split() isn't involved, it is a bit overkill to link dm_io into hlist,
given there is only single dm_io in the list, so convert to single list
for holding all dm_io instances associated with this bio.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-core.h |  2 +-
 drivers/md/dm.c  | 46 +++-
 2 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 811c0ccbc63d..7f51957913e8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -257,7 +257,7 @@ struct dm_io {
spinlock_t lock;
unsigned long start_time;
void *data;
-   struct hlist_node node;
+   struct dm_io *next;
struct task_struct *map_task;
struct dm_stats_aux stats_aux;
/* last member of dm_target_io is 'struct bio' */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2987f7cf7b47..db23efd6bbf6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1492,7 +1492,7 @@ static bool __process_abnormal_io(struct clone_info *ci, 
struct dm_target *ti,
 }
 
 /*
- * Reuse ->bi_private as hlist head for storing all dm_io instances
+ * Reuse ->bi_private as dm_io list head for storing all dm_io instances
  * associated with this bio, and this bio's bi_private needs to be
  * stored in dm_io->data before the reuse.
  *
@@ -1500,14 +1500,14 @@ static bool __process_abnormal_io(struct clone_info 
*ci, struct dm_target *ti,
  * touch it after splitting. Meantime it won't be changed by anyone after
  * bio is submitted. So this reuse is safe.
  */
-static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio)
+static inline struct dm_io **dm_poll_list_head(struct bio *bio)
 {
-   return (struct hlist_head *)>bi_private;
+   return (struct dm_io **)>bi_private;
 }
 
 static void dm_queue_poll_io(struct bio *bio, struct dm_io *io)
 {
-   struct hlist_head *head = dm_get_bio_hlist_head(bio);
+   struct dm_io **head = dm_poll_list_head(bio);
 
if (!(bio->bi_opf & REQ_DM_POLL_LIST)) {
bio->bi_opf |= REQ_DM_POLL_LIST;
@@ -1517,19 +1517,20 @@ static void dm_queue_poll_io(struct bio *bio, struct 
dm_io *io)
 */
io->data = bio->bi_private;
 
-   INIT_HLIST_HEAD(head);
-
/* tell block layer to poll for completion */
bio->bi_cookie = ~BLK_QC_T_NONE;
+
+   io->next = NULL;
} else {
/*
 * bio recursed due to split, reuse original poll list,
 * and save bio->bi_private too.
 */
-   io->data = hlist_entry(head->first, struct dm_io, node)->data;
+   io->data = (*head)->data;
+   io->next = *head;
}
 
-   hlist_add_head(>node, head);
+   *head = io;
 }
 
 /*
@@ -1682,18 +1683,16 @@ static bool dm_poll_dm_io(struct dm_io *io, struct 
io_comp_batch *iob,
 static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob,
   unsigned int flags)
 {
-   struct hlist_head *head = dm_get_bio_hlist_head(bio);
-   struct hlist_head tmp = HLIST_HEAD_INIT;
-   struct hlist_node *next;
-   struct dm_io *io;
+   struct dm_io **head = dm_poll_list_head(bio);
+   struct dm_io *list = *head;
+   struct dm_io *tmp = NULL;
+   struct dm_io *curr, *next;
 
/* Only poll normal bio which was marked as REQ_DM_POLL_LIST */
if (!(bio->bi_opf & REQ_DM_POLL_LIST))
return 0;
 
-   WARN_ON_ONCE(hlist_empty(head));
-
-   hlist_move_list(head, );
+   WARN_ON_ONCE(!list);
 
/*
 * Restore .bi_private before possibly completing dm_io.
@@ -1704,24 +1703,27 @@ static int dm_poll_bio(struct bio *bio, struct 
io_comp_batch *iob,
 * clearing REQ_DM_POLL_LIST here.
 */
bio->bi_opf &= ~REQ_DM_POLL_LIST;
-   bio->bi_private = hlist_entry(tmp.first, struct dm_io, node)->data;
+   bio->bi_private = list->data;
 
-   hlist_for_each_entry_safe(io, next, , node) {
-   if (dm_poll_dm_io(io, iob, flags)) {
-   hlist_del_init(>node);
+   for (curr = list, next = curr->next; curr; curr = next, next =
+   curr ? curr->next : NULL) {
+   if (dm_poll_dm_io(curr, iob, flags)) {
/*
 * clone_endio() has already occurred, so passing
 * error as 0 here doesn't override io->status
 */
-   dm_io_dec_pending(io, 0);
+   dm_io_dec_pending(curr, 0);
+   } else {
+   curr->next = tmp;
+   tmp = curr;
}
}
 
/* Not done? */
-   if (!hlist_empty()) {
+   if (tmp) {

[dm-devel] [PATCH 7/8] dm: improve target io referencing

2022-04-12 Thread Ming Lei

Currently target io's reference counter is grabbed before calling
__map_bio(), this way isn't efficient since we can move this grabbing
into alloc_io().

Meantime it becomes typical async io reference counter model: one is
for submission side, the other is for completion side, and the io won't
be completed until both two sides are done.

Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 51 -
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c3ba6b4e19b..2987f7cf7b47 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -587,7 +587,9 @@ static struct dm_io *alloc_io(struct mapped_device *md, 
struct bio *bio)
io = container_of(tio, struct dm_io, tio);
io->magic = DM_IO_MAGIC;
io->status = 0;
-   atomic_set(>io_count, 1);
+
+   /* one is for submission, the other is for completion */
+   atomic_set(>io_count, 2);
this_cpu_inc(*md->pending_io);
io->orig_bio = bio;
io->md = md;
@@ -937,11 +939,6 @@ static inline bool dm_tio_is_normal(struct dm_target_io 
*tio)
!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
-static void dm_io_inc_pending(struct dm_io *io)
-{
-   atomic_inc(>io_count);
-}
-
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
@@ -1276,7 +1273,6 @@ static void __map_bio(struct bio *clone)
/*
 * Map the clone.
 */
-   dm_io_inc_pending(io);
tio->old_sector = clone->bi_iter.bi_sector;
 
if (unlikely(swap_bios_limit(ti, clone))) {
@@ -1358,11 +1354,12 @@ static void alloc_multiple_bios(struct bio_list *blist, 
struct clone_info *ci,
}
 }
 
-static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
+static int __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
  unsigned num_bios, unsigned *len)
 {
struct bio_list blist = BIO_EMPTY_LIST;
struct bio *clone;
+   int ret = 0;
 
switch (num_bios) {
case 0:
@@ -1371,15 +1368,19 @@ static void __send_duplicate_bios(struct clone_info 
*ci, struct dm_target *ti,
clone = alloc_tio(ci, ti, 0, len, GFP_NOIO);
dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
__map_bio(clone);
+   ret = 1;
break;
default:
alloc_multiple_bios(, ci, ti, num_bios, len);
while ((clone = bio_list_pop())) {
dm_tio_set_flag(clone_to_tio(clone), 
DM_TIO_IS_DUPLICATE_BIO);
__map_bio(clone);
+   ret += 1;
}
break;
}
+
+   return ret;
 }
 
 static void __send_empty_flush(struct clone_info *ci)
@@ -1399,8 +1400,19 @@ static void __send_empty_flush(struct clone_info *ci)
ci->bio = _bio;
ci->sector_count = 0;
 
-   while ((ti = dm_table_get_target(ci->map, target_nr++)))
-   __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+   while ((ti = dm_table_get_target(ci->map, target_nr++))) {
+   int bios;
+
+   atomic_add(ti->num_flush_bios, >io->io_count);
+   bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+   atomic_sub(ti->num_flush_bios - bios, >io->io_count);
+   }
+
+   /*
+* alloc_io() takes one extra reference for submission, so the
+* reference won't reach 0 after the following subtraction
+*/
+   atomic_sub(1, >io->io_count);
 
bio_uninit(ci->bio);
 }
@@ -1409,6 +1421,7 @@ static void __send_changing_extent_only(struct clone_info 
*ci, struct dm_target
unsigned num_bios)
 {
unsigned len;
+   int bios;
 
len = min_t(sector_t, ci->sector_count,
max_io_len_target_boundary(ti, dm_target_offset(ti, 
ci->sector)));
@@ -1420,7 +1433,13 @@ static void __send_changing_extent_only(struct 
clone_info *ci, struct dm_target
ci->sector += len;
ci->sector_count -= len;
 
-   __send_duplicate_bios(ci, ti, num_bios, );
+   atomic_add(num_bios, >io->io_count);
+   bios = __send_duplicate_bios(ci, ti, num_bios, );
+   /*
+* alloc_io() takes one extra reference for submission, so the
+* reference won't reach 0 after the following subtraction
+*/
+   atomic_sub(num_bios - bios + 1, >io->io_count);
 }
 
 static bool is_abnormal_io(struct bio *bio)
@@ -1603,9 +1622,15 @@ static void dm_split_and_process_bio(struct 
mapped_device *md,
 * Add every dm_io instance into the hlist_head which is stored in
 * bio->bi_private, so that dm_poll_bio can poll them

[dm-devel] [PATCH 6/8] dm: don't grab target io reference in dm_zone_map_bio

2022-04-12 Thread Ming Lei

dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so no necessary to do it dm_zone_map_bio
any more.

Reviewed-by: Damien Le Moal 
Tested-by: Damien Le Moal 
Signed-off-by: Ming Lei 
---
 drivers/md/dm-core.h |  7 ---
 drivers/md/dm-zone.c | 10 --
 drivers/md/dm.c  |  7 ++-
 3 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index aefb080c230d..811c0ccbc63d 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -283,13 +283,6 @@ static inline void dm_io_set_flag(struct dm_io *io, 
unsigned int bit)
io->flags |= (1U << bit);
 }
 
-static inline void dm_io_inc_pending(struct dm_io *io)
-{
-   atomic_inc(>io_count);
-}
-
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error);
-
 static inline struct completion *dm_get_completion_from_kobject(struct kobject 
*kobj)
 {
return _of(kobj, struct dm_kobject_holder, kobj)->completion;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..85d3c158719f 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -545,13 +545,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
return DM_MAPIO_KILL;
}
 
-   /*
-* The target map function may issue and complete the IO quickly.
-* Take an extra reference on the IO to make sure it does disappear
-* until we run dm_zone_map_bio_end().
-*/
-   dm_io_inc_pending(io);
-
/* Let the target do its work */
r = ti->type->map(ti, clone);
switch (r) {
@@ -580,9 +573,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
break;
}
 
-   /* Drop the extra reference on the IO */
-   dm_io_dec_pending(io, sts);
-
if (sts != BLK_STS_OK)
return DM_MAPIO_KILL;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index df1d013fb793..3c3ba6b4e19b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -937,11 +937,16 @@ static inline bool dm_tio_is_normal(struct dm_target_io 
*tio)
!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
+static void dm_io_inc_pending(struct dm_io *io)
+{
+   atomic_inc(>io_count);
+}
+
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
  */
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
+static void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
 {
/* Push-back supersedes any I/O errors */
if (unlikely(error)) {
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io

2022-04-12 Thread Ming Lei

The current DM codes setup ->orig_bio after __map_bio() returns,
and not only cause kernel panic for dm zone, but also a bit ugly
and tricky, especially the waiting until ->orig_bio is set in
dm_submit_bio_remap().

The reason is that one new bio is cloned from original FS bio to
represent the mapped part, which just serves io accounting.

Now we have switched to bdev based io accounting interface, and we
can retrieve sectors/bio_op from both the real original bio and the
added fields of .sector_offset & .sectors easily, so the new cloned
bio isn't necessary any more.

Not only fixes dm-zone's kernel panic, but also cleans up dm io
accounting & split a bit.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-core.h |  8 ++-
 drivers/md/dm.c  | 51 ++--
 2 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 4277853c7535..aefb080c230d 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -247,7 +247,12 @@ struct dm_io {
blk_short_t flags;
atomic_t io_count;
struct mapped_device *md;
+
+   /* The three fields represent mapped part of original bio */
struct bio *orig_bio;
+   unsigned int sector_offset; /* offset to end of orig_bio */
+   unsigned int sectors;
+
blk_status_t status;
spinlock_t lock;
unsigned long start_time;
@@ -264,7 +269,8 @@ struct dm_io {
  */
 enum {
DM_IO_START_ACCT,
-   DM_IO_ACCOUNTED
+   DM_IO_ACCOUNTED,
+   DM_IO_SPLITTED
 };
 
 static inline bool dm_io_flagged(struct dm_io *io, unsigned int bit)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 31eacc0e93ed..df1d013fb793 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -509,8 +509,10 @@ static void dm_io_acct(struct dm_io *io, bool end)
/* If REQ_PREFLUSH set save any payload but do not account it */
if (bio_is_flush_with_data(bio))
sectors = 0;
-   else
+   else if (likely(!(dm_io_flagged(io, DM_IO_SPLITTED
sectors = bio_sectors(bio);
+   else
+   sectors = io->sectors;
 
if (!end)
bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
@@ -518,10 +520,21 @@ static void dm_io_acct(struct dm_io *io, bool end)
else
bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
 
-   if (unlikely(dm_stats_used(>stats)))
+   if (unlikely(dm_stats_used(>stats))) {
+   sector_t sector;
+
+   if (likely(!dm_io_flagged(io, DM_IO_SPLITTED))) {
+   sector = bio->bi_iter.bi_sector;
+   sectors = bio_sectors(bio);
+   } else {
+   sector = bio_end_sector(bio) - io->sector_offset;
+   sectors = io->sectors;
+   }
+
dm_stats_account_io(>stats, bio_data_dir(bio),
-   bio->bi_iter.bi_sector, bio_sectors(bio),
+   sector, sectors,
end, start_time, stats_aux);
+   }
 }
 
 static void __dm_start_io_acct(struct dm_io *io)
@@ -576,7 +589,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, 
struct bio *bio)
io->status = 0;
atomic_set(>io_count, 1);
this_cpu_inc(*md->pending_io);
-   io->orig_bio = NULL;
+   io->orig_bio = bio;
io->md = md;
io->map_task = current;
spin_lock_init(>lock);
@@ -1222,13 +1235,6 @@ void dm_submit_bio_remap(struct bio *clone, struct bio 
*tgt_clone)
/* Still in target's map function */
dm_io_set_flag(io, DM_IO_START_ACCT);
} else {
-   /*
-* Called by another thread, managed by DM target,
-* wait for dm_split_and_process_bio() to store
-* io->orig_bio
-*/
-   while (unlikely(!smp_load_acquire(>orig_bio)))
-   msleep(1);
dm_start_io_acct(io, clone);
}
 
@@ -1557,7 +1563,6 @@ static void dm_split_and_process_bio(struct mapped_device 
*md,
 struct dm_table *map, struct bio *bio)
 {
struct clone_info ci;
-   struct bio *orig_bio = NULL;
int error = 0;
 
init_clone_info(, md, map, bio);
@@ -1573,22 +1578,16 @@ static void dm_split_and_process_bio(struct 
mapped_device *md,
if (error || !ci.sector_count)
goto out;
 
-   /*
-* Remainder must be passed to submit_bio_noacct() so it gets handled
-* *after* bios already submitted have been completely processed.
-* We take a clone of the original to store in ci.io->orig_bio to be
-* used by dm_end_io_acct() and for dm_io_complete() to use for
-* completion handling.
-

[dm-devel] [PATCH 4/8] dm: switch to bdev based io accounting interface

2022-04-12 Thread Ming Lei

DM won't account sectors in flush IO, also we can retrieve sectors
from 'dm_io' for avoiding to allocate & update new original bio, which
will be done in the following patch.

So switch to bdev based io accounting interface.

Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ed85cd1165a4..31eacc0e93ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -504,29 +504,24 @@ static void dm_io_acct(struct dm_io *io, bool end)
unsigned long start_time = io->start_time;
struct mapped_device *md = io->md;
struct bio *bio = io->orig_bio;
-   bool is_flush_with_data;
-   unsigned int bi_size;
+   unsigned int sectors;
 
/* If REQ_PREFLUSH set save any payload but do not account it */
-   is_flush_with_data = bio_is_flush_with_data(bio);
-   if (is_flush_with_data) {
-   bi_size = bio->bi_iter.bi_size;
-   bio->bi_iter.bi_size = 0;
-   }
+   if (bio_is_flush_with_data(bio))
+   sectors = 0;
+   else
+   sectors = bio_sectors(bio);
 
if (!end)
-   bio_start_io_acct_time(bio, start_time);
+   bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
+   start_time);
else
-   bio_end_io_acct(bio, start_time);
+   bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
 
if (unlikely(dm_stats_used(>stats)))
dm_stats_account_io(>stats, bio_data_dir(bio),
bio->bi_iter.bi_sector, bio_sectors(bio),
end, start_time, stats_aux);
-
-   /* Restore bio's payload so it does get accounted upon requeue */
-   if (is_flush_with_data)
-   bio->bi_iter.bi_size = bi_size;
 }
 
 static void __dm_start_io_acct(struct dm_io *io)
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly

2022-04-12 Thread Ming Lei

All the other 4 parameters are retrieved from the 'dm_io' instance, so
not necessary to pass all four to dm_io_acct().

Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 62f7af815ef8..ed85cd1165a4 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -498,9 +498,12 @@ static bool bio_is_flush_with_data(struct bio *bio)
return ((bio->bi_opf & REQ_PREFLUSH) && bio->bi_iter.bi_size);
 }
 
-static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
-  unsigned long start_time, struct dm_stats_aux *stats_aux)
+static void dm_io_acct(struct dm_io *io, bool end)
 {
+   struct dm_stats_aux *stats_aux = >stats_aux;
+   unsigned long start_time = io->start_time;
+   struct mapped_device *md = io->md;
+   struct bio *bio = io->orig_bio;
bool is_flush_with_data;
unsigned int bi_size;
 
@@ -528,7 +531,7 @@ static void dm_io_acct(bool end, struct mapped_device *md, 
struct bio *bio,
 
 static void __dm_start_io_acct(struct dm_io *io)
 {
-   dm_io_acct(false, io->md, io->orig_bio, io->start_time, >stats_aux);
+   dm_io_acct(io, false);
 }
 
 static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
@@ -557,7 +560,7 @@ static void dm_start_io_acct(struct dm_io *io, struct bio 
*clone)
 
 static void dm_end_io_acct(struct dm_io *io)
 {
-   dm_io_acct(true, io->md, io->orig_bio, io->start_time, >stats_aux);
+   dm_io_acct(io, true);
 }
 
 static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 2/8] dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct

2022-04-12 Thread Ming Lei

dm->orig_bio is always passed to __dm_start_io_acct and dm_end_io_acct,
so it isn't necessary to take one bio parameter for the two helpers.

Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..62f7af815ef8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -526,16 +526,13 @@ static void dm_io_acct(bool end, struct mapped_device 
*md, struct bio *bio,
bio->bi_iter.bi_size = bi_size;
 }
 
-static void __dm_start_io_acct(struct dm_io *io, struct bio *bio)
+static void __dm_start_io_acct(struct dm_io *io)
 {
-   dm_io_acct(false, io->md, bio, io->start_time, >stats_aux);
+   dm_io_acct(false, io->md, io->orig_bio, io->start_time, >stats_aux);
 }
 
 static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
 {
-   /* Must account IO to DM device in terms of orig_bio */
-   struct bio *bio = io->orig_bio;
-
/*
 * Ensure IO accounting is only ever started once.
 * Expect no possibility for race unless DM_TIO_IS_DUPLICATE_BIO.
@@ -555,12 +552,12 @@ static void dm_start_io_acct(struct dm_io *io, struct bio 
*clone)
spin_unlock_irqrestore(>lock, flags);
}
 
-   __dm_start_io_acct(io, bio);
+   __dm_start_io_acct(io);
 }
 
-static void dm_end_io_acct(struct dm_io *io, struct bio *bio)
+static void dm_end_io_acct(struct dm_io *io)
 {
-   dm_io_acct(true, io->md, bio, io->start_time, >stats_aux);
+   dm_io_acct(true, io->md, io->orig_bio, io->start_time, >stats_aux);
 }
 
 static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
@@ -875,14 +872,14 @@ static void dm_io_complete(struct dm_io *io)
 
io_error = io->status;
if (dm_io_flagged(io, DM_IO_ACCOUNTED))
-   dm_end_io_acct(io, bio);
+   dm_end_io_acct(io);
else if (!io_error) {
/*
 * Must handle target that DM_MAPIO_SUBMITTED only to
 * then bio_endio() rather than dm_submit_bio_remap()
 */
-   __dm_start_io_acct(io, bio);
-   dm_end_io_acct(io, bio);
+   __dm_start_io_acct(io);
+   dm_end_io_acct(io);
}
free_io(io);
smp_wmb();
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 1/8] block: replace disk based account with bdev's

2022-04-12 Thread Ming Lei

'block device' is generic type for interface, and gendisk becomes more
one block layer internal type, so replace disk based account interface
with bdec's.

Also add 'start_time' parameter to bdev_start_io_acct() so that we
can cover device mapper's io accounting by the two bdev based interface.

Signed-off-by: Ming Lei 
---
 block/blk-core.c  | 15 ---
 drivers/block/zram/zram_drv.c |  5 +++--
 include/linux/blkdev.h|  7 ---
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 937bb6b86331..a3ae13b129ff 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1063,12 +1063,13 @@ unsigned long bio_start_io_acct(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(bio_start_io_acct);
 
-unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
-unsigned int op)
+unsigned long bdev_start_io_acct(struct block_device *bdev,
+unsigned int sectors, unsigned int op,
+unsigned long start_time)
 {
-   return __part_start_io_acct(disk->part0, sectors, op, jiffies);
+   return __part_start_io_acct(bdev, sectors, op, start_time);
 }
-EXPORT_SYMBOL(disk_start_io_acct);
+EXPORT_SYMBOL(bdev_start_io_acct);
 
 static void __part_end_io_acct(struct block_device *part, unsigned int op,
   unsigned long start_time)
@@ -1091,12 +1092,12 @@ void bio_end_io_acct_remapped(struct bio *bio, unsigned 
long start_time,
 }
 EXPORT_SYMBOL_GPL(bio_end_io_acct_remapped);
 
-void disk_end_io_acct(struct gendisk *disk, unsigned int op,
+void bdev_end_io_acct(struct block_device *bdev, unsigned int op,
  unsigned long start_time)
 {
-   __part_end_io_acct(disk->part0, op, start_time);
+   __part_end_io_acct(bdev, op, start_time);
 }
-EXPORT_SYMBOL(disk_end_io_acct);
+EXPORT_SYMBOL(bdev_end_io_acct);
 
 /**
  * blk_lld_busy - Check if underlying low-level drivers of a device are busy
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e9474b02012d..adb5209a556a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1675,9 +1675,10 @@ static int zram_rw_page(struct block_device *bdev, 
sector_t sector,
bv.bv_len = PAGE_SIZE;
bv.bv_offset = 0;
 
-   start_time = disk_start_io_acct(bdev->bd_disk, SECTORS_PER_PAGE, op);
+   start_time = bdev_start_io_acct(bdev->bd_disk->part0,
+   SECTORS_PER_PAGE, op, jiffies);
ret = zram_bvec_rw(zram, , index, offset, op, NULL);
-   disk_end_io_acct(bdev->bd_disk, op, start_time);
+   bdev_end_io_acct(bdev->bd_disk->part0, op, start_time);
 out:
/*
 * If I/O fails, just return error(ie, non-zero) without
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 60d016138997..f680ba6f0ab2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1491,9 +1491,10 @@ static inline void blk_wake_io_task(struct task_struct 
*waiter)
wake_up_process(waiter);
 }
 
-unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
-   unsigned int op);
-void disk_end_io_acct(struct gendisk *disk, unsigned int op,
+unsigned long bdev_start_io_acct(struct block_device *bdev,
+unsigned int sectors, unsigned int op,
+unsigned long start_time);
+void bdev_end_io_acct(struct block_device *bdev, unsigned int op,
unsigned long start_time);
 
 void bio_start_io_acct_time(struct bio *bio, unsigned long start_time);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 0/8] dm: io accounting & polling improvement

2022-04-12 Thread Ming Lei

Hello Guys,

The 1st patch adds bdev based io accounting interface.

The 2nd ~ 5th patches improves dm's io accounting & split, meantime
fixes kernel panic on dm-zone.

The other patches improves io polling & dm io reference handling.


Ming Lei (8):
  block: replace disk based account with bdev's
  dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
  dm: pass 'dm_io' instance to dm_io_acct directly
  dm: switch to bdev based io accounting interface
  dm: always setup ->orig_bio in alloc_io
  dm: don't grab target io reference in dm_zone_map_bio
  dm: improve target io referencing
  dm: put all polled io into one single list

 block/blk-core.c  |  15 +--
 drivers/block/zram/zram_drv.c |   5 +-
 drivers/md/dm-core.h  |  17 ++-
 drivers/md/dm-zone.c  |  10 --
 drivers/md/dm.c   | 190 +++---
 include/linux/blkdev.h|   7 +-
 6 files changed, 131 insertions(+), 113 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio

2022-04-11 Thread Ming Lei

On Tue, Apr 12, 2022 at 09:28:46AM +0900, Damien Le Moal wrote:
> On 4/12/22 09:09, Ming Lei wrote:
> > On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
> >> On 4/11/22 23:18, Ming Lei wrote:
> >>>>>> This fixes the issue:
> >>>>>>
> >>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
> >>>>>> --- a/drivers/md/dm.c
> >>>>>> +++ b/drivers/md/dm.c
> >>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >>>>>> *md, struct bio *bio)
> >>>>>> io->status = 0;
> >>>>>> atomic_set(>io_count, 1);
> >>>>>> this_cpu_inc(*md->pending_io);
> >>>>>> -   io->orig_bio = NULL;
> >>>>>> +   io->orig_bio = bio;
> >>>>>> io->md = md;
> >>>>>> io->map_task = current;
> >>>>>> spin_lock_init(>lock);
> >>>>>>
> >>>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >>>>>> However, this change may be messing up the bio accounting. Need to 
> >>>>>> check that.
> >>>>>
> >>>>> Looks it is one recent regression since:
> >>>>>
> >>>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> >>>>
> >>>> Yep, saw that. Problem is, I really do not understand that change setting
> >>>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> >>>> is done on each fragment of the orig_bio instead of once for the entire
> >>>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio 
> >>>> as
> >>>> an argument to  __map_bio() from __split_and_process_bio(), I do not 
> >>>> think
> >>>> my change is correct. Thoughts ?
> >>>
> >>> Frankly speaking, both changing ->orig_bio after split and setting 
> >>> ->orig_bio
> >>> after ->map() looks ugly & tricky, and the following change should avoid 
> >>> the
> >>> issue, meantime simplify dm accounting a bit:
> >>>
> >>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>> index 3c5fad7c4ee6..f1fe83113608 100644
> >>> --- a/drivers/md/dm.c
> >>> +++ b/drivers/md/dm.c
> >>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device 
> >>> *md, struct bio *bio)
> >>>   io->status = 0;
> >>>   atomic_set(>io_count, 1);
> >>>   this_cpu_inc(*md->pending_io);
> >>> - io->orig_bio = NULL;
> >>> + io->orig_bio = bio;
> >>>   io->md = md;
> >>>   io->map_task = current;
> >>>   spin_lock_init(>lock);
> >>> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, 
> >>> struct bio *tgt_clone)
> >>>* Account io->origin_bio to DM dev on behalf of target
> >>>* that took ownership of IO with DM_MAPIO_SUBMITTED.
> >>>*/
> >>> - if (io->map_task == current) {
> >>> + if (io->map_task == current)
> >>>   /* Still in target's map function */
> >>>   dm_io_set_flag(io, DM_IO_START_ACCT);
> >>> - } else {
> >>> - /*
> >>> -  * Called by another thread, managed by DM target,
> >>> -  * wait for dm_split_and_process_bio() to store
> >>> -  * io->orig_bio
> >>> -  */
> >>> - while (unlikely(!smp_load_acquire(>orig_bio)))
> >>> - msleep(1);
> >>> + else
> >>
> >> Curly brackets around the else here.
> >>
> >>>   dm_start_io_acct(io, clone);
> >>> - }
> >>>  
> >>>   __dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
> >>> tio->old_sector);
> >>> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct 
> >>> mapped_device *md,
> >>>struct dm_table *map, struct bio *bio)
> >>>  {
> >>>   struct clone_info ci;
> >>> - struct bio *orig_bio = NULL;
> >>> + str

Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio

2022-04-11 Thread Ming Lei

On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
> On 4/11/22 23:18, Ming Lei wrote:
> >>>> This fixes the issue:
> >>>>
> >>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>> index 3c5fad7c4ee6..3dd6735450c5 100644
> >>>> --- a/drivers/md/dm.c
> >>>> +++ b/drivers/md/dm.c
> >>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >>>> *md, struct bio *bio)
> >>>> io->status = 0;
> >>>> atomic_set(>io_count, 1);
> >>>> this_cpu_inc(*md->pending_io);
> >>>> -   io->orig_bio = NULL;
> >>>> +   io->orig_bio = bio;
> >>>> io->md = md;
> >>>> io->map_task = current;
> >>>> spin_lock_init(>lock);
> >>>>
> >>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >>>> However, this change may be messing up the bio accounting. Need to check 
> >>>> that.
> >>>
> >>> Looks it is one recent regression since:
> >>>
> >>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> >>
> >> Yep, saw that. Problem is, I really do not understand that change setting
> >> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> >> is done on each fragment of the orig_bio instead of once for the entire
> >> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
> >> an argument to  __map_bio() from __split_and_process_bio(), I do not think
> >> my change is correct. Thoughts ?
> > 
> > Frankly speaking, both changing ->orig_bio after split and setting 
> > ->orig_bio
> > after ->map() looks ugly & tricky, and the following change should avoid the
> > issue, meantime simplify dm accounting a bit:
> > 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3c5fad7c4ee6..f1fe83113608 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, 
> > struct bio *bio)
> > io->status = 0;
> > atomic_set(>io_count, 1);
> > this_cpu_inc(*md->pending_io);
> > -   io->orig_bio = NULL;
> > +   io->orig_bio = bio;
> > io->md = md;
> > io->map_task = current;
> > spin_lock_init(>lock);
> > @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct 
> > bio *tgt_clone)
> >  * Account io->origin_bio to DM dev on behalf of target
> >  * that took ownership of IO with DM_MAPIO_SUBMITTED.
> >  */
> > -   if (io->map_task == current) {
> > +   if (io->map_task == current)
> > /* Still in target's map function */
> > dm_io_set_flag(io, DM_IO_START_ACCT);
> > -   } else {
> > -   /*
> > -* Called by another thread, managed by DM target,
> > -* wait for dm_split_and_process_bio() to store
> > -* io->orig_bio
> > -*/
> > -   while (unlikely(!smp_load_acquire(>orig_bio)))
> > -   msleep(1);
> > +   else
> 
> Curly brackets around the else here.
> 
> > dm_start_io_acct(io, clone);
> > -   }
> >  
> > __dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
> >   tio->old_sector);
> > @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct 
> > mapped_device *md,
> >  struct dm_table *map, struct bio *bio)
> >  {
> > struct clone_info ci;
> > -   struct bio *orig_bio = NULL;
> > +   struct bio *new_bio = NULL;
> > int error = 0;
> >  
> > init_clone_info(, md, map, bio);
> > @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct 
> > mapped_device *md,
> > if (error || !ci.sector_count)
> > goto out;
> >  
> > -   /*
> > -* Remainder must be passed to submit_bio_noacct() so it gets handled
> > -* *after* bios already submitted have been completely processed.
> > -* We take a clone of the original to store in ci.io->orig_bio to be
> > -* used by dm_end_io_acct() and for dm_io_complete() to use for
> > -* completion handling.
> > -*/
> 
> This comment should remain with some adjustment.

Fine, just felt the approach is

Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio

2022-04-11 Thread Ming Lei

On Mon, Apr 11, 2022 at 04:42:59PM +0900, Damien Le Moal wrote:
> On 4/11/22 16:34, Ming Lei wrote:
> > On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
> >> On 4/11/22 11:19, Damien Le Moal wrote:
> >>> On 4/11/22 10:04, Ming Lei wrote:
> >>>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> >>>>> On 4/11/22 09:36, Ming Lei wrote:
> >>>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >>>>>>> On 4/9/22 02:12, Ming Lei wrote:
> >>>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>>>>>>> reference is grabbed already, and the reference won't be released
> >>>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>>>>>>> any more.
> >>>>>>>
> >>>>>>> I do not think that this patch is correct. Removing the extra 
> >>>>>>> reference on
> >>>>>>> the io can lead to problems if the io is completed in the target
> >>>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for 
> >>>>>>> the
> >>>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() 
> >>>>>>> needs
> >>>>>>
> >>>>>> __map_bio():
> >>>>>>...
> >>>>>>dm_io_inc_pending(io);
> >>>>>>...
> >>>>>>dm_zone_map_bio(tio);
> >>>>>>...
> >>>>>
> >>>>> dm-crypt (for instance) may terminate the clone bio immediately in its
> >>>>> ->map() function context, resulting in the bio_endio()clone) ->
> >>>>> clone_endio() -> dm_io_dec_pending() call chain.
> >>>>>
> >>>>> With that, the io is gone and dm_zone_map_bio_end() will not have a 
> >>>>> valid
> >>>>> reference on the orig bio.
> >>>>
> >>>> Any target can complete io during ->map. Here looks nothing is special 
> >>>> with
> >>>> dm-crypt or dm-zone, why does only dm zone need extra reference?
> >>>>
> >>>> The reference counter is initialized as 1 in init_clone_info(), 
> >>>> dm_io_inc_pending()
> >>>> in __map_bio() increases it to 2, so after the above call chain you 
> >>>> mentioned is done,
> >>>> the counter becomes 1. The original bio can't be completed until 
> >>>> dm_io_dec_pending()
> >>>> in dm_split_and_process_bio() is called.
> >>>>
> >>>> Or maybe I miss any extra requirement from dm-zone?
> >>>
> >>> Something is wrong... With and without your patch, when I setup a dm-crypt
> >>> target on top of a zoned nullblk device, I get:
> >>>
> >>> [  292.596454] device-mapper: uevent: version 1.0.3
> >>> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> >>> initialised: dm-devel@redhat.com
> >>> [  292.732217] general protection fault, probably for non-canonical
> >>> address 0xdc02:  [#1] PREEMPT SMP KASAN PTI
> >>> [  292.743724] KASAN: null-ptr-deref in range
> >>> [0x0010-0x0017]
> >>> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> >>> 5.18.0-rc2+ #1458
> >>> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> >>> 02/21/2020
> >>> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> >>> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> >>> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> >>> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> >>> [  292.790946] RSP: 0018:8883cd847218 EFLAGS: 00010202
> >>> [  292.796260] RAX: dc00 RBX: 8885c5bcdce8 RCX:
> >>> 111034470027
> >>> [  292.803496] RDX: 0002 RSI: 0008 RDI:
> >>> 8885c5bcdc60
> >>> [  292.810732] RBP: 111079b08e4f R08: 8881a23801d8 R09:
> >>> 8881a238013f
> >>> [  292.817970] R10: 88821c594040 R11: 0001 R12:
> >>> 
> >>> [  292.82520

Re: [dm-devel] [PATCH] dm: dm-zone: Fix NULL pointer dereference in dm_zone_map_bio()

2022-04-11 Thread Ming Lei

On Mon, Apr 11, 2022 at 06:38:38PM +0900, Damien Le Moal wrote:
> Commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface") changed
> the alloc_io() function to delay the initialization of struct dm_io
> orig_bio field, leaving this field as NULL until the first call to
> __split_and_process_bio() is executed for the user submitted BIO. This
> change causes a NULL pointer dereference in dm_zone_map_bio() when the
> original user BIO is inspected to detect the need for zone append
> command emulation.
> 
> Avoid this problem by adding a struct clone_info *ci argument to the
> __map_bio() function and a struct bio *orig_bio argument to
> dm_zone_map_bio(). Doing so, the call to dm_zone_map_bio() can be passed
> directly a pointer to the original user BIO using the bio field of
> struct clone_info.
> 
> Fixes: 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> Signed-off-by: Damien Le Moal 
> ---
>  drivers/md/dm-zone.c |  3 +--
>  drivers/md/dm.c  | 10 +-
>  drivers/md/dm.h  |  5 +++--
>  3 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..772161f0b029 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -513,13 +513,12 @@ static bool dm_need_zone_wp_tracking(struct bio 
> *orig_bio)
>  /*
>   * Special IO mapping for targets needing zone append emulation.
>   */
> -int dm_zone_map_bio(struct dm_target_io *tio)
> +int dm_zone_map_bio(struct dm_target_io *tio, struct bio *orig_bio)
>  {
>   struct dm_io *io = tio->io;
>   struct dm_target *ti = tio->ti;
>   struct mapped_device *md = io->md;
>   struct request_queue *q = md->queue;
> - struct bio *orig_bio = io->orig_bio;
>   struct bio *clone = >clone;
>   unsigned int zno;
>   blk_status_t sts;
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..1d8f24f04c7d 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1258,7 +1258,7 @@ static noinline void __set_swap_bios_limit(struct 
> mapped_device *md, int latch)
>   mutex_unlock(>swap_bios_lock);
>  }
>  
> -static void __map_bio(struct bio *clone)
> +static void __map_bio(struct clone_info *ci, struct bio *clone)
>  {
>   struct dm_target_io *tio = clone_to_tio(clone);
>   int r;
> @@ -1287,7 +1287,7 @@ static void __map_bio(struct bio *clone)
>* map operation.
>*/
>   if (dm_emulate_zone_append(io->md))
> - r = dm_zone_map_bio(tio);
> + r = dm_zone_map_bio(tio, ci->bio);

It depends if bio_split() in dm_split_and_process_bio() can be triggered
for dm-zone. If it can be triggered, here the actual original bio should
be the one returned from bio_split().

Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio

2022-04-11 Thread Ming Lei

On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
> On 4/11/22 11:19, Damien Le Moal wrote:
> > On 4/11/22 10:04, Ming Lei wrote:
> >> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> >>> On 4/11/22 09:36, Ming Lei wrote:
> >>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >>>>> On 4/9/22 02:12, Ming Lei wrote:
> >>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>>>>> reference is grabbed already, and the reference won't be released
> >>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>>>>> any more.
> >>>>>
> >>>>> I do not think that this patch is correct. Removing the extra reference 
> >>>>> on
> >>>>> the io can lead to problems if the io is completed in the target
> >>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() 
> >>>>> needs
> >>>>
> >>>> __map_bio():
> >>>>  ...
> >>>>  dm_io_inc_pending(io);
> >>>>  ...
> >>>>  dm_zone_map_bio(tio);
> >>>>  ...
> >>>
> >>> dm-crypt (for instance) may terminate the clone bio immediately in its
> >>> ->map() function context, resulting in the bio_endio()clone) ->
> >>> clone_endio() -> dm_io_dec_pending() call chain.
> >>>
> >>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> >>> reference on the orig bio.
> >>
> >> Any target can complete io during ->map. Here looks nothing is special with
> >> dm-crypt or dm-zone, why does only dm zone need extra reference?
> >>
> >> The reference counter is initialized as 1 in init_clone_info(), 
> >> dm_io_inc_pending()
> >> in __map_bio() increases it to 2, so after the above call chain you 
> >> mentioned is done,
> >> the counter becomes 1. The original bio can't be completed until 
> >> dm_io_dec_pending()
> >> in dm_split_and_process_bio() is called.
> >>
> >> Or maybe I miss any extra requirement from dm-zone?
> > 
> > Something is wrong... With and without your patch, when I setup a dm-crypt
> > target on top of a zoned nullblk device, I get:
> > 
> > [  292.596454] device-mapper: uevent: version 1.0.3
> > [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> > initialised: dm-devel@redhat.com
> > [  292.732217] general protection fault, probably for non-canonical
> > address 0xdc02:  [#1] PREEMPT SMP KASAN PTI
> > [  292.743724] KASAN: null-ptr-deref in range
> > [0x0010-0x0017]
> > [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> > 5.18.0-rc2+ #1458
> > [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> > 02/21/2020
> > [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> > [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> > 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> > <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> > [  292.790946] RSP: 0018:8883cd847218 EFLAGS: 00010202
> > [  292.796260] RAX: dc00 RBX: 8885c5bcdce8 RCX:
> > 111034470027
> > [  292.803496] RDX: 0002 RSI: 0008 RDI:
> > 8885c5bcdc60
> > [  292.810732] RBP: 111079b08e4f R08: 8881a23801d8 R09:
> > 8881a238013f
> > [  292.817970] R10: 88821c594040 R11: 0001 R12:
> > 
> > [  292.825206] R13: 8885c5bcdc50 R14: 8881a238 R15:
> > 8885c5bcdd08
> > [  292.832442] FS:  7fe169b06b40() GS:0fc0()
> > knlGS:
> > [  292.840646] CS:  0010 DS:  ES:  CR0: 80050033
> > [  292.846481] CR2: 7ffd80a57a38 CR3: 0004b91b0006 CR4:
> > 007706f0
> > [  292.853722] DR0:  DR1:  DR2:
> > 
> > [  292.860957] DR3:  DR6: fffe0ff0 DR7:
> > 0400
> > [  292.868194] PKRU: 5554
> > [  292.870949] Call Trace:
> > [  292.873446]  
> > [  292.875593]  ? lock_is_held_type+0xd7/0x130
> > [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
> &g

Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio

2022-04-10 Thread Ming Lei

On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> On 4/11/22 09:36, Ming Lei wrote:
> > On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >> On 4/9/22 02:12, Ming Lei wrote:
> >>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>> reference is grabbed already, and the reference won't be released
> >>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>> any more.
> >>
> >> I do not think that this patch is correct. Removing the extra reference on
> >> the io can lead to problems if the io is completed in the target
> >> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> > 
> > __map_bio():
> > ...
> > dm_io_inc_pending(io);
> > ...
> > dm_zone_map_bio(tio);
> > ...
> 
> dm-crypt (for instance) may terminate the clone bio immediately in its
> ->map() function context, resulting in the bio_endio()clone) ->
> clone_endio() -> dm_io_dec_pending() call chain.
> 
> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> reference on the orig bio.

Any target can complete io during ->map. Here looks nothing is special with
dm-crypt or dm-zone, why does only dm zone need extra reference?

The reference counter is initialized as 1 in init_clone_info(), 
dm_io_inc_pending()
in __map_bio() increases it to 2, so after the above call chain you mentioned 
is done,
the counter becomes 1. The original bio can't be completed until 
dm_io_dec_pending()
in dm_split_and_process_bio() is called.

Or maybe I miss any extra requirement from dm-zone?


thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio

2022-04-10 Thread Ming Lei

On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> On 4/9/22 02:12, Ming Lei wrote:
> > dm_zone_map_bio() is only called from __map_bio in which the io's
> > reference is grabbed already, and the reference won't be released
> > until the bio is submitted, so no necessary to do it dm_zone_map_bio
> > any more.
> 
> I do not think that this patch is correct. Removing the extra reference on
> the io can lead to problems if the io is completed in the target
> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs

__map_bio():
...
dm_io_inc_pending(io);
...
dm_zone_map_bio(tio);
...

dm_submit_bio():
dm_split_and_process_bio
init_clone_info(, md, map, bio)
atomic_set(>io_count, 1)
...
__map_bio()
...
dm_io_dec_pending()

The target io can only be completed after the above line returns,
so the extra reference in dm_zone_map_bio() doesn't make any difference,
does it?


Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 3/3] dm: put all polled io into one single list

2022-04-08 Thread Ming Lei

If bio_split() isn't involved, it is a bit overkill to link dm_io into hlist,
given there is only single dm_io in the list, so convert to single list
for holding all dm_io instances associated with this bio.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-core.h |  2 +-
 drivers/md/dm.c  | 46 +++-
 2 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 13136bd47cb3..e28f7d54d2b7 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -252,7 +252,7 @@ struct dm_io {
spinlock_t lock;
unsigned long start_time;
void *data;
-   struct hlist_node node;
+   struct dm_io *next;
struct task_struct *map_task;
struct dm_stats_aux stats_aux;
/* last member of dm_target_io is 'struct bio' */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 528559ca2f91..84cb73462fcd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1486,7 +1486,7 @@ static bool __process_abnormal_io(struct clone_info *ci, 
struct dm_target *ti,
 }
 
 /*
- * Reuse ->bi_private as hlist head for storing all dm_io instances
+ * Reuse ->bi_private as dm_io list head for storing all dm_io instances
  * associated with this bio, and this bio's bi_private needs to be
  * stored in dm_io->data before the reuse.
  *
@@ -1494,14 +1494,14 @@ static bool __process_abnormal_io(struct clone_info 
*ci, struct dm_target *ti,
  * touch it after splitting. Meantime it won't be changed by anyone after
  * bio is submitted. So this reuse is safe.
  */
-static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio)
+static inline struct dm_io **dm_poll_list_head(struct bio *bio)
 {
-   return (struct hlist_head *)>bi_private;
+   return (struct dm_io **)>bi_private;
 }
 
 static void dm_queue_poll_io(struct bio *bio, struct dm_io *io)
 {
-   struct hlist_head *head = dm_get_bio_hlist_head(bio);
+   struct dm_io **head = dm_poll_list_head(bio);
 
if (!(bio->bi_opf & REQ_DM_POLL_LIST)) {
bio->bi_opf |= REQ_DM_POLL_LIST;
@@ -1511,19 +1511,20 @@ static void dm_queue_poll_io(struct bio *bio, struct 
dm_io *io)
 */
io->data = bio->bi_private;
 
-   INIT_HLIST_HEAD(head);
-
/* tell block layer to poll for completion */
bio->bi_cookie = ~BLK_QC_T_NONE;
+
+   io->next = NULL;
} else {
/*
 * bio recursed due to split, reuse original poll list,
 * and save bio->bi_private too.
 */
-   io->data = hlist_entry(head->first, struct dm_io, node)->data;
+   io->data = (*head)->data;
+   io->next = *head;
}
 
-   hlist_add_head(>node, head);
+   *head = io;
 }
 
 /*
@@ -1677,18 +1678,16 @@ static bool dm_poll_dm_io(struct dm_io *io, struct 
io_comp_batch *iob,
 static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob,
   unsigned int flags)
 {
-   struct hlist_head *head = dm_get_bio_hlist_head(bio);
-   struct hlist_head tmp = HLIST_HEAD_INIT;
-   struct hlist_node *next;
-   struct dm_io *io;
+   struct dm_io **head = dm_poll_list_head(bio);
+   struct dm_io *list = *head;
+   struct dm_io *tmp = NULL;
+   struct dm_io *curr, *next;
 
/* Only poll normal bio which was marked as REQ_DM_POLL_LIST */
if (!(bio->bi_opf & REQ_DM_POLL_LIST))
return 0;
 
-   WARN_ON_ONCE(hlist_empty(head));
-
-   hlist_move_list(head, );
+   WARN_ON_ONCE(!list);
 
/*
 * Restore .bi_private before possibly completing dm_io.
@@ -1699,24 +1698,27 @@ static int dm_poll_bio(struct bio *bio, struct 
io_comp_batch *iob,
 * clearing REQ_DM_POLL_LIST here.
 */
bio->bi_opf &= ~REQ_DM_POLL_LIST;
-   bio->bi_private = hlist_entry(tmp.first, struct dm_io, node)->data;
+   bio->bi_private = list->data;
 
-   hlist_for_each_entry_safe(io, next, , node) {
-   if (dm_poll_dm_io(io, iob, flags)) {
-   hlist_del_init(>node);
+   for (curr = list, next = curr->next; curr; curr = next, next =
+   curr ? curr->next : NULL) {
+   if (dm_poll_dm_io(curr, iob, flags)) {
/*
 * clone_endio() has already occurred, so passing
 * error as 0 here doesn't override io->status
 */
-   dm_io_dec_pending(io, 0);
+   dm_io_dec_pending(curr, 0);
+   } else {
+   curr->next = tmp;
+   tmp = curr;
}
}
 
/* Not done? */
-   if (!hlist_empty()) {
+   if (tmp) {

[dm-devel] [PATCH 2/3] dm: improve target io referencing

2022-04-08 Thread Ming Lei

Currently target io's reference counter is grabbed before calling
__map_bio(), this way isn't efficient since we can move this grabbing
into alloc_io().

Meantime it becomes typical async io reference counter model: one is
for submission side, the other is for completion side, and the io won't
be completed until both two sides are done.

Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 36 +---
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b8424a4b4725..528559ca2f91 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -579,7 +579,9 @@ static struct dm_io *alloc_io(struct mapped_device *md, 
struct bio *bio)
io = container_of(tio, struct dm_io, tio);
io->magic = DM_IO_MAGIC;
io->status = 0;
-   atomic_set(>io_count, 1);
+
+   /* one is for submission, the other is for completion */
+   atomic_set(>io_count, 2);
this_cpu_inc(*md->pending_io);
io->orig_bio = NULL;
io->md = md;
@@ -929,11 +931,6 @@ static inline bool dm_tio_is_normal(struct dm_target_io 
*tio)
!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
-static void dm_io_inc_pending(struct dm_io *io)
-{
-   atomic_inc(>io_count);
-}
-
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
@@ -1275,7 +1272,6 @@ static void __map_bio(struct bio *clone)
/*
 * Map the clone.
 */
-   dm_io_inc_pending(io);
tio->old_sector = clone->bi_iter.bi_sector;
 
if (unlikely(swap_bios_limit(ti, clone))) {
@@ -1357,11 +1353,12 @@ static void alloc_multiple_bios(struct bio_list *blist, 
struct clone_info *ci,
}
 }
 
-static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
+static int __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
  unsigned num_bios, unsigned *len)
 {
struct bio_list blist = BIO_EMPTY_LIST;
struct bio *clone;
+   int ret = 0;
 
switch (num_bios) {
case 0:
@@ -1370,15 +1367,19 @@ static void __send_duplicate_bios(struct clone_info 
*ci, struct dm_target *ti,
clone = alloc_tio(ci, ti, 0, len, GFP_NOIO);
dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
__map_bio(clone);
+   ret = 1;
break;
default:
alloc_multiple_bios(, ci, ti, num_bios, len);
while ((clone = bio_list_pop())) {
dm_tio_set_flag(clone_to_tio(clone), 
DM_TIO_IS_DUPLICATE_BIO);
__map_bio(clone);
+   ret += 1;
}
break;
}
+
+   return ret;
 }
 
 static void __send_empty_flush(struct clone_info *ci)
@@ -1398,8 +1399,16 @@ static void __send_empty_flush(struct clone_info *ci)
ci->bio = _bio;
ci->sector_count = 0;
 
-   while ((ti = dm_table_get_target(ci->map, target_nr++)))
-   __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+   while ((ti = dm_table_get_target(ci->map, target_nr++))) {
+   int bios;
+
+   atomic_add(ti->num_flush_bios, >io->io_count);
+   bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+   atomic_sub(ti->num_flush_bios - bios, >io->io_count);
+   }
+
+   /* alloc_io() takes one extra reference for submission */
+   atomic_sub(1, >io->io_count);
 
bio_uninit(ci->bio);
 }
@@ -1408,6 +1417,7 @@ static void __send_changing_extent_only(struct clone_info 
*ci, struct dm_target
unsigned num_bios)
 {
unsigned len;
+   int bios;
 
len = min_t(sector_t, ci->sector_count,
max_io_len_target_boundary(ti, dm_target_offset(ti, 
ci->sector)));
@@ -1419,7 +1429,11 @@ static void __send_changing_extent_only(struct 
clone_info *ci, struct dm_target
ci->sector += len;
ci->sector_count -= len;
 
-   __send_duplicate_bios(ci, ti, num_bios, );
+
+   atomic_add(num_bios, >io->io_count);
+   bios = __send_duplicate_bios(ci, ti, num_bios, );
+   /* alloc_io() takes one extra reference for submission */
+   atomic_sub(num_bios - bios + 1, >io->io_count);
 }
 
 static bool is_abnormal_io(struct bio *bio)
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio

2022-04-08 Thread Ming Lei

dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so no necessary to do it dm_zone_map_bio
any more.

Cc: Damien Le Moal 
Signed-off-by: Ming Lei 
---
 drivers/md/dm-core.h |  7 ---
 drivers/md/dm-zone.c | 10 --
 drivers/md/dm.c  |  7 ++-
 3 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 4277853c7535..13136bd47cb3 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -277,13 +277,6 @@ static inline void dm_io_set_flag(struct dm_io *io, 
unsigned int bit)
io->flags |= (1U << bit);
 }
 
-static inline void dm_io_inc_pending(struct dm_io *io)
-{
-   atomic_inc(>io_count);
-}
-
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error);
-
 static inline struct completion *dm_get_completion_from_kobject(struct kobject 
*kobj)
 {
return _of(kobj, struct dm_kobject_holder, kobj)->completion;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..85d3c158719f 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -545,13 +545,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
return DM_MAPIO_KILL;
}
 
-   /*
-* The target map function may issue and complete the IO quickly.
-* Take an extra reference on the IO to make sure it does disappear
-* until we run dm_zone_map_bio_end().
-*/
-   dm_io_inc_pending(io);
-
/* Let the target do its work */
r = ti->type->map(ti, clone);
switch (r) {
@@ -580,9 +573,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
break;
}
 
-   /* Drop the extra reference on the IO */
-   dm_io_dec_pending(io, sts);
-
if (sts != BLK_STS_OK)
return DM_MAPIO_KILL;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..b8424a4b4725 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -929,11 +929,16 @@ static inline bool dm_tio_is_normal(struct dm_target_io 
*tio)
!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
+static void dm_io_inc_pending(struct dm_io *io)
+{
+   atomic_inc(>io_count);
+}
+
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
  */
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
+static void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
 {
/* Push-back supersedes any I/O errors */
if (unlikely(error)) {
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 0/3] dm: improvement on io referencing and io poll

2022-04-08 Thread Ming Lei

Hello,

The 1st patch removes get/put io reference in dm_zone_map_bio.

The 2nd one improves io referencing.

The 3rd patch switches to hold dm_io instance in single list.


Ming Lei (3):
  dm: don't grab target io reference in dm_zone_map_bio
  dm: improve target io referencing
  dm: put all polled io into one single list

 drivers/md/dm-core.h |  9 +
 drivers/md/dm-zone.c | 10 --
 drivers/md/dm.c  | 79 
 3 files changed, 51 insertions(+), 47 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH] io_uring: reissue in case -EAGAIN is returned after io issue returns

2022-04-05 Thread Ming Lei

On Mon, Apr 04, 2022 at 12:51:30PM -0400, Mike Snitzer wrote:
> On Sun, Apr 03 2022 at  7:45P -0400,
> Ming Lei  wrote:
> 
> > -EAGAIN still may return after io issue returns, and REQ_F_REISSUE is
> > set in io_complete_rw_iopoll(), but the req never gets chance to be handled.
> > io_iopoll_check doesn't handle this situation, and io hang can be caused.
> > 
> > Current dm io polling may return -EAGAIN after bio submission is
> > returned, also blk-throttle might trigger this situation too.
> > 
> > Cc: Mike Snitzer 
> > Signed-off-by: Ming Lei 
> 
> I first reverted commit 5291984004ed ("dm: fix bio polling to handle
> possibile BLK_STS_AGAIN") then applied this patch and verified this
> fixes the DM bio polling hangs.  Nice work!
> 
> But interestingly with this fio test (against dm-linear ontop of
> null_blk with queue_mode=2 submit_queues=8 poll_queues=2 bs=4096 gb=16):
> 
> fio --bs=4096 --ioengine=io_uring --fixedbufs --registerfiles --hipri=1 \
> --iodepth=16 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \
> --filename=/dev/mapper/linear --direct=1 --runtime=20 --numjobs=16 \
> --rw=randread --name=test --group_reporting --norandommap

16jobs in io_uring/aio test is overkill.

> 
> I get 3186k IOPS with your patch to have io_uring retry (and commit
> 5291984004ed reverted), but 4305k IOPS if leave commit 5291984004ed
> applied (and DM resorts to retrying any -EAGAIN _without_ polling).

IMO, commit 5291984004ed shouldn't be reverted, which is reasonable to
retry on underlying IO for dm.

This patch is for making io_uring more reliable, since the current
io_uring code only handles -EAGAIN from submission code path, and
-EAGAIN/REISSUE isn't handled if it is returned during ->poll(),
then the io hang is caused.

Jens, what do you think of this patch? Does io_uring need to handle
-EAGAIN in this case?

> 
> Jens rightly pointed out to me that polling tests that exhaust tags
> are bogus anyway (because such unbounded IO defeats the point of
> polling).  Jens also thinks my result, with commit 5291984004ed
> applied, is somehow bogus and not to be trusted ;)  He is very likely
> correct, and the failing likely in the null_blk driver -- I'm
> skeptical of that driver given it cannot pass fio verify testing
> (e.g. --do_verify=1 --verify=crc32c --verify_async=1) with or without
> polling.

Because it is null block...

> 
> Review comments inlined below.
> 
> > ---
> >  fs/io-wq.h|  13 +
> >  fs/io_uring.c | 128 --
> >  2 files changed, 86 insertions(+), 55 deletions(-)
> > 
> > diff --git a/fs/io-wq.h b/fs/io-wq.h
> > index dbecd27656c7..4ca4863664fb 100644
> > --- a/fs/io-wq.h
> > +++ b/fs/io-wq.h
> > @@ -96,6 +96,19 @@ static inline void wq_list_add_head(struct 
> > io_wq_work_node *node,
> > WRITE_ONCE(list->first, node);
> >  }
> >  
> > +static inline void wq_list_remove(struct io_wq_work_list *list,
> > + struct io_wq_work_node *prev,
> > + struct io_wq_work_node *node)
> > +{
> > +   if (!prev)
> > +   WRITE_ONCE(list->first, node->next);
> > +   else
> > +   prev->next = node->next;
> > +
> > +   if (node == list->last)
> > +   list->last = prev;
> > +}
> > +
> >  static inline void wq_list_cut(struct io_wq_work_list *list,
> >struct io_wq_work_node *last,
> >struct io_wq_work_node *prev)
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 59e54a6854b7..6db5514e10ca 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -2759,6 +2759,65 @@ static inline bool io_run_task_work(void)
> > return false;
> >  }
> >  
> > +#ifdef CONFIG_BLOCK
> > +static bool io_resubmit_prep(struct io_kiocb *req)
> > +{
> > +   struct io_async_rw *rw = req->async_data;
> > +
> > +   if (!req_has_async_data(req))
> > +   return !io_req_prep_async(req);
> > +   iov_iter_restore(>s.iter, >s.iter_state);
> > +   return true;
> > +}
> > +
> > +static bool io_rw_should_reissue(struct io_kiocb *req)
> > +{
> > +   umode_t mode = file_inode(req->file)->i_mode;
> > +   struct io_ring_ctx *ctx = req->ctx;
> > +
> > +   if (!S_ISBLK(mode) && !S_ISREG(mode))
> > +   return false;
> > +   if ((req->flags & REQ_F_NOWAIT) || (io_wq_current_is_worker() &&
> > +   !(ctx->flags & IORING_SETUP_IOPOLL)))
> > +

Re: [dm-devel] [PATCH v6 2/2] dm: support bio polling

2022-03-09 Thread Ming Lei

On Wed, Mar 09, 2022 at 09:11:26AM -0700, Jens Axboe wrote:
> On 3/8/22 6:13 PM, Ming Lei wrote:
> > On Tue, Mar 08, 2022 at 06:02:50PM -0700, Jens Axboe wrote:
> >> On 3/7/22 11:53 AM, Mike Snitzer wrote:
> >>> From: Ming Lei 
> >>>
> >>> Support bio(REQ_POLLED) polling in the following approach:
> >>>
> >>> 1) only support io polling on normal READ/WRITE, and other abnormal IOs
> >>> still fallback to IRQ mode, so the target io is exactly inside the dm
> >>> io.
> >>>
> >>> 2) hold one refcnt on io->io_count after submitting this dm bio with
> >>> REQ_POLLED
> >>>
> >>> 3) support dm native bio splitting, any dm io instance associated with
> >>> current bio will be added into one list which head is bio->bi_private
> >>> which will be recovered before ending this bio
> >>>
> >>> 4) implement .poll_bio() callback, call bio_poll() on the single target
> >>> bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
> >>> dm_io_dec_pending() after the target io is done in .poll_bio()
> >>>
> >>> 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
> >>> which is based on Jeffle's previous patch.
> >>
> >> It's not the prettiest thing in the world with the overlay on bi_private,
> >> but at least it's nicely documented now.
> >>
> >> I would encourage you to actually test this on fast storage, should make
> >> a nice difference. I can run this on a gen2 optane, it's 10x the IOPS
> >> of what it was tested on and should help better highlight where it
> >> makes a difference.
> >>
> >> If either of you would like that, then send me a fool proof recipe for
> >> what should be setup so I have a poll capable dm device.
> > 
> > Follows steps for setup dm stripe over two nvmes, then run io_uring on
> > the dm stripe dev.
> 
> Thanks! Much easier when I don't have to figure it out... Setup:

Jens, thanks for running the test!

> 
> CPU: 12900K
> Drives: 2x P5800X gen2 optane (~5M IOPS each at 512b)
> 
> Baseline kernel:
> 
> sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 
> /dev/dm-0
> Added file /dev/dm-0 (submitter 0)
> polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> submitter=0, tid=1004
> IOPS=2794K, BW=1364MiB/s, IOS/call=31/30, inflight=(124)
> IOPS=2793K, BW=1363MiB/s, IOS/call=31/31, inflight=(62)
> IOPS=2789K, BW=1362MiB/s, IOS/call=31/30, inflight=(124)
> IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(124)
> IOPS=2780K, BW=1357MiB/s, IOS/call=31/31, inflight=(62)
> IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(62)
> ^CExiting on signal
> Maximum IOPS=2794K
> 
> generating about 500K ints/sec,

~5.6 IOs completed in each int averagely, looks irq coalesce is working.

> and using 4k blocks:
> 
> sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 
> /dev/dm-0
> Added file /dev/dm-0 (submitter 0)
> polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> submitter=0, tid=967
> IOPS=1683K, BW=6575MiB/s, IOS/call=24/24, inflight=(93)
> IOPS=1685K, BW=6584MiB/s, IOS/call=24/24, inflight=(124)
> IOPS=1686K, BW=6588MiB/s, IOS/call=24/24, inflight=(124)
> IOPS=1684K, BW=6581MiB/s, IOS/call=24/24, inflight=(93)
> IOPS=1686K, BW=6589MiB/s, IOS/call=24/24, inflight=(124)
> IOPS=1687K, BW=6593MiB/s, IOS/call=24/24, inflight=(128)
> IOPS=1687K, BW=6590MiB/s, IOS/call=24/24, inflight=(93)
> ^CExiting on signal
> Maximum IOPS=1687K
> 
> which ends up being bw limited for me, because the devices aren't linked
> gen4. That's about 1.4M ints/sec.

Looks one interrupt just completes one IO with 4k bs, no irq coalesce
any more. The interrupts may not run in CPU 10 I guess.

> 
> With the patched kernel, same test:
> 
> sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 
> /dev/dm-0
> Added file /dev/dm-0 (submitter 0)
> polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> submitter=0, tid=989
> IOPS=4151K, BW=2026MiB/s, IOS/call=16/15, inflight=(128)
> IOPS=4159K, BW=2031MiB/s, IOS/call=15/15, inflight=(128)
> IOPS=4193K, BW=2047MiB/s, IOS/call=15/15, inflight=(128)
> IOPS=4191K, BW=2046MiB/s, IOS/call=15/15, inflight=(128)
> IOPS=4202K, BW=2052MiB/s, IOS/call=15/15, inflight=(128)
> ^CExiting on signal
> Maximum IOPS=4202K
> 
> with basically zero interrupts, and 4k

Re: [dm-devel] [PATCH v6 2/2] dm: support bio polling

2022-03-08 Thread Ming Lei

On Tue, Mar 08, 2022 at 06:02:50PM -0700, Jens Axboe wrote:
> On 3/7/22 11:53 AM, Mike Snitzer wrote:
> > From: Ming Lei 
> > 
> > Support bio(REQ_POLLED) polling in the following approach:
> > 
> > 1) only support io polling on normal READ/WRITE, and other abnormal IOs
> > still fallback to IRQ mode, so the target io is exactly inside the dm
> > io.
> > 
> > 2) hold one refcnt on io->io_count after submitting this dm bio with
> > REQ_POLLED
> > 
> > 3) support dm native bio splitting, any dm io instance associated with
> > current bio will be added into one list which head is bio->bi_private
> > which will be recovered before ending this bio
> > 
> > 4) implement .poll_bio() callback, call bio_poll() on the single target
> > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
> > dm_io_dec_pending() after the target io is done in .poll_bio()
> > 
> > 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
> > which is based on Jeffle's previous patch.
> 
> It's not the prettiest thing in the world with the overlay on bi_private,
> but at least it's nicely documented now.
> 
> I would encourage you to actually test this on fast storage, should make
> a nice difference. I can run this on a gen2 optane, it's 10x the IOPS
> of what it was tested on and should help better highlight where it
> makes a difference.
> 
> If either of you would like that, then send me a fool proof recipe for
> what should be setup so I have a poll capable dm device.

Follows steps for setup dm stripe over two nvmes, then run io_uring on
the dm stripe dev.

1) dm_stripe.perl

#!/usr/bin/perl -w
# Create a striped device across any number of underlying devices. The device
# will be called "stripe_dev" and have a chunk-size of 128k.

my $chunk_size = 128 * 2;
my $dev_name = "stripe_dev";
my $num_devs = @ARGV;
my @devs = @ARGV;
my ($min_dev_size, $stripe_dev_size, $i);

if (!$num_devs) {
die("Specify at least one device\n");
}

$min_dev_size = `blockdev --getsz $devs[0]`;
for ($i = 1; $i < $num_devs; $i++) {
my $this_size = `blockdev --getsz $devs[$i]`;
$min_dev_size = ($min_dev_size < $this_size) ?
$min_dev_size : $this_size;
}

$stripe_dev_size = $min_dev_size * $num_devs;
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);

$table = "0 $stripe_dev_size striped $num_devs $chunk_size";
for ($i = 0; $i < $num_devs; $i++) {
$table .= " $devs[$i] 0";
}

`echo $table | dmsetup create $dev_name`;


2) test_poll_on_dm_stripe.sh

#!/bin/bash

RT=40
JOBS=1
HI=1
BS=4K

set -x
dmsetup remove_all

rmmod nvme
modprobe nvme poll_queues=2

sleep 2

./dm_stripe.perl /dev/nvme0n1 /dev/nvme1n1
sleep 1
DEV=/dev/mapper/stripe_dev

echo "io_uring hipri test"

fio --bs=$BS --ioengine=io_uring --fixedbufs --registerfiles \
--hipri=$HI --iodepth=64 --iodepth_batch_submit=16 
--iodepth_batch_complete_min=16 \
--filename=$DEV --direct=1 --runtime=$RT --numjobs=$JOBS --rw=randread 
--name=test \
--group_reporting

Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 2/2] dm: support bio polling

2022-03-06 Thread Ming Lei

On Mon, Mar 07, 2022 at 10:41:31AM +0800, Ming Lei wrote:
> On Sun, Mar 06, 2022 at 07:25:11PM -0700, Jens Axboe wrote:
> > On 3/6/22 7:20 PM, Ming Lei wrote:
> > > On Sun, Mar 06, 2022 at 06:48:15PM -0700, Jens Axboe wrote:
> > >> On 3/6/22 2:29 AM, Christoph Hellwig wrote:
> > >>>> +/*
> > >>>> + * Reuse ->bi_end_io as hlist head for storing all dm_io instances
> > >>>> + * associated with this bio, and this bio's bi_end_io has to be
> > >>>> + * stored in one of 'dm_io' instance first.
> > >>>> + */
> > >>>> +static inline struct hlist_head *dm_get_bio_hlist_head(struct bio 
> > >>>> *bio)
> > >>>> +{
> > >>>> +  WARN_ON_ONCE(!(bio->bi_opf & REQ_DM_POLL_LIST));
> > >>>> +
> > >>>> +  return (struct hlist_head *)>bi_end_io;
> > >>>> +}
> > >>>
> > >>> So this reuse is what I really hated.  I still think we should be able
> > >>> to find space in the bio by creatively shifting fields around to just
> > >>> add the hlist there directly, which would remove the need for this
> > >>> override and more importantly the quite cumbersome saving and restoring
> > >>> of the end_io handler.
> > >>
> > >> If it's possible, then that would be preferable. But I don't think
> > >> that's going to be easy to do...
> > > 
> > > I agree, now basically there isn't gap inside bio, so either adding one
> > > new field or reusing one existed field...
> > 
> > There'd no amount of re-arranging that'll free up 8 bytes, that's just
> > not happening. I'm not a huge fan of growing struct bio for that, and
> > the oddity here is mostly (to me) that ->bi_end_io is the one overlayed.
> > That would usually belong to the owner of the bio.
> > 
> > Maybe some commenting would help?
> 
> OK, ->bi_end_io is safe because it is only called until the bio is
> ended, so we can retrieve the list head and recover ->bi_end_io before
> polling.

->bi_private can be reused too, is that better?

Yeah, both belong to owner(higher level storage), then block layer can't touch
them inside submit_bio_noacct(), that is also why this trick is safe.

Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 2/2] dm: support bio polling

2022-03-06 Thread Ming Lei

On Sun, Mar 06, 2022 at 07:25:11PM -0700, Jens Axboe wrote:
> On 3/6/22 7:20 PM, Ming Lei wrote:
> > On Sun, Mar 06, 2022 at 06:48:15PM -0700, Jens Axboe wrote:
> >> On 3/6/22 2:29 AM, Christoph Hellwig wrote:
> >>>> +/*
> >>>> + * Reuse ->bi_end_io as hlist head for storing all dm_io instances
> >>>> + * associated with this bio, and this bio's bi_end_io has to be
> >>>> + * stored in one of 'dm_io' instance first.
> >>>> + */
> >>>> +static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio)
> >>>> +{
> >>>> +WARN_ON_ONCE(!(bio->bi_opf & REQ_DM_POLL_LIST));
> >>>> +
> >>>> +return (struct hlist_head *)>bi_end_io;
> >>>> +}
> >>>
> >>> So this reuse is what I really hated.  I still think we should be able
> >>> to find space in the bio by creatively shifting fields around to just
> >>> add the hlist there directly, which would remove the need for this
> >>> override and more importantly the quite cumbersome saving and restoring
> >>> of the end_io handler.
> >>
> >> If it's possible, then that would be preferable. But I don't think
> >> that's going to be easy to do...
> > 
> > I agree, now basically there isn't gap inside bio, so either adding one
> > new field or reusing one existed field...
> 
> There'd no amount of re-arranging that'll free up 8 bytes, that's just
> not happening. I'm not a huge fan of growing struct bio for that, and
> the oddity here is mostly (to me) that ->bi_end_io is the one overlayed.
> That would usually belong to the owner of the bio.
> 
> Maybe some commenting would help?

OK, ->bi_end_io is safe because it is only called until the bio is
ended, so we can retrieve the list head and recover ->bi_end_io before
polling.

> Is bi_next available at this point?

The same bio can be re-submitted to block layer because of splitting, and
will be linked to current->bio_list[].

BTW, bio splitting can be very often for some dm target, that is why we
don't ignore bio splitting for dm polling.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 2/2] dm: support bio polling

2022-03-06 Thread Ming Lei

On Sun, Mar 06, 2022 at 06:48:15PM -0700, Jens Axboe wrote:
> On 3/6/22 2:29 AM, Christoph Hellwig wrote:
> >> +/*
> >> + * Reuse ->bi_end_io as hlist head for storing all dm_io instances
> >> + * associated with this bio, and this bio's bi_end_io has to be
> >> + * stored in one of 'dm_io' instance first.
> >> + */
> >> +static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio)
> >> +{
> >> +  WARN_ON_ONCE(!(bio->bi_opf & REQ_DM_POLL_LIST));
> >> +
> >> +  return (struct hlist_head *)>bi_end_io;
> >> +}
> > 
> > So this reuse is what I really hated.  I still think we should be able
> > to find space in the bio by creatively shifting fields around to just
> > add the hlist there directly, which would remove the need for this
> > override and more importantly the quite cumbersome saving and restoring
> > of the end_io handler.
> 
> If it's possible, then that would be preferable. But I don't think
> that's going to be easy to do...

I agree, now basically there isn't gap inside bio, so either adding one
new field or reusing one existed field...


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v4 0/2] block/dm: support bio polling

2022-03-04 Thread Ming Lei

On Fri, Mar 04, 2022 at 04:26:21PM -0500, Mike Snitzer wrote:
> Hi,
> 
> I've rebased Ming's latest [1] ontop of dm-5.18 [2] (which is based on
> for-5.18/block). End result available in dm-5.18-biopoll branch [3]
> 
> These changes add bio polling support to DM.  Tested with linear and
> striped DM targets.
> 
> IOPS improvement was ~5% on my baremetal system with a single Intel
> Optane NVMe device (555K hipri=1 vs 525K hipri=0).
> 
> Ming has seen better improvement while testing within a VM:
>  dm-linear: hipri=1 vs hipri=0 15~20% iops improvement
>  dm-stripe: hipri=1 vs hipri=0 ~30% iops improvement
> 
> I'd like to merge these changes via the DM tree when the 5.18 merge
> window opens.  The first block patch that adds ->poll_bio to
> block_device_operations will need review so that I can take it
> through the DM tree.  Reason for going through the DM tree is there
> have been some fairly extensive changes queued in dm-5.18 that build
> on for-5.18/block.  So I think it easiest to just add the block
> depenency via DM tree since DM is first consumer of ->poll_bio
> 
> FYI, Ming does have another DM patch [4] that looks to avoid using
> hlist but I only just saw it.  bio_split() _is_ involved (see
> dm_split_and_process_bio) so I'm not exactly sure where he is going
> with that change. 

io_uring(polling) workloads often cares latency, so big IO request
isn't involved usually, I guess. Then bio_split() is seldom called in
dm_split_and_process_bio(), such as if 4k random IO is run on dm-linear
or dm-stripe via io_uring, bio_split() won't be run into.

Single list is enough here, and efficient than hlist, just need
a little care to delete element from the list since linux kernel doesn't
have generic single list implementation.

> But that is DM-implementation detail that we'll
> sort out.

Yeah, that patch also needs more test.


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH V3 0/3] block/dm: support bio polling

2022-03-01 Thread Ming Lei

On Tue, Mar 01, 2022 at 04:19:42PM -0500, Mike Snitzer wrote:
> On Mon, Feb 28 2022 at  7:58P -0500,
> Ming Lei  wrote:
> 
> > On Mon, Feb 28, 2022 at 11:27:44AM -0500, Mike Snitzer wrote:
> > > 
> > > Hey Ming,
> > > 
> > > I'd like us to follow-through with adding bio-based polling support.
> > > Kind of strange none of us that were sent this V3 ever responded,
> > > sorry about that!
> > > 
> > > Do you have interest in rebasing this patchset (against linux-dm.git's
> > > "dm-5.18" branch since there has been quite some churn)?  Or are you
> > > OK with me doing the rebase?
> > 
> > Hi Mike,
> > 
> > Actually I have one local v5.17 rebase:
> > 
> > https://github.com/ming1/linux/tree/my_v5.17-dm-io-poll
> > 
> > Also one for-5.18/block rebase which is done just now:
> > 
> > https://github.com/ming1/linux/tree/my_v5.18-dm-bio-poll
> > 
> > In my previous test on v5.17 rebase, the IOPS improvement is a bit small,
> > so I didn't post it out. Recently not get time to investigate
> > the performance further, so please feel free to work on it.
> 
> OK, I've rebased it on dm-5.18.
> 
> Can you please share the exact test(s) you were running?  I assume you
> were running directly against a request-based device and then
> comparing polling perf through dm-linear to the same underlying
> request-based device?

I run io_uring over dm-linear and dm-stripe, over two nvme disks with
2 poll_queues.

IOPS improvement can be observed, but not big.

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH V3 0/3] block/dm: support bio polling

2022-02-28 Thread Ming Lei

On Mon, Feb 28, 2022 at 11:27:44AM -0500, Mike Snitzer wrote:
> On Wed, Jun 23 2021 at  3:40P -0400,
> Ming Lei  wrote:
> 
> > Hello Guys,
> > 
> > Based on Christoph's bio based polling model[1], implement DM bio polling
> > with one very simple approach.
> > 
> > Patch 1 adds helper of blk_queue_poll().
> > 
> > Patch 2 adds .bio_poll() callback to block_device_operations, so bio
> > driver can implement its own logic for io polling.
> > 
> > Patch 3 implements bio polling for device mapper.
> > 
> > 
> > V3:
> > - patch style change as suggested by Christoph(2/3)
> > - fix kernel panic issue caused by nested dm polling, which is found
> >   & figured out by Jeffle Xu (3/3)
> > - re-organize setup polling code (3/3)
> > - remove RFC
> > 
> > V2:
> > - drop patch to add new fields into bio
> > - support io polling for dm native bio splitting
> > - add comment
> > 
> > Ming Lei (3):
> >   block: add helper of blk_queue_poll
> >   block: add ->poll_bio to block_device_operations
> >   dm: support bio polling
> > 
> >  block/blk-core.c |  18 +++---
> >  block/blk-sysfs.c|   4 +-
> >  block/genhd.c|   2 +
> >  drivers/md/dm-table.c|  24 +++
> >  drivers/md/dm.c  | 131 ++-
> >  drivers/nvme/host/core.c |   2 +-
> >  include/linux/blkdev.h   |   2 +
> >  7 files changed, 170 insertions(+), 13 deletions(-)
> > 
> > -- 
> > 2.31.1
> > 
> 
> Hey Ming,
> 
> I'd like us to follow-through with adding bio-based polling support.
> Kind of strange none of us that were sent this V3 ever responded,
> sorry about that!
> 
> Do you have interest in rebasing this patchset (against linux-dm.git's
> "dm-5.18" branch since there has been quite some churn)?  Or are you
> OK with me doing the rebase?

Hi Mike,

Actually I have one local v5.17 rebase:

https://github.com/ming1/linux/tree/my_v5.17-dm-io-poll

Also one for-5.18/block rebase which is done just now:

https://github.com/ming1/linux/tree/my_v5.18-dm-bio-poll

In my previous test on v5.17 rebase, the IOPS improvement is a bit small,
so I didn't post it out. Recently not get time to investigate
the performance further, so please feel free to work on it.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH] dm rq: clear cloned bio ->bi_bdev to fix I/O accounting

2022-01-17 Thread Ming Lei

On Sun, Jan 16, 2022 at 11:51:17PM -0800, Christoph Hellwig wrote:
> So I actually noticed this during code inspection a while ago, but I
> think we need to use the actual underlying device instead of NULL here
> to keep our block layer gurantees.  See the patch in my queue below.
> 
> ---
> From 1e330b8e57fc0d6c6fb07c0ec2915dca5d7a585a Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Thu, 13 Jan 2022 10:53:59 +0100
> Subject: block: assign bi_bdev for cloned bios in blk_rq_prep_clone
> 
> The cloned bios for the cloned request in blk_rq_prep_clone currently
> still point to the original bi_bdev.  This is harmless because dm-mpath

It breaks io accounting, so not harmless, but the code in this patch is
fine:

Reviewed-by: Ming Lei 


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH] dm rq: clear cloned bio ->bi_bdev to fix I/O accounting

2022-01-15 Thread Ming Lei

On Sat, Jan 15, 2022 at 1:59 AM Benjamin Marzinski  wrote:
>
> bio_clone_fast() sets the cloned bio to have the same ->bi_bdev as the
> source bio. This means that when request-based dm called setup_clone(),
> the cloned bio had its ->bi_bdev pointing to the dm device. After Commit
> 0b6e522cdc4a ("blk-mq: use ->bi_bdev for I/O accounting")
> __blk_account_io_start() started using the request's ->bio->bi_bdev for
> I/O accounting, if it was set. This caused IO going to the underlying
> devices to use the dm device for their I/O accounting.
>
> Request-based dm can't be used on top of partitions, so
> dm_rq_bio_constructor() can just clear the cloned bio's ->bi_bdev and
> have __blk_account_io_start() fall back to using rq->rq_disk->part0 for
> the I/O accounting.
>
> Signed-off-by: Benjamin Marzinski 
> ---
>  drivers/md/dm-rq.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 579ab6183d4d..42099dc76e3c 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -328,6 +328,7 @@ static int dm_rq_bio_constructor(struct bio *bio, struct 
> bio *bio_orig,
> info->orig = bio_orig;
>     info->tio = tio;
> bio->bi_end_io = end_clone_bio;
> +   bio->bi_bdev = NULL;

Reviewed-by: Ming Lei 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 1/2] block: add resubmit_bio_noacct()

2022-01-10 Thread Ming Lei

On Mon, Jan 10, 2022 at 02:03:16PM -0500, Mike Snitzer wrote:
> On Mon, Jan 10 2022 at 12:35P -0500,
> Christoph Hellwig  wrote:
> 
> > On Mon, Jan 10, 2022 at 03:51:40PM +0800, Ming Lei wrote:
> > > Add block layer API of resubmit_bio_noacct() for handling blk-throttle
> > > iops limit correctly. Typical use case is that bio split, and it isn't
> > > good to export blk_throtl_charge_bio_split() for drivers, so add new API
> > > for serving such purpose.
> > 
> > Umm, submit_bio_noacct is meant exactly for this case of resubmitting
> > a bio.  We should not need another API for that.
> > 
> 
> Ming is lifting code out of __blk_queue_split() for reuse (by DM in
> this instance, because it has its own bio_split+bio_chain).
> 
> Are you saying submit_bio_noacct() should be made to call
> blk_throtl_charge_bio_split() and blk_throtl_charge_bio_split() simply
> return if not a split bio? (not sure bio has enough context to know,

DM or MD may submit split bio to underlying queue directly, so we can't
do that simply. Also some FS may call bio_split() too.

Or we simply let blk_throtl_bio cover everything? That is basically
what V1 did, and the only issue is that we can't run the account
in case that submit_bio_noacct() is called from blk_throtl_dispatch_work_fn().

> other than looking at some side-effect change from bio_chain)
> 
> But Ming: your __blk_queue_split() change seems wrong.
> Prior to your patch __blk_queue_split() did:
> 
> bio_chain(split, *bio);
> submit_bio_noacct(*bio);
> *bio = split;
> blk_throtl_charge_bio_split(*bio);
> 
> After your patch (effectively):
> 
> bio_chain(split, *bio);
> submit_bio_noacct(*bio);
> blk_throtl_charge_bio_split(bio);
> *bio = split;
> 
> Maybe that was intended? (or maybe it doesn't matter because bio_split
> copies fields with bio_clone_fast())?  Regardless, it is subtle.

It doesn't matter, blk_throtl_charge_bio_split() just accounts bio
number, here the split bio is submitted to the same queue.

> 
> Should blk_throtl_charge_bio_split() just be pushed down to
> bio_split()?

It is fragile, will the bio allocated from bio_split() always
submitted finally? or submitted to same queue?

> 
> In general, such narrow hacks for how to properly resubmit split bios
> are asking for further trouble.

I don't think it is hacks, it is one approach which has been verified as
workable in blk-mq.

> As is, I'm having to triage new
> reports of bio-based accounting issues (which has called into question
> my hack/fix commit a1e1cb72d9649 ("dm: fix redundant IO accounting for
> bios that need splitting") that papered over this bigger issue of
> needing proper split IO accounting, so likely needs to be revisited).

Here the issue is just about bio number accounting.

BTW, maybe you can follow blk-mq's way: just account after io is split,
such as, move start_io_acct() to the end of __split_and_process_non_flush(),
then you can just account io start once.


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 1/2] block: add resubmit_bio_noacct()

2022-01-10 Thread Ming Lei

On Mon, Jan 10, 2022 at 09:35:22AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 10, 2022 at 03:51:40PM +0800, Ming Lei wrote:
> > Add block layer API of resubmit_bio_noacct() for handling blk-throttle
> > iops limit correctly. Typical use case is that bio split, and it isn't
> > good to export blk_throtl_charge_bio_split() for drivers, so add new API
> > for serving such purpose.
> 
> Umm, submit_bio_noacct is meant exactly for this case of resubmitting
> a bio.  We should not need another API for that.

Follows the background of the issue first:

1) IOPS limit throttle needs to cover split bio since iostat accounts
split bio actually, and user space setup iops limit throttle by the
feedback from iostat/diskstat;

2) block throttle is block layer internal stuff, and we shouldn't expose
blk_throtl_charge_bio_split() to driver.

Maybe rename the new API as submit_split_bio_noacct(), but we can't
reuse submit_bio_noacct() simply, otherwise blk_throtl_charge_bio_split()
needs to be exported.

Another ides is to clearing BIO_THROTTLED before calling submit_bio_noacct(),
meantime blk-throttle code needs to change to avoid double accounting of bio
bytes, so the caller of submit_bio_noacct() still needs some change.
This way can get smooth IOPS throttle, but needs to call __blk_throtl_bio
for split bio one more time.

Or other idea for this bio split vs. iops limit issue?

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 1/2] block: add resubmit_bio_noacct()

2022-01-10 Thread Ming Lei

Add block layer API of resubmit_bio_noacct() for handling blk-throttle
iops limit correctly. Typical use case is that bio split, and it isn't
good to export blk_throtl_charge_bio_split() for drivers, so add new API
for serving such purpose.

Cc: lining 
Cc: Tejun Heo 
Cc: Chunguang Xu 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 12 
 block/blk-merge.c  |  4 +---
 include/linux/blkdev.h |  1 +
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fd029c86d6ac..733fec7dc5d6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -910,6 +910,18 @@ void submit_bio_noacct(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio_noacct);
 
+/*
+ * Usually for submitting one bio which has been checked by
+ * submit_bio_checks already. The typical use case is for handling
+ * blk-throttle iops limit correctly.
+ */
+void resubmit_bio_noacct(struct bio *bio)
+{
+   submit_bio_noacct(bio);
+   blk_throtl_charge_bio_split(bio);
+}
+EXPORT_SYMBOL(resubmit_bio_noacct);
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The  bio which describes the I/O
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 4de34a332c9f..acc786d872e6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -366,10 +366,8 @@ void __blk_queue_split(struct request_queue *q, struct bio 
**bio,
 
bio_chain(split, *bio);
trace_block_split(split, (*bio)->bi_iter.bi_sector);
-   submit_bio_noacct(*bio);
+   resubmit_bio_noacct(*bio);
*bio = split;
-
-   blk_throtl_charge_bio_split(*bio);
}
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 22746b2d6825..cce2db9fae1f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -600,6 +600,7 @@ static inline unsigned int blk_queue_depth(struct 
request_queue *q)
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 void submit_bio_noacct(struct bio *bio);
+void resubmit_bio_noacct(struct bio *bio);
 
 extern int blk_lld_busy(struct request_queue *q);
 extern void blk_queue_split(struct bio **);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 0/2] block/dm: add resubmit_bio_noacct for fixing iops throttling

2022-01-09 Thread Ming Lei

Hello Guys,

Commit 4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO 
scenarios")
only fixes iops throttle for blk-mq drivers. This patchset adds API of 
resubmit_bio_noacct
so that we can use it for fixing the same issue on bio based drivers.

Meantime fix the issue on device mapper via the added API, and the issue
is reported by lining.

Ming Lei (2):
  block: add resubmit_bio_noacct()
  dm: use resubmit_bio_noacct to submit split bio

 block/blk-core.c   | 12 
 block/blk-merge.c  |  4 +---
 drivers/md/dm.c|  2 +-
 include/linux/blkdev.h |  1 +
 4 files changed, 15 insertions(+), 4 deletions(-)

Cc: lining 
Cc: Tejun Heo 
Cc: Chunguang Xu 
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 2/2] dm: use resubmit_bio_noacct to submit split bio

2022-01-09 Thread Ming Lei

lining reported that blk-throttle iops limit doesn't work correctly
for dm-thin. Turns out it is same issue with the one addressed by commit
4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO scenarios").

So use the new added block layer API for addressing the same issue.

Reported-by: lining 
Cc: Tejun Heo 
Cc: Chunguang Xu 
Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 280918cdcabd..8a58379e737c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1562,7 +1562,7 @@ static void __split_and_process_bio(struct mapped_device 
*md,

bio_chain(b, bio);
trace_block_split(b, bio->bi_iter.bi_sector);
-   submit_bio_noacct(bio);
+   resubmit_bio_noacct(bio);
}
}

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 3/3] dm: mark dm queue as blocking if any underlying is blocking

2022-01-06 Thread Ming Lei

On Thu, Jan 06, 2022 at 10:40:51AM -0500, Mike Snitzer wrote:
> On Tue, Dec 21 2021 at  9:14P -0500,
> Ming Lei  wrote:
> 
> > dm request based driver doesn't set BLK_MQ_F_BLOCKING, so dm_queue_rq()
> > is supposed to not sleep.
> > 
> > However, blk_insert_cloned_request() is used by dm_queue_rq() for
> > queuing underlying request, but the underlying queue may be marked as
> > BLK_MQ_F_BLOCKING, so blk_insert_cloned_request() may become to block
> > current context, then rcu warning is triggered.
> > 
> > Fixes the issue by marking dm request based queue as BLK_MQ_F_BLOCKING
> > if any underlying queue is marked as BLK_MQ_F_BLOCKING, meantime we
> > need to allocate srcu beforehand.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/md/dm-rq.c|  5 -
> >  drivers/md/dm-rq.h|  3 ++-
> >  drivers/md/dm-table.c | 14 ++
> >  drivers/md/dm.c   |  5 +++--
> >  drivers/md/dm.h   |  1 +
> >  5 files changed, 24 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > index 579ab6183d4d..2297d37c62a9 100644
> > --- a/drivers/md/dm-rq.c
> > +++ b/drivers/md/dm-rq.c
> > @@ -535,7 +535,8 @@ static const struct blk_mq_ops dm_mq_ops = {
> > .init_request = dm_mq_init_request,
> >  };
> >  
> > -int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
> > +int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t,
> > +bool blocking)
> >  {
> > struct dm_target *immutable_tgt;
> > int err;
> > @@ -550,6 +551,8 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
> > struct dm_table *t)
> > md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
> > md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
> > md->tag_set->driver_data = md;
> > +   if (blocking)
> > +   md->tag_set->flags |= BLK_MQ_F_BLOCKING;
> >  
> > md->tag_set->cmd_size = sizeof(struct dm_rq_target_io);
> > immutable_tgt = dm_table_get_immutable_target(t);
> 
> As you can see, dm_table_get_immutable_target(t) is called here ^
> 
> Rather than pass 'blocking' in, please just call dm_table_has_blocking_dev(t);
> 
> But not a big deal, I can clean that up once this gets committed...
> 
> > diff --git a/drivers/md/dm-rq.h b/drivers/md/dm-rq.h
> > index 1eea0da641db..5f3729f277d7 100644
> > --- a/drivers/md/dm-rq.h
> > +++ b/drivers/md/dm-rq.h
> > @@ -30,7 +30,8 @@ struct dm_rq_clone_bio_info {
> > struct bio clone;
> >  };
> >  
> > -int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t);
> > +int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t,
> > +bool blocking);
> >  void dm_mq_cleanup_mapped_device(struct mapped_device *md);
> >  
> >  void dm_start_queue(struct request_queue *q);
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index aa173f5bdc3d..e4bdd4f757a3 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -1875,6 +1875,20 @@ static bool dm_table_supports_write_zeroes(struct 
> > dm_table *t)
> > return true;
> >  }
> >  
> > +/* If the device can block inside ->queue_rq */
> > +static int device_is_io_blocking(struct dm_target *ti, struct dm_dev *dev,
> > + sector_t start, sector_t len, void *data)
> > +{
> > +   struct request_queue *q = bdev_get_queue(dev->bdev);
> > +
> > +   return blk_queue_blocking(q);
> > +}
> > +
> > +bool dm_table_has_blocking_dev(struct dm_table *t)
> > +{
> > +   return dm_table_any_dev_attr(t, device_is_io_blocking, NULL);
> > +}
> > +
> >  static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev 
> > *dev,
> >  sector_t start, sector_t len, void *data)
> >  {
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 280918cdcabd..2f72877752dd 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1761,7 +1761,7 @@ static struct mapped_device *alloc_dev(int minor)
> >  * established. If request-based table is loaded: blk-mq will
> >  * override accordingly.
> >  */
> > -   md->disk = blk_alloc_disk(md->numa_node_id);
> > +   md->disk = blk_alloc_disk_srcu(md->numa_node_id);
> > if (!md->disk)
> > goto bad;
> > md->queue = md->d

Re: [dm-devel] [PATCH 0/3] blk-mq/dm-rq: support BLK_MQ_F_BLOCKING for dm-rq

2021-12-22 Thread Ming Lei

On Tue, Dec 21, 2021 at 08:21:39AM -0800, Christoph Hellwig wrote:
> On Tue, Dec 21, 2021 at 10:14:56PM +0800, Ming Lei wrote:
> > Hello,
> > 
> > dm-rq may be built on blk-mq device which marks BLK_MQ_F_BLOCKING, so
> > dm_mq_queue_rq() may become to sleep current context.
> > 
> > Fixes the issue by allowing dm-rq to set BLK_MQ_F_BLOCKING in case that
> > any underlying queue is marked as BLK_MQ_F_BLOCKING.
> > 
> > DM request queue is allocated before allocating tagset, this way is a
> > bit special, so we need to pre-allocate srcu payload, then use the queue
> > flag of QUEUE_FLAG_BLOCKING for locking dispatch.
> 
> What is the benefit over just forcing bio-based dm-mpath for these
> devices?

At least IO scheduler can't be used for bio based dm-mpath, also there should
be other drawbacks for bio based mpath and request mpath is often the default
option, maybe Mike has more input about bio vs request dm-mpath.



Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 3/3] dm: mark dm queue as blocking if any underlying is blocking

2021-12-21 Thread Ming Lei

dm request based driver doesn't set BLK_MQ_F_BLOCKING, so dm_queue_rq()
is supposed to not sleep.

However, blk_insert_cloned_request() is used by dm_queue_rq() for
queuing underlying request, but the underlying queue may be marked as
BLK_MQ_F_BLOCKING, so blk_insert_cloned_request() may become to block
current context, then rcu warning is triggered.

Fixes the issue by marking dm request based queue as BLK_MQ_F_BLOCKING
if any underlying queue is marked as BLK_MQ_F_BLOCKING, meantime we
need to allocate srcu beforehand.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-rq.c|  5 -
 drivers/md/dm-rq.h|  3 ++-
 drivers/md/dm-table.c | 14 ++
 drivers/md/dm.c   |  5 +++--
 drivers/md/dm.h   |  1 +
 5 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 579ab6183d4d..2297d37c62a9 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -535,7 +535,8 @@ static const struct blk_mq_ops dm_mq_ops = {
.init_request = dm_mq_init_request,
 };
 
-int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
+int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t,
+bool blocking)
 {
struct dm_target *immutable_tgt;
int err;
@@ -550,6 +551,8 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
md->tag_set->driver_data = md;
+   if (blocking)
+   md->tag_set->flags |= BLK_MQ_F_BLOCKING;
 
md->tag_set->cmd_size = sizeof(struct dm_rq_target_io);
immutable_tgt = dm_table_get_immutable_target(t);
diff --git a/drivers/md/dm-rq.h b/drivers/md/dm-rq.h
index 1eea0da641db..5f3729f277d7 100644
--- a/drivers/md/dm-rq.h
+++ b/drivers/md/dm-rq.h
@@ -30,7 +30,8 @@ struct dm_rq_clone_bio_info {
struct bio clone;
 };
 
-int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t);
+int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t,
+bool blocking);
 void dm_mq_cleanup_mapped_device(struct mapped_device *md);
 
 void dm_start_queue(struct request_queue *q);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index aa173f5bdc3d..e4bdd4f757a3 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1875,6 +1875,20 @@ static bool dm_table_supports_write_zeroes(struct 
dm_table *t)
return true;
 }
 
+/* If the device can block inside ->queue_rq */
+static int device_is_io_blocking(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+   struct request_queue *q = bdev_get_queue(dev->bdev);
+
+   return blk_queue_blocking(q);
+}
+
+bool dm_table_has_blocking_dev(struct dm_table *t)
+{
+   return dm_table_any_dev_attr(t, device_is_io_blocking, NULL);
+}
+
 static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev,
 sector_t start, sector_t len, void *data)
 {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 280918cdcabd..2f72877752dd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1761,7 +1761,7 @@ static struct mapped_device *alloc_dev(int minor)
 * established. If request-based table is loaded: blk-mq will
 * override accordingly.
 */
-   md->disk = blk_alloc_disk(md->numa_node_id);
+   md->disk = blk_alloc_disk_srcu(md->numa_node_id);
if (!md->disk)
goto bad;
md->queue = md->disk->queue;
@@ -2046,7 +2046,8 @@ int dm_setup_md_queue(struct mapped_device *md, struct 
dm_table *t)
switch (type) {
case DM_TYPE_REQUEST_BASED:
md->disk->fops = _rq_blk_dops;
-   r = dm_mq_init_request_queue(md, t);
+   r = dm_mq_init_request_queue(md, t,
+   dm_table_has_blocking_dev(t));
if (r) {
DMERR("Cannot initialize queue for request-based dm 
mapped device");
return r;
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 742d9c80efe1..f7f92b272cce 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -60,6 +60,7 @@ int dm_calculate_queue_limits(struct dm_table *table,
  struct queue_limits *limits);
 int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
  struct queue_limits *limits);
+bool dm_table_has_blocking_dev(struct dm_table *t);
 struct list_head *dm_table_get_devices(struct dm_table *t);
 void dm_table_presuspend_targets(struct dm_table *t);
 void dm_table_presuspend_undo_targets(struct dm_table *t);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 2/3] block: add blk_alloc_disk_srcu

2021-12-21 Thread Ming Lei

Add blk_alloc_disk_srcu() so that we can allocate srcu inside request queue
for supporting blocking ->queue_rq().

dm-rq needs this API.

Signed-off-by: Ming Lei 
---
 block/genhd.c |  5 +++--
 include/linux/genhd.h | 12 
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 3c139a1b6f04..d21786fbb7bb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1333,12 +1333,13 @@ struct gendisk *__alloc_disk_node(struct request_queue 
*q, int node_id,
 }
 EXPORT_SYMBOL(__alloc_disk_node);
 
-struct gendisk *__blk_alloc_disk(int node, struct lock_class_key *lkclass)
+struct gendisk *__blk_alloc_disk(int node, bool alloc_srcu,
+   struct lock_class_key *lkclass)
 {
struct request_queue *q;
struct gendisk *disk;
 
-   q = blk_alloc_queue(node, false);
+   q = blk_alloc_queue(node, alloc_srcu);
if (!q)
return NULL;
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 6906a45bc761..20259340b962 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -227,23 +227,27 @@ void blk_drop_partitions(struct gendisk *disk);
 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
struct lock_class_key *lkclass);
 extern void put_disk(struct gendisk *disk);
-struct gendisk *__blk_alloc_disk(int node, struct lock_class_key *lkclass);
+struct gendisk *__blk_alloc_disk(int node, bool alloc_srcu,
+   struct lock_class_key *lkclass);
 
 /**
- * blk_alloc_disk - allocate a gendisk structure
+ * __alloc_disk - allocate a gendisk structure
  * @node_id: numa node to allocate on
+ * @alloc_srcu: allocate srcu instance for supporting blocking ->queue_rq
  *
  * Allocate and pre-initialize a gendisk structure for use with BIO based
  * drivers.
  *
  * Context: can sleep
  */
-#define blk_alloc_disk(node_id)
\
+#define __alloc_disk(node_id, alloc_srcu)  \
 ({ \
static struct lock_class_key __key; \
\
-   __blk_alloc_disk(node_id, &__key);  \
+   __blk_alloc_disk(node_id, alloc_srcu, &__key);  \
 })
+#define blk_alloc_disk(node_id) __alloc_disk(node_id, false)
+#define blk_alloc_disk_srcu(node_id) __alloc_disk(node_id, true)
 void blk_cleanup_disk(struct gendisk *disk);
 
 int __register_blkdev(unsigned int major, const char *name,
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 1/3] block: split having srcu from queue blocking

2021-12-21 Thread Ming Lei

Now we reuse queue flag of QUEUE_FLAG_HAS_SRCU for both having srcu and
BLK_MQ_F_BLOCKING. Actually they are two things: one is that srcu is
allocated inside queue, another is that we need to handle blocking
->queue_rq. So far this way works as expected.

dm-rq needs to set BLK_MQ_F_BLOCKING if any underlying queue is
marked as BLK_MQ_F_BLOCKING. But dm queue is allocated before tagset
is allocated, one doable way is to always allocate SRCU for dm
queue, then set BLK_MQ_F_BLOCKING for the tagset if it is required,
meantime we can mark the request queue as supporting blocking
->queue_rq.

So add one new flag of QUEUE_FLAG_BLOCKING for supporting blocking
->queue_rq only, and use one private field to describe if request
queue has allocated srcu instance.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 2 +-
 block/blk-mq.c | 6 +++---
 block/blk-mq.h | 2 +-
 block/blk-sysfs.c  | 2 +-
 include/linux/blkdev.h | 5 +++--
 5 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 10619fd83c1b..7ba806a4e779 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -449,7 +449,7 @@ struct request_queue *blk_alloc_queue(int node_id, bool 
alloc_srcu)
return NULL;
 
if (alloc_srcu) {
-   blk_queue_flag_set(QUEUE_FLAG_HAS_SRCU, q);
+   q->has_srcu = true;
if (init_srcu_struct(q->srcu) != 0)
goto fail_q;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0d7c9d3e0329..1408a6b8ccdc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -259,7 +259,7 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
  */
 void blk_mq_wait_quiesce_done(struct request_queue *q)
 {
-   if (blk_queue_has_srcu(q))
+   if (blk_queue_blocking(q))
synchronize_srcu(q->srcu);
else
synchronize_rcu();
@@ -4024,8 +4024,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set 
*set,
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
struct request_queue *q)
 {
-   WARN_ON_ONCE(blk_queue_has_srcu(q) !=
-   !!(set->flags & BLK_MQ_F_BLOCKING));
+   if (set->flags & BLK_MQ_F_BLOCKING)
+   blk_queue_flag_set(QUEUE_FLAG_BLOCKING, q);
 
/* mark the queue as mq asap */
q->mq_ops = set->ops;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 948791ea2a3e..9601918e2034 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -377,7 +377,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx 
*hctx,
 /* run the code block in @dispatch_ops with rcu/srcu read lock held */
 #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)\
 do {   \
-   if (!blk_queue_has_srcu(q)) {   \
+   if (!blk_queue_blocking(q)) {   \
rcu_read_lock();\
(dispatch_ops); \
rcu_read_unlock();  \
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e20eadfcf5c8..af89fabb58e3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -736,7 +736,7 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
struct request_queue *q = container_of(rcu_head, struct request_queue,
   rcu_head);
 
-   kmem_cache_free(blk_get_queue_kmem_cache(blk_queue_has_srcu(q)), q);
+   kmem_cache_free(blk_get_queue_kmem_cache(q->has_srcu), q);
 }
 
 /* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c80cfaefc0a8..d84abdb294c4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -365,6 +365,7 @@ struct request_queue {
 #endif
 
boolmq_sysfs_init_done;
+   boolhas_srcu;
 
 #define BLK_MAX_WRITE_HINTS5
u64 write_hints[BLK_MAX_WRITE_HINTS];
@@ -385,7 +386,7 @@ struct request_queue {
 /* Keep blk_queue_flag_name[] in sync with the definitions below */
 #define QUEUE_FLAG_STOPPED 0   /* queue is stopped */
 #define QUEUE_FLAG_DYING   1   /* queue being torn down */
-#define QUEUE_FLAG_HAS_SRCU2   /* SRCU is allocated */
+#define QUEUE_FLAG_BLOCKING2   /* ->queue_rq may block */
 #define QUEUE_FLAG_NOMERGES 3  /* disable merge attempts */
 #define QUEUE_FLAG_SAME_COMP   4   /* complete on same CPU-group */
 #define QUEUE_FLAG_FAIL_IO 5   /* fake timeout */
@@ -423,7 +424,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct 
request_queue *q);
 
 #define blk_queue_stopped(q)   test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
 #define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)

[dm-devel] [PATCH 0/3] blk-mq/dm-rq: support BLK_MQ_F_BLOCKING for dm-rq

2021-12-21 Thread Ming Lei

Hello,

dm-rq may be built on blk-mq device which marks BLK_MQ_F_BLOCKING, so
dm_mq_queue_rq() may become to sleep current context.

Fixes the issue by allowing dm-rq to set BLK_MQ_F_BLOCKING in case that
any underlying queue is marked as BLK_MQ_F_BLOCKING.

DM request queue is allocated before allocating tagset, this way is a
bit special, so we need to pre-allocate srcu payload, then use the queue
flag of QUEUE_FLAG_BLOCKING for locking dispatch.


Ming Lei (3):
  block: split having srcu from queue blocking
  block: add blk_alloc_disk_srcu
  dm: mark dm queue as blocking if any underlying is blocking

 block/blk-core.c   |  2 +-
 block/blk-mq.c |  6 +++---
 block/blk-mq.h |  2 +-
 block/blk-sysfs.c  |  2 +-
 block/genhd.c  |  5 +++--
 drivers/md/dm-rq.c |  5 -
 drivers/md/dm-rq.h |  3 ++-
 drivers/md/dm-table.c  | 14 ++
 drivers/md/dm.c|  5 +++--
 drivers/md/dm.h|  1 +
 include/linux/blkdev.h |  5 +++--
 include/linux/genhd.h  | 12 
 12 files changed, 44 insertions(+), 18 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] Random high CPU utilization in blk-mq with the none scheduler

2021-12-13 Thread Ming Lei

On Tue, Dec 14, 2021 at 12:31:23AM +, Dexuan Cui wrote:
> > From: Ming Lei 
> > Sent: Sunday, December 12, 2021 11:38 PM
> 
> Ming, thanks so much for the detailed analysis!
> 
> > From the log:
> > 
> > 1) dm-mpath:
> > - queue depth: 2048
> > - busy: 848, and 62 of them are in sw queue, so run queue is often
> >   caused
> > - nr_hw_queues: 1
> > - dm-2 is in use, and dm-1/dm-3 is idle
> > - dm-2's dispatch busy is 8, that should be the reason why excessive CPU
> > usage is observed when flushing plug list without commit dc5fc361d891 in
> > which hctx->dispatch_busy is just bypassed
> > 
> > 2) iscsi
> > - dispatch_busy is 0
> > - nr_hw_queues: 1
> > - queue depth: 113
> > - busy=~33, active_queues is 3, so each LUN/iscsi host is saturated
> > - 23 active LUNs, 23 * 33 = 759 in-flight commands
> > 
> > The high CPU utilization may be caused by:
> > 
> > 1) big queue depth of dm mpath, the situation may be improved much if it
> > is reduced to 1024 or 800. The max allowed inflight commands from iscsi
> > hosts can be figured out, if dm's queue depth is much more than this number,
> > the extra commands need to dispatch, and run queue can be scheduled
> > immediately, so high CPU utilization is caused.
> 
> I think you're correct:
> with dm_mod.dm_mq_queue_depth=256, the max CPU utilization is 8%.
> with dm_mod.dm_mq_queue_depth=400, the max CPU utilization is 12%. 
> with dm_mod.dm_mq_queue_depth=800, the max CPU utilization is 88%.
> 
> The performance with queue_depth=800 is poor.
> The performance with queue_depth=400 is good.
> The performance with queue_depth=256 is also good, and there is only a 
> small drop comared with the 400 case.

That should be the reason why the issue isn't triggered in case of real
io scheduler.

So far blk-mq doesn't provide way to adjust tags queue depth
dynamically.

But not understand reason of default dm_mq_queue_depth(2048), in this
situation, each LUN can just queue 113/3 requests at most, and 3 LUNs
are attached to single iscsi host.

Mike, can you share why the default dm_mq_queue_depth is so big? And
seems it doesn't consider the underlying queue's queue depth. What is
the biggest dm rq queue depth? which need to saturate all underlying paths?

> 
> > 2) single hw queue, so contention should be big, which should be avoided
> > in big machine, nvme-tcp might be better than iscsi here
> > 
> > 3) iscsi io latency is a bit big
> > 
> > Even CPU utilization is reduced by commit dc5fc361d891, io performance
> > can't be good too with v5.16-rc, I guess.
> > 
> > Thanks,
> > Ming
> 
> Actually the I/O performance of v5.16-rc4 (commit dc5fc361d891 is included)
> is good -- it's about the same as the case where v5.16-rc4 + reverting
> dc5fc361d891 + dm_mod.dm_mq_queue_depth=400 (or 256).

The single hw queue may be the root cause of your issue, and there
is only single run_work, which can be touched by all CPUs(~200) almost, so cache
ping-pong could be very serious. 

Jens patch may improve it more or less, please test it.

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 2/3] scsi: make sure that request queue queiesce and unquiesce balanced

2021-11-02 Thread Ming Lei

Hi James,

On Mon, Nov 01, 2021 at 09:43:27PM -0400, James Bottomley wrote:
> On Thu, 2021-10-21 at 22:59 +0800, Ming Lei wrote:
> > For fixing queue quiesce race between driver and block layer(elevator
> > switch, update nr_requests, ...), we need to support concurrent
> > quiesce
> > and unquiesce, which requires the two call balanced.
> > 
> > It isn't easy to audit that in all scsi drivers, especially the two
> > may
> > be called from different contexts, so do it in scsi core with one
> > per-device
> > bit flag & global spinlock, basically zero cost since request queue
> > quiesce
> > is seldom triggered.
> > 
> > Reported-by: Yi Zhang 
> > Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue
> > quiesce/unquiesce")
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/scsi/scsi_lib.c| 45 ++
> > 
> >  include/scsi/scsi_device.h |  1 +
> >  2 files changed, 37 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 51fcd46be265..414f4daf8005 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -2638,6 +2638,40 @@ static int
> > __scsi_internal_device_block_nowait(struct scsi_device *sdev)
> > return 0;
> >  }
> >  
> > +static DEFINE_SPINLOCK(sdev_queue_stop_lock);
> > +
> > +void scsi_start_queue(struct scsi_device *sdev)
> > +{
> > +   bool need_start;
> > +   unsigned long flags;
> > +
> > +   spin_lock_irqsave(_queue_stop_lock, flags);
> > +   need_start = sdev->queue_stopped;
> > +   sdev->queue_stopped = 0;
> > +   spin_unlock_irqrestore(_queue_stop_lock, flags);
> > +
> > +   if (need_start)
> > +   blk_mq_unquiesce_queue(sdev->request_queue);
> 
> Well, this is a classic atomic pattern:
> 
> if (cmpxchg(>queue_stopped, 1, 0))
>   blk_mq_unquiesce_queue(sdev->request_queue);
> 
> The reason to do it with atomics rather than spinlocks is
> 
>1. no need to disable interrupts: atomics are locked
>2. faster because a spinlock takes an exclusive line every time but the
>   read to check the value can be in shared mode in cmpxchg
>3. it's just shorter and better code.

You are right, I agree.

> 
> The only minor downside is queue_stopped now needs to be a u32.

Yeah, that is the reason I don't take this atomic way since it needs to
add one extra u32 into 'struct scsi_device'.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 2/3] scsi: make sure that request queue queiesce and unquiesce balanced

2021-10-21 Thread Ming Lei

For fixing queue quiesce race between driver and block layer(elevator
switch, update nr_requests, ...), we need to support concurrent quiesce
and unquiesce, which requires the two call balanced.

It isn't easy to audit that in all scsi drivers, especially the two may
be called from different contexts, so do it in scsi core with one per-device
bit flag & global spinlock, basically zero cost since request queue quiesce
is seldom triggered.

Reported-by: Yi Zhang 
Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce")
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c| 45 ++
 include/scsi/scsi_device.h |  1 +
 2 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 51fcd46be265..414f4daf8005 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2638,6 +2638,40 @@ static int __scsi_internal_device_block_nowait(struct 
scsi_device *sdev)
return 0;
 }
 
+static DEFINE_SPINLOCK(sdev_queue_stop_lock);
+
+void scsi_start_queue(struct scsi_device *sdev)
+{
+   bool need_start;
+   unsigned long flags;
+
+   spin_lock_irqsave(_queue_stop_lock, flags);
+   need_start = sdev->queue_stopped;
+   sdev->queue_stopped = 0;
+   spin_unlock_irqrestore(_queue_stop_lock, flags);
+
+   if (need_start)
+   blk_mq_unquiesce_queue(sdev->request_queue);
+}
+
+static void scsi_stop_queue(struct scsi_device *sdev, bool nowait)
+{
+   bool need_stop;
+   unsigned long flags;
+
+   spin_lock_irqsave(_queue_stop_lock, flags);
+   need_stop = !sdev->queue_stopped;
+   sdev->queue_stopped = 1;
+   spin_unlock_irqrestore(_queue_stop_lock, flags);
+
+   if (need_stop) {
+   if (nowait)
+   blk_mq_quiesce_queue_nowait(sdev->request_queue);
+   else
+   blk_mq_quiesce_queue(sdev->request_queue);
+   }
+}
+
 /**
  * scsi_internal_device_block_nowait - try to transition to the SDEV_BLOCK 
state
  * @sdev: device to block
@@ -2662,7 +2696,7 @@ int scsi_internal_device_block_nowait(struct scsi_device 
*sdev)
 * request queue.
 */
if (!ret)
-   blk_mq_quiesce_queue_nowait(sdev->request_queue);
+   scsi_stop_queue(sdev, true);
return ret;
 }
 EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait);
@@ -2689,19 +2723,12 @@ static int scsi_internal_device_block(struct 
scsi_device *sdev)
mutex_lock(>state_mutex);
err = __scsi_internal_device_block_nowait(sdev);
if (err == 0)
-   blk_mq_quiesce_queue(sdev->request_queue);
+   scsi_stop_queue(sdev, false);
mutex_unlock(>state_mutex);
 
return err;
 }
 
-void scsi_start_queue(struct scsi_device *sdev)
-{
-   struct request_queue *q = sdev->request_queue;
-
-   blk_mq_unquiesce_queue(q);
-}
-
 /**
  * scsi_internal_device_unblock_nowait - resume a device after a block request
  * @sdev:  device to resume
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 430b73bd02ac..ac74beccffa2 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -206,6 +206,7 @@ struct scsi_device {
unsigned rpm_autosuspend:1; /* Enable runtime autosuspend at device
 * creation time */
unsigned ignore_media_change:1; /* Ignore MEDIA CHANGE on resume */
+   unsigned queue_stopped:1;   /* request queue is quiesced */
 
bool offline_already;   /* Device offline message logged */
 
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 3/3] dm: don't stop request queue after the dm device is suspended

2021-10-21 Thread Ming Lei

For fixing queue quiesce race between driver and block layer(elevator
switch, update nr_requests, ...), we need to support concurrent quiesce
and unquiesce, which requires the two call to be balanced.

__bind() is only called from dm_swap_table() in which dm device has been
suspended already, so not necessary to stop queue again. With this way,
request queue quiesce and unquiesce can be balanced.

Reported-by: Yi Zhang 
Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce")
Signed-off-by: Ming Lei 
---
 drivers/md/dm.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7870e6460633..727282d79b26 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1927,16 +1927,6 @@ static struct dm_table *__bind(struct mapped_device *md, 
struct dm_table *t,
 
dm_table_event_callback(t, event_callback, md);
 
-   /*
-* The queue hasn't been stopped yet, if the old table type wasn't
-* for request-based during suspension.  So stop it to prevent
-* I/O mapping before resume.
-* This must be done before setting the queue restrictions,
-* because request-based dm may be run just after the setting.
-*/
-   if (request_based)
-   dm_stop_queue(q);
-
if (request_based) {
/*
 * Leverage the fact that request-based DM targets are
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 1/3] scsi: avoid to quiesce sdev->request_queue two times

2021-10-21 Thread Ming Lei

For fixing queue quiesce race between driver and block layer(elevator
switch, update nr_requests, ...), we need to support concurrent quiesce
and unquiesce, which requires the two to be balanced.

blk_mq_quiesce_queue() calls blk_mq_quiesce_queue_nowait() for updating
quiesce depth and marking the flag, then scsi_internal_device_block() calls
blk_mq_quiesce_queue_nowait() two times actually.

Fix the double quiesce and keep quiesce and unquiesce balanced.

Reported-by: Yi Zhang 
Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce")
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 30f7d0b4eb73..51fcd46be265 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2630,6 +2630,14 @@ scsi_target_resume(struct scsi_target *starget)
 }
 EXPORT_SYMBOL(scsi_target_resume);
 
+static int __scsi_internal_device_block_nowait(struct scsi_device *sdev)
+{
+   if (scsi_device_set_state(sdev, SDEV_BLOCK))
+   return scsi_device_set_state(sdev, SDEV_CREATED_BLOCK);
+
+   return 0;
+}
+
 /**
  * scsi_internal_device_block_nowait - try to transition to the SDEV_BLOCK 
state
  * @sdev: device to block
@@ -2646,24 +2654,16 @@ EXPORT_SYMBOL(scsi_target_resume);
  */
 int scsi_internal_device_block_nowait(struct scsi_device *sdev)
 {
-   struct request_queue *q = sdev->request_queue;
-   int err = 0;
-
-   err = scsi_device_set_state(sdev, SDEV_BLOCK);
-   if (err) {
-   err = scsi_device_set_state(sdev, SDEV_CREATED_BLOCK);
-
-   if (err)
-   return err;
-   }
+   int ret = __scsi_internal_device_block_nowait(sdev);
 
/*
 * The device has transitioned to SDEV_BLOCK.  Stop the
 * block layer from calling the midlayer with this device's
 * request queue.
 */
-   blk_mq_quiesce_queue_nowait(q);
-   return 0;
+   if (!ret)
+   blk_mq_quiesce_queue_nowait(sdev->request_queue);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait);
 
@@ -2684,13 +2684,12 @@ EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait);
  */
 static int scsi_internal_device_block(struct scsi_device *sdev)
 {
-   struct request_queue *q = sdev->request_queue;
int err;
 
mutex_lock(>state_mutex);
-   err = scsi_internal_device_block_nowait(sdev);
+   err = __scsi_internal_device_block_nowait(sdev);
if (err == 0)
-   blk_mq_quiesce_queue(q);
+   blk_mq_quiesce_queue(sdev->request_queue);
mutex_unlock(>state_mutex);
 
return err;
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH 0/3] block: keep quiesce & unquiesce balanced for scsi/dm

2021-10-21 Thread Ming Lei

Hello Jens,

Recently we merge the patch of e70feb8b3e68 ("blk-mq: support concurrent queue
quiesce/unquiesce") for fixing race between driver and block layer wrt.
queue quiesce.

Yi reported that srp/002 is broken with this patch, turns out scsi and
dm don't keep the two balanced actually.

So fix dm and scsi and make srp/002 pass again.


Ming Lei (3):
  scsi: avoid to quiesce sdev->request_queue two times
  scsi: make sure that request queue queiesce and unquiesce balanced
  dm: don't stop request queue after the dm device is suspended

 drivers/md/dm.c| 10 --
 drivers/scsi/scsi_lib.c| 70 ++
 include/scsi/scsi_device.h |  1 +
 3 files changed, 49 insertions(+), 32 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm-rq: don't queue request during suspend

2021-10-07 Thread Ming Lei

On Wed, Oct 06, 2021 at 10:15:58AM -0400, Mike Snitzer wrote:
> On Thu, Sep 23 2021 at  5:11P -0400,
> Ming Lei  wrote:
> 
> > DM uses blk-mq's quiesce/unquiesce to stop/start device mapper queue.
> > 
> > But blk-mq's unquiesce may come from outside events, such as elevator
> > switch, updating nr_requests or others, and request may come during
> > suspend, so simply ask for blk-mq to requeue it.
> > 
> > Fixes one kernel panic issue when running updating nr_requests and
> > dm-mpath suspend/resume stress test.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/md/dm-rq.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > index 5b95eea517d1..a896dea9750e 100644
> > --- a/drivers/md/dm-rq.c
> > +++ b/drivers/md/dm-rq.c
> > @@ -490,6 +490,14 @@ static blk_status_t dm_mq_queue_rq(struct 
> > blk_mq_hw_ctx *hctx,
> > struct mapped_device *md = tio->md;
> > struct dm_target *ti = md->immutable_target;
> >  
> > +   /*
> > +* blk-mq's unquiesce may come from outside events, such as
> > +* elevator switch, updating nr_requests or others, and request may
> > +* come during suspend, so simply ask for blk-mq to requeue it.
> > +*/
> > +   if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)))
> > +   return BLK_STS_RESOURCE;
> > +
> > if (unlikely(!ti)) {
> > int srcu_idx;
> > struct dm_table *map = dm_get_live_table(md, _idx);
> > -- 
> > 2.31.1
> > 
> 
> Hey Ming,
> 
> I've marked this for stable@ and queued this up.  BUT this test is
> racey, could easily be that device gets suspended just after your
> test.

Hello Mike,

I understand the device shouldn't be suspended after the test given
it is just like the following two tasks running contiguously in the
test:

1) task1
- suspend device mapper
- resume device mapper

2) task2
- updating nr_requests of the device mapper

BTW, it is reported as RH BZ1891486 in which it is easily reproduced,
however, seems device suspended isn't observed.

thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH] dm-rq: don't queue request during suspend

2021-09-23 Thread Ming Lei

DM uses blk-mq's quiesce/unquiesce to stop/start device mapper queue.

But blk-mq's unquiesce may come from outside events, such as elevator
switch, updating nr_requests or others, and request may come during
suspend, so simply ask for blk-mq to requeue it.

Fixes one kernel panic issue when running updating nr_requests and
dm-mpath suspend/resume stress test.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-rq.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 5b95eea517d1..a896dea9750e 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -490,6 +490,14 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx 
*hctx,
struct mapped_device *md = tio->md;
struct dm_target *ti = md->immutable_target;
 
+   /*
+* blk-mq's unquiesce may come from outside events, such as
+* elevator switch, updating nr_requests or others, and request may
+* come during suspend, so simply ask for blk-mq to requeue it.
+*/
+   if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)))
+   return BLK_STS_RESOURCE;
+
if (unlikely(!ti)) {
int srcu_idx;
struct dm_table *map = dm_get_live_table(md, _idx);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v3 2/8] scsi/sr: add error handling support for add_disk()

2021-09-06 Thread Ming Lei

On Mon, Aug 30, 2021 at 02:25:32PM -0700, Luis Chamberlain wrote:
> We never checked for errors on add_disk() as this function
> returned void. Now that this is fixed, use the shiny new
> error handling.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/scsi/sr.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
> index 2942a4ec9bdd..72fd21844367 100644
> --- a/drivers/scsi/sr.c
> +++ b/drivers/scsi/sr.c
> @@ -779,7 +779,10 @@ static int sr_probe(struct device *dev)
>   dev_set_drvdata(dev, cd);
>   disk->flags |= GENHD_FL_REMOVABLE;
>   sr_revalidate_disk(cd);
> - device_add_disk(>sdev_gendev, disk, NULL);
> +
> + error = device_add_disk(>sdev_gendev, disk, NULL);
> + if (error)
> + goto fail_minor;

You don't undo register_cdrom(), maybe you can use kref_put(>kref, 
sr_kref_release);
to simplify the error handling.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v3 1/8] scsi/sd: add error handling support for add_disk()

2021-09-06 Thread Ming Lei

On Mon, Aug 30, 2021 at 02:25:31PM -0700, Luis Chamberlain wrote:
> We never checked for errors on add_disk() as this function
> returned void. Now that this is fixed, use the shiny new
> error handling.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Luis Chamberlain 
> ---
>  drivers/scsi/sd.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 610ebba0d66e..8c1273fff23e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3487,7 +3487,11 @@ static int sd_probe(struct device *dev)
>   pm_runtime_set_autosuspend_delay(dev,
>   sdp->host->hostt->rpm_autosuspend_delay);
>   }
> - device_add_disk(dev, gd, NULL);
> +
> + error = device_add_disk(dev, gd, NULL);
> + if (error)
> + goto out_free_index;
> +

The error handling is actually wrong, see 


https://lore.kernel.org/linux-scsi/c93f3010-13c9-e07f-1458-b6b47a270...@acm.org/T/#t

Maybe you can base on that patch.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-30 Thread Ming Lei

On Mon, Jun 21, 2021 at 07:33:34PM +0800, JeffleXu wrote:
> 
> 
> On 6/18/21 10:39 PM, Ming Lei wrote:
> > From 47e523b9ee988317369eaadb96826323cd86819e Mon Sep 17 00:00:00 2001
> > From: Ming Lei 
> > Date: Wed, 16 Jun 2021 16:13:46 +0800
> > Subject: [RFC PATCH V3 3/3] dm: support bio polling
> > 
> > Support bio(REQ_POLLED) polling in the following approach:
> > 
> > 1) only support io polling on normal READ/WRITE, and other abnormal IOs
> > still fallback on IRQ mode, so the target io is exactly inside the dm
> > io.
> > 
> > 2) hold one refcnt on io->io_count after submitting this dm bio with
> > REQ_POLLED
> > 
> > 3) support dm native bio splitting, any dm io instance associated with
> > current bio will be added into one list which head is bio->bi_end_io
> > which will be recovered before ending this bio
> > 
> > 4) implement .poll_bio() callback, call bio_poll() on the single target
> > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
> > dec_pending() after the target io is done in .poll_bio()
> > 
> > 4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
> > which is based on Jeffle's previous patch.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> > V3:
> > - covers all comments from Jeffle
> > - fix corner cases when polling on abnormal ios
> > 
> ...
> 
> One bug and one performance issue, though I haven't investigated deep
> for both.
> 
> 
> kernel base: based on Jens' for-next, applying Christoph and Leiming's
> patchset.
> 
> 
> 1. One bug when there's DM device stack, e.g., dm-linear upon another
> dm-linear. Can be reproduced by following steps:
> 
> ```
> $ sudo dmsetup create tmpdev --table '0 2097152 linear /dev/nvme0n1 0'
> 
> $ cat tmp.table
> 0 2097152 linear /dev/mapper/tmpdev 0
> 2097152 2097152 linear /dev/nvme0n1 0
> 
> $ cat tmp.table | dmsetup create testdev
> 
> $ fio -name=test -ioengine=io_uring -iodepth=128 -numjobs=1 -thread
> -rw=randread -direct=1 -bs=4k -time_based -runtime=10 -cpus_allowed=6
> -filename=/dev/mapper/testdev -hipri=1
> ```
> 
> 
> BUG: unable to handle page fault for address: c01a6208
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0003) - permissions violation
> PGD 39740c067 P4D 39740c067 PUD 39740e067 PMD 1035db067 PTE 1ddf6f061
> Oops: 0003 [#1] SMP PTI
> CPU: 6 PID: 5899 Comm: fio Tainted: G S
> 5.13.0-0.1.git.81bcdc3.al7.x86_64 #1
> Hardware name: Inventec K900G3-10G/B900G3, BIOS A2.20 06/23/2017
> RIP: 0010:dm_submit_bio+0x171/0x3e0 [dm_mod]
> Code: 08 85 c0 0f 84 78 01 00 00 80 7c 24 2c 00 0f 84 b8 00 00 00 48 8b
> 53 38 48 8b 44 24 18 48 85 d2 48 8d 48 28 48 89 50 28 74 04 <48> 89 4a
> 08 48 89 4b 38 48 83 c3 38 48 89 58 30 41 f7 c5 fe ff ff
> RSP: 0018:9e5c45e1b9a0 EFLAGS: 00010286
> RAX: 8ab59fd50140 RBX: 8ab59fd50088 RCX: 8ab59fd50168
> RDX: c01a6200 RSI: 00052f08 RDI: 
> RBP: 8ab59fd501c8 R08:  R09: 
> R10: 9e5c45e1b950 R11: 0007 R12: 8ab4c2bc2000
> R13:  R14: 8ab4c2bc2548 R15: 8ab59fd50140
> FS:  7f555de42700() GS:8af33f18() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: c01a6208 CR3: 000124990005 CR4: 003706e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  submit_bio_noacct+0x144/0x3f0
>  ? submit_bio+0x42/0x120
>  submit_bio+0x42/0x120
>  blkdev_direct_IO+0x454/0x4b0
>  ? io_resubmit_prep+0x40/0x40
>  ? __fsnotify_parent+0xff/0x350
>  ? __fsnotify_parent+0x10f/0x350
>  ? generic_file_read_iter+0x83/0x150
>  generic_file_read_iter+0x83/0x150
>  blkdev_read_iter+0x41/0x50
>  io_read+0xe9/0x420
>  ? __cond_resched+0x16/0x40
>  ? __kmalloc_node+0x16e/0x4e0
>  ? memcg_alloc_page_obj_cgroups+0x32/0x90
>  ? io_issue_sqe+0x7e8/0x1260
>  io_issue_sqe+0x7e8/0x1260
>  ? io_submit_sqes+0x47b/0x1420
>  __io_queue_sqe+0x56/0x380
>  ? io_submit_sqes+0x120a/0x1420
>  io_submit_sqes+0x120a/0x1420
>  ? __x64_sys_io_uring_enter+0x1d2/0x3e0
>  __x64_sys_io_uring_enter+0x1d2/0x3e0
>  ? exit_to_user_mode_prepare+0x4c/0x210
>  do_syscall_64+0x36/0x70
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7f55d3cb1b59
> Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89
> f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
> f0 ff ff 73 01 c3 48 8b 0d ff e2 2b 00 f7 d8 64 89 01 48
>

[dm-devel] [PATCH V3 2/3] block: add ->poll_bio to block_device_operations

2021-06-23 Thread Ming Lei

Prepare for supporting IO polling for bio based driver.

Add ->poll_bio callback so that bio driver can provide their own logic
for polling bio.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 13 +
 block/genhd.c  |  2 ++
 include/linux/blkdev.h |  1 +
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1e24c71c6738..e585e549c291 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1113,7 +1113,8 @@ EXPORT_SYMBOL(submit_bio);
  */
 int bio_poll(struct bio *bio, unsigned int flags)
 {
-   struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+   struct gendisk *disk = bio->bi_bdev->bd_disk;
+   struct request_queue *q = disk->queue;
blk_qc_t cookie = READ_ONCE(bio->bi_cookie);
int ret;
 
@@ -1125,10 +1126,14 @@ int bio_poll(struct bio *bio, unsigned int flags)
 
if (blk_queue_enter(q, BLK_MQ_REQ_NOWAIT))
return 0;
-   if (WARN_ON_ONCE(!queue_is_mq(q)))
-   ret = 0;/* not yet implemented, should not happen */
-   else
+
+   if (queue_is_mq(q))
ret = blk_mq_poll(q, cookie, flags);
+   else if (disk->fops->poll_bio)
+   ret = disk->fops->poll_bio(bio, flags);
+   else
+   ret = !WARN_ON_ONCE(1);
+
blk_queue_exit(q);
return ret;
 }
diff --git a/block/genhd.c b/block/genhd.c
index 5f5628216295..3dfb7d52e280 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -471,6 +471,8 @@ static void __device_add_disk(struct device *parent, struct 
gendisk *disk,
 {
int ret;
 
+   WARN_ON_ONCE(queue_is_mq(disk->queue) && disk->fops->poll_bio);
+
/*
 * The disk queue should now be all set with enough information about
 * the device for the elevator code to pick an adequate default
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fc0ba0b80776..fc63155d2ac4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1858,6 +1858,7 @@ static inline void blk_ksm_unregister(struct 
request_queue *q) { }
 
 struct block_device_operations {
void (*submit_bio)(struct bio *bio);
+   int (*poll_bio)(struct bio *bio, unsigned int flags);
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned 
int);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH V3 1/3] block: add helper of blk_queue_poll

2021-06-23 Thread Ming Lei

There has been 3 users, and will be more, so add one such helper.

Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Jeffle Xu 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 5 ++---
 block/blk-sysfs.c| 4 ++--
 drivers/nvme/host/core.c | 2 +-
 include/linux/blkdev.h   | 1 +
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 531176578221..1e24c71c6738 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -835,7 +835,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio 
*bio)
}
}
 
-   if (!test_bit(QUEUE_FLAG_POLL, >queue_flags))
+   if (!blk_queue_poll(q))
bio->bi_opf &= ~REQ_POLLED;
 
switch (bio_op(bio)) {
@@ -1117,8 +1117,7 @@ int bio_poll(struct bio *bio, unsigned int flags)
blk_qc_t cookie = READ_ONCE(bio->bi_cookie);
int ret;
 
-   if (cookie == BLK_QC_T_NONE ||
-   !test_bit(QUEUE_FLAG_POLL, >queue_flags))
+   if (cookie == BLK_QC_T_NONE || !blk_queue_poll(q))
return 0;
 
if (current->plug)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f78e73ca6091..93dcf2dfaafd 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -422,13 +422,13 @@ static ssize_t queue_poll_delay_store(struct 
request_queue *q, const char *page,
 
 static ssize_t queue_poll_show(struct request_queue *q, char *page)
 {
-   return queue_var_show(test_bit(QUEUE_FLAG_POLL, >queue_flags), page);
+   return queue_var_show(blk_queue_poll(q), page);
 }
 
 static ssize_t queue_poll_store(struct request_queue *q, const char *page,
size_t count)
 {
-   if (!test_bit(QUEUE_FLAG_POLL, >queue_flags))
+   if (!blk_queue_poll(q))
return -EINVAL;
pr_info_ratelimited("writes to the poll attribute are ignored.\n");
pr_info_ratelimited("please use driver specific parameters instead.\n");
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fe0b8da3de7f..e31c7704ef4d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1025,7 +1025,7 @@ static void nvme_execute_rq_polled(struct request_queue 
*q,
 {
DECLARE_COMPLETION_ONSTACK(wait);
 
-   WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, >queue_flags));
+   WARN_ON_ONCE(!blk_queue_poll(q));
 
rq->cmd_flags |= REQ_POLLED;
rq->end_io_data = 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 561b04117bd4..fc0ba0b80776 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -677,6 +677,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct 
request_queue *q);
 #define blk_queue_fua(q)   test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)test_bit(QUEUE_FLAG_REGISTERED, 
&(q)->queue_flags)
 #define blk_queue_nowait(q)test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll(q)  test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH V3 3/3] dm: support bio polling

2021-06-23 Thread Ming Lei

Support bio(REQ_POLLED) polling in the following approach:

1) only support io polling on normal READ/WRITE, and other abnormal IOs
still fallback on IRQ mode, so the target io is exactly inside the dm
io.

2) hold one refcnt on io->io_count after submitting this dm bio with
REQ_POLLED

3) support dm native bio splitting, any dm io instance associated with
current bio will be added into one list which head is bio->bi_end_io
which will be recovered before ending this bio

4) implement .poll_bio() callback, call bio_poll() on the single target
bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
dec_pending() after the target io is done in .poll_bio()

4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
which is based on Jeffle's previous patch.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-table.c |  24 
 drivers/md/dm.c   | 131 +-
 2 files changed, 152 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index ee47a332b462..b14b379442d2 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1491,6 +1491,12 @@ struct dm_target *dm_table_find_target(struct dm_table 
*t, sector_t sector)
return >targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+  sector_t start, sector_t len, void *data)
+{
+   return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1541,6 +1547,11 @@ static int count_device(struct dm_target *ti, struct 
dm_dev *dev,
return 0;
 }
 
+static int dm_table_supports_poll(struct dm_table *t)
+{
+   return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2078,6 +2089,19 @@ void dm_table_set_restrictions(struct dm_table *t, 
struct request_queue *q,
 
dm_update_keyslot_manager(q, t);
blk_queue_update_readahead(q);
+
+   /*
+* Check for request-based device is remained to
+* dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+* For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+* devices supporting polling.
+*/
+   if (__table_type_bio_based(t->type)) {
+   if (dm_table_supports_poll(t))
+   blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+   else
+   blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+   }
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 363f12a285ce..cfc2e1915ec4 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -39,6 +39,8 @@
 #define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE"
 #define DM_COOKIE_LENGTH 24
 
+#define REQ_SAVED_END_IO REQ_DRV
+
 static const char *_name = DM_NAME;
 
 static unsigned int major = 0;
@@ -72,6 +74,7 @@ struct clone_info {
struct dm_io *io;
sector_t sector;
unsigned sector_count;
+   boolsubmit_as_polled;
 };
 
 /*
@@ -99,6 +102,8 @@ struct dm_io {
blk_status_t status;
atomic_t io_count;
struct bio *orig_bio;
+   void*saved_bio_end_io;
+   struct hlist_node  node;
unsigned long start_time;
spinlock_t endio_lock;
struct dm_stats_aux stats_aux;
@@ -687,6 +692,8 @@ static struct dm_target_io *alloc_tio(struct clone_info 
*ci, struct dm_target *t
tio->ti = ti;
tio->target_bio_nr = target_bio_nr;
 
+   WARN_ON_ONCE(ci->submit_as_polled && !tio->inside_dm_io);
+
return tio;
 }
 
@@ -938,8 +945,14 @@ static void dec_pending(struct dm_io *io, blk_status_t 
error)
end_io_acct(io);
free_io(md, io);
 
-   if (io_error == BLK_STS_DM_REQUEUE)
+   if (io_error == BLK_STS_DM_REQUEUE) {
+   /*
+* Upper layer won't help us poll split bio, so
+* clear REQ_POLLED in case of requeue
+*/
+   bio->bi_opf &= ~REQ_POLLED;
return;
+   }
 
if ((bio->bi_opf & REQ_PREFLUSH) && bio->bi_iter.bi_size) {
/*
@@ -1366,6 +1379,9 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
 
__bio_clone_fast(clone, bio);
 
+   /* REQ_SAVED_END_IO shouldn't be inherited */
+   clone->bi_opf &= ~REQ_SAVED_END_IO;
+
r = bio_crypt_clone(clone, bio, GFP_NOIO);
if (r < 0)
return r;
@@ -1574,6 +1590,46 @@ static bool __process_abnormal_io(struct c

[dm-devel] [PATCH V3 0/3] block/dm: support bio polling

2021-06-23 Thread Ming Lei

Hello Guys,

Based on Christoph's bio based polling model[1], implement DM bio polling
with one very simple approach.

Patch 1 adds helper of blk_queue_poll().

Patch 2 adds .bio_poll() callback to block_device_operations, so bio
driver can implement its own logic for io polling.

Patch 3 implements bio polling for device mapper.


V3:
- patch style change as suggested by Christoph(2/3)
- fix kernel panic issue caused by nested dm polling, which is found
  & figured out by Jeffle Xu (3/3)
- re-organize setup polling code (3/3)
- remove RFC

V2:
- drop patch to add new fields into bio
- support io polling for dm native bio splitting
- add comment

Ming Lei (3):
  block: add helper of blk_queue_poll
  block: add ->poll_bio to block_device_operations
  dm: support bio polling

 block/blk-core.c |  18 +++---
 block/blk-sysfs.c|   4 +-
 block/genhd.c|   2 +
 drivers/md/dm-table.c|  24 +++
 drivers/md/dm.c  | 131 ++-
 drivers/nvme/host/core.c |   2 +-
 include/linux/blkdev.h   |   2 +
 7 files changed, 170 insertions(+), 13 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-21 Thread Ming Lei

On Tue, Jun 22, 2021 at 10:26:15AM +0800, JeffleXu wrote:
> 
> 
> On 6/21/21 10:04 PM, Ming Lei wrote:
> > On Mon, Jun 21, 2021 at 07:33:34PM +0800, JeffleXu wrote:
> >>
> >>
> >> On 6/18/21 10:39 PM, Ming Lei wrote:
> >>> From 47e523b9ee988317369eaadb96826323cd86819e Mon Sep 17 00:00:00 2001
> >>> From: Ming Lei 
> >>> Date: Wed, 16 Jun 2021 16:13:46 +0800
> >>> Subject: [RFC PATCH V3 3/3] dm: support bio polling
> >>>
> >>> Support bio(REQ_POLLED) polling in the following approach:
> >>>
> >>> 1) only support io polling on normal READ/WRITE, and other abnormal IOs
> >>> still fallback on IRQ mode, so the target io is exactly inside the dm
> >>> io.
> >>>
> >>> 2) hold one refcnt on io->io_count after submitting this dm bio with
> >>> REQ_POLLED
> >>>
> >>> 3) support dm native bio splitting, any dm io instance associated with
> >>> current bio will be added into one list which head is bio->bi_end_io
> >>> which will be recovered before ending this bio
> >>>
> >>> 4) implement .poll_bio() callback, call bio_poll() on the single target
> >>> bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
> >>> dec_pending() after the target io is done in .poll_bio()
> >>>
> >>> 4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
> >>> which is based on Jeffle's previous patch.
> >>>
> >>> Signed-off-by: Ming Lei 
> >>> ---
> >>> V3:
> >>>   - covers all comments from Jeffle
> >>>   - fix corner cases when polling on abnormal ios
> >>>
> >> ...
> >>
> >> One bug and one performance issue, though I haven't investigated deep
> >> for both.
> >>
> >>
> >> kernel base: based on Jens' for-next, applying Christoph and Leiming's
> >> patchset.
> >>
> >>
> >> 1. One bug when there's DM device stack, e.g., dm-linear upon another
> >> dm-linear. Can be reproduced by following steps:
> >>
> >> ```
> >> $ sudo dmsetup create tmpdev --table '0 2097152 linear /dev/nvme0n1 0'
> >>
> >> $ cat tmp.table
> >> 0 2097152 linear /dev/mapper/tmpdev 0
> >> 2097152 2097152 linear /dev/nvme0n1 0
> >>
> >> $ cat tmp.table | dmsetup create testdev
> >>
> >> $ fio -name=test -ioengine=io_uring -iodepth=128 -numjobs=1 -thread
> >> -rw=randread -direct=1 -bs=4k -time_based -runtime=10 -cpus_allowed=6
> >> -filename=/dev/mapper/testdev -hipri=1
> >> ```
> >>
> >>
> >> BUG: unable to handle page fault for address: c01a6208
> >> #PF: supervisor write access in kernel mode
> >> #PF: error_code(0x0003) - permissions violation
> >> PGD 39740c067 P4D 39740c067 PUD 39740e067 PMD 1035db067 PTE 1ddf6f061
> >> Oops: 0003 [#1] SMP PTI
> >> CPU: 6 PID: 5899 Comm: fio Tainted: G S
> >> 5.13.0-0.1.git.81bcdc3.al7.x86_64 #1
> >> Hardware name: Inventec K900G3-10G/B900G3, BIOS A2.20 06/23/2017
> >> RIP: 0010:dm_submit_bio+0x171/0x3e0 [dm_mod]
> > 
> > It has been fixed in my local repo:
> > 
> > @@ -1608,6 +1649,7 @@ static void init_clone_info(struct clone_info *ci, 
> > struct mapped_device *md,
> > ci->map = map;
> > ci->io = alloc_io(md, bio);
> > ci->sector = bio->bi_iter.bi_sector;
> > +   ci->submit_as_polled = false;
> > 
> 
> It doesn't work in my test environment. Actually the following fix
> should be applied.
> 
> 
> @@ -1390,6 +1403,8 @@ static int clone_bio(struct dm_target_io *tio,
> struct bio *bio,
> if (bio_integrity(bio))
> bio_integrity_trim(clone);
> 
> +   clone->bi_opf &= ~REQ_SAVED_END_IO;
> +

This change is good, but it shouldn't fix the panic except for nested
device map, I will fold into V3.

> return 0;
>  }
> 
> 
> The rationale is that, REQ_SAVED_END_IO should be cleared once the bio
> *passes through* the device stack layer. Or the cloned bio for next
> layer will inherit REQ_SAVED_END_IO flag, in which case
> 'cloned_bio->bi_end_io' (actually acts as the hlist head) won't be
> initialized in dm_setup_polled_io(), and thus it gets crashed when
> trying to insert into this hash list in __split_and_process_bio().

'cloned_bio' can't reach dm_submit_bio() if it isn't one DM bio.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-21 Thread Ming Lei

On Mon, Jun 21, 2021 at 07:33:34PM +0800, JeffleXu wrote:
> 
> 
> On 6/18/21 10:39 PM, Ming Lei wrote:
> > From 47e523b9ee988317369eaadb96826323cd86819e Mon Sep 17 00:00:00 2001
> > From: Ming Lei 
> > Date: Wed, 16 Jun 2021 16:13:46 +0800
> > Subject: [RFC PATCH V3 3/3] dm: support bio polling
> > 
> > Support bio(REQ_POLLED) polling in the following approach:
> > 
> > 1) only support io polling on normal READ/WRITE, and other abnormal IOs
> > still fallback on IRQ mode, so the target io is exactly inside the dm
> > io.
> > 
> > 2) hold one refcnt on io->io_count after submitting this dm bio with
> > REQ_POLLED
> > 
> > 3) support dm native bio splitting, any dm io instance associated with
> > current bio will be added into one list which head is bio->bi_end_io
> > which will be recovered before ending this bio
> > 
> > 4) implement .poll_bio() callback, call bio_poll() on the single target
> > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
> > dec_pending() after the target io is done in .poll_bio()
> > 
> > 4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
> > which is based on Jeffle's previous patch.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> > V3:
> > - covers all comments from Jeffle
> > - fix corner cases when polling on abnormal ios
> > 
> ...
> 
> One bug and one performance issue, though I haven't investigated deep
> for both.
> 
> 
> kernel base: based on Jens' for-next, applying Christoph and Leiming's
> patchset.
> 
> 
> 1. One bug when there's DM device stack, e.g., dm-linear upon another
> dm-linear. Can be reproduced by following steps:
> 
> ```
> $ sudo dmsetup create tmpdev --table '0 2097152 linear /dev/nvme0n1 0'
> 
> $ cat tmp.table
> 0 2097152 linear /dev/mapper/tmpdev 0
> 2097152 2097152 linear /dev/nvme0n1 0
> 
> $ cat tmp.table | dmsetup create testdev
> 
> $ fio -name=test -ioengine=io_uring -iodepth=128 -numjobs=1 -thread
> -rw=randread -direct=1 -bs=4k -time_based -runtime=10 -cpus_allowed=6
> -filename=/dev/mapper/testdev -hipri=1
> ```
> 
> 
> BUG: unable to handle page fault for address: c01a6208
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0003) - permissions violation
> PGD 39740c067 P4D 39740c067 PUD 39740e067 PMD 1035db067 PTE 1ddf6f061
> Oops: 0003 [#1] SMP PTI
> CPU: 6 PID: 5899 Comm: fio Tainted: G S
> 5.13.0-0.1.git.81bcdc3.al7.x86_64 #1
> Hardware name: Inventec K900G3-10G/B900G3, BIOS A2.20 06/23/2017
> RIP: 0010:dm_submit_bio+0x171/0x3e0 [dm_mod]

It has been fixed in my local repo:

@@ -1608,6 +1649,7 @@ static void init_clone_info(struct clone_info *ci, struct 
mapped_device *md,
ci->map = map;
ci->io = alloc_io(md, bio);
ci->sector = bio->bi_iter.bi_sector;
+   ci->submit_as_polled = false;

> 
> 
> 2. Performance Issue
> 
> I test both on x86 (with only one NVMe) and aarch64 (with multiple NVMes).
> 
> The result (IOPS) on x86 is as expected:
> 
> Type|IRQ   | Polling
> - |  | 
> dm-linear | 239k | 357k
> 
> - dm-linear built upon one NVMe，bs=4k, iopoll=1, iodepth=128,
> numjobs=1, direct, randread, ioengine=io_uring

This data looks good.

> 
> 
> 
> While the result on aarch64 is a little confusing.
> 
> Type|IRQ   | Polling
> - |  | 
> dm-linear [1] | 208k | 230k
> dm-linear [2] | 637k | 691k
> dm-stripe | 310k | 354k
> 
> - dm-linear [1] built upon *one* NVMe，bs=4k, iopoll=1, iodepth=128,
> *numjobs=1*, direct, randread, ioengine=io_uring
> - dm-linear [2] built upon *three* NVMes，bs=4k, iopoll=1, iodepth=128,
> *numjobs=3*, direct, randread, ioengine=io_uring
> - dm-stripe built upon *three* NVMes，chunk_size=4k, bs=12k, iopoll=1,
> iodepth=128, numjobs=3, direct, randread, ioengine=io_uring
> 
> 
> Following is the corresponding test result of Leiming's last
> implementation for bio-based polling on aarch64.
> IRQ   IOPOLL  ratio
> dm-linear [2] 639K835K~30%
> dm-stripe 314K408K~30%

The previous version polls one hw queue once if bios are submitted to
same hw queue. We might improve it in future.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-21 Thread Ming Lei

On Mon, Jun 21, 2021 at 09:36:56AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 17, 2021 at 06:35:49PM +0800, Ming Lei wrote:
> > +   /*
> > +* Only support bio polling for normal IO, and the target io is
> > +* exactly inside the dm io instance
> > +*/
> > +   ci->io->submit_as_polled = !!(ci->bio->bi_opf & REQ_POLLED);
> 
> Nit: the !! is not needed.

OK.

> 
> > @@ -1608,6 +1625,22 @@ static void init_clone_info(struct clone_info *ci, 
> > struct mapped_device *md,
> > ci->map = map;
> > ci->io = alloc_io(md, bio);
> > ci->sector = bio->bi_iter.bi_sector;
> > +
> > +   if (bio->bi_opf & REQ_POLLED) {
> > +   INIT_HLIST_NODE(>io->node);
> > +
> > +   /*
> > +* Save .bi_end_io into dm_io, so that we can reuse .bi_end_io
> > +* for storing dm_io list
> > +*/
> > +   if (bio->bi_opf & REQ_SAVED_END_IO) {
> > +   ci->io->saved_bio_end_io = NULL;
> 
> So if it already was saved the list gets cleared here?  Can you explain
> this logic a little more?

Inside dm_poll_bio() we recognize non-NULL ->saved_bio_end_io as
valid, so it has to be initialized it here.

> 
> > +   } else {
> > +   ci->io->saved_bio_end_io = bio->bi_end_io;
> > +   INIT_HLIST_HEAD((struct hlist_head *)>bi_end_io);
> 
> I think you want to hide these casts in helpers that clearly document
> why this is safe rather than sprinkling the casts all over the code.
> I also wonder if there is any better way to structur this.

OK, I will add a helper of dm_get_bio_hlist_head() with comment.

> 
> > +static int dm_poll_bio(struct bio *bio, unsigned int flags)
> > +{
> > +   struct dm_io *io;
> > +   void *saved_bi_end_io = NULL;
> > +   struct hlist_head tmp = HLIST_HEAD_INIT;
> > +   struct hlist_head *head = (struct hlist_head *)>bi_end_io;
> > +   struct hlist_node *next;
> > +
> > +   /*
> > +* This bio can be submitted from FS as POLLED so that FS may keep
> > +* polling even though the flag is cleared by bio splitting or
> > +* requeue, so return immediately.
> > +*/
> > +   if (!(bio->bi_opf & REQ_POLLED))
> > +   return 0;
> 
> I can't really parse the comment, can you explain this a little more?
> But if we need this check, shouldn't it move to bio_poll()?

Upper layer keeps to poll one bio with POLLED, but the flag can be
cleared by driver or block layer. Once it is cleared, we should return
immediately.

Yeah, we can move it to bio_poll().

> 
> > +   hlist_for_each_entry(io, , node) {
> > +   if (io->saved_bio_end_io && !saved_bi_end_io) {
> > +   saved_bi_end_io = io->saved_bio_end_io;
> > +   break;
> > +   }
> > +   }
> 
> So it seems like you don't use bi_cookie at all.  Why not turn
> bi_cookie into a union to stash the hlist_head and use that?

hlist_head is 'void *', but ->bi_cookie is 'unsigned int'.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 2/3] block: add ->poll_bio to block_device_operations

2021-06-21 Thread Ming Lei

On Mon, Jun 21, 2021 at 09:25:02AM +0200, Christoph Hellwig wrote:
> > +   struct gendisk *disk = bio->bi_bdev->bd_disk;
> > +   struct request_queue *q = disk->queue;
> > blk_qc_t cookie = READ_ONCE(bio->bi_cookie);
> > int ret;
> >  
> > -   if (cookie == BLK_QC_T_NONE || !blk_queue_poll(q))
> > +   if ((queue_is_mq(q) && cookie == BLK_QC_T_NONE) ||
> > +   !blk_queue_poll(q))
> > return 0;
> 
> How does polling for a bio without a cookie make sense even when
> polling bio based?

It isn't necessary to use bio->bi_cookie, that is why I doesn't use it,
which actually provides one free 32bit in bio for bio based driver.

> 
> But if we come up for a good rationale for this I'd really
> split the conditions to make them more readable:
> 
>   if (!test_bit(QUEUE_FLAG_POLL, >queue_flags))
>   return 0;
>   if (queue_is_mq(q) && cookie == BLK_QC_T_NONE)
>   return 0;

OK.

> 
> > +   if (!queue_is_mq(q)) {
> > +   if (disk->fops->poll_bio) {
> > +   ret = disk->fops->poll_bio(bio, flags);
> > +   } else {
> > +   WARN_ON_ONCE(1);
> > +   ret = 0;
> > +   }
> > +   } else {
> > ret = blk_mq_poll(q, cookie, flags);
> 
> I'd go for someting like:
> 
>   if (queue_is_mq(q))
>   ret = blk_mq_poll(q, cookie, flags);
>   else if (disk->fops->poll_bio)
>   ret = disk->fops->poll_bio(bio, flags);
>   else
>   WARN_ON_ONCE(1);
> 
> with ret initialized to 0 at declaration time.

Fine.

> 
> >  struct block_device_operations {
> > void (*submit_bio)(struct bio *bio);
> > +   /* ->poll_bio is for bio driver only */
> 
> I'd drop the comment, this is already nicely documented in add_disk
> together with the actual check.  We also don't note this for submit_bio
> here.

OK.



thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 1/3] block: add helper of blk_queue_poll

2021-06-21 Thread Ming Lei

On Mon, Jun 21, 2021 at 09:20:36AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 17, 2021 at 06:35:47PM +0800, Ming Lei wrote:
> > There has been 3 users, and will be more, so add one such helper.
> > 
> > Reviewed-by: Chaitanya Kulkarni 
> > Reviewed-by: Jeffle Xu 
> > Reviewed-by: Hannes Reinecke 
> > Signed-off-by: Ming Lei 
> 
> I still don't like hiding a simple flag test like this, it just adds
> another step to grepping what is going on.

It is actually one pattern in block layer since there is so many such
macros of blk_queue_*. And it makes the check line shorter.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-18 Thread Ming Lei

On Fri, Jun 18, 2021 at 04:56:45PM -0400, Mike Snitzer wrote:
> [you really should've changed the subject of this email to
> "[RFC PATCH V3 3/3] dm: support bio polling"]
> 
> On Fri, Jun 18 2021 at 10:39P -0400,
> Ming Lei  wrote:
> 
> > From 47e523b9ee988317369eaadb96826323cd86819e Mon Sep 17 00:00:00 2001
> > From: Ming Lei 
> > Date: Wed, 16 Jun 2021 16:13:46 +0800
> > Subject: [RFC PATCH V3 3/3] dm: support bio polling
> > 
> > Support bio(REQ_POLLED) polling in the following approach:
> > 
> > 1) only support io polling on normal READ/WRITE, and other abnormal IOs
> > still fallback on IRQ mode, so the target io is exactly inside the dm
> > io.
> > 
> > 2) hold one refcnt on io->io_count after submitting this dm bio with
> > REQ_POLLED
> > 
> > 3) support dm native bio splitting, any dm io instance associated with
> > current bio will be added into one list which head is bio->bi_end_io
> > which will be recovered before ending this bio
> > 
> > 4) implement .poll_bio() callback, call bio_poll() on the single target
> > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
> > dec_pending() after the target io is done in .poll_bio()
> > 
> > 4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
> > which is based on Jeffle's previous patch.
> 
> ^ nit: two "4)", last should be 5.
> 
> > 
> > Signed-off-by: Ming Lei 
> > ---
> > V3:
> > - covers all comments from Jeffle
> 
> Would really appreciate it if Jeffle could test these changes like he
> did previous dm IO polling patchsets he implemented.  Jeffle?

Yeah, I am looking forward to Jeffle's test too, :-)

> 
> > - fix corner cases when polling on abnormal ios
> > 
> >  drivers/md/dm-table.c |  24 
> >  drivers/md/dm.c   | 127 --
> >  2 files changed, 147 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index ee47a332b462..b14b379442d2 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -1491,6 +1491,12 @@ struct dm_target *dm_table_find_target(struct 
> > dm_table *t, sector_t sector)
> > return >targets[(KEYS_PER_NODE * n) + k];
> >  }
> >  
> > +static int device_not_poll_capable(struct dm_target *ti, struct dm_dev 
> > *dev,
> > +  sector_t start, sector_t len, void *data)
> > +{
> > +   return !blk_queue_poll(bdev_get_queue(dev->bdev));
> > +}
> > +
> >  /*
> >   * type->iterate_devices() should be called when the sanity check needs to
> >   * iterate and check all underlying data devices. iterate_devices() will
> > @@ -1541,6 +1547,11 @@ static int count_device(struct dm_target *ti, struct 
> > dm_dev *dev,
> > return 0;
> >  }
> >  
> > +static int dm_table_supports_poll(struct dm_table *t)
> > +{
> > +   return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
> > +}
> > +
> >  /*
> >   * Check whether a table has no data devices attached using each
> >   * target's iterate_devices method.
> > @@ -2078,6 +2089,19 @@ void dm_table_set_restrictions(struct dm_table *t, 
> > struct request_queue *q,
> >  
> > dm_update_keyslot_manager(q, t);
> > blk_queue_update_readahead(q);
> > +
> > +   /*
> > +* Check for request-based device is remained to
> > +* dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
> > +* For bio-based device, only set QUEUE_FLAG_POLL when all underlying
> > +* devices supporting polling.
> > +*/
> > +   if (__table_type_bio_based(t->type)) {
> > +   if (dm_table_supports_poll(t))
> > +   blk_queue_flag_set(QUEUE_FLAG_POLL, q);
> > +   else
> > +   blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
> > +   }
> >  }
> >  
> >  unsigned int dm_table_get_num_targets(struct dm_table *t)
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 363f12a285ce..df4a6a999014 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -39,6 +39,8 @@
> >  #define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE"
> >  #define DM_COOKIE_LENGTH 24
> >  
> > +#define REQ_SAVED_END_IO REQ_DRV
> > +
> >  static const char *_name = DM_NAME;
> >  
> >  static unsigned int major = 0;
> > @@ -72,6 +74,7 @@ struct clone_info {
> > struct dm

Re: [dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-18 Thread Ming Lei

>From 47e523b9ee988317369eaadb96826323cd86819e Mon Sep 17 00:00:00 2001
From: Ming Lei 
Date: Wed, 16 Jun 2021 16:13:46 +0800
Subject: [RFC PATCH V3 3/3] dm: support bio polling

Support bio(REQ_POLLED) polling in the following approach:

1) only support io polling on normal READ/WRITE, and other abnormal IOs
still fallback on IRQ mode, so the target io is exactly inside the dm
io.

2) hold one refcnt on io->io_count after submitting this dm bio with
REQ_POLLED

3) support dm native bio splitting, any dm io instance associated with
current bio will be added into one list which head is bio->bi_end_io
which will be recovered before ending this bio

4) implement .poll_bio() callback, call bio_poll() on the single target
bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
dec_pending() after the target io is done in .poll_bio()

4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
which is based on Jeffle's previous patch.

Signed-off-by: Ming Lei 
---
V3:
- covers all comments from Jeffle
- fix corner cases when polling on abnormal ios

 drivers/md/dm-table.c |  24 
 drivers/md/dm.c   | 127 --
 2 files changed, 147 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index ee47a332b462..b14b379442d2 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1491,6 +1491,12 @@ struct dm_target *dm_table_find_target(struct dm_table 
*t, sector_t sector)
return >targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+  sector_t start, sector_t len, void *data)
+{
+   return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1541,6 +1547,11 @@ static int count_device(struct dm_target *ti, struct 
dm_dev *dev,
return 0;
 }
 
+static int dm_table_supports_poll(struct dm_table *t)
+{
+   return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2078,6 +2089,19 @@ void dm_table_set_restrictions(struct dm_table *t, 
struct request_queue *q,
 
dm_update_keyslot_manager(q, t);
blk_queue_update_readahead(q);
+
+   /*
+* Check for request-based device is remained to
+* dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+* For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+* devices supporting polling.
+*/
+   if (__table_type_bio_based(t->type)) {
+   if (dm_table_supports_poll(t))
+   blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+   else
+   blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+   }
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 363f12a285ce..df4a6a999014 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -39,6 +39,8 @@
 #define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE"
 #define DM_COOKIE_LENGTH 24
 
+#define REQ_SAVED_END_IO REQ_DRV
+
 static const char *_name = DM_NAME;
 
 static unsigned int major = 0;
@@ -72,6 +74,7 @@ struct clone_info {
struct dm_io *io;
sector_t sector;
unsigned sector_count;
+   boolsubmit_as_polled;
 };
 
 /*
@@ -99,6 +102,8 @@ struct dm_io {
blk_status_t status;
atomic_t io_count;
struct bio *orig_bio;
+   void*saved_bio_end_io;
+   struct hlist_node  node;
unsigned long start_time;
spinlock_t endio_lock;
struct dm_stats_aux stats_aux;
@@ -687,6 +692,8 @@ static struct dm_target_io *alloc_tio(struct clone_info 
*ci, struct dm_target *t
tio->ti = ti;
tio->target_bio_nr = target_bio_nr;
 
+   WARN_ON_ONCE(ci->submit_as_polled && !tio->inside_dm_io);
+
return tio;
 }
 
@@ -938,8 +945,14 @@ static void dec_pending(struct dm_io *io, blk_status_t 
error)
end_io_acct(io);
free_io(md, io);
 
-   if (io_error == BLK_STS_DM_REQUEUE)
+   if (io_error == BLK_STS_DM_REQUEUE) {
+   /*
+* Upper layer won't help us poll split bio, so
+* clear REQ_POLLED in case of requeue
+*/
+   bio->bi_opf &= ~REQ_POLLED;
return;
+   }
 
if ((bio->bi_opf & REQ_PREFLUSH) && bio->bi_iter.bi_size) {
/*
@@ -1574,6 +1587,32 @@ static bool __process_abnormal_io(struct clone_info *ci, 
struct dm_target *t

Re: [dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-18 Thread Ming Lei

Hello Jeffle,

On Fri, Jun 18, 2021 at 04:19:10PM +0800, JeffleXu wrote:
> 
> 
> On 6/17/21 6:35 PM, Ming Lei wrote:
> > Support bio(REQ_POLLED) polling in the following approach:
> > 
> > 1) only support io polling on normal READ/WRITE, and other abnormal IOs
> > still fallback on IRQ mode, so the target io is exactly inside the dm
> > io.
> > 
> > 2) hold one refcnt on io->io_count after submitting this dm bio with
> > REQ_POLLED
> > 
> > 3) support dm native bio splitting, any dm io instance associated with
> > current bio will be added into one list which head is bio->bi_end_io
> > which will be recovered before ending this bio
> > 
> > 4) implement .poll_bio() callback, call bio_poll() on the single target
> > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
> > dec_pending() after the target io is done in .poll_bio()
> > 
> > 4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
> > which is based on Jeffle's previous patch.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/md/dm-table.c |  24 +
> >  drivers/md/dm.c   | 111 --
> >  2 files changed, 132 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index ee47a332b462..b14b379442d2 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -1491,6 +1491,12 @@ struct dm_target *dm_table_find_target(struct 
> > dm_table *t, sector_t sector)
> > return >targets[(KEYS_PER_NODE * n) + k];
> >  }
> >  
> > +static int device_not_poll_capable(struct dm_target *ti, struct dm_dev 
> > *dev,
> > +  sector_t start, sector_t len, void *data)
> > +{
> > +   return !blk_queue_poll(bdev_get_queue(dev->bdev));
> > +}
> > +
> >  /*
> >   * type->iterate_devices() should be called when the sanity check needs to
> >   * iterate and check all underlying data devices. iterate_devices() will
> > @@ -1541,6 +1547,11 @@ static int count_device(struct dm_target *ti, struct 
> > dm_dev *dev,
> > return 0;
> >  }
> >  
> > +static int dm_table_supports_poll(struct dm_table *t)
> > +{
> > +   return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
> > +}
> > +
> >  /*
> >   * Check whether a table has no data devices attached using each
> >   * target's iterate_devices method.
> > @@ -2078,6 +2089,19 @@ void dm_table_set_restrictions(struct dm_table *t, 
> > struct request_queue *q,
> >  
> > dm_update_keyslot_manager(q, t);
> > blk_queue_update_readahead(q);
> > +
> > +   /*
> > +* Check for request-based device is remained to
> > +* dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
> > +* For bio-based device, only set QUEUE_FLAG_POLL when all underlying
> > +* devices supporting polling.
> > +*/
> > +   if (__table_type_bio_based(t->type)) {
> > +   if (dm_table_supports_poll(t))
> > +   blk_queue_flag_set(QUEUE_FLAG_POLL, q);
> > +   else
> > +   blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
> > +   }
> >  }
> >  
> >  unsigned int dm_table_get_num_targets(struct dm_table *t)
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 363f12a285ce..9745c3deacc3 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -39,6 +39,8 @@
> >  #define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE"
> >  #define DM_COOKIE_LENGTH 24
> >  
> > +#define REQ_SAVED_END_IO REQ_DRV
> > +
> >  static const char *_name = DM_NAME;
> >  
> >  static unsigned int major = 0;
> > @@ -97,8 +99,11 @@ struct dm_io {
> > unsigned magic;
> > struct mapped_device *md;
> > blk_status_t status;
> > +   boolsubmit_as_polled;
> > atomic_t io_count;
> > struct bio *orig_bio;
> > +   void*saved_bio_end_io;
> > +   struct hlist_node  node;
> > unsigned long start_time;
> > spinlock_t endio_lock;
> > struct dm_stats_aux stats_aux;
> > @@ -687,6 +692,8 @@ static struct dm_target_io *alloc_tio(struct clone_info 
> > *ci, struct dm_target *t
> > tio->ti = ti;
> > tio->target_bio_nr = target_bio_nr;
> >  
> > +   WARN_ON_ONCE(ci->io->submit_as_polled && !tio->inside_dm_io);
> > +
> > return tio;
> >  }
> >  
> > @@ -9

Re: [dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-17 Thread Ming Lei

On Thu, Jun 17, 2021 at 06:35:49PM +0800, Ming Lei wrote:
> Support bio(REQ_POLLED) polling in the following approach:
> 
> 1) only support io polling on normal READ/WRITE, and other abnormal IOs
> still fallback on IRQ mode, so the target io is exactly inside the dm
> io.
> 
> 2) hold one refcnt on io->io_count after submitting this dm bio with
> REQ_POLLED
> 
> 3) support dm native bio splitting, any dm io instance associated with
> current bio will be added into one list which head is bio->bi_end_io
> which will be recovered before ending this bio
> 
> 4) implement .poll_bio() callback, call bio_poll() on the single target
> bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
> dec_pending() after the target io is done in .poll_bio()
> 
> 4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
> which is based on Jeffle's previous patch.
> 
> Signed-off-by: Ming Lei 

...

> @@ -938,8 +945,12 @@ static void dec_pending(struct dm_io *io, blk_status_t 
> error)
>   end_io_acct(io);
>   free_io(md, io);
>  
> - if (io_error == BLK_STS_DM_REQUEUE)
> + if (io_error == BLK_STS_DM_REQUEUE) {
> + /* not poll any more in case of requeue */
> + if (bio->bi_opf & REQ_POLLED)
> + bio->bi_opf &= ~REQ_POLLED;

It becomes not necessary to clear REQ_POLLED before requeuing since
every dm_io is added into the hlist_head which is reused from
bio->bi_end_io, so all dm-io(include the one to be requeued) will be
polled.

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [RFC PATCH V2 3/3] dm: support bio polling

2021-06-17 Thread Ming Lei

Support bio(REQ_POLLED) polling in the following approach:

1) only support io polling on normal READ/WRITE, and other abnormal IOs
still fallback on IRQ mode, so the target io is exactly inside the dm
io.

2) hold one refcnt on io->io_count after submitting this dm bio with
REQ_POLLED

3) support dm native bio splitting, any dm io instance associated with
current bio will be added into one list which head is bio->bi_end_io
which will be recovered before ending this bio

4) implement .poll_bio() callback, call bio_poll() on the single target
bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
dec_pending() after the target io is done in .poll_bio()

4) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
which is based on Jeffle's previous patch.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-table.c |  24 +
 drivers/md/dm.c   | 111 --
 2 files changed, 132 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index ee47a332b462..b14b379442d2 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1491,6 +1491,12 @@ struct dm_target *dm_table_find_target(struct dm_table 
*t, sector_t sector)
return >targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+  sector_t start, sector_t len, void *data)
+{
+   return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1541,6 +1547,11 @@ static int count_device(struct dm_target *ti, struct 
dm_dev *dev,
return 0;
 }
 
+static int dm_table_supports_poll(struct dm_table *t)
+{
+   return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2078,6 +2089,19 @@ void dm_table_set_restrictions(struct dm_table *t, 
struct request_queue *q,
 
dm_update_keyslot_manager(q, t);
blk_queue_update_readahead(q);
+
+   /*
+* Check for request-based device is remained to
+* dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+* For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+* devices supporting polling.
+*/
+   if (__table_type_bio_based(t->type)) {
+   if (dm_table_supports_poll(t))
+   blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+   else
+   blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+   }
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 363f12a285ce..9745c3deacc3 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -39,6 +39,8 @@
 #define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE"
 #define DM_COOKIE_LENGTH 24
 
+#define REQ_SAVED_END_IO REQ_DRV
+
 static const char *_name = DM_NAME;
 
 static unsigned int major = 0;
@@ -97,8 +99,11 @@ struct dm_io {
unsigned magic;
struct mapped_device *md;
blk_status_t status;
+   boolsubmit_as_polled;
atomic_t io_count;
struct bio *orig_bio;
+   void*saved_bio_end_io;
+   struct hlist_node  node;
unsigned long start_time;
spinlock_t endio_lock;
struct dm_stats_aux stats_aux;
@@ -687,6 +692,8 @@ static struct dm_target_io *alloc_tio(struct clone_info 
*ci, struct dm_target *t
tio->ti = ti;
tio->target_bio_nr = target_bio_nr;
 
+   WARN_ON_ONCE(ci->io->submit_as_polled && !tio->inside_dm_io);
+
return tio;
 }
 
@@ -938,8 +945,12 @@ static void dec_pending(struct dm_io *io, blk_status_t 
error)
end_io_acct(io);
free_io(md, io);
 
-   if (io_error == BLK_STS_DM_REQUEUE)
+   if (io_error == BLK_STS_DM_REQUEUE) {
+   /* not poll any more in case of requeue */
+   if (bio->bi_opf & REQ_POLLED)
+   bio->bi_opf &= ~REQ_POLLED;
return;
+   }
 
if ((bio->bi_opf & REQ_PREFLUSH) && bio->bi_iter.bi_size) {
/*
@@ -1590,6 +1601,12 @@ static int __split_and_process_non_flush(struct 
clone_info *ci)
if (__process_abnormal_io(ci, ti, ))
return r;
 
+   /*
+* Only support bio polling for normal IO, and the target io is
+* exactly inside the dm io instance
+*/
+   ci->io->submit_as_polled = !!(ci->bio->bi_opf & REQ_POLLED);
+
len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
 
r = __clone_and_map_data_bio(ci, ti,

[dm-devel] [RFC PATCH V2 2/3] block: add ->poll_bio to block_device_operations

2021-06-17 Thread Ming Lei

Prepare for supporting IO polling for bio based driver.

Add ->poll_bio callback so that bio driver can provide their own logic
for polling bio.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 18 +-
 block/genhd.c  |  3 +++
 include/linux/blkdev.h |  2 ++
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1e24c71c6738..a1552ec8d608 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1113,11 +1113,13 @@ EXPORT_SYMBOL(submit_bio);
  */
 int bio_poll(struct bio *bio, unsigned int flags)
 {
-   struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+   struct gendisk *disk = bio->bi_bdev->bd_disk;
+   struct request_queue *q = disk->queue;
blk_qc_t cookie = READ_ONCE(bio->bi_cookie);
int ret;
 
-   if (cookie == BLK_QC_T_NONE || !blk_queue_poll(q))
+   if ((queue_is_mq(q) && cookie == BLK_QC_T_NONE) ||
+   !blk_queue_poll(q))
return 0;
 
if (current->plug)
@@ -1125,10 +1127,16 @@ int bio_poll(struct bio *bio, unsigned int flags)
 
if (blk_queue_enter(q, BLK_MQ_REQ_NOWAIT))
return 0;
-   if (WARN_ON_ONCE(!queue_is_mq(q)))
-   ret = 0;/* not yet implemented, should not happen */
-   else
+   if (!queue_is_mq(q)) {
+   if (disk->fops->poll_bio) {
+   ret = disk->fops->poll_bio(bio, flags);
+   } else {
+   WARN_ON_ONCE(1);
+   ret = 0;
+   }
+   } else {
ret = blk_mq_poll(q, cookie, flags);
+   }
blk_queue_exit(q);
return ret;
 }
diff --git a/block/genhd.c b/block/genhd.c
index 5f5628216295..042dfa5b3f79 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -471,6 +471,9 @@ static void __device_add_disk(struct device *parent, struct 
gendisk *disk,
 {
int ret;
 
+   /* ->poll_bio is only for bio based driver */
+   WARN_ON_ONCE(queue_is_mq(disk->queue) && disk->fops->poll_bio);
+
/*
 * The disk queue should now be all set with enough information about
 * the device for the elevator code to pick an adequate default
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fc0ba0b80776..6da6fb120148 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1858,6 +1858,8 @@ static inline void blk_ksm_unregister(struct 
request_queue *q) { }
 
 struct block_device_operations {
void (*submit_bio)(struct bio *bio);
+   /* ->poll_bio is for bio driver only */
+   int (*poll_bio)(struct bio *bio, unsigned int flags);
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned 
int);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [RFC PATCH V2 1/3] block: add helper of blk_queue_poll

2021-06-17 Thread Ming Lei

There has been 3 users, and will be more, so add one such helper.

Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Jeffle Xu 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 5 ++---
 block/blk-sysfs.c| 4 ++--
 drivers/nvme/host/core.c | 2 +-
 include/linux/blkdev.h   | 1 +
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 531176578221..1e24c71c6738 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -835,7 +835,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio 
*bio)
}
}
 
-   if (!test_bit(QUEUE_FLAG_POLL, >queue_flags))
+   if (!blk_queue_poll(q))
bio->bi_opf &= ~REQ_POLLED;
 
switch (bio_op(bio)) {
@@ -1117,8 +1117,7 @@ int bio_poll(struct bio *bio, unsigned int flags)
blk_qc_t cookie = READ_ONCE(bio->bi_cookie);
int ret;
 
-   if (cookie == BLK_QC_T_NONE ||
-   !test_bit(QUEUE_FLAG_POLL, >queue_flags))
+   if (cookie == BLK_QC_T_NONE || !blk_queue_poll(q))
return 0;
 
if (current->plug)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f78e73ca6091..93dcf2dfaafd 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -422,13 +422,13 @@ static ssize_t queue_poll_delay_store(struct 
request_queue *q, const char *page,
 
 static ssize_t queue_poll_show(struct request_queue *q, char *page)
 {
-   return queue_var_show(test_bit(QUEUE_FLAG_POLL, >queue_flags), page);
+   return queue_var_show(blk_queue_poll(q), page);
 }
 
 static ssize_t queue_poll_store(struct request_queue *q, const char *page,
size_t count)
 {
-   if (!test_bit(QUEUE_FLAG_POLL, >queue_flags))
+   if (!blk_queue_poll(q))
return -EINVAL;
pr_info_ratelimited("writes to the poll attribute are ignored.\n");
pr_info_ratelimited("please use driver specific parameters instead.\n");
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fe0b8da3de7f..e31c7704ef4d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1025,7 +1025,7 @@ static void nvme_execute_rq_polled(struct request_queue 
*q,
 {
DECLARE_COMPLETION_ONSTACK(wait);
 
-   WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, >queue_flags));
+   WARN_ON_ONCE(!blk_queue_poll(q));
 
rq->cmd_flags |= REQ_POLLED;
rq->end_io_data = 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 561b04117bd4..fc0ba0b80776 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -677,6 +677,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct 
request_queue *q);
 #define blk_queue_fua(q)   test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)test_bit(QUEUE_FLAG_REGISTERED, 
&(q)->queue_flags)
 #define blk_queue_nowait(q)test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll(q)  test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [RFC PATCH V2 0/3] block/dm: support bio polling

2021-06-17 Thread Ming Lei

Hello Guys,

Based on Christoph's bio based polling model[1], implement DM bio polling
with one very simple approach.

Patch 1 adds helper of blk_queue_poll().

Patch 2 adds .bio_poll() callback to block_device_operations, so bio
driver can implement its own logic for io polling.

Patch 3 implements bio polling for device mapper.

Any comments are welcome.

V2:
- drop patch to add new fields into bio
- support io polling for dm native bio splitting
- add comment

Ming Lei (3):
  block: add helper of blk_queue_poll
  block: add ->poll_bio to block_device_operations
  dm: support bio polling

 block/blk-core.c |  21 +---
 block/blk-sysfs.c|   4 +-
 block/genhd.c|   3 ++
 drivers/md/dm-table.c|  24 +
 drivers/md/dm.c  | 111 +--
 drivers/nvme/host/core.c |   2 +-
 include/linux/blkdev.h   |   3 ++
 7 files changed, 155 insertions(+), 13 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

1 2 3 4 5 6 7 >

1 - 100 of 698 matches

Mail list logo