Re: [dm-devel] [PATCH v7 5/5] loop: Add support for provision requests

2023-05-22 Thread Darrick J. Wong
On Mon, May 22, 2023 at 03:09:55PM -0700, Sarthak Kukreti wrote:
> On Mon, May 22, 2023 at 9:37 AM Darrick J. Wong  wrote:
> >
> > If someone calls fallocate(UNSHARE_RANGE) on a loop bdev, shouldn't
> > there be a way to pass that through to the fallocate call to the backing
> > file?
> >
> > --D
> >
> 
> Yeah, I think we could add a REQ_UNSHARE bit (similar to REQ_NOUNMAP) to pass 
> down the intent to the backing file (and possibly beyond...).
> 
> I took a stab at implementing it as a follow up patch so that there's
> less review churn on the current series. If it looks good, I can add
> it to the end of the series (or incorporate this into the existing
> block and loop patches):

It looks like a reasonable addition to the end of the series, assuming
that filling holes in thinp devices is cheap but unsharing snapshot
blocks is not.

> From: Sarthak Kukreti 
> Date: Mon, 22 May 2023 14:18:15 -0700
> Subject: [PATCH] block: Pass unshare intent via REQ_OP_PROVISION
> 
> Allow REQ_OP_PROVISION to pass in an extra REQ_UNSHARE bit to
> annotate unshare requests to underlying layers. Layers that support
> FALLOC_FL_UNSHARE will be able to use this as an indicator of which
> fallocate() mode to use.

> Signed-off-by: Sarthak Kukreti 
> ---
>  block/blk-lib.c   |  6 +-
>  block/fops.c  |  6 +-
>  drivers/block/loop.c  | 35 +--
>  include/linux/blk_types.h |  3 +++
>  include/linux/blkdev.h|  3 ++-
>  5 files changed, 44 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 3cff5fb654f5..bea6f5a700b3 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -350,6 +350,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
>   * @sector:  start sector
>   * @nr_sects:number of sectors to provision
>   * @gfp_mask:memory allocation flags (for bio_alloc)
> + * @flags:   controls detailed behavior
>   *
>   * Description:
>   *  Issues a provision request to the block device for the range of sectors.
> @@ -357,7 +358,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
>   *  underlying storage pool to allocate space for this block range.
>   */
>  int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
> - sector_t nr_sects, gfp_t gfp)
> + sector_t nr_sects, gfp_t gfp, unsigned flags)
>  {
>   sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
>   unsigned int max_sectors = bdev_max_provision_sectors(bdev);
> @@ -380,6 +381,9 @@ int blkdev_issue_provision(struct block_device *bdev, 
> sector_t sector,
>   bio->bi_iter.bi_sector = sector;
>   bio->bi_iter.bi_size = req_sects << SECTOR_SHIFT;
>  
> + if (flags & BLKDEV_UNSHARE_RANGE)

This is a provisioning flag, shouldn't this be ...
BLKDEV_PROVISION_UNSHARE or something?

> + bio->bi_opf |= REQ_UNSHARE;
> +
>   sector += req_sects;
>   nr_sects -= req_sects;
>   if (!nr_sects) {
> diff --git a/block/fops.c b/block/fops.c
> index be2e41f160bf..6848756f0557 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -659,7 +659,11 @@ static long blkdev_fallocate(struct file *file, int 
> mode, loff_t start,
>   case FALLOC_FL_KEEP_SIZE:
>   case FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE:
>   error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
> -len >> SECTOR_SHIFT, GFP_KERNEL);
> +len >> SECTOR_SHIFT, GFP_KERNEL,
> +(mode &
> + FALLOC_FL_UNSHARE_RANGE) ?
> +BLKDEV_UNSHARE_RANGE :
> +0);

You might want to do something about the six level indent here;
Linus hates that.

--D

>   break;
>   case FALLOC_FL_ZERO_RANGE:
>   case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 7fe1a6629754..c844b145d666 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -306,6 +306,30 @@ static int lo_read_simple(struct loop_device *lo, struct 
> request *rq,
>   return 0;
>  }
>  
> +static bool validate_fallocate_mode(struct loop_device *lo, int mode)
> +{
> + bool ret = true;
> +
> + switch (mode) {
> + case FALLOC_FL_PUNCH_HOLE:
> + case FALLOC_FL_ZERO_RANGE:
> + if (!bdev_max_discard_sectors(lo->lo_device))
> + ret = false;
> + break;
> + case 0:
> + case FALLOC_FL_UNSHARE_RANGE:
> + if (!bdev_max_provision_sectors(lo->lo_device))
> + ret = false;
> + break;
> +
> + default:
> + ret = false;
> + }
> +
> + return ret;
> +}
> +
> +
>  static int lo_fallocate(struct loop_device *lo, struct 

[dm-devel] [PATCH v7 5/5] loop: Add support for provision requests

2023-05-22 Thread Sarthak Kukreti
On Mon, May 22, 2023 at 9:37 AM Darrick J. Wong  wrote:
>
> If someone calls fallocate(UNSHARE_RANGE) on a loop bdev, shouldn't
> there be a way to pass that through to the fallocate call to the backing
> file?
>
> --D
>

Yeah, I think we could add a REQ_UNSHARE bit (similar to REQ_NOUNMAP) to pass 
down the intent to the backing file (and possibly beyond...).

I took a stab at implementing it as a follow up patch so that there's less 
review churn on the current series. If it looks good, I can add it to the end 
of the series (or incorporate this into the existing block and loop patches):

From: Sarthak Kukreti 
Date: Mon, 22 May 2023 14:18:15 -0700
Subject: [PATCH] block: Pass unshare intent via REQ_OP_PROVISION

Allow REQ_OP_PROVISION to pass in an extra REQ_UNSHARE bit to
annotate unshare requests to underlying layers. Layers that support
FALLOC_FL_UNSHARE will be able to use this as an indicator of which
fallocate() mode to use.

Signed-off-by: Sarthak Kukreti 
---
 block/blk-lib.c   |  6 +-
 block/fops.c  |  6 +-
 drivers/block/loop.c  | 35 +--
 include/linux/blk_types.h |  3 +++
 include/linux/blkdev.h|  3 ++-
 5 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 3cff5fb654f5..bea6f5a700b3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -350,6 +350,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
  * @sector:start sector
  * @nr_sects:  number of sectors to provision
  * @gfp_mask:  memory allocation flags (for bio_alloc)
+ * @flags: controls detailed behavior
  *
  * Description:
  *  Issues a provision request to the block device for the range of sectors.
@@ -357,7 +358,7 @@ EXPORT_SYMBOL(blkdev_issue_secure_erase);
  *  underlying storage pool to allocate space for this block range.
  */
 int blkdev_issue_provision(struct block_device *bdev, sector_t sector,
-   sector_t nr_sects, gfp_t gfp)
+   sector_t nr_sects, gfp_t gfp, unsigned flags)
 {
sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
unsigned int max_sectors = bdev_max_provision_sectors(bdev);
@@ -380,6 +381,9 @@ int blkdev_issue_provision(struct block_device *bdev, 
sector_t sector,
bio->bi_iter.bi_sector = sector;
bio->bi_iter.bi_size = req_sects << SECTOR_SHIFT;
 
+   if (flags & BLKDEV_UNSHARE_RANGE)
+   bio->bi_opf |= REQ_UNSHARE;
+
sector += req_sects;
nr_sects -= req_sects;
if (!nr_sects) {
diff --git a/block/fops.c b/block/fops.c
index be2e41f160bf..6848756f0557 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -659,7 +659,11 @@ static long blkdev_fallocate(struct file *file, int mode, 
loff_t start,
case FALLOC_FL_KEEP_SIZE:
case FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE:
error = blkdev_issue_provision(bdev, start >> SECTOR_SHIFT,
-  len >> SECTOR_SHIFT, GFP_KERNEL);
+  len >> SECTOR_SHIFT, GFP_KERNEL,
+  (mode &
+   FALLOC_FL_UNSHARE_RANGE) ?
+  BLKDEV_UNSHARE_RANGE :
+  0);
break;
case FALLOC_FL_ZERO_RANGE:
case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 7fe1a6629754..c844b145d666 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -306,6 +306,30 @@ static int lo_read_simple(struct loop_device *lo, struct 
request *rq,
return 0;
 }
 
+static bool validate_fallocate_mode(struct loop_device *lo, int mode)
+{
+   bool ret = true;
+
+   switch (mode) {
+   case FALLOC_FL_PUNCH_HOLE:
+   case FALLOC_FL_ZERO_RANGE:
+   if (!bdev_max_discard_sectors(lo->lo_device))
+   ret = false;
+   break;
+   case 0:
+   case FALLOC_FL_UNSHARE_RANGE:
+   if (!bdev_max_provision_sectors(lo->lo_device))
+   ret = false;
+   break;
+
+   default:
+   ret = false;
+   }
+
+   return ret;
+}
+
+
 static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
int mode)
 {
@@ -316,11 +340,7 @@ static int lo_fallocate(struct loop_device *lo, struct 
request *rq, loff_t pos,
struct file *file = lo->lo_backing_file;
int ret;
 
-   if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE) &&
-   !bdev_max_discard_sectors(lo->lo_device))
-   return -EOPNOTSUPP;
-
-   if (mode == 0 && !bdev_max_provision_sectors(lo->lo_device))
+   if (!validate_fallocate_mode(lo, mode))
return -EOPNOTSUPP;
 
mode |= 

[dm-devel] [PATCH v3 7/7] block: Inline blk_mq_{, delay_}kick_requeue_list()

2023-05-22 Thread Bart Van Assche
Patch "block: Preserve the order of requeued requests" changed
blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() into
blk_mq_run_hw_queues() and blk_mq_delay_run_hw_queues() calls
respectively. Inline blk_mq_{,delay_}kick_requeue_list() because these
functions are now too short to keep these as separate functions.

Cc: Christoph Hellwig 
Cc: Damien Le Moal 
Cc: Ming Lei 
Cc: Mike Snitzer 
Signed-off-by: Bart Van Assche 
---
 block/blk-flush.c|  4 ++--
 block/blk-mq-debugfs.c   |  2 +-
 block/blk-mq.c   | 16 +---
 drivers/block/ublk_drv.c |  6 +++---
 drivers/block/xen-blkfront.c |  1 -
 drivers/md/dm-rq.c   |  6 +++---
 drivers/nvme/host/core.c |  2 +-
 drivers/s390/block/scm_blk.c |  2 +-
 drivers/scsi/scsi_lib.c  |  2 +-
 include/linux/blk-mq.h   |  2 --
 10 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index dba392cf22be..22170036ddcb 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -191,7 +191,7 @@ static void blk_flush_complete_seq(struct request *rq,
spin_lock(>requeue_lock);
list_add_tail(>queuelist, >flush_list);
spin_unlock(>requeue_lock);
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
break;
 
case REQ_FSEQ_DONE:
@@ -352,7 +352,7 @@ static void blk_kick_flush(struct request_queue *q, struct 
blk_flush_queue *fq,
list_add_tail(_rq->queuelist, >flush_list);
spin_unlock(>requeue_lock);
 
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
 }
 
 static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 68165a50951b..869cc62ed50f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -145,7 +145,7 @@ static ssize_t queue_state_write(void *data, const char 
__user *buf,
} else if (strcmp(op, "start") == 0) {
blk_mq_start_stopped_hw_queues(q, true);
} else if (strcmp(op, "kick") == 0) {
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
} else {
pr_err("%s: unsupported operation '%s'\n", __func__, op);
 inval:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 52dffdc70480..34dcfc84d902 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1430,7 +1430,7 @@ void blk_mq_requeue_request(struct request *rq, bool 
kick_requeue_list)
spin_unlock_irqrestore(>requeue_lock, flags);
 
if (kick_requeue_list)
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
 }
 EXPORT_SYMBOL(blk_mq_requeue_request);
 
@@ -1470,19 +1470,6 @@ static void blk_mq_process_requeue_list(struct 
blk_mq_hw_ctx *hctx)
blk_mq_insert_request(rq, 0);
 }
 
-void blk_mq_kick_requeue_list(struct request_queue *q)
-{
-   blk_mq_run_hw_queues(q, true);
-}
-EXPORT_SYMBOL(blk_mq_kick_requeue_list);
-
-void blk_mq_delay_kick_requeue_list(struct request_queue *q,
-   unsigned long msecs)
-{
-   blk_mq_delay_run_hw_queues(q, msecs);
-}
-EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
-
 static bool blk_mq_rq_inflight(struct request *rq, void *priv)
 {
/*
@@ -3537,7 +3524,6 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, 
struct hlist_node *node)
 
list_for_each_entry_safe(rq, next, , queuelist)
blk_mq_requeue_request(rq, false);
-   blk_mq_kick_requeue_list(hctx->queue);
 
blk_mq_run_hw_queue(hctx, true);
return 0;
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 539eada32861..4a3d579a25b5 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -900,7 +900,7 @@ static inline void __ublk_rq_task_work(struct request *req,
 */
if (unlikely(!mapped_bytes)) {
blk_mq_requeue_request(req, false);
-   blk_mq_delay_kick_requeue_list(req->q,
+   blk_mq_delay_run_hw_queues(req->q,
UBLK_REQUEUE_DELAY_MS);
return;
}
@@ -1290,7 +1290,7 @@ static void ublk_unquiesce_dev(struct ublk_device *ub)
 
blk_mq_unquiesce_queue(ub->ub_disk->queue);
/* We may have requeued some rqs in ublk_quiesce_queue() */
-   blk_mq_kick_requeue_list(ub->ub_disk->queue);
+   blk_mq_run_hw_queues(ub->ub_disk->queue, true);
 }
 
 static void ublk_stop_dev(struct ublk_device *ub)
@@ -2334,7 +2334,7 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
blk_mq_unquiesce_queue(ub->ub_disk->queue);
pr_devel("%s: queue unquiesced, dev id %d.\n",
__func__, header->dev_id);
-   blk_mq_kick_requeue_list(ub->ub_disk->queue);
+   blk_mq_run_hw_queues(ub->ub_disk->queue, 

[dm-devel] [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list()

2023-05-22 Thread Bart Van Assche
Since commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") the function __dm_mq_kick_requeue_list() is too short to keep it
as a separate function. Hence, inline this function.

Cc: Christoph Hellwig 
Cc: Damien Le Moal 
Cc: Ming Lei 
Cc: Mike Snitzer 
Signed-off-by: Bart Van Assche 
---
 drivers/md/dm-rq.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f7e9a3632eb3..bbe1e2ea0aa4 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -168,21 +168,16 @@ static void dm_end_request(struct request *clone, 
blk_status_t error)
rq_completed(md);
 }
 
-static void __dm_mq_kick_requeue_list(struct request_queue *q, unsigned long 
msecs)
-{
-   blk_mq_delay_kick_requeue_list(q, msecs);
-}
-
 void dm_mq_kick_requeue_list(struct mapped_device *md)
 {
-   __dm_mq_kick_requeue_list(md->queue, 0);
+   blk_mq_kick_requeue_list(md->queue);
 }
 EXPORT_SYMBOL(dm_mq_kick_requeue_list);
 
 static void dm_mq_delay_requeue_request(struct request *rq, unsigned long 
msecs)
 {
blk_mq_requeue_request(rq, false);
-   __dm_mq_kick_requeue_list(rq->q, msecs);
+   blk_mq_delay_kick_requeue_list(rq->q, msecs);
 }
 
 static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool 
delay_requeue)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-22 Thread Mike Snitzer
On Fri, May 19 2023 at  7:07P -0400,
Dave Chinner  wrote:

> On Fri, May 19, 2023 at 10:41:31AM -0400, Mike Snitzer wrote:
> > On Fri, May 19 2023 at 12:09P -0400,
> > Christoph Hellwig  wrote:
> > 
> > > FYI, I really don't think this primitive is a good idea.  In the
> > > concept of non-overwritable storage (NAND, SMR drives) the entire
> > > concept of a one-shoot 'provisioning' that will guarantee later writes
> > > are always possible is simply bogus.
> > 
> > Valid point for sure, such storage shouldn't advertise support (and
> > will return -EOPNOTSUPP).
> > 
> > But the primitive still has utility for other classes of storage.
> 
> Yet the thing people are wanting to us filesystem developers to use
> this with is thinly provisioned storage that has snapshot
> capability. That, by definition, is non-overwritable storage. These
> are the use cases people are asking filesystes to gracefully handle
> and report errors when the sparse backing store runs out of space.

DM thinp falls into this category but as you detailed it can be made
to work reliably. To carry that forward we need to first establish
the REQ_PROVISION primitive (with this series).

Follow-on associated dm-thinp enhancements can then serve as reference
for how to take advantage of XFS's ability to operate reliably of
thinly provisioned storage.
 
> e.g. journal writes after a snapshot is taken on a busy filesystem
> are always an overwrite and this requires more space in the storage
> device for the write to succeed. ENOSPC from the backing device for
> journal IO is a -fatal error-. Hence if REQ_PROVISION doesn't
> guarantee space for overwrites after snapshots, then it's not
> actually useful for solving the real world use cases we actually
> need device-level provisioning to solve.
> 
> It is not viable for filesystems to have to reprovision space for
> in-place metadata overwrites after every snapshot - the filesystem
> may not even know a snapshot has been taken! And it's not feasible
> for filesystems to provision on demand before they modify metadata
> because we don't know what metadata is going to need to be modified
> before we start modifying metadata in transactions. If we get ENOSPC
> from provisioning in the middle of a dirty transcation, it's all
> over just the same as if we get ENOSPC during metadata writeback...
> 
> Hence what filesystems actually need is device provisioned space to
> be -always over-writable- without ENOSPC occurring.  Ideally, if we
> provision a range of the block device, the block device *must*
> guarantee all future writes to that LBA range succeeds. That
> guarantee needs to stand until we discard or unmap the LBA range,
> and for however many writes we do to that LBA range.
> 
> e.g. If the device takes a snapshot, it needs to reprovision the
> potential COW ranges that overlap with the provisioned LBA range at
> snapshot time. e.g. by re-reserving the space from the backing pool
> for the provisioned space so if a COW occurs there is space
> guaranteed for it to succeed.  If there isn't space in the backing
> pool for the reprovisioning, then whatever operation that triggers
> the COW behaviour should fail with ENOSPC before doing anything
> else

Happy to implement this in dm-thinp.  Each thin block will need a bit
to say if the block must be REQ_PROVISION'd at time of snapshot (and
the resulting block will need the same bit set).

Walking all blocks of a thin device and triggering REQ_PROVISION for
each will obviously make thin snapshot creation take more time.

I think this approach is better than having a dedicated bitmap hooked
off each thin device's metadata (with bitmap being copied and walked
at the time of snapshot). But we'll see... I'll get with Joe to
discuss further.

> Software devices like dm-thin/snapshot should really only need to
> keep a persistent map of the provisioned space and refresh space
> reservations for used space within that map whenever something that
> triggers COW behaviour occurs. i.e. a snapshot needs to reset the
> provisioned ranges back to "all ranges are freshly provisioned"
> before the snapshot is started. If that space is not available in
> the backing pool, then the snapshot attempt gets ENOSPC
> 
> That means filesystems only need to provision space for journals and
> fixed metadata at mkfs time, and they only need issue a
> REQ_PROVISION bio when they first allocate over-write in place
> metadata. We already have online discard and/or fstrim for releasing
> provisioned space via discards.
> 
> This will require some mods to filesystems like ext4 and XFS to
> issue REQ_PROVISION and fail gracefully during metadata allocation.
> However, doing so means that we can actually harden filesystems
> against sparse block device ENOSPC errors by ensuring they will
> never occur in critical filesystem structures

Yes, let's finally _do_ this! ;)

Mike

--
dm-devel mailing list
dm-devel@redhat.com

Re: [dm-devel] [PATCH v7 5/5] loop: Add support for provision requests

2023-05-22 Thread Darrick J. Wong
On Thu, May 18, 2023 at 03:33:26PM -0700, Sarthak Kukreti wrote:
> Add support for provision requests to loopback devices.
> Loop devices will configure provision support based on
> whether the underlying block device/file can support
> the provision request and upon receiving a provision bio,
> will map it to the backing device/storage. For loop devices
> over files, a REQ_OP_PROVISION request will translate to
> an fallocate mode 0 call on the backing file.
> 
> Signed-off-by: Sarthak Kukreti 
> ---
>  drivers/block/loop.c | 34 +++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index bc31bb7072a2..7fe1a6629754 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -311,16 +311,20 @@ static int lo_fallocate(struct loop_device *lo, struct 
> request *rq, loff_t pos,
>  {
>   /*
>* We use fallocate to manipulate the space mappings used by the image
> -  * a.k.a. discard/zerorange.
> +  * a.k.a. discard/provision/zerorange.
>*/
>   struct file *file = lo->lo_backing_file;
>   int ret;
>  
> - mode |= FALLOC_FL_KEEP_SIZE;
> + if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE) &&
> + !bdev_max_discard_sectors(lo->lo_device))
> + return -EOPNOTSUPP;
>  
> - if (!bdev_max_discard_sectors(lo->lo_device))
> + if (mode == 0 && !bdev_max_provision_sectors(lo->lo_device))
>   return -EOPNOTSUPP;
>  
> + mode |= FALLOC_FL_KEEP_SIZE;
> +
>   ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq));
>   if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
>   return -EIO;
> @@ -488,6 +492,8 @@ static int do_req_filebacked(struct loop_device *lo, 
> struct request *rq)
>   FALLOC_FL_PUNCH_HOLE);
>   case REQ_OP_DISCARD:
>   return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
> + case REQ_OP_PROVISION:
> + return lo_fallocate(lo, rq, pos, 0);

If someone calls fallocate(UNSHARE_RANGE) on a loop bdev, shouldn't
there be a way to pass that through to the fallocate call to the backing
file?

--D

>   case REQ_OP_WRITE:
>   if (cmd->use_aio)
>   return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
> @@ -754,6 +760,25 @@ static void loop_sysfs_exit(struct loop_device *lo)
>  _attribute_group);
>  }
>  
> +static void loop_config_provision(struct loop_device *lo)
> +{
> + struct file *file = lo->lo_backing_file;
> + struct inode *inode = file->f_mapping->host;
> +
> + /*
> +  * If the backing device is a block device, mirror its provisioning
> +  * capability.
> +  */
> + if (S_ISBLK(inode->i_mode)) {
> + blk_queue_max_provision_sectors(lo->lo_queue,
> + bdev_max_provision_sectors(I_BDEV(inode)));
> + } else if (file->f_op->fallocate) {
> + blk_queue_max_provision_sectors(lo->lo_queue, UINT_MAX >> 9);
> + } else {
> + blk_queue_max_provision_sectors(lo->lo_queue, 0);
> + }
> +}
> +
>  static void loop_config_discard(struct loop_device *lo)
>  {
>   struct file *file = lo->lo_backing_file;
> @@ -1092,6 +1117,7 @@ static int loop_configure(struct loop_device *lo, 
> fmode_t mode,
>   blk_queue_io_min(lo->lo_queue, bsize);
>  
>   loop_config_discard(lo);
> + loop_config_provision(lo);
>   loop_update_rotational(lo);
>   loop_update_dio(lo);
>   loop_sysfs_init(lo);
> @@ -1304,6 +1330,7 @@ loop_set_status(struct loop_device *lo, const struct 
> loop_info64 *info)
>   }
>  
>   loop_config_discard(lo);
> + loop_config_provision(lo);
>  
>   /* update dio if lo_offset or transfer is changed */
>   __loop_update_dio(lo, lo->use_dio);
> @@ -1830,6 +1857,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   case REQ_OP_FLUSH:
>   case REQ_OP_DISCARD:
>   case REQ_OP_WRITE_ZEROES:
> + case REQ_OP_PROVISION:
>   cmd->use_aio = false;
>   break;
>   default:
> -- 
> 2.40.1.698.g37aff9b760-goog
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v11 6/9] nvmet: add copy command support for bdev and file ns

2023-05-22 Thread Nitesh Shetty
Add support for handling nvme_cmd_copy command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/target/admin-cmd.c   |  9 -
 drivers/nvme/target/io-cmd-bdev.c | 62 +++
 drivers/nvme/target/io-cmd-file.c | 52 ++
 drivers/nvme/target/loop.c|  6 +++
 drivers/nvme/target/nvmet.h   |  1 +
 5 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..8e644b8ec0fd 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req 
*req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-   NVME_CTRL_ONCS_WRITE_ZEROES);
-
+   NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req 
*req)
 
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+   else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+   (PAGE_SHIFT - SECTOR_SHIFT));
+   id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+   }
 
/*
 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index c2d6cea0236b..d92dfe86c647 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct 
nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+   if (bdev_max_copy_sectors(bdev)) {
+   id->msrc = id->msrc;
+   id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+   SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   } else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+   bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   }
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +196,21 @@ static void nvmet_bio_done(struct bio *bio)
nvmet_req_bio_put(req, bio);
 }
 
+static void nvmet_bdev_copy_end_io(void *private, int comp_len)
+{
+   struct nvmet_req *req = (struct nvmet_req *)private;
+   u16 status;
+
+   if (comp_len == req->copy_len) {
+   req->cqe->result.u32 = cpu_to_le32(1);
+   status = errno_to_nvme_status(req, 0);
+   } else {
+   req->cqe->result.u32 = cpu_to_le32(0);
+   status = errno_to_nvme_status(req, (__force u16)BLK_STS_IOERR);
+   }
+   nvmet_req_complete(req, status);
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
struct sg_mapping_iter *miter)
@@ -450,6 +477,37 @@ static void nvmet_bdev_execute_write_zeroes(struct 
nvmet_req *req)
}
 }
 
+/* At present we handle only one range entry, since copy offload is aligned 
with
+ * copy_file_range, only one entry is passed from block layer.
+ */
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+   struct nvme_copy_range range;
+   struct nvme_command *cmd = req->cmd;
+   int ret;
+   u16 status;
+
+   status = nvmet_copy_from_sgl(req, 0, , sizeof(range));
+   if (status)
+   goto out;
+
+   ret = blkdev_issue_copy(req->ns->bdev,
+   le64_to_cpu(cmd->copy.sdlba) << req->ns->blksize_shift,
+   req->ns->bdev,
+   le64_to_cpu(range.slba) << req->ns->blksize_shift,
+   (le16_to_cpu(range.nlb) + 1) << req->ns->blksize_shift,
+   nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);
+   if (ret) {
+   req->cqe->result.u32 = cpu_to_le32(0);
+   status = blk_to_nvme_status(req, BLK_STS_IOERR);
+   goto 

Re: [dm-devel] [PATCH v11 1/9] block: Introduce queue limits for copy-offload support

2023-05-22 Thread Damien Le Moal
On 5/22/23 19:41, Nitesh Shetty wrote:
> Add device limits as sysfs entries,
> - copy_offload (RW)
> - copy_max_bytes (RW)
> - copy_max_bytes_hw (RO)
> 
> Above limits help to split the copy payload in block layer.
> copy_offload: used for setting copy offload(1) or emulation(0).
> copy_max_bytes: maximum total length of copy in single payload.
> copy_max_bytes_hw: Reflects the device supported maximum limit.
> 
> Reviewed-by: Hannes Reinecke 
> Signed-off-by: Nitesh Shetty 
> Signed-off-by: Kanchan Joshi 
> Signed-off-by: Anuj Gupta 
> ---
>  Documentation/ABI/stable/sysfs-block | 33 ++
>  block/blk-settings.c | 24 +++
>  block/blk-sysfs.c| 64 
>  include/linux/blkdev.h   | 12 ++
>  include/uapi/linux/fs.h  |  3 ++
>  5 files changed, 136 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block 
> b/Documentation/ABI/stable/sysfs-block
> index c57e5b7cb532..e4d31132f77c 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -155,6 +155,39 @@ Description:
>   last zone of the device which may be smaller.
>  
>  
> +What:/sys/block//queue/copy_offload
> +Date:April 2023
> +Contact: linux-bl...@vger.kernel.org
> +Description:
> + [RW] When read, this file shows whether offloading copy to a
> + device is enabled (1) or disabled (0). Writing '0' to this
> + file will disable offloading copies for this device.
> + Writing any '1' value will enable this feature. If the device
> + does not support offloading, then writing 1, will result in
> + error.

will result is an error.

> +
> +
> +What:/sys/block//queue/copy_max_bytes
> +Date:April 2023
> +Contact: linux-bl...@vger.kernel.org
> +Description:
> + [RW] This is the maximum number of bytes, that the block layer

Please drop the comma after block.

> + will allow for copy request. This will be smaller or equal to

will allow for a copy request. This value is always smaller...

> + the maximum size allowed by the hardware, indicated by
> + 'copy_max_bytes_hw'. Attempt to set value higher than

An attempt to set a value higher than...

> + 'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
> +
> +
> +What:/sys/block//queue/copy_max_bytes_hw
> +Date:April 2023
> +Contact: linux-bl...@vger.kernel.org
> +Description:
> + [RO] This is the maximum number of bytes, that the hardware

drop the comma after bytes

> + will allow in a single data copy request.

will allow for

> + A value of 0 means that the device does not support
> + copy offload.

Given that you do have copy emulation for devices that do not support hw
offload, how is the user supposed to know the maximum size of a copy request
when it is emulated ? This is not obvious from looking at these parameters.

> +
> +
>  What:/sys/block//queue/crypto/
>  Date:February 2022
>  Contact: linux-bl...@vger.kernel.org
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 896b4654ab00..23aff2d4dcba 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
>   lim->zoned = BLK_ZONED_NONE;
>   lim->zone_write_granularity = 0;
>   lim->dma_alignment = 511;
> + lim->max_copy_sectors_hw = 0;
> + lim->max_copy_sectors = 0;
>  }
>  
>  /**
> @@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
>   lim->max_dev_sectors = UINT_MAX;
>   lim->max_write_zeroes_sectors = UINT_MAX;
>   lim->max_zone_append_sectors = UINT_MAX;
> + lim->max_copy_sectors_hw = ULONG_MAX;
> + lim->max_copy_sectors = ULONG_MAX;

UINT_MAX is not enough ?

>  }
>  EXPORT_SYMBOL(blk_set_stacking_limits);
>  
> @@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue 
> *q,
>  }
>  EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>  
> +/**
> + * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
> + * @q:  the request queue for the device
> + * @max_copy_sectors: maximum number of sectors to copy
> + **/
> +void blk_queue_max_copy_sectors_hw(struct request_queue *q,
> + unsigned int max_copy_sectors)
> +{
> + if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
> + max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
> +
> + q->limits.max_copy_sectors_hw = max_copy_sectors;
> + q->limits.max_copy_sectors = max_copy_sectors;
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
> +
>  /**
>   * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
>   * @q:  the request 

[dm-devel] [PATCH v11 9/9] null_blk: add support for copy offload

2023-05-22 Thread Nitesh Shetty
Implementaion is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.

Suggested-by: Damien Le Moal 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
---
 drivers/block/null_blk/main.c | 108 --
 drivers/block/null_blk/null_blk.h |   8 +++
 2 files changed, 111 insertions(+), 5 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index b3fedafe301e..34e009b3ebd5 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -157,6 +157,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -409,6 +413,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -550,6 +555,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_queue_mode,
_device_attr_blocksize,
_device_attr_max_sectors,
+   _device_attr_copy_max_bytes,
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
@@ -656,7 +662,8 @@ static ssize_t memb_group_features_show(struct config_item 
*item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
-   "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+   "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+   "copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -722,6 +729,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+   dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1271,6 +1279,78 @@ static int null_transfer(struct nullb *nullb, struct 
page *page,
return err;
 }
 
+static inline void nullb_setup_copy_read(struct nullb *nullb, struct bio *bio)
+{
+   struct nullb_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+
+   token->subsys = "nullb";
+   token->sector_in = bio->bi_iter.bi_sector;
+   token->nullb = nullb;
+   token->sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+}
+
+static inline int nullb_setup_copy_write(struct nullb *nullb,
+   struct bio *bio, bool is_fua)
+{
+   struct nullb_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+   sector_t sector_in, sector_out;
+   void *in, *out;
+   size_t rem, temp;
+   unsigned long offset_in, offset_out;
+   struct nullb_page *t_page_in, *t_page_out;
+   int ret = -EIO;
+
+   if (unlikely(memcmp(token->subsys, "nullb", 5)))
+   return -EINVAL;
+   if (unlikely(token->nullb != nullb))
+   return -EINVAL;
+   if (WARN_ON(token->sectors != bio->bi_iter.bi_size >> SECTOR_SHIFT))
+   return -EINVAL;
+
+   sector_in = token->sector_in;
+   sector_out = bio->bi_iter.bi_sector;
+   rem = token->sectors << SECTOR_SHIFT;
+
+   spin_lock_irq(>lock);
+   while (rem > 0) {
+   temp = min_t(size_t, nullb->dev->blocksize, rem);
+   offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
+   offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
+
+   if (null_cache_active(nullb) && !is_fua)
+   null_make_cache_space(nullb, PAGE_SIZE);
+
+   t_page_in = null_lookup_page(nullb, sector_in, false,
+   !null_cache_active(nullb));
+   if (!t_page_in)
+   goto err;
+   t_page_out = null_insert_page(nullb, sector_out,
+   !null_cache_active(nullb) || is_fua);
+   if (!t_page_out)
+   goto err;
+
+   in = kmap_local_page(t_page_in->page);
+   out = kmap_local_page(t_page_out->page);
+
+   memcpy(out + 

[dm-devel] [PATCH v11 7/9] dm: Add support for copy offload

2023-05-22 Thread Nitesh Shetty
Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-table.c | 41 +++
 drivers/md/dm.c   |  7 ++
 include/linux/device-mapper.h |  5 +
 3 files changed, 53 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 1398f1d6e83e..b3269271e761 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1867,6 +1867,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+   struct request_queue *q = bdev_get_queue(dev->bdev);
+
+   return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+   struct dm_target *ti;
+   unsigned int i;
+
+   for (i = 0; i < t->num_targets; i++) {
+   ti = dm_table_get_target(t, i);
+
+   if (!ti->copy_offload_supported)
+   return false;
+
+   /*
+* target provides copy support (as implied by setting
+* 'copy_offload_supported')
+* and it relies on _all_ data devices having copy support.
+*/
+   if (!ti->type->iterate_devices ||
+ti->type->iterate_devices(ti,
+device_not_copy_capable, NULL))
+   return false;
+   }
+
+   return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
  sector_t start, sector_t len, void *data)
 {
@@ -1949,6 +1982,14 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
q->limits.discard_misaligned = 0;
}
 
+   if (!dm_table_supports_copy(t)) {
+   blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+   q->limits.max_copy_sectors = 0;
+   q->limits.max_copy_sectors_hw = 0;
+   } else {
+   blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+   }
+
if (!dm_table_supports_secure_erase(t))
q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3b694ba3a106..ab9069090a7d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,13 @@ static blk_status_t __split_and_process_bio(struct 
clone_info *ci)
if (unlikely(ci->is_abnormal_io))
return __process_abnormal_io(ci, ti);
 
+   if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+   max_io_len(ti, ci->sector) < ci->sector_count)) {
+   DMERR("Error, IO size(%u) > max target size(%llu)\n",
+   ci->sector_count, max_io_len(ti, ci->sector));
+   return BLK_STS_IOERR;
+   }
+
/*
 * Only support bio polling for normal IO, and the target io is
 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a52d2b9a6846..04016bd76e73 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -398,6 +398,11 @@ struct dm_target {
 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 */
bool needs_bio_set_dev:1;
+
+   /*
+* copy offload is supported
+*/
+   bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v11 1/9] block: Introduce queue limits for copy-offload support

2023-05-22 Thread Nitesh Shetty
Add device limits as sysfs entries,
- copy_offload (RW)
- copy_max_bytes (RW)
- copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 33 ++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 64 
 include/linux/blkdev.h   | 12 ++
 include/uapi/linux/fs.h  |  3 ++
 5 files changed, 136 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..e4d31132f77c 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,39 @@ Description:
last zone of the device which may be smaller.
 
 
+What:  /sys/block//queue/copy_offload
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] When read, this file shows whether offloading copy to a
+   device is enabled (1) or disabled (0). Writing '0' to this
+   file will disable offloading copies for this device.
+   Writing any '1' value will enable this feature. If the device
+   does not support offloading, then writing 1, will result in
+   error.
+
+
+What:  /sys/block//queue/copy_max_bytes
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes, that the block layer
+   will allow for copy request. This will be smaller or equal to
+   the maximum size allowed by the hardware, indicated by
+   'copy_max_bytes_hw'. Attempt to set value higher than
+   'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
+
+
+What:  /sys/block//queue/copy_max_bytes_hw
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes, that the hardware
+   will allow in a single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 896b4654ab00..23aff2d4dcba 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_sectors_hw = 0;
+   lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_sectors_hw = ULONG_MAX;
+   lim->max_copy_sectors = ULONG_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+   unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_sectors_hw = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
   b->max_segment_size);
 
+   t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+   t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+   b->max_copy_sectors_hw);
+
t->misaligned |= b->misaligned;
 
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a64208583853..826ab29beba3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -212,6 +212,63 @@ 

[dm-devel] [PATCH v11 4/9] fs, block: copy_file_range for def_blk_ops for direct block device

2023-05-22 Thread Nitesh Shetty
For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent.
Modify checks to allow bdevs to use copy_file_range.

Suggested-by: Ming Lei 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-lib.c| 23 +++
 block/fops.c   | 20 
 fs/read_write.c| 11 +--
 include/linux/blkdev.h |  3 +++
 mm/filemap.c   | 11 ---
 5 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index ba32545eb8d5..7d6ef85692a6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -523,6 +523,29 @@ int blkdev_issue_copy(struct block_device *bdev_in, loff_t 
pos_in,
 }
 EXPORT_SYMBOL_GPL(blkdev_issue_copy);
 
+/* Returns the length of bytes copied */
+int blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ gfp_t gfp_mask)
+{
+   struct request_queue *in_q = bdev_get_queue(bdev_in);
+   struct request_queue *out_q = bdev_get_queue(bdev_out);
+   int ret = 0;
+
+   if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
+   return 0;
+
+   if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {
+   ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+   len, NULL, NULL, gfp_mask);
+   if (ret < 0)
+   return 0;
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned flags)
diff --git a/block/fops.c b/block/fops.c
index ab750e8a040f..df8985675ed1 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -614,6 +614,25 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, unsigned int flags)
+{
+   struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+   struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+   int comp_len = 0;
+
+   if ((file_in->f_iocb_flags & IOCB_DIRECT) &&
+   (file_out->f_iocb_flags & IOCB_DIRECT))
+   comp_len = blkdev_copy_offload(in_bdev, pos_in, out_bdev,
+pos_out, len, GFP_KERNEL);
+   if (comp_len != len)
+   comp_len = generic_copy_file_range(file_in, pos_in + comp_len,
+   file_out, pos_out + comp_len, len - comp_len, flags);
+
+   return comp_len;
+}
+
 #defineBLKDEV_FALLOC_FL_SUPPORTED  
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -697,6 +716,7 @@ const struct file_operations def_blk_fops = {
.splice_read= generic_file_splice_read,
.splice_write   = iter_file_splice_write,
.fallocate  = blkdev_fallocate,
+   .copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
diff --git a/fs/read_write.c b/fs/read_write.c
index a21ba3be7dbe..47e848fcfd42 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -1447,7 +1448,11 @@ static int generic_copy_file_checks(struct file 
*file_in, loff_t pos_in,
return -EOVERFLOW;
 
/* Shorten the copy to EOF */
-   size_in = i_size_read(inode_in);
+   if (S_ISBLK(inode_in->i_mode))
+   size_in = bdev_nr_bytes(I_BDEV(file_in->f_mapping->host));
+   else
+   size_in = i_size_read(inode_in);
+
if (pos_in >= size_in)
count = 0;
else
@@ -1708,7 +1713,9 @@ int generic_file_rw_checks(struct file *file_in, struct 
file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
-   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+
+   if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
+   (!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))
return -EINVAL;
 
if (!(file_in->f_mode & FMODE_READ) ||
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a95c26faa8b6..a9bb7e3a8c79 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1054,6 +1054,9 @@ int blkdev_issue_secure_erase(struct block_device *bdev, 
sector_t sector,
 int 

[dm-devel] [PATCH v11 5/9] nvme: add copy offload support

2023-05-22 Thread Nitesh Shetty
For device supporting native copy, nvme driver receives read and
write request with BLK_COPY op flags.
For read request the nvme driver populates the payload with source
information.
For write request the driver converts it to nvme copy command using the
source information in the payload and submits to the device.
current design only supports single source range.
This design is courtesy Mikulas Patocka's token based copy

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier González 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/host/constants.c |   1 +
 drivers/nvme/host/core.c  | 103 +-
 drivers/nvme/host/fc.c|   5 ++
 drivers/nvme/host/nvme.h  |   7 +++
 drivers/nvme/host/pci.c   |  27 -
 drivers/nvme/host/rdma.c  |   7 +++
 drivers/nvme/host/tcp.c   |  16 ++
 drivers/nvme/host/trace.c |  19 +++
 include/linux/nvme.h  |  43 +-
 9 files changed, 220 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index bc523ca02254..01be882b726f 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
[nvme_cmd_resv_report] = "Reservation Report",
[nvme_cmd_resv_acquire] = "Reservation Acquire",
[nvme_cmd_resv_release] = "Reservation Release",
+   [nvme_cmd_copy] = "Copy Offload",
[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
[nvme_cmd_zone_append] = "Zone Management Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ccb6eb1282f8..aef7b59dbd61 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -754,6 +754,77 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline void nvme_setup_copy_read(struct nvme_ns *ns, struct request 
*req)
+{
+   struct bio *bio = req->bio;
+   struct nvme_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+
+   token->subsys = "nvme";
+   token->ns = ns;
+   token->src_sector = bio->bi_iter.bi_sector;
+   token->sectors = bio->bi_iter.bi_size >> 9;
+}
+
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+  struct request *req, struct nvme_command *cmnd)
+{
+   struct nvme_copy_range *range = NULL;
+   struct bio *bio = req->bio;
+   struct nvme_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+   sector_t src_sector, dst_sector, n_sectors;
+   u64 src_lba, dst_lba, n_lba;
+   unsigned short nr_range = 1;
+   u16 control = 0;
+
+   if (unlikely(memcmp(token->subsys, "nvme", 4)))
+   return BLK_STS_NOTSUPP;
+   if (unlikely(token->ns != ns))
+   return BLK_STS_NOTSUPP;
+
+   src_sector = token->src_sector;
+   dst_sector = bio->bi_iter.bi_sector;
+   n_sectors = token->sectors;
+   if (WARN_ON(n_sectors != bio->bi_iter.bi_size >> 9))
+   return BLK_STS_NOTSUPP;
+
+   src_lba = nvme_sect_to_lba(ns, src_sector);
+   dst_lba = nvme_sect_to_lba(ns, dst_sector);
+   n_lba = nvme_sect_to_lba(ns, n_sectors);
+
+   if (WARN_ON(!n_lba))
+   return BLK_STS_NOTSUPP;
+
+   if (req->cmd_flags & REQ_FUA)
+   control |= NVME_RW_FUA;
+
+   if (req->cmd_flags & REQ_FAILFAST_DEV)
+   control |= NVME_RW_LR;
+
+   memset(cmnd, 0, sizeof(*cmnd));
+   cmnd->copy.opcode = nvme_cmd_copy;
+   cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+   cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+
+   range = kmalloc_array(nr_range, sizeof(*range),
+   GFP_ATOMIC | __GFP_NOWARN);
+   if (!range)
+   return BLK_STS_RESOURCE;
+
+   range[0].slba = cpu_to_le64(src_lba);
+   range[0].nlb = cpu_to_le16(n_lba - 1);
+
+   cmnd->copy.nr_range = 0;
+
+   req->special_vec.bv_page = virt_to_page(range);
+   req->special_vec.bv_offset = offset_in_page(range);
+   req->special_vec.bv_len = sizeof(*range) * nr_range;
+   req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+   cmnd->copy.control = cpu_to_le16(control);
+
+   return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
@@ -988,10 +1059,16 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req)
ret = nvme_setup_discard(ns, req, cmd);
break;
case REQ_OP_READ:
-   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
+   if (unlikely(req->cmd_flags & REQ_COPY))
+   nvme_setup_copy_read(ns, req);
+   else
+ 

Re: [dm-devel] [PATCH] 11-dm-mpath.rules: fix warnings reported by udevadm verify

2023-05-22 Thread Dmitry V. Levin
On Mon, May 22, 2023 at 12:19:04PM +0200, Martin Wilck wrote:
> On Sat, 2023-05-20 at 01:33 +0300, Dmitry V. Levin wrote:
> > Fix the following warnings reported by udevadm verify:
> > 
> > multipath/11-dm-mpath.rules:18 Whitespace after comma is expected.
> > multipath/11-dm-mpath.rules:73 Whitespace after comma is expected.
> > multipath/11-dm-mpath.rules:73 Whitespace after comma is expected.
> > multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
> > multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
> > multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
> > multipath/11-dm-mpath.rules: udev rules check failed
> > 
> > Signed-off-by: Dmitry V. Levin 
> 
> Maybe you should have mentioned that you've just invented this syntax
> rule yourself (https://github.com/systemd/systemd/pull/26980).
> I see no requirement for adding whitespace after a comma in the udev
> man page. 
> 
> Is this an attempt to change the udev rule syntax retroactively?

As you probably know, udevd silently accepts much broader syntax, for
example, it doesn't need neither comma no whitespace between KEY=VALUE
expressions, and I doubt this will ever change in the future.

In contrast, `udevadm verify` is a tool that checks syntactic, semantic,
and style correctness of udev rules files.  It indeed expects whitespace
after a comma in udev rules - a style most of existing udev rules follow.

> Furthermore, there is actually whitespace after the comma in the code
> this patch changes, it just happens to be at the beginning of the
> following line, which your syntax check ignores.

When I saw the parser of udev rules used by udevd for the first time,
I was also surprised to find out that it discards all leading whitespace
regardless of line continuations.  As result, that whitespace is not
visible to the syntax check at all.  So yes, you are literally correct,
there is whitespace there, but most of existing udev rules add whitespace
between a comma and a backslash.


-- 
ldv

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH v11 0/9] Implement copy offload support

2023-05-22 Thread Nitesh Shetty
The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev implementation.

This is next iteration of our previous patchset v10[1].

Overall series supports:

1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file backend).

2. Block layer
- Block-generic copy (REQ_COPY flag), with interface
accommodating two block-devs
- Emulation, for in-kernel user when offload is natively 
absent
- dm-linear support (for cases not requiring split)

3. User-interface
- copy_file_range

Testing
===
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Null block device
c. NVMe Fabrics loopback.
d. blktests[3] (tests block/034-037, nvme/050-053)

Emulation can be tested on any device.

fio[4].

Infra and plumbing:
===
We populate copy_file_range callback in def_blk_fops. 
For devices that support copy-offload, use blkdev_copy_offload to
achieve in-device copy.
However for cases, where device doesn't support offload,
fallback to generic_copy_file_range.
For in-kernel users (like NVMe fabrics), we use blkdev_issue_copy
which implements its own emulation, as fd is not available.
Modify checks in generic_copy_file_range to support block-device.

Performance:

The major benefit of this copy-offload/emulation framework is
observed in fabrics setup, for copy workloads across the network.
The host will send offload command over the network and actual copy
can be achieved using emulation on the target.
This results in higher performance and lower network consumption,
as compared to read and write travelling across the network.
With async-design of copy-offload/emulation we are able to see the
following improvements as compared to userspace read + write on a
NVMeOF TCP setup:

Setup1: Network Speed: 1000Mb/s
Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Target PC: AMD Ryzen 9 5900X 12-Core Processor
block size 8k:
710% improvement in IO BW (108 MiB/s to 876 MiB/s).
Network utilisation drops from  97% to 15%.
block-size 1M:
2532% improvement in IO BW (101 MiB/s to 2659 MiB/s).
Network utilisation drops from 89% to 0.62%.

Setup2: Network Speed: 100Gb/s
Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
(host and target have the same configuration)
block-size 8k:
17.5% improvement in IO BW (794 MiB/s to 933 MiB/s).
Network utilisation drops from  6.75% to 0.16%.

Blktests[3]
==
tests/block/034,035: Runs copy offload and emulation on block
  device.
tests/block/035,036: Runs copy offload and emulation on null
  block device.
tests/nvme/050-053: Create a loop backed fabrics device and
  run copy offload and emulation.

Future Work
===
- loopback device copy offload support
- upstream fio to use copy offload
- upstream blktest to use copy offload

These are to be taken up after this minimal series is agreed upon.

Additional links:
=
[0] 
https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zt14mmpgmivczugb33c60...@mail.gmail.com/

https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a...@nvidia.com/

https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.she...@samsung.com/
[1] 
https://lore.kernel.org/all/20230419114320.13674-1-nj.she...@samsung.com/
[2] 
https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v11
[4] https://github.com/vincentkfu/fio/commits/copyoffload-3.34-v11

Changes since v10:
=
- NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
- NVMeOF: sparse warnings (kernel test robot)

Changes since v9:
=
- null_blk, improved documentation, minor fixes(Chaitanya Kulkarni)
- fio, expanded testing and minor fixes (Vincent Fu)

Changes since v8:
=
- null_blk, copy_max_bytes_hw is made config fs parameter
  (Damien Le Moal)
- Negative error handling in copy_file_range (Christian Brauner)
 

[dm-devel] [PATCH v11 3/9] block: add emulation for copy

2023-05-22 Thread Nitesh Shetty
For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination asynchronously.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c| 175 -
 block/blk-map.c|   4 +-
 include/linux/blkdev.h |   3 +
 3 files changed, 179 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index ed089e703cb1..ba32545eb8d5 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -295,6 +295,172 @@ static int __blkdev_copy_offload(struct block_device 
*bdev_in, loff_t pos_in,
return blkdev_copy_wait_completion(cio);
 }
 
+static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
+   gfp_t gfp_mask)
+{
+   int min_size = PAGE_SIZE;
+   void *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp_mask);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   /* retry half the requested size */
+   req_size >>= 1;
+   }
+
+   return NULL;
+}
+
+static void blkdev_copy_emulate_write_endio(struct bio *bio)
+{
+   struct copy_ctx *ctx = bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   kvfree(page_address(bio->bi_io_vec[0].bv_page));
+   bio_map_kern_endio(bio);
+   kfree(ctx);
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+}
+
+static void blkdev_copy_emulate_read_endio(struct bio *read_bio)
+{
+   struct copy_ctx *ctx = read_bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (read_bio->bi_status) {
+   clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+   cio->pos_in;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   __free_page(read_bio->bi_io_vec[0].bv_page);
+   bio_map_kern_endio(read_bio);
+   kfree(ctx);
+
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+   }
+   schedule_work(>dispatch_work);
+   kfree(read_bio);
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ * Returns the length of bytes copied or error if encountered
+ */
+static int __blkdev_copy_emulate(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct request_queue *in = bdev_get_queue(bdev_in);
+   struct request_queue *out = bdev_get_queue(bdev_out);
+   struct bio *read_bio, *write_bio;
+   void *buf = NULL;
+   struct copy_ctx *ctx;
+   struct cio *cio;
+   sector_t buf_len, req_len, rem = 0;
+   sector_t max_src_hw_len = min_t(unsigned int,
+   queue_max_hw_sectors(in),
+   queue_max_segments(in) << (PAGE_SHIFT - SECTOR_SHIFT))
+   << SECTOR_SHIFT;
+   sector_t max_dst_hw_len = min_t(unsigned int,
+   queue_max_hw_sectors(out),
+   queue_max_segments(out) << (PAGE_SHIFT - SECTOR_SHIFT))
+   << SECTOR_SHIFT;
+   sector_t max_hw_len = min_t(unsigned int,
+   max_src_hw_len, max_dst_hw_len);
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->pos_in = pos_in;
+   cio->pos_out = pos_out;
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private = private;
+
+   for (rem = len; rem > 0; rem -= buf_len) {
+   req_len = min_t(int, max_hw_len, rem);
+
+   buf = blkdev_copy_alloc_buf(req_len, _len, gfp_mask);
+   if 

[dm-devel] [PATCH v11 8/9] dm: Enable copy offload for dm-linear target

2023-05-22 Thread Nitesh Shetty
Setting copy_offload_supported flag to enable offload.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+   ti->copy_offload_supported = 1;
ti->private = lc;
return 0;
 
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v11 2/9] block: Add copy offload support infrastructure

2023-05-22 Thread Nitesh Shetty
Introduce blkdev_issue_copy which takes similar arguments as
copy_file_range and performs copy offload between two bdevs.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c   | 235 ++
 block/blk.h   |   2 +
 include/linux/blk_types.h |  25 
 include/linux/blkdev.h|   3 +
 4 files changed, 265 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..ed089e703cb1 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,241 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * blkdev_copy_write_endio.
+ */
+static int blkdev_copy_wait_completion(struct cio *cio)
+{
+   int ret;
+
+   if (cio->endio)
+   return 0;
+
+   if (atomic_read(>refcount)) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   blk_io_schedule();
+   }
+
+   ret = cio->comp_len;
+   kfree(cio);
+
+   return ret;
+}
+
+static void blkdev_copy_offload_write_endio(struct bio *bio)
+{
+   struct copy_ctx *ctx = bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   __free_page(bio->bi_io_vec[0].bv_page);
+   bio_put(bio);
+
+   kfree(ctx);
+   if (!atomic_dec_and_test(>refcount))
+   return;
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+}
+
+static void blkdev_copy_offload_read_endio(struct bio *read_bio)
+{
+   struct copy_ctx *ctx = read_bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (read_bio->bi_status) {
+   clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT)
+   - cio->pos_in;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   __free_page(read_bio->bi_io_vec[0].bv_page);
+   bio_put(ctx->write_bio);
+   bio_put(read_bio);
+   kfree(ctx);
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+   return;
+   }
+
+   schedule_work(>dispatch_work);
+   bio_put(read_bio);
+}
+
+static void blkdev_copy_dispatch_work(struct work_struct *work)
+{
+   struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+   dispatch_work);
+
+   submit_bio(ctx->write_bio);
+}
+
+/*
+ * __blkdev_copy_offload   - Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. First we send a read bio with REQ_COPY flag along with a token and source
+ * and length. Once read bio reaches driver layer, device driver adds all the
+ * source info to token and does a fake completion.
+ * 2. Once read operation completes, we issue write with REQ_COPY flag with 
same
+ * token. In driver layer, token info is used to form a copy offload command.
+ *
+ * Returns the length of bytes copied or error if encountered
+ */
+static int __blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out, size_t len,
+   cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct cio *cio;
+   struct copy_ctx *ctx;
+   struct bio *read_bio, *write_bio;
+   struct page *token;
+   sector_t copy_len;
+   sector_t rem, max_copy_len;
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private = private;
+
+   max_copy_len = min(bdev_max_copy_sectors(bdev_in),
+   bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;
+
+   cio->pos_in = pos_in;
+   cio->pos_out = pos_out;
+ 

Re: [dm-devel] [PATCH] 11-dm-mpath.rules: fix warnings reported by udevadm verify

2023-05-22 Thread Martin Wilck
On Mon, 2023-05-22 at 13:49 +0300, Dmitry V. Levin wrote:
> On Mon, May 22, 2023 at 12:19:04PM +0200, Martin Wilck wrote:
> > On Sat, 2023-05-20 at 01:33 +0300, Dmitry V. Levin wrote:
> > > Fix the following warnings reported by udevadm verify:
> > > 
> > > multipath/11-dm-mpath.rules:18 Whitespace after comma is
> > > expected.
> > > multipath/11-dm-mpath.rules:73 Whitespace after comma is
> > > expected.
> > > multipath/11-dm-mpath.rules:73 Whitespace after comma is
> > > expected.
> > > multipath/11-dm-mpath.rules:78 Whitespace after comma is
> > > expected.
> > > multipath/11-dm-mpath.rules:78 Whitespace after comma is
> > > expected.
> > > multipath/11-dm-mpath.rules:78 Whitespace after comma is
> > > expected.
> > > multipath/11-dm-mpath.rules: udev rules check failed
> > > 
> > > Signed-off-by: Dmitry V. Levin 
> > 
> > Maybe you should have mentioned that you've just invented this
> > syntax
> > rule yourself (https://github.com/systemd/systemd/pull/26980).
> > I see no requirement for adding whitespace after a comma in the
> > udev
> > man page. 
> > 
> > Is this an attempt to change the udev rule syntax retroactively?
> 
> As you probably know, udevd silently accepts much broader syntax, for
> example, it doesn't need neither comma no whitespace between
> KEY=VALUE
> expressions, and I doubt this will ever change in the future.

Ok, that answers what I was asking. Thanks. Btw I did not know that
commas could be left out, I found out while I looked into this issue.

> In contrast, `udevadm verify` is a tool that checks syntactic,
> semantic,
> and style correctness of udev rules files.  

So this is about style and readability. Fair enough. It would have been
nice to mention that in the commit message.

> It indeed expects whitespace
> after a comma in udev rules - a style most of existing udev rules
> follow.

The multipath-tools rules do, too; they just don't add whitespace
between the comma and the line continuation marker '\'.

> but most of existing udev rules add whitespace
> between a comma and a backslash.

I see. I'll apply this patch then (and the other one with the missing
comma, too), but unless you object, I'll add a note to the
commit message explaining that this for improving readability 
and coding style compliance. I want to avoid the impression that the
existing code is technically wrong, which it isn't.

Reviewed-by: Martin Wilck 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] 11-dm-mpath.rules: fix warnings reported by udevadm verify

2023-05-22 Thread Martin Wilck
On Sat, 2023-05-20 at 01:33 +0300, Dmitry V. Levin wrote:
> Fix the following warnings reported by udevadm verify:
> 
> multipath/11-dm-mpath.rules:18 Whitespace after comma is expected.
> multipath/11-dm-mpath.rules:73 Whitespace after comma is expected.
> multipath/11-dm-mpath.rules:73 Whitespace after comma is expected.
> multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
> multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
> multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
> multipath/11-dm-mpath.rules: udev rules check failed
> 
> Signed-off-by: Dmitry V. Levin 

Maybe you should have mentioned that you've just invented this syntax
rule yourself (https://github.com/systemd/systemd/pull/26980).
I see no requirement for adding whitespace after a comma in the udev
man page. 

Is this an attempt to change the udev rule syntax retroactively?

Furthermore, there is actually whitespace after the comma in the code
this patch changes, it just happens to be at the beginning of the
following line, which your syntax check ignores.

Regards,
Martin

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [RFC] dm overlaybd: targets mapping OverlayBD image

2023-05-22 Thread Du Rui
OverlayBD is a novel layering block-level image format, which is design
for container, secure container and applicable to virtual machine,
published in USENIX ATC '20
https://www.usenix.org/system/files/atc20-li-huiba.pdf

OverlayBD already has a ContainerD non-core sub-project implementation
in userspace, as an accelerated container image service
https://github.com/containerd/accelerated-container-image

It could be much more efficient when do decompressing and mapping works
in the kernel with the framework of device-mapper, in many circumstances,
such as secure container runtime, mobile-devices, etc.

This patch contains a module, dm-overlaybd, provides two kinds of targets
dm-zfile and dm-lsmt, to expose a group of block-devices contains
OverlayBD image as a overlaid read-only block-device.

Signed-off-by: Du Rui 
---
 .../device-mapper/dm-overlaybd.rst|  71 +++
 drivers/md/Kconfig|   2 +
 drivers/md/Makefile   |   1 +
 drivers/md/overlaybd/Kconfig  |  37 ++
 drivers/md/overlaybd/Makefile |   4 +
 drivers/md/overlaybd/dm-lsmt.c| 162 +
 drivers/md/overlaybd/dm-lsmtformat.c  | 575 ++
 drivers/md/overlaybd/dm-ovbd-blkfile.c| 134 
 drivers/md/overlaybd/dm-ovbd.c|  46 ++
 drivers/md/overlaybd/dm-ovbd.h|  45 ++
 drivers/md/overlaybd/dm-zfile.c   | 154 +
 drivers/md/overlaybd/dm-zfileformat.c | 455 ++
 12 files changed, 1686 insertions(+)
 create mode 100644 Documentation/admin-guide/device-mapper/dm-overlaybd.rst
 create mode 100644 drivers/md/overlaybd/Kconfig
 create mode 100644 drivers/md/overlaybd/Makefile
 create mode 100644 drivers/md/overlaybd/dm-lsmt.c
 create mode 100644 drivers/md/overlaybd/dm-lsmtformat.c
 create mode 100644 drivers/md/overlaybd/dm-ovbd-blkfile.c
 create mode 100644 drivers/md/overlaybd/dm-ovbd.c
 create mode 100644 drivers/md/overlaybd/dm-ovbd.h
 create mode 100644 drivers/md/overlaybd/dm-zfile.c
 create mode 100644 drivers/md/overlaybd/dm-zfileformat.c

diff --git a/Documentation/admin-guide/device-mapper/dm-overlaybd.rst 
b/Documentation/admin-guide/device-mapper/dm-overlaybd.rst
new file mode 100644
index ..ad48cc7b57c7
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-overlaybd.rst
@@ -0,0 +1,71 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+dm-overlaybd
+
+
+The device-mapper OverlayBD (dm-overlaybd) module allows merge block-devices
+contains OverlayBD format layers, into read-only OverlayBD block device.
+
+The OverlayBD is a block-device based container images system, based on
+random-readable compressed format ZFile, and multi-layer overlay block-device
+format LSMTFile.
+
+The following targets are provided by dm-overlaybd
+--
+
+- dm-zfile
+- dm-lsmt
+
+
+dm-zfile Table parameters
+-
+   
+
+Parameters:
+
+:
+Full pathname to the underlying block-device,
+:
+ZFile data length, in unit of byte.
+
+Examples:
+
+ZFile format data laying on /dev/vda, the ZFile length is 658971539,
+the table should be like:
+
+zfile_decompressed /dev/vdb 658971539
+
+
+dm-lsmt Table parameters
+
+   
+  [ ]
+
+Mandatory parameters:
+
+:
+Full pathname to the underlying block-device, usually is a mapped
+ZFile device.
+:
+ZFile data length, in unit of byte.
+
+
+Optional parameter:
+
+ :
+Multi-layer LSMTFile could overlay as stack, just like container
+images.
+The upper layers also described in table as optional parameters.
+Each layer should tell the Full pathname of underlying block-device,
+and a file size described LSMTFile data length in unit of byte.
+
+
+Examples:
+
+The base layer is mapped by dm-zfile in /dev/mapper/lsmt_base, and one upper
+layer is /dev/mapper/lsmt_upper that described changes from base layer.
+Assume that base layer LSMTFile length is 1,006,923,776 bytes, and upper layer 
is
+682,206,208 bytes.
+
+merged /dev/mapper/lsmt_base 1006923776 /dev/mapper/lsmt_upper 1006923776
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b0a22e99bade..4ae4fb628712 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -656,4 +656,6 @@ config DM_AUDIT
  Enables audit logging of several security relevant events in the
  particular device-mapper targets, especially the integrity target.
 
+source "drivers/md/overlaybd/Kconfig"
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 84291e38dca8..01b6da500a6a 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_DM_INTEGRITY)+= dm-integrity.o
 obj-$(CONFIG_DM_ZONED) += dm-zoned.o
 obj-$(CONFIG_DM_WRITECACHE)+= dm-writecache.o
 obj-$(CONFIG_SECURITY_LOADPIN_VERITY)  += 

[dm-devel] [PATCH] 11-dm-mpath.rules: fix warnings reported by udevadm verify

2023-05-22 Thread Dmitry V. Levin
Fix the following warnings reported by udevadm verify:

multipath/11-dm-mpath.rules:18 Whitespace after comma is expected.
multipath/11-dm-mpath.rules:73 Whitespace after comma is expected.
multipath/11-dm-mpath.rules:73 Whitespace after comma is expected.
multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
multipath/11-dm-mpath.rules:78 Whitespace after comma is expected.
multipath/11-dm-mpath.rules: udev rules check failed

Signed-off-by: Dmitry V. Levin 
---
 multipath/11-dm-mpath.rules | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/multipath/11-dm-mpath.rules b/multipath/11-dm-mpath.rules
index d191ae8d..c339f521 100644
--- a/multipath/11-dm-mpath.rules
+++ b/multipath/11-dm-mpath.rules
@@ -14,7 +14,7 @@ ENV{.MPATH_DEVICE_READY_OLD}="$env{MPATH_DEVICE_READY}"
 # multipath sets DM_SUBSYSTEM_UDEV_FLAG2 when it reloads a
 # table with no active devices. If this happens, mark the
 # device not ready
-ENV{DM_SUBSYSTEM_UDEV_FLAG2}=="1", ENV{MPATH_DEVICE_READY}="0",\
+ENV{DM_SUBSYSTEM_UDEV_FLAG2}=="1", ENV{MPATH_DEVICE_READY}="0", \
GOTO="mpath_action"
 
 # If the last path has failed mark the device not ready
@@ -68,13 +68,13 @@ ENV{MPATH_DEVICE_READY}=="0", ENV{DM_NOSCAN}="1"
 # Also skip all foreign rules if no path is available.
 # Remember the original value of DM_DISABLE_OTHER_RULES_FLAG
 # and restore it back once we have at least one path available.
-ENV{MPATH_DEVICE_READY}=="0", ENV{.MPATH_DEVICE_READY_OLD}=="1",\
-   ENV{DM_DISABLE_OTHER_RULES_FLAG_OLD}=="",\
+ENV{MPATH_DEVICE_READY}=="0", ENV{.MPATH_DEVICE_READY_OLD}=="1", \
+   ENV{DM_DISABLE_OTHER_RULES_FLAG_OLD}=="", \

ENV{DM_DISABLE_OTHER_RULES_FLAG_OLD}="$env{DM_UDEV_DISABLE_OTHER_RULES_FLAG}"
 ENV{MPATH_DEVICE_READY}=="0", ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}="1"
-ENV{MPATH_DEVICE_READY}!="0", ENV{.MPATH_DEVICE_READY_OLD}=="0",\
-   
ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}="$env{DM_DISABLE_OTHER_RULES_FLAG_OLD}",\
-   ENV{DM_DISABLE_OTHER_RULES_FLAG_OLD}="",\
+ENV{MPATH_DEVICE_READY}!="0", ENV{.MPATH_DEVICE_READY_OLD}=="0", \
+   
ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}="$env{DM_DISABLE_OTHER_RULES_FLAG_OLD}", \
+   ENV{DM_DISABLE_OTHER_RULES_FLAG_OLD}="", \
ENV{DM_ACTIVATION}="1", ENV{MPATH_UNCHANGED}="0"
 
 # The code to check multipath state ends here. We need to set
-- 
ldv

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel