Re: [RFC PATCH v5 2/4] block: add simple copy support

2021-04-14 Thread Selva Jove
I agree with you. Will remove BLKDEV_COPY_NOEMULATION.

On Tue, Apr 13, 2021 at 6:03 AM Damien Le Moal  wrote:
>
> On 2021/04/12 23:35, Selva Jove wrote:
> > On Mon, Apr 12, 2021 at 5:55 AM Damien Le Moal  
> > wrote:
> >>
> >> On 2021/04/07 20:33, Selva Jove wrote:
> >>> Initially I started moving the dm-kcopyd interface to the block layer
> >>> as a generic interface.
> >>> Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
> >>> tightly coupled with dm_io()
> >>>
> >>> To move dm-kcopyd to block layer, it would also require dm_io code to
> >>> be moved to block layer.
> >>> It would cause havoc in dm layer, as it is the backbone of the
> >>> dm-layer and needs complete
> >>> rewriting of dm-layer. Do you see any other way of doing this without
> >>> having to move dm_io code
> >>> or to have redundant code ?
> >>
> >> Right. Missed that. So reusing dm-kcopyd and making it a common interface 
> >> will
> >> take some more efforts. OK, then. For the first round of commits, let's 
> >> forget
> >> about this. But I still think that your emulation could be a lot better 
> >> than a
> >> loop doing blocking writes after blocking reads.
> >>
> >
> > Current implementation issues read asynchronously and once all the reads are
> > completed, then the write is issued as whole to reduce the IO traffic
> > in the queue.
> > I agree that things can be better. Will explore another approach of
> > sending writes
> > immediately once reads are completed and with  plugging to increase the 
> > chances
> > of merging.
> >
> >> [...]
> >>>>> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> >>>>> + struct range_entry *src_rlist, struct block_device 
> >>>>> *dest_bdev,
> >>>>> + sector_t dest, gfp_t gfp_mask, int flags)
> >>>>> +{
> >>>>> + struct request_queue *q = bdev_get_queue(src_bdev);
> >>>>> + struct request_queue *dest_q = bdev_get_queue(dest_bdev);
> >>>>> + struct blk_copy_payload *payload;
> >>>>> + sector_t bs_mask, copy_size;
> >>>>> + int ret;
> >>>>> +
> >>>>> + ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
> >>>>> + , _size);
> >>>>> + if (ret)
> >>>>> + return ret;
> >>>>> +
> >>>>> + bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
> >>>>> + if (dest & bs_mask) {
> >>>>> + return -EINVAL;
> >>>>> + goto out;
> >>>>> + }
> >>>>> +
> >>>>> + if (q == dest_q && q->limits.copy_offload) {
> >>>>> + ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
> >>>>> + if (ret)
> >>>>> + goto out;
> >>>>> + } else if (flags & BLKDEV_COPY_NOEMULATION) {
> >>>>
> >>>> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why 
> >>>> would that
> >>>> user say "Fail on me if the device does not support copy" ??? This is a 
> >>>> weird
> >>>> interface in my opinion.
> >>>>
> >>>
> >>> BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() 
> >>> callers
> >>> to use their native copying method instead of the emulated copy that I
> >>> added. This way we
> >>> ensure that dm uses the hw-assisted copy and if that is not present,
> >>> it falls back to existing
> >>> copy method.
> >>>
> >>> The other users who don't have their native emulation can use this
> >>> emulated-copy implementation.
> >>
> >> I do not understand. Emulation or not should be entirely driven by the 
> >> device
> >> reporting support for simple copy (or not). It does not matter which 
> >> component
> >> is issuing the simple copy call: an FS to a real device, and FS to a DM 
> >> device
> >> or a DM target driver. If the underlying device reported support for simple
> >> copy, use that. Otherwise, emulate with read/write. What am I missing here 
> >> ?
> >>
> >
> > blkdev_issue_copy() api will generally complete the copy-operation,
> > either by using
> > offloaded-copy or by using emulated-copy. The caller of the api is not
> > required to
> > figure the type of support. However, it can opt out of emulated-copy
> > by specifying
> > the flag BLKDEV_NOEMULATION. This is helpful for the case when the
> > caller already
> > has got a sophisticated emulation (e.g. dm-kcopyd users).
>
> This does not make any sense to me. If the user has already another mean of
> doing copies, then that user will not call blkdev_issue_copy(). So I really do
> not understand what the "opting out of emulated copy" would be useful for. 
> That
> user can check the simple copy support glag in the device request queue and 
> act
> accordingly: use its own block copy code when simple copy is not supported or
> use blkdev_issue_copy() when the device has simple copy. Adding that
> BLKDEV_COPY_NOEMULATION does not serve any purpose at all.
>
>
>
> --
> Damien Le Moal
> Western Digital Research


Re: [RFC PATCH v5 2/4] block: add simple copy support

2021-04-12 Thread Selva Jove
On Mon, Apr 12, 2021 at 5:55 AM Damien Le Moal  wrote:
>
> On 2021/04/07 20:33, Selva Jove wrote:
> > Initially I started moving the dm-kcopyd interface to the block layer
> > as a generic interface.
> > Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
> > tightly coupled with dm_io()
> >
> > To move dm-kcopyd to block layer, it would also require dm_io code to
> > be moved to block layer.
> > It would cause havoc in dm layer, as it is the backbone of the
> > dm-layer and needs complete
> > rewriting of dm-layer. Do you see any other way of doing this without
> > having to move dm_io code
> > or to have redundant code ?
>
> Right. Missed that. So reusing dm-kcopyd and making it a common interface will
> take some more efforts. OK, then. For the first round of commits, let's forget
> about this. But I still think that your emulation could be a lot better than a
> loop doing blocking writes after blocking reads.
>

Current implementation issues read asynchronously and once all the reads are
completed, then the write is issued as whole to reduce the IO traffic
in the queue.
I agree that things can be better. Will explore another approach of
sending writes
immediately once reads are completed and with  plugging to increase the chances
of merging.

> [...]
> >>> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> >>> + struct range_entry *src_rlist, struct block_device 
> >>> *dest_bdev,
> >>> + sector_t dest, gfp_t gfp_mask, int flags)
> >>> +{
> >>> + struct request_queue *q = bdev_get_queue(src_bdev);
> >>> + struct request_queue *dest_q = bdev_get_queue(dest_bdev);
> >>> + struct blk_copy_payload *payload;
> >>> + sector_t bs_mask, copy_size;
> >>> + int ret;
> >>> +
> >>> + ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
> >>> + , _size);
> >>> + if (ret)
> >>> + return ret;
> >>> +
> >>> + bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
> >>> + if (dest & bs_mask) {
> >>> + return -EINVAL;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + if (q == dest_q && q->limits.copy_offload) {
> >>> + ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
> >>> + if (ret)
> >>> + goto out;
> >>> + } else if (flags & BLKDEV_COPY_NOEMULATION) {
> >>
> >> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would 
> >> that
> >> user say "Fail on me if the device does not support copy" ??? This is a 
> >> weird
> >> interface in my opinion.
> >>
> >
> > BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() 
> > callers
> > to use their native copying method instead of the emulated copy that I
> > added. This way we
> > ensure that dm uses the hw-assisted copy and if that is not present,
> > it falls back to existing
> > copy method.
> >
> > The other users who don't have their native emulation can use this
> > emulated-copy implementation.
>
> I do not understand. Emulation or not should be entirely driven by the device
> reporting support for simple copy (or not). It does not matter which component
> is issuing the simple copy call: an FS to a real device, and FS to a DM device
> or a DM target driver. If the underlying device reported support for simple
> copy, use that. Otherwise, emulate with read/write. What am I missing here ?
>

blkdev_issue_copy() api will generally complete the copy-operation,
either by using
offloaded-copy or by using emulated-copy. The caller of the api is not
required to
figure the type of support. However, it can opt out of emulated-copy
by specifying
the flag BLKDEV_NOEMULATION. This is helpful for the case when the
caller already
has got a sophisticated emulation (e.g. dm-kcopyd users).

>
> [...]
> >>> @@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct 
> >>> queue_limits *b,
> >>>   if (b->chunk_sectors)
> >>>   t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
> >>>
> >>> + /* simple copy not supported in stacked devices */
> >>> + t->copy_offload = 0;
> >>> + t->max_copy_sectors = 0;
> >>> + t->max_copy_range_sectors = 0;
> >>> + t->max_copy_nr_ranges = 0;
> >>
> >> You do not need this. Limits not explicitely initialized are 0 already.
> >> But I do not see why you can't support copy on stacked devices. That 
> >> should be
> >> feasible taking the min() for each of the above limit.
> >>
> >
> > Disabling stacked device support was feedback from v2.
> >
> > https://patchwork.kernel.org/project/linux-block/patch/20201204094659.12732-2-selvakuma...@samsung.com/
>
> Right. But the initialization to 0 is still not needed. The fields are already
> initialized to 0.
>
>
> --
> Damien Le Moal
> Western Digital Research


Re: [RFC PATCH v5 2/4] block: add simple copy support

2021-04-07 Thread Selva Jove
Initially I started moving the dm-kcopyd interface to the block layer
as a generic interface.
Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
tightly coupled with dm_io()

To move dm-kcopyd to block layer, it would also require dm_io code to
be moved to block layer.
It would cause havoc in dm layer, as it is the backbone of the
dm-layer and needs complete
rewriting of dm-layer. Do you see any other way of doing this without
having to move dm_io code
or to have redundant code ?


On Sat, Feb 20, 2021 at 10:29 AM Damien Le Moal  wrote:
>
> On 2021/02/20 11:01, SelvaKumar S wrote:
> > Add new BLKCOPY ioctl that offloads copying of one or more sources
> > ranges to a destination in the device. Accepts a 'copy_range' structure
> > that contains destination (in sectors), no of sources and pointer to the
> > array of source ranges. Each source range is represented by 'range_entry'
> > that contains start and length of source ranges (in sectors).
> >
> > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > bio with control information as payload and submit to the device.
> > REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> > to zoned device.
> >
> > If the device doesn't support copy or copy offload is disabled, then
> > copy operation is emulated by default. However, the copy-emulation is an
> > opt-in feature. Caller can choose not to use the copy-emulation by
> > specifying a flag 'BLKDEV_COPY_NOEMULATION'.
> >
> > Copy-emulation is implemented by allocating memory of total copy size.
> > The source ranges are read into memory by chaining bio for each source
> > ranges and submitting them async and the last bio waits for completion.
> > After data is read, it is written to the destination.
> >
> > bio_map_kern() is used to allocate bio and add pages of copy buffer to
> > bio. As bio->bi_private and bio->bi_end_io are needed for chaining the
> > bio and gets over-written, invalidate_kernel_vmap_range() for read is
> > called in the caller.
> >
> > Introduce queue limits for simple copy and other helper functions.
> > Add device limits as sysfs entries.
> >   - copy_offload
> >   - max_copy_sectors
> >   - max_copy_ranges_sectors
> >   - max_copy_nr_ranges
> >
> > copy_offload(= 0) is disabled by default. This needs to be enabled if
> > copy-offload needs to be used.
> > max_copy_sectors = 0 indicates the device doesn't support native copy.
> >
> > Native copy offload is not supported for stacked devices and is done via
> > copy emulation.
> >
> > Signed-off-by: SelvaKumar S 
> > Signed-off-by: Kanchan Joshi 
> > Signed-off-by: Nitesh Shetty 
> > Signed-off-by: Javier González 
> > Signed-off-by: Chaitanya Kulkarni 
> > ---
> >  block/blk-core.c  | 102 --
> >  block/blk-lib.c   | 222 ++
> >  block/blk-merge.c |   2 +
> >  block/blk-settings.c  |  10 ++
> >  block/blk-sysfs.c |  47 
> >  block/blk-zoned.c |   1 +
> >  block/bounce.c|   1 +
> >  block/ioctl.c |  33 ++
> >  include/linux/bio.h   |   1 +
> >  include/linux/blk_types.h |  14 +++
> >  include/linux/blkdev.h|  15 +++
> >  include/uapi/linux/fs.h   |  13 +++
> >  12 files changed, 453 insertions(+), 8 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 7663a9b94b80..23e646e5ae43 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -720,6 +720,17 @@ static noinline int should_fail_bio(struct bio *bio)
> >  }
> >  ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
> >
> > +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
> > + sector_t nr_sectors, sector_t max_sect)
> > +{
> > + if (nr_sectors && max_sect &&
> > + (nr_sectors > max_sect || start > max_sect - nr_sectors)) {
> > + handle_bad_sector(bio, max_sect);
> > + return -EIO;
> > + }
> > + return 0;
> > +}
> > +
> >  /*
> >   * Check whether this bio extends beyond the end of the device or 
> > partition.
> >   * This may well happen - the kernel calls bread() without checking the 
> > size of
> > @@ -738,6 +749,75 @@ static inline int bio_check_eod(struct bio *bio, 
> > sector_t maxsector)
> >   return 0;
> >  }
> >
> > +/*
> > + * Check for copy limits and remap source ranges if needed.
> > + */
> > +static int blk_check_copy(struct bio *bio)
> > +{
> > + struct blk_copy_payload *payload = bio_data(bio);
> > + struct request_queue *q = bio->bi_disk->queue;
> > + sector_t max_sect, start_sect, copy_size = 0;
> > + sector_t src_max_sect, src_start_sect;
> > + struct block_device *bd_part;
> > + int i, ret = -EIO;
> > +
> > + rcu_read_lock();
> > +
> > + bd_part = __disk_get_part(bio->bi_disk, bio->bi_partno);
> > + if (unlikely(!bd_part)) {
> > + rcu_read_unlock();
> > + goto out;
> > + }
> > +
> > + 

Re: [RFC PATCH v5 0/4] add simple copy support

2021-02-23 Thread Selva Jove
Dave,

copy_file_range() is work under progress.  FALLOC_FL_UNSHARE of fallocate()
use case sounds interesting. I will try to address both of them in the
next series.

Adding SCSI_XCOPY() support is not in the scope of this patchset. However
blkdev_issue_copy() interface is made generic so that it is possible to extend
to cross device XCOPY in future.


Thanks,
Selva


Re: [RFC PATCH v5 0/4] add simple copy support

2021-02-23 Thread Selva Jove
Thanks Su Yue. I'll update the link in the next series.

On Mon, Feb 22, 2021 at 12:23 PM Su Yue  wrote:
>
>
> On Fri 19 Feb 2021 at 20:45, SelvaKumar S
>  wrote:
>
> > This patchset tries to add support for TP4065a ("Simple Copy
> > Command"),
> > v2020.05.04 ("Ratified")
> >
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> >
>
> 404 not found.
> Should it be
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip
> ?
>
> > Simple copy command is a copy offloading operation and is  used
> > to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single
> > destination
> > LBA within the device reducing traffic between host and device.
> >
> > This implementation doesn't add native copy offload support for
> > stacked
> > devices rather copy offload is done through emulation. Possible
> > use
> > cases are F2FS gc and BTRFS relocation/balance.
> >
> > *blkdev_issue_copy* takes source bdev, no of sources, array of
> > source
> > ranges (in sectors), destination bdev and destination offset(in
> > sectors).
> > If both source and destination block devices are same and
> > copy_offload = 1,
> > then copy is done through native copy offloading. Copy emulation
> > is used
> > in other cases.
> >
> > As SCSI XCOPY can take two different block devices and no of
> > source range is
> > equal to 1, this interface can be extended in future to support
> > SCSI XCOPY.
> >
> > For devices supporting native simple copy, attach the control
> > information
> > as payload to the bio and submit to the device. For devices
> > without native
> > copy support, copy emulation is done by reading each source
> > range into memory
> > and writing it to the destination. Caller can choose not to try
> > emulation if copy offload is not supported by setting
> > BLKDEV_COPY_NOEMULATION flag.
> >
> > Following limits are added to queue limits and are exposed in
> > sysfs
> > to userspace
> >   - *copy_offload* controls copy_offload. set 0 to disable copy
> >   offload, 1 to enable native copy offloading support.
> >   - *max_copy_sectors* limits the sum of all source_range length
> >   - *max_copy_nr_ranges* limits the number of source ranges
> >   - *max_copy_range_sectors* limit the maximum number of sectors
> >   that can constitute a single source range.
> >
> >   max_copy_sectors = 0 indicates the device doesn't support copy
> > offloading.
> >
> >   *copy offload* sysfs entry is configurable and can be used
> > toggle
> > between emulation and native support depending upon the usecase.
> >
> > Changes from v4
> >
> > 1. Extend dm-kcopyd to leverage copy-offload, while copying
> > within the
> > same device. The other approach was to have copy-emulation by
> > moving
> > dm-kcopyd to block layer. But it also required moving core dm-io
> > infra,
> > causing a massive churn across multiple dm-targets.
> >
> > 2. Remove export in bio_map_kern()
> > 3. Change copy_offload sysfs to accept 0 or else
> > 4. Rename copy support flag to QUEUE_FLAG_SIMPLE_COPY
> > 5. Rename payload entries, add source bdev field to be used
> > while
> > partition remapping, remove copy_size
> > 6. Change the blkdev_issue_copy() interface to accept
> > destination and
> > source values in sector rather in bytes
> > 7. Add payload to bio using bio_map_kern() for copy_offload case
> > 8. Add check to return error if one of the source range length
> > is 0
> > 9. Add BLKDEV_COPY_NOEMULATION flag to allow user to not try
> > copy
> > emulation incase of copy offload is not supported. Caller can
> > his use
> > his existing copying logic to complete the io.
> > 10. Bug fix copy checks and reduce size of rcu_lock()
> >
> > Planned for next:
> > - adding blktests
> > - handling larger (than device limits) copy
> > - decide on ioctl interface (man-page etc.)
> >
> > Changes from v3
> >
> > 1. gfp_flag fixes.
> > 2. Export bio_map_kern() and use it to allocate and add pages to
> > bio.
> > 3. Move copy offload, reading to buf, writing from buf to
> > separate functions.
> > 4. Send read bio of copy offload by chaining them and submit
> > asynchronously.
> > 5. Add gendisk->part0 and part->bd_start_sect changes to
> > blk_check_copy().
> > 6. Move single source range limit check to blk_check_copy()
> > 7. Rename __blkdev_issue_copy() to blkdev_issue_copy and remove
> > old helper.
> > 8. Change blkdev_issue_copy() interface generic to accepts
> > destination bdev
> >   to support XCOPY as well.
> > 9. Add invalidate_kernel_vmap_range() after reading data for
> > vmalloc'ed memory.
> > 10. Fix buf allocoation logic to allocate buffer for the total
> > size of copy.
> > 11. Reword patch commit description.
> >
> > Changes from v2
> >
> > 1. Add emulation support for devices not supporting copy.
> > 2. Add *copy_offload* sysfs entry to enable and disable
> > copy_offload
> >   in devices 

Re: [RFC PATCH v5 3/4] nvme: add simple copy support

2021-02-22 Thread Selva Jove
Matthew,

Maximum Source Range Count (MSRC) is limited by u8. So the maximum
number of source ranges is 256 (0 base value). The number of pages
required to be sent to the device is at most 2. Since we are
allocating the memory using kmalloc_array(), we would get a continuous
physical segment. nvme_map_data() maps the physical segment either by
setting 2 PRP pointers or by SGL. So the copy command sends two pages
to the device for copying more than128 ranges.

On Sat, Feb 20, 2021 at 9:08 AM Matthew Wilcox  wrote:
>
> On Fri, Feb 19, 2021 at 06:15:16PM +0530, SelvaKumar S wrote:
> > + struct nvme_copy_range *range = NULL;
> [...]
> > + range = kmalloc_array(nr_range, sizeof(*range),
> > + GFP_ATOMIC | __GFP_NOWARN);
> [...]
> > + req->special_vec.bv_page = virt_to_page(range);
> > + req->special_vec.bv_offset = offset_in_page(range);
> > + req->special_vec.bv_len = sizeof(*range) * nr_range;
> [...]
> > +struct nvme_copy_range {
> > + __le64  rsvd0;
> > + __le64  slba;
> > + __le16  nlb;
> > + __le16  rsvd18;
> > + __le32  rsvd20;
> > + __le32  eilbrt;
> > + __le16  elbat;
> > + __le16  elbatm;
> > +};
>
> so ... at 32 bytes, you can get 128 per 4kB page.  What happens if you
> try to send down a command that attempts to copy 129 ranges?


Re: [dm-devel] [RFC PATCH v4 2/3] block: add simple copy support

2021-01-05 Thread Selva Jove
Hi Darrick,


On Tue, Jan 5, 2021 at 12:33 AM Darrick J. Wong  wrote:
>
> SelvaKumar S: This didn't show up on dm-devel, sorry for the OT reply...
>
> On Mon, Jan 04, 2021 at 12:47:11PM +, Damien Le Moal wrote:
> > On 2021/01/04 19:48, SelvaKumar S wrote:
> > > Add new BLKCOPY ioctl that offloads copying of one or more sources
> > > ranges to a destination in the device. Accepts copy_ranges that contains
> > > destination, no of sources and pointer to the array of source
> > > ranges. Each range_entry contains start and length of source
> > > ranges (in bytes).
> > >
> > > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > > bio with control information as payload and submit to the device.
> > > REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> > > to zoned device.
> > >
> > > If the device doesn't support copy or copy offload is disabled, then
> > > copy is emulated by allocating memory of total copy size. The source
> > > ranges are read into memory by chaining bio for each source ranges and
> > > submitting them async and the last bio waits for completion. After data
> > > is read, it is written to the destination.
> > >
> > > bio_map_kern() is used to allocate bio and add pages of copy buffer to
> > > bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
> > > bio and over written, invalidate_kernel_vmap_range() for read is called
> > > in the caller.
> > >
> > > Introduce queue limits for simple copy and other helper functions.
> > > Add device limits as sysfs entries.
> > > - copy_offload
> > > - max_copy_sectors
> > > - max_copy_ranges_sectors
> > > - max_copy_nr_ranges
> > >
> > > copy_offload(= 0) is disabled by default.
> > > max_copy_sectors = 0 indicates the device doesn't support native copy.
> > >
> > > Native copy offload is not supported for stacked devices and is done via
> > > copy emulation.
> > >
> > > Signed-off-by: SelvaKumar S 
> > > Signed-off-by: Kanchan Joshi 
> > > Signed-off-by: Nitesh Shetty 
> > > Signed-off-by: Javier González 
> > > ---
> > >  block/blk-core.c  |  94 ++--
> > >  block/blk-lib.c   | 223 ++
> > >  block/blk-merge.c |   2 +
> > >  block/blk-settings.c  |  10 ++
> > >  block/blk-sysfs.c |  50 +
> > >  block/blk-zoned.c |   1 +
> > >  block/bounce.c|   1 +
> > >  block/ioctl.c |  43 
> > >  include/linux/bio.h   |   1 +
> > >  include/linux/blk_types.h |  15 +++
> > >  include/linux/blkdev.h|  13 +++
> > >  include/uapi/linux/fs.h   |  13 +++
>
> This series should also be cc'd to linux-api since you're adding a new
> userspace api.
>

Sure. Will cc linux-api

>
> Alternately, save yourself the trouble of passing userspace API review
> by hooking this up to the existing copy_file_range call in the vfs.  /me
> hopes you sent blktests to check the operation of this thing, since none
> of the original patches made it to this list.
>

As the initial version had only source bdev, copy_file_rage call was not
viable. With this version, we have two bdev for source and destination.
I'll relook at that. Sure. Will add a new blktests for simple copy.

>
> If you really /do/ need a new kernel call for this, then please send in
> a manpage documenting the fields of struct range_entry and copy_range,
> and how users are supposed to use this.
>

Sure. Will send a manpage documentation once the plumbing is concrete.

>  plumbing...>
>
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index f44eb0a04afd..5cadb176317a 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -64,6 +64,18 @@ struct fstrim_range {
> > > __u64 minlen;
> > >  };
> > >
> > > +struct range_entry {
> > > +   __u64 src;
>
> Is this an offset?  Or the fd of an open bdev?

This is the source offset.

>
> > > +   __u64 len;
> > > +};
> > > +
> > > +struct copy_range {
> > > +   __u64 dest;
> > > +   __u64 nr_range;
> > > +   __u64 range_list;
>
> Hm, I think this is a pointer?  Why not just put the range_entry array
> at the end of struct copy_range?
>

As the size of the range_entry array can change dynamically depending on
nr_range, it was difficult to do copy_from_user() on copy_range structure in the
ioctl. So I decided to keep that as a pointer to range_entry array
instead of keeping
array at the end.

> > > +   __u64 rsvd;
>
> Also needs a flags argument so we don't have to add BLKCOPY2 in like 3
> months.
>

'rsvd' field is kept to support future copies. Will rename it.

> --D
>
> > > +};
> > > +
> > >  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl 
> > > definitions */
> > >  #define FILE_DEDUPE_RANGE_SAME 0
> > >  #define FILE_DEDUPE_RANGE_DIFFERS  1
> > > @@ -184,6 +196,7 @@ struct fsxattr {
> > >  #define BLKSECDISCARD _IO(0x12,125)
> > >  #define BLKROTATIONAL _IO(0x12,126)
> > >  #define 

Re: [RFC PATCH v4 2/3] block: add simple copy support

2021-01-05 Thread Selva Jove
Thanks for the review, Damien.

On Mon, Jan 4, 2021 at 6:17 PM Damien Le Moal  wrote:
>
> On 2021/01/04 19:48, SelvaKumar S wrote:
> > Add new BLKCOPY ioctl that offloads copying of one or more sources
> > ranges to a destination in the device. Accepts copy_ranges that contains
> > destination, no of sources and pointer to the array of source
> > ranges. Each range_entry contains start and length of source
> > ranges (in bytes).
> >
> > Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
> > bio with control information as payload and submit to the device.
> > REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
> > to zoned device.
> >
> > If the device doesn't support copy or copy offload is disabled, then
> > copy is emulated by allocating memory of total copy size. The source
> > ranges are read into memory by chaining bio for each source ranges and
> > submitting them async and the last bio waits for completion. After data
> > is read, it is written to the destination.
> >
> > bio_map_kern() is used to allocate bio and add pages of copy buffer to
> > bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
> > bio and over written, invalidate_kernel_vmap_range() for read is called
> > in the caller.
> >
> > Introduce queue limits for simple copy and other helper functions.
> > Add device limits as sysfs entries.
> >   - copy_offload
> >   - max_copy_sectors
> >   - max_copy_ranges_sectors
> >   - max_copy_nr_ranges
> >
> > copy_offload(= 0) is disabled by default.
> > max_copy_sectors = 0 indicates the device doesn't support native copy.
> >
> > Native copy offload is not supported for stacked devices and is done via
> > copy emulation.
> >
> > Signed-off-by: SelvaKumar S 
> > Signed-off-by: Kanchan Joshi 
> > Signed-off-by: Nitesh Shetty 
> > Signed-off-by: Javier González 
> > ---
> >  block/blk-core.c  |  94 ++--
> >  block/blk-lib.c   | 223 ++
> >  block/blk-merge.c |   2 +
> >  block/blk-settings.c  |  10 ++
> >  block/blk-sysfs.c |  50 +
> >  block/blk-zoned.c |   1 +
> >  block/bounce.c|   1 +
> >  block/ioctl.c |  43 
> >  include/linux/bio.h   |   1 +
> >  include/linux/blk_types.h |  15 +++
> >  include/linux/blkdev.h|  13 +++
> >  include/uapi/linux/fs.h   |  13 +++
> >  12 files changed, 458 insertions(+), 8 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 96e5fcd7f071..4a5cd3f53cd2 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -719,6 +719,17 @@ static noinline int should_fail_bio(struct bio *bio)
> >  }
> >  ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
> >
> > +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
> > + sector_t nr_sectors, sector_t maxsector)
> > +{
> > + if (nr_sectors && maxsector &&
> > + (nr_sectors > maxsector || start > maxsector - nr_sectors)) {
> > + handle_bad_sector(bio, maxsector);
> > + return -EIO;
> > + }
> > + return 0;
> > +}
> > +
> >  /*
> >   * Check whether this bio extends beyond the end of the device or 
> > partition.
> >   * This may well happen - the kernel calls bread() without checking the 
> > size of
> > @@ -737,6 +748,65 @@ static inline int bio_check_eod(struct bio *bio, 
> > sector_t maxsector)
> >   return 0;
> >  }
> >
> > +/*
> > + * Check for copy limits and remap source ranges if needed.
> > + */
> > +static int blk_check_copy(struct bio *bio)
> > +{
> > + struct block_device *p = NULL;
> > + struct request_queue *q = bio->bi_disk->queue;
> > + struct blk_copy_payload *payload;
> > + int i, maxsector, start_sect = 0, ret = -EIO;
> > + unsigned short nr_range;
> > +
> > + rcu_read_lock();
> > +
> > + p = __disk_get_part(bio->bi_disk, bio->bi_partno);
> > + if (unlikely(!p))
> > + goto out;
> > + if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
> > + goto out;
> > + if (unlikely(bio_check_ro(bio, p)))
> > + goto out;
> > +
> > + maxsector =  bdev_nr_sectors(p);
> > + start_sect = p->bd_start_sect;
> > +
> > + payload = bio_data(bio);
> > + nr_range = payload->copy_range;
> > +
> > + /* cannot handle copy crossing nr_ranges limit */
> > + if (payload->copy_range > q->limits.max_copy_nr_ranges)
> > + goto out;
>
> If payload->copy_range indicates the number of ranges to be copied, you should
> name it payload->copy_nr_ranges.
>

Agreed. Will rename the entries.

> > +
> > + /* cannot handle copy more than copy limits */
>
> Why ? You could loop... Similarly to discard, zone reset etc.
>

Sure. Will add the support for handling copy larger than device limits.

>
> > + if (payload->copy_size > q->limits.max_copy_sectors)
> > + goto out;
>
> Shouldn't the list of source 

Re: [RFC PATCH v4 1/3] block: export bio_map_kern()

2021-01-04 Thread Selva Jove
Thanks Damien, Will update that.

On Mon, Jan 4, 2021 at 5:45 PM Damien Le Moal  wrote:
>
> On 2021/01/04 19:48, SelvaKumar S wrote:
> > Export bio_map_kern() so that copy offload emulation can use
> > it to add vmalloced memory to bio.
> >
> > Signed-off-by: SelvaKumar S 
> > ---
> >  block/blk-map.c| 3 ++-
> >  include/linux/blkdev.h | 2 ++
> >  2 files changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-map.c b/block/blk-map.c
> > index 21630dccac62..50d61475bb68 100644
> > --- a/block/blk-map.c
> > +++ b/block/blk-map.c
> > @@ -378,7 +378,7 @@ static void bio_map_kern_endio(struct bio *bio)
> >   *   Map the kernel address into a bio suitable for io to a block
> >   *   device. Returns an error pointer in case of error.
> >   */
> > -static struct bio *bio_map_kern(struct request_queue *q, void *data,
> > +struct bio *bio_map_kern(struct request_queue *q, void *data,
> >   unsigned int len, gfp_t gfp_mask)
> >  {
> >   unsigned long kaddr = (unsigned long)data;
> > @@ -428,6 +428,7 @@ static struct bio *bio_map_kern(struct request_queue 
> > *q, void *data,
> >   bio->bi_end_io = bio_map_kern_endio;
> >   return bio;
> >  }
> > +EXPORT_SYMBOL(bio_map_kern);
>
> Simple copy support is a block layer code, so you I do not think you need 
> this.
> You only need to remove the static declaration of the function.
>
> >
> >  static void bio_copy_kern_endio(struct bio *bio)
> >  {
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 070de09425ad..81f9e7bec16c 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -936,6 +936,8 @@ extern int blk_rq_map_user(struct request_queue *, 
> > struct request *,
> >  struct rq_map_data *, void __user *, unsigned long,
> >  gfp_t);
> >  extern int blk_rq_unmap_user(struct bio *);
> > +extern struct bio *bio_map_kern(struct request_queue *q, void *data,
> > + unsigned int len, gfp_t gfp_mask);
> >  extern int blk_rq_map_kern(struct request_queue *, struct request *, void 
> > *, unsigned int, gfp_t);
> >  extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
> >  struct rq_map_data *, const struct iov_iter *,
> >
>
>
> --
> Damien Le Moal
> Western Digital Research


Re: [RFC PATCH v3 1/2] block: add simple copy support

2020-12-13 Thread Selva Jove
On Fri, Dec 11, 2020 at 9:56 PM Johannes Thumshirn
 wrote:
>
> On 11/12/2020 15:57, SelvaKumar S wrote:
> [...]
> > +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload 
> > *payload,
> > + gfp_t gfp_mask)
> > +{
> > + struct request_queue *q = bdev_get_queue(bdev);
> > + struct bio *bio;
> > + void *buf = NULL;
> > + int i, nr_srcs, max_range_len, ret, cur_dest, cur_size;
> > +
> > + nr_srcs = payload->copy_range;
> > + max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT;
> > + cur_dest = payload->dest;
> > + buf = kvmalloc(max_range_len, GFP_ATOMIC);
>
> Why GFP_ATOMIC and not the passed in gfp_mask? Especially as this is a 
> kvmalloc()
> which has the potential to grow quite big.
>
> > +int __blkdev_issue_copy(struct block_device *bdev, sector_t dest,
> > + sector_t nr_srcs, struct range_entry *rlist, gfp_t gfp_mask,
> > + int flags, struct bio **biop)
> > +{
>
> [...]
>
> > + total_size = struct_size(payload, range, nr_srcs);
> > + payload = kmalloc(total_size, GFP_ATOMIC | __GFP_NOWARN);
>
> Same here.
>
>
> > diff --git a/block/ioctl.c b/block/ioctl.c
> > index 6b785181344f..a4a507d85e56 100644
> > --- a/block/ioctl.c
> > +++ b/block/ioctl.c
> > @@ -142,6 +142,47 @@ static int blk_ioctl_discard(struct block_device 
> > *bdev, fmode_t mode,
> >   GFP_KERNEL, flags);
> >  }
> >
> > +static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
> > + unsigned long arg, unsigned long flags)
> > +{
>
> [...]
>
> > +
> > + rlist = kmalloc_array(crange.nr_range, sizeof(*rlist),
> > + GFP_ATOMIC | __GFP_NOWARN);
>
> And here. I think this one can even be GFP_KERNEL.
>
>
>

Thanks. Will fix this.


Re: [RFC PATCH v3 1/2] block: add simple copy support

2020-12-13 Thread Selva Jove
On Fri, Dec 11, 2020 at 11:35 PM Keith Busch  wrote:
>
> On Fri, Dec 11, 2020 at 07:21:38PM +0530, SelvaKumar S wrote:
> > +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload 
> > *payload,
> > + gfp_t gfp_mask)
> > +{
> > + struct request_queue *q = bdev_get_queue(bdev);
> > + struct bio *bio;
> > + void *buf = NULL;
> > + int i, nr_srcs, max_range_len, ret, cur_dest, cur_size;
> > +
> > + nr_srcs = payload->copy_range;
> > + max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT;
>
> The default value for this limit is 0, and this is the function for when
> the device doesn't support copy. Are we expecting drivers to set this
> value to something else for that case?

Sorry. Missed that. Will add a fix.

>
> > + cur_dest = payload->dest;
> > + buf = kvmalloc(max_range_len, GFP_ATOMIC);
> > + if (!buf)
> > + return -ENOMEM;
> > +
> > + for (i = 0; i < nr_srcs; i++) {
> > + bio = bio_alloc(gfp_mask, 1);
> > + bio->bi_iter.bi_sector = payload->range[i].src;
> > + bio->bi_opf = REQ_OP_READ;
> > + bio_set_dev(bio, bdev);
> > +
> > + cur_size = payload->range[i].len << SECTOR_SHIFT;
> > + ret = bio_add_page(bio, virt_to_page(buf), cur_size,
> > +offset_in_page(payload));
>
> 'buf' is vmalloc'ed, so we don't necessarily have congituous pages. I
> think you need to allocate the bio with bio_map_kern() or something like
> that instead with that kind of memory.
>

Sure. Will use bio_map_kern().

> > + if (ret != cur_size) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + ret = submit_bio_wait(bio);
> > + bio_put(bio);
> > + if (ret)
> > + goto out;
> > +
> > + bio = bio_alloc(gfp_mask, 1);
> > + bio_set_dev(bio, bdev);
> > + bio->bi_opf = REQ_OP_WRITE;
> > + bio->bi_iter.bi_sector = cur_dest;
> > + ret = bio_add_page(bio, virt_to_page(buf), cur_size,
> > +offset_in_page(payload));
> > + if (ret != cur_size) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + ret = submit_bio_wait(bio);
> > + bio_put(bio);
> > + if (ret)
> > + goto out;
> > +
> > + cur_dest += payload->range[i].len;
> > + }
>
> I think this would be a faster implementation if the reads were
> asynchronous with a payload buffer allocated specific to that read, and
> the callback can enqueue the write part. This would allow you to
> accumulate all the read data and write it in a single call.

Sounds like a better approach. Will add this implementation in v4.


Re: [RFC PATCH 2/2] nvme: add simple copy support

2020-12-02 Thread Selva Jove
On Tue, Dec 1, 2020 at 8:46 PM Keith Busch  wrote:
>
> On Tue, Dec 01, 2020 at 11:09:49AM +0530, SelvaKumar S wrote:
> > +static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
> > +struct nvme_id_ns *id)
> > +{
> > + struct nvme_ctrl *ctrl = ns->ctrl;
> > + struct request_queue *queue = disk->queue;
> > +
> > + if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
> > + queue->limits.max_copy_sectors = 0;
> > + blk_queue_flag_clear(QUEUE_FLAG_COPY, queue);
> > + return;
> > + }
> > +
> > + /* setting copy limits */
> > + ns->mcl = le64_to_cpu(id->mcl);
> > + ns->mssrl = le32_to_cpu(id->mssrl);
> > + ns->msrc = id->msrc;
>
> These are not used anywhere outside this function, so there's no need to
> add members to the struct.

Sure. Will remove these entries from nvme_ns.

>
> > + if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, queue))
> > + return;
>
> The queue limits are not necessarily the same each time we're called to
> update the disk info, so this return shouldn't be here.
>

Makes sense.

> > +
> > + queue->limits.max_copy_sectors = ns->mcl * (1 << (ns->lba_shift - 9));
> > + queue->limits.max_copy_range_sectors = ns->mssrl *
> > + (1 << (ns->lba_shift - 9));
> > + queue->limits.max_copy_nr_ranges = ns->msrc + 1;
> > +}
>
> <>
>
> > @@ -2045,6 +2133,7 @@ static void nvme_update_disk_info(struct gendisk 
> > *disk,
> >   set_capacity_and_notify(disk, capacity);
> >
> >   nvme_config_discard(disk, ns);
> > + nvme_config_copy(disk, ns, id);
> >   nvme_config_write_zeroes(disk, ns);
> >
> >   if (id->nsattr & NVME_NS_ATTR_RO)
> > @@ -3014,6 +3103,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> >   ctrl->oaes = le32_to_cpu(id->oaes);
> >   ctrl->wctemp = le16_to_cpu(id->wctemp);
> >   ctrl->cctemp = le16_to_cpu(id->cctemp);
> > + ctrl->ocfs = le32_to_cpu(id->ocfs);
>
> ocfs is not used anywhere.


Re: [RFC PATCH 1/2] block: add simple copy support

2020-12-01 Thread Selva Jove
Thanks for reporting the memory leak. Will add a fix.

On Tue, Dec 1, 2020 at 3:58 PM Aleksei Marov  wrote:
>
> On Tue, 2020-12-01 at 11:09 +0530, SelvaKumar S wrote:
> > + ret = __blkdev_issue_copy(bdev, dest, nr_srcs, rlist, gfp_mask, flags,
> > + );
> > + if (!ret && bio) {
> > + ret = submit_bio_wait(bio);
> > + if (ret == -EOPNOTSUPP)
> > + ret = 0;
> > +
> > + kfree(page_address(bio_first_bvec_all(bio)->bv_page) +
> > + bio_first_bvec_all(bio)->bv_offset);
> > + bio_put(bio);
> > + }
> > +
> > + return ret;
> > +}
> I think  there is an issue here that if bio_add_page  returns error in
> __blkdev_issue_copy then ret is -ENOMEM and we never do bio_put for bio
> allocated in  __blkdev_issue_copy so it is small memory leak.
>
>