Re: [RFC PATCH v2 0/2] add simple copy support
Hi! > > Good idea. Are you thinking of a sysfs entry to select the backend? > > Not sure on this one, initially I thought of a sysfs file, but then > how would you do it. One "global" sysfs entry is probably a bad idea. > Having one per block device to select native vs emulation maybe? And > a good way to benchmark. > > The other idea would be a benchmark loop on boot like the raid library > does. Doing copy benchmarking would require writes on the media, right? Kernel should not do such stuff without userspace requesting it... Best regards, Pavel -- http://www.livejournal.com/~pavelmachek signature.asc Description: Digital signature
Re: [RFC PATCH v2 0/2] add simple copy support
On 08.12.2020 13:24, Johannes Thumshirn wrote: On 08/12/2020 14:13, Javier González wrote: On 08.12.2020 12:37, Johannes Thumshirn wrote: On 08/12/2020 13:22, Javier González wrote: Good idea. Are you thinking of a sysfs entry to select the backend? Not sure on this one, initially I thought of a sysfs file, but then how would you do it. One "global" sysfs entry is probably a bad idea. Having one per block device to select native vs emulation maybe? And a good way to benchmark. I was thinking a per block device to target the use case where a certain implementation / workload is better one way or the other. Yes something along those lines. The other idea would be a benchmark loop on boot like the raid library does. Then on the other hand, there might be workloads that run faster with the emulation and some that run faster with the hardware acceleration. I think these points are the reason the last attempts got stuck. Yes. I believe that any benchmark we run would be biased in a certain way. If we can move forward with a sysfs entry and default to legacy path, we would not alter current behavior and enable NVMe copy offload (for now) for those that want to use it. We can then build on top of it. Does this sound like a reasonable approach? Yes this sounds like a reasonable approach to me. Cool. We will add this to the V3 then. Thanks Johannes!
Re: [RFC PATCH v2 0/2] add simple copy support
Hannes, [Sorry I'm late to the discussion here, had a few fires going in addition to the end of the kernel cycle] > And we shouldn't forget that the main issue which killed all previous > implementations was a missing QoS guarantee. That and the fact that our arbitrary stacking situation was hard to resolve. The QoS guarantee was somewhat addressed by Fred in T10. But I agree we need some sort of toggle. -- Martin K. Petersen Oracle Linux Engineering
Re: [RFC PATCH v2 0/2] add simple copy support
On 08/12/2020 14:13, Javier González wrote: > On 08.12.2020 12:37, Johannes Thumshirn wrote: >> On 08/12/2020 13:22, Javier González wrote: >>> Good idea. Are you thinking of a sysfs entry to select the backend? >> >> Not sure on this one, initially I thought of a sysfs file, but then >> how would you do it. One "global" sysfs entry is probably a bad idea. >> Having one per block device to select native vs emulation maybe? And >> a good way to benchmark. > > I was thinking a per block device to target the use case where a certain > implementation / workload is better one way or the other. Yes something along those lines. >> >> The other idea would be a benchmark loop on boot like the raid library >> does. >> >> Then on the other hand, there might be workloads that run faster with >> the emulation and some that run faster with the hardware acceleration. >> >> I think these points are the reason the last attempts got stuck. > > Yes. I believe that any benchmark we run would be biased in a certain > way. If we can move forward with a sysfs entry and default to legacy > path, we would not alter current behavior and enable NVMe copy offload > (for now) for those that want to use it. We can then build on top of it. > > Does this sound like a reasonable approach? > Yes this sounds like a reasonable approach to me.
Re: [RFC PATCH v2 0/2] add simple copy support
On 08.12.2020 12:37, Johannes Thumshirn wrote: On 08/12/2020 13:22, Javier González wrote: Good idea. Are you thinking of a sysfs entry to select the backend? Not sure on this one, initially I thought of a sysfs file, but then how would you do it. One "global" sysfs entry is probably a bad idea. Having one per block device to select native vs emulation maybe? And a good way to benchmark. I was thinking a per block device to target the use case where a certain implementation / workload is better one way or the other. The other idea would be a benchmark loop on boot like the raid library does. Then on the other hand, there might be workloads that run faster with the emulation and some that run faster with the hardware acceleration. I think these points are the reason the last attempts got stuck. Yes. I believe that any benchmark we run would be biased in a certain way. If we can move forward with a sysfs entry and default to legacy path, we would not alter current behavior and enable NVMe copy offload (for now) for those that want to use it. We can then build on top of it. Does this sound like a reasonable approach?
Re: [RFC PATCH v2 0/2] add simple copy support
On 08/12/2020 13:22, Javier González wrote: > Good idea. Are you thinking of a sysfs entry to select the backend? Not sure on this one, initially I thought of a sysfs file, but then how would you do it. One "global" sysfs entry is probably a bad idea. Having one per block device to select native vs emulation maybe? And a good way to benchmark. The other idea would be a benchmark loop on boot like the raid library does. Then on the other hand, there might be workloads that run faster with the emulation and some that run faster with the hardware acceleration. I think these points are the reason the last attempts got stuck.
Re: [RFC PATCH v2 0/2] add simple copy support
On 08.12.2020 08:40, Johannes Thumshirn wrote: On 07/12/2020 20:27, Javier González wrote: Good point. We can share some performance data on how Simple Copy scales in terms of bw / latency and the CPU usage. Do you have anything else in mind? With an emulation in the kernel, we could make the usd "backend" implementation configurable. So if the emulation is faster, users can select the emulation, if the device is faster then the device. Kind of what the crypto and raid code do as well. Good idea. Are you thinking of a sysfs entry to select the backend? I'm really interested in this work, as BTRFS relocation/balance will have potential benefits, but we need to get it right. Agree. We will post a V3 with emulation and addressing other comments. We can take it from there. If you have comments on V2, please send them our way and we will take them in.
Re: [RFC PATCH v2 0/2] add simple copy support
On 08.12.2020 07:44, Hannes Reinecke wrote: On 12/7/20 11:12 PM, Douglas Gilbert wrote: On 2020-12-07 9:56 a.m., Hannes Reinecke wrote: On 12/7/20 3:11 PM, Christoph Hellwig wrote: So, I'm really worried about: a) a good use case. GC in f2fs or btrfs seem like good use cases, as does accelating dm-kcopyd. I agree with Damien that lifting dm-kcopyd to common code would also be really nice. I'm not 100% sure it should be a requirement, but it sure would be nice to have I don't think just adding an ioctl is enough of a use case for complex kernel infrastructure. b) We had a bunch of different attempts at SCSI XCOPY support form IIRC Martin, Bart and Mikulas. I think we need to pull them into this discussion, and make sure whatever we do covers the SCSI needs. And we shouldn't forget that the main issue which killed all previous implementations was a missing QoS guarantee. It's nice to have simply copy, but if the implementation is _slower_ than doing it by hand from the OS there is very little point in even attempting to do so. I can't see any provisions for that in the TPAR, leading me to the assumption that NVMe simple copy will suffer from the same issue. So if we can't address this I guess this attempt will fail, too. I have been doing quite a lot of work and testing in my sg driver rewrite in the copy and compare area. The baselines for performance are dd and io_uring-cp (in liburing). There are lots of ways to improve on them. Here are some: - the user data need never pass through the user space (could mmap it out during the READ if there is a good reason). Only the metadata (e.g. NVMe or SCSI commands) needs to come from the user space and errors, if any, reported back to the user space. - break a large copy (or compare) into segments, with each segment a "comfortable" size for the OS to handle, say 256 KB - there is one constraint: the READ in each segment must complete before its paired WRITE can commence - extra constraint for some zoned disks: WRITEs must be issued in order (assuming they are applied in that order, if not, need to wait until each WRITE completes) - arrange for READ WRITE pair in each segment to share the same bio - have multiple slots each holding a segment (i.e. a bio and metadata to process a READ-WRITE pair) - re-use each slot's bio for the following READ-WRITE pair - issue the READs in each slot asynchronously and do an interleaved (io)poll for completion. Then issue the paired WRITE asynchronously - the above "slot" algorithm runs in one thread, so there can be multiple threads doing the same algorithm. Segment manager needs to be locked (or use an atomics) so that each segment (identified by its starting LBAs) is issued once and only once when the next thread wants a segment to copy Running multiple threads gives diminishing or even worsening returns. Runtime metrics on lock contention and storage bus capacity may help choosing the number of threads. A simpler approach might be add more threads until the combined throughput increase is less than 10% say. The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command (or NVMe NVM Compare command). Using dd logic, a disk to disk compare can be implemented with not much more work than changing the WRITE to a VERIFY command. This is a different approach to the Linux cmp utility which READs in both sides and does a memcmp() type operation. Using ramdisks (from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE operations taking a write lock over the store while the VERIFY only needs a read lock so many VERIFY operations can co-exist on the same store. Unfortunately on real SAS and NVMe SSDs that I tested the performance of the VERIFY and NVM Compare commands is underwhelming. For comparison, using scsi_debug ramdisks, dd copy throughput was < 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen 3600 based. Which is precisely my concern. Simple copy might be efficient for one particular implementation, but it might be completely off the board for others. But both will be claiming to support it, and us having no idea whether choosing simple copy will speed up matters or not. Without having a programmatic way to figure out the speed of the implementation we have to detect the performance ourselves, like the benchmarking loop RAID5 does. I was hoping to avoid that, and just ask the device/controller; but that turned out to be in vain. I believe it makes sense to do extensive characterization to understand how the host and device implementation behave. However, I do not believe we will get far if the requirement is that any acceleration has to outperform the legacy path under all circumstances and implementations. At this moment in time, this is a fea
Re: [RFC PATCH v2 0/2] add simple copy support
On 07/12/2020 20:27, Javier González wrote: > Good point. We can share some performance data on how Simple Copy scales > in terms of bw / latency and the CPU usage. Do you have anything else in > mind? > With an emulation in the kernel, we could make the usd "backend" implementation configurable. So if the emulation is faster, users can select the emulation, if the device is faster then the device. Kind of what the crypto and raid code do as well. I'm really interested in this work, as BTRFS relocation/balance will have potential benefits, but we need to get it right.
Re: [RFC PATCH v2 0/2] add simple copy support
On 12/7/20 11:12 PM, Douglas Gilbert wrote: On 2020-12-07 9:56 a.m., Hannes Reinecke wrote: On 12/7/20 3:11 PM, Christoph Hellwig wrote: So, I'm really worried about: a) a good use case. GC in f2fs or btrfs seem like good use cases, as does accelating dm-kcopyd. I agree with Damien that lifting dm-kcopyd to common code would also be really nice. I'm not 100% sure it should be a requirement, but it sure would be nice to have I don't think just adding an ioctl is enough of a use case for complex kernel infrastructure. b) We had a bunch of different attempts at SCSI XCOPY support form IIRC Martin, Bart and Mikulas. I think we need to pull them into this discussion, and make sure whatever we do covers the SCSI needs. And we shouldn't forget that the main issue which killed all previous implementations was a missing QoS guarantee. It's nice to have simply copy, but if the implementation is _slower_ than doing it by hand from the OS there is very little point in even attempting to do so. I can't see any provisions for that in the TPAR, leading me to the assumption that NVMe simple copy will suffer from the same issue. So if we can't address this I guess this attempt will fail, too. I have been doing quite a lot of work and testing in my sg driver rewrite in the copy and compare area. The baselines for performance are dd and io_uring-cp (in liburing). There are lots of ways to improve on them. Here are some: - the user data need never pass through the user space (could mmap it out during the READ if there is a good reason). Only the metadata (e.g. NVMe or SCSI commands) needs to come from the user space and errors, if any, reported back to the user space. - break a large copy (or compare) into segments, with each segment a "comfortable" size for the OS to handle, say 256 KB - there is one constraint: the READ in each segment must complete before its paired WRITE can commence - extra constraint for some zoned disks: WRITEs must be issued in order (assuming they are applied in that order, if not, need to wait until each WRITE completes) - arrange for READ WRITE pair in each segment to share the same bio - have multiple slots each holding a segment (i.e. a bio and metadata to process a READ-WRITE pair) - re-use each slot's bio for the following READ-WRITE pair - issue the READs in each slot asynchronously and do an interleaved (io)poll for completion. Then issue the paired WRITE asynchronously - the above "slot" algorithm runs in one thread, so there can be multiple threads doing the same algorithm. Segment manager needs to be locked (or use an atomics) so that each segment (identified by its starting LBAs) is issued once and only once when the next thread wants a segment to copy Running multiple threads gives diminishing or even worsening returns. Runtime metrics on lock contention and storage bus capacity may help choosing the number of threads. A simpler approach might be add more threads until the combined throughput increase is less than 10% say. The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command (or NVMe NVM Compare command). Using dd logic, a disk to disk compare can be implemented with not much more work than changing the WRITE to a VERIFY command. This is a different approach to the Linux cmp utility which READs in both sides and does a memcmp() type operation. Using ramdisks (from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE operations taking a write lock over the store while the VERIFY only needs a read lock so many VERIFY operations can co-exist on the same store. Unfortunately on real SAS and NVMe SSDs that I tested the performance of the VERIFY and NVM Compare commands is underwhelming. For comparison, using scsi_debug ramdisks, dd copy throughput was < 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen 3600 based. Which is precisely my concern. Simple copy might be efficient for one particular implementation, but it might be completely off the board for others. But both will be claiming to support it, and us having no idea whether choosing simple copy will speed up matters or not. Without having a programmatic way to figure out the speed of the implementation we have to detect the performance ourselves, like the benchmarking loop RAID5 does. I was hoping to avoid that, and just ask the device/controller; but that turned out to be in vain. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Re: [RFC PATCH v2 0/2] add simple copy support
On 2020-12-07 9:56 a.m., Hannes Reinecke wrote: On 12/7/20 3:11 PM, Christoph Hellwig wrote: So, I'm really worried about: a) a good use case. GC in f2fs or btrfs seem like good use cases, as does accelating dm-kcopyd. I agree with Damien that lifting dm-kcopyd to common code would also be really nice. I'm not 100% sure it should be a requirement, but it sure would be nice to have I don't think just adding an ioctl is enough of a use case for complex kernel infrastructure. b) We had a bunch of different attempts at SCSI XCOPY support form IIRC Martin, Bart and Mikulas. I think we need to pull them into this discussion, and make sure whatever we do covers the SCSI needs. And we shouldn't forget that the main issue which killed all previous implementations was a missing QoS guarantee. It's nice to have simply copy, but if the implementation is _slower_ than doing it by hand from the OS there is very little point in even attempting to do so. I can't see any provisions for that in the TPAR, leading me to the assumption that NVMe simple copy will suffer from the same issue. So if we can't address this I guess this attempt will fail, too. I have been doing quite a lot of work and testing in my sg driver rewrite in the copy and compare area. The baselines for performance are dd and io_uring-cp (in liburing). There are lots of ways to improve on them. Here are some: - the user data need never pass through the user space (could mmap it out during the READ if there is a good reason). Only the metadata (e.g. NVMe or SCSI commands) needs to come from the user space and errors, if any, reported back to the user space. - break a large copy (or compare) into segments, with each segment a "comfortable" size for the OS to handle, say 256 KB - there is one constraint: the READ in each segment must complete before its paired WRITE can commence - extra constraint for some zoned disks: WRITEs must be issued in order (assuming they are applied in that order, if not, need to wait until each WRITE completes) - arrange for READ WRITE pair in each segment to share the same bio - have multiple slots each holding a segment (i.e. a bio and metadata to process a READ-WRITE pair) - re-use each slot's bio for the following READ-WRITE pair - issue the READs in each slot asynchronously and do an interleaved (io)poll for completion. Then issue the paired WRITE asynchronously - the above "slot" algorithm runs in one thread, so there can be multiple threads doing the same algorithm. Segment manager needs to be locked (or use an atomics) so that each segment (identified by its starting LBAs) is issued once and only once when the next thread wants a segment to copy Running multiple threads gives diminishing or even worsening returns. Runtime metrics on lock contention and storage bus capacity may help choosing the number of threads. A simpler approach might be add more threads until the combined throughput increase is less than 10% say. The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command (or NVMe NVM Compare command). Using dd logic, a disk to disk compare can be implemented with not much more work than changing the WRITE to a VERIFY command. This is a different approach to the Linux cmp utility which READs in both sides and does a memcmp() type operation. Using ramdisks (from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE operations taking a write lock over the store while the VERIFY only needs a read lock so many VERIFY operations can co-exist on the same store. Unfortunately on real SAS and NVMe SSDs that I tested the performance of the VERIFY and NVM Compare commands is underwhelming. For comparison, using scsi_debug ramdisks, dd copy throughput was < 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen 3600 based. Doug Gilbert
Re: [RFC PATCH v2 0/2] add simple copy support
On 07.12.2020 15:56, Hannes Reinecke wrote: On 12/7/20 3:11 PM, Christoph Hellwig wrote: So, I'm really worried about: a) a good use case. GC in f2fs or btrfs seem like good use cases, as does accelating dm-kcopyd. I agree with Damien that lifting dm-kcopyd to common code would also be really nice. I'm not 100% sure it should be a requirement, but it sure would be nice to have I don't think just adding an ioctl is enough of a use case for complex kernel infrastructure. b) We had a bunch of different attempts at SCSI XCOPY support form IIRC Martin, Bart and Mikulas. I think we need to pull them into this discussion, and make sure whatever we do covers the SCSI needs. And we shouldn't forget that the main issue which killed all previous implementations was a missing QoS guarantee. It's nice to have simply copy, but if the implementation is _slower_ than doing it by hand from the OS there is very little point in even attempting to do so. I can't see any provisions for that in the TPAR, leading me to the assumption that NVMe simple copy will suffer from the same issue. So if we can't address this I guess this attempt will fail, too. Good point. We can share some performance data on how Simple Copy scales in terms of bw / latency and the CPU usage. Do you have anything else in mind?
Re: [RFC PATCH v2 0/2] add simple copy support
On 07.12.2020 15:11, Christoph Hellwig wrote: So, I'm really worried about: a) a good use case. GC in f2fs or btrfs seem like good use cases, as does accelating dm-kcopyd. I agree with Damien that lifting dm-kcopyd to common code would also be really nice. I'm not 100% sure it should be a requirement, but it sure would be nice to have I don't think just adding an ioctl is enough of a use case for complex kernel infrastructure. We are looking at dm-kcopyd. I would have liked to start with a very specific use case and build from there, but I see Damien's and Keith's point of having a default path. I believe we can add this to the next version. b) We had a bunch of different attempts at SCSI XCOPY support form IIRC Martin, Bart and Mikulas. I think we need to pull them into this discussion, and make sure whatever we do covers the SCSI needs. Agree. We discussed a lot about the scope and agreed that everything outside of the specifics of Simple Copy requires the input from the ones that have worked on XCOPY support in the past. On Fri, Dec 04, 2020 at 03:16:57PM +0530, SelvaKumar S wrote: This patchset tries to add support for TP4065a ("Simple Copy Command"), v2020.05.04 ("Ratified") The Specification can be found in following link. https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip This is an RFC. Looking forward for any feedbacks or other alternate designs for plumbing simple copy to IO stack. Simple copy command is a copy offloading operation and is used to copy multiple contiguous ranges (source_ranges) of LBA's to a single destination LBA within the device reducing traffic between host and device. This implementation accepts destination, no of sources and arrays of source ranges from application and attach it as payload to the bio and submits to the device. Following limits are added to queue limits and are exposed in sysfs to userspace - *max_copy_sectors* limits the sum of all source_range length - *max_copy_nr_ranges* limits the number of source ranges - *max_copy_range_sectors* limit the maximum number of sectors that can constitute a single source range. Changes from v1: 1. Fix memory leak in __blkdev_issue_copy 2. Unmark blk_check_copy inline 3. Fix line break in blk_check_copy_eod 4. Remove p checks and made code more readable 5. Don't use bio_set_op_attrs and remove op and set bi_opf directly 6. Use struct_size to calculate total_size 7. Fix partition remap of copy destination 8. Remove mcl,mssrl,msrc from nvme_ns 9. Initialize copy queue limits to 0 in nvme_config_copy 10. Remove return in QUEUE_FLAG_COPY check 11. Remove unused OCFS SelvaKumar S (2): block: add simple copy support nvme: add simple copy support block/blk-core.c | 94 ++--- block/blk-lib.c | 123 ++ block/blk-merge.c | 2 + block/blk-settings.c | 11 block/blk-sysfs.c | 23 +++ block/blk-zoned.c | 1 + block/bounce.c| 1 + block/ioctl.c | 43 + drivers/nvme/host/core.c | 87 +++ include/linux/bio.h | 1 + include/linux/blk_types.h | 15 + include/linux/blkdev.h| 15 + include/linux/nvme.h | 43 - include/uapi/linux/fs.h | 13 14 files changed, 461 insertions(+), 11 deletions(-) -- 2.25.1 ---end quoted text---
Re: [RFC PATCH v2 0/2] add simple copy support
On 12/7/20 3:11 PM, Christoph Hellwig wrote: So, I'm really worried about: a) a good use case. GC in f2fs or btrfs seem like good use cases, as does accelating dm-kcopyd. I agree with Damien that lifting dm-kcopyd to common code would also be really nice. I'm not 100% sure it should be a requirement, but it sure would be nice to have I don't think just adding an ioctl is enough of a use case for complex kernel infrastructure. b) We had a bunch of different attempts at SCSI XCOPY support form IIRC Martin, Bart and Mikulas. I think we need to pull them into this discussion, and make sure whatever we do covers the SCSI needs. And we shouldn't forget that the main issue which killed all previous implementations was a missing QoS guarantee. It's nice to have simply copy, but if the implementation is _slower_ than doing it by hand from the OS there is very little point in even attempting to do so. I can't see any provisions for that in the TPAR, leading me to the assumption that NVMe simple copy will suffer from the same issue. So if we can't address this I guess this attempt will fail, too. Cheers, Hannes -- Dr. Hannes ReineckeKernel Storage Architect h...@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Re: [RFC PATCH v2 0/2] add simple copy support
So, I'm really worried about: a) a good use case. GC in f2fs or btrfs seem like good use cases, as does accelating dm-kcopyd. I agree with Damien that lifting dm-kcopyd to common code would also be really nice. I'm not 100% sure it should be a requirement, but it sure would be nice to have I don't think just adding an ioctl is enough of a use case for complex kernel infrastructure. b) We had a bunch of different attempts at SCSI XCOPY support form IIRC Martin, Bart and Mikulas. I think we need to pull them into this discussion, and make sure whatever we do covers the SCSI needs. On Fri, Dec 04, 2020 at 03:16:57PM +0530, SelvaKumar S wrote: > This patchset tries to add support for TP4065a ("Simple Copy Command"), > v2020.05.04 ("Ratified") > > The Specification can be found in following link. > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip > > This is an RFC. Looking forward for any feedbacks or other alternate > designs for plumbing simple copy to IO stack. > > Simple copy command is a copy offloading operation and is used to copy > multiple contiguous ranges (source_ranges) of LBA's to a single destination > LBA within the device reducing traffic between host and device. > > This implementation accepts destination, no of sources and arrays of > source ranges from application and attach it as payload to the bio and > submits to the device. > > Following limits are added to queue limits and are exposed in sysfs > to userspace > - *max_copy_sectors* limits the sum of all source_range length > - *max_copy_nr_ranges* limits the number of source ranges > - *max_copy_range_sectors* limit the maximum number of sectors > that can constitute a single source range. > > Changes from v1: > > 1. Fix memory leak in __blkdev_issue_copy > 2. Unmark blk_check_copy inline > 3. Fix line break in blk_check_copy_eod > 4. Remove p checks and made code more readable > 5. Don't use bio_set_op_attrs and remove op and set >bi_opf directly > 6. Use struct_size to calculate total_size > 7. Fix partition remap of copy destination > 8. Remove mcl,mssrl,msrc from nvme_ns > 9. Initialize copy queue limits to 0 in nvme_config_copy > 10. Remove return in QUEUE_FLAG_COPY check > 11. Remove unused OCFS > > SelvaKumar S (2): > block: add simple copy support > nvme: add simple copy support > > block/blk-core.c | 94 ++--- > block/blk-lib.c | 123 ++ > block/blk-merge.c | 2 + > block/blk-settings.c | 11 > block/blk-sysfs.c | 23 +++ > block/blk-zoned.c | 1 + > block/bounce.c| 1 + > block/ioctl.c | 43 + > drivers/nvme/host/core.c | 87 +++ > include/linux/bio.h | 1 + > include/linux/blk_types.h | 15 + > include/linux/blkdev.h| 15 + > include/linux/nvme.h | 43 - > include/uapi/linux/fs.h | 13 > 14 files changed, 461 insertions(+), 11 deletions(-) > > -- > 2.25.1 ---end quoted text---
Re: [RFC PATCH v2 0/2] add simple copy support
On 2020/12/07 17:16, javier.g...@samsung.com wrote: > On 07.12.2020 08:06, Damien Le Moal wrote: >> On 2020/12/07 16:46, javier.g...@samsung.com wrote: >>> On 04.12.2020 23:40, Keith Busch wrote: On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote: > On 2020/12/04 20:02, SelvaKumar S wrote: >> This patchset tries to add support for TP4065a ("Simple Copy Command"), >> v2020.05.04 ("Ratified") >> >> The Specification can be found in following link. >> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip >> >> This is an RFC. Looking forward for any feedbacks or other alternate >> designs for plumbing simple copy to IO stack. >> >> Simple copy command is a copy offloading operation and is used to copy >> multiple contiguous ranges (source_ranges) of LBA's to a single >> destination >> LBA within the device reducing traffic between host and device. >> >> This implementation accepts destination, no of sources and arrays of >> source ranges from application and attach it as payload to the bio and >> submits to the device. >> >> Following limits are added to queue limits and are exposed in sysfs >> to userspace >> - *max_copy_sectors* limits the sum of all source_range length >> - *max_copy_nr_ranges* limits the number of source ranges >> - *max_copy_range_sectors* limit the maximum number of sectors >> that can constitute a single source range. > > Same comment as before. I think this is a good start, but for this to be > really > useful to users and kernel components alike, this really needs copy > emulation > for drives that do not have a native copy feature, similarly to what > write zeros > handling for instance: if the drive does not have a copy command (simple > copy > for NVMe or XCOPY for scsi), then the block layer should issue read/write > commands to seamlessly execute the copy. Otherwise, this will only serve > a small > niche for users and will not be optimal for FS and DM drivers that could > be > simplified with a generic block layer copy functionality. > > This is my 10 cents though, others may differ about this. Yes, I agree that copy emulation support should be included with the hardware enabled solution. >>> >>> Keith, Damien, >>> >>> Can we do the block layer emulation with this patchset and then work in >>> follow-up patchses on (i) the FS interface with F2FS as a first user and >>> (ii) other HW accelerations such as XCOPY? >> >> The initial patchset supporting NVMe simple copy and emulation copy, all >> under >> an API that probably will be similar that of dm-kcopyd will cover all block >> devices. Other hardware native support for copy functions such as scsi >> extended >> copy can be added later under the hood without any API changes (or minimal >> changes). > > Sounds good. That we can do. We will add a new patch for this. > >> >> I am not sure what you mean by "FS interface for F2FS": the block layer API >> for >> this copy functionality will be what F2FS (and other FSes) will call. That is >> the interface, no ? > > Essentially yes.. I mean adding the F2FS logic and potentially some > helpers to the block layer to aid GC. GC is very much special to each FS. SO I do not think adding helpers to the block layer will have value. We should stick to a pure block copy API for that layer. > >> >>> For XCOPY, I believe we need to have a separate discussion as much works >>> is already done that we should align to. >> >> I think Martin (added to this thread) and others have looked into it but I do >> not think that anything made it into the kernel yet. > > Exactly. Looking at some of the code posted through time and recalling > the discussions at LSF/MM, seems like there are a number of things we > are not addressing here that could be incorporated down the road, such > as dedicated syscalls / extensions, multi namespace / device support, > etc. dm-kcopyd interface supports copy between multiple devices. That of course would not enable NVMe simple copy use, but that makes the interface generic enough so that we should not have any problem with other hardware copy functions. >> > -- Damien Le Moal Western Digital Research
Re: [RFC PATCH v2 0/2] add simple copy support
On 07.12.2020 08:06, Damien Le Moal wrote: On 2020/12/07 16:46, javier.g...@samsung.com wrote: On 04.12.2020 23:40, Keith Busch wrote: On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote: On 2020/12/04 20:02, SelvaKumar S wrote: This patchset tries to add support for TP4065a ("Simple Copy Command"), v2020.05.04 ("Ratified") The Specification can be found in following link. https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip This is an RFC. Looking forward for any feedbacks or other alternate designs for plumbing simple copy to IO stack. Simple copy command is a copy offloading operation and is used to copy multiple contiguous ranges (source_ranges) of LBA's to a single destination LBA within the device reducing traffic between host and device. This implementation accepts destination, no of sources and arrays of source ranges from application and attach it as payload to the bio and submits to the device. Following limits are added to queue limits and are exposed in sysfs to userspace - *max_copy_sectors* limits the sum of all source_range length - *max_copy_nr_ranges* limits the number of source ranges - *max_copy_range_sectors* limit the maximum number of sectors that can constitute a single source range. Same comment as before. I think this is a good start, but for this to be really useful to users and kernel components alike, this really needs copy emulation for drives that do not have a native copy feature, similarly to what write zeros handling for instance: if the drive does not have a copy command (simple copy for NVMe or XCOPY for scsi), then the block layer should issue read/write commands to seamlessly execute the copy. Otherwise, this will only serve a small niche for users and will not be optimal for FS and DM drivers that could be simplified with a generic block layer copy functionality. This is my 10 cents though, others may differ about this. Yes, I agree that copy emulation support should be included with the hardware enabled solution. Keith, Damien, Can we do the block layer emulation with this patchset and then work in follow-up patchses on (i) the FS interface with F2FS as a first user and (ii) other HW accelerations such as XCOPY? The initial patchset supporting NVMe simple copy and emulation copy, all under an API that probably will be similar that of dm-kcopyd will cover all block devices. Other hardware native support for copy functions such as scsi extended copy can be added later under the hood without any API changes (or minimal changes). Sounds good. That we can do. We will add a new patch for this. I am not sure what you mean by "FS interface for F2FS": the block layer API for this copy functionality will be what F2FS (and other FSes) will call. That is the interface, no ? Essentially yes.. I mean adding the F2FS logic and potentially some helpers to the block layer to aid GC. For XCOPY, I believe we need to have a separate discussion as much works is already done that we should align to. I think Martin (added to this thread) and others have looked into it but I do not think that anything made it into the kernel yet. Exactly. Looking at some of the code posted through time and recalling the discussions at LSF/MM, seems like there are a number of things we are not addressing here that could be incorporated down the road, such as dedicated syscalls / extensions, multi namespace / device support, etc.
Re: [RFC PATCH v2 0/2] add simple copy support
On 2020/12/07 16:46, javier.g...@samsung.com wrote: > On 04.12.2020 23:40, Keith Busch wrote: >> On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote: >>> On 2020/12/04 20:02, SelvaKumar S wrote: This patchset tries to add support for TP4065a ("Simple Copy Command"), v2020.05.04 ("Ratified") The Specification can be found in following link. https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip This is an RFC. Looking forward for any feedbacks or other alternate designs for plumbing simple copy to IO stack. Simple copy command is a copy offloading operation and is used to copy multiple contiguous ranges (source_ranges) of LBA's to a single destination LBA within the device reducing traffic between host and device. This implementation accepts destination, no of sources and arrays of source ranges from application and attach it as payload to the bio and submits to the device. Following limits are added to queue limits and are exposed in sysfs to userspace - *max_copy_sectors* limits the sum of all source_range length - *max_copy_nr_ranges* limits the number of source ranges - *max_copy_range_sectors* limit the maximum number of sectors that can constitute a single source range. >>> >>> Same comment as before. I think this is a good start, but for this to be >>> really >>> useful to users and kernel components alike, this really needs copy >>> emulation >>> for drives that do not have a native copy feature, similarly to what write >>> zeros >>> handling for instance: if the drive does not have a copy command (simple >>> copy >>> for NVMe or XCOPY for scsi), then the block layer should issue read/write >>> commands to seamlessly execute the copy. Otherwise, this will only serve a >>> small >>> niche for users and will not be optimal for FS and DM drivers that could be >>> simplified with a generic block layer copy functionality. >>> >>> This is my 10 cents though, others may differ about this. >> >> Yes, I agree that copy emulation support should be included with the >> hardware enabled solution. > > Keith, Damien, > > Can we do the block layer emulation with this patchset and then work in > follow-up patchses on (i) the FS interface with F2FS as a first user and > (ii) other HW accelerations such as XCOPY? The initial patchset supporting NVMe simple copy and emulation copy, all under an API that probably will be similar that of dm-kcopyd will cover all block devices. Other hardware native support for copy functions such as scsi extended copy can be added later under the hood without any API changes (or minimal changes). I am not sure what you mean by "FS interface for F2FS": the block layer API for this copy functionality will be what F2FS (and other FSes) will call. That is the interface, no ? > For XCOPY, I believe we need to have a separate discussion as much works > is already done that we should align to. I think Martin (added to this thread) and others have looked into it but I do not think that anything made it into the kernel yet. -- Damien Le Moal Western Digital Research
Re: [RFC PATCH v2 0/2] add simple copy support
On 04.12.2020 23:40, Keith Busch wrote: On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote: On 2020/12/04 20:02, SelvaKumar S wrote: > This patchset tries to add support for TP4065a ("Simple Copy Command"), > v2020.05.04 ("Ratified") > > The Specification can be found in following link. > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip > > This is an RFC. Looking forward for any feedbacks or other alternate > designs for plumbing simple copy to IO stack. > > Simple copy command is a copy offloading operation and is used to copy > multiple contiguous ranges (source_ranges) of LBA's to a single destination > LBA within the device reducing traffic between host and device. > > This implementation accepts destination, no of sources and arrays of > source ranges from application and attach it as payload to the bio and > submits to the device. > > Following limits are added to queue limits and are exposed in sysfs > to userspace >- *max_copy_sectors* limits the sum of all source_range length >- *max_copy_nr_ranges* limits the number of source ranges >- *max_copy_range_sectors* limit the maximum number of sectors >that can constitute a single source range. Same comment as before. I think this is a good start, but for this to be really useful to users and kernel components alike, this really needs copy emulation for drives that do not have a native copy feature, similarly to what write zeros handling for instance: if the drive does not have a copy command (simple copy for NVMe or XCOPY for scsi), then the block layer should issue read/write commands to seamlessly execute the copy. Otherwise, this will only serve a small niche for users and will not be optimal for FS and DM drivers that could be simplified with a generic block layer copy functionality. This is my 10 cents though, others may differ about this. Yes, I agree that copy emulation support should be included with the hardware enabled solution. Keith, Damien, Can we do the block layer emulation with this patchset and then work in follow-up patchses on (i) the FS interface with F2FS as a first user and (ii) other HW accelerations such as XCOPY? For XCOPY, I believe we need to have a separate discussion as much works is already done that we should align to.
Re: [RFC PATCH v2 0/2] add simple copy support
On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote: > On 2020/12/04 20:02, SelvaKumar S wrote: > > This patchset tries to add support for TP4065a ("Simple Copy Command"), > > v2020.05.04 ("Ratified") > > > > The Specification can be found in following link. > > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip > > > > This is an RFC. Looking forward for any feedbacks or other alternate > > designs for plumbing simple copy to IO stack. > > > > Simple copy command is a copy offloading operation and is used to copy > > multiple contiguous ranges (source_ranges) of LBA's to a single destination > > LBA within the device reducing traffic between host and device. > > > > This implementation accepts destination, no of sources and arrays of > > source ranges from application and attach it as payload to the bio and > > submits to the device. > > > > Following limits are added to queue limits and are exposed in sysfs > > to userspace > > - *max_copy_sectors* limits the sum of all source_range length > > - *max_copy_nr_ranges* limits the number of source ranges > > - *max_copy_range_sectors* limit the maximum number of sectors > > that can constitute a single source range. > > Same comment as before. I think this is a good start, but for this to be > really > useful to users and kernel components alike, this really needs copy emulation > for drives that do not have a native copy feature, similarly to what write > zeros > handling for instance: if the drive does not have a copy command (simple copy > for NVMe or XCOPY for scsi), then the block layer should issue read/write > commands to seamlessly execute the copy. Otherwise, this will only serve a > small > niche for users and will not be optimal for FS and DM drivers that could be > simplified with a generic block layer copy functionality. > > This is my 10 cents though, others may differ about this. Yes, I agree that copy emulation support should be included with the hardware enabled solution.
Re: [RFC PATCH v2 0/2] add simple copy support
On 2020/12/04 20:02, SelvaKumar S wrote: > This patchset tries to add support for TP4065a ("Simple Copy Command"), > v2020.05.04 ("Ratified") > > The Specification can be found in following link. > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip > > This is an RFC. Looking forward for any feedbacks or other alternate > designs for plumbing simple copy to IO stack. > > Simple copy command is a copy offloading operation and is used to copy > multiple contiguous ranges (source_ranges) of LBA's to a single destination > LBA within the device reducing traffic between host and device. > > This implementation accepts destination, no of sources and arrays of > source ranges from application and attach it as payload to the bio and > submits to the device. > > Following limits are added to queue limits and are exposed in sysfs > to userspace > - *max_copy_sectors* limits the sum of all source_range length > - *max_copy_nr_ranges* limits the number of source ranges > - *max_copy_range_sectors* limit the maximum number of sectors > that can constitute a single source range. Same comment as before. I think this is a good start, but for this to be really useful to users and kernel components alike, this really needs copy emulation for drives that do not have a native copy feature, similarly to what write zeros handling for instance: if the drive does not have a copy command (simple copy for NVMe or XCOPY for scsi), then the block layer should issue read/write commands to seamlessly execute the copy. Otherwise, this will only serve a small niche for users and will not be optimal for FS and DM drivers that could be simplified with a generic block layer copy functionality. This is my 10 cents though, others may differ about this. > > Changes from v1: > > 1. Fix memory leak in __blkdev_issue_copy > 2. Unmark blk_check_copy inline > 3. Fix line break in blk_check_copy_eod > 4. Remove p checks and made code more readable > 5. Don't use bio_set_op_attrs and remove op and set >bi_opf directly > 6. Use struct_size to calculate total_size > 7. Fix partition remap of copy destination > 8. Remove mcl,mssrl,msrc from nvme_ns > 9. Initialize copy queue limits to 0 in nvme_config_copy > 10. Remove return in QUEUE_FLAG_COPY check > 11. Remove unused OCFS > > SelvaKumar S (2): > block: add simple copy support > nvme: add simple copy support > > block/blk-core.c | 94 ++--- > block/blk-lib.c | 123 ++ > block/blk-merge.c | 2 + > block/blk-settings.c | 11 > block/blk-sysfs.c | 23 +++ > block/blk-zoned.c | 1 + > block/bounce.c| 1 + > block/ioctl.c | 43 + > drivers/nvme/host/core.c | 87 +++ > include/linux/bio.h | 1 + > include/linux/blk_types.h | 15 + > include/linux/blkdev.h| 15 + > include/linux/nvme.h | 43 - > include/uapi/linux/fs.h | 13 > 14 files changed, 461 insertions(+), 11 deletions(-) > -- Damien Le Moal Western Digital Research
[RFC PATCH v2 0/2] add simple copy support
This patchset tries to add support for TP4065a ("Simple Copy Command"), v2020.05.04 ("Ratified") The Specification can be found in following link. https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip This is an RFC. Looking forward for any feedbacks or other alternate designs for plumbing simple copy to IO stack. Simple copy command is a copy offloading operation and is used to copy multiple contiguous ranges (source_ranges) of LBA's to a single destination LBA within the device reducing traffic between host and device. This implementation accepts destination, no of sources and arrays of source ranges from application and attach it as payload to the bio and submits to the device. Following limits are added to queue limits and are exposed in sysfs to userspace - *max_copy_sectors* limits the sum of all source_range length - *max_copy_nr_ranges* limits the number of source ranges - *max_copy_range_sectors* limit the maximum number of sectors that can constitute a single source range. Changes from v1: 1. Fix memory leak in __blkdev_issue_copy 2. Unmark blk_check_copy inline 3. Fix line break in blk_check_copy_eod 4. Remove p checks and made code more readable 5. Don't use bio_set_op_attrs and remove op and set bi_opf directly 6. Use struct_size to calculate total_size 7. Fix partition remap of copy destination 8. Remove mcl,mssrl,msrc from nvme_ns 9. Initialize copy queue limits to 0 in nvme_config_copy 10. Remove return in QUEUE_FLAG_COPY check 11. Remove unused OCFS SelvaKumar S (2): block: add simple copy support nvme: add simple copy support block/blk-core.c | 94 ++--- block/blk-lib.c | 123 ++ block/blk-merge.c | 2 + block/blk-settings.c | 11 block/blk-sysfs.c | 23 +++ block/blk-zoned.c | 1 + block/bounce.c| 1 + block/ioctl.c | 43 + drivers/nvme/host/core.c | 87 +++ include/linux/bio.h | 1 + include/linux/blk_types.h | 15 + include/linux/blkdev.h| 15 + include/linux/nvme.h | 43 - include/uapi/linux/fs.h | 13 14 files changed, 461 insertions(+), 11 deletions(-) -- 2.25.1