Re: [RFC PATCH v5 0/4] add simple copy support
On 4/11/2021 10:26 PM, Javier González wrote: On 11.04.2021 12:10, Max Gurtovoy wrote: On 4/10/2021 9:32 AM, Javier González wrote: On 10 Apr 2021, at 02.30, Chaitanya Kulkarni wrote: On 4/9/21 17:22, Max Gurtovoy wrote: On 2/19/2021 2:45 PM, SelvaKumar S wrote: This patchset tries to add support for TP4065a ("Simple Copy Command"), v2020.05.04 ("Ratified") The Specification can be found in following link. https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip Simple copy command is a copy offloading operation and is used to copy multiple contiguous ranges (source_ranges) of LBA's to a single destination LBA within the device reducing traffic between host and device. This implementation doesn't add native copy offload support for stacked devices rather copy offload is done through emulation. Possible use cases are F2FS gc and BTRFS relocation/balance. *blkdev_issue_copy* takes source bdev, no of sources, array of source ranges (in sectors), destination bdev and destination offset(in sectors). If both source and destination block devices are same and copy_offload = 1, then copy is done through native copy offloading. Copy emulation is used in other cases. As SCSI XCOPY can take two different block devices and no of source range is equal to 1, this interface can be extended in future to support SCSI XCOPY. Any idea why this TP wasn't designed for copy offload between 2 different namespaces in the same controller ? Yes, it was the first attempt so to keep it simple. Further work is needed to add incremental TP so that we can also do a copy between the name-spaces of same controller (if we can't already) and to the namespaces that belongs to the different controller. And a simple copy will be the case where the src_nsid == dst_nsid ? Also why there are multiple source ranges and only one dst range ? We could add a bit to indicate if this range is src or dst.. One of the target use cases was ZNS in order to avoid fabric transfers during host GC. You can see how this plays well with several zone ranges and a single zone destination. If we start getting support in Linux through the different past copy offload efforts, I’m sure we can extend this TP in the future. But the "copy" command IMO is more general than the ZNS GC case, that can be a private case of copy, isn't it ? It applies to any namespace type, so yes. I just wanted to give you the background for the current "simple" scope through one of the use cases that was in mind. We can get a big benefit of offloading the data copy from one ns to another in the same controller and even in different controllers in the same subsystem. Definitely. Do you think the extension should be to "copy" command or to create a new command "x_copy" for copying to different destination ns ? I believe there is space for extensions to simple copy. But given the experience with XCOPY, I can imagine that changes will be incremental, based on very specific use cases. I think getting support upstream and bringing deployed cases is a very good start. Copying data (files) within the controller/subsystem from ns_A to ns_B using NVMf will reduce network BW and memory BW in the host server. This feature is well known and the use case is well known. The question whether we implement it in vendor specific manner of we add it to the specification. I prefer adding it to the spec :)
Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs
On 4/6/2021 2:53 PM, Jason Gunthorpe wrote: On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote: On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote: On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: From: Leon Romanovsky From Avihai, Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering imposed on PCI transactions, and thus, can improve performance. Until now, relaxed ordering could be set only by user space applications for user MRs. The following patch series enables relaxed ordering for the kernel ULPs as well. Relaxed ordering is an optional capability, and as such, it is ignored by vendors that don't support it. The following test results show the performance improvement achieved Did you test this patchset with CPU does not support relaxed ordering? I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation and they are not interesting from performance point of view. We observed significantly performance degradation when run perftest with relaxed ordering enabled over old CPU. https://github.com/linux-rdma/perftest/issues/116 The perftest is slightly different, but you pointed to the valid point. We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit and arguably this was needed to be done in perftest too. No, the PCI device should not have the RO bit set in this situation. It is something mlx5_core needs to do. We can't push this into applications. pcie_relaxed_ordering_enabled is called in drivers/net/ethernet/mellanox/mlx5/core/en_common.c so probably need to move it to mlx5_core in this series. There should be no performance difference from asking for IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and not asking for it at all. Either the platform has working relaxed ordering that gives a performance gain and the RO config spec bit should be set, or it doesn't and the bit should be clear. is this the case today ? This is not something to decide in userspace, or in RDMA. At worst it becomes another platform specific PCI tunable people have to set. I thought the old haswell systems were quirked to disable RO globally anyhow? Jason
Re: [RFC PATCH v5 0/4] add simple copy support
On 4/10/2021 9:32 AM, Javier González wrote: On 10 Apr 2021, at 02.30, Chaitanya Kulkarni wrote: On 4/9/21 17:22, Max Gurtovoy wrote: On 2/19/2021 2:45 PM, SelvaKumar S wrote: This patchset tries to add support for TP4065a ("Simple Copy Command"), v2020.05.04 ("Ratified") The Specification can be found in following link. https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip Simple copy command is a copy offloading operation and is used to copy multiple contiguous ranges (source_ranges) of LBA's to a single destination LBA within the device reducing traffic between host and device. This implementation doesn't add native copy offload support for stacked devices rather copy offload is done through emulation. Possible use cases are F2FS gc and BTRFS relocation/balance. *blkdev_issue_copy* takes source bdev, no of sources, array of source ranges (in sectors), destination bdev and destination offset(in sectors). If both source and destination block devices are same and copy_offload = 1, then copy is done through native copy offloading. Copy emulation is used in other cases. As SCSI XCOPY can take two different block devices and no of source range is equal to 1, this interface can be extended in future to support SCSI XCOPY. Any idea why this TP wasn't designed for copy offload between 2 different namespaces in the same controller ? Yes, it was the first attempt so to keep it simple. Further work is needed to add incremental TP so that we can also do a copy between the name-spaces of same controller (if we can't already) and to the namespaces that belongs to the different controller. And a simple copy will be the case where the src_nsid == dst_nsid ? Also why there are multiple source ranges and only one dst range ? We could add a bit to indicate if this range is src or dst.. One of the target use cases was ZNS in order to avoid fabric transfers during host GC. You can see how this plays well with several zone ranges and a single zone destination. If we start getting support in Linux through the different past copy offload efforts, I’m sure we can extend this TP in the future. But the "copy" command IMO is more general than the ZNS GC case, that can be a private case of copy, isn't it ? We can get a big benefit of offloading the data copy from one ns to another in the same controller and even in different controllers in the same subsystem. Do you think the extension should be to "copy" command or to create a new command "x_copy" for copying to different destination ns ?
Re: [RFC PATCH v5 0/4] add simple copy support
On 2/19/2021 2:45 PM, SelvaKumar S wrote: This patchset tries to add support for TP4065a ("Simple Copy Command"), v2020.05.04 ("Ratified") The Specification can be found in following link. https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip Simple copy command is a copy offloading operation and is used to copy multiple contiguous ranges (source_ranges) of LBA's to a single destination LBA within the device reducing traffic between host and device. This implementation doesn't add native copy offload support for stacked devices rather copy offload is done through emulation. Possible use cases are F2FS gc and BTRFS relocation/balance. *blkdev_issue_copy* takes source bdev, no of sources, array of source ranges (in sectors), destination bdev and destination offset(in sectors). If both source and destination block devices are same and copy_offload = 1, then copy is done through native copy offloading. Copy emulation is used in other cases. As SCSI XCOPY can take two different block devices and no of source range is equal to 1, this interface can be extended in future to support SCSI XCOPY. Any idea why this TP wasn't designed for copy offload between 2 different namespaces in the same controller ? And a simple copy will be the case where the src_nsid == dst_nsid ? Also why there are multiple source ranges and only one dst range ? We could add a bit to indicate if this range is src or dst.. For devices supporting native simple copy, attach the control information as payload to the bio and submit to the device. For devices without native copy support, copy emulation is done by reading each source range into memory and writing it to the destination. Caller can choose not to try emulation if copy offload is not supported by setting BLKDEV_COPY_NOEMULATION flag. Following limits are added to queue limits and are exposed in sysfs to userspace - *copy_offload* controls copy_offload. set 0 to disable copy offload, 1 to enable native copy offloading support. - *max_copy_sectors* limits the sum of all source_range length - *max_copy_nr_ranges* limits the number of source ranges - *max_copy_range_sectors* limit the maximum number of sectors that can constitute a single source range. max_copy_sectors = 0 indicates the device doesn't support copy offloading. *copy offload* sysfs entry is configurable and can be used toggle between emulation and native support depending upon the usecase. Changes from v4 1. Extend dm-kcopyd to leverage copy-offload, while copying within the same device. The other approach was to have copy-emulation by moving dm-kcopyd to block layer. But it also required moving core dm-io infra, causing a massive churn across multiple dm-targets. 2. Remove export in bio_map_kern() 3. Change copy_offload sysfs to accept 0 or else 4. Rename copy support flag to QUEUE_FLAG_SIMPLE_COPY 5. Rename payload entries, add source bdev field to be used while partition remapping, remove copy_size 6. Change the blkdev_issue_copy() interface to accept destination and source values in sector rather in bytes 7. Add payload to bio using bio_map_kern() for copy_offload case 8. Add check to return error if one of the source range length is 0 9. Add BLKDEV_COPY_NOEMULATION flag to allow user to not try copy emulation incase of copy offload is not supported. Caller can his use his existing copying logic to complete the io. 10. Bug fix copy checks and reduce size of rcu_lock() Planned for next: - adding blktests - handling larger (than device limits) copy - decide on ioctl interface (man-page etc.) Changes from v3 1. gfp_flag fixes. 2. Export bio_map_kern() and use it to allocate and add pages to bio. 3. Move copy offload, reading to buf, writing from buf to separate functions. 4. Send read bio of copy offload by chaining them and submit asynchronously. 5. Add gendisk->part0 and part->bd_start_sect changes to blk_check_copy(). 6. Move single source range limit check to blk_check_copy() 7. Rename __blkdev_issue_copy() to blkdev_issue_copy and remove old helper. 8. Change blkdev_issue_copy() interface generic to accepts destination bdev to support XCOPY as well. 9. Add invalidate_kernel_vmap_range() after reading data for vmalloc'ed memory. 10. Fix buf allocoation logic to allocate buffer for the total size of copy. 11. Reword patch commit description. Changes from v2 1. Add emulation support for devices not supporting copy. 2. Add *copy_offload* sysfs entry to enable and disable copy_offload in devices supporting simple copy. 3. Remove simple copy support for stacked devices. Changes from v1: 1. Fix memory leak in __blkdev_issue_copy 2. Unmark blk_check_copy inline 3. Fix line break in blk_check_copy_eod 4. Remove p checks and made code more readable 5. Don't use bio_set_op_attrs and remove op and set bi_opf directly 6. Use struct_size to calculate total_size 7. Fix
Re: [PATCH v2] infiniband: Fix a use after free in isert_connect_request
On 3/22/2021 6:13 PM, Lv Yunlong wrote: The device is got by isert_device_get() with refcount is 1, and is assigned to isert_conn by isert_conn->device = device. When isert_create_qp() failed, device will be freed with isert_device_put(). Later, the device is used in isert_free_login_buf(isert_conn) by the isert_conn->device->ib_device statement. This patch free the device in the correct order. Signed-off-by: Lv Yunlong --- drivers/infiniband/ulp/isert/ib_isert.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) looks good, Reviewed-by: Max Gurtovoy
Re: [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver
On 3/11/2021 1:37 PM, Christoph Hellwig wrote: On Wed, Mar 10, 2021 at 08:31:27AM -0400, Jason Gunthorpe wrote: Yes, that needs more refactoring. I'm viewing this series as a "statement of intent" and once we commit to doing this we can go through the bigger effort to split up vfio_pci_core and tidy its API. Obviously this is a big project, given the past comments I don't want to send more effort here until we see a community consensus emerge that this is what we want to do. If we build a sub-driver instead the work is all in the trash bin. So my viewpoint here is that this work doesn't seem very useful for the existing subdrivers given how much compat pain there is. It defintively is the right way to go for a new driver. This bring us back to the first series that introduced mlx5_vfio_pci driver without the igd, nvlink2 drivers. if we leave the subdrivers/extensions in vfio_pci_core it won't be logically right. If we put it in vfio_pci we'll need to maintain it and extend it if new functionality or bugs will be reported. if we create a new drivers for these devices, we'll use the compat layer and hopefully after few years these users will be using only my_driver_vfio_pci and we'll be able to remove the compat layer (that is not so big). We tried almost all the options and now we need to progress and agree on the design. Effort is big and I wish we won't continue with experiments without a clear view of what exactly should be done. So we need a plan how Jason's series and my series can live together and how can we start merging it gradually.
Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
On 3/11/2021 9:54 AM, Alexey Kardashevskiy wrote: On 11/03/2021 13:00, Jason Gunthorpe wrote: On Thu, Mar 11, 2021 at 12:42:56PM +1100, Alexey Kardashevskiy wrote: btw can the id list have only vendor ids and not have device ids? The PCI matcher is quite flexable, see the other patch from Max for the igd ah cool, do this for NVIDIA GPUs then please, I just discovered another P9 system sold with NVIDIA T4s which is not in your list. I think it will make things easier down the road if you maintain an exact list Then why do not you do the exact list for Intel IGD? The commit log does not explain this detail. I expect Intel team to review this series and give a more precise list. I did the best I could in finding a proper configuration for igd. But best practice is to be as narrow as possible as I hope this will eventually impact module autoloading and other details. The amount of device specific knowledge is too little to tie it up to device ids, it is a generic PCI driver with quirks. We do not have a separate drivers for the hardware which requires quirks. It provides its own capability structure exposed to userspace, that is absolutely not a "quirk" And how do you hope this should impact autoloading? I would like to autoload the most specific vfio driver for the target hardware. Is there an idea how it is going to work? For example, the Intel IGD driver and vfio-pci-igd - how should the system pick one? If there is no MODULE_DEVICE_TABLE in vfio-pci-xxx, is the user supposed to try binding all vfio-pci-xxx drivers until some binds? For example, in my local setup I did a POC patch that convert some drivers to be "manual binding only drivers". So the IGD driver will have the priority, user will unbind the device from it, load igd-vfio-pci, bind the device to it, ends with probing. For now we separated the driver core stuff until we all agree that this series is the right way to go + we also make sure it's backward compatible. If you someday need to support new GPU HW that needs a different VFIO driver then you are really stuck because things become indeterminate if there are two devices claiming the ID. We don't have the concept of "best match", driver core works on exact match.
Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
On 3/10/2021 4:19 PM, Alexey Kardashevskiy wrote: On 10/03/2021 23:57, Max Gurtovoy wrote: On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote: On 09/03/2021 19:33, Max Gurtovoy wrote: The new drivers introduced are nvlink2gpu_vfio_pci.ko and npu2_vfio_pci.ko. The first will be responsible for providing special extensions for NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host bus adapter). Also, preserve backward compatibility for users that were binding NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will be dropped in the future Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig | 28 +++- drivers/vfio/pci/Makefile | 7 +- .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} | 144 - drivers/vfio/pci/npu2_vfio_pci.h | 24 +++ ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +- drivers/vfio/pci/nvlink2gpu_vfio_pci.h | 24 +++ drivers/vfio/pci/vfio_pci.c | 61 ++- drivers/vfio/pci/vfio_pci_core.c | 18 --- drivers/vfio/pci/vfio_pci_core.h | 14 -- 9 files changed, 422 insertions(+), 47 deletions(-) rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%) create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} (67%) create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 829e90a2e5a3..88c89863a205 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -48,8 +48,30 @@ config VFIO_PCI_IGD To enable Intel IGD assignment through vfio-pci, say Y. -config VFIO_PCI_NVLINK2 - def_bool y +config VFIO_PCI_NVLINK2GPU + tristate "VFIO support for NVIDIA NVLINK2 GPUs" depends on VFIO_PCI_CORE && PPC_POWERNV help - VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs + VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions + for P9 Witherspoon machine. + +config VFIO_PCI_NPU2 + tristate "VFIO support for IBM NPU host bus adapter on P9" + depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV + help + VFIO PCI specific extensions for IBM NVLink2 host bus adapter on P9 + Witherspoon machine. + +config VFIO_PCI_DRIVER_COMPAT + bool "VFIO PCI backward compatibility for vendor specific extensions" + default y + depends on VFIO_PCI + help + Say Y here if you want to preserve VFIO PCI backward + compatibility. vfio_pci.ko will continue to automatically use + the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to + a compatible device. + + When N is selected the user must bind explicity to the module + they want to handle the device and vfio_pci.ko will have no + device specific special behaviors. diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index f539f32c9296..86fb62e271fc 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,10 +2,15 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o +obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o +obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o vfio_pci_npu2.o vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o vfio-pci-y := vfio_pci.o + +npu2-vfio-pci-y := npu2_vfio_pci.o + +nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_npu2.c b/drivers/vfio/pci/npu2_vfio_pci.c similarity index 64% rename from drivers/vfio/pci/vfio_pci_npu2.c rename to drivers/vfio/pci/npu2_vfio_pci.c index 717745256ab3..7071bda0f2b6 100644 --- a/drivers/vfio/pci/vfio_pci_npu2.c +++ b/drivers/vfio/pci/npu2_vfio_pci.c @@ -14,19 +14,28 @@ * Author: Alex Williamson */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include #include #include #include #include +#include #include #include #include #include "vfio_pci_core.h" +#include "npu2_vfio_pci.h" #define CREATE_TRACE_POINTS #include "npu2_trace.h" +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "Alexey Kardashevskiy " +#define DRIVER_DESC "NPU2 VFIO PCI - User Level meta-driver for POWER9 NPU NVLink2 HBA" + EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap); struct vfio_pci_npu2_data { @@ -36,6 +45,10 @@ struct vfio_pci_npu2_data { unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */ }; +struct npu
Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote: On 09/03/2021 19:33, Max Gurtovoy wrote: The new drivers introduced are nvlink2gpu_vfio_pci.ko and npu2_vfio_pci.ko. The first will be responsible for providing special extensions for NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host bus adapter). Also, preserve backward compatibility for users that were binding NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will be dropped in the future Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig | 28 +++- drivers/vfio/pci/Makefile | 7 +- .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} | 144 - drivers/vfio/pci/npu2_vfio_pci.h | 24 +++ ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +- drivers/vfio/pci/nvlink2gpu_vfio_pci.h | 24 +++ drivers/vfio/pci/vfio_pci.c | 61 ++- drivers/vfio/pci/vfio_pci_core.c | 18 --- drivers/vfio/pci/vfio_pci_core.h | 14 -- 9 files changed, 422 insertions(+), 47 deletions(-) rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%) create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} (67%) create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 829e90a2e5a3..88c89863a205 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -48,8 +48,30 @@ config VFIO_PCI_IGD To enable Intel IGD assignment through vfio-pci, say Y. -config VFIO_PCI_NVLINK2 - def_bool y +config VFIO_PCI_NVLINK2GPU + tristate "VFIO support for NVIDIA NVLINK2 GPUs" depends on VFIO_PCI_CORE && PPC_POWERNV help - VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs + VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions + for P9 Witherspoon machine. + +config VFIO_PCI_NPU2 + tristate "VFIO support for IBM NPU host bus adapter on P9" + depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV + help + VFIO PCI specific extensions for IBM NVLink2 host bus adapter on P9 + Witherspoon machine. + +config VFIO_PCI_DRIVER_COMPAT + bool "VFIO PCI backward compatibility for vendor specific extensions" + default y + depends on VFIO_PCI + help + Say Y here if you want to preserve VFIO PCI backward + compatibility. vfio_pci.ko will continue to automatically use + the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to + a compatible device. + + When N is selected the user must bind explicity to the module + they want to handle the device and vfio_pci.ko will have no + device specific special behaviors. diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index f539f32c9296..86fb62e271fc 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,10 +2,15 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o +obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o +obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o vfio_pci_npu2.o vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o vfio-pci-y := vfio_pci.o + +npu2-vfio-pci-y := npu2_vfio_pci.o + +nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_npu2.c b/drivers/vfio/pci/npu2_vfio_pci.c similarity index 64% rename from drivers/vfio/pci/vfio_pci_npu2.c rename to drivers/vfio/pci/npu2_vfio_pci.c index 717745256ab3..7071bda0f2b6 100644 --- a/drivers/vfio/pci/vfio_pci_npu2.c +++ b/drivers/vfio/pci/npu2_vfio_pci.c @@ -14,19 +14,28 @@ * Author: Alex Williamson */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include #include #include #include #include +#include #include #include #include #include "vfio_pci_core.h" +#include "npu2_vfio_pci.h" #define CREATE_TRACE_POINTS #include "npu2_trace.h" +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "Alexey Kardashevskiy " +#define DRIVER_DESC "NPU2 VFIO PCI - User Level meta-driver for POWER9 NPU NVLink2 HBA" + EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap); struct vfio_pci_npu2_data { @@ -36,6 +45,10 @@ struct vfio_pci_npu2_data { unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */ }; +struct npu2_vfio_pci_device { + struct vfio_pci_core_device vdev; +}; + static size_t vfio_pci_npu2_rw(struct
[PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver
Create a new driver igd_vfio_pci.ko that will be responsible for providing special extensions for INTEL Graphics card (GVT-d). Also preserve backward compatibility with vfio_pci.ko vendor specific extensions. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig | 5 +- drivers/vfio/pci/Makefile | 4 +- .../pci/{vfio_pci_igd.c => igd_vfio_pci.c}| 147 +- drivers/vfio/pci/igd_vfio_pci.h | 24 +++ drivers/vfio/pci/vfio_pci.c | 4 + drivers/vfio/pci/vfio_pci_core.c | 15 -- drivers/vfio/pci/vfio_pci_core.h | 9 -- 7 files changed, 176 insertions(+), 32 deletions(-) rename drivers/vfio/pci/{vfio_pci_igd.c => igd_vfio_pci.c} (62%) create mode 100644 drivers/vfio/pci/igd_vfio_pci.h diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 88c89863a205..09d85ba3e5b2 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -37,17 +37,14 @@ config VFIO_PCI_INTX def_bool y if !S390 config VFIO_PCI_IGD - bool "VFIO PCI extensions for Intel graphics (GVT-d)" + tristate "VFIO PCI extensions for Intel graphics (GVT-d)" depends on VFIO_PCI_CORE && X86 - default y help Support for Intel IGD specific extensions to enable direct assignment to virtual machines. This includes exposing an IGD specific firmware table and read-only copies of the host bridge and LPC bridge config space. - To enable Intel IGD assignment through vfio-pci, say Y. - config VFIO_PCI_NVLINK2GPU tristate "VFIO support for NVIDIA NVLINK2 GPUs" depends on VFIO_PCI_CORE && PPC_POWERNV diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 86fb62e271fc..298b2fb3f075 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -4,9 +4,9 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o +obj-$(CONFIG_VFIO_PCI_IGD) += igd-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o -vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o vfio-pci-y := vfio_pci.o @@ -14,3 +14,5 @@ vfio-pci-y := vfio_pci.o npu2-vfio-pci-y := npu2_vfio_pci.o nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o + +igd-vfio-pci-y := igd_vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/igd_vfio_pci.c similarity index 62% rename from drivers/vfio/pci/vfio_pci_igd.c rename to drivers/vfio/pci/igd_vfio_pci.c index 2388c9722ed8..bbbc432bca82 100644 --- a/drivers/vfio/pci/vfio_pci_igd.c +++ b/drivers/vfio/pci/igd_vfio_pci.c @@ -10,19 +10,32 @@ * address is also virtualized to prevent user modification. */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include #include #include +#include #include #include #include "vfio_pci_core.h" +#include "igd_vfio_pci.h" #define OPREGION_SIGNATURE "IntelGraphicsMem" #define OPREGION_SIZE (8 * 1024) #define OPREGION_PCI_ADDR 0xfc -static size_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev, char __user *buf, - size_t count, loff_t *ppos, bool iswrite) +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "Alex Williamson " +#define DRIVER_DESC "IGD VFIO PCI - User Level meta-driver for Intel Graphics Processing Unit" + +struct igd_vfio_pci_device { + struct vfio_pci_core_device vdev; +}; + +static size_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev, + char __user *buf, size_t count, loff_t *ppos, bool iswrite) { unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; void *base = vdev->region[i].data; @@ -261,7 +274,7 @@ static int vfio_pci_igd_cfg_init(struct vfio_pci_core_device *vdev) return 0; } -int vfio_pci_igd_init(struct vfio_pci_core_device *vdev) +static int vfio_pci_igd_init(struct vfio_pci_core_device *vdev) { int ret; @@ -275,3 +288,131 @@ int vfio_pci_igd_init(struct vfio_pci_core_device *vdev) return 0; } + +static void igd_vfio_pci_release(void *device_data) +{ + struct vfio_pci_core_device *vdev = device_data; + + mutex_lock(>reflck->lock); + if (!(--vdev->refcnt)) { + vfio_pci_vf_token_user_add(vdev, -1); + vfio_pci_core_spapr_eeh_release(vdev); + vfio_pci_core_disable(vdev); + } + mutex_unlock(>reflck->lock); + + module_put(THIS_MODULE); +} + +static int igd_vfio_pci_open(void *device_data) +{ + struct vfio_pci_core_device *vdev = device_data; + int ret = 0;
[PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
The new drivers introduced are nvlink2gpu_vfio_pci.ko and npu2_vfio_pci.ko. The first will be responsible for providing special extensions for NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host bus adapter). Also, preserve backward compatibility for users that were binding NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will be dropped in the future Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig | 28 +++- drivers/vfio/pci/Makefile | 7 +- .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} | 144 - drivers/vfio/pci/npu2_vfio_pci.h | 24 +++ ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +- drivers/vfio/pci/nvlink2gpu_vfio_pci.h| 24 +++ drivers/vfio/pci/vfio_pci.c | 61 ++- drivers/vfio/pci/vfio_pci_core.c | 18 --- drivers/vfio/pci/vfio_pci_core.h | 14 -- 9 files changed, 422 insertions(+), 47 deletions(-) rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%) create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} (67%) create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 829e90a2e5a3..88c89863a205 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -48,8 +48,30 @@ config VFIO_PCI_IGD To enable Intel IGD assignment through vfio-pci, say Y. -config VFIO_PCI_NVLINK2 - def_bool y +config VFIO_PCI_NVLINK2GPU + tristate "VFIO support for NVIDIA NVLINK2 GPUs" depends on VFIO_PCI_CORE && PPC_POWERNV help - VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs + VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions + for P9 Witherspoon machine. + +config VFIO_PCI_NPU2 + tristate "VFIO support for IBM NPU host bus adapter on P9" + depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV + help + VFIO PCI specific extensions for IBM NVLink2 host bus adapter on P9 + Witherspoon machine. + +config VFIO_PCI_DRIVER_COMPAT + bool "VFIO PCI backward compatibility for vendor specific extensions" + default y + depends on VFIO_PCI + help + Say Y here if you want to preserve VFIO PCI backward + compatibility. vfio_pci.ko will continue to automatically use + the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to + a compatible device. + + When N is selected the user must bind explicity to the module + they want to handle the device and vfio_pci.ko will have no + device specific special behaviors. diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index f539f32c9296..86fb62e271fc 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,10 +2,15 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o +obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o +obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o vfio_pci_npu2.o vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o vfio-pci-y := vfio_pci.o + +npu2-vfio-pci-y := npu2_vfio_pci.o + +nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_npu2.c b/drivers/vfio/pci/npu2_vfio_pci.c similarity index 64% rename from drivers/vfio/pci/vfio_pci_npu2.c rename to drivers/vfio/pci/npu2_vfio_pci.c index 717745256ab3..7071bda0f2b6 100644 --- a/drivers/vfio/pci/vfio_pci_npu2.c +++ b/drivers/vfio/pci/npu2_vfio_pci.c @@ -14,19 +14,28 @@ * Author: Alex Williamson */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include #include #include #include #include +#include #include #include #include #include "vfio_pci_core.h" +#include "npu2_vfio_pci.h" #define CREATE_TRACE_POINTS #include "npu2_trace.h" +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "Alexey Kardashevskiy " +#define DRIVER_DESC "NPU2 VFIO PCI - User Level meta-driver for POWER9 NPU NVLink2 HBA" + EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap); struct vfio_pci_npu2_data { @@ -36,6 +45,10 @@ struct vfio_pci_npu2_data { unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */ }; +struct npu2_vfio_pci_device { + struct vfio_pci_core_device vdev; +}; + static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev, char __user
[PATCH 7/9] vfio/pci_core: split nvlink2 to nvlink2gpu and npu2
This is a preparation for moving vendor specific code from vfio_pci_core to vendor specific vfio_pci drivers. The next step will be creating a dedicated module to NVIDIA NVLINK2 devices with P9 extensions and a dedicated module for Power9 NPU NVLink2 HBAs. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Makefile | 2 +- drivers/vfio/pci/npu2_trace.h | 50 .../vfio/pci/{trace.h => nvlink2gpu_trace.h} | 27 +-- drivers/vfio/pci/vfio_pci_core.c | 2 +- drivers/vfio/pci/vfio_pci_core.h | 4 +- drivers/vfio/pci/vfio_pci_npu2.c | 222 ++ ...io_pci_nvlink2.c => vfio_pci_nvlink2gpu.c} | 201 +--- 7 files changed, 280 insertions(+), 228 deletions(-) create mode 100644 drivers/vfio/pci/npu2_trace.h rename drivers/vfio/pci/{trace.h => nvlink2gpu_trace.h} (72%) create mode 100644 drivers/vfio/pci/vfio_pci_npu2.c rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_nvlink2gpu.c} (59%) diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 16e7d77d63ce..f539f32c9296 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -5,7 +5,7 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o +vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o vfio_pci_npu2.o vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/npu2_trace.h b/drivers/vfio/pci/npu2_trace.h new file mode 100644 index ..c8a1110132dc --- /dev/null +++ b/drivers/vfio/pci/npu2_trace.h @@ -0,0 +1,50 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * VFIO PCI mmap/mmap_fault tracepoints + * + * Copyright (C) 2018 IBM Corp. All rights reserved. + * Author: Alexey Kardashevskiy + */ + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vfio_pci + +#if !defined(_TRACE_VFIO_PCI_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VFIO_PCI_H + +#include + +TRACE_EVENT(vfio_pci_npu2_mmap, + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, + unsigned long size, int ret), + TP_ARGS(pdev, hpa, ua, size, ret), + + TP_STRUCT__entry( + __field(const char *, name) + __field(unsigned long, hpa) + __field(unsigned long, ua) + __field(unsigned long, size) + __field(int, ret) + ), + + TP_fast_assign( + __entry->name = dev_name(>dev), + __entry->hpa = hpa; + __entry->ua = ua; + __entry->size = size; + __entry->ret = ret; + ), + + TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa, + __entry->ua, __entry->size, __entry->ret) +); + +#endif /* _TRACE_VFIO_PCI_H */ + +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH ../../drivers/vfio/pci +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE npu2_trace + +/* This part must be outside protection */ +#include diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/nvlink2gpu_trace.h similarity index 72% rename from drivers/vfio/pci/trace.h rename to drivers/vfio/pci/nvlink2gpu_trace.h index b2aa986ab9ed..2392b9d4c6c9 100644 --- a/drivers/vfio/pci/trace.h +++ b/drivers/vfio/pci/nvlink2gpu_trace.h @@ -62,37 +62,12 @@ TRACE_EVENT(vfio_pci_nvgpu_mmap, __entry->ua, __entry->size, __entry->ret) ); -TRACE_EVENT(vfio_pci_npu2_mmap, - TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, - unsigned long size, int ret), - TP_ARGS(pdev, hpa, ua, size, ret), - - TP_STRUCT__entry( - __field(const char *, name) - __field(unsigned long, hpa) - __field(unsigned long, ua) - __field(unsigned long, size) - __field(int, ret) - ), - - TP_fast_assign( - __entry->name = dev_name(>dev), - __entry->hpa = hpa; - __entry->ua = ua; - __entry->size = size; - __entry->ret = ret; - ), - - TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa, - __entry->ua, __entry->size, __entry->ret) -); - #endif /* _TRACE_VFIO_PCI_H */ #undef TRACE_INCLUDE_PATH #define TRACE_INCLUDE_PATH ../../drivers/vfio/pci #undef TRACE_INCLUDE_FILE -#define TRACE_INCLUDE_FILE trace +#define TRACE_INCLUDE_FILE nvlink2gpu_trace /* This part must be outside protection */ #include diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index ba5dd4321487..4de8e352
[PATCH 6/9] vfio-pci-core: export vfio_pci_register_dev_region function
This function will be used to allow vendor drivers to register regions to be used and accessed by the core subsystem driver. This way, the core will use the region ops that are vendor specific and managed by the vendor vfio-pci driver. Next step that can be made is to move the logic of igd and nvlink2 to a dedicated module instead of managing their vendor specific extensions in the core driver. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/vfio_pci_core.c | 1 + drivers/vfio/pci/vfio_pci_core.h | 5 + 2 files changed, 6 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 7b6be1e4646f..ba5dd4321487 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -711,6 +711,7 @@ int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev, return 0; } +EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region); struct vfio_devices { struct vfio_device **devices; diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h index 46eb3443125b..60b42df6c519 100644 --- a/drivers/vfio/pci/vfio_pci_core.h +++ b/drivers/vfio/pci/vfio_pci_core.h @@ -257,4 +257,9 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn); extern const struct pci_error_handlers vfio_pci_core_err_handlers; +int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev, + unsigned int type, unsigned int subtype, + const struct vfio_pci_regops *ops, + size_t size, u32 flags, void *data); + #endif /* VFIO_PCI_CORE_H */ -- 2.25.4
[PATCH 5/9] vfio/pci: introduce vfio_pci_device structure
This structure will hold the specific attributes for the generic vfio_pci.ko driver. It will be allocated by the vfio_pci driver that will register the vfio subsystem using vfio_pci_core_register_device and it will unregister the using vfio_pci_core_unregister_device. In this way every vfio_pci future driver will be able to use this mechanism to set vendor specific attributes in its vendor specific structure and register to subsystem core while utilizing vfio_pci_core library. This is a standard Linux subsystem behaviour and will also ease on vfio_pci drivers to extend callbacks of vfio_device_ops and will be able to use container_of mechanism as well (instead of passing void pointers as around the stack). Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/vfio_pci.c | 31 + drivers/vfio/pci/vfio_pci_core.c | 39 +--- drivers/vfio/pci/vfio_pci_core.h | 5 ++-- 3 files changed, 45 insertions(+), 30 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 447c31f4e64e..dbc0a6559914 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -31,6 +32,10 @@ #define DRIVER_AUTHOR "Alex Williamson " #define DRIVER_DESC "VFIO PCI - User Level meta-driver" +struct vfio_pci_device { + struct vfio_pci_core_device vdev; +}; + static char ids[1024] __initdata; module_param_string(ids, ids, sizeof(ids), 0); MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is \"vendor:device[:subvendor[:subdevice[:class[:class_mask\" and multiple comma separated entries can be specified"); @@ -139,21 +144,37 @@ static const struct vfio_device_ops vfio_pci_ops = { static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { - struct vfio_pci_core_device *vdev; + struct vfio_pci_device *vpdev; + int ret; if (vfio_pci_is_denylisted(pdev)) return -EINVAL; - vdev = vfio_create_pci_device(pdev, _pci_ops); - if (IS_ERR(vdev)) - return PTR_ERR(vdev); + vpdev = kzalloc(sizeof(*vpdev), GFP_KERNEL); + if (!vpdev) + return -ENOMEM; + + ret = vfio_pci_core_register_device(>vdev, pdev, _pci_ops); + if (ret) + goto out_free; return 0; + +out_free: + kfree(vpdev); + return ret; } static void vfio_pci_remove(struct pci_dev *pdev) { - vfio_destroy_pci_device(pdev); + struct vfio_device *vdev = dev_get_drvdata(>dev); + struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev); + struct vfio_pci_device *vpdev; + + vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev); + + vfio_pci_core_unregister_device(core_vpdev); + kfree(vpdev); } static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 878a3609b916..7b6be1e4646f 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1837,15 +1837,15 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb, return 0; } -struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev, +int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev, + struct pci_dev *pdev, const struct vfio_device_ops *vfio_pci_ops) { - struct vfio_pci_core_device *vdev; struct iommu_group *group; int ret; if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL) - return ERR_PTR(-EINVAL); + return -EINVAL; /* * Prevent binding to PFs with VFs enabled, the VFs might be in use @@ -1857,18 +1857,12 @@ struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev, */ if (pci_num_vf(pdev)) { pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n"); - return ERR_PTR(-EBUSY); + return -EBUSY; } group = vfio_iommu_group_get(>dev); if (!group) - return ERR_PTR(-EINVAL); - - vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); - if (!vdev) { - ret = -ENOMEM; - goto out_group_put; - } + return -EINVAL; vdev->pdev = pdev; vdev->vfio_pci_ops = vfio_pci_ops; @@ -1884,7 +1878,7 @@ struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev, ret = vfio_add_group_dev(>dev, vfio_pci_ops, vdev); if (ret) - goto out_free; + goto out_group_put; ret = vfio_pci_reflck_attach(vdev); if (ret) @@ -1928,7 +1922,7 @@ struct vfio_pci_core_device *vfio_create_pci_device(struct
[PATCH 4/9] vfio-pci: introduce vfio_pci_core subsystem driver
Split the vfio_pci driver into two parts, the 'struct pci_driver' (vfio_pci) and a library of code (vfio_pci_core) that helps creating a VFIO device on top of a PCI device. As before vfio_pci.ko continues to present the same interface under sysfs and this change should have no functional impact. vfio_pci_core exposes an interface that is similar to a typical Linux subsystem, in that a pci_driver doing probe() can setup a number of details and then create the VFIO char device. Allowing another module to provide the pci_driver allows that module to customize how VFIO is setup, inject its own operations, and easily extend vendor specific functionality. This is complementary to how VFIO's mediated devices work. Instead of custome device lifecycle managmenet and a special bus drivers using this approach will rely on the normal driver core lifecycle (e.g. bind/unbind) management and this is optimized to effectively support customization that is only making small modifications to what vfio_pci would do normally. This approach is also a pluggable alternative for the hard wired CONFIG_VFIO_PCI_IGD and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are built into vfio-pci. Using this work all of that code can be moved to a dedicated device-specific modules and cleanly split out of the generic vfio_pci driver. Below is an example for adding new driver to vfio pci subsystem: +-+ | | | VFIO| | | +-+ +-+ | | | VFIO_PCI_CORE | | | +-+ +--+ +---+ +--+ | | | | | | | VFIO_PCI| | MLX5_VFIO_PCI | | IGD_VFIO_PCI | | | | | | | +--+ +---+ +--+ In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio subsystem and also use the generic PCI functionality exported from it. Additionally it will add the needed vendor specific logic for HW specific features such as Live Migration. Same for the igd_vfio_pci that will add special extensions for Intel Graphics cards (GVT-d). Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig | 22 ++- drivers/vfio/pci/Makefile| 13 +- drivers/vfio/pci/vfio_pci.c | 247 drivers/vfio/pci/vfio_pci_core.c | 318 --- drivers/vfio/pci/vfio_pci_core.h | 113 +++ 5 files changed, 423 insertions(+), 290 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci.c diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index ac3c1dd3edef..829e90a2e5a3 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -config VFIO_PCI - tristate "VFIO support for PCI devices" +config VFIO_PCI_CORE + tristate "VFIO core support for PCI devices" depends on VFIO && PCI && EVENTFD select VFIO_VIRQFD select IRQ_BYPASS_MANAGER @@ -10,9 +10,17 @@ config VFIO_PCI If you don't know what to do here, say N. +config VFIO_PCI + tristate "VFIO support for PCI devices" + depends on VFIO_PCI_CORE + help + This provides a generic PCI support using the VFIO framework. + + If you don't know what to do here, say N. + config VFIO_PCI_VGA bool "VFIO PCI support for VGA devices" - depends on VFIO_PCI && X86 && VGA_ARB + depends on VFIO_PCI_CORE && X86 && VGA_ARB help Support for VGA extension to VFIO PCI. This exposes an additional region on VGA devices for accessing legacy VGA addresses used by @@ -21,16 +29,16 @@ config VFIO_PCI_VGA If you don't know what to do here, say N. config VFIO_PCI_MMAP - depends on VFIO_PCI + depends on VFIO_PCI_CORE def_bool y if !S390 config VFIO_PCI_INTX - depends on VFIO_PCI + depends on VFIO_PCI_CORE def_bool y if !S390 config VFIO_PCI_IGD bool "VFIO PCI extensions for Intel graphics (GVT-d)" - depends on VFIO_PCI && X86 + depends on VFIO_PCI_CORE && X86 default y help Support for Intel IGD specific extensions to enable direct @@ -42,6 +50,6 @@ config VFIO_PCI_IGD config VFIO_PCI_NVLINK2 def_bool y - depends on VFIO_PCI &
[PATCH 3/9] vfio-pci: rename vfio_pci_device to vfio_pci_core_device
This is a preparation patch for separating the vfio_pci driver to a subsystem driver and a generic pci driver. This patch doesn't change any logic. The new vfio_pci_core_device structure will be the main structure of the core driver and later on vfio_pci_device structure will be the main structure of the generic vfio_pci driver. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/vfio_pci_config.c | 68 +++--- drivers/vfio/pci/vfio_pci_core.c| 90 ++--- drivers/vfio/pci/vfio_pci_core.h| 76 drivers/vfio/pci/vfio_pci_igd.c | 14 ++--- drivers/vfio/pci/vfio_pci_intrs.c | 40 ++--- drivers/vfio/pci/vfio_pci_nvlink2.c | 20 +++ drivers/vfio/pci/vfio_pci_rdwr.c| 16 ++--- drivers/vfio/pci/vfio_pci_zdev.c| 10 ++-- 8 files changed, 167 insertions(+), 167 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 5e9d24992207..122918d0f733 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -108,9 +108,9 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = { struct perm_bits { u8 *virt; /* read/write virtual data, not hw */ u8 *write; /* writeable bits */ - int (*readfn)(struct vfio_pci_device *vdev, int pos, int count, + int (*readfn)(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 *val); - int (*writefn)(struct vfio_pci_device *vdev, int pos, int count, + int (*writefn)(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 val); }; @@ -171,7 +171,7 @@ static int vfio_user_config_write(struct pci_dev *pdev, int offset, return ret; } -static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos, +static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 *val) { @@ -197,7 +197,7 @@ static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos, return count; } -static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos, +static int vfio_default_config_write(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 val) { @@ -244,7 +244,7 @@ static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos, } /* Allow direct read from hardware, except for capability next pointer */ -static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos, +static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 *val) { @@ -269,7 +269,7 @@ static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos, } /* Raw access skips any kind of virtualization */ -static int vfio_raw_config_write(struct vfio_pci_device *vdev, int pos, +static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 val) { @@ -282,7 +282,7 @@ static int vfio_raw_config_write(struct vfio_pci_device *vdev, int pos, return count; } -static int vfio_raw_config_read(struct vfio_pci_device *vdev, int pos, +static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 *val) { @@ -296,7 +296,7 @@ static int vfio_raw_config_read(struct vfio_pci_device *vdev, int pos, } /* Virt access uses only virtualization */ -static int vfio_virt_config_write(struct vfio_pci_device *vdev, int pos, +static int vfio_virt_config_write(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 val) { @@ -304,7 +304,7 @@ static int vfio_virt_config_write(struct vfio_pci_device *vdev, int pos, return count; } -static int vfio_virt_config_read(struct vfio_pci_device *vdev, int pos, +static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos, int count, struct perm_bits *perm, int offset, __le32 *val) { @@ -396,7 +396,7 @@ static inline void p_setd(struct perm_bits *p, int off, u32 virt, u32 write) } /* Caller should hold memory_lock semaphore */ -bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev) +bool __vfio_pci_memory_enabled(struct vfio_pci_core_device
[PATCH 2/9] vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h
This is a preparation patch for separating the vfio_pci driver to a subsystem driver and a generic pci driver. This patch doesn't change any logic. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/vfio_pci_config.c | 2 +- drivers/vfio/pci/vfio_pci_core.c | 2 +- drivers/vfio/pci/{vfio_pci_private.h => vfio_pci_core.h} | 6 +++--- drivers/vfio/pci/vfio_pci_igd.c | 2 +- drivers/vfio/pci/vfio_pci_intrs.c| 2 +- drivers/vfio/pci/vfio_pci_nvlink2.c | 2 +- drivers/vfio/pci/vfio_pci_rdwr.c | 2 +- drivers/vfio/pci/vfio_pci_zdev.c | 2 +- 8 files changed, 10 insertions(+), 10 deletions(-) rename drivers/vfio/pci/{vfio_pci_private.h => vfio_pci_core.h} (98%) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index a402adee8a21..5e9d24992207 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -26,7 +26,7 @@ #include #include -#include "vfio_pci_private.h" +#include "vfio_pci_core.h" /* Fake capability ID for standard config space */ #define PCI_CAP_ID_BASIC 0 diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 65e7e6b44578..bd587db04625 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -28,7 +28,7 @@ #include #include -#include "vfio_pci_private.h" +#include "vfio_pci_core.h" #define DRIVER_VERSION "0.2" #define DRIVER_AUTHOR "Alex Williamson " diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_core.h similarity index 98% rename from drivers/vfio/pci/vfio_pci_private.h rename to drivers/vfio/pci/vfio_pci_core.h index 9cd1882a05af..b9ac7132b84a 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_core.h @@ -15,8 +15,8 @@ #include #include -#ifndef VFIO_PCI_PRIVATE_H -#define VFIO_PCI_PRIVATE_H +#ifndef VFIO_PCI_CORE_H +#define VFIO_PCI_CORE_H #define VFIO_PCI_OFFSET_SHIFT 40 @@ -225,4 +225,4 @@ static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev, } #endif -#endif /* VFIO_PCI_PRIVATE_H */ +#endif /* VFIO_PCI_CORE_H */ diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c index 53d97f459252..0c599cd33d01 100644 --- a/drivers/vfio/pci/vfio_pci_igd.c +++ b/drivers/vfio/pci/vfio_pci_igd.c @@ -15,7 +15,7 @@ #include #include -#include "vfio_pci_private.h" +#include "vfio_pci_core.h" #define OPREGION_SIGNATURE "IntelGraphicsMem" #define OPREGION_SIZE (8 * 1024) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 869dce5f134d..df1e8c8c274c 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -20,7 +20,7 @@ #include #include -#include "vfio_pci_private.h" +#include "vfio_pci_core.h" /* * INTx diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c index 9adcf6a8f888..326a704c4527 100644 --- a/drivers/vfio/pci/vfio_pci_nvlink2.c +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c @@ -19,7 +19,7 @@ #include #include #include -#include "vfio_pci_private.h" +#include "vfio_pci_core.h" #define CREATE_TRACE_POINTS #include "trace.h" diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index a0b5fc8e46f4..667e82726e75 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -17,7 +17,7 @@ #include #include -#include "vfio_pci_private.h" +#include "vfio_pci_core.h" #ifdef __LITTLE_ENDIAN #define vfio_ioread64 ioread64 diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c index 229685634031..3e91d49fa3f0 100644 --- a/drivers/vfio/pci/vfio_pci_zdev.c +++ b/drivers/vfio/pci/vfio_pci_zdev.c @@ -19,7 +19,7 @@ #include #include -#include "vfio_pci_private.h" +#include "vfio_pci_core.h" /* * Add the Base PCI Function information to the device info region. -- 2.25.4
[PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c
This is a preparation patch for separating the vfio_pci driver to a subsystem driver and a generic pci driver. This patch doesn't change any logic. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Makefile| 2 +- drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%) diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index eff97a7cd9f1..bbf8d7c8fc45 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o +vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o vfio-pci-$(CONFIG_S390) += vfio_pci_zdev.o diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c similarity index 100% rename from drivers/vfio/pci/vfio_pci.c rename to drivers/vfio/pci/vfio_pci_core.c -- 2.25.4
[PATCH v3 0/9] Introduce vfio-pci-core subsystem
. - rename core structure to vfio_pci_core_device. - create 3 new vendor drivers (igd_vfio_pci, npu2_vfio_pci, nvlink2gpu_vfio_pci) and preserve backward compatibility. - rebase on top of Linux 5.11 tag. - remove patches that were accepted as stand-alone. - removed mlx5_vfio_pci driver from this series (3 vendor drivers will be enough for emphasizing the approch). changes from v1: - Create a private and public vfio-pci structures (From Alex) - register to vfio core directly from vfio-pci-core (for now, expose minimal public funcionality to vfio pci drivers. This will remove the implicit behaviour from v1. More power to the drivers can be added in the future) - Add patch 3/9 to emphasize the needed extension for LM feature (From Cornelia) - take/release refcount for the pci module during core open/release - update nvlink, igd and zdev to PowerNV, X86 and s390 extensions for vfio-pci core - fix segfault bugs in current vfio-pci zdev code Max Gurtovoy (9): vfio-pci: rename vfio_pci.c to vfio_pci_core.c vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h vfio-pci: rename vfio_pci_device to vfio_pci_core_device vfio-pci: introduce vfio_pci_core subsystem driver vfio/pci: introduce vfio_pci_device structure vfio-pci-core: export vfio_pci_register_dev_region function vfio/pci_core: split nvlink2 to nvlink2gpu and npu2 vfio/pci: export nvlink2 support into vendor vfio_pci drivers vfio/pci: export igd support into vendor vfio_pci driver drivers/vfio/pci/Kconfig | 53 +- drivers/vfio/pci/Makefile | 20 +- .../pci/{vfio_pci_igd.c => igd_vfio_pci.c}| 159 +- drivers/vfio/pci/igd_vfio_pci.h | 24 + drivers/vfio/pci/npu2_trace.h | 50 + drivers/vfio/pci/npu2_vfio_pci.c | 364 +++ drivers/vfio/pci/npu2_vfio_pci.h | 24 + .../vfio/pci/{trace.h => nvlink2gpu_trace.h} | 27 +- ...io_pci_nvlink2.c => nvlink2gpu_vfio_pci.c} | 296 +- drivers/vfio/pci/nvlink2gpu_vfio_pci.h| 24 + drivers/vfio/pci/vfio_pci.c | 2433 ++--- drivers/vfio/pci/vfio_pci_config.c| 70 +- drivers/vfio/pci/vfio_pci_core.c | 2245 +++ drivers/vfio/pci/vfio_pci_core.h | 242 ++ drivers/vfio/pci/vfio_pci_intrs.c | 42 +- drivers/vfio/pci/vfio_pci_private.h | 228 -- drivers/vfio/pci/vfio_pci_rdwr.c | 18 +- drivers/vfio/pci/vfio_pci_zdev.c | 12 +- 18 files changed, 3528 insertions(+), 2803 deletions(-) rename drivers/vfio/pci/{vfio_pci_igd.c => igd_vfio_pci.c} (58%) create mode 100644 drivers/vfio/pci/igd_vfio_pci.h create mode 100644 drivers/vfio/pci/npu2_trace.h create mode 100644 drivers/vfio/pci/npu2_vfio_pci.c create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h rename drivers/vfio/pci/{trace.h => nvlink2gpu_trace.h} (72%) rename drivers/vfio/pci/{vfio_pci_nvlink2.c => nvlink2gpu_vfio_pci.c} (57%) create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h create mode 100644 drivers/vfio/pci/vfio_pci_core.c create mode 100644 drivers/vfio/pci/vfio_pci_core.h delete mode 100644 drivers/vfio/pci/vfio_pci_private.h -- 2.25.4
Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
On 2/2/2021 7:10 PM, Jason Gunthorpe wrote: On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote: On the other side, we have the zdev support, which both requires s390 and applies to any pci device on s390. Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always return the s390 specific data in VFIO_DEVICE_GET_INFO if running on s390? It would be like returning data from ACPI on other platforms. Agree. all agree that I remove it ? we already have a check in the code: if (ret && ret != -ENODEV) { pci_warn(vdev->vpdev.pdev, "Failed to setup zPCI info capabilities\n"); return ret; } so in case its not zdev we should get -ENODEV and continue in the good flow. It really seems like part of vfio-pci-core Jason
Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
On 2/5/2021 2:42 AM, Alexey Kardashevskiy wrote: On 04/02/2021 23:51, Jason Gunthorpe wrote: On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote: It is system firmware (==bios) which puts stuff in the device tree. The stuff is: 1. emulated pci devices (custom pci bridges), one per nvlink, emulated by the firmware, the driver is "ibmnpu" and it is a part on the nvidia driver; these are basically config space proxies to the cpu's side of nvlink. 2. interconnect information - which of 6 gpus nvlinks connected to which nvlink on the cpu side, and memory ranges. So what is this vfio_nvlink driver supposed to be bound to? The "emulated pci devices"? Yes. A real GPU function? Yes. A real nvswitch function? What do you mean by this exactly? The cpu side of nvlink is "emulated pci devices", the gpu side is not in pci space at all, the nvidia driver manages it via the gpu's mmio or/and cfg space. Something else? Nope :) In this new scheme which you are proposing it should be 2 drivers, I guess. I see. So should it be nvidia_vfio_pci.ko ? and it will do the NVLINK stuff in case the class code matches and otherwise just work as simple vfio_pci GPU ? What about the second driver ? should it be called ibmnpu_vfio_pci.ko ? Jason
Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
On 2/3/2021 5:24 AM, Alexey Kardashevskiy wrote: On 03/02/2021 04:41, Max Gurtovoy wrote: On 2/2/2021 6:06 PM, Cornelia Huck wrote: On Mon, 1 Feb 2021 11:42:30 -0700 Alex Williamson wrote: On Mon, 1 Feb 2021 12:49:12 -0500 Matthew Rosato wrote: On 2/1/21 12:14 PM, Cornelia Huck wrote: On Mon, 1 Feb 2021 16:28:27 + Max Gurtovoy wrote: This patch doesn't change any logic but only align to the concept of vfio_pci_core extensions. Extensions that are related to a platform and not to a specific vendor of PCI devices should be part of the core driver. Extensions that are specific for PCI device vendor should go to a dedicated vendor vfio-pci driver. My understanding is that igd means support for Intel graphics, i.e. a strict subset of x86. If there are other future extensions that e.g. only make sense for some devices found only on AMD systems, I don't think they should all be included under the same x86 umbrella. Similar reasoning for nvlink, that only seems to cover support for some GPUs under Power, and is not a general platform-specific extension IIUC. We can arguably do the zdev -> s390 rename (as zpci appears only on s390, and all PCI devices will be zpci on that platform), although I'm not sure about the benefit. As far as I can tell, there isn't any benefit for s390 it's just "re-branding" to match the platform name rather than the zdev moniker, which admittedly perhaps makes it more clear to someone outside of s390 that any PCI device on s390 is a zdev/zpci type, and thus will use this extension to vfio_pci(_core). This would still be true even if we added something later that builds atop it (e.g. a platform-specific device like ism-vfio-pci). Or for that matter, mlx5 via vfio-pci on s390x uses these zdev extensions today and would need to continue using them in a world where mlx5-vfio-pci.ko exists. I guess all that to say: if such a rename matches the 'grand scheme' of this design where we treat arch-level extensions to vfio_pci(_core) as "vfio_pci_(arch)" then I'm not particularly opposed to the rename. But by itself it's not very exciting :) This all seems like the wrong direction to me. The goal here is to modularize vfio-pci into a core library and derived vendor modules that make use of that core library. If existing device specific extensions within vfio-pci cannot be turned into vendor modules through this support and are instead redefined as platform specific features of the new core library, that feels like we're already admitting failure of this core library to support known devices, let alone future devices. IGD is a specific set of devices. They happen to rely on some platform specific support, whose availability should be determined via the vendor module probe callback. Packing that support into an "x86" component as part of the core feels not only short sighted, but also avoids addressing the issues around how userspace determines an optimal module to use for a device. Hm, it seems that not all current extensions to the vfio-pci code are created equal. IIUC, we have igd and nvlink, which are sets of devices that only show up on x86 or ppc, respectively, and may rely on some special features of those architectures/platforms. The important point is that you have a device identifier that you can match a driver against. maybe you can supply the ids ? Alexey K, I saw you've been working on the NVLINK2 for P9. can you supply the exact ids for that should be bounded to this driver ? I'll add it to V3. Sorry no, I never really had the exact ids, they keep popping up after years. The nvlink2 support can only work if the platform supports it as there is nothing to do to the GPUs themselves, it is about setting up those nvlink2 links. If the platform advertises certain features in the device tree - then any GPU in the SXM2 form factor (not sure about the exact abbrev, not an usual PCIe device) should just work. If the nvlink2 "subdriver" fails to find nvlinks (the platform does not advertise them or some parameters are new to this subdriver, such as link-speed), we still fall back to generic vfio-pci and try passing through this GPU as a plain PCI device with no extra links. Semi-optimal but if the user is mining cryptocoins, then highspeed links are not really that important :) And the nvidia driver is capable of running such GPUs without nvlinks. Thanks, I see. But the PCI function (the bounded BDF) is GPU function or NVLINK function ? If it's NVLINK function then we should fail probing in the host vfio-pci driver. if its a GPU function so it shouldn't been called nvlink2 vfio-pci driver. Its just an extension in the GPU vfio-pci driver. On the other side, we have the zdev support, which both requires s390 and applies to any pci device on s390. If we added special handling for ISM on s390, ISM would be in a category similar to igd/nvlink. Now, i
Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
On 2/3/2021 5:24 AM, Alexey Kardashevskiy wrote: On 03/02/2021 04:41, Max Gurtovoy wrote: On 2/2/2021 6:06 PM, Cornelia Huck wrote: On Mon, 1 Feb 2021 11:42:30 -0700 Alex Williamson wrote: On Mon, 1 Feb 2021 12:49:12 -0500 Matthew Rosato wrote: On 2/1/21 12:14 PM, Cornelia Huck wrote: On Mon, 1 Feb 2021 16:28:27 + Max Gurtovoy wrote: This patch doesn't change any logic but only align to the concept of vfio_pci_core extensions. Extensions that are related to a platform and not to a specific vendor of PCI devices should be part of the core driver. Extensions that are specific for PCI device vendor should go to a dedicated vendor vfio-pci driver. My understanding is that igd means support for Intel graphics, i.e. a strict subset of x86. If there are other future extensions that e.g. only make sense for some devices found only on AMD systems, I don't think they should all be included under the same x86 umbrella. Similar reasoning for nvlink, that only seems to cover support for some GPUs under Power, and is not a general platform-specific extension IIUC. We can arguably do the zdev -> s390 rename (as zpci appears only on s390, and all PCI devices will be zpci on that platform), although I'm not sure about the benefit. As far as I can tell, there isn't any benefit for s390 it's just "re-branding" to match the platform name rather than the zdev moniker, which admittedly perhaps makes it more clear to someone outside of s390 that any PCI device on s390 is a zdev/zpci type, and thus will use this extension to vfio_pci(_core). This would still be true even if we added something later that builds atop it (e.g. a platform-specific device like ism-vfio-pci). Or for that matter, mlx5 via vfio-pci on s390x uses these zdev extensions today and would need to continue using them in a world where mlx5-vfio-pci.ko exists. I guess all that to say: if such a rename matches the 'grand scheme' of this design where we treat arch-level extensions to vfio_pci(_core) as "vfio_pci_(arch)" then I'm not particularly opposed to the rename. But by itself it's not very exciting :) This all seems like the wrong direction to me. The goal here is to modularize vfio-pci into a core library and derived vendor modules that make use of that core library. If existing device specific extensions within vfio-pci cannot be turned into vendor modules through this support and are instead redefined as platform specific features of the new core library, that feels like we're already admitting failure of this core library to support known devices, let alone future devices. IGD is a specific set of devices. They happen to rely on some platform specific support, whose availability should be determined via the vendor module probe callback. Packing that support into an "x86" component as part of the core feels not only short sighted, but also avoids addressing the issues around how userspace determines an optimal module to use for a device. Hm, it seems that not all current extensions to the vfio-pci code are created equal. IIUC, we have igd and nvlink, which are sets of devices that only show up on x86 or ppc, respectively, and may rely on some special features of those architectures/platforms. The important point is that you have a device identifier that you can match a driver against. maybe you can supply the ids ? Alexey K, I saw you've been working on the NVLINK2 for P9. can you supply the exact ids for that should be bounded to this driver ? I'll add it to V3. Sorry no, I never really had the exact ids, they keep popping up after years. The nvlink2 support can only work if the platform supports it as there is nothing to do to the GPUs themselves, it is about setting up those nvlink2 links. If the platform advertises certain features in the device tree - then any GPU in the SXM2 form factor (not sure about the exact abbrev, not an usual PCIe device) should just work. If the nvlink2 "subdriver" fails to find nvlinks (the platform does not advertise them or some parameters are new to this subdriver, such as link-speed), we still fall back to generic vfio-pci and try passing through this GPU as a plain PCI device with no extra links. Semi-optimal but if the user is mining cryptocoins, then highspeed links are not really that important :) And the nvidia driver is capable of running such GPUs without nvlinks. Thanks, From the above I can conclude that nvlink2_vfio_pci will need to find nvlinks during the probe and fail in case it doesn't. which platform driver is responsible to advertise it ? can we use aux_device to represent these nvlink/nvlinks ? In case of a failure, the user can fallback to vfio-pci.ko and run without the nvlinks as you said. This is deterministic behavior and the user will know exactly what he's getting from vfio subsystem. On the other side, we have the zdev support, which both
Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
On 2/2/2021 10:44 PM, Jason Gunthorpe wrote: On Tue, Feb 02, 2021 at 12:37:23PM -0700, Alex Williamson wrote: For the most part, this explicit bind interface is redundant to driver_override, which already avoids the duplicate ID issue. No, the point here is to have the ID tables in the PCI drivers because they fundamentally only work with their supported IDs. The normal driver core ID tables are a replacement for all the hardwired if's in vfio_pci. driver_override completely disables all the ID checking, it seems only useful for vfio_pci which works with everything. It should not be used with something like nvlink_vfio_pci.ko that needs ID checking. This mechanism of driver_override seems weird to me. In case of hotplug and both capable drivers (native device driver and vfio-pci) are loaded, both will compete on the device. I think the proposed flags is very powerful and it does fix the original concern Alex had ("if we start adding ids for vfio drivers then we create conflicts with the native host driver") and it's very deterministic. In this way we'll bind explicitly to a driver. And the way we'll choose a vfio-pci driver is by device_id + vendor_id + subsystem_device + subsystem_vendor. There shouldn't be 2 vfio-pci drivers that support a device with same above 4 ids. if you don't find a suitable vendor-vfio-pci.ko, you'll try binding vfio-pci.ko. Each driver will publish its supported ids in sysfs to help the user to decide. Yes, this DRIVER_EXPLICIT_BIND_ONLY idea somewhat replaces driver_override because we could set the PCI any match on vfio_pci and manage the driver binding explicitly instead. A driver id table doesn't really help for binding the device, ultimately even if a device is in the id table it might fail to probe due to the missing platform support that each of these igd and nvlink drivers expose, What happens depends on what makes sense for the driver, some missing optional support could continue without it, or it could fail. IGD and nvlink can trivially go onwards and work if they don't find the platform support. Or they might want to fail, I think the mlx5 and probably nvlink drivers should fail as they are intended to be coupled with userspace that expects to use their extended features. In those cases failing is a feature because it prevents the whole system from going into an unexpected state. Jason
Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd
On 2/2/2021 6:06 PM, Cornelia Huck wrote: On Mon, 1 Feb 2021 11:42:30 -0700 Alex Williamson wrote: On Mon, 1 Feb 2021 12:49:12 -0500 Matthew Rosato wrote: On 2/1/21 12:14 PM, Cornelia Huck wrote: On Mon, 1 Feb 2021 16:28:27 + Max Gurtovoy wrote: This patch doesn't change any logic but only align to the concept of vfio_pci_core extensions. Extensions that are related to a platform and not to a specific vendor of PCI devices should be part of the core driver. Extensions that are specific for PCI device vendor should go to a dedicated vendor vfio-pci driver. My understanding is that igd means support for Intel graphics, i.e. a strict subset of x86. If there are other future extensions that e.g. only make sense for some devices found only on AMD systems, I don't think they should all be included under the same x86 umbrella. Similar reasoning for nvlink, that only seems to cover support for some GPUs under Power, and is not a general platform-specific extension IIUC. We can arguably do the zdev -> s390 rename (as zpci appears only on s390, and all PCI devices will be zpci on that platform), although I'm not sure about the benefit. As far as I can tell, there isn't any benefit for s390 it's just "re-branding" to match the platform name rather than the zdev moniker, which admittedly perhaps makes it more clear to someone outside of s390 that any PCI device on s390 is a zdev/zpci type, and thus will use this extension to vfio_pci(_core). This would still be true even if we added something later that builds atop it (e.g. a platform-specific device like ism-vfio-pci). Or for that matter, mlx5 via vfio-pci on s390x uses these zdev extensions today and would need to continue using them in a world where mlx5-vfio-pci.ko exists. I guess all that to say: if such a rename matches the 'grand scheme' of this design where we treat arch-level extensions to vfio_pci(_core) as "vfio_pci_(arch)" then I'm not particularly opposed to the rename. But by itself it's not very exciting :) This all seems like the wrong direction to me. The goal here is to modularize vfio-pci into a core library and derived vendor modules that make use of that core library. If existing device specific extensions within vfio-pci cannot be turned into vendor modules through this support and are instead redefined as platform specific features of the new core library, that feels like we're already admitting failure of this core library to support known devices, let alone future devices. IGD is a specific set of devices. They happen to rely on some platform specific support, whose availability should be determined via the vendor module probe callback. Packing that support into an "x86" component as part of the core feels not only short sighted, but also avoids addressing the issues around how userspace determines an optimal module to use for a device. Hm, it seems that not all current extensions to the vfio-pci code are created equal. IIUC, we have igd and nvlink, which are sets of devices that only show up on x86 or ppc, respectively, and may rely on some special features of those architectures/platforms. The important point is that you have a device identifier that you can match a driver against. maybe you can supply the ids ? Alexey K, I saw you've been working on the NVLINK2 for P9. can you supply the exact ids for that should be bounded to this driver ? I'll add it to V3. On the other side, we have the zdev support, which both requires s390 and applies to any pci device on s390. If we added special handling for ISM on s390, ISM would be in a category similar to igd/nvlink. Now, if somebody plugs a mlx5 device into an s390, we would want both the zdev support and the specialized mlx5 driver. Maybe zdev should be some kind of library that can be linked into normal vfio-pci, into vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on s390 (if configured into the kernel).
[PATCH 9/9] vfio/pci: use powernv naming instead of nvlink2
This patch doesn't change any logic but only align to the concept of vfio_pci_core extensions. Extensions that are related to a platform and not to a specific vendor of PCI devices should be part of the core driver. Extensions that are specific for PCI device vendor should go to a dedicated vendor vfio-pci driver. For now, powernv extensions will include only nvlink2. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig| 6 -- drivers/vfio/pci/Makefile | 2 +- drivers/vfio/pci/vfio_pci_core.c| 4 ++-- drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} | 0 drivers/vfio/pci/vfio_pci_private.h | 2 +- 5 files changed, 8 insertions(+), 6 deletions(-) rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (100%) diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index c98f2df01a60..fe0264b3d02f 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -47,11 +47,13 @@ config VFIO_PCI_X86 To enable Intel X86 extensions for vfio-pci-core, say Y. -config VFIO_PCI_NVLINK2 +config VFIO_PCI_POWERNV def_bool y depends on VFIO_PCI_CORE && PPC_POWERNV help - VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs + VFIO PCI extensions for IBM PowerNV (Non-Virtualized) platform + + To enable POWERNV extensions for vfio-pci-core, say Y. config VFIO_PCI_S390 bool "VFIO PCI extensions for S390 platform" diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index d8ccb70e015a..442b7c78de4c 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -6,7 +6,7 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_X86) += vfio_pci_x86.o -vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o +vfio-pci-core-$(CONFIG_VFIO_PCI_POWERNV) += vfio_pci_powernv.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index e0e258c37fb5..90cc728fffc7 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -337,7 +337,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) } if (pdev->vendor == PCI_VENDOR_ID_NVIDIA && - IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) { + IS_ENABLED(CONFIG_VFIO_PCI_POWERNV)) { ret = vfio_pci_nvdia_v100_nvlink2_init(vdev); if (ret && ret != -ENODEV) { pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM region\n"); @@ -346,7 +346,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) } if (pdev->vendor == PCI_VENDOR_ID_IBM && - IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) { + IS_ENABLED(CONFIG_VFIO_PCI_POWERNV)) { ret = vfio_pci_ibm_npu2_init(vdev); if (ret && ret != -ENODEV) { pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD region\n"); diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_powernv.c similarity index 100% rename from drivers/vfio/pci/vfio_pci_nvlink2.c rename to drivers/vfio/pci/vfio_pci_powernv.c diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index efc688525784..dc6a9191a704 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -173,7 +173,7 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev) return -ENODEV; } #endif -#ifdef CONFIG_VFIO_PCI_NVLINK2 +#ifdef CONFIG_VFIO_PCI_POWERNV extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev); extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev); #else -- 2.25.4
[PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue
In case allocation fails, we must behave correctly and exit with error. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/vfio_pci_zdev.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c index 175096fcd902..e9ef4239ef7a 100644 --- a/drivers/vfio/pci/vfio_pci_zdev.c +++ b/drivers/vfio/pci/vfio_pci_zdev.c @@ -71,6 +71,8 @@ static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps) int ret; cap = kmalloc(cap_size, GFP_KERNEL); + if (!cap) + return -ENOMEM; cap->header.id = VFIO_DEVICE_INFO_CAP_ZPCI_UTIL; cap->header.version = 1; @@ -94,6 +96,8 @@ static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps) int ret; cap = kmalloc(cap_size, GFP_KERNEL); + if (!cap) + return -ENOMEM; cap->header.id = VFIO_DEVICE_INFO_CAP_ZPCI_PFIP; cap->header.version = 1; -- 2.25.4
[PATCH 8/9] vfio/pci: use x86 naming instead of igd
This patch doesn't change any logic but only align to the concept of vfio_pci_core extensions. Extensions that are related to a platform and not to a specific vendor of PCI devices should be part of the core driver. Extensions that are specific for PCI device vendor should go to a dedicated vendor vfio-pci driver. For now, x86 extensions will include only igd. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig| 13 ++--- drivers/vfio/pci/Makefile | 2 +- drivers/vfio/pci/vfio_pci_core.c| 2 +- drivers/vfio/pci/vfio_pci_private.h | 2 +- drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} | 0 5 files changed, 9 insertions(+), 10 deletions(-) rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%) diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 6e6c976b9512..c98f2df01a60 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -36,17 +36,16 @@ config VFIO_PCI_INTX depends on VFIO_PCI_CORE def_bool y if !S390 -config VFIO_PCI_IGD - bool "VFIO PCI extensions for Intel graphics (GVT-d)" +config VFIO_PCI_X86 + bool "VFIO PCI extensions for Intel X86 platform" depends on VFIO_PCI_CORE && X86 default y help - Support for Intel IGD specific extensions to enable direct - assignment to virtual machines. This includes exposing an IGD - specific firmware table and read-only copies of the host bridge - and LPC bridge config space. + Support for Intel X86 specific extensions for VFIO PCI devices. + This includes exposing an IGD specific firmware table and + read-only copies of the host bridge and LPC bridge config space. - To enable Intel IGD assignment through vfio-pci, say Y. + To enable Intel X86 extensions for vfio-pci-core, say Y. config VFIO_PCI_NVLINK2 def_bool y diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 90962a495eba..d8ccb70e015a 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -5,7 +5,7 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o -vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o +vfio-pci-core-$(CONFIG_VFIO_PCI_X86) += vfio_pci_x86.o vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index c559027def2d..e0e258c37fb5 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) if (vfio_pci_is_vga(pdev) && pdev->vendor == PCI_VENDOR_ID_INTEL && - IS_ENABLED(CONFIG_VFIO_PCI_IGD)) { + IS_ENABLED(CONFIG_VFIO_PCI_X86)) { ret = vfio_pci_igd_init(vdev); if (ret && ret != -ENODEV) { pci_warn(pdev, "Failed to setup Intel IGD regions\n"); diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 7c3c2cd575f8..efc688525784 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -165,7 +165,7 @@ extern u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_device *vdev); extern void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev, u16 cmd); -#ifdef CONFIG_VFIO_PCI_IGD +#ifdef CONFIG_VFIO_PCI_X86 extern int vfio_pci_igd_init(struct vfio_pci_device *vdev); #else static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev) diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_x86.c similarity index 100% rename from drivers/vfio/pci/vfio_pci_igd.c rename to drivers/vfio/pci/vfio_pci_x86.c -- 2.25.4
[PATCH 7/9] vfio/pci: use s390 naming instead of zdev
This patch doesn't change any logic but only the concept of vfio_pci_core extensions. Extensions that are related to a platform and not to a specific vendor of PCI devices should be part of the core driver. Extensions that are specific for PCI device vendor should go to a dedicated vendor vfio-pci driver. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig | 4 ++-- drivers/vfio/pci/Makefile | 2 +- drivers/vfio/pci/vfio_pci_core.c | 2 +- drivers/vfio/pci/vfio_pci_private.h | 2 +- drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} | 2 +- 5 files changed, 6 insertions(+), 6 deletions(-) rename drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} (98%) diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index dcb164d7d641..6e6c976b9512 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -54,8 +54,8 @@ config VFIO_PCI_NVLINK2 help VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs -config VFIO_PCI_ZDEV - bool "VFIO PCI ZPCI device CLP support" +config VFIO_PCI_S390 + bool "VFIO PCI extensions for S390 platform" depends on VFIO_PCI_CORE && S390 default y help diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 9f67edca31c5..90962a495eba 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -7,7 +7,7 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o -vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o +vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index a0a91331f575..c559027def2d 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -785,7 +785,7 @@ static long vfio_pci_core_ioctl(void *device_data, unsigned int cmd, info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions; info.num_irqs = VFIO_PCI_NUM_IRQS; - if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) { + if (IS_ENABLED(CONFIG_VFIO_PCI_S390)) { int ret = vfio_pci_info_zdev_add_caps(vdev, ); if (ret && ret != -ENODEV) { diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 1c3bb809b5c0..7c3c2cd575f8 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -188,7 +188,7 @@ static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev) } #endif -#ifdef CONFIG_VFIO_PCI_ZDEV +#ifdef CONFIG_VFIO_PCI_S390 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev, struct vfio_info_cap *caps); #else diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_s390.c similarity index 98% rename from drivers/vfio/pci/vfio_pci_zdev.c rename to drivers/vfio/pci/vfio_pci_s390.c index e9ef4239ef7a..b6421a7f767f 100644 --- a/drivers/vfio/pci/vfio_pci_zdev.c +++ b/drivers/vfio/pci/vfio_pci_s390.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0+ /* - * VFIO ZPCI devices support + * VFIO PCI support for S390 platform (a.k.a ZPCI devices) * * Copyright (C) IBM Corp. 2020. All rights reserved. * Author(s): Pierre Morel -- 2.25.4
[PATCH 5/9] vfio-pci/zdev: remove unused vdev argument
Zdev static functions does not use vdev argument. Remove it. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/vfio_pci_zdev.c | 20 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c index 7b20b34b1034..175096fcd902 100644 --- a/drivers/vfio/pci/vfio_pci_zdev.c +++ b/drivers/vfio/pci/vfio_pci_zdev.c @@ -24,8 +24,7 @@ /* * Add the Base PCI Function information to the device info region. */ -static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev, -struct vfio_info_cap *caps) +static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps) { struct vfio_device_info_cap_zpci_base cap = { .header.id = VFIO_DEVICE_INFO_CAP_ZPCI_BASE, @@ -45,8 +44,7 @@ static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev, /* * Add the Base PCI Function Group information to the device info region. */ -static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev, - struct vfio_info_cap *caps) +static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps) { struct vfio_device_info_cap_zpci_group cap = { .header.id = VFIO_DEVICE_INFO_CAP_ZPCI_GROUP, @@ -66,8 +64,7 @@ static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev, /* * Add the device utility string to the device info region. */ -static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev, -struct vfio_info_cap *caps) +static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps) { struct vfio_device_info_cap_zpci_util *cap; int cap_size = sizeof(*cap) + CLP_UTIL_STR_LEN; @@ -90,8 +87,7 @@ static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev, /* * Add the function path string to the device info region. */ -static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev, -struct vfio_info_cap *caps) +static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps) { struct vfio_device_info_cap_zpci_pfip *cap; int cap_size = sizeof(*cap) + CLP_PFIP_NR_SEGMENTS; @@ -123,21 +119,21 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev, if (!zdev) return -ENODEV; - ret = zpci_base_cap(zdev, vdev, caps); + ret = zpci_base_cap(zdev, caps); if (ret) return ret; - ret = zpci_group_cap(zdev, vdev, caps); + ret = zpci_group_cap(zdev, caps); if (ret) return ret; if (zdev->util_str_avail) { - ret = zpci_util_cap(zdev, vdev, caps); + ret = zpci_util_cap(zdev, caps); if (ret) return ret; } - ret = zpci_pfip_cap(zdev, vdev, caps); + ret = zpci_pfip_cap(zdev, caps); return ret; } -- 2.25.4
[PATCH 4/9] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
This driver will register to PCI bus and Auxiliary bus. In case the probe of both devices will succeed, we'll have a vendor specific VFIO PCI device. mlx5_vfio_pci use vfio_pci_core to register and create a VFIO device and use auxiliary_device to get the needed extension from the vendor device driver. If one of the probe() functions will fail, the VFIO char device will not be created. For now, only register and bind the auxiliary_device to the pci_device in case we have a match between the auxiliary_device id to the pci_device BDF. Later, vendor specific features such as live migration will be added and will be available to the virtualization software. Note: Although we've created the mlx5-vfio-pci.ko, the binding to vfio-pci.ko will still work as before. It's fully backward compatible. Of course, the extended vendor functionality will not exist in case one will bind the device to the generic vfio_pci.ko. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig | 10 ++ drivers/vfio/pci/Makefile| 3 + drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++ include/linux/mlx5/vfio_pci.h| 36 + 4 files changed, 302 insertions(+) create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c create mode 100644 include/linux/mlx5/vfio_pci.h diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index b958a48f63a0..dcb164d7d641 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -65,3 +65,13 @@ config VFIO_PCI_ZDEV for zPCI devices passed through via VFIO on s390. Say Y here. + +config MLX5_VFIO_PCI + tristate "VFIO support for MLX5 PCI devices" + depends on VFIO_PCI_CORE && MLX5_CORE + select AUXILIARY_BUS + help + This provides a generic PCI support for MLX5 devices using the VFIO + framework. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 3f2a27e222cd..9f67edca31c5 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,6 +2,7 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o @@ -9,3 +10,5 @@ vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o vfio-pci-y := vfio_pci.o + +mlx5-vfio-pci-y := mlx5_vfio_pci.o diff --git a/drivers/vfio/pci/mlx5_vfio_pci.c b/drivers/vfio/pci/mlx5_vfio_pci.c new file mode 100644 index ..4e6b256c74bf --- /dev/null +++ b/drivers/vfio/pci/mlx5_vfio_pci.c @@ -0,0 +1,253 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. + * Author: Max Gurtovoy + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "vfio_pci_core.h" + +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "Max Gurtovoy " +#define DRIVER_DESC "MLX5 VFIO PCI - User Level meta-driver for NVIDIA MLX5 device family" + +/* 16k migration data size */ +#define MLX5_MIGRATION_REGION_DATA_SIZESZ_16K +/* Data section offset from migration region */ +#define MLX5_MIGRATION_REGION_DATA_OFFSET (sizeof(struct vfio_device_migration_info)) + +struct mlx5_vfio_pci_migration_info { + struct vfio_device_migration_info mig; + char data[MLX5_MIGRATION_REGION_DATA_SIZE]; +}; + +static LIST_HEAD(aux_devs_list); +static DEFINE_MUTEX(aux_devs_lock); + +static struct mlx5_vfio_pci_adev *mlx5_vfio_pci_find_adev(struct pci_dev *pdev) +{ + struct mlx5_vfio_pci_adev *mvadev, *found = NULL; + + mutex_lock(_devs_lock); + list_for_each_entry(mvadev, _devs_list, entry) { + if (mvadev->madev.adev.id == pci_dev_id(pdev)) { + found = mvadev; + break; + } + } + mutex_unlock(_devs_lock); + + return found; +} + +static int mlx5_vfio_pci_aux_probe(struct auxiliary_device *adev, + const struct auxiliary_device_id *id) +{ + struct mlx5_vfio_pci_adev *mvadev; + + mvadev = adev_to_mvadev(adev); + + pr_info("%s aux probing bdf %02x:%02x.%d mdev is %s\n", + adev->name, + PCI_BUS_NUM(adev->id & 0x), + PCI_SLOT(adev->id & 0xff), + PCI_FUNC(adev->id & 0xff), dev_name(mvadev->madev.mdev->device)); + + mutex_lock(_devs_lock); + list_add(>entry, _devs_list); + mutex_unlock(_devs_lock); + + return 0; +} + +static void mlx5_vfi
[PATCH 3/9] vfio-pci-core: export vfio_pci_register_dev_region function
This function will be used to allow vendor drivers to register regions to be used and accessed by the core subsystem driver. This way, the core will use the region ops that are vendor specific and managed by the vendor vfio-pci driver. Next step that can be made is to move the logic of igd and nvlink to a dedicated module instead of managing their functionality in the core driver. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/vfio_pci_core.c| 12 +- drivers/vfio/pci/vfio_pci_core.h| 28 drivers/vfio/pci/vfio_pci_igd.c | 16 -- drivers/vfio/pci/vfio_pci_nvlink2.c | 25 +++-- drivers/vfio/pci/vfio_pci_private.h | 34 + 5 files changed, 64 insertions(+), 51 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index d5bf01132c23..a0a91331f575 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -395,7 +395,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev) vdev->virq_disabled = false; for (i = 0; i < vdev->num_regions; i++) - vdev->region[i].ops->release(vdev, >region[i]); + vdev->region[i].ops->release(>vpdev, >region[i]); vdev->num_regions = 0; kfree(vdev->region); @@ -716,11 +716,12 @@ static int msix_mmappable_cap(struct vfio_pci_device *vdev, return vfio_info_add_capability(caps, , sizeof(header)); } -int vfio_pci_register_dev_region(struct vfio_pci_device *vdev, +int vfio_pci_register_dev_region(struct vfio_pci_core_device *vpdev, unsigned int type, unsigned int subtype, const struct vfio_pci_regops *ops, size_t size, u32 flags, void *data) { + struct vfio_pci_device *vdev = vpdev_to_vdev(vpdev); struct vfio_pci_region *region; region = krealloc(vdev->region, @@ -741,6 +742,7 @@ int vfio_pci_register_dev_region(struct vfio_pci_device *vdev, return 0; } +EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region); struct vfio_devices { struct vfio_device **devices; @@ -928,7 +930,7 @@ static long vfio_pci_core_ioctl(void *device_data, unsigned int cmd, return ret; if (vdev->region[i].ops->add_capability) { - ret = vdev->region[i].ops->add_capability(vdev, + ret = vdev->region[i].ops->add_capability(>vpdev, >region[i], ); if (ret) return ret; @@ -1379,7 +1381,7 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf, return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite); default: index -= VFIO_PCI_NUM_REGIONS; - return vdev->region[index].ops->rw(vdev, buf, + return vdev->region[index].ops->rw(>vpdev, buf, count, ppos, iswrite); } @@ -1622,7 +1624,7 @@ static int vfio_pci_core_mmap(void *device_data, struct vm_area_struct *vma) if (region && region->ops && region->ops->mmap && (region->flags & VFIO_REGION_INFO_FLAG_MMAP)) - return region->ops->mmap(vdev, region, vma); + return region->ops->mmap(>vpdev, region, vma); return -EINVAL; } if (index >= VFIO_PCI_ROM_REGION_INDEX) diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h index 9833935af735..0b227ee3f377 100644 --- a/drivers/vfio/pci/vfio_pci_core.h +++ b/drivers/vfio/pci/vfio_pci_core.h @@ -15,6 +15,7 @@ #define VFIO_PCI_CORE_H struct vfio_pci_device_ops; +struct vfio_pci_region; struct vfio_pci_core_device { struct pci_dev *pdev; @@ -22,6 +23,29 @@ struct vfio_pci_core_device { void*dd_data; }; +struct vfio_pci_regops { + size_t (*rw)(struct vfio_pci_core_device *vpdev, char __user *buf, + size_t count, loff_t *ppos, bool iswrite); + void(*release)(struct vfio_pci_core_device *vpdev, + struct vfio_pci_region *region); + int (*mmap)(struct vfio_pci_core_device *vpdev, + struct vfio_pci_region *region, + struct vm_area_struct *vma); + int (*add_capability)(struct vfio_pci_core_device *vpdev, + struct vfio_pci_region *region, + struct vfio_info_cap *caps); +}; + +struct vfio_pci_region { + u32
[PATCH 2/9] vfio-pci: introduce vfio_pci_core subsystem driver
Split the vfio_pci driver into two parts, the 'struct pci_driver' (vfio_pci) and a library of code (vfio_pci_core) that helps creating a VFIO device on top of a PCI device. As before vfio_pci.ko continues to present the same interface under sysfs and this change should have no functional impact. vfio_pci_core exposes an interface that is similar to a typical Linux subsystem, in that a pci_driver doing probe() can setup a number of details and then create the VFIO char device. Allowing another module to provide the pci_driver allows that module to customize how VFIO is setup, inject its own operations, and easily extend vendor specific functionality. This is complementary to how VFIO's mediated devices work. Instead of custome device lifecycle managmenet and a special bus drivers using this approach will rely on the normal driver core lifecycle (eg bind/unbind) management and this is optimized to effectively support customization that is only making small modifications to what vfio_pci would do normally. This approach is also a pluggable alternative for the hard wired CONFIG_VFIO_PCI_IG and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are built into vfio-pci. Using this work all of that code can be moved to a dedicated device-specific modules and cleanly split out of the generic vfio_pci driver. Below is an example for adding new driver to vfio pci subsystem: +--+ | | | VFIO | | | +--+ +--+ | | | VFIO_PCI_CORE | | | +--+ ++ +---+ || | | | VFIO_PCI | | MLX5_VFIO_PCI | || | | ++ +---+ In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio subsystem and also use the generic PCI functionality exported from it. Additionally it will add the needed vendor specific logic for HW specific features such as Live Migration. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig| 24 +- drivers/vfio/pci/Makefile | 13 +- drivers/vfio/pci/vfio_pci.c | 202 ++ drivers/vfio/pci/vfio_pci_config.c | 56 ++-- drivers/vfio/pci/vfio_pci_core.c| 392 ++-- drivers/vfio/pci/vfio_pci_core.h| 45 drivers/vfio/pci/vfio_pci_igd.c | 2 +- drivers/vfio/pci/vfio_pci_intrs.c | 22 +- drivers/vfio/pci/vfio_pci_nvlink2.c | 24 +- drivers/vfio/pci/vfio_pci_private.h | 4 +- drivers/vfio/pci/vfio_pci_rdwr.c| 14 +- drivers/vfio/pci/vfio_pci_zdev.c| 2 +- 12 files changed, 471 insertions(+), 329 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci.c create mode 100644 drivers/vfio/pci/vfio_pci_core.h diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 40a223381ab6..b958a48f63a0 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -config VFIO_PCI - tristate "VFIO support for PCI devices" +config VFIO_PCI_CORE + tristate "VFIO core support for PCI devices" depends on VFIO && PCI && EVENTFD select VFIO_VIRQFD select IRQ_BYPASS_MANAGER @@ -10,9 +10,17 @@ config VFIO_PCI If you don't know what to do here, say N. +config VFIO_PCI + tristate "VFIO support for PCI devices" + depends on VFIO_PCI_CORE + help + This provides a generic PCI support using the VFIO framework. + + If you don't know what to do here, say N. + config VFIO_PCI_VGA bool "VFIO PCI support for VGA devices" - depends on VFIO_PCI && X86 && VGA_ARB + depends on VFIO_PCI_CORE && X86 && VGA_ARB help Support for VGA extension to VFIO PCI. This exposes an additional region on VGA devices for accessing legacy VGA addresses used by @@ -21,16 +29,16 @@ config VFIO_PCI_VGA If you don't know what to do here, say N. config VFIO_PCI_MMAP - depends on VFIO_PCI + depends on VFIO_PCI_CORE def_bool y if !S390 config VFIO_PCI_INTX - depends on VFIO_PCI + depends on VFIO_PCI_CORE def_bool y if !S390 config VFIO_PCI_IGD bool "VFIO PCI extensions for Intel graphics (GVT-d)" - depends on VFIO_PCI && X86 + depends on VFIO_PCI_CORE && X86 default y help Support for Intel IGD specific extensions to enable direct @@ -42,13 +50,13 @@ config VFIO_PCI_IGD config VFIO_PCI_NVLINK
[PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c
This is a preparation patch for separating the vfio_pci driver to a subsystem driver and a generic pci driver. This patch doesn't change any logic. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Makefile| 2 +- drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%) diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 781e0809d6ee..dd350b9b 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o +vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c similarity index 100% rename from drivers/vfio/pci/vfio_pci.c rename to drivers/vfio/pci/vfio_pci_core.c -- 2.25.4
[PATCH v2 0/9] Introduce vfio-pci-core subsystem
io-pci zdev code TODOs: 1. Create subsystem module for VFIO_MDEV. This can be used for vendor specific scalable functions for example (SFs). 2. Add Live migration functionality for mlx5 SNAP devices (NVMe/Virtio-BLK). 3. Add Live migration functionality for mlx5 VFs 4. Add the needed functionality for mlx5_core 5. Improve userspace and discovery 6. move VGA stuff to x86 extension I would like to thank the great team that was involved in this development, design and internal review: Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy and others. This series applies cleanly on top of kernel 5.11-rc5+ commit 13391c60da33: "Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6" from Linus. Note: Live migration for MLX5 SNAP devices is WIP and can be the first example for adding vendor extension to vfio-pci devices. As the changes to the subsystem must be defined as a pre-condition for this work, we've decided to split the submission for now. Max Gurtovoy (9): vfio-pci: rename vfio_pci.c to vfio_pci_core.c vfio-pci: introduce vfio_pci_core subsystem driver vfio-pci-core: export vfio_pci_register_dev_region function mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices vfio-pci/zdev: remove unused vdev argument vfio-pci/zdev: fix possible segmentation fault issue vfio/pci: use s390 naming instead of zdev vfio/pci: use x86 naming instead of igd vfio/pci: use powernv naming instead of nvlink2 drivers/vfio/pci/Kconfig | 57 +- drivers/vfio/pci/Makefile | 16 +- drivers/vfio/pci/mlx5_vfio_pci.c | 253 ++ drivers/vfio/pci/vfio_pci.c | 2384 + drivers/vfio/pci/vfio_pci_config.c| 56 +- drivers/vfio/pci/vfio_pci_core.c | 2326 drivers/vfio/pci/vfio_pci_core.h | 73 + drivers/vfio/pci/vfio_pci_intrs.c | 22 +- ...{vfio_pci_nvlink2.c => vfio_pci_powernv.c} | 47 +- drivers/vfio/pci/vfio_pci_private.h | 44 +- drivers/vfio/pci/vfio_pci_rdwr.c | 14 +- .../pci/{vfio_pci_zdev.c => vfio_pci_s390.c} | 28 +- .../pci/{vfio_pci_igd.c => vfio_pci_x86.c}| 18 +- include/linux/mlx5/vfio_pci.h | 36 + 14 files changed, 2916 insertions(+), 2458 deletions(-) create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c create mode 100644 drivers/vfio/pci/vfio_pci_core.c create mode 100644 drivers/vfio/pci/vfio_pci_core.h rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (89%) rename drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} (80%) rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (89%) create mode 100644 include/linux/mlx5/vfio_pci.h -- 2.25.4
Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem
On 2/1/2021 6:32 AM, Alex Williamson wrote: On Sun, 31 Jan 2021 20:46:40 +0200 Max Gurtovoy wrote: On 1/28/2021 11:02 PM, Alex Williamson wrote: On Thu, 28 Jan 2021 17:29:30 +0100 Cornelia Huck wrote: On Tue, 26 Jan 2021 15:27:43 +0200 Max Gurtovoy wrote: On 1/26/2021 5:34 AM, Alex Williamson wrote: On Mon, 25 Jan 2021 20:45:22 -0400 Jason Gunthorpe wrote: On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote: extensions potentially break vendor drivers, etc. We're only even hand waving that existing device specific support could be farmed out to new device specific drivers without even going to the effort to prove that. This is a RFC, not a complete patch series. The RFC is to get feedback on the general design before everyone comits alot of resources and positions get dug in. Do you really think the existing device specific support would be a problem to lift? It already looks pretty clean with the vfio_pci_regops, looks easy enough to lift to the parent. So far the TODOs rather mask the dirty little secrets of the extension rather than showing how a vendor derived driver needs to root around in struct vfio_pci_device to do something useful, so probably porting actual device specific support rather than further hand waving would be more helpful. It would be helpful to get actual feedback on the high level design - someting like this was already tried in May and didn't go anywhere - are you surprised that we are reluctant to commit alot of resources doing a complete job just to have it go nowhere again? That's not really what I'm getting from your feedback, indicating vfio-pci is essentially done, the mlx stub driver should be enough to see the direction, and additional concerns can be handled with TODO comments. Sorry if this is not construed as actual feedback, I think both Connie and I are making an effort to understand this and being hampered by lack of a clear api or a vendor driver that's anything more than vfio-pci plus an aux bus interface. Thanks, I think I got the main idea and I'll try to summarize it: The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we do need it to be able to create vendor-vfio-pci.ko driver in the future to include vendor special souse inside. One other thing I'd like to bring up: What needs to be done in userspace? Does a userspace driver like QEMU need changes to actually exploit this? Does management software like libvirt need to be involved in decision making, or does it just need to provide the knobs to make the driver configurable? I'm still pretty nervous about the userspace aspect of this as well. QEMU and other actual vfio drivers are probably the least affected, at least for QEMU, it'll happily open any device that has a pointer to an IOMMU group that's reflected as a vfio group device. Tools like libvirt, on the other hand, actually do driver binding and we need to consider how they make driver decisions. Jason suggested that the vfio-pci driver ought to be only spec compliant behavior, which sounds like some deprecation process of splitting out the IGD, NVLink, zpci, etc. features into sub-drivers and eventually removing that device specific support from vfio-pci. Would we expect libvirt to know, "this is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname -m says we're running on s390, try to bind it to vfio-zpci"? Maybe we expect derived drivers to only bind to devices they recognize, so libvirt could blindly try a whole chain of drivers, ending in vfio-pci. Obviously if we have competing drivers that support the same device in different ways, that quickly falls apart. I think we can leave common arch specific stuff, such as s390 (IIUC) in the core driver. And only create vfio_pci drivers for vendor/device/subvendor specific stuff. So on one hand you're telling us that the design principles here can be applied to various other device/platform specific support, but on the other you're saying, but don't do that... I guess I was looking at nvlink2 as device specific. But let's update the nvlink2, s390 and IGD a bit: 1. s390 - config VFIO_PCI_ZDEV rename to config VFIO_PCI_S390 (it will include all needed tweeks for S390) 2. nvlink2 - config VFIO_PCI_NVLINK2 rename to config VFIO_PCI_P9 (it will include all needed tweeks for P9) 3. igd - config VFIO_PCI_IGD rename to config VFIO_PCI_X86 (it will include all needed tweeks for X86) All the 3 stays in the vfio-pci-core.ko since we might need S390 stuff if we plug Network adapter from vendor-A or NVMe adapter from vendor-B for example. This is platform specific and we don't want to duplicate it in each vendor driver. Same for P9 (and nvlink2 is only a special case in there) and X86. Also, the competing drivers issue can also happen today, right ? after adding new_id to vfio_pci I don't know how linux will behave if we'll plug new device with same id to the system. whi
Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem
On 1/28/2021 11:02 PM, Alex Williamson wrote: On Thu, 28 Jan 2021 17:29:30 +0100 Cornelia Huck wrote: On Tue, 26 Jan 2021 15:27:43 +0200 Max Gurtovoy wrote: On 1/26/2021 5:34 AM, Alex Williamson wrote: On Mon, 25 Jan 2021 20:45:22 -0400 Jason Gunthorpe wrote: On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote: extensions potentially break vendor drivers, etc. We're only even hand waving that existing device specific support could be farmed out to new device specific drivers without even going to the effort to prove that. This is a RFC, not a complete patch series. The RFC is to get feedback on the general design before everyone comits alot of resources and positions get dug in. Do you really think the existing device specific support would be a problem to lift? It already looks pretty clean with the vfio_pci_regops, looks easy enough to lift to the parent. So far the TODOs rather mask the dirty little secrets of the extension rather than showing how a vendor derived driver needs to root around in struct vfio_pci_device to do something useful, so probably porting actual device specific support rather than further hand waving would be more helpful. It would be helpful to get actual feedback on the high level design - someting like this was already tried in May and didn't go anywhere - are you surprised that we are reluctant to commit alot of resources doing a complete job just to have it go nowhere again? That's not really what I'm getting from your feedback, indicating vfio-pci is essentially done, the mlx stub driver should be enough to see the direction, and additional concerns can be handled with TODO comments. Sorry if this is not construed as actual feedback, I think both Connie and I are making an effort to understand this and being hampered by lack of a clear api or a vendor driver that's anything more than vfio-pci plus an aux bus interface. Thanks, I think I got the main idea and I'll try to summarize it: The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we do need it to be able to create vendor-vfio-pci.ko driver in the future to include vendor special souse inside. One other thing I'd like to bring up: What needs to be done in userspace? Does a userspace driver like QEMU need changes to actually exploit this? Does management software like libvirt need to be involved in decision making, or does it just need to provide the knobs to make the driver configurable? I'm still pretty nervous about the userspace aspect of this as well. QEMU and other actual vfio drivers are probably the least affected, at least for QEMU, it'll happily open any device that has a pointer to an IOMMU group that's reflected as a vfio group device. Tools like libvirt, on the other hand, actually do driver binding and we need to consider how they make driver decisions. Jason suggested that the vfio-pci driver ought to be only spec compliant behavior, which sounds like some deprecation process of splitting out the IGD, NVLink, zpci, etc. features into sub-drivers and eventually removing that device specific support from vfio-pci. Would we expect libvirt to know, "this is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname -m says we're running on s390, try to bind it to vfio-zpci"? Maybe we expect derived drivers to only bind to devices they recognize, so libvirt could blindly try a whole chain of drivers, ending in vfio-pci. Obviously if we have competing drivers that support the same device in different ways, that quickly falls apart. I think we can leave common arch specific stuff, such as s390 (IIUC) in the core driver. And only create vfio_pci drivers for vendor/device/subvendor specific stuff. Also, the competing drivers issue can also happen today, right ? after adding new_id to vfio_pci I don't know how linux will behave if we'll plug new device with same id to the system. which driver will probe it ? I don't really afraid of competing drivers since we can ask from vendor vfio pci_drivers to add vendor_id, device_id, subsystem_vendor and subsystem_device so we won't have this problem. I don't think that there will be 2 drivers that drive the same device with these 4 ids. Userspace tool can have a map of ids to drivers and bind the device to the right vfio-pci vendor driver if it has one. if not, bind to vfio_pci.ko. Libvirt could also expand its available driver models for the user to specify a variant, I'd support that for overriding a choice that libvirt might make otherwise, but forcing the user to know this information is just passing the buck. We can add a code to libvirt as mentioned above. Some derived drivers could probably actually include device IDs rather than only relying on dynamic ids, but then we get into the problem that we're competing with native host driver for a device. The aux bus example here is essentially the least troublesome variation since it works in conjunction
Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem
On 1/28/2021 6:29 PM, Cornelia Huck wrote: On Tue, 26 Jan 2021 15:27:43 +0200 Max Gurtovoy wrote: Hi Alex, Cornelia and Jason, thanks for the reviewing this. On 1/26/2021 5:34 AM, Alex Williamson wrote: On Mon, 25 Jan 2021 20:45:22 -0400 Jason Gunthorpe wrote: On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote: We're supposed to be enlightened by a vendor driver that does nothing more than pass the opaque device_data through to the core functions, but in reality this is exactly the point of concern above. At a minimum that vendor driver needs to look at the vdev to get the pdev, The end driver already havs the pdev, the RFC doesn't go enough into those bits, it is a good comment. The dd_data pased to the vfio_create_pci_device() will be retrieved from the ops to get back to the end drivers data. This can cleanly include everything: the VF pci_device, PF pci_device, mlx5_core pointer, vfio_device and vfio_pci_device. This is why the example passes in the mvadev: + vdev = vfio_create_pci_device(pdev, _vfio_pci_ops, mvadev); The mvadev has the PF, VF, and mlx5 core driver pointer. Getting that back out during the ops is enough to do what the mlx5 driver needs to do, which is relay migration related IOCTLs to the PF function via the mlx5_core driver so the device can execute them on behalf of the VF. but then what else does it look at, consume, or modify. Now we have vendor drivers misusing the core because it's not clear which fields are private and how public fields can be used safely, The kernel has never followed rigid rules for data isolation, it is normal to have whole private structs exposed in headers so that container_of can be used to properly compose data structures. I reject this assertion, there are plenty of structs that clearly indicate private vs public data or, as we've done in mdev, clearly marking the private data in a "private" header and provide access functions for public fields. Including a "private" header to make use of a "library" is just wrong. In the example above, there's no way for the mlx vendor driver to get back dd_data without extracting it from struct vfio_pci_device itself. I'll create a better separation between private/public fields according to my understanding for the V2. I'll just mention that beyond this separation, future improvements will be needed and can be done incrementally. I don't think that we should do so many changes at one shut. The incremental process is safer from subsystem point of view. I also think that upstreaming mlx5_vfio_pci.ko and upstreaming vfio-pci separation into 2 modules doesn't have to happen in one-shut. The design can probably benefit from tossing a non-mlx5 driver into the mix. So, let me suggest the zdev support for that experiment (see e6b817d4b8217a9528fcfd59719b924ab8a5ff23 and friends.) It is quite straightforward: it injects some capabilities into the info ioctl. A specialized driver would basically only need to provide special handling for the ioctl callback and just forward anything else. It also would not need any matching for device ids etc., as it would only make sense on s390x, but regardless of the device. It could, however, help us to get an idea what a good split would look like. AFAIU, s390x is related to IBM architecture and not to a specific PCI device. So I guess it should stay in the core as many PCI devices will need these capabilities on IBM system. I think I'll use NVLINK2 P9 stuff as an example of the split and add it to vfio_pci.ko instead of vfio_pci_core.ko as a first step. later we can create a dedicated module for these devices (V100 GPUs). But again, to make our point in this RFC, I'll improve it for V2. Look at struct device, for instance. Most of that is private to the driver core. A few 'private to vfio-pci-core' comments would help, it is good feedback to make that more clear. extensions potentially break vendor drivers, etc. We're only even hand waving that existing device specific support could be farmed out to new device specific drivers without even going to the effort to prove that. This is a RFC, not a complete patch series. The RFC is to get feedback on the general design before everyone comits alot of resources and positions get dug in. Do you really think the existing device specific support would be a problem to lift? It already looks pretty clean with the vfio_pci_regops, looks easy enough to lift to the parent. So far the TODOs rather mask the dirty little secrets of the extension rather than showing how a vendor derived driver needs to root around in struct vfio_pci_device to do something useful, so probably porting actual device specific support rather than further hand waving would be more helpful. It would be helpful to get actual feedback on the high level design - someting like this was already tried in May and didn't go anywhere - are you surprised that we are relucta
Re: [PATCH RFC v2 08/10] vdpa: add vdpa simulator for block device
On 1/28/2021 4:41 PM, Stefano Garzarella wrote: From: Max Gurtovoy This will allow running vDPA for virtio block protocol. Signed-off-by: Max Gurtovoy [sgarzare: various cleanups/fixes] Signed-off-by: Stefano Garzarella --- v2: - rebased on top of other changes (dev_attr, get_config(), notify(), etc.) - memset to 0 the config structure in vdpasim_blk_get_config() - used vdpasim pointer in vdpasim_blk_get_config() v1: - Removed unused headers - Used cpu_to_vdpasim*() to store config fields - Replaced 'select VDPA_SIM' with 'depends on VDPA_SIM' since selected option can not depend on other [Jason] - Start with a single queue for now [Jason] - Add comments to memory barriers --- drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 145 +++ drivers/vdpa/Kconfig | 7 ++ drivers/vdpa/vdpa_sim/Makefile | 1 + 3 files changed, 153 insertions(+) create mode 100644 drivers/vdpa/vdpa_sim/vdpa_sim_blk.c diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c new file mode 100644 index ..999f9ca0b628 --- /dev/null +++ b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * VDPA simulator for block device. + * + * Copyright (c) 2020, Mellanox Technologies. All rights reserved. I guess we can change the copyright from Mellanox to: Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. Thanks.
Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem
Hi Alex, Cornelia and Jason, thanks for the reviewing this. On 1/26/2021 5:34 AM, Alex Williamson wrote: On Mon, 25 Jan 2021 20:45:22 -0400 Jason Gunthorpe wrote: On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote: We're supposed to be enlightened by a vendor driver that does nothing more than pass the opaque device_data through to the core functions, but in reality this is exactly the point of concern above. At a minimum that vendor driver needs to look at the vdev to get the pdev, The end driver already havs the pdev, the RFC doesn't go enough into those bits, it is a good comment. The dd_data pased to the vfio_create_pci_device() will be retrieved from the ops to get back to the end drivers data. This can cleanly include everything: the VF pci_device, PF pci_device, mlx5_core pointer, vfio_device and vfio_pci_device. This is why the example passes in the mvadev: + vdev = vfio_create_pci_device(pdev, _vfio_pci_ops, mvadev); The mvadev has the PF, VF, and mlx5 core driver pointer. Getting that back out during the ops is enough to do what the mlx5 driver needs to do, which is relay migration related IOCTLs to the PF function via the mlx5_core driver so the device can execute them on behalf of the VF. but then what else does it look at, consume, or modify. Now we have vendor drivers misusing the core because it's not clear which fields are private and how public fields can be used safely, The kernel has never followed rigid rules for data isolation, it is normal to have whole private structs exposed in headers so that container_of can be used to properly compose data structures. I reject this assertion, there are plenty of structs that clearly indicate private vs public data or, as we've done in mdev, clearly marking the private data in a "private" header and provide access functions for public fields. Including a "private" header to make use of a "library" is just wrong. In the example above, there's no way for the mlx vendor driver to get back dd_data without extracting it from struct vfio_pci_device itself. I'll create a better separation between private/public fields according to my understanding for the V2. I'll just mention that beyond this separation, future improvements will be needed and can be done incrementally. I don't think that we should do so many changes at one shut. The incremental process is safer from subsystem point of view. I also think that upstreaming mlx5_vfio_pci.ko and upstreaming vfio-pci separation into 2 modules doesn't have to happen in one-shut. But again, to make our point in this RFC, I'll improve it for V2. Look at struct device, for instance. Most of that is private to the driver core. A few 'private to vfio-pci-core' comments would help, it is good feedback to make that more clear. extensions potentially break vendor drivers, etc. We're only even hand waving that existing device specific support could be farmed out to new device specific drivers without even going to the effort to prove that. This is a RFC, not a complete patch series. The RFC is to get feedback on the general design before everyone comits alot of resources and positions get dug in. Do you really think the existing device specific support would be a problem to lift? It already looks pretty clean with the vfio_pci_regops, looks easy enough to lift to the parent. So far the TODOs rather mask the dirty little secrets of the extension rather than showing how a vendor derived driver needs to root around in struct vfio_pci_device to do something useful, so probably porting actual device specific support rather than further hand waving would be more helpful. It would be helpful to get actual feedback on the high level design - someting like this was already tried in May and didn't go anywhere - are you surprised that we are reluctant to commit alot of resources doing a complete job just to have it go nowhere again? That's not really what I'm getting from your feedback, indicating vfio-pci is essentially done, the mlx stub driver should be enough to see the direction, and additional concerns can be handled with TODO comments. Sorry if this is not construed as actual feedback, I think both Connie and I are making an effort to understand this and being hampered by lack of a clear api or a vendor driver that's anything more than vfio-pci plus an aux bus interface. Thanks, I think I got the main idea and I'll try to summarize it: The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we do need it to be able to create vendor-vfio-pci.ko driver in the future to include vendor special souse inside. The separation implementation and the question of what is private and what is public, and the core APIs to the various drivers should be improved or better demonstrated in the V2. I'll work on improving it and I'll send the V2. If you have some feedback of the things/fields/structs you think should remain private to
[PATCH 2/3] vfio-pci: introduce vfio_pci_core subsystem driver
Split the vfio_pci driver into two parts, the 'struct pci_driver' (vfio_pci) and a library of code (vfio_pci_core) that helps creating a VFIO device on top of a PCI device. As before vfio_pci.ko continues to present the same interface under sysfs and this change should have no functional impact. vfio_pci_core exposes an interface that is similar to a typical Linux subsystem, in that a pci_driver doing probe() can setup a number of details and then create the VFIO char device. Allowing another module to provide the pci_driver allows that module to customize how VFIO is setup, inject its own operations, and easily extend vendor specific functionality. This is complementary to how VFIO's mediated devices work. Instead of custome device lifecycle managmenet and a special bus drivers using this approach will rely on the normal driver core lifecycle (eg bind/unbind) management and this is optimized to effectively support customization that is only making small modifications to what vfio_pci would do normally. This approach is also a pluggable alternative for the hard wired CONFIG_VFIO_PCI_IG and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are built into vfio-pci. Using this work all of that code can be moved to a dedicated device-specific modules and cleanly split out of the generic vfio_pci driver. Below is an example for adding new driver to vfio pci subsystem: +--+ | | | VFIO | | | +--+ +--+ | | | VFIO_PCI_CORE | | | +--+ ++ +---+ || | | | VFIO_PCI | | MLX5_VFIO_PCI | || | | ++ +---+ In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio subsystem and also use the generic PCI functionality exported from it. Additionally it will add the needed vendor specific logic for HW specific features such as Live Migration. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig| 12 +- drivers/vfio/pci/Makefile | 13 +- drivers/vfio/pci/vfio_pci.c | 220 drivers/vfio/pci/vfio_pci_core.c| 255 +++- drivers/vfio/pci/vfio_pci_private.h | 21 +++ 5 files changed, 321 insertions(+), 200 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci.c diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 40a223381ab6..5f90be27fba0 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -config VFIO_PCI - tristate "VFIO support for PCI devices" +config VFIO_PCI_CORE + tristate "VFIO core support for PCI devices" depends on VFIO && PCI && EVENTFD select VFIO_VIRQFD select IRQ_BYPASS_MANAGER @@ -10,6 +10,14 @@ config VFIO_PCI If you don't know what to do here, say N. +config VFIO_PCI + tristate "VFIO support for PCI devices" + depends on VFIO_PCI_CORE + help + This provides a generic PCI support using the VFIO framework. + + If you don't know what to do here, say N. + config VFIO_PCI_VGA bool "VFIO PCI support for VGA devices" depends on VFIO_PCI && X86 && VGA_ARB diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index dd350b9b..3f2a27e222cd 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,8 +1,11 @@ # SPDX-License-Identifier: GPL-2.0-only -vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o -vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o -vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o -vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o - +obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o + +vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o +vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o +vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o +vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o + +vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c new file mode 100644 index ..4e115a136930 --- /dev/null +++ b/drivers/vfio/pci/vfio_pci.c @@ -0,0 +1,220 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. + * + * Copyright (C) 2012 Red Hat, Inc. All rights reserved. + * Author: Alex Williamson + * + * Deriv
[PATCH 3/3] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
This driver will register to PCI bus and Auxiliary bus. In case the probe of both devices will succeed, we'll have a vendor specific VFIO PCI device. mlx5_vfio_pci use vfio_pci_core to register and create a VFIO device and use auxiliary_device to get the needed extension from the vendor device driver. If one of the probe() functions will fail, the VFIO char device will not be created. For now, only register and bind the auxiliary_device to the pci_device in case we have a match between the auxiliary_device id to the pci_device BDF. Later, vendor specific features such as live migration will be added and will be available to the virtualization software. Note: Although we've created the mlx5-vfio-pci.ko, the binding to vfio-pci.ko will still work as before. It's fully backward compatible. Of course, the extended vendor functionality will not exist in case one will bind the device to the generic vfio_pci.ko. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Kconfig | 10 ++ drivers/vfio/pci/Makefile| 3 + drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++ include/linux/mlx5/vfio_pci.h| 36 + 4 files changed, 302 insertions(+) create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c create mode 100644 include/linux/mlx5/vfio_pci.h diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 5f90be27fba0..2133cd2f9c92 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -65,3 +65,13 @@ config VFIO_PCI_ZDEV for zPCI devices passed through via VFIO on s390. Say Y here. + +config MLX5_VFIO_PCI + tristate "VFIO support for MLX5 PCI devices" + depends on VFIO_PCI_CORE && MLX5_CORE + select AUXILIARY_BUS + help + This provides a generic PCI support for MLX5 devices using the VFIO + framework. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 3f2a27e222cd..9f67edca31c5 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,6 +2,7 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o @@ -9,3 +10,5 @@ vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o vfio-pci-y := vfio_pci.o + +mlx5-vfio-pci-y := mlx5_vfio_pci.o diff --git a/drivers/vfio/pci/mlx5_vfio_pci.c b/drivers/vfio/pci/mlx5_vfio_pci.c new file mode 100644 index ..98cc2d037b0d --- /dev/null +++ b/drivers/vfio/pci/mlx5_vfio_pci.c @@ -0,0 +1,253 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. + * Author: Max Gurtovoy + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "vfio_pci_private.h" + +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "Max Gurtovoy " +#define DRIVER_DESC "MLX5 VFIO PCI - User Level meta-driver for NVIDIA MLX5 device family" + +static LIST_HEAD(aux_devs_list); +static DEFINE_MUTEX(aux_devs_lock); + +static struct mlx5_vfio_pci_adev *mlx5_vfio_pci_find_adev(struct pci_dev *pdev) +{ + struct mlx5_vfio_pci_adev *mvadev, *found = NULL; + + mutex_lock(_devs_lock); + list_for_each_entry(mvadev, _devs_list, entry) { + if (mvadev->madev.adev.id == pci_dev_id(pdev)) { + found = mvadev; + break; + } + } + mutex_unlock(_devs_lock); + + return found; +} + +static int mlx5_vfio_pci_aux_probe(struct auxiliary_device *adev, + const struct auxiliary_device_id *id) +{ + struct mlx5_vfio_pci_adev *mvadev; + + mvadev = adev_to_mvadev(adev); + + pr_info("%s aux probing bdf %02x:%02x.%d mdev is %s\n", + adev->name, + PCI_BUS_NUM(adev->id & 0x), + PCI_SLOT(adev->id & 0xff), + PCI_FUNC(adev->id & 0xff), dev_name(mvadev->madev.mdev->device)); + + mutex_lock(_devs_lock); + list_add(>entry, _devs_list); + mutex_unlock(_devs_lock); + + return 0; +} + +static void mlx5_vfio_pci_aux_remove(struct auxiliary_device *adev) +{ + struct mlx5_vfio_pci_adev *mvadev = adev_to_mvadev(adev); + struct vfio_pci_device *vdev = dev_get_drvdata(>dev); + + /* TODO: is this the right thing to do ? maybe FLR ? */ + if (vdev) + pci_reset_function(vdev->pdev); + + mutex_lock(_devs_lock); + list
[PATCH 1/3] vfio-pci: rename vfio_pci.c to vfio_pci_core.c
This is a preparation patch for separating the vfio_pci driver to a subsystem driver and a generic pci driver. This patch doesn't change any logic. Signed-off-by: Max Gurtovoy --- drivers/vfio/pci/Makefile| 2 +- drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%) diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 781e0809d6ee..dd350b9b 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o +vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c similarity index 100% rename from drivers/vfio/pci/vfio_pci.c rename to drivers/vfio/pci/vfio_pci_core.c -- 2.25.4
[PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem
Hi Alex and Cornelia, This series split the vfio_pci driver into 2 parts: pci driver and a subsystem driver that will also be library of code. The pci driver, vfio_pci.ko will be used as before and it will bind to the subsystem driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset if fully backward compatible. This is a typical Linux subsystem framework behaviour. This framework can be also adopted by vfio_mdev devices as we'll see in the below sketch. This series is coming to solve the issues that were raised in the previous attempt for extending vfio-pci for vendor specific functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao. This solution is also deterministic in a sense that when a user will bind to a vendor specific vfio-pci driver, it will get all the special goodies of the HW. This subsystem framework will also ease on adding vendor specific functionality to VFIO devices in the future by allowing another module to provide the pci_driver that can setup number of details before registering to VFIO subsystem (such as inject its own operations). Below we can see the proposed changes (this patchset only deals with VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem as well): +--+ | | |VFIO | | | +--+ ++++ |||| | VFIO_PCI_CORE || VFIO_MDEV_CORE| |||| ++++ +---+ +--++---+ +--+ | | | || | | | | | | || | | | | VFIO_PCI | | MLX5_VFIO_PCI|| VFIO_MDEV | |MLX5_VFIO_MDEV| | | | || | | | | | | || | | | +---+ +--++---+ +--+ First 2 patches introduce the above changes for vfio_pci and vfio_pci_core. Patch (3/3) introduces a new mlx5 vfio-pci module that registers to VFIO subsystem using vfio_pci_core. It also registers to Auxiliary bus for binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This will allow extending mlx5-vfio-pci devices with HW specific features such as Live Migration (mlx5_core patches are not part of this series that comes for proposing the changes need for the vfio pci subsystem). These devices will be seen on the Auxiliary bus as: mlx5_core.vfio_pci.2048 -> ../../../devices/pci:00/:00:02.0/:05:00.0/:06:00.0/:07:00.0/mlx5_core.vfio_pci.2048 mlx5_core.vfio_pci.2304 -> ../../../devices/pci:00/:00:02.0/:05:00.0/:06:00.0/:07:00.1/mlx5_core.vfio_pci.2304 2048 represents BDF 08:00.0 and 2304 represents BDF 09:00.0 in decimal view. In this manner, the administrator will be able to locate the correct vfio-pci module it should bind the desired BDF to (by finding the pointer to the module according to the Auxiliary driver of that BDF). In this way, we'll use the HW vendor driver core to manage the lifecycle of these devices. This is reasonable since only the vendor driver knows exactly about the status on its internal state and the capabilities of its acceleratots, for example. TODOs: 1. For this RFC we still haven't cleaned all vendor specific stuff that were merged in the past into vfio_pci (such as VFIO_PCI_IG and VFIO_PCI_NVLINK2). 2. Create subsystem module for VFIO_MDEV. This can be used for vendor specific scalable functions for example (SFs). 3. Add Live migration functionality for mlx5 SNAP devices (NVMe/Virtio-BLK). 4. Add Live migration functionality for mlx5 VFs 5. Add the needed functionality for mlx5_core I would like to thank the great team that was involved in this development, design and internal review: Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy and others. This series applies cleanly on top of kernel 5.11-rc2+ commit 2ff90100ace8: "Merge tag 'hwmon-for-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging" from Linus. Note: Live migration for MLX5 SNAP devices is WIP and will be the first example for adding vendor extension to vfio-pci devices. As the changes to the subsystem must be defined as a pre-condition for this work, we've decided to split the submission for now. Max Gurtovoy
Re: [PATCH v2 00/17] vdpa: generalize vdpa simulator
On 11/26/2020 4:49 PM, Stefano Garzarella wrote: This series moves the network device simulator in a new module (vdpa_sim_net) and leaves the generic functions in the vdpa_sim core module, allowing the possibility to add new vDPA device simulators. For now I removed the vdpa-blk simulator patches, since I'm still working on them and debugging the iotlb issues. Thanks to Max that started this work! I took his patches and extended a bit. As Jason suggested, I simplified the "vdpa: split vdpasim to core and net modules" patch, moving some changes out in small patches. @Max: I put your Co-developed-by and Signed-off-by tags on these patches, let me know if it is okay for you, or if there is a better way to give credit to your work! Stefano, thanks for taking my initial series and bringing it to upstream level and thanks Jason for your reviews. I'm ok with the tags and hopefully I'll be able to help a bit in the submission in couple of weeks. great progress ! v1: https://lists.linuxfoundation.org/pipermail/virtualization/2020-November/050677.html v2: - moved most of the patches before the vdpa-core/net split [Jason] - removed unnecessary headers - removed 'default n' in Kconfig entries [Jason] - added VDPASIM_IOTLB_LIMIT macro [Jason] - set vringh notify callback [Jason] - used VIRTIO terminology for in_iov/out_iov [Stefan] - simplified "vdpa: split vdpasim to core and net modules" patch, moving some changes out in small patches - left batch_mapping module parameter in the core [Jason] Max Gurtovoy (2): vdpa_sim: remove hard-coded virtq count vdpa: split vdpasim to core and net modules Stefano Garzarella (15): vdpa: remove unnecessary 'default n' in Kconfig entries vdpa_sim: remove unnecessary headers inclusion vdpa_sim: remove the limit of IOTLB entries vdpa_sim: rename vdpasim_config_ops variables vdpa_sim: add struct vdpasim_dev_attr for device attributes vdpa_sim: add device id field in vdpasim_dev_attr vdpa_sim: add supported_features field in vdpasim_dev_attr vdpa_sim: add work_fn in vdpasim_dev_attr vdpa_sim: store parsed MAC address in a buffer vdpa_sim: make 'config' generic and usable for any device type vdpa_sim: add get_config callback in vdpasim_dev_attr vdpa_sim: set vringh notify callback vdpa_sim: use kvmalloc to allocate vdpasim->buffer vdpa_sim: make vdpasim->buffer size configurable vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov
Re: [PATCH v3] nvme-rdma: handle nvme completion data length
On 10/25/2020 1:51 PM, zhenwei pi wrote: Hit a kernel warning: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 0 at lib/refcount.c:28 RIP: 0010:refcount_warn_saturate+0xd9/0xe0 Call Trace: nvme_rdma_recv_done+0xf3/0x280 [nvme_rdma] __ib_process_cq+0x76/0x150 [ib_core] ... The reason is that a zero bytes message received from target, and the host side continues to process without length checking, then the previous CQE is processed twice. Do sanity check on received data length, try to recovery for corrupted CQE case. Because zero bytes message in not defined in spec, using zero bytes message to detect dead connections on transport layer is not standard, currently still treat it as illegal. Thanks to Chao Leng & Sagi for suggestions. Signed-off-by: zhenwei pi --- drivers/nvme/host/rdma.c | 8 1 file changed, 8 insertions(+) Seems strange that the targets sends zero byte packets. Can you specify which target is this and the scenario ?
Re: [PATCH v3 14/56] IB: fix kernel-doc markups
Thanks Mauro, small fix for iser On 10/23/2020 7:33 PM, Mauro Carvalho Chehab wrote: Some functions have different names between their prototypes and the kernel-doc markup. Others need to be fixed, as kernel-doc markups should use this format: identifier - description Signed-off-by: Mauro Carvalho Chehab diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index 3690e28cc7ea..84cebf937680 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -739,7 +739,7 @@ iscsi_iser_set_param(struct iscsi_cls_conn *cls_conn, } /** - * iscsi_iser_set_param() - set class connection parameter + * iscsi_iser_conn_get_stats() - set class connection parameter iscsi_iser_conn_get_stats() - get iscsi connection statistics * @cls_conn:iscsi class connection * @stats: iscsi stats to output *
Re: [PATCH] nvme: do not call del_gendisk() on a disk that was never added
On 6/7/2020 2:45 PM, Niklas Cassel wrote: device_add_disk() is negated by del_gendisk(). alloc_disk_node() is negated by put_disk(). In nvme_alloc_ns(), device_add_disk() is one of the last things being called in the success case, and only void functions are being called after this. Therefore this call should not be negated in the error path. The superfluous call to del_gendisk() leads to the following prints: [7.839975] kobject: '(null)' (1ff73734): is not initialized, yet kobject_put() is being called. [7.840865] WARNING: CPU: 2 PID: 361 at lib/kobject.c:736 kobject_put+0x70/0x120 Fixes: 33cfdc2aa696 ("nvme: enforce extended LBA format for fabrics metadata") Signed-off-by: Niklas Cassel --- An alternative would be to do like nvme_ns_remove(), i.e. in the error path; check if ns->disk->flags & GENHD_FL_UP is set, and only then call del_gendisk(). However, that seems unnecessary, since as nvme_alloc_ns() is currently written, we know that device_add_disk() does not need to be negated. drivers/nvme/host/core.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) Looks good, Reviewed-by: Max Gurtovoy
Re: linux-next: manual merge of the block tree with the rdma tree
On 6/3/2020 2:32 AM, Jason Gunthorpe wrote: On Wed, Jun 03, 2020 at 01:40:51AM +0300, Max Gurtovoy wrote: On 6/3/2020 12:37 AM, Jens Axboe wrote: On 6/2/20 1:09 PM, Jason Gunthorpe wrote: On Tue, Jun 02, 2020 at 01:02:55PM -0600, Jens Axboe wrote: On 6/2/20 1:01 PM, Jason Gunthorpe wrote: On Tue, Jun 02, 2020 at 11:37:26AM +0300, Max Gurtovoy wrote: On 6/2/2020 5:56 AM, Stephen Rothwell wrote: Hi all, Hi, This looks good to me. Can you share a pointer to the tree so we'll test it in our labs ? need to re-test: 1. srq per core 2. srq per core + T10-PI And both will run with shared CQ. Max, this is too much conflict to send to Linus between your own patches. I am going to drop the nvme part of this from RDMA. Normally I don't like applying partial series, but due to this tree split, you can send the rebased nvme part through the nvme/block tree at rc1 in two weeks.. Yes, I'll send it in 2 weeks. Actually I hoped the iSER patches for CQ pool will be sent in this series but eventually they were not. This way we could have taken only the iser part and the new API. I saw the pulled version too late since I wasn't CCed to it and it was already merged before I had a chance to warn you about possible conflict. I think in general we should try to add new RDMA APIs first with iSER/SRP and avoid conflicting trees. If you are careful we can construct a shared branch and if Jens/etc is willing he can pull the RDMA base code after RDMA merges the branch and then apply the nvme parts. This is how things work with netdev It is tricky and you have to plan for it during your submission step, but we should be able to manage in most cases if this comes up more often. I think we can construct a branch like this for dedicated series and delete it after the acceptance. In case of new APIs for RDMA that involve touching NVMe stuff - we'll create this branch and ask Jens to pull it as you suggested. And as a general note, I suggest we won't merge NVMe/RDMA stuff to rdma-next without cooperation with Jens. -Max. Jason
Re: linux-next: manual merge of the block tree with the rdma tree
On 6/3/2020 12:37 AM, Jens Axboe wrote: On 6/2/20 1:09 PM, Jason Gunthorpe wrote: On Tue, Jun 02, 2020 at 01:02:55PM -0600, Jens Axboe wrote: On 6/2/20 1:01 PM, Jason Gunthorpe wrote: On Tue, Jun 02, 2020 at 11:37:26AM +0300, Max Gurtovoy wrote: On 6/2/2020 5:56 AM, Stephen Rothwell wrote: Hi all, Hi, This looks good to me. Can you share a pointer to the tree so we'll test it in our labs ? need to re-test: 1. srq per core 2. srq per core + T10-PI And both will run with shared CQ. Max, this is too much conflict to send to Linus between your own patches. I am going to drop the nvme part of this from RDMA. Normally I don't like applying partial series, but due to this tree split, you can send the rebased nvme part through the nvme/block tree at rc1 in two weeks.. Yes, I'll send it in 2 weeks. Actually I hoped the iSER patches for CQ pool will be sent in this series but eventually they were not. This way we could have taken only the iser part and the new API. I saw the pulled version too late since I wasn't CCed to it and it was already merged before I had a chance to warn you about possible conflict. I think in general we should try to add new RDMA APIs first with iSER/SRP and avoid conflicting trees. Was going to comment that this is probably how it should have been done to begin with. If we have multiple conflicts like that between two trees, someone is doing something wrong... Well, on the other hand having people add APIs in one tree and then (promised) consumers in another tree later on has proven problematic in the past. It is best to try to avoid that, but in this case I don't think Max will have any delay to get the API consumer into nvme in two weeks. Having conflicting trees is a problem. If there's a dependency for two trees for some new work, then just have a separate branch that's built on those two. For NVMe core work, then it should include the pending NVMe changes. I guess it's hard to do so during merge window since the block and rdma trees are not in sync. I think it would have been a good idea to add Jens to CC and mention that we're posting code that is maintained by 2 different trees in the cover latter.
Re: linux-next: manual merge of the block tree with the rdma tree
On 6/2/2020 5:56 AM, Stephen Rothwell wrote: Hi all, Hi, This looks good to me. Can you share a pointer to the tree so we'll test it in our labs ? need to re-test: 1. srq per core 2. srq per core + T10-PI And both will run with shared CQ. Today's linux-next merge of the block tree got a conflict in: drivers/nvme/target/rdma.c between commit: 5733111dcd97 ("nvmet-rdma: use new shared CQ mechanism") from the rdma tree and commits: b0012dd39715 ("nvmet-rdma: use SRQ per completion vector") b09160c3996c ("nvmet-rdma: add metadata/T10-PI support") from the block tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts.
Re: [PATCH] nvme-tcp: constify static struct blk_mq_ops
Looks good, Reviewed-by: Max Gurtovoy
Re: [PATCH] nvme: tcp: remove redundant assignment to variable ret
On 9/5/2019 5:34 PM, Colin King wrote: From: Colin Ian King The variable ret is being initialized with a value that is never read and is being re-assigned immediately afterwards. The assignment is redundant and hence can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King --- drivers/nvme/host/tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 2d8ba31cb691..d91be6ddfe25 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1824,7 +1824,7 @@ static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl) static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new) { struct nvmf_ctrl_options *opts = ctrl->opts; - int ret = -EINVAL; + int ret; ret = nvme_tcp_configure_admin_queue(ctrl, new); if (ret) Looks fine, Reviewed-by: Max Gurtovoy
Re: [PATCH v2] nvme: fix multipath crash when ANA desactivated
On 7/10/2019 12:29 AM, Christoph Hellwig wrote: On Sat, Jul 06, 2019 at 01:06:44PM +0300, Max Gurtovoy wrote: + /* check if multipath is enabled and we have the capability */ + if (!multipath) + return 0; + if (!ctrl->subsys || ((ctrl->subsys->cmic & (1 << 3)) != 0)) shouldn't it be: if (!ctrl->subsys || ((ctrl->subsys->cmic & (1 << 3)) == 0)) or if (!ctrl->subsys || !(ctrl->subsys->cmic & (1 << 3))) Otherwise, you don't really do any initialization and return 0 in case you have the capability, right ? Yes. FYI, my idea how to fix this would be something like: diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index a9a927677970..cdb3e5baa329 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -12,11 +12,6 @@ module_param(multipath, bool, 0444); MODULE_PARM_DESC(multipath, "turn on native support for multiple controllers per subsystem"); -inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) -{ - return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3)); -} - /* * If multipathing is enabled we need to always use the subsystem instance * number for numbering our devices to avoid conflicts between subsystems that @@ -622,7 +617,7 @@ int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) { int error; - if (!nvme_ctrl_use_ana(ctrl)) + if (!multipath || !ctrl->subsys || !(ctrl->subsys->cmic & (1 << 3))) return 0; ctrl->anacap = id->anacap; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 716a876119c8..14eca76bec5c 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -485,7 +485,10 @@ extern const struct attribute_group *nvme_ns_id_attr_groups[]; extern const struct block_device_operations nvme_ns_head_ops; #ifdef CONFIG_NVME_MULTIPATH -bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl); +static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) +{ + return ctrl->ana_log_buf != NULL; +} void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, struct nvme_ctrl *ctrl, int *flags); void nvme_failover_req(struct request *req); Yes this looks good.
Re: [PATCH v2] nvme: fix multipath crash when ANA desactivated
On 7/5/2019 5:05 PM, Marta Rybczynska wrote: Fix a crash with multipath activated. It happends when ANA log page is larger than MDTS and because of that ANA is disabled. The driver then tries to access unallocated buffer when connecting to a nvme target. The signature is as follows: [ 300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192). [ 300.435387] nvme nvme0: disabling ANA support. [ 300.437835] nvme nvme0: creating 4 I/O queues. [ 300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009 [ 300.464609] BUG: unable to handle kernel NULL pointer dereference at 0008 [ 300.466342] #PF error: [normal kernel read fault] [ 300.467385] PGD 0 P4D 0 [ 300.467987] Oops: [#1] SMP PTI [ 300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4 [ 300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core] [ 300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core] [ 300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48 [ 300.477374] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296 [ 300.478334] RAX: 0001 RBX: 9130f1872258 RCX: [ 300.479784] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258 [ 300.481488] RBP: R08: 0001 R09: 0044 [ 300.483203] R10: 0220 R11: 0040 R12: 9130f18722c0 [ 300.484928] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0 [ 300.486626] FS: () GS:9130f7b8() knlGS: [ 300.488538] CS: 0010 DS: ES: CR0: 80050033 [ 300.489907] CR2: 0008 CR3: 0002365e6000 CR4: 06e0 [ 300.491612] DR0: DR1: DR2: [ 300.493303] DR3: DR6: fffe0ff0 DR7: 0400 [ 300.494991] Call Trace: [ 300.495645] nvme_mpath_add_disk+0x5c/0xb0 [nvme_core] [ 300.496880] nvme_validate_ns+0x2ef/0x550 [nvme_core] [ 300.498105] ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core] [ 300.499539] nvme_scan_work+0x2b4/0x370 [nvme_core] [ 300.500717] ? __switch_to_asm+0x35/0x70 [ 300.501663] process_one_work+0x171/0x380 [ 300.502340] worker_thread+0x49/0x3f0 [ 300.503079] kthread+0xf8/0x130 [ 300.503795] ? max_active_store+0x80/0x80 [ 300.504690] ? kthread_bind+0x10/0x10 [ 300.505502] ret_from_fork+0x35/0x40 [ 300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net virtio_console net_failover virtio_blk failover ata_piix serio_raw libata virtio_pci virtio_ring virtio [ 300.514984] CR2: 0008 [ 300.515569] ---[ end trace faa2eefad7e7f218 ]--- [ 300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core] [ 300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48 [ 300.520353] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296 [ 300.521229] RAX: 0001 RBX: 9130f1872258 RCX: [ 300.522399] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258 [ 300.523560] RBP: R08: 0001 R09: 0044 [ 300.524734] R10: 0220 R11: 0040 R12: 9130f18722c0 [ 300.525915] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0 [ 300.527084] FS: () GS:9130f7b8() knlGS: [ 300.528396] CS: 0010 DS: ES: CR0: 80050033 [ 300.529440] CR2: 0008 CR3: 0002365e6000 CR4: 06e0 [ 300.530739] DR0: DR1: DR2: [ 300.531989] DR3: DR6: fffe0ff0 DR7: 0400 [ 300.533264] Kernel panic - not syncing: Fatal exception [ 300.534338] Kernel Offset: 0x17c0 from 0x8100 (relocation range: 0x8000-0xbfff) [ 300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]--- Signed-off-by: Marta Rybczynska Tested-by: Jean-Baptiste Riaux --- drivers/nvme/host/multipath.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/multipath.c
Re: [PATCH v2 2/2] nvmet-loop: Flush nvme_delete_wq when removing the port
On 7/4/2019 2:03 AM, Logan Gunthorpe wrote: After calling nvme_loop_delete_ctrl(), the controllers will not yet be deleted because nvme_delete_ctrl() only schedules work to do the delete. This means a race can occur if a port is removed but there are still active controllers trying to access that memory. To fix this, flush the nvme_delete_wq before returning from nvme_loop_remove_port() so that any controllers that might be in the process of being deleted won't access a freed port. Signed-off-by: Logan Gunthorpe --- drivers/nvme/target/loop.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 9e211ad6bdd3..da9cd07461fb 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -654,6 +654,14 @@ static void nvme_loop_remove_port(struct nvmet_port *port) mutex_lock(_loop_ports_mutex); list_del_init(>entry); mutex_unlock(_loop_ports_mutex); + + /* +* Ensure any ctrls that are in the process of being +* deleted are in fact deleted before we return +* and free the port. This is to prevent active +* ctrls from using a port after it's freed. +*/ + flush_workqueue(nvme_delete_wq); } static const struct nvmet_fabrics_ops nvme_loop_ops = { Looks good: Reviewed-by: Max Gurtovoy
Re: [PATCH v2 1/2] nvmet: Fix use-after-free bug when a port is removed
On 7/5/2019 12:01 AM, Logan Gunthorpe wrote: On 2019-07-04 3:00 p.m., Max Gurtovoy wrote: Hi Logan, On 7/4/2019 2:03 AM, Logan Gunthorpe wrote: When a port is removed through configfs, any connected controllers are still active and can still send commands. This causes a use-after-free bug which is detected by KASAN for any admin command that dereferences req->port (like in nvmet_execute_identify_ctrl). To fix this, disconnect all active controllers when a subsystem is removed from a port. This ensures there are no active controllers when the port is eventually removed. so now we are enforcing controller existence with port configfs, right ? sounds reasonable. Correct. Did you run your patches with other transport (RDMA/TCP/FC) ? Just RDMA and loop. I suppose I could test with TCP but I don't have FC hardware. Great. the code looks good: Reviewed-by: Max Gurtovoy Logan
Re: [PATCH v2 1/2] nvmet: Fix use-after-free bug when a port is removed
Hi Logan, On 7/4/2019 2:03 AM, Logan Gunthorpe wrote: When a port is removed through configfs, any connected controllers are still active and can still send commands. This causes a use-after-free bug which is detected by KASAN for any admin command that dereferences req->port (like in nvmet_execute_identify_ctrl). To fix this, disconnect all active controllers when a subsystem is removed from a port. This ensures there are no active controllers when the port is eventually removed. so now we are enforcing controller existence with port configfs, right ? sounds reasonable. Did you run your patches with other transport (RDMA/TCP/FC) ?
Re: [PATCH 6/8] IB/srp: set virt_boundary_mask in the scsi host
On 6/17/2019 3:19 PM, Christoph Hellwig wrote: This ensures all proper DMA layer handling is taken care of by the SCSI midlayer. Signed-off-by: Christoph Hellwig Looks good, Reviewed-by: Max Gurtovoy
Re: [PATCH 5/8] IB/iser: set virt_boundary_mask in the scsi host
On 6/17/2019 3:19 PM, Christoph Hellwig wrote: This ensures all proper DMA layer handling is taken care of by the SCSI midlayer. Signed-off-by: Christoph Hellwig Looks good, Reviewed-by: Max Gurtovoy
Re: [PATCH v2 06/10] nvme/core: add mdev interfaces
On 5/3/2019 3:29 PM, Christoph Hellwig wrote: On Thu, May 02, 2019 at 02:47:57PM +0300, Maxim Levitsky wrote: If the mdev device driver also sets the NVME_F_MDEV_DMA_SUPPORTED, the mdev core will dma map all the guest memory into the nvme device, so that nvme device driver can use dma addresses as passed from the mdev core driver We really need a proper block layer interface for that so that uring or the nvme target can use pre-mapping as well. I think we can also find a way to use nvme-mdev for the target offload p2p feature. Don't see a big difference of taking NVMe queue and namespace/partition to guest OS or to P2P since IO is issued by external entity and pooled outside the pci driver. thoughts ? ___ Linux-nvme mailing list linux-n...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
Re: [PATCH] IB/iser: remove uninitialized variable len
On 3/17/2019 1:05 AM, Colin King wrote: From: Colin Ian King The variable len is not being inintialized and the uninitialized value is being returned. However, this return path is never reached because the default case in the switch statement returns -ENOSYS. Clean up the code by replacing the return -ENOSYS with a break for the default case and returning -ENOSYS at the end of the function. This allows len to be removed. Also remove redundant break that follows a return statement. Signed-off-by: Colin Ian King Looks good, Reviewed-by: Max Gurtovoy
Re: [PATCH] nvmet: disable direct I/O when unavailable
On 2/22/2019 7:55 AM, Johannes Thumshirn wrote: On 22/02/2019 01:41, Chaitanya Kulkarni wrote: [...] As per specified in the patch, this is only useful for testing, then we should modify the test scripts so that on creation of the ctrl we switch to the buffered I/O before running fio. Or on any other file-system that does not support DIO.. Do we really want to support these kind of filesystems for fabrics backend device ? only for testing ? What is the status of iSCSI/SRP targets in this case ? OR Similar result can be achieved by setting buffered I/O flag buffered_io=1 before enabling the name-space in the test script. Frankly, we have a ton of testing related special cases in the kernel. This one is a) simple and small, only 10 LoC, b) far away from the fast path or any other place where it could have any impact on legitimate users and c) it prints an informal message showing you what happened. Sorry but this is a https://xkcd.com/386/ moment. Byte, Johannes
Re: [PATCH] nvme: host: core: fix precedence of ternary operator
On 5/23/2018 5:56 PM, Ivan Bornyakov wrote: Ternary operator have lower precedence then bitwise or, so 'cdw10' was calculated wrong. Signed-off-by: Ivan Bornyakov <brnkv...@gmail.com> --- drivers/nvme/host/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index b070c659391f..1eba9b0cb9dc 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1577,7 +1577,7 @@ static int nvme_pr_reserve(struct block_device *bdev, u64 key, static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new, enum pr_type type, bool abort) { - u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1; + u32 cdw10 = nvme_pr_type(type) << 8 | (abort ? 2 : 1); return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire); } @@ -1589,7 +1589,7 @@ static int nvme_pr_clear(struct block_device *bdev, u64 key) static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type) { - u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0; + u32 cdw10 = nvme_pr_type(type) << 8 | (key ? 1 << 3 : 0); return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release); } Looks good, Reviewed-by: Max Gurtovoy <m...@mellanox.com>
Re: [PATCH] nvme: host: core: fix precedence of ternary operator
On 5/23/2018 5:56 PM, Ivan Bornyakov wrote: Ternary operator have lower precedence then bitwise or, so 'cdw10' was calculated wrong. Signed-off-by: Ivan Bornyakov --- drivers/nvme/host/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index b070c659391f..1eba9b0cb9dc 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1577,7 +1577,7 @@ static int nvme_pr_reserve(struct block_device *bdev, u64 key, static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new, enum pr_type type, bool abort) { - u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1; + u32 cdw10 = nvme_pr_type(type) << 8 | (abort ? 2 : 1); return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire); } @@ -1589,7 +1589,7 @@ static int nvme_pr_clear(struct block_device *bdev, u64 key) static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type) { - u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0; + u32 cdw10 = nvme_pr_type(type) << 8 | (key ? 1 << 3 : 0); return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release); } Looks good, Reviewed-by: Max Gurtovoy
Re: [PATCH V2] nvme-rdma: fix double free in nvme_rdma_free_queue
Hi Jianchao, On 5/10/2018 12:42 PM, Jianchao Wang wrote: BUG: KASAN: double-free or invalid-free in nvme_rdma_free_queue+0xf6/0x110 [nvme_rdma] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] Call Trace: dump_stack+0x91/0xeb print_address_description+0x6b/0x290 kasan_report_invalid_free+0x55/0x80 __kasan_slab_free+0x176/0x190 kfree+0xeb/0x310 nvme_rdma_free_queue+0xf6/0x110 [nvme_rdma] nvme_rdma_configure_admin_queue+0x1a3/0x4d0 [nvme_rdma] nvme_rdma_reset_ctrl_work+0x4e/0xd0 [nvme_rdma] process_one_work+0x3ca/0xaa0 worker_thread+0x4e2/0x6c0 kthread+0x18d/0x1e0 ret_from_fork+0x24/0x30 The double free is on ctrl->async_event_sqe. If any case fails before ctrl->async_event_sqe is allocated in nvme_rdma_configure_admin_queue, nvme_rdma_free_queue will be invoked. However, at the moment, the ctrl->async_event_sqe has not been allocated because it has been freed in nvme_rdma_reset_ctrl_work -> nvme_rdma_shutdown_ctrl ->nvme_rdma_destroy_admin_queue -> nvme_rdma_free_queue Signed-off-by: Jianchao Wang--- V2: handle it in nvme_rdma_free_queue and add some comment to explain it. I don't know exactly what Christoph meant but IMO the best place to allocate it is in nvme_rdma_alloc_queue just before calling "set_bit(NVME_RDMA_Q_ALLOCATED, >flags);" then you will never get to double free since we clear the NVME_RDMA_Q_ALLOCATED bit in the beginning of nvme_rdma_free_queue. drivers/nvme/host/rdma.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 966e0dd..fa5cf87 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -561,9 +561,18 @@ static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue) return; if (nvme_rdma_queue_idx(queue) == 0) { - nvme_rdma_free_qe(queue->device->dev, - >ctrl->async_event_sqe, - sizeof(struct nvme_command), DMA_TO_DEVICE); + /* +* async_event_sqe is not allocated in nvme_rdma_alloc_queue. +* so there are cases where NVME_RDMA_Q_ALLOCATED is set, but +* async_event_sqe is not allocated. To avoid double free, set +* async_event_sqe.data to NULL to indicate it has been freed. +*/ + if (queue->ctrl->async_event_sqe.data) { + nvme_rdma_free_qe(queue->device->dev, + >ctrl->async_event_sqe, + sizeof(struct nvme_command), DMA_TO_DEVICE); + queue->ctrl->async_event_sqe.data = NULL; + } } nvme_rdma_destroy_queue_ib(queue); -Max.
Re: [PATCH V2] nvme-rdma: fix double free in nvme_rdma_free_queue
Hi Jianchao, On 5/10/2018 12:42 PM, Jianchao Wang wrote: BUG: KASAN: double-free or invalid-free in nvme_rdma_free_queue+0xf6/0x110 [nvme_rdma] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma] Call Trace: dump_stack+0x91/0xeb print_address_description+0x6b/0x290 kasan_report_invalid_free+0x55/0x80 __kasan_slab_free+0x176/0x190 kfree+0xeb/0x310 nvme_rdma_free_queue+0xf6/0x110 [nvme_rdma] nvme_rdma_configure_admin_queue+0x1a3/0x4d0 [nvme_rdma] nvme_rdma_reset_ctrl_work+0x4e/0xd0 [nvme_rdma] process_one_work+0x3ca/0xaa0 worker_thread+0x4e2/0x6c0 kthread+0x18d/0x1e0 ret_from_fork+0x24/0x30 The double free is on ctrl->async_event_sqe. If any case fails before ctrl->async_event_sqe is allocated in nvme_rdma_configure_admin_queue, nvme_rdma_free_queue will be invoked. However, at the moment, the ctrl->async_event_sqe has not been allocated because it has been freed in nvme_rdma_reset_ctrl_work -> nvme_rdma_shutdown_ctrl ->nvme_rdma_destroy_admin_queue -> nvme_rdma_free_queue Signed-off-by: Jianchao Wang --- V2: handle it in nvme_rdma_free_queue and add some comment to explain it. I don't know exactly what Christoph meant but IMO the best place to allocate it is in nvme_rdma_alloc_queue just before calling "set_bit(NVME_RDMA_Q_ALLOCATED, >flags);" then you will never get to double free since we clear the NVME_RDMA_Q_ALLOCATED bit in the beginning of nvme_rdma_free_queue. drivers/nvme/host/rdma.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 966e0dd..fa5cf87 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -561,9 +561,18 @@ static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue) return; if (nvme_rdma_queue_idx(queue) == 0) { - nvme_rdma_free_qe(queue->device->dev, - >ctrl->async_event_sqe, - sizeof(struct nvme_command), DMA_TO_DEVICE); + /* +* async_event_sqe is not allocated in nvme_rdma_alloc_queue. +* so there are cases where NVME_RDMA_Q_ALLOCATED is set, but +* async_event_sqe is not allocated. To avoid double free, set +* async_event_sqe.data to NULL to indicate it has been freed. +*/ + if (queue->ctrl->async_event_sqe.data) { + nvme_rdma_free_qe(queue->device->dev, + >ctrl->async_event_sqe, + sizeof(struct nvme_command), DMA_TO_DEVICE); + queue->ctrl->async_event_sqe.data = NULL; + } } nvme_rdma_destroy_queue_ib(queue); -Max.
Re: [PATCH V2] nvme-rdma: stop queue first before free it in config admin queue
On 5/10/2018 12:42 PM, Jianchao Wang wrote: When any of cases after nvme_rdma_start_queue in nvme_rdma_configure_admin_queue fails, the ctrl->queues[0] will be freed but the NVME_RDMA_Q_LIVE is still set. If nvme_rdma_stop_queue is invoked, we will incur use-after-free which will cause memory corruption. BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm] Read of size 8 at addr 8801dc3969c0 by task kworker/u16:3/9304 CPU: 3 PID: 9304 Comm: kworker/u16:3 Kdump: loaded Tainted: GW 4.17.0-rc3+ #20 Workqueue: nvme-delete-wq nvme_delete_ctrl_work Call Trace: dump_stack+0x91/0xeb print_address_description+0x6b/0x290 kasan_report+0x261/0x360 rdma_disconnect+0x1f/0xe0 [rdma_cm] nvme_rdma_stop_queue+0x25/0x40 [nvme_rdma] nvme_rdma_shutdown_ctrl+0xf3/0x150 [nvme_rdma] nvme_delete_ctrl_work+0x98/0xe0 process_one_work+0x3ca/0xaa0 worker_thread+0x4e2/0x6c0 kthread+0x18d/0x1e0 ret_from_fork+0x24/0x30 To fix it, call nvme_rdma_stop_queue for all the failed cases after nvme_rdma_start_queue. Signed-off-by: Jianchao Wang <jianchao.w.w...@oracle.com> --- V2: based on Sagi's suggestion, add out_stop_queue lable and invoke nvme_rdma_stop_queue in all the failed cases after nvme_rdma_start_queue Looks good, Reviewed-by: Max Gurtovoy <m...@mellanox.com>
Re: [PATCH V2] nvme-rdma: stop queue first before free it in config admin queue
On 5/10/2018 12:42 PM, Jianchao Wang wrote: When any of cases after nvme_rdma_start_queue in nvme_rdma_configure_admin_queue fails, the ctrl->queues[0] will be freed but the NVME_RDMA_Q_LIVE is still set. If nvme_rdma_stop_queue is invoked, we will incur use-after-free which will cause memory corruption. BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm] Read of size 8 at addr 8801dc3969c0 by task kworker/u16:3/9304 CPU: 3 PID: 9304 Comm: kworker/u16:3 Kdump: loaded Tainted: GW 4.17.0-rc3+ #20 Workqueue: nvme-delete-wq nvme_delete_ctrl_work Call Trace: dump_stack+0x91/0xeb print_address_description+0x6b/0x290 kasan_report+0x261/0x360 rdma_disconnect+0x1f/0xe0 [rdma_cm] nvme_rdma_stop_queue+0x25/0x40 [nvme_rdma] nvme_rdma_shutdown_ctrl+0xf3/0x150 [nvme_rdma] nvme_delete_ctrl_work+0x98/0xe0 process_one_work+0x3ca/0xaa0 worker_thread+0x4e2/0x6c0 kthread+0x18d/0x1e0 ret_from_fork+0x24/0x30 To fix it, call nvme_rdma_stop_queue for all the failed cases after nvme_rdma_start_queue. Signed-off-by: Jianchao Wang --- V2: based on Sagi's suggestion, add out_stop_queue lable and invoke nvme_rdma_stop_queue in all the failed cases after nvme_rdma_start_queue Looks good, Reviewed-by: Max Gurtovoy
Re: [PATCH AUTOSEL for 4.14 039/161] IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct
Hi Sasha, please consider taking a small fix for this one (also useful for 4.15): commit d3b9e8ad425cfd5b9116732e057f1b48e4d3bcb8 Author: Max Gurtovoy <m...@mellanox.com> Date: Mon Mar 5 20:09:48 2018 +0200 RDMA/core: Reduce poll batch for direct cq polling Fix warning limit for kernel stack consumption: drivers/infiniband/core/cq.c: In function 'ib_process_cq_direct': drivers/infiniband/core/cq.c:78:1: error: the frame size of 1032 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Using smaller ib_wc array on the stack brings us comfortably below that limit again. Fixes: 246d8b184c10 ("IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct") Reported-by: Arnd Bergmann <a...@arndb.de> Reviewed-by: Sergey Gorenko <serge...@mellanox.com> Signed-off-by: Max Gurtovoy <m...@mellanox.com> Signed-off-by: Leon Romanovsky <leo...@mellanox.com> Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com> Acked-by: Arnd Bergmann <a...@arndb.de> Signed-off-by: Jason Gunthorpe <j...@mellanox.com> -Max. On 4/9/2018 3:20 AM, Sasha Levin wrote: From: Sagi Grimberg <s...@grimberg.me> [ Upstream commit 246d8b184c100e8eb6b4e8c88f232c2ed2a4e672 ] polling the completion queue directly does not interfere with the existing polling logic, hence drop the requirement. Be aware that running ib_process_cq_direct with non IB_POLL_DIRECT CQ may trigger concurrent CQ processing. This can be used for polling mode ULPs. Cc: Bart Van Assche <bart.vanass...@wdc.com> Reported-by: Steve Wise <sw...@opengridcomputing.com> Signed-off-by: Sagi Grimberg <s...@grimberg.me> [maxg: added wcs array argument to __ib_process_cq] Signed-off-by: Max Gurtovoy <m...@mellanox.com> Signed-off-by: Doug Ledford <dledf...@redhat.com> Signed-off-by: Sasha Levin <alexander.le...@microsoft.com> --- drivers/infiniband/core/cq.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c index f2ae75fa3128..c8c5a5a7f433 100644 --- a/drivers/infiniband/core/cq.c +++ b/drivers/infiniband/core/cq.c @@ -25,9 +25,10 @@ #define IB_POLL_FLAGS \ (IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS) -static int __ib_process_cq(struct ib_cq *cq, int budget) +static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) { int i, n, completed = 0; + struct ib_wc *wcs = poll_wc ? : cq->wc; /* * budget might be (-1) if the caller does not @@ -35,9 +36,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget) * minimum here. */ while ((n = ib_poll_cq(cq, min_t(u32, IB_POLL_BATCH, - budget - completed), cq->wc)) > 0) { + budget - completed), wcs)) > 0) { for (i = 0; i < n; i++) { - struct ib_wc *wc = >wc[i]; + struct ib_wc *wc = [i]; if (wc->wr_cqe) wc->wr_cqe->done(cq, wc); @@ -60,18 +61,20 @@ static int __ib_process_cq(struct ib_cq *cq, int budget) * @cq: CQ to process * @budget: number of CQEs to poll for * - * This function is used to process all outstanding CQ entries on a - * %IB_POLL_DIRECT CQ. It does not offload CQ processing to a different - * context and does not ask for completion interrupts from the HCA. + * This function is used to process all outstanding CQ entries. + * It does not offload CQ processing to a different context and does + * not ask for completion interrupts from the HCA. + * Using direct processing on CQ with non IB_POLL_DIRECT type may trigger + * concurrent processing. * * Note: do not pass -1 as %budget unless it is guaranteed that the number * of completions that will be processed is small. */ int ib_process_cq_direct(struct ib_cq *cq, int budget) { - WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT); + struct ib_wc wcs[IB_POLL_BATCH]; - return __ib_process_cq(cq, budget); + return __ib_process_cq(cq, budget, wcs); } EXPORT_SYMBOL(ib_process_cq_direct); @@ -85,7 +88,7 @@ static int ib_poll_handler(struct irq_poll *iop, int budget) struct ib_cq *cq = container_of(iop, struct ib_cq, iop); int completed; - completed = __ib_process_cq(cq, budget); + completed = __ib_process_cq(cq, budget, NULL); if (completed < budget) { irq_poll_complete(>iop); if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) @@ -105,7 +108,7 @@ static void ib_cq_poll_work(struct work_struct *work) struct ib_cq *cq = container_of(work, struct ib_cq, work); int completed; - completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE); + completed = __ib_process_cq(cq
Re: [PATCH AUTOSEL for 4.14 039/161] IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct
Hi Sasha, please consider taking a small fix for this one (also useful for 4.15): commit d3b9e8ad425cfd5b9116732e057f1b48e4d3bcb8 Author: Max Gurtovoy Date: Mon Mar 5 20:09:48 2018 +0200 RDMA/core: Reduce poll batch for direct cq polling Fix warning limit for kernel stack consumption: drivers/infiniband/core/cq.c: In function 'ib_process_cq_direct': drivers/infiniband/core/cq.c:78:1: error: the frame size of 1032 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Using smaller ib_wc array on the stack brings us comfortably below that limit again. Fixes: 246d8b184c10 ("IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct") Reported-by: Arnd Bergmann Reviewed-by: Sergey Gorenko Signed-off-by: Max Gurtovoy Signed-off-by: Leon Romanovsky Reviewed-by: Bart Van Assche Acked-by: Arnd Bergmann Signed-off-by: Jason Gunthorpe -Max. On 4/9/2018 3:20 AM, Sasha Levin wrote: From: Sagi Grimberg [ Upstream commit 246d8b184c100e8eb6b4e8c88f232c2ed2a4e672 ] polling the completion queue directly does not interfere with the existing polling logic, hence drop the requirement. Be aware that running ib_process_cq_direct with non IB_POLL_DIRECT CQ may trigger concurrent CQ processing. This can be used for polling mode ULPs. Cc: Bart Van Assche Reported-by: Steve Wise Signed-off-by: Sagi Grimberg [maxg: added wcs array argument to __ib_process_cq] Signed-off-by: Max Gurtovoy Signed-off-by: Doug Ledford Signed-off-by: Sasha Levin --- drivers/infiniband/core/cq.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c index f2ae75fa3128..c8c5a5a7f433 100644 --- a/drivers/infiniband/core/cq.c +++ b/drivers/infiniband/core/cq.c @@ -25,9 +25,10 @@ #define IB_POLL_FLAGS \ (IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS) -static int __ib_process_cq(struct ib_cq *cq, int budget) +static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) { int i, n, completed = 0; + struct ib_wc *wcs = poll_wc ? : cq->wc; /* * budget might be (-1) if the caller does not @@ -35,9 +36,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget) * minimum here. */ while ((n = ib_poll_cq(cq, min_t(u32, IB_POLL_BATCH, - budget - completed), cq->wc)) > 0) { + budget - completed), wcs)) > 0) { for (i = 0; i < n; i++) { - struct ib_wc *wc = >wc[i]; + struct ib_wc *wc = [i]; if (wc->wr_cqe) wc->wr_cqe->done(cq, wc); @@ -60,18 +61,20 @@ static int __ib_process_cq(struct ib_cq *cq, int budget) * @cq: CQ to process * @budget: number of CQEs to poll for * - * This function is used to process all outstanding CQ entries on a - * %IB_POLL_DIRECT CQ. It does not offload CQ processing to a different - * context and does not ask for completion interrupts from the HCA. + * This function is used to process all outstanding CQ entries. + * It does not offload CQ processing to a different context and does + * not ask for completion interrupts from the HCA. + * Using direct processing on CQ with non IB_POLL_DIRECT type may trigger + * concurrent processing. * * Note: do not pass -1 as %budget unless it is guaranteed that the number * of completions that will be processed is small. */ int ib_process_cq_direct(struct ib_cq *cq, int budget) { - WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT); + struct ib_wc wcs[IB_POLL_BATCH]; - return __ib_process_cq(cq, budget); + return __ib_process_cq(cq, budget, wcs); } EXPORT_SYMBOL(ib_process_cq_direct); @@ -85,7 +88,7 @@ static int ib_poll_handler(struct irq_poll *iop, int budget) struct ib_cq *cq = container_of(iop, struct ib_cq, iop); int completed; - completed = __ib_process_cq(cq, budget); + completed = __ib_process_cq(cq, budget, NULL); if (completed < budget) { irq_poll_complete(>iop); if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) @@ -105,7 +108,7 @@ static void ib_cq_poll_work(struct work_struct *work) struct ib_cq *cq = container_of(work, struct ib_cq, work); int completed; - completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE); + completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL); if (completed >= IB_POLL_BUDGET_WORKQUEUE || ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) queue_work(ib_comp_wq, >work);
Re: [PATCH v3] nvmet: fix nvmet_execute_write_zeroes function
On 4/2/2018 8:38 PM, Keith Busch wrote: Thanks, I've applied the patch with a simpler changelog explaining the bug. Thanks Rodrigo and Keith, I've tested with/w.o the patch and it works well (with the fix only). -Max. ___ Linux-nvme mailing list linux-n...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
Re: [PATCH v3] nvmet: fix nvmet_execute_write_zeroes function
On 4/2/2018 8:38 PM, Keith Busch wrote: Thanks, I've applied the patch with a simpler changelog explaining the bug. Thanks Rodrigo and Keith, I've tested with/w.o the patch and it works well (with the fix only). -Max. ___ Linux-nvme mailing list linux-n...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/28/2018 8:55 PM, Doug Ledford wrote: On Wed, 2018-02-28 at 11:50 +0200, Max Gurtovoy wrote: On 2/28/2018 2:21 AM, Bart Van Assche wrote: On 02/27/18 14:15, Max Gurtovoy wrote: -static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) +static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc, + int batch) { - int i, n, completed = 0; - struct ib_wc *wcs = poll_wc ? : cq->wc; + int i, n, ib_poll_batch, completed = 0; + struct ib_wc *wcs; + + if (poll_wc) { + wcs = poll_wc; + ib_poll_batch = batch; + } else { + wcs = cq->wc; + ib_poll_batch = IB_POLL_BATCH; + } Since this code has to be touched I think that we can use this opportunity to get rid of the "poll_wc ? : cq->wc" conditional and instead use what the caller passes. That will require to update all __ib_process_cq(..., ..., NULL) calls. I also propose to let the caller pass ib_poll_batch instead of figuring it out in this function. Otherwise the approach of this patch looks fine to me. Thanks Bart. I'll make these changes and submit. That sounds reasonable to me too, thanks for reworking and resubmitting. Sure, NP. We've run NVMe-oF and SRP with the new patch. I'll send it through Mellanox maintainers pull request. Thanks for reporting and reviewing. -Max.
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/28/2018 8:55 PM, Doug Ledford wrote: On Wed, 2018-02-28 at 11:50 +0200, Max Gurtovoy wrote: On 2/28/2018 2:21 AM, Bart Van Assche wrote: On 02/27/18 14:15, Max Gurtovoy wrote: -static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) +static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc, + int batch) { - int i, n, completed = 0; - struct ib_wc *wcs = poll_wc ? : cq->wc; + int i, n, ib_poll_batch, completed = 0; + struct ib_wc *wcs; + + if (poll_wc) { + wcs = poll_wc; + ib_poll_batch = batch; + } else { + wcs = cq->wc; + ib_poll_batch = IB_POLL_BATCH; + } Since this code has to be touched I think that we can use this opportunity to get rid of the "poll_wc ? : cq->wc" conditional and instead use what the caller passes. That will require to update all __ib_process_cq(..., ..., NULL) calls. I also propose to let the caller pass ib_poll_batch instead of figuring it out in this function. Otherwise the approach of this patch looks fine to me. Thanks Bart. I'll make these changes and submit. That sounds reasonable to me too, thanks for reworking and resubmitting. Sure, NP. We've run NVMe-oF and SRP with the new patch. I'll send it through Mellanox maintainers pull request. Thanks for reporting and reviewing. -Max.
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/28/2018 2:21 AM, Bart Van Assche wrote: On 02/27/18 14:15, Max Gurtovoy wrote: -static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) +static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc, + int batch) { - int i, n, completed = 0; - struct ib_wc *wcs = poll_wc ? : cq->wc; + int i, n, ib_poll_batch, completed = 0; + struct ib_wc *wcs; + + if (poll_wc) { + wcs = poll_wc; + ib_poll_batch = batch; + } else { + wcs = cq->wc; + ib_poll_batch = IB_POLL_BATCH; + } Since this code has to be touched I think that we can use this opportunity to get rid of the "poll_wc ? : cq->wc" conditional and instead use what the caller passes. That will require to update all __ib_process_cq(..., ..., NULL) calls. I also propose to let the caller pass ib_poll_batch instead of figuring it out in this function. Otherwise the approach of this patch looks fine to me. Thanks Bart. I'll make these changes and submit. Thanks, Bart. -Max.
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/28/2018 2:21 AM, Bart Van Assche wrote: On 02/27/18 14:15, Max Gurtovoy wrote: -static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) +static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc, + int batch) { - int i, n, completed = 0; - struct ib_wc *wcs = poll_wc ? : cq->wc; + int i, n, ib_poll_batch, completed = 0; + struct ib_wc *wcs; + + if (poll_wc) { + wcs = poll_wc; + ib_poll_batch = batch; + } else { + wcs = cq->wc; + ib_poll_batch = IB_POLL_BATCH; + } Since this code has to be touched I think that we can use this opportunity to get rid of the "poll_wc ? : cq->wc" conditional and instead use what the caller passes. That will require to update all __ib_process_cq(..., ..., NULL) calls. I also propose to let the caller pass ib_poll_batch instead of figuring it out in this function. Otherwise the approach of this patch looks fine to me. Thanks Bart. I'll make these changes and submit. Thanks, Bart. -Max.
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/28/2018 12:09 AM, Jason Gunthorpe wrote: On Thu, Feb 22, 2018 at 05:39:09PM +0200, Sagi Grimberg wrote: The only reason why I added this array on-stack was to allow consumers that did not use ib_alloc_cq api to call it, but that seems like a wrong decision when thinking it over again (as probably these users did not set the wr_cqe correctly). How about we make ib_process_cq_direct use the cq wc array and add a WARN_ON statement (and fail it gracefully) if the caller used this API without calling ib_alloc_cq? but we tried to avoid cuncurrent access to cq->wc. Not sure its a valid use-case. But if there is a compelling reason to keep it as is, then we can do smaller on-stack array. Did we come to a conclusion what to do here? guys, what do you think about the following solution (untested): diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c index bc79ca8..59d2835 100644 --- a/drivers/infiniband/core/cq.c +++ b/drivers/infiniband/core/cq.c @@ -17,6 +17,7 @@ /* # of WCs to poll for with a single call to ib_poll_cq */ #define IB_POLL_BATCH 16 +#define IB_POLL_BATCH_DIRECT 8 /* # of WCs to iterate over before yielding */ #define IB_POLL_BUDGET_IRQ 256 @@ -25,17 +26,25 @@ #define IB_POLL_FLAGS \ (IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS) -static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) +static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc, + int batch) { - int i, n, completed = 0; - struct ib_wc *wcs = poll_wc ? : cq->wc; - + int i, n, ib_poll_batch, completed = 0; + struct ib_wc *wcs; + + if (poll_wc) { + wcs = poll_wc; + ib_poll_batch = batch; + } else { + wcs = cq->wc; + ib_poll_batch = IB_POLL_BATCH; + } /* * budget might be (-1) if the caller does not * want to bound this call, thus we need unsigned * minimum here. */ - while ((n = ib_poll_cq(cq, min_t(u32, IB_POLL_BATCH, + while ((n = ib_poll_cq(cq, min_t(u32, ib_poll_batch, budget - completed), wcs)) > 0) { for (i = 0; i < n; i++) { struct ib_wc *wc = [i]; @@ -48,7 +57,7 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) completed += n; - if (n != IB_POLL_BATCH || + if (n != ib_poll_batch || (budget != -1 && completed >= budget)) break; } @@ -72,9 +81,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) */ int ib_process_cq_direct(struct ib_cq *cq, int budget) { - struct ib_wc wcs[IB_POLL_BATCH]; + struct ib_wc wcs[IB_POLL_BATCH_DIRECT]; - return __ib_process_cq(cq, budget, wcs); + return __ib_process_cq(cq, budget, wcs, IB_POLL_BATCH_DIRECT); } EXPORT_SYMBOL(ib_process_cq_direct); @@ -88,7 +97,7 @@ static int ib_poll_handler(struct irq_poll *iop, int budget) struct ib_cq *cq = container_of(iop, struct ib_cq, iop); int completed; - completed = __ib_process_cq(cq, budget, NULL); + completed = __ib_process_cq(cq, budget, NULL, 0); if (completed < budget) { irq_poll_complete(>iop); if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) @@ -108,7 +117,7 @@ static void ib_cq_poll_work(struct work_struct *work) struct ib_cq *cq = container_of(work, struct ib_cq, work); int completed; - completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL); + completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL, 0); if (completed >= IB_POLL_BUDGET_WORKQUEUE || ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) queue_work(ib_comp_wq, >work); Jason
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/28/2018 12:09 AM, Jason Gunthorpe wrote: On Thu, Feb 22, 2018 at 05:39:09PM +0200, Sagi Grimberg wrote: The only reason why I added this array on-stack was to allow consumers that did not use ib_alloc_cq api to call it, but that seems like a wrong decision when thinking it over again (as probably these users did not set the wr_cqe correctly). How about we make ib_process_cq_direct use the cq wc array and add a WARN_ON statement (and fail it gracefully) if the caller used this API without calling ib_alloc_cq? but we tried to avoid cuncurrent access to cq->wc. Not sure its a valid use-case. But if there is a compelling reason to keep it as is, then we can do smaller on-stack array. Did we come to a conclusion what to do here? guys, what do you think about the following solution (untested): diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c index bc79ca8..59d2835 100644 --- a/drivers/infiniband/core/cq.c +++ b/drivers/infiniband/core/cq.c @@ -17,6 +17,7 @@ /* # of WCs to poll for with a single call to ib_poll_cq */ #define IB_POLL_BATCH 16 +#define IB_POLL_BATCH_DIRECT 8 /* # of WCs to iterate over before yielding */ #define IB_POLL_BUDGET_IRQ 256 @@ -25,17 +26,25 @@ #define IB_POLL_FLAGS \ (IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS) -static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) +static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc, + int batch) { - int i, n, completed = 0; - struct ib_wc *wcs = poll_wc ? : cq->wc; - + int i, n, ib_poll_batch, completed = 0; + struct ib_wc *wcs; + + if (poll_wc) { + wcs = poll_wc; + ib_poll_batch = batch; + } else { + wcs = cq->wc; + ib_poll_batch = IB_POLL_BATCH; + } /* * budget might be (-1) if the caller does not * want to bound this call, thus we need unsigned * minimum here. */ - while ((n = ib_poll_cq(cq, min_t(u32, IB_POLL_BATCH, + while ((n = ib_poll_cq(cq, min_t(u32, ib_poll_batch, budget - completed), wcs)) > 0) { for (i = 0; i < n; i++) { struct ib_wc *wc = [i]; @@ -48,7 +57,7 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) completed += n; - if (n != IB_POLL_BATCH || + if (n != ib_poll_batch || (budget != -1 && completed >= budget)) break; } @@ -72,9 +81,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) */ int ib_process_cq_direct(struct ib_cq *cq, int budget) { - struct ib_wc wcs[IB_POLL_BATCH]; + struct ib_wc wcs[IB_POLL_BATCH_DIRECT]; - return __ib_process_cq(cq, budget, wcs); + return __ib_process_cq(cq, budget, wcs, IB_POLL_BATCH_DIRECT); } EXPORT_SYMBOL(ib_process_cq_direct); @@ -88,7 +97,7 @@ static int ib_poll_handler(struct irq_poll *iop, int budget) struct ib_cq *cq = container_of(iop, struct ib_cq, iop); int completed; - completed = __ib_process_cq(cq, budget, NULL); + completed = __ib_process_cq(cq, budget, NULL, 0); if (completed < budget) { irq_poll_complete(>iop); if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) @@ -108,7 +117,7 @@ static void ib_cq_poll_work(struct work_struct *work) struct ib_cq *cq = container_of(work, struct ib_cq, work); int completed; - completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL); + completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL, 0); if (completed >= IB_POLL_BUDGET_WORKQUEUE || ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) queue_work(ib_comp_wq, >work); Jason
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/21/2018 3:44 PM, Sagi Grimberg wrote: On Tue, 2018-02-20 at 21:59 +0100, Arnd Bergmann wrote: /* # of WCs to poll for with a single call to ib_poll_cq */ -#define IB_POLL_BATCH 16 +#define IB_POLL_BATCH 8 The purpose of batch polling is to minimize contention on the cq spinlock. Reducing the IB_POLL_BATCH constant may affect performance negatively. Has the performance impact of this change been verified for all affected drivers (ib_srp, ib_srpt, ib_iser, ib_isert, NVMeOF, NVMeOF target, SMB Direct, NFS over RDMA, ...)? Only the users of the DIRECT polling method use an on-stack array of ib_wc's. This is only the SRP drivers. The other two modes have use of a dynamically allocated array of ib_wc's that hangs off the ib_cq. These shouldn't need any reduction in the size of this array, and they are the common case. IMO a better solution would be to change ib_process_cq_direct to use a smaller on-stack array, and leave IB_POLL_BATCH alone. The only reason why I added this array on-stack was to allow consumers that did not use ib_alloc_cq api to call it, but that seems like a wrong decision when thinking it over again (as probably these users did not set the wr_cqe correctly). How about we make ib_process_cq_direct use the cq wc array and add a WARN_ON statement (and fail it gracefully) if the caller used this API without calling ib_alloc_cq? but we tried to avoid cuncurrent access to cq->wc. Why can't we use the solution I wrote above ? -- diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c index bc79ca8215d7..cd3e9e124834 100644 --- a/drivers/infiniband/core/cq.c +++ b/drivers/infiniband/core/cq.c @@ -25,10 +25,10 @@ #define IB_POLL_FLAGS \ (IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS) -static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) +static int __ib_process_cq(struct ib_cq *cq, int budget) { int i, n, completed = 0; - struct ib_wc *wcs = poll_wc ? : cq->wc; + struct ib_wc *wcs = cq->wc; /* * budget might be (-1) if the caller does not @@ -72,9 +72,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) */ int ib_process_cq_direct(struct ib_cq *cq, int budget) { - struct ib_wc wcs[IB_POLL_BATCH]; - - return __ib_process_cq(cq, budget, wcs); + if (unlikely(WARN_ON_ONCE(!cq->wc))) + return 0; + return __ib_process_cq(cq, budget); } EXPORT_SYMBOL(ib_process_cq_direct); @@ -88,7 +88,7 @@ static int ib_poll_handler(struct irq_poll *iop, int budget) struct ib_cq *cq = container_of(iop, struct ib_cq, iop); int completed; - completed = __ib_process_cq(cq, budget, NULL); + completed = __ib_process_cq(cq, budget); if (completed < budget) { irq_poll_complete(>iop); if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) @@ -108,7 +108,7 @@ static void ib_cq_poll_work(struct work_struct *work) struct ib_cq *cq = container_of(work, struct ib_cq, work); int completed; - completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL); + completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE); if (completed >= IB_POLL_BUDGET_WORKQUEUE || ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) queue_work(ib_comp_wq, >work); -- -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/21/2018 3:44 PM, Sagi Grimberg wrote: On Tue, 2018-02-20 at 21:59 +0100, Arnd Bergmann wrote: /* # of WCs to poll for with a single call to ib_poll_cq */ -#define IB_POLL_BATCH 16 +#define IB_POLL_BATCH 8 The purpose of batch polling is to minimize contention on the cq spinlock. Reducing the IB_POLL_BATCH constant may affect performance negatively. Has the performance impact of this change been verified for all affected drivers (ib_srp, ib_srpt, ib_iser, ib_isert, NVMeOF, NVMeOF target, SMB Direct, NFS over RDMA, ...)? Only the users of the DIRECT polling method use an on-stack array of ib_wc's. This is only the SRP drivers. The other two modes have use of a dynamically allocated array of ib_wc's that hangs off the ib_cq. These shouldn't need any reduction in the size of this array, and they are the common case. IMO a better solution would be to change ib_process_cq_direct to use a smaller on-stack array, and leave IB_POLL_BATCH alone. The only reason why I added this array on-stack was to allow consumers that did not use ib_alloc_cq api to call it, but that seems like a wrong decision when thinking it over again (as probably these users did not set the wr_cqe correctly). How about we make ib_process_cq_direct use the cq wc array and add a WARN_ON statement (and fail it gracefully) if the caller used this API without calling ib_alloc_cq? but we tried to avoid cuncurrent access to cq->wc. Why can't we use the solution I wrote above ? -- diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c index bc79ca8215d7..cd3e9e124834 100644 --- a/drivers/infiniband/core/cq.c +++ b/drivers/infiniband/core/cq.c @@ -25,10 +25,10 @@ #define IB_POLL_FLAGS \ (IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS) -static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) +static int __ib_process_cq(struct ib_cq *cq, int budget) { int i, n, completed = 0; - struct ib_wc *wcs = poll_wc ? : cq->wc; + struct ib_wc *wcs = cq->wc; /* * budget might be (-1) if the caller does not @@ -72,9 +72,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc) */ int ib_process_cq_direct(struct ib_cq *cq, int budget) { - struct ib_wc wcs[IB_POLL_BATCH]; - - return __ib_process_cq(cq, budget, wcs); + if (unlikely(WARN_ON_ONCE(!cq->wc))) + return 0; + return __ib_process_cq(cq, budget); } EXPORT_SYMBOL(ib_process_cq_direct); @@ -88,7 +88,7 @@ static int ib_poll_handler(struct irq_poll *iop, int budget) struct ib_cq *cq = container_of(iop, struct ib_cq, iop); int completed; - completed = __ib_process_cq(cq, budget, NULL); + completed = __ib_process_cq(cq, budget); if (completed < budget) { irq_poll_complete(>iop); if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) @@ -108,7 +108,7 @@ static void ib_cq_poll_work(struct work_struct *work) struct ib_cq *cq = container_of(work, struct ib_cq, work); int completed; - completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL); + completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE); if (completed >= IB_POLL_BUDGET_WORKQUEUE || ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) queue_work(ib_comp_wq, >work); -- -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/20/2018 11:47 PM, Chuck Lever wrote: On Feb 20, 2018, at 4:14 PM, Bart Van Asschewrote: On Tue, 2018-02-20 at 21:59 +0100, Arnd Bergmann wrote: /* # of WCs to poll for with a single call to ib_poll_cq */ -#define IB_POLL_BATCH 16 +#define IB_POLL_BATCH 8 The purpose of batch polling is to minimize contention on the cq spinlock. Reducing the IB_POLL_BATCH constant may affect performance negatively. Has the performance impact of this change been verified for all affected drivers (ib_srp, ib_srpt, ib_iser, ib_isert, NVMeOF, NVMeOF target, SMB Direct, NFS over RDMA, ...)? Only the users of the DIRECT polling method use an on-stack array of ib_wc's. This is only the SRP drivers. The other two modes have use of a dynamically allocated array of ib_wc's that hangs off the ib_cq. These shouldn't need any reduction in the size of this array, and they are the common case. IMO a better solution would be to change ib_process_cq_direct to use a smaller on-stack array, and leave IB_POLL_BATCH alone. Yup, good idea. you can define IB_DIRECT_POLL_BATCH to be 8 and use it in ib_process_cq_direct. *but* please make sure to use the right value in ib_poll_cq since the wcs array should be able to hold the requested amount of wcs. -Max. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant
On 2/20/2018 11:47 PM, Chuck Lever wrote: On Feb 20, 2018, at 4:14 PM, Bart Van Assche wrote: On Tue, 2018-02-20 at 21:59 +0100, Arnd Bergmann wrote: /* # of WCs to poll for with a single call to ib_poll_cq */ -#define IB_POLL_BATCH 16 +#define IB_POLL_BATCH 8 The purpose of batch polling is to minimize contention on the cq spinlock. Reducing the IB_POLL_BATCH constant may affect performance negatively. Has the performance impact of this change been verified for all affected drivers (ib_srp, ib_srpt, ib_iser, ib_isert, NVMeOF, NVMeOF target, SMB Direct, NFS over RDMA, ...)? Only the users of the DIRECT polling method use an on-stack array of ib_wc's. This is only the SRP drivers. The other two modes have use of a dynamically allocated array of ib_wc's that hangs off the ib_cq. These shouldn't need any reduction in the size of this array, and they are the common case. IMO a better solution would be to change ib_process_cq_direct to use a smaller on-stack array, and leave IB_POLL_BATCH alone. Yup, good idea. you can define IB_DIRECT_POLL_BATCH to be 8 and use it in ib_process_cq_direct. *but* please make sure to use the right value in ib_poll_cq since the wcs array should be able to hold the requested amount of wcs. -Max. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] nvme-pci: Fix incorrect use of CMB size to calculate q_depth
On 2/6/2018 11:48 AM, Sagi Grimberg wrote: Looks good, Reviewed-by: Sagi Grimberg <s...@grimberg.me> I'll pick this one up unless someone thinks I shouldn't.. Looks good to me (I can imagine what scenario failed this :) ), Reviewed-by: Max Gurtovoy <m...@mellanox.com> ___ Linux-nvme mailing list linux-n...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
Re: [PATCH] nvme-pci: Fix incorrect use of CMB size to calculate q_depth
On 2/6/2018 11:48 AM, Sagi Grimberg wrote: Looks good, Reviewed-by: Sagi Grimberg I'll pick this one up unless someone thinks I shouldn't.. Looks good to me (I can imagine what scenario failed this :) ), Reviewed-by: Max Gurtovoy ___ Linux-nvme mailing list linux-n...@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
Re: Backport Mellanox mlx5 patches to stable 4.9.y
On 2/1/2018 10:21 AM, Greg KH wrote: On Tue, Jan 30, 2018 at 10:12:51AM +0100, Marta Rybczynska wrote: Hello Mellanox maintainers, I'd like to ask you to OK backporting two patches in mlx5 driver to 4.9 stable tree (they're in master for some time already). We have multiple deployment in 4.9 that are running into the bug fixed by those patches. We're deploying patched kernels and the issue disappears. The patches are: 1410a90ae449061b7e1ae19d275148f36948801b net/mlx5: Define interface bits for fencing UMR wqe 6e8484c5cf07c7ee632587e98c1a12d319dacb7c RDMA/mlx5: set UMR wqe fence according to HCA cap Looks good, now queued up, thanks. thanks Greg. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Backport Mellanox mlx5 patches to stable 4.9.y
On 2/1/2018 10:21 AM, Greg KH wrote: On Tue, Jan 30, 2018 at 10:12:51AM +0100, Marta Rybczynska wrote: Hello Mellanox maintainers, I'd like to ask you to OK backporting two patches in mlx5 driver to 4.9 stable tree (they're in master for some time already). We have multiple deployment in 4.9 that are running into the bug fixed by those patches. We're deploying patched kernels and the issue disappears. The patches are: 1410a90ae449061b7e1ae19d275148f36948801b net/mlx5: Define interface bits for fencing UMR wqe 6e8484c5cf07c7ee632587e98c1a12d319dacb7c RDMA/mlx5: set UMR wqe fence according to HCA cap Looks good, now queued up, thanks. thanks Greg. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] IB-iSER: Adjustments for three function implementations
On 1/27/2018 8:17 PM, SF Markus Elfring wrote: From: Markus Elfring <elfr...@users.sourceforge.net> Date: Sat, 27 Jan 2018 19:02:34 +0100 Three update suggestions were taken into account from static source code analysis. Markus Elfring (3): Delete an error message for a failed memory allocation in iser_send_data_out() Delete an unnecessary variable initialisation in iser_send_data_out() Combine substrings for three messages drivers/infiniband/ulp/iser/iser_initiator.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) This series looks good to me, Reviewed-by: Max Gurtovoy <m...@mellanox.com>
Re: [PATCH 0/3] IB-iSER: Adjustments for three function implementations
On 1/27/2018 8:17 PM, SF Markus Elfring wrote: From: Markus Elfring Date: Sat, 27 Jan 2018 19:02:34 +0100 Three update suggestions were taken into account from static source code analysis. Markus Elfring (3): Delete an error message for a failed memory allocation in iser_send_data_out() Delete an unnecessary variable initialisation in iser_send_data_out() Combine substrings for three messages drivers/infiniband/ulp/iser/iser_initiator.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) This series looks good to me, Reviewed-by: Max Gurtovoy
Re: [PATCH] nvme: don't free uuid pointer before printing it
On 1/25/2018 10:09 AM, Johannes Thumshirn wrote: Commit df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID option") fixed the leak of 'p' but in case uuid_parse() fails the memory is freed before the error print that is using it. Free it after printing eventual errors. Signed-off-by: Johannes Thumshirn <jthumsh...@suse.de> Fixes: df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID option") Reported-by: Dan Carpenter <dan.carpen...@oracle.com> Cc: Roland Dreier <rol...@purestorage.com> --- drivers/nvme/host/fabrics.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index eb46967bb0d5..9cee72a80472 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -739,12 +739,13 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, goto out; } ret = uuid_parse(p, ); - kfree(p); if (ret) { pr_err("Invalid hostid %s\n", p); ret = -EINVAL; + kfree(p); goto out; } + kfree(p); break; case NVMF_OPT_DUP_CONNECT: opts->duplicate_connect = true; Looks good, Reviewed-by: Max Gurtovoy <m...@mellanox.com>
Re: [PATCH] nvme: don't free uuid pointer before printing it
On 1/25/2018 10:09 AM, Johannes Thumshirn wrote: Commit df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID option") fixed the leak of 'p' but in case uuid_parse() fails the memory is freed before the error print that is using it. Free it after printing eventual errors. Signed-off-by: Johannes Thumshirn Fixes: df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID option") Reported-by: Dan Carpenter Cc: Roland Dreier --- drivers/nvme/host/fabrics.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index eb46967bb0d5..9cee72a80472 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -739,12 +739,13 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, goto out; } ret = uuid_parse(p, ); - kfree(p); if (ret) { pr_err("Invalid hostid %s\n", p); ret = -EINVAL; + kfree(p); goto out; } + kfree(p); break; case NVMF_OPT_DUP_CONNECT: opts->duplicate_connect = true; Looks good, Reviewed-by: Max Gurtovoy
Re: [PATCH V5 1/2] nvme-pci: introduce RECONNECTING state to mark initializing procedure
On 1/18/2018 12:10 PM, Jianchao Wang wrote: After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect), both nvme-fc/rdma have following pattern: RESETTING- quiesce blk-mq queues, teardown and delete queues/ connections, clear out outstanding IO requests... RECONNECTING - establish new queues/connections and some other initializing things. I guess we can call it NVME_CTRL_CONNECTING in later patch (more suitable name for this state now). Introduce RECONNECTING to nvme-pci transport to do the same mark. Then we get a coherent state definition among nvme pci/rdma/fc transports. Suggested-by: James SmartSigned-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 2 +- drivers/nvme/host/pci.c | 19 +-- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 230cc09..23b3e53 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -242,7 +242,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (new_state) { case NVME_CTRL_ADMIN_ONLY: switch (old_state) { - case NVME_CTRL_RESETTING: + case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ default: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 45f843d..05344be 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1138,9 +1138,14 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts) */ bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO); - /* If there is a reset ongoing, we shouldn't reset again. */ - if (dev->ctrl.state == NVME_CTRL_RESETTING) + /* If there is a reset/reinit ongoing, we shouldn't reset again. */ + switch (dev->ctrl.state) { + case NVME_CTRL_RESETTING: + case NVME_CTRL_RECONNECTING: return false; + default: + break; + } /* We shouldn't reset unless the controller is on fatal error state * _or_ if we lost the communication with it. @@ -2304,6 +2309,16 @@ static void nvme_reset_work(struct work_struct *work) if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) nvme_dev_disable(dev, false); + /* +* Introduce RECONNECTING state from nvme-fc/rdma transports to mark the +* initializing procedure here. +*/ + if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_RECONNECTING)) { + dev_warn(dev->ctrl.device, + "failed to mark controller RECONNECTING\n"); + goto out; + } + result = nvme_pci_enable(dev); if (result) goto out;
Re: [PATCH V5 1/2] nvme-pci: introduce RECONNECTING state to mark initializing procedure
On 1/18/2018 12:10 PM, Jianchao Wang wrote: After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect), both nvme-fc/rdma have following pattern: RESETTING- quiesce blk-mq queues, teardown and delete queues/ connections, clear out outstanding IO requests... RECONNECTING - establish new queues/connections and some other initializing things. I guess we can call it NVME_CTRL_CONNECTING in later patch (more suitable name for this state now). Introduce RECONNECTING to nvme-pci transport to do the same mark. Then we get a coherent state definition among nvme pci/rdma/fc transports. Suggested-by: James Smart Signed-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 2 +- drivers/nvme/host/pci.c | 19 +-- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 230cc09..23b3e53 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -242,7 +242,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (new_state) { case NVME_CTRL_ADMIN_ONLY: switch (old_state) { - case NVME_CTRL_RESETTING: + case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ default: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 45f843d..05344be 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1138,9 +1138,14 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts) */ bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO); - /* If there is a reset ongoing, we shouldn't reset again. */ - if (dev->ctrl.state == NVME_CTRL_RESETTING) + /* If there is a reset/reinit ongoing, we shouldn't reset again. */ + switch (dev->ctrl.state) { + case NVME_CTRL_RESETTING: + case NVME_CTRL_RECONNECTING: return false; + default: + break; + } /* We shouldn't reset unless the controller is on fatal error state * _or_ if we lost the communication with it. @@ -2304,6 +2309,16 @@ static void nvme_reset_work(struct work_struct *work) if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) nvme_dev_disable(dev, false); + /* +* Introduce RECONNECTING state from nvme-fc/rdma transports to mark the +* initializing procedure here. +*/ + if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_RECONNECTING)) { + dev_warn(dev->ctrl.device, + "failed to mark controller RECONNECTING\n"); + goto out; + } + result = nvme_pci_enable(dev); if (result) goto out;
Re: [PATCH V4 1/2] nvme: add NVME_CTRL_RESET_PREPARE state
hi Jianchao Wang, On 1/17/2018 6:54 AM, Jianchao Wang wrote: Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING before queue the reset work. This is not so strict. There could be a big gap before the reset_work callback is invoked. In addition, there is some disable work in the reset_work callback, strictly speaking, not part of reset work, and could lead to some confusion. In addition, after set state to RESETTING and disable procedure, nvme-rdma/fc use NVME_CTRL_RECONNECTING to mark the setup and reconnect procedure. The RESETTING state has been narrowed. This patch add NVME_CTRL_RESET_PREPARE state to mark the reset_work or error recovery work, scheduling gap and disable procedure. After that, - For nvme-pci, nvmet-loop, set state to RESETTING, start initialization. - For nvme-rdma, nvme-fc, set state to RECONNECTING, start initialization or reconnect. Suggested-by: Christoph HellwigSigned-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 18 +++--- drivers/nvme/host/fc.c | 4 ++-- drivers/nvme/host/nvme.h | 8 drivers/nvme/host/pci.c| 25 + drivers/nvme/host/rdma.c | 2 +- drivers/nvme/target/loop.c | 5 + 6 files changed, 52 insertions(+), 10 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 230cc09..87d209f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size) int nvme_reset_ctrl(struct nvme_ctrl *ctrl) { - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE)) return -EBUSY; if (!queue_work(nvme_wq, >reset_work)) return -EBUSY; @@ -260,7 +260,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: @@ -271,10 +271,20 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; + + case NVME_CTRL_RESETTING: + switch (old_state) { + case NVME_CTRL_RESET_PREPARE: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_RECONNECTING: switch (old_state) { case NVME_CTRL_LIVE: - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: As I suggested in V3, please don't allow this transition. We'll move to NVME_CTRL_RECONNECTING from NVME_CTRL_RESETTING. I look on it like that: NVME_CTRL_RESET_PREPARE - "suspend" state NVME_CTRL_RESETTING - "resume" state you don't reconnect from "suspend" state, you must "resume" before you reconnect. changed = true; /* FALLTHRU */ default: @@ -286,6 +296,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, case NVME_CTRL_LIVE: case NVME_CTRL_ADMIN_ONLY: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ @@ -2660,6 +2671,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, [NVME_CTRL_LIVE]= "live", [NVME_CTRL_ADMIN_ONLY] = "only-admin", [NVME_CTRL_RESETTING] = "resetting", + [NVME_CTRL_RESET_PREPARE] = "reset-prepare", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING]= "deleting", [NVME_CTRL_DEAD]= "dead", diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 306aee4..1ba0669 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -546,7 +546,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl) queue_delayed_work(nvme_wq, >connect_work, 0); break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect @@ -789,7 +789,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl) */ break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect diff
Re: [PATCH V4 1/2] nvme: add NVME_CTRL_RESET_PREPARE state
hi Jianchao Wang, On 1/17/2018 6:54 AM, Jianchao Wang wrote: Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING before queue the reset work. This is not so strict. There could be a big gap before the reset_work callback is invoked. In addition, there is some disable work in the reset_work callback, strictly speaking, not part of reset work, and could lead to some confusion. In addition, after set state to RESETTING and disable procedure, nvme-rdma/fc use NVME_CTRL_RECONNECTING to mark the setup and reconnect procedure. The RESETTING state has been narrowed. This patch add NVME_CTRL_RESET_PREPARE state to mark the reset_work or error recovery work, scheduling gap and disable procedure. After that, - For nvme-pci, nvmet-loop, set state to RESETTING, start initialization. - For nvme-rdma, nvme-fc, set state to RECONNECTING, start initialization or reconnect. Suggested-by: Christoph Hellwig Signed-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 18 +++--- drivers/nvme/host/fc.c | 4 ++-- drivers/nvme/host/nvme.h | 8 drivers/nvme/host/pci.c| 25 + drivers/nvme/host/rdma.c | 2 +- drivers/nvme/target/loop.c | 5 + 6 files changed, 52 insertions(+), 10 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 230cc09..87d209f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size) int nvme_reset_ctrl(struct nvme_ctrl *ctrl) { - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE)) return -EBUSY; if (!queue_work(nvme_wq, >reset_work)) return -EBUSY; @@ -260,7 +260,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: @@ -271,10 +271,20 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; + + case NVME_CTRL_RESETTING: + switch (old_state) { + case NVME_CTRL_RESET_PREPARE: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_RECONNECTING: switch (old_state) { case NVME_CTRL_LIVE: - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: As I suggested in V3, please don't allow this transition. We'll move to NVME_CTRL_RECONNECTING from NVME_CTRL_RESETTING. I look on it like that: NVME_CTRL_RESET_PREPARE - "suspend" state NVME_CTRL_RESETTING - "resume" state you don't reconnect from "suspend" state, you must "resume" before you reconnect. changed = true; /* FALLTHRU */ default: @@ -286,6 +296,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, case NVME_CTRL_LIVE: case NVME_CTRL_ADMIN_ONLY: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ @@ -2660,6 +2671,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, [NVME_CTRL_LIVE]= "live", [NVME_CTRL_ADMIN_ONLY] = "only-admin", [NVME_CTRL_RESETTING] = "resetting", + [NVME_CTRL_RESET_PREPARE] = "reset-prepare", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING]= "deleting", [NVME_CTRL_DEAD]= "dead", diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 306aee4..1ba0669 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -546,7 +546,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl) queue_delayed_work(nvme_wq, >connect_work, 0); break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect @@ -789,7 +789,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl) */ break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect diff --git a/drivers/nvme/host/nvme.h
Re: [Suspected-Phishing]Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting
On 1/15/2018 3:28 PM, Max Gurtovoy wrote: On 1/14/2018 11:48 AM, Sagi Grimberg wrote: Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING before queue the reset work. This is not so strict. There could be a big gap before the reset_work callback is invoked. In addition, there is some disable work in the reset_work callback, strictly speaking, not part of reset work, and could lead to some confusion. This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE and NVME_CTRL_RESETTING. Before queue the reset work, changes state to NVME_CTRL_RESET_PREPARE, after disable work completes, changes state to NVME_CTRL_RESETTING. What guarantees that nvme_timeout will do the right thing? If it is relying on the fact that nvme-pci sets the state to RESETTING only _after_ quiescing the requests queues it needs to be documented as its a real delicate dependency. Suggested-by: Christoph Hellwig <h...@lst.de> Signed-off-by: Jianchao Wang <jianchao.w.w...@oracle.com> --- drivers/nvme/host/core.c | 17 +++-- drivers/nvme/host/fc.c | 2 ++ drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 28 ++-- drivers/nvme/host/rdma.c | 8 drivers/nvme/target/loop.c | 5 + 6 files changed, 53 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e46e60..106a437 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size) int nvme_reset_ctrl(struct nvme_ctrl *ctrl) { - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE)) return -EBUSY; if (!queue_work(nvme_wq, >reset_work)) return -EBUSY; @@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: @@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; + + case NVME_CTRL_RESETTING: + switch (old_state) { + case NVME_CTRL_RESET_PREPARE: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_RECONNECTING: switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: I forget to add that we shouldn't move from RESET_PREPARE to RECONNECTING (with my suggestion to rdma.c). Also need to consider adding another check in nvmf_check_init_req (/drivers/nvme/host/fabrics.h) for the new state. changed = true; /* FALLTHRU */ default: @@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ @@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, [NVME_CTRL_NEW] = "new", [NVME_CTRL_LIVE] = "live", [NVME_CTRL_RESETTING] = "resetting", + [NVME_CTRL_RESET_PREPARE] = "reset-prepare", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING] = "deleting", [NVME_CTRL_DEAD] = "dead", diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 794e66e..516c1ea 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect @@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect James, can you take a look at this? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f5800c3..e477c35 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts) bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO); /* If there is a reset ongoing, we shouldn't reset again. */ - if (dev->ctrl.state == NVME_CTRL_R
Re: [Suspected-Phishing]Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting
On 1/15/2018 3:28 PM, Max Gurtovoy wrote: On 1/14/2018 11:48 AM, Sagi Grimberg wrote: Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING before queue the reset work. This is not so strict. There could be a big gap before the reset_work callback is invoked. In addition, there is some disable work in the reset_work callback, strictly speaking, not part of reset work, and could lead to some confusion. This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE and NVME_CTRL_RESETTING. Before queue the reset work, changes state to NVME_CTRL_RESET_PREPARE, after disable work completes, changes state to NVME_CTRL_RESETTING. What guarantees that nvme_timeout will do the right thing? If it is relying on the fact that nvme-pci sets the state to RESETTING only _after_ quiescing the requests queues it needs to be documented as its a real delicate dependency. Suggested-by: Christoph Hellwig Signed-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 17 +++-- drivers/nvme/host/fc.c | 2 ++ drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 28 ++-- drivers/nvme/host/rdma.c | 8 drivers/nvme/target/loop.c | 5 + 6 files changed, 53 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e46e60..106a437 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size) int nvme_reset_ctrl(struct nvme_ctrl *ctrl) { - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE)) return -EBUSY; if (!queue_work(nvme_wq, >reset_work)) return -EBUSY; @@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: @@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; + + case NVME_CTRL_RESETTING: + switch (old_state) { + case NVME_CTRL_RESET_PREPARE: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_RECONNECTING: switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: I forget to add that we shouldn't move from RESET_PREPARE to RECONNECTING (with my suggestion to rdma.c). Also need to consider adding another check in nvmf_check_init_req (/drivers/nvme/host/fabrics.h) for the new state. changed = true; /* FALLTHRU */ default: @@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ @@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, [NVME_CTRL_NEW] = "new", [NVME_CTRL_LIVE] = "live", [NVME_CTRL_RESETTING] = "resetting", + [NVME_CTRL_RESET_PREPARE] = "reset-prepare", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING] = "deleting", [NVME_CTRL_DEAD] = "dead", diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 794e66e..516c1ea 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect @@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect James, can you take a look at this? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f5800c3..e477c35 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts) bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO); /* If there is a reset ongoing, we shouldn't reset again. */ - if (dev->ctrl.state == NVME_CTRL_RESETTING) + switch (dev->ctrl.state) { +
Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting
On 1/14/2018 11:48 AM, Sagi Grimberg wrote: Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING before queue the reset work. This is not so strict. There could be a big gap before the reset_work callback is invoked. In addition, there is some disable work in the reset_work callback, strictly speaking, not part of reset work, and could lead to some confusion. This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE and NVME_CTRL_RESETTING. Before queue the reset work, changes state to NVME_CTRL_RESET_PREPARE, after disable work completes, changes state to NVME_CTRL_RESETTING. What guarantees that nvme_timeout will do the right thing? If it is relying on the fact that nvme-pci sets the state to RESETTING only _after_ quiescing the requests queues it needs to be documented as its a real delicate dependency. Suggested-by: Christoph HellwigSigned-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 17 +++-- drivers/nvme/host/fc.c | 2 ++ drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 28 ++-- drivers/nvme/host/rdma.c | 8 drivers/nvme/target/loop.c | 5 + 6 files changed, 53 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e46e60..106a437 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size) int nvme_reset_ctrl(struct nvme_ctrl *ctrl) { - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE)) return -EBUSY; if (!queue_work(nvme_wq, >reset_work)) return -EBUSY; @@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: @@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; + + case NVME_CTRL_RESETTING: + switch (old_state) { + case NVME_CTRL_RESET_PREPARE: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_RECONNECTING: switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: changed = true; /* FALLTHRU */ default: @@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ @@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, [NVME_CTRL_NEW] = "new", [NVME_CTRL_LIVE] = "live", [NVME_CTRL_RESETTING] = "resetting", + [NVME_CTRL_RESET_PREPARE] = "reset-prepare", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING] = "deleting", [NVME_CTRL_DEAD] = "dead", diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 794e66e..516c1ea 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect @@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect James, can you take a look at this? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f5800c3..e477c35 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts) bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO); /* If there is a reset ongoing, we shouldn't reset again. */ - if (dev->ctrl.state == NVME_CTRL_RESETTING) + switch (dev->ctrl.state) { + case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: return false; + default: + break; + } Isn't the indentation off? also in other places... @@ -2292,7 +2303,7 @@ static void nvme_reset_work(struct work_struct *work) bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL); int
Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting
On 1/14/2018 11:48 AM, Sagi Grimberg wrote: Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING before queue the reset work. This is not so strict. There could be a big gap before the reset_work callback is invoked. In addition, there is some disable work in the reset_work callback, strictly speaking, not part of reset work, and could lead to some confusion. This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE and NVME_CTRL_RESETTING. Before queue the reset work, changes state to NVME_CTRL_RESET_PREPARE, after disable work completes, changes state to NVME_CTRL_RESETTING. What guarantees that nvme_timeout will do the right thing? If it is relying on the fact that nvme-pci sets the state to RESETTING only _after_ quiescing the requests queues it needs to be documented as its a real delicate dependency. Suggested-by: Christoph Hellwig Signed-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 17 +++-- drivers/nvme/host/fc.c | 2 ++ drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 28 ++-- drivers/nvme/host/rdma.c | 8 drivers/nvme/target/loop.c | 5 + 6 files changed, 53 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e46e60..106a437 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size) int nvme_reset_ctrl(struct nvme_ctrl *ctrl) { - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE)) return -EBUSY; if (!queue_work(nvme_wq, >reset_work)) return -EBUSY; @@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: @@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; + + case NVME_CTRL_RESETTING: + switch (old_state) { + case NVME_CTRL_RESET_PREPARE: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_RECONNECTING: switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: changed = true; /* FALLTHRU */ default: @@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ @@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, [NVME_CTRL_NEW] = "new", [NVME_CTRL_LIVE] = "live", [NVME_CTRL_RESETTING] = "resetting", + [NVME_CTRL_RESET_PREPARE] = "reset-prepare", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING] = "deleting", [NVME_CTRL_DEAD] = "dead", diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 794e66e..516c1ea 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect @@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect James, can you take a look at this? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f5800c3..e477c35 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts) bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO); /* If there is a reset ongoing, we shouldn't reset again. */ - if (dev->ctrl.state == NVME_CTRL_RESETTING) + switch (dev->ctrl.state) { + case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: return false; + default: + break; + } Isn't the indentation off? also in other places... @@ -2292,7 +2303,7 @@ static void nvme_reset_work(struct work_struct *work) bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL); int result = -ENODEV; - if
Re: pci driver loads right after unload
Hi Greg/Bjorn, On 1/2/2018 9:27 PM, Greg Kroah-Hartman wrote: On Tue, Jan 02, 2018 at 01:00:03PM -0600, Bjorn Helgaas wrote: [+cc Greg, linux-kernel] Hi Max, Thanks for the report! On Tue, Jan 02, 2018 at 01:50:23AM +0200, Max Gurtovoy wrote: hi all, I encountered a strange phenomena using 2 different pci drivers (nvme and mlx5_core) since 4.15-rc1: when I try to unload the modules using "modprobe -r" cmd it calls the .probe function right after calling the .remove function and the module is not realy unloaded. I think there is some race condition because when I added a msleep(1000) after "pci_unregister_driver(_driver);" (in the nvme module testing, it also worked in the mlx5_core), the issue seems to dissapear. You say "since 4.15-rc1". Does that mean it's a regression? If so, what's the most recent kernel that does not have this problem? Worst case, you could bisect to find where it broke. I don't see anything obvious in the drivers/pci changes between v4.14 and v4.15-rc1. Module loading and driver binding is mostly driven by the driver core and udev. Maybe you could learn something with "udevadm monitor" or by turning on the some of the debug in lib/kobject_uevent.c? This should be resolved in 4.15-rc6, there was a regression in -rc1 in this area when dealing with uevents over netlink. Max, can you test -rc6 to verify if this is really fixed or not? I've tested -rc6 and the issue doesn't repro. I'll continue monitoring this scenario in -rc7. Can you point to the commit that fixes the issue ? and I'll test the kernel with/without this patch. thanks, greg k-h Cheers, Max.