Re: [RFC PATCH v5 0/4] add simple copy support

2021-04-13 Thread Max Gurtovoy



On 4/11/2021 10:26 PM, Javier González wrote:

On 11.04.2021 12:10, Max Gurtovoy wrote:


On 4/10/2021 9:32 AM, Javier González wrote:
On 10 Apr 2021, at 02.30, Chaitanya Kulkarni 
 wrote:


On 4/9/21 17:22, Max Gurtovoy wrote:

On 2/19/2021 2:45 PM, SelvaKumar S wrote:
This patchset tries to add support for TP4065a ("Simple Copy 
Command"),

v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip 



Simple copy command is a copy offloading operation and is  used 
to copy
multiple contiguous ranges (source_ranges) of LBA's to a single 
destination

LBA within the device reducing traffic between host and device.

This implementation doesn't add native copy offload support for 
stacked

devices rather copy offload is done through emulation. Possible use
cases are F2FS gc and BTRFS relocation/balance.

*blkdev_issue_copy* takes source bdev, no of sources, array of 
source
ranges (in sectors), destination bdev and destination offset(in 
sectors).
If both source and destination block devices are same and 
copy_offload = 1,
then copy is done through native copy offloading. Copy emulation 
is used

in other cases.

As SCSI XCOPY can take two different block devices and no of 
source range is
equal to 1, this interface can be extended in future to support 
SCSI XCOPY.

Any idea why this TP wasn't designed for copy offload between 2
different namespaces in the same controller ?

Yes, it was the first attempt so to keep it simple.

Further work is needed to add incremental TP so that we can also do 
a copy
between the name-spaces of same controller (if we can't already) 
and to the

namespaces that belongs to the different controller.


And a simple copy will be the case where the src_nsid == dst_nsid ?

Also why there are multiple source ranges and only one dst range ? We
could add a bit to indicate if this range is src or dst..
One of the target use cases was ZNS in order to avoid fabric 
transfers during host GC. You can see how this plays well with 
several zone ranges and a single zone destination.


If we start getting support in Linux through the different past copy 
offload efforts, I’m sure we can extend this TP in the future.


But the "copy" command IMO is more general than the ZNS GC case, that 
can be a private case of copy, isn't it ?


It applies to any namespace type, so yes. I just wanted to give you the
background for the current "simple" scope through one of the use cases
that was in mind.

We can get a big benefit of offloading the data copy from one ns to 
another in the same controller and even in different controllers in 
the same subsystem.


Definitely.



Do you think the extension should be to "copy" command or to create a 
new command "x_copy" for copying to different destination ns ?


I believe there is space for extensions to simple copy. But given the
experience with XCOPY, I can imagine that changes will be incremental,
based on very specific use cases.

I think getting support upstream and bringing deployed cases is a very
good start.


Copying data (files) within the controller/subsystem from ns_A to ns_B 
using NVMf will reduce network BW and memory BW in the host server.


This feature is well known and the use case is well known.

The question whether we implement it in vendor specific manner of we add 
it to the specification.


I prefer adding it to the spec :)




Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-11 Thread Max Gurtovoy



On 4/6/2021 2:53 PM, Jason Gunthorpe wrote:

On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:

On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:

On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:

From: Leon Romanovsky 

 From Avihai,

Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
imposed on PCI transactions, and thus, can improve performance.

Until now, relaxed ordering could be set only by user space applications
for user MRs. The following patch series enables relaxed ordering for the
kernel ULPs as well. Relaxed ordering is an optional capability, and as
such, it is ignored by vendors that don't support it.

The following test results show the performance improvement achieved

Did you test this patchset with CPU does not support relaxed ordering?

I don't think so, the CPUs that don't support RO are Intel's 
fourth/fifth-generation
and they are not interesting from performance point of view.


We observed significantly performance degradation when run perftest with
relaxed ordering enabled over old CPU.

https://github.com/linux-rdma/perftest/issues/116

The perftest is slightly different, but you pointed to the valid point.
We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
and arguably this was needed to be done in perftest too.

No, the PCI device should not have the RO bit set in this situation.
It is something mlx5_core needs to do. We can't push this into
applications.


pcie_relaxed_ordering_enabled is called in 
drivers/net/ethernet/mellanox/mlx5/core/en_common.c so probably need to 
move it to


mlx5_core in this series.





There should be no performance difference from asking for
IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
not asking for it at all.

Either the platform has working relaxed ordering that gives a
performance gain and the RO config spec bit should be set, or it
doesn't and the bit should be clear.


is this the case today ?



This is not something to decide in userspace, or in RDMA. At worst it
becomes another platform specific PCI tunable people have to set.

I thought the old haswell systems were quirked to disable RO globally
anyhow?

Jason


Re: [RFC PATCH v5 0/4] add simple copy support

2021-04-11 Thread Max Gurtovoy



On 4/10/2021 9:32 AM, Javier González wrote:

On 10 Apr 2021, at 02.30, Chaitanya Kulkarni  wrote:

On 4/9/21 17:22, Max Gurtovoy wrote:

On 2/19/2021 2:45 PM, SelvaKumar S wrote:
This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

Simple copy command is a copy offloading operation and is  used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.

This implementation doesn't add native copy offload support for stacked
devices rather copy offload is done through emulation. Possible use
cases are F2FS gc and BTRFS relocation/balance.

*blkdev_issue_copy* takes source bdev, no of sources, array of source
ranges (in sectors), destination bdev and destination offset(in sectors).
If both source and destination block devices are same and copy_offload = 1,
then copy is done through native copy offloading. Copy emulation is used
in other cases.

As SCSI XCOPY can take two different block devices and no of source range is
equal to 1, this interface can be extended in future to support SCSI XCOPY.

Any idea why this TP wasn't designed for copy offload between 2
different namespaces in the same controller ?

Yes, it was the first attempt so to keep it simple.

Further work is needed to add incremental TP so that we can also do a copy
between the name-spaces of same controller (if we can't already) and to the
namespaces that belongs to the different controller.


And a simple copy will be the case where the src_nsid == dst_nsid ?

Also why there are multiple source ranges and only one dst range ? We
could add a bit to indicate if this range is src or dst..

One of the target use cases was ZNS in order to avoid fabric transfers during 
host GC. You can see how this plays well with several zone ranges and a single 
zone destination.

If we start getting support in Linux through the different past copy offload 
efforts, I’m sure we can extend this TP in the future.


But the "copy" command IMO is more general than the ZNS GC case, that 
can be a private case of copy, isn't it ?


We can get a big benefit of offloading the data copy from one ns to 
another in the same controller and even in different controllers in the 
same subsystem.


Do you think the extension should be to "copy" command or to create a 
new command "x_copy" for copying to different destination ns ?



  


Re: [RFC PATCH v5 0/4] add simple copy support

2021-04-09 Thread Max Gurtovoy



On 2/19/2021 2:45 PM, SelvaKumar S wrote:

This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

Simple copy command is a copy offloading operation and is  used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.

This implementation doesn't add native copy offload support for stacked
devices rather copy offload is done through emulation. Possible use
cases are F2FS gc and BTRFS relocation/balance.

*blkdev_issue_copy* takes source bdev, no of sources, array of source
ranges (in sectors), destination bdev and destination offset(in sectors).
If both source and destination block devices are same and copy_offload = 1,
then copy is done through native copy offloading. Copy emulation is used
in other cases.

As SCSI XCOPY can take two different block devices and no of source range is
equal to 1, this interface can be extended in future to support SCSI XCOPY.


Any idea why this TP wasn't designed for copy offload between 2 
different namespaces in the same controller ?


And a simple copy will be the case where the src_nsid == dst_nsid ?

Also why there are multiple source ranges and only one dst range ? We 
could add a bit to indicate if this range is src or dst..





For devices supporting native simple copy, attach the control information
as payload to the bio and submit to the device. For devices without native
copy support, copy emulation is done by reading each source range into memory
and writing it to the destination. Caller can choose not to try
emulation if copy offload is not supported by setting
BLKDEV_COPY_NOEMULATION flag.

Following limits are added to queue limits and are exposed in sysfs
to userspace
- *copy_offload* controls copy_offload. set 0 to disable copy
offload, 1 to enable native copy offloading support.
- *max_copy_sectors* limits the sum of all source_range length
- *max_copy_nr_ranges* limits the number of source ranges
- *max_copy_range_sectors* limit the maximum number of sectors
that can constitute a single source range.

max_copy_sectors = 0 indicates the device doesn't support copy
offloading.

*copy offload* sysfs entry is configurable and can be used toggle
between emulation and native support depending upon the usecase.

Changes from v4

1. Extend dm-kcopyd to leverage copy-offload, while copying within the
same device. The other approach was to have copy-emulation by moving
dm-kcopyd to block layer. But it also required moving core dm-io infra,
causing a massive churn across multiple dm-targets.

2. Remove export in bio_map_kern()
3. Change copy_offload sysfs to accept 0 or else
4. Rename copy support flag to QUEUE_FLAG_SIMPLE_COPY
5. Rename payload entries, add source bdev field to be used while
partition remapping, remove copy_size
6. Change the blkdev_issue_copy() interface to accept destination and
source values in sector rather in bytes
7. Add payload to bio using bio_map_kern() for copy_offload case
8. Add check to return error if one of the source range length is 0
9. Add BLKDEV_COPY_NOEMULATION flag to allow user to not try copy
emulation incase of copy offload is not supported. Caller can his use
his existing copying logic to complete the io.
10. Bug fix copy checks and reduce size of rcu_lock()

Planned for next:
- adding blktests
- handling larger (than device limits) copy
- decide on ioctl interface (man-page etc.)

Changes from v3

1. gfp_flag fixes.
2. Export bio_map_kern() and use it to allocate and add pages to bio.
3. Move copy offload, reading to buf, writing from buf to separate functions.
4. Send read bio of copy offload by chaining them and submit asynchronously.
5. Add gendisk->part0 and part->bd_start_sect changes to blk_check_copy().
6. Move single source range limit check to blk_check_copy()
7. Rename __blkdev_issue_copy() to blkdev_issue_copy and remove old helper.
8. Change blkdev_issue_copy() interface generic to accepts destination bdev
to support XCOPY as well.
9. Add invalidate_kernel_vmap_range() after reading data for vmalloc'ed memory.
10. Fix buf allocoation logic to allocate buffer for the total size of copy.
11. Reword patch commit description.

Changes from v2

1. Add emulation support for devices not supporting copy.
2. Add *copy_offload* sysfs entry to enable and disable copy_offload
in devices supporting simple copy.
3. Remove simple copy support for stacked devices.

Changes from v1:

1. Fix memory leak in __blkdev_issue_copy
2. Unmark blk_check_copy inline
3. Fix line break in blk_check_copy_eod
4. Remove p checks and made code more readable
5. Don't use bio_set_op_attrs and remove op and set
bi_opf directly
6. Use struct_size to calculate total_size
7. Fix 

Re: [PATCH v2] infiniband: Fix a use after free in isert_connect_request

2021-03-25 Thread Max Gurtovoy



On 3/22/2021 6:13 PM, Lv Yunlong wrote:

The device is got by isert_device_get() with refcount is 1,
and is assigned to isert_conn by isert_conn->device = device.
When isert_create_qp() failed, device will be freed with
isert_device_put().

Later, the device is used in isert_free_login_buf(isert_conn)
by the isert_conn->device->ib_device statement. This patch
free the device in the correct order.

Signed-off-by: Lv Yunlong 
---
  drivers/infiniband/ulp/isert/ib_isert.c | 16 
  1 file changed, 8 insertions(+), 8 deletions(-)


looks good,

Reviewed-by: Max Gurtovoy 




Re: [PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver

2021-03-11 Thread Max Gurtovoy



On 3/11/2021 1:37 PM, Christoph Hellwig wrote:

On Wed, Mar 10, 2021 at 08:31:27AM -0400, Jason Gunthorpe wrote:

Yes, that needs more refactoring. I'm viewing this series as a
"statement of intent" and once we commit to doing this we can go
through the bigger effort to split up vfio_pci_core and tidy its API.

Obviously this is a big project, given the past comments I don't want
to send more effort here until we see a community consensus emerge
that this is what we want to do. If we build a sub-driver instead the
work is all in the trash bin.

So my viewpoint here is that this work doesn't seem very useful for
the existing subdrivers given how much compat pain there is.  It
defintively is the right way to go for a new driver.


This bring us back to the first series that introduced mlx5_vfio_pci 
driver without the igd, nvlink2 drivers.


if we leave the subdrivers/extensions in vfio_pci_core it won't be 
logically right.


If we put it in vfio_pci we'll need to maintain it and extend it if new 
functionality or bugs will be reported.


if we create a new drivers for these devices, we'll use the compat layer 
and hopefully after few years these users will be using only 
my_driver_vfio_pci and we'll be able to remove the compat layer (that is 
not so big).


We tried almost all the options and now we need to progress and agree on 
the design.


Effort is big and I wish we won't continue with experiments without a 
clear view of what exactly should be done.


So we need a plan how Jason's series and my series can live together and 
how can we start merging it gradually.





Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-11 Thread Max Gurtovoy



On 3/11/2021 9:54 AM, Alexey Kardashevskiy wrote:



On 11/03/2021 13:00, Jason Gunthorpe wrote:

On Thu, Mar 11, 2021 at 12:42:56PM +1100, Alexey Kardashevskiy wrote:

btw can the id list have only vendor ids and not have device ids?


The PCI matcher is quite flexable, see the other patch from Max for
the igd
  ah cool, do this for NVIDIA GPUs then please, I just discovered 
another P9

system sold with NVIDIA T4s which is not in your list.


I think it will make things easier down the road if you maintain an
exact list 



Then why do not you do the exact list for Intel IGD? The commit log 
does not explain this detail.


I expect Intel team to review this series and give a more precise list.

I did the best I could in finding a proper configuration for igd.






But best practice is to be as narrow as possible as I hope this will
eventually impact module autoloading and other details.


The amount of device specific knowledge is too little to tie it up 
to device

ids, it is a generic PCI driver with quirks. We do not have a separate
drivers for the hardware which requires quirks.


It provides its own capability structure exposed to userspace, that is
absolutely not a "quirk"


And how do you hope this should impact autoloading?


I would like to autoload the most specific vfio driver for the target
hardware.



Is there an idea how it is going to work? For example, the Intel IGD 
driver and vfio-pci-igd - how should the system pick one? If there is 
no MODULE_DEVICE_TABLE in vfio-pci-xxx, is the user supposed to try 
binding all vfio-pci-xxx drivers until some binds?


For example, in my local setup I did a POC patch that convert some 
drivers to be "manual binding only drivers".


So the IGD driver will have the priority, user will unbind the device 
from it, load igd-vfio-pci, bind the device to it, ends with probing.


For now we separated the driver core stuff until we all agree that this 
series is the right way to go + we also make sure it's backward compatible.






If you someday need to support new GPU HW that needs a different VFIO
driver then you are really stuck because things become indeterminate
if there are two devices claiming the ID. We don't have the concept of
"best match", driver core works on exact match.






Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-10 Thread Max Gurtovoy



On 3/10/2021 4:19 PM, Alexey Kardashevskiy wrote:



On 10/03/2021 23:57, Max Gurtovoy wrote:


On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:



On 09/03/2021 19:33, Max Gurtovoy wrote:

The new drivers introduced are nvlink2gpu_vfio_pci.ko and
npu2_vfio_pci.ko.
The first will be responsible for providing special extensions for
NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
future). The last will be responsible for POWER9 NPU2 unit (NVLink2 
host

bus adapter).

Also, preserve backward compatibility for users that were binding
NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer 
will

be dropped in the future

Signed-off-by: Max Gurtovoy 
---
  drivers/vfio/pci/Kconfig  |  28 +++-
  drivers/vfio/pci/Makefile |   7 +-
  .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 
-

  drivers/vfio/pci/npu2_vfio_pci.h  |  24 +++
  ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 
+-

  drivers/vfio/pci/nvlink2gpu_vfio_pci.h    |  24 +++
  drivers/vfio/pci/vfio_pci.c   |  61 ++-
  drivers/vfio/pci/vfio_pci_core.c  |  18 ---
  drivers/vfio/pci/vfio_pci_core.h  |  14 --
  9 files changed, 422 insertions(+), 47 deletions(-)
  rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
  create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
  rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => 
nvlink2gpu_vfio_pci.c} (67%)

  create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 829e90a2e5a3..88c89863a205 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -48,8 +48,30 @@ config VFIO_PCI_IGD
      To enable Intel IGD assignment through vfio-pci, say Y.
  -config VFIO_PCI_NVLINK2
-    def_bool y
+config VFIO_PCI_NVLINK2GPU
+    tristate "VFIO support for NVIDIA NVLINK2 GPUs"
  depends on VFIO_PCI_CORE && PPC_POWERNV
  help
-  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 
GPUs
+  VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific 
extensions

+  for P9 Witherspoon machine.
+
+config VFIO_PCI_NPU2
+    tristate "VFIO support for IBM NPU host bus adapter on P9"
+    depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
+    help
+  VFIO PCI specific extensions for IBM NVLink2 host bus 
adapter on P9

+  Witherspoon machine.
+
+config VFIO_PCI_DRIVER_COMPAT
+    bool "VFIO PCI backward compatibility for vendor specific 
extensions"

+    default y
+    depends on VFIO_PCI
+    help
+  Say Y here if you want to preserve VFIO PCI backward
+  compatibility. vfio_pci.ko will continue to automatically use
+  the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
+  a compatible device.
+
+  When N is selected the user must bind explicity to the module
+  they want to handle the device and vfio_pci.ko will have no
+  device specific special behaviors.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f539f32c9296..86fb62e271fc 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,10 +2,15 @@
    obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
    vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o 
vfio_pci_rdwr.o vfio_pci_config.o

  vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
vfio_pci_npu2.o

  vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
    vfio-pci-y := vfio_pci.o
+
+npu2-vfio-pci-y := npu2_vfio_pci.o
+
+nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_npu2.c 
b/drivers/vfio/pci/npu2_vfio_pci.c

similarity index 64%
rename from drivers/vfio/pci/vfio_pci_npu2.c
rename to drivers/vfio/pci/npu2_vfio_pci.c
index 717745256ab3..7071bda0f2b6 100644
--- a/drivers/vfio/pci/vfio_pci_npu2.c
+++ b/drivers/vfio/pci/npu2_vfio_pci.c
@@ -14,19 +14,28 @@
   *    Author: Alex Williamson 
   */
  +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
  #include 
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
    #include "vfio_pci_core.h"
+#include "npu2_vfio_pci.h"
    #define CREATE_TRACE_POINTS
  #include "npu2_trace.h"
  +#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Alexey Kardashevskiy "
+#define DRIVER_DESC "NPU2 VFIO PCI - User Level meta-driver 
for POWER9 NPU NVLink2 HBA"

+
  EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
    struct vfio_pci_npu2_data {
@@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
  unsigned int link_speed; /* The link speed from DT's 
ibm,nvlink-speed */

  };
  +struct npu

Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-10 Thread Max Gurtovoy



On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:



On 09/03/2021 19:33, Max Gurtovoy wrote:

The new drivers introduced are nvlink2gpu_vfio_pci.ko and
npu2_vfio_pci.ko.
The first will be responsible for providing special extensions for
NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host
bus adapter).

Also, preserve backward compatibility for users that were binding
NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will
be dropped in the future

Signed-off-by: Max Gurtovoy 
---
  drivers/vfio/pci/Kconfig  |  28 +++-
  drivers/vfio/pci/Makefile |   7 +-
  .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 -
  drivers/vfio/pci/npu2_vfio_pci.h  |  24 +++
  ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +-
  drivers/vfio/pci/nvlink2gpu_vfio_pci.h    |  24 +++
  drivers/vfio/pci/vfio_pci.c   |  61 ++-
  drivers/vfio/pci/vfio_pci_core.c  |  18 ---
  drivers/vfio/pci/vfio_pci_core.h  |  14 --
  9 files changed, 422 insertions(+), 47 deletions(-)
  rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
  create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
  rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => 
nvlink2gpu_vfio_pci.c} (67%)

  create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 829e90a2e5a3..88c89863a205 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -48,8 +48,30 @@ config VFIO_PCI_IGD
      To enable Intel IGD assignment through vfio-pci, say Y.
  -config VFIO_PCI_NVLINK2
-    def_bool y
+config VFIO_PCI_NVLINK2GPU
+    tristate "VFIO support for NVIDIA NVLINK2 GPUs"
  depends on VFIO_PCI_CORE && PPC_POWERNV
  help
-  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+  VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions
+  for P9 Witherspoon machine.
+
+config VFIO_PCI_NPU2
+    tristate "VFIO support for IBM NPU host bus adapter on P9"
+    depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
+    help
+  VFIO PCI specific extensions for IBM NVLink2 host bus adapter 
on P9

+  Witherspoon machine.
+
+config VFIO_PCI_DRIVER_COMPAT
+    bool "VFIO PCI backward compatibility for vendor specific 
extensions"

+    default y
+    depends on VFIO_PCI
+    help
+  Say Y here if you want to preserve VFIO PCI backward
+  compatibility. vfio_pci.ko will continue to automatically use
+  the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
+  a compatible device.
+
+  When N is selected the user must bind explicity to the module
+  they want to handle the device and vfio_pci.ko will have no
+  device specific special behaviors.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f539f32c9296..86fb62e271fc 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,10 +2,15 @@
    obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
    vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o 
vfio_pci_rdwr.o vfio_pci_config.o

  vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
vfio_pci_npu2.o

  vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
    vfio-pci-y := vfio_pci.o
+
+npu2-vfio-pci-y := npu2_vfio_pci.o
+
+nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_npu2.c 
b/drivers/vfio/pci/npu2_vfio_pci.c

similarity index 64%
rename from drivers/vfio/pci/vfio_pci_npu2.c
rename to drivers/vfio/pci/npu2_vfio_pci.c
index 717745256ab3..7071bda0f2b6 100644
--- a/drivers/vfio/pci/vfio_pci_npu2.c
+++ b/drivers/vfio/pci/npu2_vfio_pci.c
@@ -14,19 +14,28 @@
   *    Author: Alex Williamson 
   */
  +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
  #include 
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
    #include "vfio_pci_core.h"
+#include "npu2_vfio_pci.h"
    #define CREATE_TRACE_POINTS
  #include "npu2_trace.h"
  +#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Alexey Kardashevskiy "
+#define DRIVER_DESC "NPU2 VFIO PCI - User Level meta-driver for 
POWER9 NPU NVLink2 HBA"

+
  EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
    struct vfio_pci_npu2_data {
@@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
  unsigned int link_speed; /* The link speed from DT's 
ibm,nvlink-speed */

  };
  +struct npu2_vfio_pci_device {
+    struct vfio_pci_core_device    vdev;
+};
+
  static size_t vfio_pci_npu2_rw(struct 

[PATCH 9/9] vfio/pci: export igd support into vendor vfio_pci driver

2021-03-09 Thread Max Gurtovoy
Create a new driver igd_vfio_pci.ko that will be responsible for
providing special extensions for INTEL Graphics card (GVT-d).

Also preserve backward compatibility with vfio_pci.ko vendor specific
extensions.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig  |   5 +-
 drivers/vfio/pci/Makefile |   4 +-
 .../pci/{vfio_pci_igd.c => igd_vfio_pci.c}| 147 +-
 drivers/vfio/pci/igd_vfio_pci.h   |  24 +++
 drivers/vfio/pci/vfio_pci.c   |   4 +
 drivers/vfio/pci/vfio_pci_core.c  |  15 --
 drivers/vfio/pci/vfio_pci_core.h  |   9 --
 7 files changed, 176 insertions(+), 32 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_igd.c => igd_vfio_pci.c} (62%)
 create mode 100644 drivers/vfio/pci/igd_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 88c89863a205..09d85ba3e5b2 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -37,17 +37,14 @@ config VFIO_PCI_INTX
def_bool y if !S390
 
 config VFIO_PCI_IGD
-   bool "VFIO PCI extensions for Intel graphics (GVT-d)"
+   tristate "VFIO PCI extensions for Intel graphics (GVT-d)"
depends on VFIO_PCI_CORE && X86
-   default y
help
  Support for Intel IGD specific extensions to enable direct
  assignment to virtual machines.  This includes exposing an IGD
  specific firmware table and read-only copies of the host bridge
  and LPC bridge config space.
 
- To enable Intel IGD assignment through vfio-pci, say Y.
-
 config VFIO_PCI_NVLINK2GPU
tristate "VFIO support for NVIDIA NVLINK2 GPUs"
depends on VFIO_PCI_CORE && PPC_POWERNV
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 86fb62e271fc..298b2fb3f075 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -4,9 +4,9 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
 obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_IGD) += igd-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
@@ -14,3 +14,5 @@ vfio-pci-y := vfio_pci.o
 npu2-vfio-pci-y := npu2_vfio_pci.o
 
 nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
+
+igd-vfio-pci-y := igd_vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/igd_vfio_pci.c
similarity index 62%
rename from drivers/vfio/pci/vfio_pci_igd.c
rename to drivers/vfio/pci/igd_vfio_pci.c
index 2388c9722ed8..bbbc432bca82 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/igd_vfio_pci.c
@@ -10,19 +10,32 @@
  * address is also virtualized to prevent user modification.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
 #include "vfio_pci_core.h"
+#include "igd_vfio_pci.h"
 
 #define OPREGION_SIGNATURE "IntelGraphicsMem"
 #define OPREGION_SIZE  (8 * 1024)
 #define OPREGION_PCI_ADDR  0xfc
 
-static size_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev, char __user 
*buf,
- size_t count, loff_t *ppos, bool iswrite)
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Alex Williamson "
+#define DRIVER_DESC "IGD VFIO PCI - User Level meta-driver for Intel 
Graphics Processing Unit"
+
+struct igd_vfio_pci_device {
+   struct vfio_pci_core_device vdev;
+};
+
+static size_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev,
+   char __user *buf, size_t count, loff_t *ppos, bool iswrite)
 {
unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
void *base = vdev->region[i].data;
@@ -261,7 +274,7 @@ static int vfio_pci_igd_cfg_init(struct 
vfio_pci_core_device *vdev)
return 0;
 }
 
-int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
+static int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
 {
int ret;
 
@@ -275,3 +288,131 @@ int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
 
return 0;
 }
+
+static void igd_vfio_pci_release(void *device_data)
+{
+   struct vfio_pci_core_device *vdev = device_data;
+
+   mutex_lock(>reflck->lock);
+   if (!(--vdev->refcnt)) {
+   vfio_pci_vf_token_user_add(vdev, -1);
+   vfio_pci_core_spapr_eeh_release(vdev);
+   vfio_pci_core_disable(vdev);
+   }
+   mutex_unlock(>reflck->lock);
+
+   module_put(THIS_MODULE);
+}
+
+static int igd_vfio_pci_open(void *device_data)
+{
+   struct vfio_pci_core_device *vdev = device_data;
+   int ret = 0;

[PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-09 Thread Max Gurtovoy
The new drivers introduced are nvlink2gpu_vfio_pci.ko and
npu2_vfio_pci.ko.
The first will be responsible for providing special extensions for
NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host
bus adapter).

Also, preserve backward compatibility for users that were binding
NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will
be dropped in the future

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig  |  28 +++-
 drivers/vfio/pci/Makefile |   7 +-
 .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 -
 drivers/vfio/pci/npu2_vfio_pci.h  |  24 +++
 ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +-
 drivers/vfio/pci/nvlink2gpu_vfio_pci.h|  24 +++
 drivers/vfio/pci/vfio_pci.c   |  61 ++-
 drivers/vfio/pci/vfio_pci_core.c  |  18 ---
 drivers/vfio/pci/vfio_pci_core.h  |  14 --
 9 files changed, 422 insertions(+), 47 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
 create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
 rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} (67%)
 create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 829e90a2e5a3..88c89863a205 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -48,8 +48,30 @@ config VFIO_PCI_IGD
 
  To enable Intel IGD assignment through vfio-pci, say Y.
 
-config VFIO_PCI_NVLINK2
-   def_bool y
+config VFIO_PCI_NVLINK2GPU
+   tristate "VFIO support for NVIDIA NVLINK2 GPUs"
depends on VFIO_PCI_CORE && PPC_POWERNV
help
- VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+ VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions
+ for P9 Witherspoon machine.
+
+config VFIO_PCI_NPU2
+   tristate "VFIO support for IBM NPU host bus adapter on P9"
+   depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
+   help
+ VFIO PCI specific extensions for IBM NVLink2 host bus adapter on P9
+ Witherspoon machine.
+
+config VFIO_PCI_DRIVER_COMPAT
+   bool "VFIO PCI backward compatibility for vendor specific extensions"
+   default y
+   depends on VFIO_PCI
+   help
+ Say Y here if you want to preserve VFIO PCI backward
+ compatibility. vfio_pci.ko will continue to automatically use
+ the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
+ a compatible device.
+
+ When N is selected the user must bind explicity to the module
+ they want to handle the device and vfio_pci.ko will have no
+ device specific special behaviors.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f539f32c9296..86fb62e271fc 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,10 +2,15 @@
 
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
vfio_pci_npu2.o
 vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
+
+npu2-vfio-pci-y := npu2_vfio_pci.o
+
+nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_npu2.c b/drivers/vfio/pci/npu2_vfio_pci.c
similarity index 64%
rename from drivers/vfio/pci/vfio_pci_npu2.c
rename to drivers/vfio/pci/npu2_vfio_pci.c
index 717745256ab3..7071bda0f2b6 100644
--- a/drivers/vfio/pci/vfio_pci_npu2.c
+++ b/drivers/vfio/pci/npu2_vfio_pci.c
@@ -14,19 +14,28 @@
  * Author: Alex Williamson 
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 
 #include "vfio_pci_core.h"
+#include "npu2_vfio_pci.h"
 
 #define CREATE_TRACE_POINTS
 #include "npu2_trace.h"
 
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Alexey Kardashevskiy "
+#define DRIVER_DESC "NPU2 VFIO PCI - User Level meta-driver for POWER9 NPU 
NVLink2 HBA"
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
 
 struct vfio_pci_npu2_data {
@@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */
 };
 
+struct npu2_vfio_pci_device {
+   struct vfio_pci_core_device vdev;
+};
+
 static size_t vfio_pci_npu2_rw(struct vfio_pci_core_device *vdev,
char __user

[PATCH 7/9] vfio/pci_core: split nvlink2 to nvlink2gpu and npu2

2021-03-09 Thread Max Gurtovoy
This is a preparation for moving vendor specific code from
vfio_pci_core to vendor specific vfio_pci drivers. The next step will be
creating a dedicated module to NVIDIA NVLINK2 devices with P9 extensions
and a dedicated module for Power9 NPU NVLink2 HBAs.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Makefile |   2 +-
 drivers/vfio/pci/npu2_trace.h |  50 
 .../vfio/pci/{trace.h => nvlink2gpu_trace.h}  |  27 +--
 drivers/vfio/pci/vfio_pci_core.c  |   2 +-
 drivers/vfio/pci/vfio_pci_core.h  |   4 +-
 drivers/vfio/pci/vfio_pci_npu2.c  | 222 ++
 ...io_pci_nvlink2.c => vfio_pci_nvlink2gpu.c} | 201 +---
 7 files changed, 280 insertions(+), 228 deletions(-)
 create mode 100644 drivers/vfio/pci/npu2_trace.h
 rename drivers/vfio/pci/{trace.h => nvlink2gpu_trace.h} (72%)
 create mode 100644 drivers/vfio/pci/vfio_pci_npu2.c
 rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_nvlink2gpu.c} (59%)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 16e7d77d63ce..f539f32c9296 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -5,7 +5,7 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
vfio_pci_npu2.o
 vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/npu2_trace.h b/drivers/vfio/pci/npu2_trace.h
new file mode 100644
index ..c8a1110132dc
--- /dev/null
+++ b/drivers/vfio/pci/npu2_trace.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * VFIO PCI mmap/mmap_fault tracepoints
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ * Author: Alexey Kardashevskiy 
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfio_pci
+
+#if !defined(_TRACE_VFIO_PCI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFIO_PCI_H
+
+#include 
+
+TRACE_EVENT(vfio_pci_npu2_mmap,
+   TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
+   unsigned long size, int ret),
+   TP_ARGS(pdev, hpa, ua, size, ret),
+
+   TP_STRUCT__entry(
+   __field(const char *, name)
+   __field(unsigned long, hpa)
+   __field(unsigned long, ua)
+   __field(unsigned long, size)
+   __field(int, ret)
+   ),
+
+   TP_fast_assign(
+   __entry->name = dev_name(>dev),
+   __entry->hpa = hpa;
+   __entry->ua = ua;
+   __entry->size = size;
+   __entry->ret = ret;
+   ),
+
+   TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa,
+   __entry->ua, __entry->size, __entry->ret)
+);
+
+#endif /* _TRACE_VFIO_PCI_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE npu2_trace
+
+/* This part must be outside protection */
+#include 
diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/nvlink2gpu_trace.h
similarity index 72%
rename from drivers/vfio/pci/trace.h
rename to drivers/vfio/pci/nvlink2gpu_trace.h
index b2aa986ab9ed..2392b9d4c6c9 100644
--- a/drivers/vfio/pci/trace.h
+++ b/drivers/vfio/pci/nvlink2gpu_trace.h
@@ -62,37 +62,12 @@ TRACE_EVENT(vfio_pci_nvgpu_mmap,
__entry->ua, __entry->size, __entry->ret)
 );
 
-TRACE_EVENT(vfio_pci_npu2_mmap,
-   TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
-   unsigned long size, int ret),
-   TP_ARGS(pdev, hpa, ua, size, ret),
-
-   TP_STRUCT__entry(
-   __field(const char *, name)
-   __field(unsigned long, hpa)
-   __field(unsigned long, ua)
-   __field(unsigned long, size)
-   __field(int, ret)
-   ),
-
-   TP_fast_assign(
-   __entry->name = dev_name(>dev),
-   __entry->hpa = hpa;
-   __entry->ua = ua;
-   __entry->size = size;
-   __entry->ret = ret;
-   ),
-
-   TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa,
-   __entry->ua, __entry->size, __entry->ret)
-);
-
 #endif /* _TRACE_VFIO_PCI_H */
 
 #undef TRACE_INCLUDE_PATH
 #define TRACE_INCLUDE_PATH ../../drivers/vfio/pci
 #undef TRACE_INCLUDE_FILE
-#define TRACE_INCLUDE_FILE trace
+#define TRACE_INCLUDE_FILE nvlink2gpu_trace
 
 /* This part must be outside protection */
 #include 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ba5dd4321487..4de8e352

[PATCH 6/9] vfio-pci-core: export vfio_pci_register_dev_region function

2021-03-09 Thread Max Gurtovoy
This function will be used to allow vendor drivers to register regions
to be used and accessed by the core subsystem driver. This way, the core
will use the region ops that are vendor specific and managed by the
vendor vfio-pci driver.

Next step that can be made is to move the logic of igd and nvlink2 to a
dedicated module instead of managing their vendor specific extensions in
the core driver.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/vfio_pci_core.c | 1 +
 drivers/vfio/pci/vfio_pci_core.h | 5 +
 2 files changed, 6 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7b6be1e4646f..ba5dd4321487 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -711,6 +711,7 @@ int vfio_pci_register_dev_region(struct 
vfio_pci_core_device *vdev,
 
return 0;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
 
 struct vfio_devices {
struct vfio_device **devices;
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index 46eb3443125b..60b42df6c519 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -257,4 +257,9 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int 
nr_virtfn);
 
 extern const struct pci_error_handlers vfio_pci_core_err_handlers;
 
+int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
+   unsigned int type, unsigned int subtype,
+   const struct vfio_pci_regops *ops,
+   size_t size, u32 flags, void *data);
+
 #endif /* VFIO_PCI_CORE_H */
-- 
2.25.4



[PATCH 5/9] vfio/pci: introduce vfio_pci_device structure

2021-03-09 Thread Max Gurtovoy
This structure will hold the specific attributes for the generic
vfio_pci.ko driver. It will be allocated by the vfio_pci driver that
will register the vfio subsystem using vfio_pci_core_register_device
and it will unregister the using vfio_pci_core_unregister_device. In
this way every vfio_pci future driver will be able to use this mechanism
to set vendor specific attributes in its vendor specific structure and
register to subsystem core while utilizing vfio_pci_core library.

This is a standard Linux subsystem behaviour and will also ease on
vfio_pci drivers to extend callbacks of vfio_device_ops and will be able
to use container_of mechanism as well (instead of passing void pointers
as around the stack).

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/vfio_pci.c  | 31 +
 drivers/vfio/pci/vfio_pci_core.c | 39 +---
 drivers/vfio/pci/vfio_pci_core.h |  5 ++--
 3 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 447c31f4e64e..dbc0a6559914 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -31,6 +32,10 @@
 #define DRIVER_AUTHOR   "Alex Williamson "
 #define DRIVER_DESC "VFIO PCI - User Level meta-driver"
 
+struct vfio_pci_device {
+   struct vfio_pci_core_device vdev;
+};
+
 static char ids[1024] __initdata;
 module_param_string(ids, ids, sizeof(ids), 0);
 MODULE_PARM_DESC(ids, "Initial PCI IDs to add to the vfio driver, format is 
\"vendor:device[:subvendor[:subdevice[:class[:class_mask\" and multiple 
comma separated entries can be specified");
@@ -139,21 +144,37 @@ static const struct vfio_device_ops vfio_pci_ops = {
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
-   struct vfio_pci_core_device *vdev;
+   struct vfio_pci_device *vpdev;
+   int ret;
 
if (vfio_pci_is_denylisted(pdev))
return -EINVAL;
 
-   vdev = vfio_create_pci_device(pdev, _pci_ops);
-   if (IS_ERR(vdev))
-   return PTR_ERR(vdev);
+   vpdev = kzalloc(sizeof(*vpdev), GFP_KERNEL);
+   if (!vpdev)
+   return -ENOMEM;
+
+   ret = vfio_pci_core_register_device(>vdev, pdev, _pci_ops);
+   if (ret)
+   goto out_free;
 
return 0;
+
+out_free:
+   kfree(vpdev);
+   return ret;
 }
 
 static void vfio_pci_remove(struct pci_dev *pdev)
 {
-   vfio_destroy_pci_device(pdev);
+   struct vfio_device *vdev = dev_get_drvdata(>dev);
+   struct vfio_pci_core_device *core_vpdev = vfio_device_data(vdev);
+   struct vfio_pci_device *vpdev;
+
+   vpdev = container_of(core_vpdev, struct vfio_pci_device, vdev);
+
+   vfio_pci_core_unregister_device(core_vpdev);
+   kfree(vpdev);
 }
 
 static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 878a3609b916..7b6be1e4646f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1837,15 +1837,15 @@ static int vfio_pci_bus_notifier(struct notifier_block 
*nb,
return 0;
 }
 
-struct vfio_pci_core_device *vfio_create_pci_device(struct pci_dev *pdev,
+int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
+   struct pci_dev *pdev,
const struct vfio_device_ops *vfio_pci_ops)
 {
-   struct vfio_pci_core_device *vdev;
struct iommu_group *group;
int ret;
 
if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
-   return ERR_PTR(-EINVAL);
+   return -EINVAL;
 
/*
 * Prevent binding to PFs with VFs enabled, the VFs might be in use
@@ -1857,18 +1857,12 @@ struct vfio_pci_core_device 
*vfio_create_pci_device(struct pci_dev *pdev,
 */
if (pci_num_vf(pdev)) {
pci_warn(pdev, "Cannot bind to PF with SR-IOV enabled\n");
-   return ERR_PTR(-EBUSY);
+   return -EBUSY;
}
 
group = vfio_iommu_group_get(>dev);
if (!group)
-   return ERR_PTR(-EINVAL);
-
-   vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
-   if (!vdev) {
-   ret = -ENOMEM;
-   goto out_group_put;
-   }
+   return -EINVAL;
 
vdev->pdev = pdev;
vdev->vfio_pci_ops = vfio_pci_ops;
@@ -1884,7 +1878,7 @@ struct vfio_pci_core_device 
*vfio_create_pci_device(struct pci_dev *pdev,
 
ret = vfio_add_group_dev(>dev, vfio_pci_ops, vdev);
if (ret)
-   goto out_free;
+   goto out_group_put;
 
ret = vfio_pci_reflck_attach(vdev);
if (ret)
@@ -1928,7 +1922,7 @@ struct vfio_pci_core_device 
*vfio_create_pci_device(struct 

[PATCH 4/9] vfio-pci: introduce vfio_pci_core subsystem driver

2021-03-09 Thread Max Gurtovoy
Split the vfio_pci driver into two parts, the 'struct pci_driver'
(vfio_pci) and a library of code (vfio_pci_core) that helps creating a
VFIO device on top of a PCI device.

As before vfio_pci.ko continues to present the same interface under
sysfs and this change should have no functional impact.

vfio_pci_core exposes an interface that is similar to a typical
Linux subsystem, in that a pci_driver doing probe() can setup a number
of details and then create the VFIO char device.

Allowing another module to provide the pci_driver allows that module
to customize how VFIO is setup, inject its own operations, and easily
extend vendor specific functionality.

This is complementary to how VFIO's mediated devices work. Instead of
custome device lifecycle managmenet and a special bus drivers using
this approach will rely on the normal driver core lifecycle (e.g.
bind/unbind) management and this is optimized to effectively support
customization that is only making small modifications to what vfio_pci
would do normally.

This approach is also a pluggable alternative for the hard wired
CONFIG_VFIO_PCI_IGD and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are
built into vfio-pci. Using this work all of that code can be moved to
a dedicated device-specific modules and cleanly split out of the
generic vfio_pci driver.

Below is an example for adding new driver to vfio pci subsystem:
+-+
| |
| VFIO|
| |
+-+

+-+
| |
|  VFIO_PCI_CORE  |
| |
+-+

+--+ +---+ +--+
|  | |   | |  |
|  VFIO_PCI| | MLX5_VFIO_PCI | | IGD_VFIO_PCI |
|  | |   | |  |
+--+ +---+ +--+

In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio
subsystem and also use the generic PCI functionality exported from it.
Additionally it will add the needed vendor specific logic for HW
specific features such as Live Migration. Same for the igd_vfio_pci that
will add special extensions for Intel Graphics cards (GVT-d).

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig |  22 ++-
 drivers/vfio/pci/Makefile|  13 +-
 drivers/vfio/pci/vfio_pci.c  | 247 
 drivers/vfio/pci/vfio_pci_core.c | 318 ---
 drivers/vfio/pci/vfio_pci_core.h | 113 +++
 5 files changed, 423 insertions(+), 290 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index ac3c1dd3edef..829e90a2e5a3 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-config VFIO_PCI
-   tristate "VFIO support for PCI devices"
+config VFIO_PCI_CORE
+   tristate "VFIO core support for PCI devices"
depends on VFIO && PCI && EVENTFD
select VFIO_VIRQFD
select IRQ_BYPASS_MANAGER
@@ -10,9 +10,17 @@ config VFIO_PCI
 
  If you don't know what to do here, say N.
 
+config VFIO_PCI
+   tristate "VFIO support for PCI devices"
+   depends on VFIO_PCI_CORE
+   help
+ This provides a generic PCI support using the VFIO framework.
+
+ If you don't know what to do here, say N.
+
 config VFIO_PCI_VGA
bool "VFIO PCI support for VGA devices"
-   depends on VFIO_PCI && X86 && VGA_ARB
+   depends on VFIO_PCI_CORE && X86 && VGA_ARB
help
  Support for VGA extension to VFIO PCI.  This exposes an additional
  region on VGA devices for accessing legacy VGA addresses used by
@@ -21,16 +29,16 @@ config VFIO_PCI_VGA
  If you don't know what to do here, say N.
 
 config VFIO_PCI_MMAP
-   depends on VFIO_PCI
+   depends on VFIO_PCI_CORE
def_bool y if !S390
 
 config VFIO_PCI_INTX
-   depends on VFIO_PCI
+   depends on VFIO_PCI_CORE
def_bool y if !S390
 
 config VFIO_PCI_IGD
bool "VFIO PCI extensions for Intel graphics (GVT-d)"
-   depends on VFIO_PCI && X86
+   depends on VFIO_PCI_CORE && X86
default y
help
  Support for Intel IGD specific extensions to enable direct
@@ -42,6 +50,6 @@ config VFIO_PCI_IGD
 
 config VFIO_PCI_NVLINK2
def_bool y
-   depends on VFIO_PCI &

[PATCH 3/9] vfio-pci: rename vfio_pci_device to vfio_pci_core_device

2021-03-09 Thread Max Gurtovoy
This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change
any logic. The new vfio_pci_core_device structure will be the main
structure of the core driver and later on vfio_pci_device structure will
be the main structure of the generic vfio_pci driver.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/vfio_pci_config.c  | 68 +++---
 drivers/vfio/pci/vfio_pci_core.c| 90 ++---
 drivers/vfio/pci/vfio_pci_core.h| 76 
 drivers/vfio/pci/vfio_pci_igd.c | 14 ++---
 drivers/vfio/pci/vfio_pci_intrs.c   | 40 ++---
 drivers/vfio/pci/vfio_pci_nvlink2.c | 20 +++
 drivers/vfio/pci/vfio_pci_rdwr.c| 16 ++---
 drivers/vfio/pci/vfio_pci_zdev.c| 10 ++--
 8 files changed, 167 insertions(+), 167 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 5e9d24992207..122918d0f733 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -108,9 +108,9 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] 
= {
 struct perm_bits {
u8  *virt;  /* read/write virtual data, not hw */
u8  *write; /* writeable bits */
-   int (*readfn)(struct vfio_pci_device *vdev, int pos, int count,
+   int (*readfn)(struct vfio_pci_core_device *vdev, int pos, int count,
  struct perm_bits *perm, int offset, __le32 *val);
-   int (*writefn)(struct vfio_pci_device *vdev, int pos, int count,
+   int (*writefn)(struct vfio_pci_core_device *vdev, int pos, int 
count,
   struct perm_bits *perm, int offset, __le32 val);
 };
 
@@ -171,7 +171,7 @@ static int vfio_user_config_write(struct pci_dev *pdev, int 
offset,
return ret;
 }
 
-static int vfio_default_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 *val)
 {
@@ -197,7 +197,7 @@ static int vfio_default_config_read(struct vfio_pci_device 
*vdev, int pos,
return count;
 }
 
-static int vfio_default_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_default_config_write(struct vfio_pci_core_device *vdev, int 
pos,
 int count, struct perm_bits *perm,
 int offset, __le32 val)
 {
@@ -244,7 +244,7 @@ static int vfio_default_config_write(struct vfio_pci_device 
*vdev, int pos,
 }
 
 /* Allow direct read from hardware, except for capability next pointer */
-static int vfio_direct_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos,
   int count, struct perm_bits *perm,
   int offset, __le32 *val)
 {
@@ -269,7 +269,7 @@ static int vfio_direct_config_read(struct vfio_pci_device 
*vdev, int pos,
 }
 
 /* Raw access skips any kind of virtualization */
-static int vfio_raw_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
 int count, struct perm_bits *perm,
 int offset, __le32 val)
 {
@@ -282,7 +282,7 @@ static int vfio_raw_config_write(struct vfio_pci_device 
*vdev, int pos,
return count;
 }
 
-static int vfio_raw_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 *val)
 {
@@ -296,7 +296,7 @@ static int vfio_raw_config_read(struct vfio_pci_device 
*vdev, int pos,
 }
 
 /* Virt access uses only virtualization */
-static int vfio_virt_config_write(struct vfio_pci_device *vdev, int pos,
+static int vfio_virt_config_write(struct vfio_pci_core_device *vdev, int pos,
  int count, struct perm_bits *perm,
  int offset, __le32 val)
 {
@@ -304,7 +304,7 @@ static int vfio_virt_config_write(struct vfio_pci_device 
*vdev, int pos,
return count;
 }
 
-static int vfio_virt_config_read(struct vfio_pci_device *vdev, int pos,
+static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
 int count, struct perm_bits *perm,
 int offset, __le32 *val)
 {
@@ -396,7 +396,7 @@ static inline void p_setd(struct perm_bits *p, int off, u32 
virt, u32 write)
 }
 
 /* Caller should hold memory_lock semaphore */
-bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
+bool __vfio_pci_memory_enabled(struct vfio_pci_core_device

[PATCH 2/9] vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h

2021-03-09 Thread Max Gurtovoy
This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change
any logic.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/vfio_pci_config.c   | 2 +-
 drivers/vfio/pci/vfio_pci_core.c | 2 +-
 drivers/vfio/pci/{vfio_pci_private.h => vfio_pci_core.h} | 6 +++---
 drivers/vfio/pci/vfio_pci_igd.c  | 2 +-
 drivers/vfio/pci/vfio_pci_intrs.c| 2 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c  | 2 +-
 drivers/vfio/pci/vfio_pci_rdwr.c | 2 +-
 drivers/vfio/pci/vfio_pci_zdev.c | 2 +-
 8 files changed, 10 insertions(+), 10 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_private.h => vfio_pci_core.h} (98%)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index a402adee8a21..5e9d24992207 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -26,7 +26,7 @@
 #include 
 #include 
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 /* Fake capability ID for standard config space */
 #define PCI_CAP_ID_BASIC   0
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 65e7e6b44578..bd587db04625 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -28,7 +28,7 @@
 #include 
 #include 
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson "
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_core.h
similarity index 98%
rename from drivers/vfio/pci/vfio_pci_private.h
rename to drivers/vfio/pci/vfio_pci_core.h
index 9cd1882a05af..b9ac7132b84a 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -15,8 +15,8 @@
 #include 
 #include 
 
-#ifndef VFIO_PCI_PRIVATE_H
-#define VFIO_PCI_PRIVATE_H
+#ifndef VFIO_PCI_CORE_H
+#define VFIO_PCI_CORE_H
 
 #define VFIO_PCI_OFFSET_SHIFT   40
 
@@ -225,4 +225,4 @@ static inline int vfio_pci_info_zdev_add_caps(struct 
vfio_pci_device *vdev,
 }
 #endif
 
-#endif /* VFIO_PCI_PRIVATE_H */
+#endif /* VFIO_PCI_CORE_H */
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 53d97f459252..0c599cd33d01 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -15,7 +15,7 @@
 #include 
 #include 
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 #define OPREGION_SIGNATURE "IntelGraphicsMem"
 #define OPREGION_SIZE  (8 * 1024)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 869dce5f134d..df1e8c8c274c 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -20,7 +20,7 @@
 #include 
 #include 
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 /*
  * INTx
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 9adcf6a8f888..326a704c4527 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 #include 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index a0b5fc8e46f4..667e82726e75 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -17,7 +17,7 @@
 #include 
 #include 
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 #ifdef __LITTLE_ENDIAN
 #define vfio_ioread64  ioread64
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 229685634031..3e91d49fa3f0 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 
-#include "vfio_pci_private.h"
+#include "vfio_pci_core.h"
 
 /*
  * Add the Base PCI Function information to the device info region.
-- 
2.25.4



[PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c

2021-03-09 Thread Max Gurtovoy
This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Makefile| 2 +-
 drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index eff97a7cd9f1..bbf8d7c8fc45 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-$(CONFIG_S390) += vfio_pci_zdev.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci.c
rename to drivers/vfio/pci/vfio_pci_core.c
-- 
2.25.4



[PATCH v3 0/9] Introduce vfio-pci-core subsystem

2021-03-09 Thread Max Gurtovoy
.
 - rename core structure to vfio_pci_core_device.
 - create 3 new vendor drivers (igd_vfio_pci, npu2_vfio_pci,
   nvlink2gpu_vfio_pci) and preserve backward compatibility.
 - rebase on top of Linux 5.11 tag.
 - remove patches that were accepted as stand-alone.
 - removed mlx5_vfio_pci driver from this series (3 vendor drivers will be 
enough
   for emphasizing the approch).

changes from v1:
 - Create a private and public vfio-pci structures (From Alex)
 - register to vfio core directly from vfio-pci-core (for now, expose
   minimal public funcionality to vfio pci drivers. This will remove the
   implicit behaviour from v1. More power to the drivers can be added in
   the future)
 - Add patch 3/9 to emphasize the needed extension for LM feature (From
   Cornelia)
 - take/release refcount for the pci module during core open/release
 - update nvlink, igd and zdev to PowerNV, X86 and s390 extensions for
   vfio-pci core
 - fix segfault bugs in current vfio-pci zdev code

Max Gurtovoy (9):
  vfio-pci: rename vfio_pci.c to vfio_pci_core.c
  vfio-pci: rename vfio_pci_private.h to vfio_pci_core.h
  vfio-pci: rename vfio_pci_device to vfio_pci_core_device
  vfio-pci: introduce vfio_pci_core subsystem driver
  vfio/pci: introduce vfio_pci_device structure
  vfio-pci-core: export vfio_pci_register_dev_region function
  vfio/pci_core: split nvlink2 to nvlink2gpu and npu2
  vfio/pci: export nvlink2 support into vendor vfio_pci drivers
  vfio/pci: export igd support into vendor vfio_pci driver

 drivers/vfio/pci/Kconfig  |   53 +-
 drivers/vfio/pci/Makefile |   20 +-
 .../pci/{vfio_pci_igd.c => igd_vfio_pci.c}|  159 +-
 drivers/vfio/pci/igd_vfio_pci.h   |   24 +
 drivers/vfio/pci/npu2_trace.h |   50 +
 drivers/vfio/pci/npu2_vfio_pci.c  |  364 +++
 drivers/vfio/pci/npu2_vfio_pci.h  |   24 +
 .../vfio/pci/{trace.h => nvlink2gpu_trace.h}  |   27 +-
 ...io_pci_nvlink2.c => nvlink2gpu_vfio_pci.c} |  296 +-
 drivers/vfio/pci/nvlink2gpu_vfio_pci.h|   24 +
 drivers/vfio/pci/vfio_pci.c   | 2433 ++---
 drivers/vfio/pci/vfio_pci_config.c|   70 +-
 drivers/vfio/pci/vfio_pci_core.c  | 2245 +++
 drivers/vfio/pci/vfio_pci_core.h  |  242 ++
 drivers/vfio/pci/vfio_pci_intrs.c |   42 +-
 drivers/vfio/pci/vfio_pci_private.h   |  228 --
 drivers/vfio/pci/vfio_pci_rdwr.c  |   18 +-
 drivers/vfio/pci/vfio_pci_zdev.c  |   12 +-
 18 files changed, 3528 insertions(+), 2803 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_igd.c => igd_vfio_pci.c} (58%)
 create mode 100644 drivers/vfio/pci/igd_vfio_pci.h
 create mode 100644 drivers/vfio/pci/npu2_trace.h
 create mode 100644 drivers/vfio/pci/npu2_vfio_pci.c
 create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
 rename drivers/vfio/pci/{trace.h => nvlink2gpu_trace.h} (72%)
 rename drivers/vfio/pci/{vfio_pci_nvlink2.c => nvlink2gpu_vfio_pci.c} (57%)
 create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h
 create mode 100644 drivers/vfio/pci/vfio_pci_core.c
 create mode 100644 drivers/vfio/pci/vfio_pci_core.h
 delete mode 100644 drivers/vfio/pci/vfio_pci_private.h

-- 
2.25.4



Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-11 Thread Max Gurtovoy



On 2/2/2021 7:10 PM, Jason Gunthorpe wrote:

On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:


On the other side, we have the zdev support, which both requires s390
and applies to any pci device on s390.

Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
s390?

It would be like returning data from ACPI on other platforms.


Agree.

all agree that I remove it ?

we already have a check in the code:

if (ret && ret != -ENODEV) {
    pci_warn(vdev->vpdev.pdev, "Failed to 
setup zPCI info capabilities\n");

    return ret;
}

so in case its not zdev we should get -ENODEV and continue in the good flow.



It really seems like part of vfio-pci-core

Jason


Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-08 Thread Max Gurtovoy



On 2/5/2021 2:42 AM, Alexey Kardashevskiy wrote:



On 04/02/2021 23:51, Jason Gunthorpe wrote:

On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:


It is system firmware (==bios) which puts stuff in the device tree. The
stuff is:
1. emulated pci devices (custom pci bridges), one per nvlink, 
emulated by
the firmware, the driver is "ibmnpu" and it is a part on the nvidia 
driver;

these are basically config space proxies to the cpu's side of nvlink.
2. interconnect information - which of 6 gpus nvlinks connected to 
which

nvlink on the cpu side, and memory ranges.


So what is this vfio_nvlink driver supposed to be bound to?

The "emulated pci devices"?


Yes.


A real GPU function?


Yes.


A real nvswitch function?


What do you mean by this exactly? The cpu side of nvlink is "emulated 
pci devices", the gpu side is not in pci space at all, the nvidia 
driver manages it via the gpu's mmio or/and cfg space.



Something else?


Nope :)
In this new scheme which you are proposing it should be 2 drivers, I 
guess.


I see.

So should it be nvidia_vfio_pci.ko ? and it will do the NVLINK stuff in 
case the class code matches and otherwise just work as simple vfio_pci GPU ?


What about the second driver ? should it be called ibmnpu_vfio_pci.ko ?





Jason





Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-04 Thread Max Gurtovoy



On 2/3/2021 5:24 AM, Alexey Kardashevskiy wrote:



On 03/02/2021 04:41, Max Gurtovoy wrote:


On 2/2/2021 6:06 PM, Cornelia Huck wrote:

On Mon, 1 Feb 2021 11:42:30 -0700
Alex Williamson  wrote:


On Mon, 1 Feb 2021 12:49:12 -0500
Matthew Rosato  wrote:


On 2/1/21 12:14 PM, Cornelia Huck wrote:

On Mon, 1 Feb 2021 16:28:27 +
Max Gurtovoy  wrote:
This patch doesn't change any logic but only align to the 
concept of

vfio_pci_core extensions. Extensions that are related to a platform
and not to a specific vendor of PCI devices should be part of 
the core
driver. Extensions that are specific for PCI device vendor 
should go

to a dedicated vendor vfio-pci driver.
My understanding is that igd means support for Intel graphics, 
i.e. a

strict subset of x86. If there are other future extensions that e.g.
only make sense for some devices found only on AMD systems, I don't
think they should all be included under the same x86 umbrella.

Similar reasoning for nvlink, that only seems to cover support 
for some
GPUs under Power, and is not a general platform-specific 
extension IIUC.


We can arguably do the zdev -> s390 rename (as zpci appears only on
s390, and all PCI devices will be zpci on that platform), 
although I'm

not sure about the benefit.

As far as I can tell, there isn't any benefit for s390 it's just
"re-branding" to match the platform name rather than the zdev 
moniker,
which admittedly perhaps makes it more clear to someone outside of 
s390
that any PCI device on s390 is a zdev/zpci type, and thus will use 
this
extension to vfio_pci(_core).  This would still be true even if we 
added

something later that builds atop it (e.g. a platform-specific device
like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on 
s390x uses
these zdev extensions today and would need to continue using them 
in a

world where mlx5-vfio-pci.ko exists.

I guess all that to say: if such a rename matches the 'grand 
scheme' of
this design where we treat arch-level extensions to 
vfio_pci(_core) as
"vfio_pci_(arch)" then I'm not particularly opposed to the 
rename.  But

by itself it's not very exciting :)

This all seems like the wrong direction to me.  The goal here is to
modularize vfio-pci into a core library and derived vendor modules 
that

make use of that core library.  If existing device specific extensions
within vfio-pci cannot be turned into vendor modules through this
support and are instead redefined as platform specific features of the
new core library, that feels like we're already admitting failure of
this core library to support known devices, let alone future devices.

IGD is a specific set of devices.  They happen to rely on some 
platform

specific support, whose availability should be determined via the
vendor module probe callback.  Packing that support into an "x86"
component as part of the core feels not only short sighted, but also
avoids addressing the issues around how userspace determines an 
optimal

module to use for a device.

Hm, it seems that not all current extensions to the vfio-pci code are
created equal.

IIUC, we have igd and nvlink, which are sets of devices that only show
up on x86 or ppc, respectively, and may rely on some special features
of those architectures/platforms. The important point is that you have
a device identifier that you can match a driver against.


maybe you can supply the ids ?

Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
supply the exact ids for that should be bounded to this driver ?


I'll add it to V3.


Sorry no, I never really had the exact ids, they keep popping up after 
years.


The nvlink2 support can only work if the platform supports it as there 
is nothing to do to the GPUs themselves, it is about setting up those 
nvlink2 links. If the platform advertises certain features in the 
device tree - then any GPU in the SXM2 form factor (not sure about the 
exact abbrev, not an usual PCIe device) should just work.


If the nvlink2 "subdriver" fails to find nvlinks (the platform does 
not advertise them or some parameters are new to this subdriver, such 
as link-speed), we still fall back to generic vfio-pci and try passing 
through this GPU as a plain PCI device with no extra links. 
Semi-optimal but if the user is mining cryptocoins, then highspeed 
links are not really that important :) And the nvidia driver is 
capable of running such GPUs without nvlinks. Thanks,



I see.

But the PCI function (the bounded BDF) is GPU function or NVLINK function ?

If it's NVLINK function then we should fail probing in the host vfio-pci 
driver.


if its a GPU function so it shouldn't been called nvlink2 vfio-pci 
driver. Its just an extension in the GPU vfio-pci driver.








On the other side, we have the zdev support, which both requires s390
and applies to any pci device on s390. If we added special handling for
ISM on s390, ISM would be in a category similar to igd/nvlink.

Now, i

Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-03 Thread Max Gurtovoy



On 2/3/2021 5:24 AM, Alexey Kardashevskiy wrote:



On 03/02/2021 04:41, Max Gurtovoy wrote:


On 2/2/2021 6:06 PM, Cornelia Huck wrote:

On Mon, 1 Feb 2021 11:42:30 -0700
Alex Williamson  wrote:


On Mon, 1 Feb 2021 12:49:12 -0500
Matthew Rosato  wrote:


On 2/1/21 12:14 PM, Cornelia Huck wrote:

On Mon, 1 Feb 2021 16:28:27 +
Max Gurtovoy  wrote:
This patch doesn't change any logic but only align to the 
concept of

vfio_pci_core extensions. Extensions that are related to a platform
and not to a specific vendor of PCI devices should be part of 
the core
driver. Extensions that are specific for PCI device vendor 
should go

to a dedicated vendor vfio-pci driver.
My understanding is that igd means support for Intel graphics, 
i.e. a

strict subset of x86. If there are other future extensions that e.g.
only make sense for some devices found only on AMD systems, I don't
think they should all be included under the same x86 umbrella.

Similar reasoning for nvlink, that only seems to cover support 
for some
GPUs under Power, and is not a general platform-specific 
extension IIUC.


We can arguably do the zdev -> s390 rename (as zpci appears only on
s390, and all PCI devices will be zpci on that platform), 
although I'm

not sure about the benefit.

As far as I can tell, there isn't any benefit for s390 it's just
"re-branding" to match the platform name rather than the zdev 
moniker,
which admittedly perhaps makes it more clear to someone outside of 
s390
that any PCI device on s390 is a zdev/zpci type, and thus will use 
this
extension to vfio_pci(_core).  This would still be true even if we 
added

something later that builds atop it (e.g. a platform-specific device
like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on 
s390x uses
these zdev extensions today and would need to continue using them 
in a

world where mlx5-vfio-pci.ko exists.

I guess all that to say: if such a rename matches the 'grand 
scheme' of
this design where we treat arch-level extensions to 
vfio_pci(_core) as
"vfio_pci_(arch)" then I'm not particularly opposed to the 
rename.  But

by itself it's not very exciting :)

This all seems like the wrong direction to me.  The goal here is to
modularize vfio-pci into a core library and derived vendor modules 
that

make use of that core library.  If existing device specific extensions
within vfio-pci cannot be turned into vendor modules through this
support and are instead redefined as platform specific features of the
new core library, that feels like we're already admitting failure of
this core library to support known devices, let alone future devices.

IGD is a specific set of devices.  They happen to rely on some 
platform

specific support, whose availability should be determined via the
vendor module probe callback.  Packing that support into an "x86"
component as part of the core feels not only short sighted, but also
avoids addressing the issues around how userspace determines an 
optimal

module to use for a device.

Hm, it seems that not all current extensions to the vfio-pci code are
created equal.

IIUC, we have igd and nvlink, which are sets of devices that only show
up on x86 or ppc, respectively, and may rely on some special features
of those architectures/platforms. The important point is that you have
a device identifier that you can match a driver against.


maybe you can supply the ids ?

Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
supply the exact ids for that should be bounded to this driver ?


I'll add it to V3.


Sorry no, I never really had the exact ids, they keep popping up after 
years.


The nvlink2 support can only work if the platform supports it as there 
is nothing to do to the GPUs themselves, it is about setting up those 
nvlink2 links. If the platform advertises certain features in the 
device tree - then any GPU in the SXM2 form factor (not sure about the 
exact abbrev, not an usual PCIe device) should just work.


If the nvlink2 "subdriver" fails to find nvlinks (the platform does 
not advertise them or some parameters are new to this subdriver, such 
as link-speed), we still fall back to generic vfio-pci and try passing 
through this GPU as a plain PCI device with no extra links. 
Semi-optimal but if the user is mining cryptocoins, then highspeed 
links are not really that important :) And the nvidia driver is 
capable of running such GPUs without nvlinks. Thanks,


From the above I can conclude that nvlink2_vfio_pci will need to find 
nvlinks during the probe and fail in case it doesn't.


which platform driver is responsible to advertise it ? can we use 
aux_device to represent these nvlink/nvlinks ?


In case of a failure, the user can fallback to vfio-pci.ko and run 
without the nvlinks as you said.


This is deterministic behavior and the user will know exactly what he's 
getting from vfio subsystem.









On the other side, we have the zdev support, which both 

Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-02 Thread Max Gurtovoy



On 2/2/2021 10:44 PM, Jason Gunthorpe wrote:

On Tue, Feb 02, 2021 at 12:37:23PM -0700, Alex Williamson wrote:


For the most part, this explicit bind interface is redundant to
driver_override, which already avoids the duplicate ID issue.

No, the point here is to have the ID tables in the PCI drivers because
they fundamentally only work with their supported IDs. The normal
driver core ID tables are a replacement for all the hardwired if's in
vfio_pci.

driver_override completely disables all the ID checking, it seems only
useful for vfio_pci which works with everything. It should not be used
with something like nvlink_vfio_pci.ko that needs ID checking.


This mechanism of driver_override seems weird to me. In case of hotplug 
and both capable drivers (native device driver and vfio-pci) are loaded, 
both will compete on the device.


I think the proposed flags is very powerful and it does fix the original 
concern Alex had ("if we start adding ids for vfio drivers then we 
create conflicts with the native host driver") and it's very deterministic.


In this way we'll bind explicitly to a driver.

And the way we'll choose a vfio-pci driver is by device_id + vendor_id + 
subsystem_device + subsystem_vendor.


There shouldn't be 2 vfio-pci drivers that support a device with same 
above 4 ids.


if you don't find a suitable vendor-vfio-pci.ko, you'll try binding 
vfio-pci.ko.


Each driver will publish its supported ids in sysfs to help the user to 
decide.




Yes, this DRIVER_EXPLICIT_BIND_ONLY idea somewhat replaces
driver_override because we could set the PCI any match on vfio_pci and
manage the driver binding explicitly instead.


A driver id table doesn't really help for binding the device,
ultimately even if a device is in the id table it might fail to
probe due to the missing platform support that each of these igd and
nvlink drivers expose,

What happens depends on what makes sense for the driver, some missing
optional support could continue without it, or it could fail.

IGD and nvlink can trivially go onwards and work if they don't find
the platform support.

Or they might want to fail, I think the mlx5 and probably nvlink
drivers should fail as they are intended to be coupled with userspace
that expects to use their extended features.

In those cases failing is a feature because it prevents the whole
system from going into an unexpected state.

Jason


Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-02 Thread Max Gurtovoy



On 2/2/2021 6:06 PM, Cornelia Huck wrote:

On Mon, 1 Feb 2021 11:42:30 -0700
Alex Williamson  wrote:


On Mon, 1 Feb 2021 12:49:12 -0500
Matthew Rosato  wrote:


On 2/1/21 12:14 PM, Cornelia Huck wrote:

On Mon, 1 Feb 2021 16:28:27 +
Max Gurtovoy  wrote:
 

This patch doesn't change any logic but only align to the concept of
vfio_pci_core extensions. Extensions that are related to a platform
and not to a specific vendor of PCI devices should be part of the core
driver. Extensions that are specific for PCI device vendor should go
to a dedicated vendor vfio-pci driver.

My understanding is that igd means support for Intel graphics, i.e. a
strict subset of x86. If there are other future extensions that e.g.
only make sense for some devices found only on AMD systems, I don't
think they should all be included under the same x86 umbrella.

Similar reasoning for nvlink, that only seems to cover support for some
GPUs under Power, and is not a general platform-specific extension IIUC.

We can arguably do the zdev -> s390 rename (as zpci appears only on
s390, and all PCI devices will be zpci on that platform), although I'm
not sure about the benefit.

As far as I can tell, there isn't any benefit for s390 it's just
"re-branding" to match the platform name rather than the zdev moniker,
which admittedly perhaps makes it more clear to someone outside of s390
that any PCI device on s390 is a zdev/zpci type, and thus will use this
extension to vfio_pci(_core).  This would still be true even if we added
something later that builds atop it (e.g. a platform-specific device
like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses
these zdev extensions today and would need to continue using them in a
world where mlx5-vfio-pci.ko exists.

I guess all that to say: if such a rename matches the 'grand scheme' of
this design where we treat arch-level extensions to vfio_pci(_core) as
"vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But
by itself it's not very exciting :)

This all seems like the wrong direction to me.  The goal here is to
modularize vfio-pci into a core library and derived vendor modules that
make use of that core library.  If existing device specific extensions
within vfio-pci cannot be turned into vendor modules through this
support and are instead redefined as platform specific features of the
new core library, that feels like we're already admitting failure of
this core library to support known devices, let alone future devices.

IGD is a specific set of devices.  They happen to rely on some platform
specific support, whose availability should be determined via the
vendor module probe callback.  Packing that support into an "x86"
component as part of the core feels not only short sighted, but also
avoids addressing the issues around how userspace determines an optimal
module to use for a device.

Hm, it seems that not all current extensions to the vfio-pci code are
created equal.

IIUC, we have igd and nvlink, which are sets of devices that only show
up on x86 or ppc, respectively, and may rely on some special features
of those architectures/platforms. The important point is that you have
a device identifier that you can match a driver against.


maybe you can supply the ids ?

Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
supply the exact ids for that should be bounded to this driver ?


I'll add it to V3.



On the other side, we have the zdev support, which both requires s390
and applies to any pci device on s390. If we added special handling for
ISM on s390, ISM would be in a category similar to igd/nvlink.

Now, if somebody plugs a mlx5 device into an s390, we would want both
the zdev support and the specialized mlx5 driver. Maybe zdev should be
some kind of library that can be linked into normal vfio-pci, into
vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
s390 (if configured into the kernel).



[PATCH 9/9] vfio/pci: use powernv naming instead of nvlink2

2021-02-01 Thread Max Gurtovoy
This patch doesn't change any logic but only align to the concept of
vfio_pci_core extensions. Extensions that are related to a platform
and not to a specific vendor of PCI devices should be part of the
core driver. Extensions that are specific for PCI device vendor should go
to a dedicated vendor vfio-pci driver.

For now, powernv extensions will include only nvlink2.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig| 6 --
 drivers/vfio/pci/Makefile   | 2 +-
 drivers/vfio/pci/vfio_pci_core.c| 4 ++--
 drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} | 0
 drivers/vfio/pci/vfio_pci_private.h | 2 +-
 5 files changed, 8 insertions(+), 6 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (100%)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index c98f2df01a60..fe0264b3d02f 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -47,11 +47,13 @@ config VFIO_PCI_X86
 
  To enable Intel X86 extensions for vfio-pci-core, say Y.
 
-config VFIO_PCI_NVLINK2
+config VFIO_PCI_POWERNV
def_bool y
depends on VFIO_PCI_CORE && PPC_POWERNV
help
- VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+ VFIO PCI extensions for IBM PowerNV (Non-Virtualized) platform
+
+ To enable POWERNV extensions for vfio-pci-core, say Y.
 
 config VFIO_PCI_S390
bool "VFIO PCI extensions for S390 platform"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index d8ccb70e015a..442b7c78de4c 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_X86) += vfio_pci_x86.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_POWERNV) += vfio_pci_powernv.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index e0e258c37fb5..90cc728fffc7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -337,7 +337,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
}
 
if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
-   IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+   IS_ENABLED(CONFIG_VFIO_PCI_POWERNV)) {
ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
if (ret && ret != -ENODEV) {
pci_warn(pdev, "Failed to setup NVIDIA NV2 RAM 
region\n");
@@ -346,7 +346,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
}
 
if (pdev->vendor == PCI_VENDOR_ID_IBM &&
-   IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+   IS_ENABLED(CONFIG_VFIO_PCI_POWERNV)) {
ret = vfio_pci_ibm_npu2_init(vdev);
if (ret && ret != -ENODEV) {
pci_warn(pdev, "Failed to setup NVIDIA NV2 ATSD 
region\n");
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
b/drivers/vfio/pci/vfio_pci_powernv.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci_nvlink2.c
rename to drivers/vfio/pci/vfio_pci_powernv.c
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index efc688525784..dc6a9191a704 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -173,7 +173,7 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device 
*vdev)
return -ENODEV;
 }
 #endif
-#ifdef CONFIG_VFIO_PCI_NVLINK2
+#ifdef CONFIG_VFIO_PCI_POWERNV
 extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
 extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
 #else
-- 
2.25.4



[PATCH 6/9] vfio-pci/zdev: fix possible segmentation fault issue

2021-02-01 Thread Max Gurtovoy
In case allocation fails, we must behave correctly and exit with error.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/vfio_pci_zdev.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 175096fcd902..e9ef4239ef7a 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -71,6 +71,8 @@ static int zpci_util_cap(struct zpci_dev *zdev, struct 
vfio_info_cap *caps)
int ret;
 
cap = kmalloc(cap_size, GFP_KERNEL);
+   if (!cap)
+   return -ENOMEM;
 
cap->header.id = VFIO_DEVICE_INFO_CAP_ZPCI_UTIL;
cap->header.version = 1;
@@ -94,6 +96,8 @@ static int zpci_pfip_cap(struct zpci_dev *zdev, struct 
vfio_info_cap *caps)
int ret;
 
cap = kmalloc(cap_size, GFP_KERNEL);
+   if (!cap)
+   return -ENOMEM;
 
cap->header.id = VFIO_DEVICE_INFO_CAP_ZPCI_PFIP;
cap->header.version = 1;
-- 
2.25.4



[PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-01 Thread Max Gurtovoy
This patch doesn't change any logic but only align to the concept of
vfio_pci_core extensions. Extensions that are related to a platform
and not to a specific vendor of PCI devices should be part of the core
driver. Extensions that are specific for PCI device vendor should go
to a dedicated vendor vfio-pci driver.

For now, x86 extensions will include only igd.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig| 13 ++---
 drivers/vfio/pci/Makefile   |  2 +-
 drivers/vfio/pci/vfio_pci_core.c|  2 +-
 drivers/vfio/pci/vfio_pci_private.h |  2 +-
 drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
 5 files changed, 9 insertions(+), 10 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 6e6c976b9512..c98f2df01a60 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -36,17 +36,16 @@ config VFIO_PCI_INTX
depends on VFIO_PCI_CORE
def_bool y if !S390
 
-config VFIO_PCI_IGD
-   bool "VFIO PCI extensions for Intel graphics (GVT-d)"
+config VFIO_PCI_X86
+   bool "VFIO PCI extensions for Intel X86 platform"
depends on VFIO_PCI_CORE && X86
default y
help
- Support for Intel IGD specific extensions to enable direct
- assignment to virtual machines.  This includes exposing an IGD
- specific firmware table and read-only copies of the host bridge
- and LPC bridge config space.
+ Support for Intel X86 specific extensions for VFIO PCI devices.
+ This includes exposing an IGD specific firmware table and
+ read-only copies of the host bridge and LPC bridge config space.
 
- To enable Intel IGD assignment through vfio-pci, say Y.
+ To enable Intel X86 extensions for vfio-pci-core, say Y.
 
 config VFIO_PCI_NVLINK2
def_bool y
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 90962a495eba..d8ccb70e015a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -5,7 +5,7 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_X86) += vfio_pci_x86.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c559027def2d..e0e258c37fb5 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 
if (vfio_pci_is_vga(pdev) &&
pdev->vendor == PCI_VENDOR_ID_INTEL &&
-   IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
+   IS_ENABLED(CONFIG_VFIO_PCI_X86)) {
ret = vfio_pci_igd_init(vdev);
if (ret && ret != -ENODEV) {
pci_warn(pdev, "Failed to setup Intel IGD regions\n");
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 7c3c2cd575f8..efc688525784 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -165,7 +165,7 @@ extern u16 vfio_pci_memory_lock_and_enable(struct 
vfio_pci_device *vdev);
 extern void vfio_pci_memory_unlock_and_restore(struct vfio_pci_device *vdev,
   u16 cmd);
 
-#ifdef CONFIG_VFIO_PCI_IGD
+#ifdef CONFIG_VFIO_PCI_X86
 extern int vfio_pci_igd_init(struct vfio_pci_device *vdev);
 #else
 static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_x86.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci_igd.c
rename to drivers/vfio/pci/vfio_pci_x86.c
-- 
2.25.4



[PATCH 7/9] vfio/pci: use s390 naming instead of zdev

2021-02-01 Thread Max Gurtovoy
This patch doesn't change any logic but only the concept of
vfio_pci_core extensions. Extensions that are related to a platform and
not to a specific vendor of PCI devices should be part of the core
driver. Extensions that are specific for PCI device vendor should go to
a dedicated vendor vfio-pci driver.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig  | 4 ++--
 drivers/vfio/pci/Makefile | 2 +-
 drivers/vfio/pci/vfio_pci_core.c  | 2 +-
 drivers/vfio/pci/vfio_pci_private.h   | 2 +-
 drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} (98%)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index dcb164d7d641..6e6c976b9512 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -54,8 +54,8 @@ config VFIO_PCI_NVLINK2
help
  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
 
-config VFIO_PCI_ZDEV
-   bool "VFIO PCI ZPCI device CLP support"
+config VFIO_PCI_S390
+   bool "VFIO PCI extensions for S390 platform"
depends on VFIO_PCI_CORE && S390
default y
help
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 9f67edca31c5..90962a495eba 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_s390.o
 
 vfio-pci-y := vfio_pci.o
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a0a91331f575..c559027def2d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -785,7 +785,7 @@ static long vfio_pci_core_ioctl(void *device_data, unsigned 
int cmd,
info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
info.num_irqs = VFIO_PCI_NUM_IRQS;
 
-   if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) {
+   if (IS_ENABLED(CONFIG_VFIO_PCI_S390)) {
int ret = vfio_pci_info_zdev_add_caps(vdev, );
 
if (ret && ret != -ENODEV) {
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 1c3bb809b5c0..7c3c2cd575f8 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -188,7 +188,7 @@ static inline int vfio_pci_ibm_npu2_init(struct 
vfio_pci_device *vdev)
 }
 #endif
 
-#ifdef CONFIG_VFIO_PCI_ZDEV
+#ifdef CONFIG_VFIO_PCI_S390
 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
   struct vfio_info_cap *caps);
 #else
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_s390.c
similarity index 98%
rename from drivers/vfio/pci/vfio_pci_zdev.c
rename to drivers/vfio/pci/vfio_pci_s390.c
index e9ef4239ef7a..b6421a7f767f 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_s390.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0+
 /*
- * VFIO ZPCI devices support
+ * VFIO PCI support for S390 platform (a.k.a ZPCI devices)
  *
  * Copyright (C) IBM Corp. 2020.  All rights reserved.
  * Author(s): Pierre Morel 
-- 
2.25.4



[PATCH 5/9] vfio-pci/zdev: remove unused vdev argument

2021-02-01 Thread Max Gurtovoy
Zdev static functions does not use vdev argument. Remove it.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/vfio_pci_zdev.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 7b20b34b1034..175096fcd902 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -24,8 +24,7 @@
 /*
  * Add the Base PCI Function information to the device info region.
  */
-static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
-struct vfio_info_cap *caps)
+static int zpci_base_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 {
struct vfio_device_info_cap_zpci_base cap = {
.header.id = VFIO_DEVICE_INFO_CAP_ZPCI_BASE,
@@ -45,8 +44,7 @@ static int zpci_base_cap(struct zpci_dev *zdev, struct 
vfio_pci_device *vdev,
 /*
  * Add the Base PCI Function Group information to the device info region.
  */
-static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
- struct vfio_info_cap *caps)
+static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 {
struct vfio_device_info_cap_zpci_group cap = {
.header.id = VFIO_DEVICE_INFO_CAP_ZPCI_GROUP,
@@ -66,8 +64,7 @@ static int zpci_group_cap(struct zpci_dev *zdev, struct 
vfio_pci_device *vdev,
 /*
  * Add the device utility string to the device info region.
  */
-static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
-struct vfio_info_cap *caps)
+static int zpci_util_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 {
struct vfio_device_info_cap_zpci_util *cap;
int cap_size = sizeof(*cap) + CLP_UTIL_STR_LEN;
@@ -90,8 +87,7 @@ static int zpci_util_cap(struct zpci_dev *zdev, struct 
vfio_pci_device *vdev,
 /*
  * Add the function path string to the device info region.
  */
-static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_pci_device *vdev,
-struct vfio_info_cap *caps)
+static int zpci_pfip_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
 {
struct vfio_device_info_cap_zpci_pfip *cap;
int cap_size = sizeof(*cap) + CLP_PFIP_NR_SEGMENTS;
@@ -123,21 +119,21 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device 
*vdev,
if (!zdev)
return -ENODEV;
 
-   ret = zpci_base_cap(zdev, vdev, caps);
+   ret = zpci_base_cap(zdev, caps);
if (ret)
return ret;
 
-   ret = zpci_group_cap(zdev, vdev, caps);
+   ret = zpci_group_cap(zdev, caps);
if (ret)
return ret;
 
if (zdev->util_str_avail) {
-   ret = zpci_util_cap(zdev, vdev, caps);
+   ret = zpci_util_cap(zdev, caps);
if (ret)
return ret;
}
 
-   ret = zpci_pfip_cap(zdev, vdev, caps);
+   ret = zpci_pfip_cap(zdev, caps);
 
return ret;
 }
-- 
2.25.4



[PATCH 4/9] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices

2021-02-01 Thread Max Gurtovoy
This driver will register to PCI bus and Auxiliary bus. In case the
probe of both devices will succeed, we'll have a vendor specific VFIO
PCI device. mlx5_vfio_pci use vfio_pci_core to register and create a
VFIO device and use auxiliary_device to get the needed extension from
the vendor device driver. If one of the probe() functions will fail, the
VFIO char device will not be created. For now, only register and bind
the auxiliary_device to the pci_device in case we have a match between
the auxiliary_device id to the pci_device BDF. Later, vendor specific
features such as live migration will be added and will be available to
the virtualization software.

Note: Although we've created the mlx5-vfio-pci.ko, the binding to
vfio-pci.ko will still work as before. It's fully backward compatible.
Of course, the extended vendor functionality will not exist in case one
will bind the device to the generic vfio_pci.ko.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig |  10 ++
 drivers/vfio/pci/Makefile|   3 +
 drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++
 include/linux/mlx5/vfio_pci.h|  36 +
 4 files changed, 302 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
 create mode 100644 include/linux/mlx5/vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index b958a48f63a0..dcb164d7d641 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -65,3 +65,13 @@ config VFIO_PCI_ZDEV
  for zPCI devices passed through via VFIO on s390.
 
  Say Y here.
+
+config MLX5_VFIO_PCI
+   tristate "VFIO support for MLX5 PCI devices"
+   depends on VFIO_PCI_CORE && MLX5_CORE
+   select AUXILIARY_BUS
+   help
+ This provides a generic PCI support for MLX5 devices using the VFIO
+ framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 3f2a27e222cd..9f67edca31c5 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@
 
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
@@ -9,3 +10,5 @@ vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
+
+mlx5-vfio-pci-y := mlx5_vfio_pci.o
diff --git a/drivers/vfio/pci/mlx5_vfio_pci.c b/drivers/vfio/pci/mlx5_vfio_pci.c
new file mode 100644
index ..4e6b256c74bf
--- /dev/null
+++ b/drivers/vfio/pci/mlx5_vfio_pci.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ * Author: Max Gurtovoy 
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vfio_pci_core.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Max Gurtovoy "
+#define DRIVER_DESC "MLX5 VFIO PCI - User Level meta-driver for NVIDIA 
MLX5 device family"
+
+/* 16k migration data size */
+#define MLX5_MIGRATION_REGION_DATA_SIZESZ_16K
+/* Data section offset from migration region */
+#define MLX5_MIGRATION_REGION_DATA_OFFSET  (sizeof(struct 
vfio_device_migration_info))
+
+struct mlx5_vfio_pci_migration_info {
+   struct vfio_device_migration_info mig;
+   char data[MLX5_MIGRATION_REGION_DATA_SIZE];
+};
+
+static LIST_HEAD(aux_devs_list);
+static DEFINE_MUTEX(aux_devs_lock);
+
+static struct mlx5_vfio_pci_adev *mlx5_vfio_pci_find_adev(struct pci_dev *pdev)
+{
+   struct mlx5_vfio_pci_adev *mvadev, *found = NULL;
+
+   mutex_lock(_devs_lock);
+   list_for_each_entry(mvadev, _devs_list, entry) {
+   if (mvadev->madev.adev.id == pci_dev_id(pdev)) {
+   found = mvadev;
+   break;
+   }
+   }
+   mutex_unlock(_devs_lock);
+
+   return found;
+}
+
+static int mlx5_vfio_pci_aux_probe(struct auxiliary_device *adev,
+   const struct auxiliary_device_id *id)
+{
+   struct mlx5_vfio_pci_adev *mvadev;
+
+   mvadev = adev_to_mvadev(adev);
+
+   pr_info("%s aux probing bdf %02x:%02x.%d mdev is %s\n",
+   adev->name,
+   PCI_BUS_NUM(adev->id & 0x),
+   PCI_SLOT(adev->id & 0xff),
+   PCI_FUNC(adev->id & 0xff), 
dev_name(mvadev->madev.mdev->device));
+
+   mutex_lock(_devs_lock);
+   list_add(>entry, _devs_list);
+   mutex_unlock(_devs_lock);
+
+   return 0;
+}
+
+static void mlx5_vfi

[PATCH 3/9] vfio-pci-core: export vfio_pci_register_dev_region function

2021-02-01 Thread Max Gurtovoy
This function will be used to allow vendor drivers to register regions
to be used and accessed by the core subsystem driver. This way, the core
will use the region ops that are vendor specific and managed by the
vendor vfio-pci driver.

Next step that can be made is to move the logic of igd and nvlink to a
dedicated module instead of managing their functionality in the core
driver.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/vfio_pci_core.c| 12 +-
 drivers/vfio/pci/vfio_pci_core.h| 28 
 drivers/vfio/pci/vfio_pci_igd.c | 16 --
 drivers/vfio/pci/vfio_pci_nvlink2.c | 25 +++--
 drivers/vfio/pci/vfio_pci_private.h | 34 +
 5 files changed, 64 insertions(+), 51 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d5bf01132c23..a0a91331f575 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -395,7 +395,7 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
vdev->virq_disabled = false;
 
for (i = 0; i < vdev->num_regions; i++)
-   vdev->region[i].ops->release(vdev, >region[i]);
+   vdev->region[i].ops->release(>vpdev, >region[i]);
 
vdev->num_regions = 0;
kfree(vdev->region);
@@ -716,11 +716,12 @@ static int msix_mmappable_cap(struct vfio_pci_device 
*vdev,
return vfio_info_add_capability(caps, , sizeof(header));
 }
 
-int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
+int vfio_pci_register_dev_region(struct vfio_pci_core_device *vpdev,
 unsigned int type, unsigned int subtype,
 const struct vfio_pci_regops *ops,
 size_t size, u32 flags, void *data)
 {
+   struct vfio_pci_device *vdev = vpdev_to_vdev(vpdev);
struct vfio_pci_region *region;
 
region = krealloc(vdev->region,
@@ -741,6 +742,7 @@ int vfio_pci_register_dev_region(struct vfio_pci_device 
*vdev,
 
return 0;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
 
 struct vfio_devices {
struct vfio_device **devices;
@@ -928,7 +930,7 @@ static long vfio_pci_core_ioctl(void *device_data, unsigned 
int cmd,
return ret;
 
if (vdev->region[i].ops->add_capability) {
-   ret = vdev->region[i].ops->add_capability(vdev,
+   ret = 
vdev->region[i].ops->add_capability(>vpdev,
>region[i], );
if (ret)
return ret;
@@ -1379,7 +1381,7 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, 
char __user *buf,
return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
default:
index -= VFIO_PCI_NUM_REGIONS;
-   return vdev->region[index].ops->rw(vdev, buf,
+   return vdev->region[index].ops->rw(>vpdev, buf,
   count, ppos, iswrite);
}
 
@@ -1622,7 +1624,7 @@ static int vfio_pci_core_mmap(void *device_data, struct 
vm_area_struct *vma)
 
if (region && region->ops && region->ops->mmap &&
(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
-   return region->ops->mmap(vdev, region, vma);
+   return region->ops->mmap(>vpdev, region, vma);
return -EINVAL;
}
if (index >= VFIO_PCI_ROM_REGION_INDEX)
diff --git a/drivers/vfio/pci/vfio_pci_core.h b/drivers/vfio/pci/vfio_pci_core.h
index 9833935af735..0b227ee3f377 100644
--- a/drivers/vfio/pci/vfio_pci_core.h
+++ b/drivers/vfio/pci/vfio_pci_core.h
@@ -15,6 +15,7 @@
 #define VFIO_PCI_CORE_H
 
 struct vfio_pci_device_ops;
+struct vfio_pci_region;
 
 struct vfio_pci_core_device {
struct pci_dev  *pdev;
@@ -22,6 +23,29 @@ struct vfio_pci_core_device {
void*dd_data;
 };
 
+struct vfio_pci_regops {
+   size_t  (*rw)(struct vfio_pci_core_device *vpdev, char __user *buf,
+ size_t count, loff_t *ppos, bool iswrite);
+   void(*release)(struct vfio_pci_core_device *vpdev,
+  struct vfio_pci_region *region);
+   int (*mmap)(struct vfio_pci_core_device *vpdev,
+   struct vfio_pci_region *region,
+   struct vm_area_struct *vma);
+   int (*add_capability)(struct vfio_pci_core_device *vpdev,
+ struct vfio_pci_region *region,
+ struct vfio_info_cap *caps);
+};
+
+struct vfio_pci_region {
+   u32  

[PATCH 2/9] vfio-pci: introduce vfio_pci_core subsystem driver

2021-02-01 Thread Max Gurtovoy
Split the vfio_pci driver into two parts, the 'struct pci_driver'
(vfio_pci) and a library of code (vfio_pci_core) that helps creating a
VFIO device on top of a PCI device.

As before vfio_pci.ko continues to present the same interface under
sysfs and this change should have no functional impact.

vfio_pci_core exposes an interface that is similar to a typical
Linux subsystem, in that a pci_driver doing probe() can setup a number
of details and then create the VFIO char device.

Allowing another module to provide the pci_driver allows that module
to customize how VFIO is setup, inject its own operations, and easily
extend vendor specific functionality.

This is complementary to how VFIO's mediated devices work. Instead of
custome device lifecycle managmenet and a special bus drivers using
this approach will rely on the normal driver core lifecycle (eg
bind/unbind) management and this is optimized to effectively support
customization that is only making small modifications to what vfio_pci
would do normally.

This approach is also a pluggable alternative for the hard wired
CONFIG_VFIO_PCI_IG and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are
built into vfio-pci. Using this work all of that code can be moved to
a dedicated device-specific modules and cleanly split out of the
generic vfio_pci driver.

Below is an example for adding new driver to vfio pci subsystem:
+--+
|  |
|   VFIO   |
|  |
+--+

+--+
|  |
|   VFIO_PCI_CORE  |
|  |
+--+

++ +---+
|| |   |
|  VFIO_PCI  | | MLX5_VFIO_PCI |
|| |   |
++ +---+

In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio
subsystem and also use the generic PCI functionality exported from it.
Additionally it will add the needed vendor specific logic for HW
specific features such as Live Migration.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig|  24 +-
 drivers/vfio/pci/Makefile   |  13 +-
 drivers/vfio/pci/vfio_pci.c | 202 ++
 drivers/vfio/pci/vfio_pci_config.c  |  56 ++--
 drivers/vfio/pci/vfio_pci_core.c| 392 ++--
 drivers/vfio/pci/vfio_pci_core.h|  45 
 drivers/vfio/pci/vfio_pci_igd.c |   2 +-
 drivers/vfio/pci/vfio_pci_intrs.c   |  22 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c |  24 +-
 drivers/vfio/pci/vfio_pci_private.h |   4 +-
 drivers/vfio/pci/vfio_pci_rdwr.c|  14 +-
 drivers/vfio/pci/vfio_pci_zdev.c|   2 +-
 12 files changed, 471 insertions(+), 329 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_core.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 40a223381ab6..b958a48f63a0 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-config VFIO_PCI
-   tristate "VFIO support for PCI devices"
+config VFIO_PCI_CORE
+   tristate "VFIO core support for PCI devices"
depends on VFIO && PCI && EVENTFD
select VFIO_VIRQFD
select IRQ_BYPASS_MANAGER
@@ -10,9 +10,17 @@ config VFIO_PCI
 
  If you don't know what to do here, say N.
 
+config VFIO_PCI
+   tristate "VFIO support for PCI devices"
+   depends on VFIO_PCI_CORE
+   help
+ This provides a generic PCI support using the VFIO framework.
+
+ If you don't know what to do here, say N.
+
 config VFIO_PCI_VGA
bool "VFIO PCI support for VGA devices"
-   depends on VFIO_PCI && X86 && VGA_ARB
+   depends on VFIO_PCI_CORE && X86 && VGA_ARB
help
  Support for VGA extension to VFIO PCI.  This exposes an additional
  region on VGA devices for accessing legacy VGA addresses used by
@@ -21,16 +29,16 @@ config VFIO_PCI_VGA
  If you don't know what to do here, say N.
 
 config VFIO_PCI_MMAP
-   depends on VFIO_PCI
+   depends on VFIO_PCI_CORE
def_bool y if !S390
 
 config VFIO_PCI_INTX
-   depends on VFIO_PCI
+   depends on VFIO_PCI_CORE
def_bool y if !S390
 
 config VFIO_PCI_IGD
bool "VFIO PCI extensions for Intel graphics (GVT-d)"
-   depends on VFIO_PCI && X86
+   depends on VFIO_PCI_CORE && X86
default y
help
  Support for Intel IGD specific extensions to enable direct
@@ -42,13 +50,13 @@ config VFIO_PCI_IGD
 
 config VFIO_PCI_NVLINK

[PATCH 1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c

2021-02-01 Thread Max Gurtovoy
This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Makefile| 2 +-
 drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 781e0809d6ee..dd350b9b 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci.c
rename to drivers/vfio/pci/vfio_pci_core.c
-- 
2.25.4



[PATCH v2 0/9] Introduce vfio-pci-core subsystem

2021-02-01 Thread Max Gurtovoy
io-pci zdev code

TODOs:
1. Create subsystem module for VFIO_MDEV. This can be used for vendor
   specific scalable functions for example (SFs).
2. Add Live migration functionality for mlx5 SNAP devices
   (NVMe/Virtio-BLK).
3. Add Live migration functionality for mlx5 VFs
4. Add the needed functionality for mlx5_core
5. Improve userspace and discovery
6. move VGA stuff to x86 extension

I would like to thank the great team that was involved in this
development, design and internal review:
Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy
and others.

This series applies cleanly on top of kernel 5.11-rc5+ commit 13391c60da33:
"Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6"
from Linus.

Note: Live migration for MLX5 SNAP devices is WIP and can be the first
  example for adding vendor extension to vfio-pci devices. As the
  changes to the subsystem must be defined as a pre-condition for
  this work, we've decided to split the submission for now.

Max Gurtovoy (9):
  vfio-pci: rename vfio_pci.c to vfio_pci_core.c
  vfio-pci: introduce vfio_pci_core subsystem driver
  vfio-pci-core: export vfio_pci_register_dev_region function
  mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
  vfio-pci/zdev: remove unused vdev argument
  vfio-pci/zdev: fix possible segmentation fault issue
  vfio/pci: use s390 naming instead of zdev
  vfio/pci: use x86 naming instead of igd
  vfio/pci: use powernv naming instead of nvlink2

 drivers/vfio/pci/Kconfig  |   57 +-
 drivers/vfio/pci/Makefile |   16 +-
 drivers/vfio/pci/mlx5_vfio_pci.c  |  253 ++
 drivers/vfio/pci/vfio_pci.c   | 2384 +
 drivers/vfio/pci/vfio_pci_config.c|   56 +-
 drivers/vfio/pci/vfio_pci_core.c  | 2326 
 drivers/vfio/pci/vfio_pci_core.h  |   73 +
 drivers/vfio/pci/vfio_pci_intrs.c |   22 +-
 ...{vfio_pci_nvlink2.c => vfio_pci_powernv.c} |   47 +-
 drivers/vfio/pci/vfio_pci_private.h   |   44 +-
 drivers/vfio/pci/vfio_pci_rdwr.c  |   14 +-
 .../pci/{vfio_pci_zdev.c => vfio_pci_s390.c}  |   28 +-
 .../pci/{vfio_pci_igd.c => vfio_pci_x86.c}|   18 +-
 include/linux/mlx5/vfio_pci.h |   36 +
 14 files changed, 2916 insertions(+), 2458 deletions(-)
 create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
 create mode 100644 drivers/vfio/pci/vfio_pci_core.c
 create mode 100644 drivers/vfio/pci/vfio_pci_core.h
 rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (89%)
 rename drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} (80%)
 rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (89%)
 create mode 100644 include/linux/mlx5/vfio_pci.h

-- 
2.25.4



Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

2021-02-01 Thread Max Gurtovoy



On 2/1/2021 6:32 AM, Alex Williamson wrote:

On Sun, 31 Jan 2021 20:46:40 +0200
Max Gurtovoy  wrote:


On 1/28/2021 11:02 PM, Alex Williamson wrote:

On Thu, 28 Jan 2021 17:29:30 +0100
Cornelia Huck  wrote:
  

On Tue, 26 Jan 2021 15:27:43 +0200
Max Gurtovoy  wrote:

On 1/26/2021 5:34 AM, Alex Williamson wrote:

On Mon, 25 Jan 2021 20:45:22 -0400
Jason Gunthorpe  wrote:
   

On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:

extensions potentially break vendor drivers, etc.  We're only even hand
waving that existing device specific support could be farmed out to new
device specific drivers without even going to the effort to prove that.

This is a RFC, not a complete patch series. The RFC is to get feedback
on the general design before everyone comits alot of resources and
positions get dug in.

Do you really think the existing device specific support would be a
problem to lift? It already looks pretty clean with the
vfio_pci_regops, looks easy enough to lift to the parent.
   

So far the TODOs rather mask the dirty little secrets of the
extension rather than showing how a vendor derived driver needs to
root around in struct vfio_pci_device to do something useful, so
probably porting actual device specific support rather than further
hand waving would be more helpful.

It would be helpful to get actual feedback on the high level design -
someting like this was already tried in May and didn't go anywhere -
are you surprised that we are reluctant to commit alot of resources
doing a complete job just to have it go nowhere again?

That's not really what I'm getting from your feedback, indicating
vfio-pci is essentially done, the mlx stub driver should be enough to
see the direction, and additional concerns can be handled with TODO
comments.  Sorry if this is not construed as actual feedback, I think
both Connie and I are making an effort to understand this and being
hampered by lack of a clear api or a vendor driver that's anything more
than vfio-pci plus an aux bus interface.  Thanks,

I think I got the main idea and I'll try to summarize it:

The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
do need it to be able to create vendor-vfio-pci.ko driver in the future
to include vendor special souse inside.

One other thing I'd like to bring up: What needs to be done in
userspace? Does a userspace driver like QEMU need changes to actually
exploit this? Does management software like libvirt need to be involved
in decision making, or does it just need to provide the knobs to make
the driver configurable?

I'm still pretty nervous about the userspace aspect of this as well.
QEMU and other actual vfio drivers are probably the least affected,
at least for QEMU, it'll happily open any device that has a pointer to
an IOMMU group that's reflected as a vfio group device.  Tools like
libvirt, on the other hand, actually do driver binding and we need to
consider how they make driver decisions. Jason suggested that the
vfio-pci driver ought to be only spec compliant behavior, which sounds
like some deprecation process of splitting out the IGD, NVLink, zpci,
etc. features into sub-drivers and eventually removing that device
specific support from vfio-pci.  Would we expect libvirt to know, "this
is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname
-m says we're running on s390, try to bind it to vfio-zpci"?  Maybe we
expect derived drivers to only bind to devices they recognize, so
libvirt could blindly try a whole chain of drivers, ending in vfio-pci.
Obviously if we have competing drivers that support the same device in
different ways, that quickly falls apart.

I think we can leave common arch specific stuff, such as s390 (IIUC) in
the core driver. And only create vfio_pci drivers for
vendor/device/subvendor specific stuff.

So on one hand you're telling us that the design principles here can be
applied to various other device/platform specific support, but on the
other you're saying, but don't do that...


I guess I was looking at nvlink2 as device specific.

But let's update the nvlink2, s390 and IGD a bit:

1. s390 -  config VFIO_PCI_ZDEV rename to config VFIO_PCI_S390 (it will 
include all needed tweeks for S390)


2. nvlink2 - config VFIO_PCI_NVLINK2 rename to config VFIO_PCI_P9 (it 
will include all needed tweeks for P9)


3. igd - config VFIO_PCI_IGD rename to config VFIO_PCI_X86 (it will 
include all needed tweeks for X86)


All the 3 stays in the vfio-pci-core.ko since we might need S390 stuff 
if we plug Network adapter from vendor-A or  NVMe adapter from vendor-B 
for example. This is platform specific and we don't want to duplicate it 
in each vendor driver.


Same for P9 (and nvlink2 is only a special case in there) and X86.

  

Also, the competing drivers issue can also happen today, right ? after
adding new_id to vfio_pci I don't know how linux will behave if we'll
plug new device with same id to the system. whi

Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

2021-01-31 Thread Max Gurtovoy



On 1/28/2021 11:02 PM, Alex Williamson wrote:

On Thu, 28 Jan 2021 17:29:30 +0100
Cornelia Huck  wrote:


On Tue, 26 Jan 2021 15:27:43 +0200
Max Gurtovoy  wrote:

On 1/26/2021 5:34 AM, Alex Williamson wrote:

On Mon, 25 Jan 2021 20:45:22 -0400
Jason Gunthorpe  wrote:


On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:

extensions potentially break vendor drivers, etc.  We're only even hand
waving that existing device specific support could be farmed out to new
device specific drivers without even going to the effort to prove that.

This is a RFC, not a complete patch series. The RFC is to get feedback
on the general design before everyone comits alot of resources and
positions get dug in.

Do you really think the existing device specific support would be a
problem to lift? It already looks pretty clean with the
vfio_pci_regops, looks easy enough to lift to the parent.


So far the TODOs rather mask the dirty little secrets of the
extension rather than showing how a vendor derived driver needs to
root around in struct vfio_pci_device to do something useful, so
probably porting actual device specific support rather than further
hand waving would be more helpful.

It would be helpful to get actual feedback on the high level design -
someting like this was already tried in May and didn't go anywhere -
are you surprised that we are reluctant to commit alot of resources
doing a complete job just to have it go nowhere again?

That's not really what I'm getting from your feedback, indicating
vfio-pci is essentially done, the mlx stub driver should be enough to
see the direction, and additional concerns can be handled with TODO
comments.  Sorry if this is not construed as actual feedback, I think
both Connie and I are making an effort to understand this and being
hampered by lack of a clear api or a vendor driver that's anything more
than vfio-pci plus an aux bus interface.  Thanks,

I think I got the main idea and I'll try to summarize it:

The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we
do need it to be able to create vendor-vfio-pci.ko driver in the future
to include vendor special souse inside.

One other thing I'd like to bring up: What needs to be done in
userspace? Does a userspace driver like QEMU need changes to actually
exploit this? Does management software like libvirt need to be involved
in decision making, or does it just need to provide the knobs to make
the driver configurable?

I'm still pretty nervous about the userspace aspect of this as well.
QEMU and other actual vfio drivers are probably the least affected,
at least for QEMU, it'll happily open any device that has a pointer to
an IOMMU group that's reflected as a vfio group device.  Tools like
libvirt, on the other hand, actually do driver binding and we need to
consider how they make driver decisions. Jason suggested that the
vfio-pci driver ought to be only spec compliant behavior, which sounds
like some deprecation process of splitting out the IGD, NVLink, zpci,
etc. features into sub-drivers and eventually removing that device
specific support from vfio-pci.  Would we expect libvirt to know, "this
is an 8086 graphics device, try to bind it to vfio-pci-igd" or "uname
-m says we're running on s390, try to bind it to vfio-zpci"?  Maybe we
expect derived drivers to only bind to devices they recognize, so
libvirt could blindly try a whole chain of drivers, ending in vfio-pci.
Obviously if we have competing drivers that support the same device in
different ways, that quickly falls apart.


I think we can leave common arch specific stuff, such as s390 (IIUC) in 
the core driver. And only create vfio_pci drivers for 
vendor/device/subvendor specific stuff.


Also, the competing drivers issue can also happen today, right ? after 
adding new_id to vfio_pci I don't know how linux will behave if we'll 
plug new device with same id to the system. which driver will probe it ?


I don't really afraid of competing drivers since we can ask from vendor 
vfio pci_drivers to add vendor_id, device_id, subsystem_vendor and 
subsystem_device so we won't have this problem. I don't think that there 
will be 2 drivers that drive the same device with these 4 ids.


Userspace tool can have a map of ids to drivers and bind the device to 
the right vfio-pci vendor driver if it has one. if not, bind to vfio_pci.ko.




Libvirt could also expand its available driver models for the user to
specify a variant, I'd support that for overriding a choice that libvirt
might make otherwise, but forcing the user to know this information is
just passing the buck.


We can add a code to libvirt as mentioned above.



Some derived drivers could probably actually include device IDs rather
than only relying on dynamic ids, but then we get into the problem that
we're competing with native host driver for a device.  The aux bus
example here is essentially the least troublesome variation since it
works in conjunction 

Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

2021-01-31 Thread Max Gurtovoy



On 1/28/2021 6:29 PM, Cornelia Huck wrote:

On Tue, 26 Jan 2021 15:27:43 +0200
Max Gurtovoy  wrote:


Hi Alex, Cornelia and Jason,

thanks for the reviewing this.

On 1/26/2021 5:34 AM, Alex Williamson wrote:

On Mon, 25 Jan 2021 20:45:22 -0400
Jason Gunthorpe  wrote:
  

On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:
  

We're supposed to be enlightened by a vendor driver that does nothing
more than pass the opaque device_data through to the core functions,
but in reality this is exactly the point of concern above.  At a
minimum that vendor driver needs to look at the vdev to get the
pdev,

The end driver already havs the pdev, the RFC doesn't go enough into
those bits, it is a good comment.

The dd_data pased to the vfio_create_pci_device() will be retrieved
from the ops to get back to the end drivers data. This can cleanly
include everything: the VF pci_device, PF pci_device, mlx5_core
pointer, vfio_device and vfio_pci_device.

This is why the example passes in the mvadev:

+   vdev = vfio_create_pci_device(pdev, _vfio_pci_ops, mvadev);

The mvadev has the PF, VF, and mlx5 core driver pointer.

Getting that back out during the ops is enough to do what the mlx5
driver needs to do, which is relay migration related IOCTLs to the PF
function via the mlx5_core driver so the device can execute them on
behalf of the VF.
  

but then what else does it look at, consume, or modify.  Now we have
vendor drivers misusing the core because it's not clear which fields
are private and how public fields can be used safely,

The kernel has never followed rigid rules for data isolation, it is
normal to have whole private structs exposed in headers so that
container_of can be used to properly compose data structures.

I reject this assertion, there are plenty of structs that clearly
indicate private vs public data or, as we've done in mdev, clearly
marking the private data in a "private" header and provide access
functions for public fields.  Including a "private" header to make use
of a "library" is just wrong.  In the example above, there's no way for
the mlx vendor driver to get back dd_data without extracting it from
struct vfio_pci_device itself.

I'll create a better separation between private/public fields according
to my understanding for the V2.

I'll just mention that beyond this separation, future improvements will
be needed and can be done incrementally.

I don't think that we should do so many changes at one shut. The
incremental process is safer from subsystem point of view.

I also think that upstreaming mlx5_vfio_pci.ko and upstreaming vfio-pci
separation into 2 modules doesn't have to happen in one-shut.

The design can probably benefit from tossing a non-mlx5 driver into the
mix.

So, let me suggest the zdev support for that experiment (see
e6b817d4b8217a9528fcfd59719b924ab8a5ff23 and friends.) It is quite
straightforward: it injects some capabilities into the info ioctl. A
specialized driver would basically only need to provide special
handling for the ioctl callback and just forward anything else. It also
would not need any matching for device ids etc., as it would only make
sense on s390x, but regardless of the device. It could, however, help
us to get an idea what a good split would look like.


AFAIU, s390x is related to IBM architecture and not to a specific PCI 
device. So I guess it should stay in the core as many PCI devices will 
need these capabilities on IBM system.


I think I'll use NVLINK2 P9 stuff as an example of the split and add it 
to vfio_pci.ko instead of vfio_pci_core.ko as a first step.


later we can create a dedicated module for these devices (V100 GPUs).


But again, to make our point in this RFC, I'll improve it for V2.

  

Look at struct device, for instance. Most of that is private to the
driver core.

A few 'private to vfio-pci-core' comments would help, it is good
feedback to make that more clear.
  

extensions potentially break vendor drivers, etc.  We're only even hand
waving that existing device specific support could be farmed out to new
device specific drivers without even going to the effort to prove that.

This is a RFC, not a complete patch series. The RFC is to get feedback
on the general design before everyone comits alot of resources and
positions get dug in.

Do you really think the existing device specific support would be a
problem to lift? It already looks pretty clean with the
vfio_pci_regops, looks easy enough to lift to the parent.
  

So far the TODOs rather mask the dirty little secrets of the
extension rather than showing how a vendor derived driver needs to
root around in struct vfio_pci_device to do something useful, so
probably porting actual device specific support rather than further
hand waving would be more helpful.

It would be helpful to get actual feedback on the high level design -
someting like this was already tried in May and didn't go anywhere -
are you surprised that we are relucta

Re: [PATCH RFC v2 08/10] vdpa: add vdpa simulator for block device

2021-01-31 Thread Max Gurtovoy



On 1/28/2021 4:41 PM, Stefano Garzarella wrote:

From: Max Gurtovoy 

This will allow running vDPA for virtio block protocol.

Signed-off-by: Max Gurtovoy 
[sgarzare: various cleanups/fixes]
Signed-off-by: Stefano Garzarella 
---
v2:
- rebased on top of other changes (dev_attr, get_config(), notify(), etc.)
- memset to 0 the config structure in vdpasim_blk_get_config()
- used vdpasim pointer in vdpasim_blk_get_config()

v1:
- Removed unused headers
- Used cpu_to_vdpasim*() to store config fields
- Replaced 'select VDPA_SIM' with 'depends on VDPA_SIM' since selected
   option can not depend on other [Jason]
- Start with a single queue for now [Jason]
- Add comments to memory barriers
---
  drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 145 +++
  drivers/vdpa/Kconfig |   7 ++
  drivers/vdpa/vdpa_sim/Makefile   |   1 +
  3 files changed, 153 insertions(+)
  create mode 100644 drivers/vdpa/vdpa_sim/vdpa_sim_blk.c

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c 
b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
new file mode 100644
index ..999f9ca0b628
--- /dev/null
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDPA simulator for block device.
+ *
+ * Copyright (c) 2020, Mellanox Technologies. All rights reserved.


I guess we can change the copyright from Mellanox to:

Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

Thanks.



Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

2021-01-26 Thread Max Gurtovoy

Hi Alex, Cornelia and Jason,

thanks for the reviewing this.

On 1/26/2021 5:34 AM, Alex Williamson wrote:

On Mon, 25 Jan 2021 20:45:22 -0400
Jason Gunthorpe  wrote:


On Mon, Jan 25, 2021 at 04:31:51PM -0700, Alex Williamson wrote:


We're supposed to be enlightened by a vendor driver that does nothing
more than pass the opaque device_data through to the core functions,
but in reality this is exactly the point of concern above.  At a
minimum that vendor driver needs to look at the vdev to get the
pdev,

The end driver already havs the pdev, the RFC doesn't go enough into
those bits, it is a good comment.

The dd_data pased to the vfio_create_pci_device() will be retrieved
from the ops to get back to the end drivers data. This can cleanly
include everything: the VF pci_device, PF pci_device, mlx5_core
pointer, vfio_device and vfio_pci_device.

This is why the example passes in the mvadev:

+   vdev = vfio_create_pci_device(pdev, _vfio_pci_ops, mvadev);

The mvadev has the PF, VF, and mlx5 core driver pointer.

Getting that back out during the ops is enough to do what the mlx5
driver needs to do, which is relay migration related IOCTLs to the PF
function via the mlx5_core driver so the device can execute them on
behalf of the VF.


but then what else does it look at, consume, or modify.  Now we have
vendor drivers misusing the core because it's not clear which fields
are private and how public fields can be used safely,

The kernel has never followed rigid rules for data isolation, it is
normal to have whole private structs exposed in headers so that
container_of can be used to properly compose data structures.

I reject this assertion, there are plenty of structs that clearly
indicate private vs public data or, as we've done in mdev, clearly
marking the private data in a "private" header and provide access
functions for public fields.  Including a "private" header to make use
of a "library" is just wrong.  In the example above, there's no way for
the mlx vendor driver to get back dd_data without extracting it from
struct vfio_pci_device itself.


I'll create a better separation between private/public fields according 
to my understanding for the V2.


I'll just mention that beyond this separation, future improvements will 
be needed and can be done incrementally.


I don't think that we should do so many changes at one shut. The 
incremental process is safer from subsystem point of view.


I also think that upstreaming mlx5_vfio_pci.ko and upstreaming vfio-pci 
separation into 2 modules doesn't have to happen in one-shut.


But again, to make our point in this RFC, I'll improve it for V2.




Look at struct device, for instance. Most of that is private to the
driver core.

A few 'private to vfio-pci-core' comments would help, it is good
feedback to make that more clear.


extensions potentially break vendor drivers, etc.  We're only even hand
waving that existing device specific support could be farmed out to new
device specific drivers without even going to the effort to prove that.

This is a RFC, not a complete patch series. The RFC is to get feedback
on the general design before everyone comits alot of resources and
positions get dug in.

Do you really think the existing device specific support would be a
problem to lift? It already looks pretty clean with the
vfio_pci_regops, looks easy enough to lift to the parent.


So far the TODOs rather mask the dirty little secrets of the
extension rather than showing how a vendor derived driver needs to
root around in struct vfio_pci_device to do something useful, so
probably porting actual device specific support rather than further
hand waving would be more helpful.

It would be helpful to get actual feedback on the high level design -
someting like this was already tried in May and didn't go anywhere -
are you surprised that we are reluctant to commit alot of resources
doing a complete job just to have it go nowhere again?

That's not really what I'm getting from your feedback, indicating
vfio-pci is essentially done, the mlx stub driver should be enough to
see the direction, and additional concerns can be handled with TODO
comments.  Sorry if this is not construed as actual feedback, I think
both Connie and I are making an effort to understand this and being
hampered by lack of a clear api or a vendor driver that's anything more
than vfio-pci plus an aux bus interface.  Thanks,


I think I got the main idea and I'll try to summarize it:

The separation to vfio-pci.ko and vfio-pci-core.ko is acceptable, and we 
do need it to be able to create vendor-vfio-pci.ko driver in the future 
to include vendor special souse inside.


The separation implementation and the question of what is private and 
what is public, and the core APIs to the various drivers should be 
improved or better demonstrated in the V2.


I'll work on improving it and I'll send the V2.


If you have some feedback of the things/fields/structs you think should 
remain private to 

[PATCH 2/3] vfio-pci: introduce vfio_pci_core subsystem driver

2021-01-17 Thread Max Gurtovoy
Split the vfio_pci driver into two parts, the 'struct pci_driver'
(vfio_pci) and a library of code (vfio_pci_core) that helps creating a
VFIO device on top of a PCI device.

As before vfio_pci.ko continues to present the same interface under
sysfs and this change should have no functional impact.

vfio_pci_core exposes an interface that is similar to a typical
Linux subsystem, in that a pci_driver doing probe() can setup a number
of details and then create the VFIO char device.

Allowing another module to provide the pci_driver allows that module
to customize how VFIO is setup, inject its own operations, and easily
extend vendor specific functionality.

This is complementary to how VFIO's mediated devices work. Instead of
custome device lifecycle managmenet and a special bus drivers using
this approach will rely on the normal driver core lifecycle (eg
bind/unbind) management and this is optimized to effectively support
customization that is only making small modifications to what vfio_pci
would do normally.

This approach is also a pluggable alternative for the hard wired
CONFIG_VFIO_PCI_IG and CONFIG_VFIO_PCI_NVLINK2 "drivers" that are
built into vfio-pci. Using this work all of that code can be moved to
a dedicated device-specific modules and cleanly split out of the
generic vfio_pci driver.

Below is an example for adding new driver to vfio pci subsystem:
+--+
|  |
|   VFIO   |
|  |
+--+

+--+
|  |
|   VFIO_PCI_CORE  |
|  |
+--+

++ +---+
|| |   |
|  VFIO_PCI  | | MLX5_VFIO_PCI |
|| |   |
++ +---+

In this way mlx5_vfio_pci will use vfio_pci_core to register to vfio
subsystem and also use the generic PCI functionality exported from it.
Additionally it will add the needed vendor specific logic for HW
specific features such as Live Migration.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig|  12 +-
 drivers/vfio/pci/Makefile   |  13 +-
 drivers/vfio/pci/vfio_pci.c | 220 
 drivers/vfio/pci/vfio_pci_core.c| 255 +++-
 drivers/vfio/pci/vfio_pci_private.h |  21 +++
 5 files changed, 321 insertions(+), 200 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 40a223381ab6..5f90be27fba0 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
-config VFIO_PCI
-   tristate "VFIO support for PCI devices"
+config VFIO_PCI_CORE
+   tristate "VFIO core support for PCI devices"
depends on VFIO && PCI && EVENTFD
select VFIO_VIRQFD
select IRQ_BYPASS_MANAGER
@@ -10,6 +10,14 @@ config VFIO_PCI
 
  If you don't know what to do here, say N.
 
+config VFIO_PCI
+   tristate "VFIO support for PCI devices"
+   depends on VFIO_PCI_CORE
+   help
+ This provides a generic PCI support using the VFIO framework.
+
+ If you don't know what to do here, say N.
+
 config VFIO_PCI_VGA
bool "VFIO PCI support for VGA devices"
depends on VFIO_PCI && X86 && VGA_ARB
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index dd350b9b..3f2a27e222cd 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,8 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
-vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
-vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
-
+obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+
+vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
+
+vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
new file mode 100644
index ..4e115a136930
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -0,0 +1,220 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ * Author: Alex Williamson 
+ *
+ * Deriv

[PATCH 3/3] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices

2021-01-17 Thread Max Gurtovoy
This driver will register to PCI bus and Auxiliary bus. In case the
probe of both devices will succeed, we'll have a vendor specific VFIO
PCI device. mlx5_vfio_pci use vfio_pci_core to register and create a
VFIO device and use auxiliary_device to get the needed extension from
the vendor device driver. If one of the probe() functions will fail, the
VFIO char device will not be created. For now, only register and bind
the auxiliary_device to the pci_device in case we have a match between
the auxiliary_device id to the pci_device BDF. Later, vendor specific
features such as live migration will be added and will be available to
the virtualization software.

Note: Although we've created the mlx5-vfio-pci.ko, the binding to
vfio-pci.ko will still work as before. It's fully backward compatible.
Of course, the extended vendor functionality will not exist in case one
will bind the device to the generic vfio_pci.ko.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Kconfig |  10 ++
 drivers/vfio/pci/Makefile|   3 +
 drivers/vfio/pci/mlx5_vfio_pci.c | 253 +++
 include/linux/mlx5/vfio_pci.h|  36 +
 4 files changed, 302 insertions(+)
 create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
 create mode 100644 include/linux/mlx5/vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 5f90be27fba0..2133cd2f9c92 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -65,3 +65,13 @@ config VFIO_PCI_ZDEV
  for zPCI devices passed through via VFIO on s390.
 
  Say Y here.
+
+config MLX5_VFIO_PCI
+   tristate "VFIO support for MLX5 PCI devices"
+   depends on VFIO_PCI_CORE && MLX5_CORE
+   select AUXILIARY_BUS
+   help
+ This provides a generic PCI support for MLX5 devices using the VFIO
+ framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 3f2a27e222cd..9f67edca31c5 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@
 
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
@@ -9,3 +10,5 @@ vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
 
 vfio-pci-y := vfio_pci.o
+
+mlx5-vfio-pci-y := mlx5_vfio_pci.o
diff --git a/drivers/vfio/pci/mlx5_vfio_pci.c b/drivers/vfio/pci/mlx5_vfio_pci.c
new file mode 100644
index ..98cc2d037b0d
--- /dev/null
+++ b/drivers/vfio/pci/mlx5_vfio_pci.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+ * Author: Max Gurtovoy 
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vfio_pci_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Max Gurtovoy "
+#define DRIVER_DESC "MLX5 VFIO PCI - User Level meta-driver for NVIDIA 
MLX5 device family"
+
+static LIST_HEAD(aux_devs_list);
+static DEFINE_MUTEX(aux_devs_lock);
+
+static struct mlx5_vfio_pci_adev *mlx5_vfio_pci_find_adev(struct pci_dev *pdev)
+{
+   struct mlx5_vfio_pci_adev *mvadev, *found = NULL;
+
+   mutex_lock(_devs_lock);
+   list_for_each_entry(mvadev, _devs_list, entry) {
+   if (mvadev->madev.adev.id == pci_dev_id(pdev)) {
+   found = mvadev;
+   break;
+   }
+   }
+   mutex_unlock(_devs_lock);
+
+   return found;
+}
+
+static int mlx5_vfio_pci_aux_probe(struct auxiliary_device *adev,
+   const struct auxiliary_device_id *id)
+{
+   struct mlx5_vfio_pci_adev *mvadev;
+
+   mvadev = adev_to_mvadev(adev);
+
+   pr_info("%s aux probing bdf %02x:%02x.%d mdev is %s\n",
+   adev->name,
+   PCI_BUS_NUM(adev->id & 0x),
+   PCI_SLOT(adev->id & 0xff),
+   PCI_FUNC(adev->id & 0xff), 
dev_name(mvadev->madev.mdev->device));
+
+   mutex_lock(_devs_lock);
+   list_add(>entry, _devs_list);
+   mutex_unlock(_devs_lock);
+
+   return 0;
+}
+
+static void mlx5_vfio_pci_aux_remove(struct auxiliary_device *adev)
+{
+   struct mlx5_vfio_pci_adev *mvadev = adev_to_mvadev(adev);
+   struct vfio_pci_device *vdev = dev_get_drvdata(>dev);
+
+   /* TODO: is this the right thing to do ? maybe FLR ? */
+   if (vdev)
+   pci_reset_function(vdev->pdev);
+
+   mutex_lock(_devs_lock);
+   list

[PATCH 1/3] vfio-pci: rename vfio_pci.c to vfio_pci_core.c

2021-01-17 Thread Max Gurtovoy
This is a preparation patch for separating the vfio_pci driver to a
subsystem driver and a generic pci driver. This patch doesn't change any
logic.

Signed-off-by: Max Gurtovoy 
---
 drivers/vfio/pci/Makefile| 2 +-
 drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename drivers/vfio/pci/{vfio_pci.c => vfio_pci_core.c} (100%)

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 781e0809d6ee..dd350b9b 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o 
vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 vfio-pci-$(CONFIG_VFIO_PCI_ZDEV) += vfio_pci_zdev.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci_core.c
similarity index 100%
rename from drivers/vfio/pci/vfio_pci.c
rename to drivers/vfio/pci/vfio_pci_core.c
-- 
2.25.4



[PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

2021-01-17 Thread Max Gurtovoy
Hi Alex and Cornelia,

This series split the vfio_pci driver into 2 parts: pci driver and a
subsystem driver that will also be library of code. The pci driver,
vfio_pci.ko will be used as before and it will bind to the subsystem
driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
if fully backward compatible. This is a typical Linux subsystem
framework behaviour. This framework can be also adopted by vfio_mdev
devices as we'll see in the below sketch.

This series is coming to solve the issues that were raised in the
previous attempt for extending vfio-pci for vendor specific
functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.

This solution is also deterministic in a sense that when a user will
bind to a vendor specific vfio-pci driver, it will get all the special
goodies of the HW.
 
This subsystem framework will also ease on adding vendor specific
functionality to VFIO devices in the future by allowing another module
to provide the pci_driver that can setup number of details before
registering to VFIO subsystem (such as inject its own operations).

Below we can see the proposed changes (this patchset only deals with
VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
as well):

+--+
|  |
|VFIO  |
|  |
+--+

++++
||||
|  VFIO_PCI_CORE ||  VFIO_MDEV_CORE|
||||
++++

+---+ +--++---+ +--+
|   | |  ||   | |  |
|   | |  ||   | |  |
| VFIO_PCI  | | MLX5_VFIO_PCI|| VFIO_MDEV | |MLX5_VFIO_MDEV|
|   | |  ||   | |  |
|   | |  ||   | |  |
+---+ +--++---+ +--+

First 2 patches introduce the above changes for vfio_pci and
vfio_pci_core.

Patch (3/3) introduces a new mlx5 vfio-pci module that registers to VFIO
subsystem using vfio_pci_core. It also registers to Auxiliary bus for
binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This
will allow extending mlx5-vfio-pci devices with HW specific features
such as Live Migration (mlx5_core patches are not part of this series
that comes for proposing the changes need for the vfio pci subsystem).

These devices will be seen on the Auxiliary bus as:
mlx5_core.vfio_pci.2048 -> 
../../../devices/pci:00/:00:02.0/:05:00.0/:06:00.0/:07:00.0/mlx5_core.vfio_pci.2048
mlx5_core.vfio_pci.2304 -> 
../../../devices/pci:00/:00:02.0/:05:00.0/:06:00.0/:07:00.1/mlx5_core.vfio_pci.2304

2048 represents BDF 08:00.0 and 2304 represents BDF 09:00.0 in decimal
view. In this manner, the administrator will be able to locate the
correct vfio-pci module it should bind the desired BDF to (by finding
the pointer to the module according to the Auxiliary driver of that
BDF).

In this way, we'll use the HW vendor driver core to manage the lifecycle
of these devices. This is reasonable since only the vendor driver knows
exactly about the status on its internal state and the capabilities of
its acceleratots, for example.

TODOs:
1. For this RFC we still haven't cleaned all vendor specific stuff that
   were merged in the past into vfio_pci (such as VFIO_PCI_IG and
   VFIO_PCI_NVLINK2).
2. Create subsystem module for VFIO_MDEV. This can be used for vendor
   specific scalable functions for example (SFs).
3. Add Live migration functionality for mlx5 SNAP devices
   (NVMe/Virtio-BLK).
4. Add Live migration functionality for mlx5 VFs
5. Add the needed functionality for mlx5_core

I would like to thank the great team that was involved in this
development, design and internal review:
Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy
and others.

This series applies cleanly on top of kernel 5.11-rc2+ commit 2ff90100ace8:
"Merge tag 'hwmon-for-v5.11-rc3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging"
from Linus.

Note: Live migration for MLX5 SNAP devices is WIP and will be the first
  example for adding vendor extension to vfio-pci devices. As the
  changes to the subsystem must be defined as a pre-condition for
  this work, we've decided to split the submission for now.

Max Gurtovoy

Re: [PATCH v2 00/17] vdpa: generalize vdpa simulator

2020-11-26 Thread Max Gurtovoy



On 11/26/2020 4:49 PM, Stefano Garzarella wrote:

This series moves the network device simulator in a new module
(vdpa_sim_net) and leaves the generic functions in the vdpa_sim core
module, allowing the possibility to add new vDPA device simulators.

For now I removed the vdpa-blk simulator patches, since I'm still working
on them and debugging the iotlb issues.

Thanks to Max that started this work! I took his patches and extended a bit.

As Jason suggested, I simplified the "vdpa: split vdpasim to core and
net modules" patch, moving some changes out in small patches.
@Max: I put your Co-developed-by and Signed-off-by tags on these patches,
let me know if it is okay for you, or if there is a better way to give
credit to your work!


Stefano,

thanks for taking my initial series and bringing it to upstream level 
and thanks Jason for your reviews.


I'm ok with the tags and hopefully I'll be able to help a bit in the 
submission in couple of weeks.


great progress !



v1: 
https://lists.linuxfoundation.org/pipermail/virtualization/2020-November/050677.html

v2:
   - moved most of the patches before the vdpa-core/net split [Jason]
   - removed unnecessary headers
   - removed 'default n' in Kconfig entries [Jason]
   - added VDPASIM_IOTLB_LIMIT macro [Jason]
   - set vringh notify callback [Jason]
   - used VIRTIO terminology for in_iov/out_iov [Stefan]
   - simplified "vdpa: split vdpasim to core and net modules" patch,
 moving some changes out in small patches
   - left batch_mapping module parameter in the core [Jason]

Max Gurtovoy (2):
   vdpa_sim: remove hard-coded virtq count
   vdpa: split vdpasim to core and net modules

Stefano Garzarella (15):
   vdpa: remove unnecessary 'default n' in Kconfig entries
   vdpa_sim: remove unnecessary headers inclusion
   vdpa_sim: remove the limit of IOTLB entries
   vdpa_sim: rename vdpasim_config_ops variables
   vdpa_sim: add struct vdpasim_dev_attr for device attributes
   vdpa_sim: add device id field in vdpasim_dev_attr
   vdpa_sim: add supported_features field in vdpasim_dev_attr
   vdpa_sim: add work_fn in vdpasim_dev_attr
   vdpa_sim: store parsed MAC address in a buffer
   vdpa_sim: make 'config' generic and usable for any device type
   vdpa_sim: add get_config callback in vdpasim_dev_attr
   vdpa_sim: set vringh notify callback
   vdpa_sim: use kvmalloc to allocate vdpasim->buffer
   vdpa_sim: make vdpasim->buffer size configurable
   vdpa_sim: split vdpasim_virtqueue's iov field in out_iov and in_iov




Re: [PATCH v3] nvme-rdma: handle nvme completion data length

2020-10-28 Thread Max Gurtovoy



On 10/25/2020 1:51 PM, zhenwei pi wrote:

Hit a kernel warning:
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 0 at lib/refcount.c:28

RIP: 0010:refcount_warn_saturate+0xd9/0xe0
Call Trace:
  
  nvme_rdma_recv_done+0xf3/0x280 [nvme_rdma]
  __ib_process_cq+0x76/0x150 [ib_core]
  ...

The reason is that a zero bytes message received from target, and the
host side continues to process without length checking, then the
previous CQE is processed twice.

Do sanity check on received data length, try to recovery for corrupted
CQE case.

Because zero bytes message in not defined in spec, using zero bytes
message to detect dead connections on transport layer is not
standard, currently still treat it as illegal.

Thanks to Chao Leng & Sagi for suggestions.

Signed-off-by: zhenwei pi 
---
  drivers/nvme/host/rdma.c | 8 
  1 file changed, 8 insertions(+)


Seems strange that the targets sends zero byte packets.

Can you specify which target is this and the scenario ?



Re: [PATCH v3 14/56] IB: fix kernel-doc markups

2020-10-24 Thread Max Gurtovoy

Thanks Mauro, small fix for iser

On 10/23/2020 7:33 PM, Mauro Carvalho Chehab wrote:

Some functions have different names between their prototypes
and the kernel-doc markup.

Others need to be fixed, as kernel-doc markups should use this format:
 identifier - description

Signed-off-by: Mauro Carvalho Chehab 

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 3690e28cc7ea..84cebf937680 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -739,7 +739,7 @@ iscsi_iser_set_param(struct iscsi_cls_conn *cls_conn,
  }
  
  /**

- * iscsi_iser_set_param() - set class connection parameter
+ * iscsi_iser_conn_get_stats() - set class connection parameter


iscsi_iser_conn_get_stats() - get iscsi connection statistics



   * @cls_conn:iscsi class connection
   * @stats:   iscsi stats to output
   *
  


Re: [PATCH] nvme: do not call del_gendisk() on a disk that was never added

2020-06-07 Thread Max Gurtovoy



On 6/7/2020 2:45 PM, Niklas Cassel wrote:

device_add_disk() is negated by del_gendisk().
alloc_disk_node() is negated by put_disk().

In nvme_alloc_ns(), device_add_disk() is one of the last things being
called in the success case, and only void functions are being called
after this. Therefore this call should not be negated in the error path.

The superfluous call to del_gendisk() leads to the following prints:
[7.839975] kobject: '(null)' (1ff73734): is not initialized, yet 
kobject_put() is being called.
[7.840865] WARNING: CPU: 2 PID: 361 at lib/kobject.c:736 
kobject_put+0x70/0x120

Fixes: 33cfdc2aa696 ("nvme: enforce extended LBA format for fabrics metadata")
Signed-off-by: Niklas Cassel 
---
An alternative would be to do like nvme_ns_remove(), i.e. in the error
path; check if ns->disk->flags & GENHD_FL_UP is set, and only then call
del_gendisk(). However, that seems unnecessary, since as nvme_alloc_ns()
is currently written, we know that device_add_disk() does not need to be
negated.

  drivers/nvme/host/core.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)


Looks good,

Reviewed-by: Max Gurtovoy 




Re: linux-next: manual merge of the block tree with the rdma tree

2020-06-03 Thread Max Gurtovoy



On 6/3/2020 2:32 AM, Jason Gunthorpe wrote:

On Wed, Jun 03, 2020 at 01:40:51AM +0300, Max Gurtovoy wrote:

On 6/3/2020 12:37 AM, Jens Axboe wrote:

On 6/2/20 1:09 PM, Jason Gunthorpe wrote:

On Tue, Jun 02, 2020 at 01:02:55PM -0600, Jens Axboe wrote:

On 6/2/20 1:01 PM, Jason Gunthorpe wrote:

On Tue, Jun 02, 2020 at 11:37:26AM +0300, Max Gurtovoy wrote:

On 6/2/2020 5:56 AM, Stephen Rothwell wrote:

Hi all,

Hi,

This looks good to me.

Can you share a pointer to the tree so we'll test it in our labs ?

need to re-test:

1. srq per core

2. srq per core + T10-PI

And both will run with shared CQ.

Max, this is too much conflict to send to Linus between your own
patches. I am going to drop the nvme part of this from RDMA.

Normally I don't like applying partial series, but due to this tree
split, you can send the rebased nvme part through the nvme/block tree
at rc1 in two weeks..

Yes, I'll send it in 2 weeks.

Actually I hoped the iSER patches for CQ pool will be sent in this series
but eventually they were not.

This way we could have taken only the iser part and the new API.

I saw the pulled version too late since I wasn't CCed to it and it was
already merged before I had a chance to warn you about possible conflict.

I think in general we should try to add new RDMA APIs first with iSER/SRP
and avoid conflicting trees.

If you are careful we can construct a shared branch and if Jens/etc is
willing he can pull the RDMA base code after RDMA merges the branch
and then apply the nvme parts. This is how things work with netdev

It is tricky and you have to plan for it during your submission step,
but we should be able to manage in most cases if this comes up more
often.


I think we can construct a branch like this for dedicated series and 
delete it after the acceptance.


In case of new APIs for RDMA that involve touching NVMe stuff - we'll 
create this branch and ask Jens to pull it as you suggested.


And as a general note,  I suggest we won't merge NVMe/RDMA stuff to 
rdma-next without cooperation with Jens.


-Max.



Jason


Re: linux-next: manual merge of the block tree with the rdma tree

2020-06-02 Thread Max Gurtovoy



On 6/3/2020 12:37 AM, Jens Axboe wrote:

On 6/2/20 1:09 PM, Jason Gunthorpe wrote:

On Tue, Jun 02, 2020 at 01:02:55PM -0600, Jens Axboe wrote:

On 6/2/20 1:01 PM, Jason Gunthorpe wrote:

On Tue, Jun 02, 2020 at 11:37:26AM +0300, Max Gurtovoy wrote:

On 6/2/2020 5:56 AM, Stephen Rothwell wrote:

Hi all,

Hi,

This looks good to me.

Can you share a pointer to the tree so we'll test it in our labs ?

need to re-test:

1. srq per core

2. srq per core + T10-PI

And both will run with shared CQ.

Max, this is too much conflict to send to Linus between your own
patches. I am going to drop the nvme part of this from RDMA.

Normally I don't like applying partial series, but due to this tree
split, you can send the rebased nvme part through the nvme/block tree
at rc1 in two weeks..


Yes, I'll send it in 2 weeks.

Actually I hoped the iSER patches for CQ pool will be sent in this 
series but eventually they were not.


This way we could have taken only the iser part and the new API.

I saw the pulled version too late since I wasn't CCed to it and it was 
already merged before I had a chance to warn you about possible conflict.


I think in general we should try to add new RDMA APIs first with 
iSER/SRP and avoid conflicting trees.




Was going to comment that this is probably how it should have been
done to begin with. If we have multiple conflicts like that between
two trees, someone is doing something wrong...

Well, on the other hand having people add APIs in one tree and then
(promised) consumers in another tree later on has proven problematic
in the past. It is best to try to avoid that, but in this case I don't
think Max will have any delay to get the API consumer into nvme in two
weeks.

Having conflicting trees is a problem. If there's a dependency for
two trees for some new work, then just have a separate branch that's
built on those two. For NVMe core work, then it should include the
pending NVMe changes.


I guess it's hard to do so during merge window since the block and rdma 
trees are not in sync.


I think it would have been a good idea to add Jens to CC and mention 
that we're posting code that is maintained by 2 different trees in the 
cover latter.





Re: linux-next: manual merge of the block tree with the rdma tree

2020-06-02 Thread Max Gurtovoy



On 6/2/2020 5:56 AM, Stephen Rothwell wrote:

Hi all,


Hi,

This looks good to me.

Can you share a pointer to the tree so we'll test it in our labs ?

need to re-test:

1. srq per core

2. srq per core + T10-PI

And both will run with shared CQ.




Today's linux-next merge of the block tree got a conflict in:

   drivers/nvme/target/rdma.c

between commit:

   5733111dcd97 ("nvmet-rdma: use new shared CQ mechanism")

from the rdma tree and commits:

   b0012dd39715 ("nvmet-rdma: use SRQ per completion vector")
   b09160c3996c ("nvmet-rdma: add metadata/T10-PI support")

from the block tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.



Re: [PATCH] nvme-tcp: constify static struct blk_mq_ops

2020-05-31 Thread Max Gurtovoy

Looks good,

Reviewed-by: Max Gurtovoy 



Re: [PATCH] nvme: tcp: remove redundant assignment to variable ret

2019-09-05 Thread Max Gurtovoy



On 9/5/2019 5:34 PM, Colin King wrote:

From: Colin Ian King 

The variable ret is being initialized with a value that is never read
and is being re-assigned immediately afterwards. The assignment is
redundant and hence can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King 
---
  drivers/nvme/host/tcp.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 2d8ba31cb691..d91be6ddfe25 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1824,7 +1824,7 @@ static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl 
*ctrl)
  static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
  {
struct nvmf_ctrl_options *opts = ctrl->opts;
-   int ret = -EINVAL;
+   int ret;
  
  	ret = nvme_tcp_configure_admin_queue(ctrl, new);

if (ret)


Looks fine,

Reviewed-by: Max Gurtovoy 



Re: [PATCH v2] nvme: fix multipath crash when ANA desactivated

2019-07-10 Thread Max Gurtovoy



On 7/10/2019 12:29 AM, Christoph Hellwig wrote:

On Sat, Jul 06, 2019 at 01:06:44PM +0300, Max Gurtovoy wrote:

+   /* check if multipath is enabled and we have the capability */
+   if (!multipath)
+   return 0;
+   if (!ctrl->subsys || ((ctrl->subsys->cmic & (1 << 3)) != 0))

shouldn't it be:

if (!ctrl->subsys || ((ctrl->subsys->cmic & (1 << 3)) == 0))

or

if (!ctrl->subsys || !(ctrl->subsys->cmic & (1 << 3)))


Otherwise, you don't really do any initialization and return 0 in case you have 
the capability, right ?

Yes.  FYI, my idea how to fix this would be something like:

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a9a927677970..cdb3e5baa329 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -12,11 +12,6 @@ module_param(multipath, bool, 0444);
  MODULE_PARM_DESC(multipath,
"turn on native support for multiple controllers per subsystem");
  
-inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)

-{
-   return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
-}
-
  /*
   * If multipathing is enabled we need to always use the subsystem instance
   * number for numbering our devices to avoid conflicts between subsystems that
@@ -622,7 +617,7 @@ int nvme_mpath_init(struct nvme_ctrl *ctrl, struct 
nvme_id_ctrl *id)
  {
int error;
  
-	if (!nvme_ctrl_use_ana(ctrl))

+   if (!multipath || !ctrl->subsys || !(ctrl->subsys->cmic & (1 << 3)))
return 0;
  
  	ctrl->anacap = id->anacap;

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 716a876119c8..14eca76bec5c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -485,7 +485,10 @@ extern const struct attribute_group 
*nvme_ns_id_attr_groups[];
  extern const struct block_device_operations nvme_ns_head_ops;
  
  #ifdef CONFIG_NVME_MULTIPATH

-bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl);
+static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
+{
+   return ctrl->ana_log_buf != NULL;
+}
  void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
struct nvme_ctrl *ctrl, int *flags);
  void nvme_failover_req(struct request *req);


Yes this looks good.



Re: [PATCH v2] nvme: fix multipath crash when ANA desactivated

2019-07-06 Thread Max Gurtovoy



On 7/5/2019 5:05 PM, Marta Rybczynska wrote:

Fix a crash with multipath activated. It happends when ANA log
page is larger than MDTS and because of that ANA is disabled.
The driver then tries to access unallocated buffer when connecting
to a nvme target. The signature is as follows:

[  300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
[  300.435387] nvme nvme0: disabling ANA support.
[  300.437835] nvme nvme0: creating 4 I/O queues.
[  300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
[  300.464609] BUG: unable to handle kernel NULL pointer dereference at 
0008
[  300.466342] #PF error: [normal kernel read fault]
[  300.467385] PGD 0 P4D 0
[  300.467987] Oops:  [#1] SMP PTI
[  300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
[  300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[  300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
[  300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 
54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 
0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
[  300.477374] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
[  300.478334] RAX: 0001 RBX: 9130f1872258 RCX: 
[  300.479784] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258
[  300.481488] RBP:  R08: 0001 R09: 0044
[  300.483203] R10: 0220 R11: 0040 R12: 9130f18722c0
[  300.484928] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0
[  300.486626] FS:  () GS:9130f7b8() 
knlGS:
[  300.488538] CS:  0010 DS:  ES:  CR0: 80050033
[  300.489907] CR2: 0008 CR3: 0002365e6000 CR4: 06e0
[  300.491612] DR0:  DR1:  DR2: 
[  300.493303] DR3:  DR6: fffe0ff0 DR7: 0400
[  300.494991] Call Trace:
[  300.495645]  nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
[  300.496880]  nvme_validate_ns+0x2ef/0x550 [nvme_core]
[  300.498105]  ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
[  300.499539]  nvme_scan_work+0x2b4/0x370 [nvme_core]
[  300.500717]  ? __switch_to_asm+0x35/0x70
[  300.501663]  process_one_work+0x171/0x380
[  300.502340]  worker_thread+0x49/0x3f0
[  300.503079]  kthread+0xf8/0x130
[  300.503795]  ? max_active_store+0x80/0x80
[  300.504690]  ? kthread_bind+0x10/0x10
[  300.505502]  ret_from_fork+0x35/0x40
[  300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm 
ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle 
ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle 
iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan 
ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter 
bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net 
virtio_console net_failover virtio_blk failover ata_piix serio_raw libata 
virtio_pci virtio_ring virtio
[  300.514984] CR2: 0008
[  300.515569] ---[ end trace faa2eefad7e7f218 ]---
[  300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
[  300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 
54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 
0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
[  300.520353] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
[  300.521229] RAX: 0001 RBX: 9130f1872258 RCX: 
[  300.522399] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258
[  300.523560] RBP:  R08: 0001 R09: 0044
[  300.524734] R10: 0220 R11: 0040 R12: 9130f18722c0
[  300.525915] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0
[  300.527084] FS:  () GS:9130f7b8() 
knlGS:
[  300.528396] CS:  0010 DS:  ES:  CR0: 80050033
[  300.529440] CR2: 0008 CR3: 0002365e6000 CR4: 06e0
[  300.530739] DR0:  DR1:  DR2: 
[  300.531989] DR3:  DR6: fffe0ff0 DR7: 0400
[  300.533264] Kernel panic - not syncing: Fatal exception
[  300.534338] Kernel Offset: 0x17c0 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[  300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Marta Rybczynska 
Tested-by: Jean-Baptiste Riaux 
---
  drivers/nvme/host/multipath.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/multipath.c 

Re: [PATCH v2 2/2] nvmet-loop: Flush nvme_delete_wq when removing the port

2019-07-05 Thread Max Gurtovoy



On 7/4/2019 2:03 AM, Logan Gunthorpe wrote:

After calling nvme_loop_delete_ctrl(), the controllers will not
yet be deleted because nvme_delete_ctrl() only schedules work
to do the delete.

This means a race can occur if a port is removed but there
are still active controllers trying to access that memory.

To fix this, flush the nvme_delete_wq before returning from
nvme_loop_remove_port() so that any controllers that might
be in the process of being deleted won't access a freed port.

Signed-off-by: Logan Gunthorpe 
---
  drivers/nvme/target/loop.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9e211ad6bdd3..da9cd07461fb 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -654,6 +654,14 @@ static void nvme_loop_remove_port(struct nvmet_port *port)
mutex_lock(_loop_ports_mutex);
list_del_init(>entry);
mutex_unlock(_loop_ports_mutex);
+
+   /*
+* Ensure any ctrls that are in the process of being
+* deleted are in fact deleted before we return
+* and free the port. This is to prevent active
+* ctrls from using a port after it's freed.
+*/
+   flush_workqueue(nvme_delete_wq);
  }
  
  static const struct nvmet_fabrics_ops nvme_loop_ops = {


Looks good:

Reviewed-by: Max Gurtovoy 



Re: [PATCH v2 1/2] nvmet: Fix use-after-free bug when a port is removed

2019-07-05 Thread Max Gurtovoy



On 7/5/2019 12:01 AM, Logan Gunthorpe wrote:


On 2019-07-04 3:00 p.m., Max Gurtovoy wrote:

Hi Logan,

On 7/4/2019 2:03 AM, Logan Gunthorpe wrote:

When a port is removed through configfs, any connected controllers
are still active and can still send commands. This causes a
use-after-free bug which is detected by KASAN for any admin command
that dereferences req->port (like in nvmet_execute_identify_ctrl).

To fix this, disconnect all active controllers when a subsystem is
removed from a port. This ensures there are no active controllers
when the port is eventually removed.

so now we are enforcing controller existence with port configfs, right ?
sounds reasonable.

Correct.


Did you run your patches with other transport (RDMA/TCP/FC) ?

Just RDMA and loop. I suppose I could test with TCP but I don't have FC
hardware.


Great.

the code looks good:

Reviewed-by: Max Gurtovoy 



Logan


Re: [PATCH v2 1/2] nvmet: Fix use-after-free bug when a port is removed

2019-07-04 Thread Max Gurtovoy

Hi Logan,

On 7/4/2019 2:03 AM, Logan Gunthorpe wrote:

When a port is removed through configfs, any connected controllers
are still active and can still send commands. This causes a
use-after-free bug which is detected by KASAN for any admin command
that dereferences req->port (like in nvmet_execute_identify_ctrl).

To fix this, disconnect all active controllers when a subsystem is
removed from a port. This ensures there are no active controllers
when the port is eventually removed.


so now we are enforcing controller existence with port configfs, right ? 
sounds reasonable.


Did you run your patches with other transport (RDMA/TCP/FC) ?




Re: [PATCH 6/8] IB/srp: set virt_boundary_mask in the scsi host

2019-06-24 Thread Max Gurtovoy



On 6/17/2019 3:19 PM, Christoph Hellwig wrote:

This ensures all proper DMA layer handling is taken care of by the
SCSI midlayer.

Signed-off-by: Christoph Hellwig 


Looks good,

Reviewed-by: Max Gurtovoy 



Re: [PATCH 5/8] IB/iser: set virt_boundary_mask in the scsi host

2019-06-24 Thread Max Gurtovoy

On 6/17/2019 3:19 PM, Christoph Hellwig wrote:

This ensures all proper DMA layer handling is taken care of by the
SCSI midlayer.

Signed-off-by: Christoph Hellwig 



Looks good,

Reviewed-by: Max Gurtovoy 



Re: [PATCH v2 06/10] nvme/core: add mdev interfaces

2019-05-03 Thread Max Gurtovoy



On 5/3/2019 3:29 PM, Christoph Hellwig wrote:

On Thu, May 02, 2019 at 02:47:57PM +0300, Maxim Levitsky wrote:

If the mdev device driver also sets the
NVME_F_MDEV_DMA_SUPPORTED, the mdev core will
dma map all the guest memory into the nvme device,
so that nvme device driver can use dma addresses as passed
from the mdev core driver

We really need a proper block layer interface for that so that
uring or the nvme target can use pre-mapping as well.


I think we can also find a way to use nvme-mdev for the target offload 
p2p feature.


Don't see a big difference of taking NVMe queue and namespace/partition 
to guest OS or to P2P since IO is issued by external entity and pooled 
outside the pci driver.


thoughts ?




___
Linux-nvme mailing list
linux-n...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme


Re: [PATCH] IB/iser: remove uninitialized variable len

2019-03-17 Thread Max Gurtovoy



On 3/17/2019 1:05 AM, Colin King wrote:

From: Colin Ian King

The variable len is not being inintialized and the uninitialized
value is being returned. However, this return path is never reached
because the default case in the switch statement returns -ENOSYS.
Clean up the code by replacing the return -ENOSYS with a break
for the default case and returning -ENOSYS at the end of the
function.  This allows len to be removed.  Also remove redundant
break that follows a return statement.

Signed-off-by: Colin Ian King


Looks good,

Reviewed-by: Max Gurtovoy 



Re: [PATCH] nvmet: disable direct I/O when unavailable

2019-02-24 Thread Max Gurtovoy



On 2/22/2019 7:55 AM, Johannes Thumshirn wrote:

On 22/02/2019 01:41, Chaitanya Kulkarni wrote:
[...]

As per specified in the patch, this is only useful for testing, then we
should modify the test scripts so that on creation of the ctrl we switch
to the buffered I/O before running fio.

Or on any other file-system that does not support DIO..


Do we really want to support these kind of filesystems for fabrics 
backend device ? only for testing ?


What is the status of iSCSI/SRP targets in this case ?





OR

Similar result can be achieved by setting buffered I/O flag
buffered_io=1 before enabling the name-space in the test script.

Frankly, we have a ton of testing related special cases in the kernel.

This one is a) simple and small, only 10 LoC, b) far away from the fast
path or any other place where it could have any impact on legitimate
users and c) it prints an informal message showing you what happened.

Sorry but this is a https://xkcd.com/386/ moment.

Byte,
Johannes


Re: [PATCH] nvme: host: core: fix precedence of ternary operator

2018-05-23 Thread Max Gurtovoy



On 5/23/2018 5:56 PM, Ivan Bornyakov wrote:

Ternary operator have lower precedence then bitwise or, so 'cdw10' was
calculated wrong.

Signed-off-by: Ivan Bornyakov <brnkv...@gmail.com>
---
  drivers/nvme/host/core.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b070c659391f..1eba9b0cb9dc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1577,7 +1577,7 @@ static int nvme_pr_reserve(struct block_device *bdev, u64 
key,
  static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
enum pr_type type, bool abort)
  {
-   u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1;
+   u32 cdw10 = nvme_pr_type(type) << 8 | (abort ? 2 : 1);
return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire);
  }
  
@@ -1589,7 +1589,7 @@ static int nvme_pr_clear(struct block_device *bdev, u64 key)
  
  static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type)

  {
-   u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0;
+   u32 cdw10 = nvme_pr_type(type) << 8 | (key ? 1 << 3 : 0);
return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release);
  }
  



Looks good,

Reviewed-by: Max Gurtovoy <m...@mellanox.com>


Re: [PATCH] nvme: host: core: fix precedence of ternary operator

2018-05-23 Thread Max Gurtovoy



On 5/23/2018 5:56 PM, Ivan Bornyakov wrote:

Ternary operator have lower precedence then bitwise or, so 'cdw10' was
calculated wrong.

Signed-off-by: Ivan Bornyakov 
---
  drivers/nvme/host/core.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b070c659391f..1eba9b0cb9dc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1577,7 +1577,7 @@ static int nvme_pr_reserve(struct block_device *bdev, u64 
key,
  static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
enum pr_type type, bool abort)
  {
-   u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1;
+   u32 cdw10 = nvme_pr_type(type) << 8 | (abort ? 2 : 1);
return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire);
  }
  
@@ -1589,7 +1589,7 @@ static int nvme_pr_clear(struct block_device *bdev, u64 key)
  
  static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type)

  {
-   u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0;
+   u32 cdw10 = nvme_pr_type(type) << 8 | (key ? 1 << 3 : 0);
return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release);
  }
  



Looks good,

Reviewed-by: Max Gurtovoy 


Re: [PATCH V2] nvme-rdma: fix double free in nvme_rdma_free_queue

2018-05-16 Thread Max Gurtovoy

Hi Jianchao,

On 5/10/2018 12:42 PM, Jianchao Wang wrote:

BUG: KASAN: double-free or invalid-free in nvme_rdma_free_queue+0xf6/0x110 
[nvme_rdma]
Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma]
Call Trace:
  dump_stack+0x91/0xeb
  print_address_description+0x6b/0x290
  kasan_report_invalid_free+0x55/0x80
  __kasan_slab_free+0x176/0x190
  kfree+0xeb/0x310
  nvme_rdma_free_queue+0xf6/0x110 [nvme_rdma]
  nvme_rdma_configure_admin_queue+0x1a3/0x4d0 [nvme_rdma]
  nvme_rdma_reset_ctrl_work+0x4e/0xd0 [nvme_rdma]
  process_one_work+0x3ca/0xaa0
  worker_thread+0x4e2/0x6c0
  kthread+0x18d/0x1e0
  ret_from_fork+0x24/0x30

The double free is on ctrl->async_event_sqe.
If any case fails before ctrl->async_event_sqe is allocated in
nvme_rdma_configure_admin_queue, nvme_rdma_free_queue will be
invoked. However, at the moment, the ctrl->async_event_sqe has
not been allocated because it has been freed in
nvme_rdma_reset_ctrl_work
   -> nvme_rdma_shutdown_ctrl
 ->nvme_rdma_destroy_admin_queue
   -> nvme_rdma_free_queue

Signed-off-by: Jianchao Wang 
---

V2:
handle it in nvme_rdma_free_queue and add some comment to explain it.


I don't know exactly what Christoph meant but IMO the best place to 
allocate it is in nvme_rdma_alloc_queue just before calling


"set_bit(NVME_RDMA_Q_ALLOCATED, >flags);"

then you will never get to double free since we clear the 
NVME_RDMA_Q_ALLOCATED bit in the beginning of nvme_rdma_free_queue.




  drivers/nvme/host/rdma.c | 15 ---
  1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 966e0dd..fa5cf87 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -561,9 +561,18 @@ static void nvme_rdma_free_queue(struct nvme_rdma_queue 
*queue)
return;
  
  	if (nvme_rdma_queue_idx(queue) == 0) {

-   nvme_rdma_free_qe(queue->device->dev,
-   >ctrl->async_event_sqe,
-   sizeof(struct nvme_command), DMA_TO_DEVICE);
+   /*
+* async_event_sqe is not allocated in nvme_rdma_alloc_queue.
+* so there are cases where NVME_RDMA_Q_ALLOCATED is set, but
+* async_event_sqe is not allocated. To avoid double free, set
+* async_event_sqe.data to NULL to indicate it has been freed.
+*/
+   if (queue->ctrl->async_event_sqe.data) {
+   nvme_rdma_free_qe(queue->device->dev,
+   >ctrl->async_event_sqe,
+   sizeof(struct nvme_command), DMA_TO_DEVICE);
+   queue->ctrl->async_event_sqe.data = NULL;
+   }
}
  
  	nvme_rdma_destroy_queue_ib(queue);





-Max.


Re: [PATCH V2] nvme-rdma: fix double free in nvme_rdma_free_queue

2018-05-16 Thread Max Gurtovoy

Hi Jianchao,

On 5/10/2018 12:42 PM, Jianchao Wang wrote:

BUG: KASAN: double-free or invalid-free in nvme_rdma_free_queue+0xf6/0x110 
[nvme_rdma]
Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma]
Call Trace:
  dump_stack+0x91/0xeb
  print_address_description+0x6b/0x290
  kasan_report_invalid_free+0x55/0x80
  __kasan_slab_free+0x176/0x190
  kfree+0xeb/0x310
  nvme_rdma_free_queue+0xf6/0x110 [nvme_rdma]
  nvme_rdma_configure_admin_queue+0x1a3/0x4d0 [nvme_rdma]
  nvme_rdma_reset_ctrl_work+0x4e/0xd0 [nvme_rdma]
  process_one_work+0x3ca/0xaa0
  worker_thread+0x4e2/0x6c0
  kthread+0x18d/0x1e0
  ret_from_fork+0x24/0x30

The double free is on ctrl->async_event_sqe.
If any case fails before ctrl->async_event_sqe is allocated in
nvme_rdma_configure_admin_queue, nvme_rdma_free_queue will be
invoked. However, at the moment, the ctrl->async_event_sqe has
not been allocated because it has been freed in
nvme_rdma_reset_ctrl_work
   -> nvme_rdma_shutdown_ctrl
 ->nvme_rdma_destroy_admin_queue
   -> nvme_rdma_free_queue

Signed-off-by: Jianchao Wang 
---

V2:
handle it in nvme_rdma_free_queue and add some comment to explain it.


I don't know exactly what Christoph meant but IMO the best place to 
allocate it is in nvme_rdma_alloc_queue just before calling


"set_bit(NVME_RDMA_Q_ALLOCATED, >flags);"

then you will never get to double free since we clear the 
NVME_RDMA_Q_ALLOCATED bit in the beginning of nvme_rdma_free_queue.




  drivers/nvme/host/rdma.c | 15 ---
  1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 966e0dd..fa5cf87 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -561,9 +561,18 @@ static void nvme_rdma_free_queue(struct nvme_rdma_queue 
*queue)
return;
  
  	if (nvme_rdma_queue_idx(queue) == 0) {

-   nvme_rdma_free_qe(queue->device->dev,
-   >ctrl->async_event_sqe,
-   sizeof(struct nvme_command), DMA_TO_DEVICE);
+   /*
+* async_event_sqe is not allocated in nvme_rdma_alloc_queue.
+* so there are cases where NVME_RDMA_Q_ALLOCATED is set, but
+* async_event_sqe is not allocated. To avoid double free, set
+* async_event_sqe.data to NULL to indicate it has been freed.
+*/
+   if (queue->ctrl->async_event_sqe.data) {
+   nvme_rdma_free_qe(queue->device->dev,
+   >ctrl->async_event_sqe,
+   sizeof(struct nvme_command), DMA_TO_DEVICE);
+   queue->ctrl->async_event_sqe.data = NULL;
+   }
}
  
  	nvme_rdma_destroy_queue_ib(queue);





-Max.


Re: [PATCH V2] nvme-rdma: stop queue first before free it in config admin queue

2018-05-16 Thread Max Gurtovoy



On 5/10/2018 12:42 PM, Jianchao Wang wrote:

When any of cases after nvme_rdma_start_queue in
nvme_rdma_configure_admin_queue fails, the ctrl->queues[0] will be
freed but the NVME_RDMA_Q_LIVE is still set. If nvme_rdma_stop_queue
is invoked, we will incur use-after-free which will cause memory
corruption.
BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm]
Read of size 8 at addr 8801dc3969c0 by task kworker/u16:3/9304

CPU: 3 PID: 9304 Comm: kworker/u16:3 Kdump: loaded Tainted: GW 
4.17.0-rc3+ #20
Workqueue: nvme-delete-wq nvme_delete_ctrl_work
Call Trace:
  dump_stack+0x91/0xeb
  print_address_description+0x6b/0x290
  kasan_report+0x261/0x360
  rdma_disconnect+0x1f/0xe0 [rdma_cm]
  nvme_rdma_stop_queue+0x25/0x40 [nvme_rdma]
  nvme_rdma_shutdown_ctrl+0xf3/0x150 [nvme_rdma]
  nvme_delete_ctrl_work+0x98/0xe0
  process_one_work+0x3ca/0xaa0
  worker_thread+0x4e2/0x6c0
  kthread+0x18d/0x1e0
  ret_from_fork+0x24/0x30

To fix it, call nvme_rdma_stop_queue for all the failed cases after
nvme_rdma_start_queue.

Signed-off-by: Jianchao Wang <jianchao.w.w...@oracle.com>
---

V2:
based on Sagi's suggestion, add out_stop_queue lable and invoke
nvme_rdma_stop_queue in all the failed cases after nvme_rdma_start_queue



Looks good,

Reviewed-by: Max Gurtovoy <m...@mellanox.com>


Re: [PATCH V2] nvme-rdma: stop queue first before free it in config admin queue

2018-05-16 Thread Max Gurtovoy



On 5/10/2018 12:42 PM, Jianchao Wang wrote:

When any of cases after nvme_rdma_start_queue in
nvme_rdma_configure_admin_queue fails, the ctrl->queues[0] will be
freed but the NVME_RDMA_Q_LIVE is still set. If nvme_rdma_stop_queue
is invoked, we will incur use-after-free which will cause memory
corruption.
BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm]
Read of size 8 at addr 8801dc3969c0 by task kworker/u16:3/9304

CPU: 3 PID: 9304 Comm: kworker/u16:3 Kdump: loaded Tainted: GW 
4.17.0-rc3+ #20
Workqueue: nvme-delete-wq nvme_delete_ctrl_work
Call Trace:
  dump_stack+0x91/0xeb
  print_address_description+0x6b/0x290
  kasan_report+0x261/0x360
  rdma_disconnect+0x1f/0xe0 [rdma_cm]
  nvme_rdma_stop_queue+0x25/0x40 [nvme_rdma]
  nvme_rdma_shutdown_ctrl+0xf3/0x150 [nvme_rdma]
  nvme_delete_ctrl_work+0x98/0xe0
  process_one_work+0x3ca/0xaa0
  worker_thread+0x4e2/0x6c0
  kthread+0x18d/0x1e0
  ret_from_fork+0x24/0x30

To fix it, call nvme_rdma_stop_queue for all the failed cases after
nvme_rdma_start_queue.

Signed-off-by: Jianchao Wang 
---

V2:
based on Sagi's suggestion, add out_stop_queue lable and invoke
nvme_rdma_stop_queue in all the failed cases after nvme_rdma_start_queue



Looks good,

Reviewed-by: Max Gurtovoy 


Re: [PATCH AUTOSEL for 4.14 039/161] IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct

2018-04-09 Thread Max Gurtovoy

Hi Sasha,
please consider taking a small fix for this one (also useful for 4.15):

commit d3b9e8ad425cfd5b9116732e057f1b48e4d3bcb8
Author: Max Gurtovoy <m...@mellanox.com>
Date:   Mon Mar 5 20:09:48 2018 +0200

RDMA/core: Reduce poll batch for direct cq polling

Fix warning limit for kernel stack consumption:

drivers/infiniband/core/cq.c: In function 'ib_process_cq_direct':
drivers/infiniband/core/cq.c:78:1: error: the frame size of 1032 bytes
is larger than 1024 bytes [-Werror=frame-larger-than=]

Using smaller ib_wc array on the stack brings us comfortably below that
limit again.

Fixes: 246d8b184c10 ("IB/cq: Don't force IB_POLL_DIRECT poll 
context for ib_process_cq_direct")

Reported-by: Arnd Bergmann <a...@arndb.de>
Reviewed-by: Sergey Gorenko <serge...@mellanox.com>
    Signed-off-by: Max Gurtovoy <m...@mellanox.com>
Signed-off-by: Leon Romanovsky <leo...@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>
Acked-by: Arnd Bergmann <a...@arndb.de>
Signed-off-by: Jason Gunthorpe <j...@mellanox.com>


-Max.


On 4/9/2018 3:20 AM, Sasha Levin wrote:

From: Sagi Grimberg <s...@grimberg.me>

[ Upstream commit 246d8b184c100e8eb6b4e8c88f232c2ed2a4e672 ]

polling the completion queue directly does not interfere
with the existing polling logic, hence drop the requirement.
Be aware that running ib_process_cq_direct with non IB_POLL_DIRECT
CQ may trigger concurrent CQ processing.

This can be used for polling mode ULPs.

Cc: Bart Van Assche <bart.vanass...@wdc.com>
Reported-by: Steve Wise <sw...@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
[maxg: added wcs array argument to __ib_process_cq]
Signed-off-by: Max Gurtovoy <m...@mellanox.com>
Signed-off-by: Doug Ledford <dledf...@redhat.com>
Signed-off-by: Sasha Levin <alexander.le...@microsoft.com>
---
  drivers/infiniband/core/cq.c | 23 +--
  1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index f2ae75fa3128..c8c5a5a7f433 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -25,9 +25,10 @@
  #define IB_POLL_FLAGS \
(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)
  
-static int __ib_process_cq(struct ib_cq *cq, int budget)

+static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc)
  {
int i, n, completed = 0;
+   struct ib_wc *wcs = poll_wc ? : cq->wc;
  
  	/*

 * budget might be (-1) if the caller does not
@@ -35,9 +36,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget)
 * minimum here.
 */
while ((n = ib_poll_cq(cq, min_t(u32, IB_POLL_BATCH,
-   budget - completed), cq->wc)) > 0) {
+   budget - completed), wcs)) > 0) {
for (i = 0; i < n; i++) {
-   struct ib_wc *wc = >wc[i];
+   struct ib_wc *wc = [i];
  
  			if (wc->wr_cqe)

wc->wr_cqe->done(cq, wc);
@@ -60,18 +61,20 @@ static int __ib_process_cq(struct ib_cq *cq, int budget)
   * @cq:   CQ to process
   * @budget:   number of CQEs to poll for
   *
- * This function is used to process all outstanding CQ entries on a
- * %IB_POLL_DIRECT CQ.  It does not offload CQ processing to a different
- * context and does not ask for completion interrupts from the HCA.
+ * This function is used to process all outstanding CQ entries.
+ * It does not offload CQ processing to a different context and does
+ * not ask for completion interrupts from the HCA.
+ * Using direct processing on CQ with non IB_POLL_DIRECT type may trigger
+ * concurrent processing.
   *
   * Note: do not pass -1 as %budget unless it is guaranteed that the number
   * of completions that will be processed is small.
   */
  int ib_process_cq_direct(struct ib_cq *cq, int budget)
  {
-   WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
+   struct ib_wc wcs[IB_POLL_BATCH];
  
-	return __ib_process_cq(cq, budget);

+   return __ib_process_cq(cq, budget, wcs);
  }
  EXPORT_SYMBOL(ib_process_cq_direct);
  
@@ -85,7 +88,7 @@ static int ib_poll_handler(struct irq_poll *iop, int budget)

struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
int completed;
  
-	completed = __ib_process_cq(cq, budget);

+   completed = __ib_process_cq(cq, budget, NULL);
if (completed < budget) {
irq_poll_complete(>iop);
if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
@@ -105,7 +108,7 @@ static void ib_cq_poll_work(struct work_struct *work)
struct ib_cq *cq = container_of(work, struct ib_cq, work);
int completed;
  
-	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);

+   completed = __ib_process_cq(cq

Re: [PATCH AUTOSEL for 4.14 039/161] IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct

2018-04-09 Thread Max Gurtovoy

Hi Sasha,
please consider taking a small fix for this one (also useful for 4.15):

commit d3b9e8ad425cfd5b9116732e057f1b48e4d3bcb8
Author: Max Gurtovoy 
Date:   Mon Mar 5 20:09:48 2018 +0200

RDMA/core: Reduce poll batch for direct cq polling

Fix warning limit for kernel stack consumption:

drivers/infiniband/core/cq.c: In function 'ib_process_cq_direct':
drivers/infiniband/core/cq.c:78:1: error: the frame size of 1032 bytes
is larger than 1024 bytes [-Werror=frame-larger-than=]

Using smaller ib_wc array on the stack brings us comfortably below that
limit again.

Fixes: 246d8b184c10 ("IB/cq: Don't force IB_POLL_DIRECT poll 
context for ib_process_cq_direct")

Reported-by: Arnd Bergmann 
Reviewed-by: Sergey Gorenko 
Signed-off-by: Max Gurtovoy 
Signed-off-by: Leon Romanovsky 
Reviewed-by: Bart Van Assche 
Acked-by: Arnd Bergmann 
Signed-off-by: Jason Gunthorpe 


-Max.


On 4/9/2018 3:20 AM, Sasha Levin wrote:

From: Sagi Grimberg 

[ Upstream commit 246d8b184c100e8eb6b4e8c88f232c2ed2a4e672 ]

polling the completion queue directly does not interfere
with the existing polling logic, hence drop the requirement.
Be aware that running ib_process_cq_direct with non IB_POLL_DIRECT
CQ may trigger concurrent CQ processing.

This can be used for polling mode ULPs.

Cc: Bart Van Assche 
Reported-by: Steve Wise 
Signed-off-by: Sagi Grimberg 
[maxg: added wcs array argument to __ib_process_cq]
Signed-off-by: Max Gurtovoy 
Signed-off-by: Doug Ledford 
Signed-off-by: Sasha Levin 
---
  drivers/infiniband/core/cq.c | 23 +--
  1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index f2ae75fa3128..c8c5a5a7f433 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -25,9 +25,10 @@
  #define IB_POLL_FLAGS \
(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)
  
-static int __ib_process_cq(struct ib_cq *cq, int budget)

+static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *poll_wc)
  {
int i, n, completed = 0;
+   struct ib_wc *wcs = poll_wc ? : cq->wc;
  
  	/*

 * budget might be (-1) if the caller does not
@@ -35,9 +36,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget)
 * minimum here.
 */
while ((n = ib_poll_cq(cq, min_t(u32, IB_POLL_BATCH,
-   budget - completed), cq->wc)) > 0) {
+   budget - completed), wcs)) > 0) {
for (i = 0; i < n; i++) {
-   struct ib_wc *wc = >wc[i];
+   struct ib_wc *wc = [i];
  
  			if (wc->wr_cqe)

wc->wr_cqe->done(cq, wc);
@@ -60,18 +61,20 @@ static int __ib_process_cq(struct ib_cq *cq, int budget)
   * @cq:   CQ to process
   * @budget:   number of CQEs to poll for
   *
- * This function is used to process all outstanding CQ entries on a
- * %IB_POLL_DIRECT CQ.  It does not offload CQ processing to a different
- * context and does not ask for completion interrupts from the HCA.
+ * This function is used to process all outstanding CQ entries.
+ * It does not offload CQ processing to a different context and does
+ * not ask for completion interrupts from the HCA.
+ * Using direct processing on CQ with non IB_POLL_DIRECT type may trigger
+ * concurrent processing.
   *
   * Note: do not pass -1 as %budget unless it is guaranteed that the number
   * of completions that will be processed is small.
   */
  int ib_process_cq_direct(struct ib_cq *cq, int budget)
  {
-   WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
+   struct ib_wc wcs[IB_POLL_BATCH];
  
-	return __ib_process_cq(cq, budget);

+   return __ib_process_cq(cq, budget, wcs);
  }
  EXPORT_SYMBOL(ib_process_cq_direct);
  
@@ -85,7 +88,7 @@ static int ib_poll_handler(struct irq_poll *iop, int budget)

struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
int completed;
  
-	completed = __ib_process_cq(cq, budget);

+   completed = __ib_process_cq(cq, budget, NULL);
if (completed < budget) {
irq_poll_complete(>iop);
if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
@@ -105,7 +108,7 @@ static void ib_cq_poll_work(struct work_struct *work)
struct ib_cq *cq = container_of(work, struct ib_cq, work);
int completed;
  
-	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);

+   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL);
if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
queue_work(ib_comp_wq, >work);



Re: [PATCH v3] nvmet: fix nvmet_execute_write_zeroes function

2018-04-03 Thread Max Gurtovoy



On 4/2/2018 8:38 PM, Keith Busch wrote:

Thanks, I've applied the patch with a simpler changelog explaining
the bug.


Thanks Rodrigo and Keith, I've tested with/w.o the patch and it works 
well (with the fix only).


-Max.



___
Linux-nvme mailing list
linux-n...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme



Re: [PATCH v3] nvmet: fix nvmet_execute_write_zeroes function

2018-04-03 Thread Max Gurtovoy



On 4/2/2018 8:38 PM, Keith Busch wrote:

Thanks, I've applied the patch with a simpler changelog explaining
the bug.


Thanks Rodrigo and Keith, I've tested with/w.o the patch and it works 
well (with the fix only).


-Max.



___
Linux-nvme mailing list
linux-n...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme



Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-03-01 Thread Max Gurtovoy



On 2/28/2018 8:55 PM, Doug Ledford wrote:

On Wed, 2018-02-28 at 11:50 +0200, Max Gurtovoy wrote:


On 2/28/2018 2:21 AM, Bart Van Assche wrote:

On 02/27/18 14:15, Max Gurtovoy wrote:

-static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc
*poll_wc)
+static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc
*poll_wc,
+  int batch)
   {
-   int i, n, completed = 0;
-   struct ib_wc *wcs = poll_wc ? : cq->wc;
+   int i, n, ib_poll_batch, completed = 0;
+   struct ib_wc *wcs;
+
+   if (poll_wc) {
+   wcs = poll_wc;
+   ib_poll_batch = batch;
+   } else {
+   wcs = cq->wc;
+   ib_poll_batch = IB_POLL_BATCH;
+   }


Since this code has to be touched I think that we can use this
opportunity to get rid of the "poll_wc ? : cq->wc" conditional and
instead use what the caller passes. That will require to update all
__ib_process_cq(..., ..., NULL) calls. I also propose to let the caller
pass ib_poll_batch instead of figuring it out in this function.
Otherwise the approach of this patch looks fine to me.


Thanks Bart.
I'll make these changes and submit.


That sounds reasonable to me too, thanks for reworking and resubmitting.



Sure, NP.
We've run NVMe-oF and SRP with the new patch.
I'll send it through Mellanox maintainers pull request.

Thanks for reporting and reviewing.

-Max.


Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-03-01 Thread Max Gurtovoy



On 2/28/2018 8:55 PM, Doug Ledford wrote:

On Wed, 2018-02-28 at 11:50 +0200, Max Gurtovoy wrote:


On 2/28/2018 2:21 AM, Bart Van Assche wrote:

On 02/27/18 14:15, Max Gurtovoy wrote:

-static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc
*poll_wc)
+static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc
*poll_wc,
+  int batch)
   {
-   int i, n, completed = 0;
-   struct ib_wc *wcs = poll_wc ? : cq->wc;
+   int i, n, ib_poll_batch, completed = 0;
+   struct ib_wc *wcs;
+
+   if (poll_wc) {
+   wcs = poll_wc;
+   ib_poll_batch = batch;
+   } else {
+   wcs = cq->wc;
+   ib_poll_batch = IB_POLL_BATCH;
+   }


Since this code has to be touched I think that we can use this
opportunity to get rid of the "poll_wc ? : cq->wc" conditional and
instead use what the caller passes. That will require to update all
__ib_process_cq(..., ..., NULL) calls. I also propose to let the caller
pass ib_poll_batch instead of figuring it out in this function.
Otherwise the approach of this patch looks fine to me.


Thanks Bart.
I'll make these changes and submit.


That sounds reasonable to me too, thanks for reworking and resubmitting.



Sure, NP.
We've run NVMe-oF and SRP with the new patch.
I'll send it through Mellanox maintainers pull request.

Thanks for reporting and reviewing.

-Max.


Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-02-28 Thread Max Gurtovoy



On 2/28/2018 2:21 AM, Bart Van Assche wrote:

On 02/27/18 14:15, Max Gurtovoy wrote:
-static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc)
+static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc,

+  int batch)
  {
-   int i, n, completed = 0;
-   struct ib_wc *wcs = poll_wc ? : cq->wc;
+   int i, n, ib_poll_batch, completed = 0;
+   struct ib_wc *wcs;
+
+   if (poll_wc) {
+   wcs = poll_wc;
+   ib_poll_batch = batch;
+   } else {
+   wcs = cq->wc;
+   ib_poll_batch = IB_POLL_BATCH;
+   }


Since this code has to be touched I think that we can use this 
opportunity to get rid of the "poll_wc ? : cq->wc" conditional and 
instead use what the caller passes. That will require to update all 
__ib_process_cq(..., ..., NULL) calls. I also propose to let the caller 
pass ib_poll_batch instead of figuring it out in this function. 
Otherwise the approach of this patch looks fine to me.


Thanks Bart.
I'll make these changes and submit.



Thanks,

Bart.


-Max.


Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-02-28 Thread Max Gurtovoy



On 2/28/2018 2:21 AM, Bart Van Assche wrote:

On 02/27/18 14:15, Max Gurtovoy wrote:
-static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc)
+static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc,

+  int batch)
  {
-   int i, n, completed = 0;
-   struct ib_wc *wcs = poll_wc ? : cq->wc;
+   int i, n, ib_poll_batch, completed = 0;
+   struct ib_wc *wcs;
+
+   if (poll_wc) {
+   wcs = poll_wc;
+   ib_poll_batch = batch;
+   } else {
+   wcs = cq->wc;
+   ib_poll_batch = IB_POLL_BATCH;
+   }


Since this code has to be touched I think that we can use this 
opportunity to get rid of the "poll_wc ? : cq->wc" conditional and 
instead use what the caller passes. That will require to update all 
__ib_process_cq(..., ..., NULL) calls. I also propose to let the caller 
pass ib_poll_batch instead of figuring it out in this function. 
Otherwise the approach of this patch looks fine to me.


Thanks Bart.
I'll make these changes and submit.



Thanks,

Bart.


-Max.


Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-02-27 Thread Max Gurtovoy



On 2/28/2018 12:09 AM, Jason Gunthorpe wrote:

On Thu, Feb 22, 2018 at 05:39:09PM +0200, Sagi Grimberg wrote:



The only reason why I added this array on-stack was to allow consumers
that did not use ib_alloc_cq api to call it, but that seems like a
wrong decision when thinking it over again (as probably these users
did not set the wr_cqe correctly).

How about we make ib_process_cq_direct use the cq wc array and add
a WARN_ON statement (and fail it gracefully) if the caller used this
API without calling ib_alloc_cq?


but we tried to avoid cuncurrent access to cq->wc.


Not sure its a valid use-case. But if there is a compelling
reason to keep it as is, then we can do smaller on-stack
array.


Did we come to a conclusion what to do here?


guys,
what do you think about the following solution (untested):


diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index bc79ca8..59d2835 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -17,6 +17,7 @@

 /* # of WCs to poll for with a single call to ib_poll_cq */
 #define IB_POLL_BATCH  16
+#define IB_POLL_BATCH_DIRECT   8

 /* # of WCs to iterate over before yielding */
 #define IB_POLL_BUDGET_IRQ 256
@@ -25,17 +26,25 @@
 #define IB_POLL_FLAGS \
(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)

-static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc)
+static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc,

+  int batch)
 {
-   int i, n, completed = 0;
-   struct ib_wc *wcs = poll_wc ? : cq->wc;
-
+   int i, n, ib_poll_batch, completed = 0;
+   struct ib_wc *wcs;
+
+   if (poll_wc) {
+   wcs = poll_wc;
+   ib_poll_batch = batch;
+   } else {
+   wcs = cq->wc;
+   ib_poll_batch = IB_POLL_BATCH;
+   }
/*
 * budget might be (-1) if the caller does not
 * want to bound this call, thus we need unsigned
 * minimum here.
 */
-   while ((n = ib_poll_cq(cq, min_t(u32, IB_POLL_BATCH,
+   while ((n = ib_poll_cq(cq, min_t(u32, ib_poll_batch,
budget - completed), wcs)) > 0) {
for (i = 0; i < n; i++) {
struct ib_wc *wc = [i];
@@ -48,7 +57,7 @@ static int __ib_process_cq(struct ib_cq *cq, int 
budget, struct ib_wc *poll_wc)


completed += n;

-   if (n != IB_POLL_BATCH ||
+   if (n != ib_poll_batch ||
(budget != -1 && completed >= budget))
break;
}
@@ -72,9 +81,9 @@ static int __ib_process_cq(struct ib_cq *cq, int 
budget, struct ib_wc *poll_wc)

  */
 int ib_process_cq_direct(struct ib_cq *cq, int budget)
 {
-   struct ib_wc wcs[IB_POLL_BATCH];
+   struct ib_wc wcs[IB_POLL_BATCH_DIRECT];

-   return __ib_process_cq(cq, budget, wcs);
+   return __ib_process_cq(cq, budget, wcs, IB_POLL_BATCH_DIRECT);
 }
 EXPORT_SYMBOL(ib_process_cq_direct);

@@ -88,7 +97,7 @@ static int ib_poll_handler(struct irq_poll *iop, int 
budget)

struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
int completed;

-   completed = __ib_process_cq(cq, budget, NULL);
+   completed = __ib_process_cq(cq, budget, NULL, 0);
if (completed < budget) {
irq_poll_complete(>iop);
if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
@@ -108,7 +117,7 @@ static void ib_cq_poll_work(struct work_struct *work)
struct ib_cq *cq = container_of(work, struct ib_cq, work);
int completed;

-   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL);
+   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL, 0);
if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
queue_work(ib_comp_wq, >work);





Jason



Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-02-27 Thread Max Gurtovoy



On 2/28/2018 12:09 AM, Jason Gunthorpe wrote:

On Thu, Feb 22, 2018 at 05:39:09PM +0200, Sagi Grimberg wrote:



The only reason why I added this array on-stack was to allow consumers
that did not use ib_alloc_cq api to call it, but that seems like a
wrong decision when thinking it over again (as probably these users
did not set the wr_cqe correctly).

How about we make ib_process_cq_direct use the cq wc array and add
a WARN_ON statement (and fail it gracefully) if the caller used this
API without calling ib_alloc_cq?


but we tried to avoid cuncurrent access to cq->wc.


Not sure its a valid use-case. But if there is a compelling
reason to keep it as is, then we can do smaller on-stack
array.


Did we come to a conclusion what to do here?


guys,
what do you think about the following solution (untested):


diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index bc79ca8..59d2835 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -17,6 +17,7 @@

 /* # of WCs to poll for with a single call to ib_poll_cq */
 #define IB_POLL_BATCH  16
+#define IB_POLL_BATCH_DIRECT   8

 /* # of WCs to iterate over before yielding */
 #define IB_POLL_BUDGET_IRQ 256
@@ -25,17 +26,25 @@
 #define IB_POLL_FLAGS \
(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)

-static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc)
+static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc,

+  int batch)
 {
-   int i, n, completed = 0;
-   struct ib_wc *wcs = poll_wc ? : cq->wc;
-
+   int i, n, ib_poll_batch, completed = 0;
+   struct ib_wc *wcs;
+
+   if (poll_wc) {
+   wcs = poll_wc;
+   ib_poll_batch = batch;
+   } else {
+   wcs = cq->wc;
+   ib_poll_batch = IB_POLL_BATCH;
+   }
/*
 * budget might be (-1) if the caller does not
 * want to bound this call, thus we need unsigned
 * minimum here.
 */
-   while ((n = ib_poll_cq(cq, min_t(u32, IB_POLL_BATCH,
+   while ((n = ib_poll_cq(cq, min_t(u32, ib_poll_batch,
budget - completed), wcs)) > 0) {
for (i = 0; i < n; i++) {
struct ib_wc *wc = [i];
@@ -48,7 +57,7 @@ static int __ib_process_cq(struct ib_cq *cq, int 
budget, struct ib_wc *poll_wc)


completed += n;

-   if (n != IB_POLL_BATCH ||
+   if (n != ib_poll_batch ||
(budget != -1 && completed >= budget))
break;
}
@@ -72,9 +81,9 @@ static int __ib_process_cq(struct ib_cq *cq, int 
budget, struct ib_wc *poll_wc)

  */
 int ib_process_cq_direct(struct ib_cq *cq, int budget)
 {
-   struct ib_wc wcs[IB_POLL_BATCH];
+   struct ib_wc wcs[IB_POLL_BATCH_DIRECT];

-   return __ib_process_cq(cq, budget, wcs);
+   return __ib_process_cq(cq, budget, wcs, IB_POLL_BATCH_DIRECT);
 }
 EXPORT_SYMBOL(ib_process_cq_direct);

@@ -88,7 +97,7 @@ static int ib_poll_handler(struct irq_poll *iop, int 
budget)

struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
int completed;

-   completed = __ib_process_cq(cq, budget, NULL);
+   completed = __ib_process_cq(cq, budget, NULL, 0);
if (completed < budget) {
irq_poll_complete(>iop);
if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
@@ -108,7 +117,7 @@ static void ib_cq_poll_work(struct work_struct *work)
struct ib_cq *cq = container_of(work, struct ib_cq, work);
int completed;

-   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL);
+   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL, 0);
if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
queue_work(ib_comp_wq, >work);





Jason



Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-02-21 Thread Max Gurtovoy



On 2/21/2018 3:44 PM, Sagi Grimberg wrote:



On Tue, 2018-02-20 at 21:59 +0100, Arnd Bergmann wrote:

/* # of WCs to poll for with a single call to ib_poll_cq */
-#define IB_POLL_BATCH    16
+#define IB_POLL_BATCH    8


The purpose of batch polling is to minimize contention on the cq 
spinlock.
Reducing the IB_POLL_BATCH constant may affect performance 
negatively. Has
the performance impact of this change been verified for all affected 
drivers
(ib_srp, ib_srpt, ib_iser, ib_isert, NVMeOF, NVMeOF target, SMB 
Direct, NFS

over RDMA, ...)?


Only the users of the DIRECT polling method use an on-stack
array of ib_wc's. This is only the SRP drivers.

The other two modes have use of a dynamically allocated array
of ib_wc's that hangs off the ib_cq. These shouldn't need any
reduction in the size of this array, and they are the common
case.

IMO a better solution would be to change ib_process_cq_direct
to use a smaller on-stack array, and leave IB_POLL_BATCH alone.


The only reason why I added this array on-stack was to allow consumers
that did not use ib_alloc_cq api to call it, but that seems like a
wrong decision when thinking it over again (as probably these users
did not set the wr_cqe correctly).

How about we make ib_process_cq_direct use the cq wc array and add
a WARN_ON statement (and fail it gracefully) if the caller used this
API without calling ib_alloc_cq?


but we tried to avoid cuncurrent access to cq->wc.
Why can't we use the solution I wrote above ?



--
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index bc79ca8215d7..cd3e9e124834 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -25,10 +25,10 @@
  #define IB_POLL_FLAGS \
     (IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)

-static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc)

+static int __ib_process_cq(struct ib_cq *cq, int budget)
  {
     int i, n, completed = 0;
-   struct ib_wc *wcs = poll_wc ? : cq->wc;
+   struct ib_wc *wcs = cq->wc;

     /*
  * budget might be (-1) if the caller does not
@@ -72,9 +72,9 @@ static int __ib_process_cq(struct ib_cq *cq, int 
budget, struct ib_wc *poll_wc)

   */
  int ib_process_cq_direct(struct ib_cq *cq, int budget)
  {
-   struct ib_wc wcs[IB_POLL_BATCH];
-
-   return __ib_process_cq(cq, budget, wcs);
+   if (unlikely(WARN_ON_ONCE(!cq->wc)))
+   return 0;
+   return __ib_process_cq(cq, budget);
  }
  EXPORT_SYMBOL(ib_process_cq_direct);

@@ -88,7 +88,7 @@ static int ib_poll_handler(struct irq_poll *iop, int 
budget)

     struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
     int completed;

-   completed = __ib_process_cq(cq, budget, NULL);
+   completed = __ib_process_cq(cq, budget);
     if (completed < budget) {
     irq_poll_complete(>iop);
     if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
@@ -108,7 +108,7 @@ static void ib_cq_poll_work(struct work_struct *work)
     struct ib_cq *cq = container_of(work, struct ib_cq, work);
     int completed;

-   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL);
+   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
     if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
     ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
     queue_work(ib_comp_wq, >work);
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-02-21 Thread Max Gurtovoy



On 2/21/2018 3:44 PM, Sagi Grimberg wrote:



On Tue, 2018-02-20 at 21:59 +0100, Arnd Bergmann wrote:

/* # of WCs to poll for with a single call to ib_poll_cq */
-#define IB_POLL_BATCH    16
+#define IB_POLL_BATCH    8


The purpose of batch polling is to minimize contention on the cq 
spinlock.
Reducing the IB_POLL_BATCH constant may affect performance 
negatively. Has
the performance impact of this change been verified for all affected 
drivers
(ib_srp, ib_srpt, ib_iser, ib_isert, NVMeOF, NVMeOF target, SMB 
Direct, NFS

over RDMA, ...)?


Only the users of the DIRECT polling method use an on-stack
array of ib_wc's. This is only the SRP drivers.

The other two modes have use of a dynamically allocated array
of ib_wc's that hangs off the ib_cq. These shouldn't need any
reduction in the size of this array, and they are the common
case.

IMO a better solution would be to change ib_process_cq_direct
to use a smaller on-stack array, and leave IB_POLL_BATCH alone.


The only reason why I added this array on-stack was to allow consumers
that did not use ib_alloc_cq api to call it, but that seems like a
wrong decision when thinking it over again (as probably these users
did not set the wr_cqe correctly).

How about we make ib_process_cq_direct use the cq wc array and add
a WARN_ON statement (and fail it gracefully) if the caller used this
API without calling ib_alloc_cq?


but we tried to avoid cuncurrent access to cq->wc.
Why can't we use the solution I wrote above ?



--
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index bc79ca8215d7..cd3e9e124834 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -25,10 +25,10 @@
  #define IB_POLL_FLAGS \
     (IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)

-static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc 
*poll_wc)

+static int __ib_process_cq(struct ib_cq *cq, int budget)
  {
     int i, n, completed = 0;
-   struct ib_wc *wcs = poll_wc ? : cq->wc;
+   struct ib_wc *wcs = cq->wc;

     /*
  * budget might be (-1) if the caller does not
@@ -72,9 +72,9 @@ static int __ib_process_cq(struct ib_cq *cq, int 
budget, struct ib_wc *poll_wc)

   */
  int ib_process_cq_direct(struct ib_cq *cq, int budget)
  {
-   struct ib_wc wcs[IB_POLL_BATCH];
-
-   return __ib_process_cq(cq, budget, wcs);
+   if (unlikely(WARN_ON_ONCE(!cq->wc)))
+   return 0;
+   return __ib_process_cq(cq, budget);
  }
  EXPORT_SYMBOL(ib_process_cq_direct);

@@ -88,7 +88,7 @@ static int ib_poll_handler(struct irq_poll *iop, int 
budget)

     struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
     int completed;

-   completed = __ib_process_cq(cq, budget, NULL);
+   completed = __ib_process_cq(cq, budget);
     if (completed < budget) {
     irq_poll_complete(>iop);
     if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
@@ -108,7 +108,7 @@ static void ib_cq_poll_work(struct work_struct *work)
     struct ib_cq *cq = container_of(work, struct ib_cq, work);
     int completed;

-   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, NULL);
+   completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
     if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
     ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
     queue_work(ib_comp_wq, >work);
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-02-21 Thread Max Gurtovoy



On 2/20/2018 11:47 PM, Chuck Lever wrote:




On Feb 20, 2018, at 4:14 PM, Bart Van Assche  wrote:

On Tue, 2018-02-20 at 21:59 +0100, Arnd Bergmann wrote:

/* # of WCs to poll for with a single call to ib_poll_cq */
-#define IB_POLL_BATCH  16
+#define IB_POLL_BATCH  8


The purpose of batch polling is to minimize contention on the cq spinlock.
Reducing the IB_POLL_BATCH constant may affect performance negatively. Has
the performance impact of this change been verified for all affected drivers
(ib_srp, ib_srpt, ib_iser, ib_isert, NVMeOF, NVMeOF target, SMB Direct, NFS
over RDMA, ...)?


Only the users of the DIRECT polling method use an on-stack
array of ib_wc's. This is only the SRP drivers.

The other two modes have use of a dynamically allocated array
of ib_wc's that hangs off the ib_cq. These shouldn't need any
reduction in the size of this array, and they are the common
case.

IMO a better solution would be to change ib_process_cq_direct
to use a smaller on-stack array, and leave IB_POLL_BATCH alone.


Yup, good idea.
you can define IB_DIRECT_POLL_BATCH to be 8 and use it in 
ib_process_cq_direct. *but* please make sure to use the right value in 
ib_poll_cq since the wcs array should be able to hold the requested 
amount of wcs.


-Max.



--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] RDMA/core: reduce IB_POLL_BATCH constant

2018-02-21 Thread Max Gurtovoy



On 2/20/2018 11:47 PM, Chuck Lever wrote:




On Feb 20, 2018, at 4:14 PM, Bart Van Assche  wrote:

On Tue, 2018-02-20 at 21:59 +0100, Arnd Bergmann wrote:

/* # of WCs to poll for with a single call to ib_poll_cq */
-#define IB_POLL_BATCH  16
+#define IB_POLL_BATCH  8


The purpose of batch polling is to minimize contention on the cq spinlock.
Reducing the IB_POLL_BATCH constant may affect performance negatively. Has
the performance impact of this change been verified for all affected drivers
(ib_srp, ib_srpt, ib_iser, ib_isert, NVMeOF, NVMeOF target, SMB Direct, NFS
over RDMA, ...)?


Only the users of the DIRECT polling method use an on-stack
array of ib_wc's. This is only the SRP drivers.

The other two modes have use of a dynamically allocated array
of ib_wc's that hangs off the ib_cq. These shouldn't need any
reduction in the size of this array, and they are the common
case.

IMO a better solution would be to change ib_process_cq_direct
to use a smaller on-stack array, and leave IB_POLL_BATCH alone.


Yup, good idea.
you can define IB_DIRECT_POLL_BATCH to be 8 and use it in 
ib_process_cq_direct. *but* please make sure to use the right value in 
ib_poll_cq since the wcs array should be able to hold the requested 
amount of wcs.


-Max.



--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH] nvme-pci: Fix incorrect use of CMB size to calculate q_depth

2018-02-06 Thread Max Gurtovoy



On 2/6/2018 11:48 AM, Sagi Grimberg wrote:

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>

I'll pick this one up unless someone thinks I shouldn't..


Looks good to me (I can imagine what scenario failed this :) ),

Reviewed-by: Max Gurtovoy <m...@mellanox.com>



___
Linux-nvme mailing list
linux-n...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme


Re: [PATCH] nvme-pci: Fix incorrect use of CMB size to calculate q_depth

2018-02-06 Thread Max Gurtovoy



On 2/6/2018 11:48 AM, Sagi Grimberg wrote:

Looks good,

Reviewed-by: Sagi Grimberg 

I'll pick this one up unless someone thinks I shouldn't..


Looks good to me (I can imagine what scenario failed this :) ),

Reviewed-by: Max Gurtovoy 



___
Linux-nvme mailing list
linux-n...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme


Re: Backport Mellanox mlx5 patches to stable 4.9.y

2018-02-01 Thread Max Gurtovoy



On 2/1/2018 10:21 AM, Greg KH wrote:

On Tue, Jan 30, 2018 at 10:12:51AM +0100, Marta Rybczynska wrote:

Hello Mellanox maintainers,
I'd like to ask you to OK backporting two patches in mlx5 driver to 4.9 stable
tree (they're in master for some time already).

We have multiple deployment in 4.9 that are running into the bug fixed by those
patches. We're deploying patched kernels and the issue disappears.

The patches are:
1410a90ae449061b7e1ae19d275148f36948801b net/mlx5: Define interface bits for 
fencing UMR wqe
6e8484c5cf07c7ee632587e98c1a12d319dacb7c RDMA/mlx5: set UMR wqe fence according 
to HCA cap


Looks good, now queued up, thanks.


thanks Greg.



greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: Backport Mellanox mlx5 patches to stable 4.9.y

2018-02-01 Thread Max Gurtovoy



On 2/1/2018 10:21 AM, Greg KH wrote:

On Tue, Jan 30, 2018 at 10:12:51AM +0100, Marta Rybczynska wrote:

Hello Mellanox maintainers,
I'd like to ask you to OK backporting two patches in mlx5 driver to 4.9 stable
tree (they're in master for some time already).

We have multiple deployment in 4.9 that are running into the bug fixed by those
patches. We're deploying patched kernels and the issue disappears.

The patches are:
1410a90ae449061b7e1ae19d275148f36948801b net/mlx5: Define interface bits for 
fencing UMR wqe
6e8484c5cf07c7ee632587e98c1a12d319dacb7c RDMA/mlx5: set UMR wqe fence according 
to HCA cap


Looks good, now queued up, thanks.


thanks Greg.



greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH 0/3] IB-iSER: Adjustments for three function implementations

2018-01-28 Thread Max Gurtovoy



On 1/27/2018 8:17 PM, SF Markus Elfring wrote:

From: Markus Elfring <elfr...@users.sourceforge.net>
Date: Sat, 27 Jan 2018 19:02:34 +0100

Three update suggestions were taken into account
from static source code analysis.

Markus Elfring (3):
   Delete an error message for a failed memory allocation in 
iser_send_data_out()
   Delete an unnecessary variable initialisation in iser_send_data_out()
   Combine substrings for three messages

  drivers/infiniband/ulp/iser/iser_initiator.c | 16 ++--
  1 file changed, 6 insertions(+), 10 deletions(-)



This series looks good to me,

Reviewed-by: Max Gurtovoy <m...@mellanox.com>


Re: [PATCH 0/3] IB-iSER: Adjustments for three function implementations

2018-01-28 Thread Max Gurtovoy



On 1/27/2018 8:17 PM, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Sat, 27 Jan 2018 19:02:34 +0100

Three update suggestions were taken into account
from static source code analysis.

Markus Elfring (3):
   Delete an error message for a failed memory allocation in 
iser_send_data_out()
   Delete an unnecessary variable initialisation in iser_send_data_out()
   Combine substrings for three messages

  drivers/infiniband/ulp/iser/iser_initiator.c | 16 ++--
  1 file changed, 6 insertions(+), 10 deletions(-)



This series looks good to me,

Reviewed-by: Max Gurtovoy 


Re: [PATCH] nvme: don't free uuid pointer before printing it

2018-01-25 Thread Max Gurtovoy



On 1/25/2018 10:09 AM, Johannes Thumshirn wrote:

Commit df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID
option") fixed the leak of 'p' but in case uuid_parse() fails the memory
is freed before the error print that is using it.

Free it after printing eventual errors.

Signed-off-by: Johannes Thumshirn <jthumsh...@suse.de>
Fixes: df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID 
option")
Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
Cc: Roland Dreier <rol...@purestorage.com>
---
  drivers/nvme/host/fabrics.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index eb46967bb0d5..9cee72a80472 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -739,12 +739,13 @@ static int nvmf_parse_options(struct nvmf_ctrl_options 
*opts,
goto out;
}
ret = uuid_parse(p, );
-   kfree(p);
if (ret) {
pr_err("Invalid hostid %s\n", p);
ret = -EINVAL;
+   kfree(p);
goto out;
}
+   kfree(p);
break;
case NVMF_OPT_DUP_CONNECT:
opts->duplicate_connect = true;




Looks good,

Reviewed-by: Max Gurtovoy <m...@mellanox.com>


Re: [PATCH] nvme: don't free uuid pointer before printing it

2018-01-25 Thread Max Gurtovoy



On 1/25/2018 10:09 AM, Johannes Thumshirn wrote:

Commit df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID
option") fixed the leak of 'p' but in case uuid_parse() fails the memory
is freed before the error print that is using it.

Free it after printing eventual errors.

Signed-off-by: Johannes Thumshirn 
Fixes: df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID 
option")
Reported-by: Dan Carpenter 
Cc: Roland Dreier 
---
  drivers/nvme/host/fabrics.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index eb46967bb0d5..9cee72a80472 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -739,12 +739,13 @@ static int nvmf_parse_options(struct nvmf_ctrl_options 
*opts,
goto out;
}
ret = uuid_parse(p, );
-   kfree(p);
if (ret) {
pr_err("Invalid hostid %s\n", p);
ret = -EINVAL;
+   kfree(p);
goto out;
}
+   kfree(p);
break;
case NVMF_OPT_DUP_CONNECT:
opts->duplicate_connect = true;




Looks good,

Reviewed-by: Max Gurtovoy 


Re: [PATCH V5 1/2] nvme-pci: introduce RECONNECTING state to mark initializing procedure

2018-01-18 Thread Max Gurtovoy



On 1/18/2018 12:10 PM, Jianchao Wang wrote:

After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
both nvme-fc/rdma have following pattern:
RESETTING- quiesce blk-mq queues, teardown and delete queues/
connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
initializing things.


I guess we can call it NVME_CTRL_CONNECTING in later patch (more 
suitable name for this state now).



Introduce RECONNECTING to nvme-pci transport to do the same mark.
Then we get a coherent state definition among nvme pci/rdma/fc
transports.

Suggested-by: James Smart 
Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c |  2 +-
  drivers/nvme/host/pci.c  | 19 +--
  2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 230cc09..23b3e53 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -242,7 +242,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
switch (new_state) {
case NVME_CTRL_ADMIN_ONLY:
switch (old_state) {
-   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RECONNECTING:
changed = true;
/* FALLTHRU */
default:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 45f843d..05344be 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1138,9 +1138,14 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 
csts)
 */
bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
  
-	/* If there is a reset ongoing, we shouldn't reset again. */

-   if (dev->ctrl.state == NVME_CTRL_RESETTING)
+   /* If there is a reset/reinit ongoing, we shouldn't reset again. */
+   switch (dev->ctrl.state) {
+   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RECONNECTING:
return false;
+   default:
+   break;
+   }
  
  	/* We shouldn't reset unless the controller is on fatal error state

 * _or_ if we lost the communication with it.
@@ -2304,6 +2309,16 @@ static void nvme_reset_work(struct work_struct *work)
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
nvme_dev_disable(dev, false);
  
+	/*

+* Introduce RECONNECTING state from nvme-fc/rdma transports to mark the
+* initializing procedure here.
+*/
+   if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_RECONNECTING)) {
+   dev_warn(dev->ctrl.device,
+   "failed to mark controller RECONNECTING\n");
+   goto out;
+   }
+
result = nvme_pci_enable(dev);
if (result)
goto out;



Re: [PATCH V5 1/2] nvme-pci: introduce RECONNECTING state to mark initializing procedure

2018-01-18 Thread Max Gurtovoy



On 1/18/2018 12:10 PM, Jianchao Wang wrote:

After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
both nvme-fc/rdma have following pattern:
RESETTING- quiesce blk-mq queues, teardown and delete queues/
connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
initializing things.


I guess we can call it NVME_CTRL_CONNECTING in later patch (more 
suitable name for this state now).



Introduce RECONNECTING to nvme-pci transport to do the same mark.
Then we get a coherent state definition among nvme pci/rdma/fc
transports.

Suggested-by: James Smart 
Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c |  2 +-
  drivers/nvme/host/pci.c  | 19 +--
  2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 230cc09..23b3e53 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -242,7 +242,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
switch (new_state) {
case NVME_CTRL_ADMIN_ONLY:
switch (old_state) {
-   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RECONNECTING:
changed = true;
/* FALLTHRU */
default:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 45f843d..05344be 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1138,9 +1138,14 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 
csts)
 */
bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
  
-	/* If there is a reset ongoing, we shouldn't reset again. */

-   if (dev->ctrl.state == NVME_CTRL_RESETTING)
+   /* If there is a reset/reinit ongoing, we shouldn't reset again. */
+   switch (dev->ctrl.state) {
+   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RECONNECTING:
return false;
+   default:
+   break;
+   }
  
  	/* We shouldn't reset unless the controller is on fatal error state

 * _or_ if we lost the communication with it.
@@ -2304,6 +2309,16 @@ static void nvme_reset_work(struct work_struct *work)
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
nvme_dev_disable(dev, false);
  
+	/*

+* Introduce RECONNECTING state from nvme-fc/rdma transports to mark the
+* initializing procedure here.
+*/
+   if (!nvme_change_ctrl_state(>ctrl, NVME_CTRL_RECONNECTING)) {
+   dev_warn(dev->ctrl.device,
+   "failed to mark controller RECONNECTING\n");
+   goto out;
+   }
+
result = nvme_pci_enable(dev);
if (result)
goto out;



Re: [PATCH V4 1/2] nvme: add NVME_CTRL_RESET_PREPARE state

2018-01-17 Thread Max Gurtovoy

hi Jianchao Wang,

On 1/17/2018 6:54 AM, Jianchao Wang wrote:

Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
before queue the reset work. This is not so strict. There could be
a big gap before the reset_work callback is invoked. In addition,
there is some disable work in the reset_work callback, strictly
speaking, not part of reset work, and could lead to some confusion.

In addition, after set state to RESETTING and disable procedure,
nvme-rdma/fc use NVME_CTRL_RECONNECTING to mark the setup and
reconnect procedure. The RESETTING state has been narrowed.

This patch add NVME_CTRL_RESET_PREPARE state to mark the reset_work
or error recovery work, scheduling gap and disable procedure.
After that,
  - For nvme-pci, nvmet-loop, set state to RESETTING, start
initialization.
  - For nvme-rdma, nvme-fc, set state to RECONNECTING, start
initialization or reconnect.

Suggested-by: Christoph Hellwig 
Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c   | 18 +++---
  drivers/nvme/host/fc.c |  4 ++--
  drivers/nvme/host/nvme.h   |  8 
  drivers/nvme/host/pci.c| 25 +
  drivers/nvme/host/rdma.c   |  2 +-
  drivers/nvme/target/loop.c |  5 +
  6 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 230cc09..87d209f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size)
  
  int nvme_reset_ctrl(struct nvme_ctrl *ctrl)

  {
-   if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+   if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE))
return -EBUSY;
if (!queue_work(nvme_wq, >reset_work))
return -EBUSY;
@@ -260,7 +260,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
break;
}
break;
-   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:
switch (old_state) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
@@ -271,10 +271,20 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
break;
}
break;
+
+   case NVME_CTRL_RESETTING:
+   switch (old_state) {
+   case NVME_CTRL_RESET_PREPARE:
+   changed = true;
+   /* FALLTHRU */
+   default:
+   break;
+   }
+   break;
case NVME_CTRL_RECONNECTING:
switch (old_state) {
case NVME_CTRL_LIVE:
-   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:


As I suggested in V3, please don't allow this transition.
We'll move to NVME_CTRL_RECONNECTING from NVME_CTRL_RESETTING.

I look on it like that:

NVME_CTRL_RESET_PREPARE - "suspend" state
NVME_CTRL_RESETTING - "resume" state

you don't reconnect from "suspend" state, you must "resume" before you 
reconnect.




changed = true;
/* FALLTHRU */
default:
@@ -286,6 +296,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
case NVME_CTRL_LIVE:
case NVME_CTRL_ADMIN_ONLY:
case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:
case NVME_CTRL_RECONNECTING:
changed = true;
/* FALLTHRU */
@@ -2660,6 +2671,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
[NVME_CTRL_LIVE]= "live",
[NVME_CTRL_ADMIN_ONLY]  = "only-admin",
[NVME_CTRL_RESETTING]   = "resetting",
+   [NVME_CTRL_RESET_PREPARE]   = "reset-prepare",
[NVME_CTRL_RECONNECTING]= "reconnecting",
[NVME_CTRL_DELETING]= "deleting",
[NVME_CTRL_DEAD]= "dead",
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 306aee4..1ba0669 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -546,7 +546,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
queue_delayed_work(nvme_wq, >connect_work, 0);
break;
  
-	case NVME_CTRL_RESETTING:

+   case NVME_CTRL_RESET_PREPARE:
/*
 * Controller is already in the process of terminating the
 * association. No need to do anything further. The reconnect
@@ -789,7 +789,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
 */
break;
  
-	case NVME_CTRL_RESETTING:

+   case NVME_CTRL_RESET_PREPARE:
/*
 * Controller is already in the process of terminating the
 * association.  No need to do anything further. The reconnect
diff 

Re: [PATCH V4 1/2] nvme: add NVME_CTRL_RESET_PREPARE state

2018-01-17 Thread Max Gurtovoy

hi Jianchao Wang,

On 1/17/2018 6:54 AM, Jianchao Wang wrote:

Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
before queue the reset work. This is not so strict. There could be
a big gap before the reset_work callback is invoked. In addition,
there is some disable work in the reset_work callback, strictly
speaking, not part of reset work, and could lead to some confusion.

In addition, after set state to RESETTING and disable procedure,
nvme-rdma/fc use NVME_CTRL_RECONNECTING to mark the setup and
reconnect procedure. The RESETTING state has been narrowed.

This patch add NVME_CTRL_RESET_PREPARE state to mark the reset_work
or error recovery work, scheduling gap and disable procedure.
After that,
  - For nvme-pci, nvmet-loop, set state to RESETTING, start
initialization.
  - For nvme-rdma, nvme-fc, set state to RECONNECTING, start
initialization or reconnect.

Suggested-by: Christoph Hellwig 
Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c   | 18 +++---
  drivers/nvme/host/fc.c |  4 ++--
  drivers/nvme/host/nvme.h   |  8 
  drivers/nvme/host/pci.c| 25 +
  drivers/nvme/host/rdma.c   |  2 +-
  drivers/nvme/target/loop.c |  5 +
  6 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 230cc09..87d209f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size)
  
  int nvme_reset_ctrl(struct nvme_ctrl *ctrl)

  {
-   if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+   if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE))
return -EBUSY;
if (!queue_work(nvme_wq, >reset_work))
return -EBUSY;
@@ -260,7 +260,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
break;
}
break;
-   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:
switch (old_state) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
@@ -271,10 +271,20 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
break;
}
break;
+
+   case NVME_CTRL_RESETTING:
+   switch (old_state) {
+   case NVME_CTRL_RESET_PREPARE:
+   changed = true;
+   /* FALLTHRU */
+   default:
+   break;
+   }
+   break;
case NVME_CTRL_RECONNECTING:
switch (old_state) {
case NVME_CTRL_LIVE:
-   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:


As I suggested in V3, please don't allow this transition.
We'll move to NVME_CTRL_RECONNECTING from NVME_CTRL_RESETTING.

I look on it like that:

NVME_CTRL_RESET_PREPARE - "suspend" state
NVME_CTRL_RESETTING - "resume" state

you don't reconnect from "suspend" state, you must "resume" before you 
reconnect.




changed = true;
/* FALLTHRU */
default:
@@ -286,6 +296,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
case NVME_CTRL_LIVE:
case NVME_CTRL_ADMIN_ONLY:
case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:
case NVME_CTRL_RECONNECTING:
changed = true;
/* FALLTHRU */
@@ -2660,6 +2671,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
[NVME_CTRL_LIVE]= "live",
[NVME_CTRL_ADMIN_ONLY]  = "only-admin",
[NVME_CTRL_RESETTING]   = "resetting",
+   [NVME_CTRL_RESET_PREPARE]   = "reset-prepare",
[NVME_CTRL_RECONNECTING]= "reconnecting",
[NVME_CTRL_DELETING]= "deleting",
[NVME_CTRL_DEAD]= "dead",
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 306aee4..1ba0669 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -546,7 +546,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
queue_delayed_work(nvme_wq, >connect_work, 0);
break;
  
-	case NVME_CTRL_RESETTING:

+   case NVME_CTRL_RESET_PREPARE:
/*
 * Controller is already in the process of terminating the
 * association. No need to do anything further. The reconnect
@@ -789,7 +789,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
 */
break;
  
-	case NVME_CTRL_RESETTING:

+   case NVME_CTRL_RESET_PREPARE:
/*
 * Controller is already in the process of terminating the
 * association.  No need to do anything further. The reconnect
diff --git a/drivers/nvme/host/nvme.h 

Re: [Suspected-Phishing]Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting

2018-01-15 Thread Max Gurtovoy



On 1/15/2018 3:28 PM, Max Gurtovoy wrote:



On 1/14/2018 11:48 AM, Sagi Grimberg wrote:



Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
before queue the reset work. This is not so strict. There could be
a big gap before the reset_work callback is invoked. In addition,
there is some disable work in the reset_work callback, strictly
speaking, not part of reset work, and could lead to some confusion.

This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE
and NVME_CTRL_RESETTING. Before queue the reset work, changes state
to NVME_CTRL_RESET_PREPARE, after disable work completes, changes
state to NVME_CTRL_RESETTING.


What guarantees that nvme_timeout will do the right thing? If it
is relying on the fact that nvme-pci sets the state to RESETTING
only _after_ quiescing the requests queues it needs to be documented
as its a real delicate dependency.





Suggested-by: Christoph Hellwig <h...@lst.de>
Signed-off-by: Jianchao Wang <jianchao.w.w...@oracle.com>
---
  drivers/nvme/host/core.c   | 17 +++--
  drivers/nvme/host/fc.c |  2 ++
  drivers/nvme/host/nvme.h   |  1 +
  drivers/nvme/host/pci.c    | 28 ++--
  drivers/nvme/host/rdma.c   |  8 
  drivers/nvme/target/loop.c |  5 +
  6 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e46e60..106a437 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size)
  int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
  {
-    if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+    if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE))
  return -EBUSY;
  if (!queue_work(nvme_wq, >reset_work))
  return -EBUSY;
@@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  break;
  }
  break;
-    case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  switch (old_state) {
  case NVME_CTRL_NEW:
  case NVME_CTRL_LIVE:
@@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl 
*ctrl,

  break;
  }
  break;
+
+    case NVME_CTRL_RESETTING:
+    switch (old_state) {
+    case NVME_CTRL_RESET_PREPARE:
+    changed = true;
+    /* FALLTHRU */
+    default:
+    break;
+    }
+    break;
  case NVME_CTRL_RECONNECTING:
  switch (old_state) {
  case NVME_CTRL_LIVE:
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:


I forget to add that we shouldn't move from RESET_PREPARE to 
RECONNECTING (with my suggestion to rdma.c).
Also need to consider adding another check in nvmf_check_init_req 
(/drivers/nvme/host/fabrics.h) for the new state.




  changed = true;
  /* FALLTHRU */
  default:
@@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  switch (old_state) {
  case NVME_CTRL_LIVE:
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  case NVME_CTRL_RECONNECTING:
  changed = true;
  /* FALLTHRU */
@@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct 
device *dev,

  [NVME_CTRL_NEW]    = "new",
  [NVME_CTRL_LIVE]    = "live",
  [NVME_CTRL_RESETTING]    = "resetting",
+    [NVME_CTRL_RESET_PREPARE]    = "reset-prepare",
  [NVME_CTRL_RECONNECTING]= "reconnecting",
  [NVME_CTRL_DELETING]    = "deleting",
  [NVME_CTRL_DEAD]    = "dead",
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 794e66e..516c1ea 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
  break;
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  /*
   * Controller is already in the process of terminating the
   * association. No need to do anything further. The reconnect
@@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct 
nvme_fc_ctrl *ctrl)

  break;
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  /*
   * Controller is already in the process of terminating the
   * association.  No need to do anything further. The reconnect


James, can you take a look at this?


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3..e477c35 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev 
*dev, u32 csts)

  bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
  /* If there is a reset ongoing, we shouldn't reset again. */
-    if (dev->ctrl.state == NVME_CTRL_R

Re: [Suspected-Phishing]Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting

2018-01-15 Thread Max Gurtovoy



On 1/15/2018 3:28 PM, Max Gurtovoy wrote:



On 1/14/2018 11:48 AM, Sagi Grimberg wrote:



Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
before queue the reset work. This is not so strict. There could be
a big gap before the reset_work callback is invoked. In addition,
there is some disable work in the reset_work callback, strictly
speaking, not part of reset work, and could lead to some confusion.

This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE
and NVME_CTRL_RESETTING. Before queue the reset work, changes state
to NVME_CTRL_RESET_PREPARE, after disable work completes, changes
state to NVME_CTRL_RESETTING.


What guarantees that nvme_timeout will do the right thing? If it
is relying on the fact that nvme-pci sets the state to RESETTING
only _after_ quiescing the requests queues it needs to be documented
as its a real delicate dependency.





Suggested-by: Christoph Hellwig 
Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c   | 17 +++--
  drivers/nvme/host/fc.c |  2 ++
  drivers/nvme/host/nvme.h   |  1 +
  drivers/nvme/host/pci.c    | 28 ++--
  drivers/nvme/host/rdma.c   |  8 
  drivers/nvme/target/loop.c |  5 +
  6 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e46e60..106a437 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size)
  int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
  {
-    if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+    if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE))
  return -EBUSY;
  if (!queue_work(nvme_wq, >reset_work))
  return -EBUSY;
@@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  break;
  }
  break;
-    case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  switch (old_state) {
  case NVME_CTRL_NEW:
  case NVME_CTRL_LIVE:
@@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl 
*ctrl,

  break;
  }
  break;
+
+    case NVME_CTRL_RESETTING:
+    switch (old_state) {
+    case NVME_CTRL_RESET_PREPARE:
+    changed = true;
+    /* FALLTHRU */
+    default:
+    break;
+    }
+    break;
  case NVME_CTRL_RECONNECTING:
  switch (old_state) {
  case NVME_CTRL_LIVE:
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:


I forget to add that we shouldn't move from RESET_PREPARE to 
RECONNECTING (with my suggestion to rdma.c).
Also need to consider adding another check in nvmf_check_init_req 
(/drivers/nvme/host/fabrics.h) for the new state.




  changed = true;
  /* FALLTHRU */
  default:
@@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  switch (old_state) {
  case NVME_CTRL_LIVE:
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  case NVME_CTRL_RECONNECTING:
  changed = true;
  /* FALLTHRU */
@@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct 
device *dev,

  [NVME_CTRL_NEW]    = "new",
  [NVME_CTRL_LIVE]    = "live",
  [NVME_CTRL_RESETTING]    = "resetting",
+    [NVME_CTRL_RESET_PREPARE]    = "reset-prepare",
  [NVME_CTRL_RECONNECTING]= "reconnecting",
  [NVME_CTRL_DELETING]    = "deleting",
  [NVME_CTRL_DEAD]    = "dead",
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 794e66e..516c1ea 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
  break;
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  /*
   * Controller is already in the process of terminating the
   * association. No need to do anything further. The reconnect
@@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct 
nvme_fc_ctrl *ctrl)

  break;
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  /*
   * Controller is already in the process of terminating the
   * association.  No need to do anything further. The reconnect


James, can you take a look at this?


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3..e477c35 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev 
*dev, u32 csts)

  bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
  /* If there is a reset ongoing, we shouldn't reset again. */
-    if (dev->ctrl.state == NVME_CTRL_RESETTING)
+    switch (dev->ctrl.state) {
+    

Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting

2018-01-15 Thread Max Gurtovoy



On 1/14/2018 11:48 AM, Sagi Grimberg wrote:



Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
before queue the reset work. This is not so strict. There could be
a big gap before the reset_work callback is invoked. In addition,
there is some disable work in the reset_work callback, strictly
speaking, not part of reset work, and could lead to some confusion.

This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE
and NVME_CTRL_RESETTING. Before queue the reset work, changes state
to NVME_CTRL_RESET_PREPARE, after disable work completes, changes
state to NVME_CTRL_RESETTING.


What guarantees that nvme_timeout will do the right thing? If it
is relying on the fact that nvme-pci sets the state to RESETTING
only _after_ quiescing the requests queues it needs to be documented
as its a real delicate dependency.





Suggested-by: Christoph Hellwig 
Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c   | 17 +++--
  drivers/nvme/host/fc.c |  2 ++
  drivers/nvme/host/nvme.h   |  1 +
  drivers/nvme/host/pci.c    | 28 ++--
  drivers/nvme/host/rdma.c   |  8 
  drivers/nvme/target/loop.c |  5 +
  6 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e46e60..106a437 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size)
  int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
  {
-    if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+    if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE))
  return -EBUSY;
  if (!queue_work(nvme_wq, >reset_work))
  return -EBUSY;
@@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  break;
  }
  break;
-    case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  switch (old_state) {
  case NVME_CTRL_NEW:
  case NVME_CTRL_LIVE:
@@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  break;
  }
  break;
+
+    case NVME_CTRL_RESETTING:
+    switch (old_state) {
+    case NVME_CTRL_RESET_PREPARE:
+    changed = true;
+    /* FALLTHRU */
+    default:
+    break;
+    }
+    break;
  case NVME_CTRL_RECONNECTING:
  switch (old_state) {
  case NVME_CTRL_LIVE:
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  changed = true;
  /* FALLTHRU */
  default:
@@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  switch (old_state) {
  case NVME_CTRL_LIVE:
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  case NVME_CTRL_RECONNECTING:
  changed = true;
  /* FALLTHRU */
@@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct 
device *dev,

  [NVME_CTRL_NEW]    = "new",
  [NVME_CTRL_LIVE]    = "live",
  [NVME_CTRL_RESETTING]    = "resetting",
+    [NVME_CTRL_RESET_PREPARE]    = "reset-prepare",
  [NVME_CTRL_RECONNECTING]= "reconnecting",
  [NVME_CTRL_DELETING]    = "deleting",
  [NVME_CTRL_DEAD]    = "dead",
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 794e66e..516c1ea 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
  break;
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  /*
   * Controller is already in the process of terminating the
   * association. No need to do anything further. The reconnect
@@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl 
*ctrl)

  break;
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  /*
   * Controller is already in the process of terminating the
   * association.  No need to do anything further. The reconnect


James, can you take a look at this?


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3..e477c35 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev 
*dev, u32 csts)

  bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
  /* If there is a reset ongoing, we shouldn't reset again. */
-    if (dev->ctrl.state == NVME_CTRL_RESETTING)
+    switch (dev->ctrl.state) {
+    case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  return false;
+    default:
+    break;
+    }


Isn't the indentation off? also in other places...

@@ -2292,7 +2303,7 @@ static void nvme_reset_work(struct work_struct 
*work)

  bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
  int 

Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting

2018-01-15 Thread Max Gurtovoy



On 1/14/2018 11:48 AM, Sagi Grimberg wrote:



Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
before queue the reset work. This is not so strict. There could be
a big gap before the reset_work callback is invoked. In addition,
there is some disable work in the reset_work callback, strictly
speaking, not part of reset work, and could lead to some confusion.

This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE
and NVME_CTRL_RESETTING. Before queue the reset work, changes state
to NVME_CTRL_RESET_PREPARE, after disable work completes, changes
state to NVME_CTRL_RESETTING.


What guarantees that nvme_timeout will do the right thing? If it
is relying on the fact that nvme-pci sets the state to RESETTING
only _after_ quiescing the requests queues it needs to be documented
as its a real delicate dependency.





Suggested-by: Christoph Hellwig 
Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c   | 17 +++--
  drivers/nvme/host/fc.c |  2 ++
  drivers/nvme/host/nvme.h   |  1 +
  drivers/nvme/host/pci.c    | 28 ++--
  drivers/nvme/host/rdma.c   |  8 
  drivers/nvme/target/loop.c |  5 +
  6 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e46e60..106a437 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size)
  int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
  {
-    if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+    if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE))
  return -EBUSY;
  if (!queue_work(nvme_wq, >reset_work))
  return -EBUSY;
@@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  break;
  }
  break;
-    case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  switch (old_state) {
  case NVME_CTRL_NEW:
  case NVME_CTRL_LIVE:
@@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  break;
  }
  break;
+
+    case NVME_CTRL_RESETTING:
+    switch (old_state) {
+    case NVME_CTRL_RESET_PREPARE:
+    changed = true;
+    /* FALLTHRU */
+    default:
+    break;
+    }
+    break;
  case NVME_CTRL_RECONNECTING:
  switch (old_state) {
  case NVME_CTRL_LIVE:
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  changed = true;
  /* FALLTHRU */
  default:
@@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
  switch (old_state) {
  case NVME_CTRL_LIVE:
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  case NVME_CTRL_RECONNECTING:
  changed = true;
  /* FALLTHRU */
@@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct 
device *dev,

  [NVME_CTRL_NEW]    = "new",
  [NVME_CTRL_LIVE]    = "live",
  [NVME_CTRL_RESETTING]    = "resetting",
+    [NVME_CTRL_RESET_PREPARE]    = "reset-prepare",
  [NVME_CTRL_RECONNECTING]= "reconnecting",
  [NVME_CTRL_DELETING]    = "deleting",
  [NVME_CTRL_DEAD]    = "dead",
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 794e66e..516c1ea 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
  break;
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  /*
   * Controller is already in the process of terminating the
   * association. No need to do anything further. The reconnect
@@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl 
*ctrl)

  break;
  case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  /*
   * Controller is already in the process of terminating the
   * association.  No need to do anything further. The reconnect


James, can you take a look at this?


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3..e477c35 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev 
*dev, u32 csts)

  bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
  /* If there is a reset ongoing, we shouldn't reset again. */
-    if (dev->ctrl.state == NVME_CTRL_RESETTING)
+    switch (dev->ctrl.state) {
+    case NVME_CTRL_RESETTING:
+    case NVME_CTRL_RESET_PREPARE:
  return false;
+    default:
+    break;
+    }


Isn't the indentation off? also in other places...

@@ -2292,7 +2303,7 @@ static void nvme_reset_work(struct work_struct 
*work)

  bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
  int result = -ENODEV;
-    if 

Re: pci driver loads right after unload

2018-01-03 Thread Max Gurtovoy

Hi Greg/Bjorn,

On 1/2/2018 9:27 PM, Greg Kroah-Hartman wrote:

On Tue, Jan 02, 2018 at 01:00:03PM -0600, Bjorn Helgaas wrote:

[+cc Greg, linux-kernel]

Hi Max,

Thanks for the report!

On Tue, Jan 02, 2018 at 01:50:23AM +0200, Max Gurtovoy wrote:

hi all,
I encountered a strange phenomena using 2 different pci drivers
(nvme and mlx5_core) since 4.15-rc1:
when I try to unload the modules using "modprobe -r" cmd it calls
the .probe function right after calling the .remove function and the
module is not realy unloaded.
I think there is some race condition because when I added a
msleep(1000) after "pci_unregister_driver(_driver);" (in the
nvme module testing, it also worked in the mlx5_core), the issue
seems to dissapear.


You say "since 4.15-rc1".  Does that mean it's a regression?  If so,
what's the most recent kernel that does not have this problem?  Worst
case, you could bisect to find where it broke.

I don't see anything obvious in the drivers/pci changes between v4.14
and v4.15-rc1.  Module loading and driver binding is mostly driven by
the driver core and udev.  Maybe you could learn something with
"udevadm monitor" or by turning on the some of the debug in
lib/kobject_uevent.c?



This should be resolved in 4.15-rc6, there was a regression in -rc1 in
this area when dealing with uevents over netlink.

Max, can you test -rc6 to verify if this is really fixed or not?


I've tested -rc6 and the issue doesn't repro.
I'll continue monitoring this scenario in -rc7.
Can you point to the commit that fixes the issue ? and I'll test the 
kernel with/without this patch.




thanks,

greg k-h



Cheers,
Max.


  1   2   >