Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Michael S. Tsirkin
On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > VDPA is very different from this. You might call them both mediation,
> > > sure, but then you need another word to describe the additional
> > > changes VPDA is doing.
> > 
> > Sorry about hijacking the thread a little bit, but could you
> > call out some of the changes that are the most problematic
> > for you?
> 
> I don't really know these details.

Maybe, you then should desist from saying things like "It entirely fails
to achieve the most important thing it needs to do!" You are not making
any new friends with saying this about a piece of software without
knowing the details.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Michael S. Tsirkin
On Mon, Sep 25, 2023 at 09:40:59PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 03:44:11PM -0400, Michael S. Tsirkin wrote:
> > > VDPA is very different from this. You might call them both mediation,
> > > sure, but then you need another word to describe the additional
> > > changes VPDA is doing.
> > 
> > Sorry about hijacking the thread a little bit, but could you
> > call out some of the changes that are the most problematic
> > for you?
> 
> I don't really know these details. The operators have an existing
> virtio world that is ABI toward the VM for them, and they do not want
> *anything* to change. The VM should be unware if the virtio device is
> created by old hypervisor software or new DPU software. It presents
> exactly the same ABI.
> 
> So the challenge really is to convince that VDPA delivers that, and
> frankly, I don't think it does. ABI toward the VM is very important
> here.

And to complete the picture, it is the DPU software/firmware that
is resposible for maintaining this ABI in your ideal world?


> > > In this model the DPU is an extension of the hypervisor/qemu
> > > environment and we shift code from x86 side to arm side to increase
> > > security, save power and increase total system performance.
> > 
> > I think I begin to understand. On the DPU you have some virtio
> > devices but also some non-virtio devices.  So you have to
> > use VFIO to talk to the DPU. Reusing VFIO to talk to virtio
> > devices too, simplifies things for you. 
> 
> Yes
> 
> > If guests will see vendor-specific devices from the DPU anyway, it
> > will be impossible to migrate such guests away from the DPU so the
> > cross-vendor migration capability is less important in this
> > use-case.  Is this a good summary?
> 
> Well, sort of. As I said before, the vendor here is the cloud
> operator, not the DPU supplier. The guest will see an AWS virtio-net
> function, for example.
> 
> The operator will ensure that all their different implementations of
> this function will interwork for migration.
> 
> So within the closed world of a single operator live migration will
> work just fine.
> 
> Since the hypervisor's controlled by the operator only migrate within
> the operators own environment anyhow, it is an already solved problem.
> 
> Jason


Okay the picture emerges I think. Thanks! I'll try to summarize later
for everyone's benefit.


-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Parav Pandit via Virtualization



> From: Jason Wang 
> Sent: Tuesday, September 26, 2023 10:08 AM

> Right, so if we'd consider migration from virtio to vDPA, it needs to be 
> designed
> in a way that allows more involvement from hypervisor other than coupling it
> with a specific interface (like admin virtqueues).
It is not attached to the admin virtqueues.
One way to use it using admin commands at [1].
One can define without admin command by explaining the technical difficulties 
in admin command may/cannot work.

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Parav Pandit via Virtualization



> From: Jason Wang 
> Sent: Tuesday, September 26, 2023 10:07 AM


> 
> If you can't find a way to make legacy drivers work, use modern.
>
Understood.
This vfio series make the legacy drivers work.
Thanks.
 
> That's it.
> 
> Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 09/16] vdpa/mlx5: Allow creation/deletion of any given mr struct

2023-09-25 Thread Jason Wang
On Tue, Sep 12, 2023 at 9:02 PM Dragos Tatulea  wrote:
>
> This patch adapts the mr creation/deletion code to be able to work with
> any given mr struct pointer. All the APIs are adapted to take an extra
> parameter for the mr.
>
> mlx5_vdpa_create/delete_mr doesn't need a ASID parameter anymore. The
> check is done in the caller instead (mlx5_set_map).
>
> This change is needed for a followup patch which will introduce an
> additional mr for the vq descriptor data.
>
> Signed-off-by: Dragos Tatulea 
> ---

Thinking of this decoupling I think I have a question.

We advertise 2 address spaces and 2 groups. So we actually don't know
for example which address spaces will be used by dvq.

And actually we allow the user space to do something like

set_group_asid(dvq_group, 0)
set_map(0)
set_group_asid(dvq_group, 1)
set_map(1)

I wonder if the decoupling like this patch can work and why.

It looks to me the most easy way is to let each AS be backed by an MR.
Then we don't even need to care about the dvq, cvq.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Jason Wang
On Mon, Sep 25, 2023 at 8:26 PM Jason Gunthorpe  wrote:
>
> On Mon, Sep 25, 2023 at 10:34:54AM +0800, Jason Wang wrote:
>
> > > Cloud vendors will similarly use DPUs to create a PCI functions that
> > > meet the cloud vendor's internal specification.
> >
> > This can only work if:
> >
> > 1) the internal specification has finer garin than virtio spec
> > 2) so it can define what is not implemented in the virtio spec (like
> > migration and compatibility)
>
> Yes, and that is what is happening. Realistically the "spec" isjust a
> piece of software that the Cloud vendor owns which is simply ported to
> multiple DPU vendors.
>
> It is the same as VDPA. If VDPA can make multiple NIC vendors
> consistent then why do you have a hard time believing we can do the
> same thing just on the ARM side of a DPU?

I don't. We all know vDPA can do more than virtio.

>
> > All of the above doesn't seem to be possible or realistic now, and it
> > actually has a risk to be not compatible with virtio spec. In the
> > future when virtio has live migration supported, they want to be able
> > to migrate between virtio and vDPA.
>
> Well, that is for the spec to design.

Right, so if we'd consider migration from virtio to vDPA, it needs to
be designed in a way that allows more involvement from hypervisor
other than coupling it with a specific interface (like admin
virtqueues).

>
> > > So, as I keep saying, in this scenario the goal is no mediation in the
> > > hypervisor.
> >
> > That's pretty fine, but I don't think trapping + relying is not
> > mediation. Does it really matter what happens after trapping?
>
> It is not mediation in the sense that the kernel driver does not in
> any way make decisions on the behavior of the device. It simply
> transforms an IO operation into a device command and relays it to the
> device. The device still fully controls its own behavior.
>
> VDPA is very different from this. You might call them both mediation,
> sure, but then you need another word to describe the additional
> changes VPDA is doing.
>
> > > It is pointless, everything you think you need to do there
> > > is actually already being done in the DPU.
> >
> > Well, migration or even Qemu could be offloaded to DPU as well. If
> > that's the direction that's pretty fine.
>
> That's silly, of course qemu/kvm can't run in the DPU.

KVM can't for sure but part of Qemu could. This model has been used.

>
> However, we can empty qemu and the hypervisor out so all it does is
> run kvm and run vfio. In this model the DPU does all the OVS, storage,
> "VPDA", etc. qemu is just a passive relay of the DPU PCI functions
> into VM's vPCI functions.
>
> So, everything VDPA was doing in the environment is migrated into the
> DPU.

It really depends on the use cases. For example, in the case of DPU
what if we want to provide multiple virtio devices through a single
VF?

>
> In this model the DPU is an extension of the hypervisor/qemu
> environment and we shift code from x86 side to arm side to increase
> security, save power and increase total system performance.

That's pretty fine.

Thanks

>
> Jason
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Jason Wang
On Tue, Sep 26, 2023 at 11:45 AM Parav Pandit  wrote:
>
>
>
> > From: Michael S. Tsirkin 
> > Sent: Tuesday, September 26, 2023 12:06 AM
>
> > One can thinkably do that wait in hardware, though. Just defer completion 
> > until
> > read is done.
> >
> Once OASIS does such new interface and if some hw vendor _actually_ wants to 
> do such complex hw, may be vfio driver can adopt to it.

It is you that tries to revive legacy in the spec. We all know legacy
is tricky but work.

> When we worked with you, we discussed that there such hw does not have enough 
> returns and hence technical committee choose to proceed with admin commands.

I don't think my questions regarding the legacy transport get good
answers at that time. What's more, we all know spec allows to fix,
workaround or even deprecate a feature.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Jason Wang
On Tue, Sep 26, 2023 at 12:01 PM Parav Pandit  wrote:
>
>
>
> > From: Jason Wang 
> > Sent: Tuesday, September 26, 2023 8:03 AM
> >
> > It's the implementation details in legacy. The device needs to make sure 
> > (reset)
> > the driver can work (is done before get_status return).
> It is part of the 0.9.5 and 1.x specification as I quoted those text above.

What I meant is: legacy devices need to find their way to make legacy
drivers work. That's how legacy works.

It's too late to add any normative to the 0.95 spec. So the device
behaviour is actually defined by the legacy drivers. That is why it is
tricky.

If you can't find a way to make legacy drivers work, use modern.

That's it.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Parav Pandit via Virtualization



> From: Jason Wang 
> Sent: Tuesday, September 26, 2023 8:03 AM
> 
> It's the implementation details in legacy. The device needs to make sure 
> (reset)
> the driver can work (is done before get_status return).
It is part of the 0.9.5 and 1.x specification as I quoted those text above.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Parav Pandit via Virtualization



> From: Michael S. Tsirkin 
> Sent: Tuesday, September 26, 2023 12:06 AM

> One can thinkably do that wait in hardware, though. Just defer completion 
> until
> read is done.
>
Once OASIS does such new interface and if some hw vendor _actually_ wants to do 
such complex hw, may be vfio driver can adopt to it.
When we worked with you, we discussed that there such hw does not have enough 
returns and hence technical committee choose to proceed with admin commands.
I will skip re-discussing all over it again here.

The current virto spec is delivering the best trade-offs of functionality, 
performance and light weight implementation with future forward path towards 
more features as Jason explained such as migration.
All with near zero driver, qemu and sw involvement for rapidly growing feature 
set...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 08/16] vdpa/mlx5: Rename mr destroy functions

2023-09-25 Thread Jason Wang
On Tue, Sep 12, 2023 at 9:02 PM Dragos Tatulea  wrote:
>
> Make mlx5_destroy_mr symmetric to mlx5_create_mr.
>
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  4 ++--
>  drivers/vdpa/mlx5/core/mr.c|  6 +++---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 12 ++--
>  3 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
> b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> index 554899a80241..e1e6e7aba50e 100644
> --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> @@ -118,8 +118,8 @@ int mlx5_vdpa_handle_set_map(struct mlx5_vdpa_dev *mvdev, 
> struct vhost_iotlb *io
>  bool *change_map, unsigned int asid);
>  int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb 
> *iotlb,
> unsigned int asid);
> -void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev);
> -void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int 
> asid);
> +void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev);
> +void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
>  int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
> struct vhost_iotlb *iotlb,
> unsigned int asid);
> diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
> index fde00497f4ad..00dcce190a1f 100644
> --- a/drivers/vdpa/mlx5/core/mr.c
> +++ b/drivers/vdpa/mlx5/core/mr.c
> @@ -507,7 +507,7 @@ static void _mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev 
> *mvdev, unsigned int asid
> mr->initialized = false;
>  }
>
> -void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int 
> asid)
> +void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
>  {
> struct mlx5_vdpa_mr *mr = >mr;
>
> @@ -518,9 +518,9 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev 
> *mvdev, unsigned int asid)
> mutex_unlock(>mkey_mtx);
>  }
>
> -void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev)
> +void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev)
>  {
> -   mlx5_vdpa_destroy_mr_asid(mvdev, 
> mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP]);
> +   mlx5_vdpa_destroy_mr(mvdev, 
> mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP]);
> prune_iotlb(mvdev);
>  }
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 061d8f7a661a..4d759ab96319 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2644,7 +2644,7 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
> goto err_mr;
>
> teardown_driver(ndev);
> -   mlx5_vdpa_destroy_mr_asid(mvdev, asid);
> +   mlx5_vdpa_destroy_mr(mvdev, asid);
> err = mlx5_vdpa_create_mr(mvdev, iotlb, asid);
> if (err)
> goto err_mr;
> @@ -2660,7 +2660,7 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
> return 0;
>
>  err_setup:
> -   mlx5_vdpa_destroy_mr_asid(mvdev, asid);
> +   mlx5_vdpa_destroy_mr(mvdev, asid);
>  err_mr:
> return err;
>  }
> @@ -2797,7 +2797,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
> *vdev, u8 status)
>  err_driver:
> unregister_link_notifier(ndev);
>  err_setup:
> -   mlx5_vdpa_destroy_mr(>mvdev);
> +   mlx5_vdpa_destroy_mr_resources(>mvdev);
> ndev->mvdev.status |= VIRTIO_CONFIG_S_FAILED;
>  err_clear:
> up_write(>reslock);
> @@ -2824,7 +2824,7 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
> unregister_link_notifier(ndev);
> teardown_driver(ndev);
> clear_vqs_ready(ndev);
> -   mlx5_vdpa_destroy_mr(>mvdev);
> +   mlx5_vdpa_destroy_mr_resources(>mvdev);
> ndev->mvdev.status = 0;
> ndev->mvdev.suspended = false;
> ndev->cur_num_vqs = 0;
> @@ -2944,7 +2944,7 @@ static void mlx5_vdpa_free(struct vdpa_device *vdev)
> ndev = to_mlx5_vdpa_ndev(mvdev);
>
> free_resources(ndev);
> -   mlx5_vdpa_destroy_mr(mvdev);
> +   mlx5_vdpa_destroy_mr_resources(mvdev);
> if (!is_zero_ether_addr(ndev->config.mac)) {
> pfmdev = pci_get_drvdata(pci_physfn(mvdev->mdev->pdev));
> mlx5_mpfs_del_mac(pfmdev, ndev->config.mac);
> @@ -3474,7 +3474,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> *v_mdev, const char *name,
>  err_res2:
> free_resources(ndev);
>  err_mr:
> -   mlx5_vdpa_destroy_mr(mvdev);
> +   mlx5_vdpa_destroy_mr_resources(mvdev);
>  err_res:
> mlx5_vdpa_free_resources(>mvdev);
>  err_mpfs:
> --
> 2.41.0
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 07/16] vdpa/mlx5: Collapse "dvq" mr add/delete functions

2023-09-25 Thread Jason Wang
On Tue, Sep 12, 2023 at 9:02 PM Dragos Tatulea  wrote:
>
> Now that the cvq code is out of mlx5_vdpa_create/destroy_mr, the "dvq"
> functions can be folded into their callers.
>
> Having "dvq" in the naming will no longer be accurate in the downstream
> patches.
>
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/vdpa/mlx5/core/mr.c | 16 +---
>  1 file changed, 5 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
> index 587300e7c18e..fde00497f4ad 100644
> --- a/drivers/vdpa/mlx5/core/mr.c
> +++ b/drivers/vdpa/mlx5/core/mr.c
> @@ -489,7 +489,7 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, 
> struct mlx5_vdpa_mr *mr
> }
>  }
>
> -static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
> int asid)
> +static void _mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev, unsigned int 
> asid)
>  {
> struct mlx5_vdpa_mr *mr = >mr;
>
> @@ -513,7 +513,7 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev 
> *mvdev, unsigned int asid)
>
> mutex_lock(>mkey_mtx);
>
> -   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
> +   _mlx5_vdpa_destroy_mr(mvdev, asid);
>
> mutex_unlock(>mkey_mtx);
>  }
> @@ -524,9 +524,9 @@ void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev)
> prune_iotlb(mvdev);
>  }
>
> -static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev,
> -   struct vhost_iotlb *iotlb,
> -   unsigned int asid)
> +static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
> +   struct vhost_iotlb *iotlb,
> +   unsigned int asid)
>  {
> struct mlx5_vdpa_mr *mr = >mr;
> int err;
> @@ -550,12 +550,6 @@ static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev 
> *mvdev,
> return 0;
>  }
>
> -static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
> -   struct vhost_iotlb *iotlb, unsigned int asid)
> -{
> -   return _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid);
> -}
> -
>  int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb 
> *iotlb,
> unsigned int asid)
>  {
> --
> 2.41.0
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 06/16] vdpa/mlx5: Take cvq iotlb lock during refresh

2023-09-25 Thread Jason Wang
On Tue, Sep 12, 2023 at 9:02 PM Dragos Tatulea  wrote:
>
> The reslock is taken while refresh is called but iommu_lock is more
> specific to this resource. So take the iommu_lock during cvq iotlb
> refresh.
>
> Based on Eugenio's patch [0].
>
> [0] https://lore.kernel.org/lkml/20230112142218.725622-4-epere...@redhat.com/
>
> Suggested-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/vdpa/mlx5/core/mr.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
> index fcb6ae32e9ed..587300e7c18e 100644
> --- a/drivers/vdpa/mlx5/core/mr.c
> +++ b/drivers/vdpa/mlx5/core/mr.c
> @@ -590,11 +590,19 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev 
> *mvdev,
> struct vhost_iotlb *iotlb,
> unsigned int asid)
>  {
> +   int err;
> +
> if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
> return 0;
>
> +   spin_lock(>cvq.iommu_lock);
> +
> prune_iotlb(mvdev);
> -   return dup_iotlb(mvdev, iotlb);
> +   err = dup_iotlb(mvdev, iotlb);
> +
> +   spin_unlock(>cvq.iommu_lock);
> +
> +   return err;
>  }
>
>  int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
> --
> 2.41.0
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 05/16] vdpa/mlx5: Decouple cvq iotlb handling from hw mapping code

2023-09-25 Thread Jason Wang
On Tue, Sep 12, 2023 at 9:02 PM Dragos Tatulea  wrote:
>
> The handling of the cvq iotlb is currently coupled with the creation
> and destruction of the hardware mkeys (mr).
>
> This patch moves cvq iotlb handling into its own function and shifts it
> to a scope that is not related to mr handling. As cvq handling is just a
> prune_iotlb + dup_iotlb cycle, put it all in the same "update" function.
> Finally, the destruction path is handled by directly pruning the iotlb.
>
> After this move is done the ASID mr code can be collapsed into a single
> function.
>
> Signed-off-by: Dragos Tatulea 
> ---
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  3 ++
>  drivers/vdpa/mlx5/core/mr.c| 57 +++---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  |  7 ++--
>  3 files changed, 28 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
> b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> index 3748f027cfe9..554899a80241 100644
> --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> @@ -120,6 +120,9 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, 
> struct vhost_iotlb *iotlb,
> unsigned int asid);
>  void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev);
>  void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int 
> asid);
> +int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
> +   struct vhost_iotlb *iotlb,
> +   unsigned int asid);
>  int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
>
>  #define mlx5_vdpa_warn(__dev, format, ...)   
>   \
> diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
> index 7bd0883b8b25..fcb6ae32e9ed 100644
> --- a/drivers/vdpa/mlx5/core/mr.c
> +++ b/drivers/vdpa/mlx5/core/mr.c
> @@ -489,14 +489,6 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, 
> struct mlx5_vdpa_mr *mr
> }
>  }
>
> -static void _mlx5_vdpa_destroy_cvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
> int asid)
> -{
> -   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
> -   return;
> -
> -   prune_iotlb(mvdev);
> -}
> -
>  static void _mlx5_vdpa_destroy_dvq_mr(struct mlx5_vdpa_dev *mvdev, unsigned 
> int asid)
>  {
> struct mlx5_vdpa_mr *mr = >mr;
> @@ -522,25 +514,14 @@ void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev 
> *mvdev, unsigned int asid)
> mutex_lock(>mkey_mtx);
>
> _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
> -   _mlx5_vdpa_destroy_cvq_mr(mvdev, asid);
>
> mutex_unlock(>mkey_mtx);
>  }
>
>  void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev)
>  {
> -   mlx5_vdpa_destroy_mr_asid(mvdev, 
> mvdev->group2asid[MLX5_VDPA_CVQ_GROUP]);
> mlx5_vdpa_destroy_mr_asid(mvdev, 
> mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP]);
> -}
> -
> -static int _mlx5_vdpa_create_cvq_mr(struct mlx5_vdpa_dev *mvdev,
> -   struct vhost_iotlb *iotlb,
> -   unsigned int asid)
> -{
> -   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
> -   return 0;
> -
> -   return dup_iotlb(mvdev, iotlb);
> +   prune_iotlb(mvdev);
>  }
>
>  static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev *mvdev,
> @@ -572,22 +553,7 @@ static int _mlx5_vdpa_create_dvq_mr(struct mlx5_vdpa_dev 
> *mvdev,
>  static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
> struct vhost_iotlb *iotlb, unsigned int asid)
>  {
> -   int err;
> -
> -   err = _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid);
> -   if (err)
> -   return err;
> -
> -   err = _mlx5_vdpa_create_cvq_mr(mvdev, iotlb, asid);
> -   if (err)
> -   goto out_err;
> -
> -   return 0;
> -
> -out_err:
> -   _mlx5_vdpa_destroy_dvq_mr(mvdev, asid);
> -
> -   return err;
> +   return _mlx5_vdpa_create_dvq_mr(mvdev, iotlb, asid);
>  }
>
>  int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb 
> *iotlb,
> @@ -620,7 +586,24 @@ int mlx5_vdpa_handle_set_map(struct mlx5_vdpa_dev 
> *mvdev, struct vhost_iotlb *io
> return err;
>  }
>
> +int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
> +   struct vhost_iotlb *iotlb,
> +   unsigned int asid)
> +{
> +   if (mvdev->group2asid[MLX5_VDPA_CVQ_GROUP] != asid)
> +   return 0;
> +
> +   prune_iotlb(mvdev);
> +   return dup_iotlb(mvdev, iotlb);
> +}
> +
>  int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
>  {
> -   return mlx5_vdpa_create_mr(mvdev, NULL, 0);
> +   int err;
> +
> +   err = mlx5_vdpa_create_mr(mvdev, NULL, 0);
> +   if (err)
> +   return err;
> +
> +   return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
>  }

Nit: Still a little bit coupling but anyhow:

Acked-by: Jason Wang 

Thanks

> diff --git 

Re: [PATCH 04/16] vdpa/mlx5: Create helper function for dma mappings

2023-09-25 Thread Jason Wang
On Tue, Sep 12, 2023 at 9:02 PM Dragos Tatulea  wrote:
>
> Necessary for upcoming cvq separation from mr allocation.
>
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks


> ---
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 +
>  drivers/vdpa/mlx5/core/mr.c| 5 +
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 4 ++--
>  3 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
> b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> index ca56242972b3..3748f027cfe9 100644
> --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> @@ -120,6 +120,7 @@ int mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev, 
> struct vhost_iotlb *iotlb,
> unsigned int asid);
>  void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev);
>  void mlx5_vdpa_destroy_mr_asid(struct mlx5_vdpa_dev *mvdev, unsigned int 
> asid);
> +int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
>
>  #define mlx5_vdpa_warn(__dev, format, ...)   
>   \
> dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
> __func__, __LINE__, \
> diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
> index 5a1971fcd87b..7bd0883b8b25 100644
> --- a/drivers/vdpa/mlx5/core/mr.c
> +++ b/drivers/vdpa/mlx5/core/mr.c
> @@ -619,3 +619,8 @@ int mlx5_vdpa_handle_set_map(struct mlx5_vdpa_dev *mvdev, 
> struct vhost_iotlb *io
>
> return err;
>  }
> +
> +int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
> +{
> +   return mlx5_vdpa_create_mr(mvdev, NULL, 0);
> +}
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 37be945a0230..d34c19b4e139 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2836,7 +2836,7 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
> ++mvdev->generation;
>
> if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
> -   if (mlx5_vdpa_create_mr(mvdev, NULL, 0))
> +   if (mlx5_vdpa_create_dma_mr(mvdev))
> mlx5_vdpa_warn(mvdev, "create MR failed\n");
> }
> up_write(>reslock);
> @@ -3441,7 +3441,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> *v_mdev, const char *name,
> goto err_mpfs;
>
> if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
> -   err = mlx5_vdpa_create_mr(mvdev, NULL, 0);
> +   err = mlx5_vdpa_create_dma_mr(mvdev);
> if (err)
> goto err_res;
> }
> --
> 2.41.0
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vringh: don't use vringh_kiov_advance() in vringh_iov_xfer()

2023-09-25 Thread Jason Wang
On Mon, Sep 25, 2023 at 6:31 PM Stefano Garzarella  wrote:
>
> In the while loop of vringh_iov_xfer(), `partlen` could be 0 if one of
> the `iov` has 0 lenght.
> In this case, we should skip the iov and go to the next one.
> But calling vringh_kiov_advance() with 0 lenght does not cause the
> advancement, since it returns immediately if asked to advance by 0 bytes.
>
> Let's restore the code that was there before commit b8c06ad4d67d
> ("vringh: implement vringh_kiov_advance()"), avoiding using
> vringh_kiov_advance().
>
> Fixes: b8c06ad4d67d ("vringh: implement vringh_kiov_advance()")
> Cc: sta...@vger.kernel.org
> Reported-by: Jason Wang 
> Signed-off-by: Stefano Garzarella 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/vhost/vringh.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
> index 955d938eb663..7b8fd977f71c 100644
> --- a/drivers/vhost/vringh.c
> +++ b/drivers/vhost/vringh.c
> @@ -123,8 +123,18 @@ static inline ssize_t vringh_iov_xfer(struct vringh *vrh,
> done += partlen;
> len -= partlen;
> ptr += partlen;
> +   iov->consumed += partlen;
> +   iov->iov[iov->i].iov_len -= partlen;
> +   iov->iov[iov->i].iov_base += partlen;
>
> -   vringh_kiov_advance(iov, partlen);
> +   if (!iov->iov[iov->i].iov_len) {
> +   /* Fix up old iov element then increment. */
> +   iov->iov[iov->i].iov_len = iov->consumed;
> +   iov->iov[iov->i].iov_base -= iov->consumed;
> +
> +   iov->consumed = 0;
> +   iov->i++;
> +   }
> }
> return done;
>  }
> --
> 2.41.0
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 3/7] vhost: Add 3 new uapi to support iommufd

2023-09-25 Thread Jason Wang
On Sun, Sep 24, 2023 at 1:05 AM Cindy Lu  wrote:
>
> VHOST_VDPA_SET_IOMMU_FD: bind the device to iommufd device
>
> VDPA_DEVICE_ATTACH_IOMMUFD_AS: Attach a vdpa device to an iommufd
> address space specified by IOAS id.
>
> VDPA_DEVICE_DETACH_IOMMUFD_AS: Detach a vdpa device
> from the iommufd address space
>
> Signed-off-by: Cindy Lu 
> ---
>  drivers/vhost/vdpa.c   | 191 +
>  include/uapi/linux/vhost.h |  71 ++
>  2 files changed, 262 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ec32f785dfde..91da012084e9 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -25,6 +26,8 @@
>
>  #include "vhost.h"
>
> +MODULE_IMPORT_NS(IOMMUFD);
> +
>  enum {
> VHOST_VDPA_BACKEND_FEATURES =
> (1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2) |
> @@ -69,6 +72,15 @@ static void vhost_vdpa_iotlb_unmap(struct vhost_vdpa *v,
>struct vhost_iotlb *iotlb, u64 start,
>u64 last, u32 asid);
>
> +void vhost_vdpa_lockdep_assert_held(struct vdpa_device *vdpa)
> +{
> +   struct vhost_vdpa *v = vdpa_get_drvdata(vdpa);
> +
> +   if (WARN_ON(!v))
> +   return;
> +   lockdep_assert_held(>vdev.mutex);
> +}
> +
>  static inline u32 iotlb_to_asid(struct vhost_iotlb *iotlb)
>  {
> struct vhost_vdpa_as *as = container_of(iotlb, struct
> @@ -497,6 +509,173 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>
> return ops->suspend(vdpa);
>  }
> +static long vhost_vdpa_tmp_set_iommufd(struct vhost_vdpa *v, void __user 
> *argp)
> +{
> +   struct device *dma_dev = vdpa_get_dma_dev(v->vdpa);
> +   struct vhost_vdpa_set_iommufd set_iommufd;
> +   struct vdpa_device *vdpa = v->vdpa;
> +   struct iommufd_ctx *ictx;
> +   unsigned long minsz;
> +   u32 pt_id, dev_id;
> +   struct fd f;
> +   long r = 0;
> +   int idx;
> +
> +   minsz = offsetofend(struct vhost_vdpa_set_iommufd, ioas_id);
> +   if (copy_from_user(_iommufd, argp, minsz))
> +   return -EFAULT;
> +
> +   if (set_iommufd.group_id >= v->nvqs)
> +   return -ENOBUFS;

Needs to be v->ngroups but as replied below, but I think it should be asid.

> +
> +   idx = array_index_nospec(set_iommufd.group_id, v->nvqs);
> +
> +   /* Unset IOMMUFD */
> +   if (set_iommufd.iommufd < 0) {
> +   if (!test_bit(idx, vdpa->vq_bitmap))
> +   return -EINVAL;
> +
> +   if (!vdpa->iommufd_ictx || !vdpa->iommufd_device)
> +   return -EINVAL;
> +   if (atomic_read(>iommufd_users)) {
> +   atomic_dec(>iommufd_users);
> +   return 0;
> +   }
> +   vdpa_iommufd_unbind(v->vdpa);
> +   vdpa->iommufd_device = NULL;
> +   vdpa->iommufd_ictx = NULL;
> +   clear_bit(idx, vdpa->vq_bitmap);
> +   return iommu_attach_device(v->domain, dma_dev);
> +   }
> +   /* First opened virtqueue of this vdpa device */
> +
> +   if (!vdpa->vq_bitmap) {
> +   vdpa->vq_bitmap = bitmap_alloc(v->nvqs, GFP_KERNEL);
> +   }
> +   ///if (test_bit(idx, vdpa->vq_bitmap))
> +   //return -EBUSY;
> +
> +   /* For same device but different groups, ++refcount only */
> +   if (vdpa->iommufd_device)
> +   goto out_inc;
> +
> +   r = -EBADF;
> +   f = fdget(set_iommufd.iommufd);
> +   if (!f.file)
> +   goto out_bitmap_free;
> +
> +   r = -EINVAL;
> +   ictx = iommufd_ctx_from_file(f.file);
> +   if (IS_ERR(ictx))
> +   goto out_fdput;
> +
> +   if (v->domain)
> +   iommu_detach_device(v->domain, dma_dev);
> +
> +#if 0
> +   iommu_group_add_device(iommu_group_alloc(), >dev);
> +#endif
> +   pt_id = set_iommufd.ioas_id;
> +   r = vdpa_iommufd_bind(vdpa, ictx, _id, _id);
> +   if (r)
> +   goto out_reattach;
> +
> +   set_iommufd.out_dev_id = dev_id;
> +   set_iommufd.out_hwpt_id = pt_id;
> +   r = copy_to_user(argp + minsz, _iommufd.out_dev_id,
> +sizeof(set_iommufd.out_dev_id) +
> +sizeof(set_iommufd.out_hwpt_id)) ?
> +   -EFAULT :
> +   0;
> +   if (r)
> +   goto out_device_unbind;
> +   printk(KERN_ERR "[%s] %d called %p\n", __func__, __LINE__,
> +  vdpa->iommufd_ictx);
> +
> +   vdpa->iommufd_ictx = ictx;
> +
> +out_inc:
> +   atomic_inc(>iommufd_users);
> +   set_bit(idx, vdpa->vq_bitmap);
> +
> +   goto out_fdput;
> +
> +out_device_unbind:
> +
> +   vdpa_iommufd_unbind(vdpa);
> +out_reattach:
> +
> +   iommu_attach_device(v->domain, dma_dev);
> +   

Re: [RFC 1/7] vhost/iommufd: Add the functions support iommufd

2023-09-25 Thread Jason Wang
On Sun, Sep 24, 2023 at 1:05 AM Cindy Lu  wrote:
>
> Add a new file vhost/iommufd.c to support the function of
> iommufd, This file contains iommufd function of emulated device and
> the physical device.
>
> Signed-off-by: Cindy Lu 
> ---
>  drivers/vhost/iommufd.c | 151 
>  drivers/vhost/vhost.h   |  21 ++
>  2 files changed, 172 insertions(+)
>  create mode 100644 drivers/vhost/iommufd.c
>
> diff --git a/drivers/vhost/iommufd.c b/drivers/vhost/iommufd.c
> new file mode 100644
> index ..080858f76fd5
> --- /dev/null
> +++ b/drivers/vhost/iommufd.c
> @@ -0,0 +1,151 @@
> +#include 
> +#include 
> +
> +#include "vhost.h"
> +
> +MODULE_IMPORT_NS(IOMMUFD);
> +
> +int vdpa_iommufd_bind(struct vdpa_device *vdpa, struct iommufd_ctx *ictx,
> + u32 *ioas_id, u32 *device_id)
> +{
> +   int ret;
> +
> +   vhost_vdpa_lockdep_assert_held(vdpa);
> +
> +   /*
> +* If the driver doesn't provide this op then it means the device does
> +* not do DMA at all. So nothing to do.
> +*/
> +   if (!vdpa->config->bind_iommufd)
> +   return 0;
> +
> +   ret = vdpa->config->bind_iommufd(vdpa, ictx, device_id);
> +   if (ret)
> +   return ret;
> +
> +   ret = vdpa->config->attach_ioas(vdpa, ioas_id);
> +   if (ret)
> +   goto err_unbind;
> +   vdpa->iommufd_attached = true;
> +
> +   return 0;
> +
> +err_unbind:
> +   if (vdpa->config->unbind_iommufd)
> +   vdpa->config->unbind_iommufd(vdpa);
> +   return ret;
> +}
> +
> +void vdpa_iommufd_unbind(struct vdpa_device *vdpa)
> +{
> +   vhost_vdpa_lockdep_assert_held(vdpa);
> +
> +   if (vdpa->config->unbind_iommufd)
> +   vdpa->config->unbind_iommufd(vdpa);
> +}
> +
> +int vdpa_iommufd_physical_bind(struct vdpa_device *vdpa,
> +  struct iommufd_ctx *ictx, u32 *out_device_id)
> +{
> +   struct device *dma_dev = vdpa_get_dma_dev(vdpa);
> +   struct iommufd_device *idev;
> +
> +   idev = iommufd_device_bind(ictx, dma_dev, out_device_id);
> +   if (IS_ERR(idev))
> +   return PTR_ERR(idev);
> +   vdpa->iommufd_device = idev;
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(vdpa_iommufd_physical_bind);
> +
> +void vdpa_iommufd_physical_unbind(struct vdpa_device *vdpa)
> +{
> +   vhost_vdpa_lockdep_assert_held(vdpa);
> +
> +   if (vdpa->iommufd_attached) {
> +   iommufd_device_detach(vdpa->iommufd_device);
> +   vdpa->iommufd_attached = false;
> +   }
> +   iommufd_device_unbind(vdpa->iommufd_device);
> +   vdpa->iommufd_device = NULL;
> +}
> +EXPORT_SYMBOL_GPL(vdpa_iommufd_physical_unbind);
> +
> +int vdpa_iommufd_physical_attach_ioas(struct vdpa_device *vdpa, u32 *pt_id)
> +{
> +   unsigned int flags = 0;
> +
> +   return iommufd_device_attach(vdpa->iommufd_device, pt_id);
> +}
> +EXPORT_SYMBOL_GPL(vdpa_iommufd_physical_attach_ioas);
> +
> +static void vdpa_emulated_unmap(void *data, unsigned long iova,
> +   unsigned long length)
> +{
> +   struct vdpa_device *vdpa = data;
> +
> +   vdpa->config->dma_unmap(vdpa, 0, iova, length);
> +}
> +
> +static const struct iommufd_access_ops vdpa_user_ops = {
> +   .needs_pin_pages = 1,

Note that simulators support VA, so no page pinning in that case while rebasing.

static bool use_va = true;
module_param(use_va, bool, 0444);
MODULE_PARM_DESC(use_va, "Enable/disable the device's ability to use VA");

So we need to handle that case as well.

(Note that it looks like VA mode is broken, I may need some time to fix that).

> +   .unmap = vdpa_emulated_unmap,
> +};
> +
> +int vdpa_iommufd_emulated_bind(struct vdpa_device *vdpa,
> +  struct iommufd_ctx *ictx, u32 *out_device_id)
> +{
> +   vhost_vdpa_lockdep_assert_held(vdpa);
> +
> +   vdpa->iommufd_ictx = ictx;
> +   iommufd_ctx_get(ictx);
> +   struct iommufd_device *idev;
> +
> +   idev = iommufd_device_bind(ictx, vdpa->dma_dev, out_device_id);

This seems not appropriate for emulated devices as it deals with the
concepts that only exist in physical devices like the IOMMU domain
etc.

If possible, please refer how VFIO handles this (I guess it should
have something)

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 5/7] vdpa: Add new vdpa_config_ops

2023-09-25 Thread Jason Wang
On Sun, Sep 24, 2023 at 1:06 AM Cindy Lu  wrote:
>
> Add new vdpa_config_ops to support iommufd
>
> Signed-off-by: Cindy Lu 
> ---
>  include/linux/vdpa.h | 34 +-
>  1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> index 6d0f5e4e82c2..4ada5bd6f90e 100644
> --- a/include/linux/vdpa.h
> +++ b/include/linux/vdpa.h
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -91,6 +92,12 @@ struct vdpa_device {
> struct vdpa_mgmt_dev *mdev;
> unsigned int ngroups;
> unsigned int nas;
> +   struct iommufd_access *iommufd_access;
> +   struct iommufd_device *iommufd_device;
> +   struct iommufd_ctx *iommufd_ictx;
> +   unsigned long *vq_bitmap;
> +   atomic_t iommufd_users;
> +   bool iommufd_attached;
>  };
>
>  /**
> @@ -282,6 +289,15 @@ struct vdpa_map_file {
>   * @iova: iova to be unmapped
>   * @size: size of the area
>   * Returns integer: success (0) or error (< 0)
> + * @bind_iommufd:  use vdpa_iommufd_physical_bind for an IOMMU
> + * backed device.
> + * otherwise use vdpa_iommufd_emulated_bind
> + * @unbind_iommufd:use vdpa_iommufd_physical_unbind for an IOMMU
> + * backed device.
> + * otherwise, use vdpa_iommufd_emulated_unbind
> + * @attach_ioas:   use vdpa_iommufd_physical_attach_ioas for an
> + * IOMMU backed device.
> + * @detach_ioas:   Opposite of attach_ioas

Those should be marked as mandatory only for parents with specific
translations (e.g simulator and mlx5_vdpa).

Or anything I missed?

Thanks


>   * @free:  Free resources that belongs to vDPA (optional)
>   * @vdev: vdpa device
>   */
> @@ -341,6 +357,12 @@ struct vdpa_config_ops {
>  u64 iova, u64 size);
> int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
>   unsigned int asid);
> +   /* IOMMUFD ops */
> +   int (*bind_iommufd)(struct vdpa_device *vdev, struct iommufd_ctx 
> *ictx,
> +   u32 *out_device_id);
> +   void (*unbind_iommufd)(struct vdpa_device *vdev);
> +   int (*attach_ioas)(struct vdpa_device *vdev, u32 *pt_id);
> +   int (*detach_ioas)(struct vdpa_device *vdev);
>
> /* Free device resources */
> void (*free)(struct vdpa_device *vdev);
> @@ -510,4 +532,14 @@ struct vdpa_mgmt_dev {
>  int vdpa_mgmtdev_register(struct vdpa_mgmt_dev *mdev);
>  void vdpa_mgmtdev_unregister(struct vdpa_mgmt_dev *mdev);
>
> -#endif /* _LINUX_VDPA_H */
> +int vdpa_iommufd_physical_bind(struct vdpa_device *vdpa,
> +  struct iommufd_ctx *ictx, u32 *out_device_id);
> +void vdpa_iommufd_physical_unbind(struct vdpa_device *vdpa);
> +int vdpa_iommufd_physical_attach_ioas(struct vdpa_device *vdpa, u32 *pt_id);
> +int vdpa_iommufd_emulated_bind(struct vdpa_device *vdpa,
> +  struct iommufd_ctx *ictx, u32 *out_device_id);
> +void vdpa_iommufd_emulated_unbind(struct vdpa_device *vdpa);
> +int vdpa_iommufd_emulated_attach_ioas(struct vdpa_device *vdpa, u32 *pt_id);
> +int vdpa_iommufd_emulated_detach_ioas(struct vdpa_device *vdpa);
> +
> +#endif
> --
> 2.34.3
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 0/7] vdpa: Add support for iommufd

2023-09-25 Thread Jason Wang
On Sun, Sep 24, 2023 at 1:05 AM Cindy Lu  wrote:
>
> Hi All
> Really apologize for the delay, this is the draft RFC for
> iommufd support for vdpa, This code provides the basic function
> for iommufd support
>
> The code was tested and passed in device vdpa_sim_net
> The qemu code is
> https://gitlab.com/lulu6/gitlabqemutmp/-/tree/iommufdRFC
> The kernel code is
> https://gitlab.com/lulu6/vhost/-/tree/iommufdRFC
>
> ToDo
> 1. this code is out of date and needs to clean and rebase on the latest code
> 2. this code has some workaround, I Skip the check for
> iommu_group and CACHE_COHERENCY, also some misc issues like need to add
> mutex for iommfd operations
> 3. only test in emulated device, other modes not tested yet
>
> After addressed these problems I will send out a new version for RFC. I will
> provide the code in 3 weeks

Something more needs to be done after a quick glance at the codes.

1) The support for device with platform IOMMU support
2) The support for multiple ASes per device

...

Thanks

>
> Thanks
> Cindy
> Signed-off-by: Cindy Lu 
> The test step is
> 1. create vdpa_sim device
> ...
> vdpa dev add name vdpa15 mgmtdev vdpasim_net
> ...
> 2. load the VM with the command
>   -object iommufd,id=iommufd0 \
>   -device 
> virtio-net-pci,netdev=vhost-vdpa1,disable-legacy=on,disable-modern=off\
>   -netdev 
> type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa1,iommufd=iommufd0\
>
> 3. in guest VM you can find the vdpa_sim port works well.
> [root@ubuntunew ~]# ifconfig eth0
> eth0: flags=4163  mtu 1500
> inet6 fe80::5054:ff:fe12:3456  prefixlen 64  scopeid 0x20
> ether 52:54:00:12:34:56  txqueuelen 1000  (Ethernet)
> RX packets 53  bytes 9108 (8.8 KiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 53  bytes 9108 (8.8 KiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> [root@ubuntunew ~]# ./test.sh eth0
> [  172.815279] pktgen: Packet Generator for packet performance testing. 
> Version: 2.75
> Adding queue 0 of eth0
> Configuring devices eth0@0
> Running... ctrl^C to stop
>
> [root@ubuntunew ~]# ifconfig eth0
> eth0: flags=4163  mtu 1500
> inet6 fe80::5054:ff:fe12:3456  prefixlen 64  scopeid 0x20
> ether 52:54:00:12:34:56  txqueuelen 1000  (Ethernet)
> RX packets 183455  bytes 11748533 (11.2 MiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 183473  bytes 11749685 (11.2 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> Cindy Lu (7):
>   vhost/iommufd: Add the functions support iommufd
>   Kconfig: Add the new file vhost/iommufd
>   vhost: Add 3 new uapi to support iommufd
>   vdpa: change the map/unmap process to support iommufd
>   vdpa: Add new vdpa_config_ops
>   vdpa_sim :Add support for iommufd
>   iommufd: Skip the CACHE_COHERENCY and iommu group check
>
>  drivers/iommu/iommufd/device.c   |   6 +-
>  drivers/vdpa/vdpa_sim/vdpa_sim.c |   8 ++
>  drivers/vhost/Kconfig|   1 +
>  drivers/vhost/Makefile   |   1 +
>  drivers/vhost/iommufd.c  | 151 +++
>  drivers/vhost/vdpa.c | 201 +++
>  drivers/vhost/vhost.h|  21 
>  include/linux/vdpa.h |  34 +-
>  include/uapi/linux/vhost.h   |  71 +++
>  9 files changed, 490 insertions(+), 4 deletions(-)
>  create mode 100644 drivers/vhost/iommufd.c
>
> --
> 2.34.3
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Zhu, Lingshan



On 9/26/2023 2:36 AM, Michael S. Tsirkin wrote:

On Mon, Sep 25, 2023 at 08:26:33AM +, Parav Pandit wrote:



From: Jason Wang 
Sent: Monday, September 25, 2023 8:00 AM

On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit  wrote:



From: Jason Gunthorpe 
Sent: Friday, September 22, 2023 5:53 PM



And what's more, using MMIO BAR0 then it can work for legacy.

Oh? How? Our team didn't think so.

It does not. It was already discussed.
The device reset in legacy is not synchronous.

How do you know this?


Not sure the motivation of same discussion done in the OASIS with you and 
others in past.

Anyways, please find the answer below.

About reset,
The legacy device specification has not enforced below cited 1.0 driver 
requirement of 1.0.

"The driver SHOULD consider a driver-initiated reset complete when it reads device 
status as 0."
  
[1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf

Basically, I think any drivers that did not read status (linux pre 2011)
before freeing memory under DMA have a reset path that is racy wrt DMA, since
memory writes are posted and IO writes while not posted have completion
that does not order posted transactions, e.g. from pci express spec:
 D2b
 An I/O or Configuration Write Completion 37 is permitted to pass a 
Posted Request.
having said that there were a ton of driver races discovered on this
path in the years since, I suspect if one cares about this then
just avoiding stress on reset is wise.




The drivers do not wait for reset to complete; it was written for the sw

backend.

Do you see there's a flush after reset in the legacy driver?


Yes. it only flushes the write by reading it. The driver does not get _wait_ 
for the reset to complete within the device like above.

One can thinkably do that wait in hardware, though. Just defer completion until
read is done.
I agree with MST. At least Intel devices work fine with vfio-pci and 
legacy driver without any changes.

So far so good.

Thanks
Zhu Lingshan



Please see the reset flow of 1.x device as below.
In fact the comment of the 1.x device also needs to be updated to indicate that 
driver need to wait for the device to finish the reset.
I will send separate patch for improving this comment of vp_reset() to match 
the spec.

static void vp_reset(struct virtio_device *vdev)
{
 struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 struct virtio_pci_modern_device *mdev = _dev->mdev;

 /* 0 status means a reset. */
 vp_modern_set_status(mdev, 0);
 /* After writing 0 to device_status, the driver MUST wait for a read of
  * device_status to return 0 before reinitializing the device.
  * This will flush out the status write, and flush in device writes,
  * including MSI-X interrupts, if any.
  */
 while (vp_modern_get_status(mdev))
 msleep(1);
 /* Flush pending VQ/configuration callbacks. */
 vp_synchronize_vectors(vdev);
}



static void vp_reset(struct virtio_device *vdev) {
 struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 /* 0 status means a reset. */
 vp_legacy_set_status(_dev->ldev, 0);
 /* Flush out the status write, and flush in device writes,
  * including MSi-X interrupts, if any. */
 vp_legacy_get_status(_dev->ldev);
 /* Flush pending VQ/configuration callbacks. */
 vp_synchronize_vectors(vdev);
}

Thanks




Hence MMIO BAR0 is not the best option in real implementations.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Jason Wang
On Mon, Sep 25, 2023 at 4:26 PM Parav Pandit  wrote:
>
>
>
> > From: Jason Wang 
> > Sent: Monday, September 25, 2023 8:00 AM
> >
> > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit  wrote:
> > >
> > >
> > > > From: Jason Gunthorpe 
> > > > Sent: Friday, September 22, 2023 5:53 PM
> > >
> > >
> > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > >
> > > > Oh? How? Our team didn't think so.
> > >
> > > It does not. It was already discussed.
> > > The device reset in legacy is not synchronous.
> >
> > How do you know this?
> >
> Not sure the motivation of same discussion done in the OASIS with you and 
> others in past.

That is exactly the same point.

It's too late to define the legacy behaviour accurately in the spec so
people will be lost in the legacy maze easily.

>
> Anyways, please find the answer below.
>
> About reset,
> The legacy device specification has not enforced below cited 1.0 driver 
> requirement of 1.0.
>
> "The driver SHOULD consider a driver-initiated reset complete when it reads 
> device status as 0."

We are talking about how to make devices work for legacy drivers. So
it has nothing related to 1.0.

>
> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf
>
> > > The drivers do not wait for reset to complete; it was written for the sw
> > backend.
> >
> > Do you see there's a flush after reset in the legacy driver?
> >
> Yes. it only flushes the write by reading it. The driver does not get _wait_ 
> for the reset to complete within the device like above.

It's the implementation details in legacy. The device needs to make
sure (reset) the driver can work (is done before get_status return).

That's all.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()

2023-09-25 Thread Ming Lei
On Mon, Sep 25, 2023 at 05:17:10PM -0400, Stefan Hajnoczi wrote:
> On Fri, Sep 15, 2023 at 03:04:05PM +0800, Jason Wang wrote:
> > On Fri, Sep 8, 2023 at 11:25 PM Ming Lei  wrote:
> > >
> > > On Fri, Sep 08, 2023 at 08:44:45AM -0600, Jens Axboe wrote:
> > > > On 9/8/23 8:34 AM, Ming Lei wrote:
> > > > > On Fri, Sep 08, 2023 at 07:49:53AM -0600, Jens Axboe wrote:
> > > > >> On 9/8/23 3:30 AM, Ming Lei wrote:
> > > > >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > > >>> index ad636954abae..95a3d31a1ef1 100644
> > > > >>> --- a/io_uring/io_uring.c
> > > > >>> +++ b/io_uring/io_uring.c
> > > > >>> @@ -1930,6 +1930,10 @@ void io_wq_submit_work(struct io_wq_work 
> > > > >>> *work)
> > > > >>>   }
> > > > >>>   }
> > > > >>>
> > > > >>> + /* It is fragile to block POLLED IO, so switch to NON_BLOCK */
> > > > >>> + if ((req->ctx->flags & IORING_SETUP_IOPOLL) && def->iopoll_queue)
> > > > >>> + issue_flags |= IO_URING_F_NONBLOCK;
> > > > >>> +
> > > > >>
> > > > >> I think this comment deserves to be more descriptive. Normally we
> > > > >> absolutely cannot block for polled IO, it's only OK here because 
> > > > >> io-wq
> > > > >
> > > > > Yeah, we don't do that until commit 2bc057692599 ("block: don't make 
> > > > > REQ_POLLED
> > > > > imply REQ_NOWAIT") which actually push the responsibility/risk up to
> > > > > io_uring.
> > > > >
> > > > >> is the issuer and not necessarily the poller of it. That generally 
> > > > >> falls
> > > > >> upon the original issuer to poll these requests.
> > > > >>
> > > > >> I think this should be a separate commit, coming before the main fix
> > > > >> which is below.
> > > > >
> > > > > Looks fine, actually IO_URING_F_NONBLOCK change isn't a must, and the
> > > > > approach in V2 doesn't need this change.
> > > > >
> > > > >>
> > > > >>> @@ -3363,6 +3367,12 @@ __cold void io_uring_cancel_generic(bool 
> > > > >>> cancel_all, struct io_sq_data *sqd)
> > > > >>>   finish_wait(>wait, );
> > > > >>>   } while (1);
> > > > >>>
> > > > >>> + /*
> > > > >>> +  * Reap events from each ctx, otherwise these requests may take
> > > > >>> +  * resources and prevent other contexts from being moved on.
> > > > >>> +  */
> > > > >>> + xa_for_each(>xa, index, node)
> > > > >>> + io_iopoll_try_reap_events(node->ctx);
> > > > >>
> > > > >> The main issue here is that if someone isn't polling for them, then 
> > > > >> we
> > > > >
> > > > > That is actually what this patch is addressing, :-)
> > > >
> > > > Right, that part is obvious :)
> > > >
> > > > >> get to wait for a timeout before they complete. This can delay exit, 
> > > > >> for
> > > > >> example, as we're now just waiting 30 seconds (or whatever the 
> > > > >> timeout
> > > > >> is on the underlying device) for them to get timed out before exit 
> > > > >> can
> > > > >> finish.
> > > > >
> > > > > For the issue on null_blk, device timeout handler provides
> > > > > forward-progress, such as requests are released, so new IO can be
> > > > > handled.
> > > > >
> > > > > However, not all devices support timeout, such as virtio device.
> > > >
> > > > That's a bug in the driver, you cannot sanely support polled IO and not
> > > > be able to deal with timeouts. Someone HAS to reap the requests and
> > > > there are only two things that can do that - the application doing the
> > > > polled IO, or if that doesn't happen, a timeout.
> > >
> > > OK, then device driver timeout handler has new responsibility of covering
> > > userspace accident, :-)
> 
> Sorry, I don't have enough context so this is probably a silly question:
> 
> When an application doesn't reap a polled request, why doesn't the block
> layer take care of this in a generic way? I don't see anything
> driver-specific about this.

block layer doesn't have knowledge to handle that, io_uring knows the
application is exiting, and can help to reap the events.

But the big question is that if there is really IO timeout for virtio-blk.
If there is, the reap done in io_uring may never return and cause other
issue, so if it is done in io_uring, that can be just thought as sort of
improvement.

The real bug fix is still in device driver, usually only the driver timeout
handler can provide forward progress guarantee.

> 
> Driver-specific behavior would be sending an abort/cancel upon timeout.
> virtio-blk cannot do that because there is no such command in the device
> specification at the moment. So simply waiting for the polled request to
> complete is the only thing that can be done (aside from resetting the
> device), and it's generic behavior.

Then looks not safe to support IO polling for virtio-blk, maybe disable it
at default now until the virtio-blk spec starts to support IO abort?

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()

2023-09-25 Thread Stefan Hajnoczi
On Fri, Sep 15, 2023 at 03:04:05PM +0800, Jason Wang wrote:
> On Fri, Sep 8, 2023 at 11:25 PM Ming Lei  wrote:
> >
> > On Fri, Sep 08, 2023 at 08:44:45AM -0600, Jens Axboe wrote:
> > > On 9/8/23 8:34 AM, Ming Lei wrote:
> > > > On Fri, Sep 08, 2023 at 07:49:53AM -0600, Jens Axboe wrote:
> > > >> On 9/8/23 3:30 AM, Ming Lei wrote:
> > > >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > >>> index ad636954abae..95a3d31a1ef1 100644
> > > >>> --- a/io_uring/io_uring.c
> > > >>> +++ b/io_uring/io_uring.c
> > > >>> @@ -1930,6 +1930,10 @@ void io_wq_submit_work(struct io_wq_work *work)
> > > >>>   }
> > > >>>   }
> > > >>>
> > > >>> + /* It is fragile to block POLLED IO, so switch to NON_BLOCK */
> > > >>> + if ((req->ctx->flags & IORING_SETUP_IOPOLL) && def->iopoll_queue)
> > > >>> + issue_flags |= IO_URING_F_NONBLOCK;
> > > >>> +
> > > >>
> > > >> I think this comment deserves to be more descriptive. Normally we
> > > >> absolutely cannot block for polled IO, it's only OK here because io-wq
> > > >
> > > > Yeah, we don't do that until commit 2bc057692599 ("block: don't make 
> > > > REQ_POLLED
> > > > imply REQ_NOWAIT") which actually push the responsibility/risk up to
> > > > io_uring.
> > > >
> > > >> is the issuer and not necessarily the poller of it. That generally 
> > > >> falls
> > > >> upon the original issuer to poll these requests.
> > > >>
> > > >> I think this should be a separate commit, coming before the main fix
> > > >> which is below.
> > > >
> > > > Looks fine, actually IO_URING_F_NONBLOCK change isn't a must, and the
> > > > approach in V2 doesn't need this change.
> > > >
> > > >>
> > > >>> @@ -3363,6 +3367,12 @@ __cold void io_uring_cancel_generic(bool 
> > > >>> cancel_all, struct io_sq_data *sqd)
> > > >>>   finish_wait(>wait, );
> > > >>>   } while (1);
> > > >>>
> > > >>> + /*
> > > >>> +  * Reap events from each ctx, otherwise these requests may take
> > > >>> +  * resources and prevent other contexts from being moved on.
> > > >>> +  */
> > > >>> + xa_for_each(>xa, index, node)
> > > >>> + io_iopoll_try_reap_events(node->ctx);
> > > >>
> > > >> The main issue here is that if someone isn't polling for them, then we
> > > >
> > > > That is actually what this patch is addressing, :-)
> > >
> > > Right, that part is obvious :)
> > >
> > > >> get to wait for a timeout before they complete. This can delay exit, 
> > > >> for
> > > >> example, as we're now just waiting 30 seconds (or whatever the timeout
> > > >> is on the underlying device) for them to get timed out before exit can
> > > >> finish.
> > > >
> > > > For the issue on null_blk, device timeout handler provides
> > > > forward-progress, such as requests are released, so new IO can be
> > > > handled.
> > > >
> > > > However, not all devices support timeout, such as virtio device.
> > >
> > > That's a bug in the driver, you cannot sanely support polled IO and not
> > > be able to deal with timeouts. Someone HAS to reap the requests and
> > > there are only two things that can do that - the application doing the
> > > polled IO, or if that doesn't happen, a timeout.
> >
> > OK, then device driver timeout handler has new responsibility of covering
> > userspace accident, :-)

Sorry, I don't have enough context so this is probably a silly question:

When an application doesn't reap a polled request, why doesn't the block
layer take care of this in a generic way? I don't see anything
driver-specific about this.

Driver-specific behavior would be sending an abort/cancel upon timeout.
virtio-blk cannot do that because there is no such command in the device
specification at the moment. So simply waiting for the polled request to
complete is the only thing that can be done (aside from resetting the
device), and it's generic behavior.

Thanks,
Stefan

> >
> > We may document this requirement for driver.
> >
> > So far the only one should be virtio-blk, and the two virtio storage
> > drivers never implement timeout handler.
> >
> 
> Adding Stefan for more comments.
> 
> Thanks
> 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Michael S. Tsirkin
On Mon, Sep 25, 2023 at 03:53:18PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 25, 2023 at 02:16:30PM -0400, Michael S. Tsirkin wrote:
> 
> > I do want to understand if there's a use-case that vdpa does not address
> > simply because it might be worth while to extend it to do so, and a
> > bunch of people working on it are at Red Hat and I might have some input
> > into how that labor is allocated. But if the use-case is simply "has to
> > be vfio and not vdpa" then I guess not.
> 
> If you strip away all the philisophical arguing VDPA has no way to
> isolate the control and data virtqs to different IOMMU configurations
> with this single PCI function.

Aha, so address space/PASID support then?

> The existing HW VDPA drivers provided device specific ways to handle
> this.
> 
> Without DMA isolation you can't assign the high speed data virtq's to
> the VM without mediating them as well.
> 
> > It could be that we are using mediation differently - in my world it's
> > when there's some host software on the path between guest and hardware,
> > and this qualifies.  
> 
> That is pretty general. As I said to Jason, if you want to use it that
> way then you need to make up a new word to describe what VDPA does as
> there is a clear difference in scope between this VFIO patch (relay IO
> commands to the device) and VDPA (intercept all the control plane,
> control virtq and bring it to a RedHat/qemu standard common behavior)

IIUC VDPA itself does not really bring it to either RedHat or qemu
standard, it just allows userspace to control behaviour - if userspace
is qemu then it's qemu deciding how it behaves. Which I guess this
doesn't. Right?  RedHat's not in the picture at all I think.

> > There is also a question of capability. Specifically iommufd support
> > is lacking in vdpa (though there are finally some RFC patches to
> > address that). All this is fine, could be enough to motivate
> > a work like this one.
> 
> I've answered many times, you just don't semm to like the answers or
> dismiss them as not relevant to you.
> 
> Jason


Not really I think I lack some of the picture so I don't fully
understand. Or maybe I missed something else.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Michael S. Tsirkin
On Mon, Sep 25, 2023 at 09:26:07AM -0300, Jason Gunthorpe wrote:
> > > So, as I keep saying, in this scenario the goal is no mediation in the
> > > hypervisor.
> > 
> > That's pretty fine, but I don't think trapping + relying is not
> > mediation. Does it really matter what happens after trapping?
> 
> It is not mediation in the sense that the kernel driver does not in
> any way make decisions on the behavior of the device. It simply
> transforms an IO operation into a device command and relays it to the
> device. The device still fully controls its own behavior.
> 
> VDPA is very different from this. You might call them both mediation,
> sure, but then you need another word to describe the additional
> changes VPDA is doing.

Sorry about hijacking the thread a little bit, but could you
call out some of the changes that are the most problematic
for you?

> > > It is pointless, everything you think you need to do there
> > > is actually already being done in the DPU.
> > 
> > Well, migration or even Qemu could be offloaded to DPU as well. If
> > that's the direction that's pretty fine.
> 
> That's silly, of course qemu/kvm can't run in the DPU.
> 
> However, we can empty qemu and the hypervisor out so all it does is
> run kvm and run vfio. In this model the DPU does all the OVS, storage,
> "VPDA", etc. qemu is just a passive relay of the DPU PCI functions
> into VM's vPCI functions.
> 
> So, everything VDPA was doing in the environment is migrated into the
> DPU.
> 
> In this model the DPU is an extension of the hypervisor/qemu
> environment and we shift code from x86 side to arm side to increase
> security, save power and increase total system performance.
> 
> Jason

I think I begin to understand. On the DPU you have some virtio
devices but also some non-virtio devices.  So you have to
use VFIO to talk to the DPU. Reusing VFIO to talk to virtio
devices too, simplifies things for you. If guests will see
vendor-specific devices from the DPU anyway, it will be impossible
to migrate such guests away from the DPU so the cross-vendor
migration capability is less important in this use-case.
Is this a good summary?


-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Michael S. Tsirkin
On Mon, Sep 25, 2023 at 08:26:33AM +, Parav Pandit wrote:
> 
> 
> > From: Jason Wang 
> > Sent: Monday, September 25, 2023 8:00 AM
> > 
> > On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit  wrote:
> > >
> > >
> > > > From: Jason Gunthorpe 
> > > > Sent: Friday, September 22, 2023 5:53 PM
> > >
> > >
> > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > >
> > > > Oh? How? Our team didn't think so.
> > >
> > > It does not. It was already discussed.
> > > The device reset in legacy is not synchronous.
> > 
> > How do you know this?
> >
> Not sure the motivation of same discussion done in the OASIS with you and 
> others in past.
> 
> Anyways, please find the answer below.
> 
> About reset,
> The legacy device specification has not enforced below cited 1.0 driver 
> requirement of 1.0.
> 
> "The driver SHOULD consider a driver-initiated reset complete when it reads 
> device status as 0."
>  
> [1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf

Basically, I think any drivers that did not read status (linux pre 2011)
before freeing memory under DMA have a reset path that is racy wrt DMA, since 
memory writes are posted and IO writes while not posted have completion
that does not order posted transactions, e.g. from pci express spec:
D2b
An I/O or Configuration Write Completion 37 is permitted to pass a 
Posted Request.
having said that there were a ton of driver races discovered on this
path in the years since, I suspect if one cares about this then
just avoiding stress on reset is wise.



> > > The drivers do not wait for reset to complete; it was written for the sw
> > backend.
> > 
> > Do you see there's a flush after reset in the legacy driver?
> > 
> Yes. it only flushes the write by reading it. The driver does not get _wait_ 
> for the reset to complete within the device like above.

One can thinkably do that wait in hardware, though. Just defer completion until
read is done.

> Please see the reset flow of 1.x device as below.
> In fact the comment of the 1.x device also needs to be updated to indicate 
> that driver need to wait for the device to finish the reset.
> I will send separate patch for improving this comment of vp_reset() to match 
> the spec.
> 
> static void vp_reset(struct virtio_device *vdev)
> {
> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> struct virtio_pci_modern_device *mdev = _dev->mdev;
> 
> /* 0 status means a reset. */
> vp_modern_set_status(mdev, 0);
> /* After writing 0 to device_status, the driver MUST wait for a read 
> of
>  * device_status to return 0 before reinitializing the device.
>  * This will flush out the status write, and flush in device writes,
>  * including MSI-X interrupts, if any.
>  */
> while (vp_modern_get_status(mdev))
> msleep(1);
> /* Flush pending VQ/configuration callbacks. */
> vp_synchronize_vectors(vdev);
> }
> 
> 
> > static void vp_reset(struct virtio_device *vdev) {
> > struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > /* 0 status means a reset. */
> > vp_legacy_set_status(_dev->ldev, 0);
> > /* Flush out the status write, and flush in device writes,
> >  * including MSi-X interrupts, if any. */
> > vp_legacy_get_status(_dev->ldev);
> > /* Flush pending VQ/configuration callbacks. */
> > vp_synchronize_vectors(vdev);
> > }
> > 
> > Thanks
> > 
> > 
> > 
> > > Hence MMIO BAR0 is not the best option in real implementations.
> > >
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Michael S. Tsirkin
On Fri, Sep 22, 2023 at 01:19:28PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:39:19AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2023 at 09:25:01AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 22, 2023 at 11:02:50AM +0800, Jason Wang wrote:
> > > > On Fri, Sep 22, 2023 at 3:53 AM Jason Gunthorpe  wrote:
> > > > >
> > > > > On Thu, Sep 21, 2023 at 03:34:03PM -0400, Michael S. Tsirkin wrote:
> > > > >
> > > > > > that's easy/practical.  If instead VDPA gives the same speed with 
> > > > > > just
> > > > > > shadow vq then keeping this hack in vfio seems like less of a 
> > > > > > problem.
> > > > > > Finally if VDPA is faster then maybe you will reconsider using it ;)
> > > > >
> > > > > It is not all about the speed.
> > > > >
> > > > > VDPA presents another large and complex software stack in the
> > > > > hypervisor that can be eliminated by simply using VFIO.
> > > > 
> > > > vDPA supports standard virtio devices so how did you define
> > > > complexity?
> > > 
> > > As I said, VFIO is already required for other devices in these VMs. So
> > > anything incremental over base-line vfio-pci is complexity to
> > > minimize.
> > > 
> > > Everything vdpa does is either redundant or unnecessary compared to
> > > VFIO in these environments.
> > > 
> > > Jason
> > 
> > Yes but you know. There are all kind of environments.  I guess you
> > consider yours the most mainstream and important, and are sure it will
> > always stay like this.  But if there's a driver that does what you need
> > then you use that.
> 
> Come on, you are the one saying we cannot do things in the best way
> possible because you want your way of doing things to be the only way
> allowed. Which of us thinks "yours the most mainstream and important" ??
> 
> I'm not telling you to throw away VPDA, I'm saying there are
> legimitate real world use cases where VFIO is the appropriate
> interface, not VDPA.
> 
> I want choice, not dogmatic exclusion that there is Only One True Way.

I don't particularly think there's only one way, vfio is already there.
I am specifically thinking about this patch, for example it
muddies the waters a bit: normally I think vfio exposed device
with the same ID, suddenly it changes the ID as visible to the guest.
But again, whether doing this kind of thing is OK is more up to Alex than me.

I do want to understand if there's a use-case that vdpa does not address
simply because it might be worth while to extend it to do so, and a
bunch of people working on it are at Red Hat and I might have some input
into how that labor is allocated. But if the use-case is simply "has to
be vfio and not vdpa" then I guess not.




> > You really should be explaining what vdpa *does not* do that you
> > need.
> 
> I think I've done that enough, but if you have been following my
> explanation you should see that the entire point of this design is to
> allow a virtio device to be created inside a DPU to a specific
> detailed specification (eg an AWS virtio-net device, for instance)
> 
> The implementation is in the DPU, and only the DPU.
> 
> At the end of the day VDPA uses mediation and creates some
> RedHat/VDPA/Qemu virtio-net device in the guest. It is emphatically
> NOT a perfect recreation of the "AWS virtio-net" we started out with.
> 
> It entirely fails to achieve the most important thing it needs to do!

It could be that we are using mediation differently - in my world it's
when there's some host software on the path between guest and hardware,
and this qualifies.  The difference between what this patch does and
what vdpa does seems quantitative, not qualitative. Which might be
enough to motivate this work, I don't mind. But you seem to feel
it is qualitative and I am genuinely curious about it, because
if yes then it might lead e.g. the virtio standard in new directions.

I can *imagine* all kind of reasons to want to use vfio as compared to vdpa;
here are some examples I came up with, quickly:
- maybe you have drivers that poke at registers not in virtio spec:
  vfio allows that, vdpa by design does not
- maybe you are using vfio with a lot of devices already and don't want
  to special-case handling for virtio devices on the host
do any of the above motivations ring the bell? Some of the things you
said seem to hint at that. If yes maybe include this in the cover
letter.

There is also a question of capability. Specifically iommufd support
is lacking in vdpa (though there are finally some RFC patches to
address that). All this is fine, could be enough to motivate
a work like this one. But I am very curious to know if there
is any other capability lacking in vdpa. I asked already and you
didn't answer so I guess not?




> Yishai will rework the series with your remarks, we can look again on
> v2, thanks for all the input!
> 
> Jason

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org

Re: [PATCH 1/9] drm/amd/pm: Annotate struct smu10_voltage_dependency_table with __counted_by

2023-09-25 Thread Alex Deucher
On Mon, Sep 25, 2023 at 1:52 PM Kees Cook  wrote:
>
> On Mon, Sep 25, 2023 at 08:30:30AM +0200, Christian König wrote:
> > Am 22.09.23 um 19:41 schrieb Alex Deucher:
> > > On Fri, Sep 22, 2023 at 1:32 PM Kees Cook  wrote:
> > > > Prepare for the coming implementation by GCC and Clang of the 
> > > > __counted_by
> > > > attribute. Flexible array members annotated with __counted_by can have
> > > > their accesses bounds-checked at run-time checking via 
> > > > CONFIG_UBSAN_BOUNDS
> > > > (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> > > > functions).
> > > >
> > > > As found with Coccinelle[1], add __counted_by for struct 
> > > > smu10_voltage_dependency_table.
> > > >
> > > > [1] 
> > > > https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
> > > >
> > > > Cc: Evan Quan 
> > > > Cc: Alex Deucher 
> > > > Cc: "Christian König" 
> > > > Cc: "Pan, Xinhui" 
> > > > Cc: David Airlie 
> > > > Cc: Daniel Vetter 
> > > > Cc: Xiaojian Du 
> > > > Cc: Huang Rui 
> > > > Cc: Kevin Wang 
> > > > Cc: amd-...@lists.freedesktop.org
> > > > Cc: dri-de...@lists.freedesktop.org
> > > > Signed-off-by: Kees Cook 
> > > Acked-by: Alex Deucher 
> >
> > Mhm, I'm not sure if this is a good idea. That is a structure filled in by
> > the firmware, isn't it?
> >
> > That would imply that we might need to byte swap count before it is
> > checkable.
>
> The script found this instance because of this:
>
> static int smu10_get_clock_voltage_dependency_table(struct pp_hwmgr *hwmgr,
> struct smu10_voltage_dependency_table **pptable,
> uint32_t num_entry, const DpmClock_t 
> *pclk_dependency_table)
> {
> uint32_t i;
> struct smu10_voltage_dependency_table *ptable;
>
> ptable = kzalloc(struct_size(ptable, entries, num_entry), GFP_KERNEL);
> if (NULL == ptable)
> return -ENOMEM;
>
> ptable->count = num_entry;
>
> So the implication is that it's native byte order... but you tell me! I
> certainly don't want this annotation if it's going to break stuff. :)

In this case, the code is for an integrated GPU in an x86 CPU so the
firmware and driver endianness match.  You wouldn't find a stand alone
dGPU that uses this structure.  In this case it's ok.  False alarm.

Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/9] drm/amd/pm: Annotate struct smu10_voltage_dependency_table with __counted_by

2023-09-25 Thread Kees Cook
On Mon, Sep 25, 2023 at 08:30:30AM +0200, Christian König wrote:
> Am 22.09.23 um 19:41 schrieb Alex Deucher:
> > On Fri, Sep 22, 2023 at 1:32 PM Kees Cook  wrote:
> > > Prepare for the coming implementation by GCC and Clang of the __counted_by
> > > attribute. Flexible array members annotated with __counted_by can have
> > > their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
> > > (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> > > functions).
> > > 
> > > As found with Coccinelle[1], add __counted_by for struct 
> > > smu10_voltage_dependency_table.
> > > 
> > > [1] 
> > > https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
> > > 
> > > Cc: Evan Quan 
> > > Cc: Alex Deucher 
> > > Cc: "Christian König" 
> > > Cc: "Pan, Xinhui" 
> > > Cc: David Airlie 
> > > Cc: Daniel Vetter 
> > > Cc: Xiaojian Du 
> > > Cc: Huang Rui 
> > > Cc: Kevin Wang 
> > > Cc: amd-...@lists.freedesktop.org
> > > Cc: dri-de...@lists.freedesktop.org
> > > Signed-off-by: Kees Cook 
> > Acked-by: Alex Deucher 
> 
> Mhm, I'm not sure if this is a good idea. That is a structure filled in by
> the firmware, isn't it?
> 
> That would imply that we might need to byte swap count before it is
> checkable.

The script found this instance because of this:

static int smu10_get_clock_voltage_dependency_table(struct pp_hwmgr *hwmgr,
struct smu10_voltage_dependency_table **pptable,
uint32_t num_entry, const DpmClock_t 
*pclk_dependency_table)
{
uint32_t i;
struct smu10_voltage_dependency_table *ptable;

ptable = kzalloc(struct_size(ptable, entries, num_entry), GFP_KERNEL);
if (NULL == ptable)
return -ENOMEM;

ptable->count = num_entry;

So the implication is that it's native byte order... but you tell me! I
certainly don't want this annotation if it's going to break stuff. :)

-- 
Kees Cook
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/9] drm/i915/selftests: Annotate struct perf_series with __counted_by

2023-09-25 Thread Kees Cook
On Mon, Sep 25, 2023 at 12:08:36PM +0200, Andrzej Hajda wrote:
> 
> 
> On 22.09.2023 19:32, Kees Cook wrote:
> > Prepare for the coming implementation by GCC and Clang of the __counted_by
> > attribute. Flexible array members annotated with __counted_by can have
> > their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
> > (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> > functions).
> > 
> > As found with Coccinelle[1], add __counted_by for struct perf_series.
> > 
> > [1] 
> > https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
> > 
> > Cc: Jani Nikula 
> > Cc: Joonas Lahtinen 
> > Cc: Rodrigo Vivi 
> > Cc: Tvrtko Ursulin 
> > Cc: David Airlie 
> > Cc: Daniel Vetter 
> > Cc: Chris Wilson 
> > Cc: John Harrison 
> > Cc: Andi Shyti 
> > Cc: Matthew Brost 
> > Cc: intel-...@lists.freedesktop.org
> > Cc: dri-de...@lists.freedesktop.org
> > Signed-off-by: Kees Cook 
> 
> I am surprised this is the only finding in i915, I would expected more.

I'm sure there are more, but it's likely my Coccinelle pattern didn't
catch it. There are many many flexible arrays in drm. :)

$ grep -nRH '\[\];$' drivers/gpu/drm include/uapi/drm | grep -v :extern | wc -l
122

If anyone has some patterns I can add to the Coccinelle script, I can
take another pass at it.

> Anyway:
> 
> Reviewed-by: Andrzej Hajda 

Thank you!

-Kees

-- 
Kees Cook
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Michael S. Tsirkin
On Fri, Sep 22, 2023 at 01:22:33PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 22, 2023 at 11:40:58AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2023 at 12:15:34PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 22, 2023 at 11:13:18AM -0400, Michael S. Tsirkin wrote:
> > > > On Fri, Sep 22, 2023 at 12:25:06PM +, Parav Pandit wrote:
> > > > > 
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Friday, September 22, 2023 5:53 PM
> > > > > 
> > > > > 
> > > > > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > > > > > 
> > > > > > Oh? How? Our team didn't think so.
> > > > > 
> > > > > It does not. It was already discussed.
> > > > > The device reset in legacy is not synchronous.
> > > > > The drivers do not wait for reset to complete; it was written for the 
> > > > > sw backend.
> > > > > Hence MMIO BAR0 is not the best option in real implementations.
> > > > 
> > > > Or maybe they made it synchronous in hardware, that's all.
> > > > After all same is true for the IO BAR0 e.g. for the PF: IO writes
> > > > are posted anyway.
> > > 
> > > IO writes are not posted in PCI.
> > 
> > Aha, I was confused. Thanks for the correction. I guess you just buffer
> > subsequent transactions while reset is going on and reset quickly enough
> > for it to be seemless then?
> 
> >From a hardware perspective the CPU issues an non-posted IO write and
> then it stops processing until the far side returns an IO completion.
> 
> Using that you can emulate what the SW virtio model did and delay the
> CPU from restarting until the reset is completed.
> 
> Since MMIO is always posted, this is not possible to emulate directly
> using MMIO.
> 
> Converting IO into non-posted admin commands is a fairly close
> recreation to what actual HW would do.
> 
> Jason

I thought you asked how it is possible for hardware to support reset if
all it does is replace IO BAR with memory BAR. The answer is that since
2011 the reset is followed by read of the status field (which isn't much
older than MSIX support from 2009 - which this code assumes).  If one
uses a Linux driver from 2011 and on then all you need to do is defer
response to this read until after the reset is complete.

If you are using older drivers or other OSes then reset using a posted
write after device has operated for a while might not be safe, so e.g.
you might trigger races if you remove drivers from system or
trigger hot unplug.  For example: 

static void virtio_pci_remove(struct pci_dev *pci_dev)
{



unregister_virtio_device(_dev->vdev);

 triggers reset, then releases memory



pci_disable_device(pci_dev);

^^^ blocks DMA by clearing bus master

}

here you could see some DMA into memory that has just been released.


As Jason mentions hardware exists that is used under one of these two
restrictions on the guest (Linux since 2011 or no resets while DMA is
going on), and it works fine with these existing guests.

Given the restrictions, virtio TC didn't elect to standardize this
approach and instead opted for the heavier approach of
converting IO into non-posted admin commands in software.


-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-25 Thread Robin Murphy

On 2023-09-25 14:29, Jason Gunthorpe wrote:

On Mon, Sep 25, 2023 at 02:07:50PM +0100, Robin Murphy wrote:

On 2023-09-23 00:33, Jason Gunthorpe wrote:

On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:


virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
either; it sets it once it's discovered any instance, since apparently it's
assuming that all instances must support identical page sizes, and thus once
it's seen one it can work "normally" per the core code's assumptions. It's
also I think the only driver which has a "finalise" bodge but *can* still
properly support map-before-attach, by virtue of having to replay mappings
to every new endpoint anyway.


Well it can't quite do that since it doesn't know the geometry - it
all is sort of guessing and hoping it doesn't explode on replay. If it
knows the geometry it wouldn't need finalize...


I think it's entirely reasonable to assume that any direct mappings
specified for a device are valid for that device and its IOMMU. However, in
the particular case of virtio, it really shouldn't ever have direct mappings
anyway, since even if the underlying hardware did have any, the host can
enforce the actual direct-mapping aspect itself, and just present them as
unusable regions to the guest.


I assume this machinery is for the ARM GIC ITS page


Again, that's irrelevant. It can only be about whether the actual
->map_pages call succeeds or not. A driver could well know up-front that all
instances support the same pgsize_bitmap and aperture, and set both at
->domain_alloc time, yet still be unable to handle an actual mapping without
knowing which instance(s) that needs to interact with (e.g. omap-iommu).


I think this is a different issue. The domain is supposed to represent
the actual io pte storage, and the storage is supposed to exist even
when the domain is not attached to anything.

As we said with tegra-gart, it is a bug in the driver if all the
mappings disappear when the last device is detached from the domain.
Driver bugs like this turn into significant issues with vfio/iommufd
as this will result in warn_on's and memory leaking.

So, I disagree that this is something we should be allowing in the API
design. map_pages should succeed (memory allocation failures aside) if
a IOVA within the aperture and valid flags are presented. Regardless
of the attachment status. Calling map_pages with an IOVA outside the
aperture should be a caller bug.

It looks omap is just mis-designed to store the pgd in the omap_iommu,
not the omap_iommu_domain :( pgd is clearly a per-domain object in our
API. And why does every instance need its own copy of the identical
pgd?


The point wasn't that it was necessarily a good and justifiable example, 
just that it is one that exists, to demonstrate that in general we have 
no reasonable heuristic for guessing whether ->map_pages is going to 
succeed or not other than by calling it and seeing if it succeeds or 
not. And IMO it's a complete waste of time thinking about ways to make 
such a heuristic possible instead of just getting on with fixing 
iommu_domain_alloc() to make the problem disappear altogether. Once 
Joerg pushes out the current queue I'll rebase and resend v4 of the bus 
ops removal, then hopefully get back to despairing at the hideous pile 
of WIP iommu_domain_alloc() patches I currently have on top of it...


Thanks,
Robin.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2] MAINTAINERS: Add myself as mlx5_vdpa driver

2023-09-25 Thread Dragos Tatulea via Virtualization
As Eli Cohen moved to other work, I'll be the contact point for
mlx5_vdpa.

Acked-by: Jason Wang 
Signed-off-by: Dragos Tatulea 
---
Changes since v1:
- Fix alphabetical sorting.
- Make naming consistent with other MELLANOX entries.
---
 MAINTAINERS | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3dde038545d8..611c31b3b9c9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13677,6 +13677,12 @@ F: drivers/infiniband/hw/mlx5/
 F: include/linux/mlx5/
 F: include/uapi/rdma/mlx5-abi.h
 
+MELLANOX MLX5 VDPA DRIVER
+M: Dragos Tatulea 
+L: virtualization@lists.linux-foundation.org
+S: Supported
+F: drivers/vdpa/mlx5/
+
 MELLANOX MLXCPLD I2C AND MUX DRIVER
 M: Vadim Pasternak 
 M: Michael Shych 
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] crypto: virtio-crypto: call finalize with bh disabled

2023-09-25 Thread Gonglei (Arei) via Virtualization
Doing ipsec produces a spinlock recursion warning.
This is due to not disabling BH during crypto completion function.

Fixes: 59ca6c93387d3 ("virtio-crypto: implement RSA algorithm")
Reported-by: Halil Pasic 
Signed-off-by: Gonglei 
---
 drivers/crypto/virtio/virtio_crypto_akcipher_algs.c | 3 ++-
 drivers/crypto/virtio/virtio_crypto_skcipher_algs.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c 
b/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
index 2621ff8a9376..19e2898977d3 100644
--- a/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
@@ -61,8 +61,9 @@ static void virtio_crypto_akcipher_finalize_req(
vc_akcipher_req->src_buf = NULL;
vc_akcipher_req->dst_buf = NULL;
virtcrypto_clear_request(_akcipher_req->base);
-
+   local_bh_disable();
crypto_finalize_akcipher_request(vc_akcipher_req->base.dataq->engine, 
req, err);
+   local_bh_enable();
 }
 
 static void virtio_crypto_dataq_akcipher_callback(struct virtio_crypto_request 
*vc_req, int len)
diff --git a/drivers/crypto/virtio/virtio_crypto_skcipher_algs.c 
b/drivers/crypto/virtio/virtio_crypto_skcipher_algs.c
index 23c41d87d835..661c1102583e 100644
--- a/drivers/crypto/virtio/virtio_crypto_skcipher_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_skcipher_algs.c
@@ -566,9 +566,10 @@ static void virtio_crypto_skcipher_finalize_req(
 AES_BLOCK_SIZE, 0);
kfree_sensitive(vc_sym_req->iv);
virtcrypto_clear_request(_sym_req->base);
-
+   local_bh_disable();
crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,
   req, err);
+   local_bh_enable();
 }
 
 static struct virtio_crypto_algo virtio_crypto_algs[] = { {
-- 
2.23.0


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-sound linux driver conformance to spec

2023-09-25 Thread Matias Ezequiel Vara Larsen
On Mon, Sep 25, 2023 at 10:04:33AM +0900, Anton Yakovlev wrote:
> Hello Matias,
> 
> On 20.09.2023 22:18, Matias Ezequiel Vara Larsen wrote:
> > On Tue, Sep 19, 2023 at 11:52:27AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Sep 19, 2023 at 04:18:57PM +0200, Matias Ezequiel Vara Larsen 
> > > wrote:
> > > > On Tue, Sep 19, 2023 at 05:43:56AM -0400, Michael S. Tsirkin wrote:
> > > > > On Wed, Sep 13, 2023 at 05:04:24PM +0200, Matias Ezequiel Vara Larsen 
> > > > > wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > This email is to report a behavior of the Linux virtio-sound driver 
> > > > > > that
> > > > > > looks like it is not conforming to the VirtIO specification. The 
> > > > > > kernel
> > > > > > driver is moving buffers from the used ring to the available ring
> > > > > > without knowing if the content has been updated from the user. If 
> > > > > > the
> > > > > > device picks up buffers from the available ring just after it is
> > > > > > notified, it happens that the content is old.
> > > > > 
> > > > > Then, what happens, exactly? Do things still work?
> > > > 
> > > > We are currently developing a vhost-user backend for virtio-sound and
> > > > what happens is that if the backend implementation decides to copy the
> > > > content of a buffer from a request that just arrived to the available
> > > > ring, it gets the old content thus reproducing some sections two times.
> > > > For example, we observe that when issuing `aplay FrontLeft.wav`, we hear
> > > > `Front, front left...`. To fix this issue, our current implementation
> > > > delays reading from guest memory just until the audio engine requires.
> > > > However, the first implementation shall also work since it is conforming
> > > > to the specification.
> > > > 
> > > > Matias
> > > 
> > > Sounds like it. How hard is it to change the behaviour though?
> > > Does it involve changing userspace?
> > 
> > AFAIU, a fix for the driver may be to somehow wait until userspace
> > updates the buffer before add it in the available ring.
> > So far, when the device notifies the driver that a new buffer is in the
> > used ring, the driver calls the virtsnd_pcm_msg_complete() function to
> > do:
> > ```
> > schedule_work(>elapsed_period);
> > 
> > virtsnd_pcm_msg_send(vss);
> > ```
> > It first defers the notification to userspace regarding an elapse period
> > and then it enqueues the request again in the available
> > ring.`schedule_work()` defers the calling to the
> > `virtsnd_pcm_period_elapsed()` function that issues
> > `snd_pcm_period_elapsed(vss->substream);` to notify userspace.
> > My proposal would be that the driver could also defer
> > `virtsnd_pcm_msg_send(vss)` to execute just after
> > `snd_pcm_period_elapsed(vss->substream)`. Something like this:
> > 
> > diff --git a/sound/virtio/virtio_pcm.c b/sound/virtio/virtio_pcm.c
> > index c10d91fff2fb..41f1e74c8478 100644
> > --- a/sound/virtio/virtio_pcm.c
> > +++ b/sound/virtio/virtio_pcm.c
> > @@ -309,6 +309,7 @@ static void virtsnd_pcm_period_elapsed(struct 
> > work_struct *work)
> >  container_of(work, struct virtio_pcm_substream, 
> > elapsed_period);
> >  snd_pcm_period_elapsed(vss->substream);
> > +   virtsnd_pcm_msg_send(vss);
> >   }
> >   /**
> > diff --git a/sound/virtio/virtio_pcm_msg.c b/sound/virtio/virtio_pcm_msg.c
> > index aca2dc1989ba..43f0078b1152 100644
> > --- a/sound/virtio/virtio_pcm_msg.c
> > +++ b/sound/virtio/virtio_pcm_msg.c
> > @@ -321,7 +321,6 @@ static void virtsnd_pcm_msg_complete(struct 
> > virtio_pcm_msg *msg,
> >  schedule_work(>elapsed_period);
> > -   virtsnd_pcm_msg_send(vss);
> >  } else if (!vss->msg_count) {
> >  wake_up_all(>msg_empty);
> >  }
> > 
> > 
> > I tested it and it looks it fixes the issue. However, I am not sure if
> > this is enough since I do not know if when `snd_pcm_period_elapsed()`
> > returns, the buffers have been already updated.
> 
> It's just a lucky coincidence that this change helped in your case.
> 
> Instead, to solve the problem, it's necessary to implement read/write()
> operations for the substream, and disable MMAP access mode.
> 
> I'll try, but I'm not sure I'll have enough time for this in the near future.
> 
> 
Hello Anton,

Thanks, I will try to propose a better fix in the next few weeks. I
am not familiar with sound drivers so I may require some help. I'll let
you know if I have any question.

Thanks, Matias.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/9] drm/amd/pm: Annotate struct smu10_voltage_dependency_table with __counted_by

2023-09-25 Thread Alex Deucher
On Mon, Sep 25, 2023 at 10:07 AM Alex Deucher  wrote:
>
> On Mon, Sep 25, 2023 at 2:30 AM Christian König
>  wrote:
> >
> > Am 22.09.23 um 19:41 schrieb Alex Deucher:
> > > On Fri, Sep 22, 2023 at 1:32 PM Kees Cook  wrote:
> > >> Prepare for the coming implementation by GCC and Clang of the 
> > >> __counted_by
> > >> attribute. Flexible array members annotated with __counted_by can have
> > >> their accesses bounds-checked at run-time checking via 
> > >> CONFIG_UBSAN_BOUNDS
> > >> (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> > >> functions).
> > >>
> > >> As found with Coccinelle[1], add __counted_by for struct 
> > >> smu10_voltage_dependency_table.
> > >>
> > >> [1] 
> > >> https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
> > >>
> > >> Cc: Evan Quan 
> > >> Cc: Alex Deucher 
> > >> Cc: "Christian König" 
> > >> Cc: "Pan, Xinhui" 
> > >> Cc: David Airlie 
> > >> Cc: Daniel Vetter 
> > >> Cc: Xiaojian Du 
> > >> Cc: Huang Rui 
> > >> Cc: Kevin Wang 
> > >> Cc: amd-...@lists.freedesktop.org
> > >> Cc: dri-de...@lists.freedesktop.org
> > >> Signed-off-by: Kees Cook 
> > > Acked-by: Alex Deucher 
> >
> > Mhm, I'm not sure if this is a good idea. That is a structure filled in
> > by the firmware, isn't it?
> >
> > That would imply that we might need to byte swap count before it is
> > checkable.
>
> True. Good point.  Same for the other amdgpu patch.

Actually the other patch is fine.  That's just a local structure.

Alex

>
> Alex
>
> >
> > Regards,
> > Christian.
> >
> > >
> > >> ---
> > >>   drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h | 2 +-
> > >>   1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h 
> > >> b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
> > >> index 808e0ecbe1f0..42adc2a3dcbc 100644
> > >> --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
> > >> +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
> > >> @@ -192,7 +192,7 @@ struct smu10_clock_voltage_dependency_record {
> > >>
> > >>   struct smu10_voltage_dependency_table {
> > >>  uint32_t count;
> > >> -   struct smu10_clock_voltage_dependency_record entries[];
> > >> +   struct smu10_clock_voltage_dependency_record entries[] 
> > >> __counted_by(count);
> > >>   };
> > >>
> > >>   struct smu10_clock_voltage_information {
> > >> --
> > >> 2.34.1
> > >>
> >
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/9] drm/amd/pm: Annotate struct smu10_voltage_dependency_table with __counted_by

2023-09-25 Thread Alex Deucher
On Mon, Sep 25, 2023 at 2:30 AM Christian König
 wrote:
>
> Am 22.09.23 um 19:41 schrieb Alex Deucher:
> > On Fri, Sep 22, 2023 at 1:32 PM Kees Cook  wrote:
> >> Prepare for the coming implementation by GCC and Clang of the __counted_by
> >> attribute. Flexible array members annotated with __counted_by can have
> >> their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
> >> (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> >> functions).
> >>
> >> As found with Coccinelle[1], add __counted_by for struct 
> >> smu10_voltage_dependency_table.
> >>
> >> [1] 
> >> https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
> >>
> >> Cc: Evan Quan 
> >> Cc: Alex Deucher 
> >> Cc: "Christian König" 
> >> Cc: "Pan, Xinhui" 
> >> Cc: David Airlie 
> >> Cc: Daniel Vetter 
> >> Cc: Xiaojian Du 
> >> Cc: Huang Rui 
> >> Cc: Kevin Wang 
> >> Cc: amd-...@lists.freedesktop.org
> >> Cc: dri-de...@lists.freedesktop.org
> >> Signed-off-by: Kees Cook 
> > Acked-by: Alex Deucher 
>
> Mhm, I'm not sure if this is a good idea. That is a structure filled in
> by the firmware, isn't it?
>
> That would imply that we might need to byte swap count before it is
> checkable.

True. Good point.  Same for the other amdgpu patch.

Alex

>
> Regards,
> Christian.
>
> >
> >> ---
> >>   drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h 
> >> b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
> >> index 808e0ecbe1f0..42adc2a3dcbc 100644
> >> --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
> >> +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
> >> @@ -192,7 +192,7 @@ struct smu10_clock_voltage_dependency_record {
> >>
> >>   struct smu10_voltage_dependency_table {
> >>  uint32_t count;
> >> -   struct smu10_clock_voltage_dependency_record entries[];
> >> +   struct smu10_clock_voltage_dependency_record entries[] 
> >> __counted_by(count);
> >>   };
> >>
> >>   struct smu10_clock_voltage_information {
> >> --
> >> 2.34.1
> >>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-25 Thread Jason Gunthorpe
On Mon, Sep 25, 2023 at 02:07:50PM +0100, Robin Murphy wrote:
> On 2023-09-23 00:33, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
> > 
> > > virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
> > > either; it sets it once it's discovered any instance, since apparently 
> > > it's
> > > assuming that all instances must support identical page sizes, and thus 
> > > once
> > > it's seen one it can work "normally" per the core code's assumptions. It's
> > > also I think the only driver which has a "finalise" bodge but *can* still
> > > properly support map-before-attach, by virtue of having to replay mappings
> > > to every new endpoint anyway.
> > 
> > Well it can't quite do that since it doesn't know the geometry - it
> > all is sort of guessing and hoping it doesn't explode on replay. If it
> > knows the geometry it wouldn't need finalize...
> 
> I think it's entirely reasonable to assume that any direct mappings
> specified for a device are valid for that device and its IOMMU. However, in
> the particular case of virtio, it really shouldn't ever have direct mappings
> anyway, since even if the underlying hardware did have any, the host can
> enforce the actual direct-mapping aspect itself, and just present them as
> unusable regions to the guest.

I assume this machinery is for the ARM GIC ITS page

> Again, that's irrelevant. It can only be about whether the actual
> ->map_pages call succeeds or not. A driver could well know up-front that all
> instances support the same pgsize_bitmap and aperture, and set both at
> ->domain_alloc time, yet still be unable to handle an actual mapping without
> knowing which instance(s) that needs to interact with (e.g. omap-iommu).

I think this is a different issue. The domain is supposed to represent
the actual io pte storage, and the storage is supposed to exist even
when the domain is not attached to anything.

As we said with tegra-gart, it is a bug in the driver if all the
mappings disappear when the last device is detached from the domain.
Driver bugs like this turn into significant issues with vfio/iommufd
as this will result in warn_on's and memory leaking.

So, I disagree that this is something we should be allowing in the API
design. map_pages should succeed (memory allocation failures aside) if
a IOVA within the aperture and valid flags are presented. Regardless
of the attachment status. Calling map_pages with an IOVA outside the
aperture should be a caller bug.

It looks omap is just mis-designed to store the pgd in the omap_iommu,
not the omap_iommu_domain :( pgd is clearly a per-domain object in our
API. And why does every instance need its own copy of the identical
pgd?

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-25 Thread Robin Murphy

On 2023-09-23 00:33, Jason Gunthorpe wrote:

On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:


virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
either; it sets it once it's discovered any instance, since apparently it's
assuming that all instances must support identical page sizes, and thus once
it's seen one it can work "normally" per the core code's assumptions. It's
also I think the only driver which has a "finalise" bodge but *can* still
properly support map-before-attach, by virtue of having to replay mappings
to every new endpoint anyway.


Well it can't quite do that since it doesn't know the geometry - it
all is sort of guessing and hoping it doesn't explode on replay. If it
knows the geometry it wouldn't need finalize...


I think it's entirely reasonable to assume that any direct mappings 
specified for a device are valid for that device and its IOMMU. However, 
in the particular case of virtio, it really shouldn't ever have direct 
mappings anyway, since even if the underlying hardware did have any, the 
host can enforce the actual direct-mapping aspect itself, and just 
present them as unusable regions to the guest.



What do you think about something like this to replace
iommu_create_device_direct_mappings(), that does enforce things
properly?


I fail to see how that would make any practical difference. Either the
mappings can be correctly set up in a pagetable *before* the relevant device
is attached to that pagetable, or they can't (if the driver doesn't have
enough information to be able to do so) and we just have to really hope
nothing blows up in the race window between attaching the device to an empty
pagetable and having a second try at iommu_create_device_direct_mappings().
That's a driver-level issue and has nothing to do with pgsize_bitmap either
way.


Except we don't detect this in the core code correctly, that is my
point. We should detect the aperture conflict, not pgsize_bitmap to
check if it is the first or second try.


Again, that's irrelevant. It can only be about whether the actual 
->map_pages call succeeds or not. A driver could well know up-front that 
all instances support the same pgsize_bitmap and aperture, and set both 
at ->domain_alloc time, yet still be unable to handle an actual mapping 
without knowing which instance(s) that needs to interact with (e.g. 
omap-iommu).


Thanks,
Robin.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] iommu/virtio: Make use of ops->iotlb_sync_map

2023-09-25 Thread Jason Gunthorpe
On Mon, Sep 25, 2023 at 10:48:21AM +0800, Baolu Lu wrote:
> On 9/23/23 7:33 AM, Jason Gunthorpe wrote:
> > On Fri, Sep 22, 2023 at 07:07:40PM +0100, Robin Murphy wrote:
> > 
> > > virtio isn't setting ops->pgsize_bitmap for the sake of direct mappings
> > > either; it sets it once it's discovered any instance, since apparently 
> > > it's
> > > assuming that all instances must support identical page sizes, and thus 
> > > once
> > > it's seen one it can work "normally" per the core code's assumptions. It's
> > > also I think the only driver which has a "finalise" bodge but*can*  still
> > > properly support map-before-attach, by virtue of having to replay mappings
> > > to every new endpoint anyway.
> > Well it can't quite do that since it doesn't know the geometry - it
> > all is sort of guessing and hoping it doesn't explode on replay. If it
> > knows the geometry it wouldn't need finalize...
> 
> The ultimate solution to this problem seems to be to add device pointer
> to the parameter of ops->domain_alloc. So once the domain is allocated,
> it is fully initialized. Attaching this domain to a device that is not
> compatible will return -EINVAL, then the caller has to allocate a new
> domain for this device.

Sure, it looks like this, and it works already for default domain
setup.

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 8599394c9157d7..1697f5a2370785 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -637,15 +637,10 @@ static void viommu_event_handler(struct virtqueue *vq)
 
 /* IOMMU API */
 
-static struct iommu_domain *viommu_domain_alloc(unsigned type)
+static struct viommu_domain *__viommu_domain_alloc(void)
 {
struct viommu_domain *vdomain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED &&
-   type != IOMMU_DOMAIN_DMA &&
-   type != IOMMU_DOMAIN_IDENTITY)
-   return NULL;
-
vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL);
if (!vdomain)
return NULL;
@@ -654,16 +649,32 @@ static struct iommu_domain *viommu_domain_alloc(unsigned 
type)
spin_lock_init(>mappings_lock);
vdomain->mappings = RB_ROOT_CACHED;
 
+   return vdomain;
+}
+
+static struct iommu_domain *viommu_domain_alloc(unsigned type)
+{
+   struct viommu_domain *vdomain;
+
+   /*
+* viommu sometimes creates an identity domain out of a
+* paging domain, that can't use the global static.
+* During attach it will fill in an identity page table.
+*/
+   if (type != IOMMU_DOMAIN_IDENTITY)
+   return NULL;
+   vdomain = __viommu_domain_alloc();
+   if (!vdomain)
+   return NULL;
return >domain;
 }
 
 static int viommu_domain_finalise(struct viommu_endpoint *vdev,
- struct iommu_domain *domain)
+ struct viommu_domain *vdomain)
 {
int ret;
unsigned long viommu_page_size;
struct viommu_dev *viommu = vdev->viommu;
-   struct viommu_domain *vdomain = to_viommu_domain(domain);
 
viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap);
if (viommu_page_size > PAGE_SIZE) {
@@ -680,13 +691,13 @@ static int viommu_domain_finalise(struct viommu_endpoint 
*vdev,
 
vdomain->id = (unsigned int)ret;
 
-   domain->pgsize_bitmap   = viommu->pgsize_bitmap;
-   domain->geometry= viommu->geometry;
+   vdomain->domain.pgsize_bitmap = viommu->pgsize_bitmap;
+   vdomain->domain.geometry = viommu->geometry;
 
vdomain->map_flags  = viommu->map_flags;
vdomain->viommu = viommu;
 
-   if (domain->type == IOMMU_DOMAIN_IDENTITY) {
+   if (vdomain->domain.type == IOMMU_DOMAIN_IDENTITY) {
if (virtio_has_feature(viommu->vdev,
   VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
vdomain->bypass = true;
@@ -717,6 +728,24 @@ static void viommu_domain_free(struct iommu_domain *domain)
kfree(vdomain);
 }
 
+static struct iommu_domain *viommu_domain_alloc_paging(struct device *dev)
+{
+   struct viommu_domain *vdomain;
+   vdomain = __viommu_domain_alloc();
+   if (!vdomain)
+   return NULL;
+
+   if (dev) {
+   struct viommu_endpoint *vdev = dev_iommu_priv_get(dev);
+
+   if (viommu_domain_finalise(vdev, vdomain)) {
+   viommu_domain_free(>domain);
+   return NULL;
+   }
+   }
+   return >domain;
+}
+
 static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
int i;
@@ -732,7 +761,7 @@ static int viommu_attach_dev(struct iommu_domain *domain, 
struct device *dev)
 * Properly initialize the domain now that we know which viommu
 * owns it.
 */
-   ret = viommu_domain_finalise(vdev, domain);
+   

Re: [PATCH] virtio_console: Annotate struct port_buffer with __counted_by

2023-09-25 Thread Amit Shah
On Fri, 2023-09-22 at 10:51 -0700, Kees Cook wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
> (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> As found with Coccinelle[1], add __counted_by for struct port_buffer.
> 
> [1] 
> https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
> 
> Cc: Amit Shah 
> Cc: Arnd Bergmann 
> Cc: Greg Kroah-Hartman 
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Kees Cook 
> ---
>  drivers/char/virtio_console.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> index 680d1ef2a217..431e9e5bf9c1 100644
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -106,7 +106,7 @@ struct port_buffer {
>   unsigned int sgpages;
>  
>   /* sg is used if spages > 0. sg must be the last in is struct */
> - struct scatterlist sg[];
> + struct scatterlist sg[] __counted_by(sgpages);
>  };

Reviewed-by: Amit Shah 

Greg, please pick this up.

Thanks,

Amit
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] vringh: don't use vringh_kiov_advance() in vringh_iov_xfer()

2023-09-25 Thread Stefano Garzarella
In the while loop of vringh_iov_xfer(), `partlen` could be 0 if one of
the `iov` has 0 lenght.
In this case, we should skip the iov and go to the next one.
But calling vringh_kiov_advance() with 0 lenght does not cause the
advancement, since it returns immediately if asked to advance by 0 bytes.

Let's restore the code that was there before commit b8c06ad4d67d
("vringh: implement vringh_kiov_advance()"), avoiding using
vringh_kiov_advance().

Fixes: b8c06ad4d67d ("vringh: implement vringh_kiov_advance()")
Cc: sta...@vger.kernel.org
Reported-by: Jason Wang 
Signed-off-by: Stefano Garzarella 
---
 drivers/vhost/vringh.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index 955d938eb663..7b8fd977f71c 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -123,8 +123,18 @@ static inline ssize_t vringh_iov_xfer(struct vringh *vrh,
done += partlen;
len -= partlen;
ptr += partlen;
+   iov->consumed += partlen;
+   iov->iov[iov->i].iov_len -= partlen;
+   iov->iov[iov->i].iov_base += partlen;
 
-   vringh_kiov_advance(iov, partlen);
+   if (!iov->iov[iov->i].iov_len) {
+   /* Fix up old iov element then increment. */
+   iov->iov[iov->i].iov_len = iov->consumed;
+   iov->iov[iov->i].iov_base -= iov->consumed;
+
+   iov->consumed = 0;
+   iov->i++;
+   }
}
return done;
 }
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH vfio 11/11] vfio/virtio: Introduce a vfio driver over virtio devices

2023-09-25 Thread Parav Pandit via Virtualization


> From: Jason Wang 
> Sent: Monday, September 25, 2023 8:00 AM
> 
> On Fri, Sep 22, 2023 at 8:25 PM Parav Pandit  wrote:
> >
> >
> > > From: Jason Gunthorpe 
> > > Sent: Friday, September 22, 2023 5:53 PM
> >
> >
> > > > And what's more, using MMIO BAR0 then it can work for legacy.
> > >
> > > Oh? How? Our team didn't think so.
> >
> > It does not. It was already discussed.
> > The device reset in legacy is not synchronous.
> 
> How do you know this?
>
Not sure the motivation of same discussion done in the OASIS with you and 
others in past.

Anyways, please find the answer below.

About reset,
The legacy device specification has not enforced below cited 1.0 driver 
requirement of 1.0.

"The driver SHOULD consider a driver-initiated reset complete when it reads 
device status as 0."
 
[1] https://ozlabs.org/~rusty/virtio-spec/virtio-0.9.5.pdf

> > The drivers do not wait for reset to complete; it was written for the sw
> backend.
> 
> Do you see there's a flush after reset in the legacy driver?
> 
Yes. it only flushes the write by reading it. The driver does not get _wait_ 
for the reset to complete within the device like above.

Please see the reset flow of 1.x device as below.
In fact the comment of the 1.x device also needs to be updated to indicate that 
driver need to wait for the device to finish the reset.
I will send separate patch for improving this comment of vp_reset() to match 
the spec.

static void vp_reset(struct virtio_device *vdev)
{
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
struct virtio_pci_modern_device *mdev = _dev->mdev;

/* 0 status means a reset. */
vp_modern_set_status(mdev, 0);
/* After writing 0 to device_status, the driver MUST wait for a read of
 * device_status to return 0 before reinitializing the device.
 * This will flush out the status write, and flush in device writes,
 * including MSI-X interrupts, if any.
 */
while (vp_modern_get_status(mdev))
msleep(1);
/* Flush pending VQ/configuration callbacks. */
vp_synchronize_vectors(vdev);
}


> static void vp_reset(struct virtio_device *vdev) {
> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> /* 0 status means a reset. */
> vp_legacy_set_status(_dev->ldev, 0);
> /* Flush out the status write, and flush in device writes,
>  * including MSi-X interrupts, if any. */
> vp_legacy_get_status(_dev->ldev);
> /* Flush pending VQ/configuration callbacks. */
> vp_synchronize_vectors(vdev);
> }
> 
> Thanks
> 
> 
> 
> > Hence MMIO BAR0 is not the best option in real implementations.
> >

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 00/16] vdpa: Add support for vq descriptor mappings

2023-09-25 Thread Dragos Tatulea via Virtualization
On Tue, 2023-09-12 at 16:01 +0300, Dragos Tatulea wrote:
> This patch series adds support for vq descriptor table mappings which
> are used to improve vdpa live migration downtime. The improvement comes
> from using smaller mappings which take less time to create and destroy
> in hw.
> 
Gentle ping.

Note that I will have to send a v2. The changes in mlx5_ifc.h will need to be
merged first separately into the mlx5-next branch [0] and then pulled from there
when the series is applied.

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-next

Thanks,
Dragos

> The first part adds the vdpa core changes from Si-Wei [0].
> 
> The second part adds support in mlx5_vdpa:
> - Refactor the mr code to be able to cleanly add descriptor mappings.
> - Add hardware descriptor mr support.
> - Properly update iotlb for cvq during ASID switch.
> 
> [0]
> https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com
> 
> Dragos Tatulea (13):
>   vdpa/mlx5: Create helper function for dma mappings
>   vdpa/mlx5: Decouple cvq iotlb handling from hw mapping code
>   vdpa/mlx5: Take cvq iotlb lock during refresh
>   vdpa/mlx5: Collapse "dvq" mr add/delete functions
>   vdpa/mlx5: Rename mr destroy functions
>   vdpa/mlx5: Allow creation/deletion of any given mr struct
>   vdpa/mlx5: Move mr mutex out of mr struct
>   vdpa/mlx5: Improve mr update flow
>   vdpa/mlx5: Introduce mr for vq descriptor
>   vdpa/mlx5: Enable hw support for vq descriptor mapping
>   vdpa/mlx5: Make iotlb helper functions more generic
>   vdpa/mlx5: Update cvq iotlb mapping on ASID change
>   Cover letter: vdpa/mlx5: Add support for vq descriptor mappings
> 
> Si-Wei Liu (3):
>   vdpa: introduce dedicated descriptor group for virtqueue
>   vhost-vdpa: introduce descriptor group backend feature
>   vhost-vdpa: uAPI to get dedicated descriptor group id
> 
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  31 +++--
>  drivers/vdpa/mlx5/core/mr.c    | 191 -
>  drivers/vdpa/mlx5/core/resources.c |   6 +-
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 100 ++-
>  drivers/vhost/vdpa.c   |  27 
>  include/linux/mlx5/mlx5_ifc.h  |   8 +-
>  include/linux/mlx5/mlx5_ifc_vdpa.h |   7 +-
>  include/linux/vdpa.h   |  11 ++
>  include/uapi/linux/vhost.h |   8 ++
>  include/uapi/linux/vhost_types.h   |   5 +
>  10 files changed, 264 insertions(+), 130 deletions(-)
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/9] drm/amd/pm: Annotate struct smu10_voltage_dependency_table with __counted_by

2023-09-25 Thread Christian König via Virtualization

Am 22.09.23 um 19:41 schrieb Alex Deucher:

On Fri, Sep 22, 2023 at 1:32 PM Kees Cook  wrote:

Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct 
smu10_voltage_dependency_table.

[1] 
https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci

Cc: Evan Quan 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: "Pan, Xinhui" 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Xiaojian Du 
Cc: Huang Rui 
Cc: Kevin Wang 
Cc: amd-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
Signed-off-by: Kees Cook 

Acked-by: Alex Deucher 


Mhm, I'm not sure if this is a good idea. That is a structure filled in 
by the firmware, isn't it?


That would imply that we might need to byte swap count before it is 
checkable.


Regards,
Christian.




---
  drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
index 808e0ecbe1f0..42adc2a3dcbc 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.h
@@ -192,7 +192,7 @@ struct smu10_clock_voltage_dependency_record {

  struct smu10_voltage_dependency_table {
 uint32_t count;
-   struct smu10_clock_voltage_dependency_record entries[];
+   struct smu10_clock_voltage_dependency_record entries[] 
__counted_by(count);
  };

  struct smu10_clock_voltage_information {
--
2.34.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization